Commit Graph

1300604 Commits

Author SHA1 Message Date
Gao Xiang
7c3ca1838a erofs: restrict pcluster size limitations
Error out if {en,de}encoded size of a pcluster is unsupported:
  Maximum supported encoded size (of a pcluster):  1 MiB
  Maximum supported decoded size (of a pcluster): 12 MiB

Users can still choose to use supported large configurations (e.g.,
for archival purposes), but there may be performance penalties in
low-memory scenarios compared to smaller pclusters.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240912074156.2925394-1-hsiangkao@linux.alibaba.com
2024-09-12 23:00:09 +08:00
Chunhai Guo
79f504a2cd erofs: allocate more short-lived pages from reserved pool first
This patch aims to allocate bvpages and short-lived compressed pages
from the reserved pool first.

After applying this patch, there are three benefits.

1. It reduces the page allocation time.
 The bvpages and short-lived compressed pages account for about 4% of
the pages allocated from the system in the multi-app launch benchmarks
[1]. It reduces the page allocation time accordingly and lowers the
likelihood of blockage by page allocation in low memory scenarios.

2. The pages in the reserved pool will be allocated on demand.
 Currently, bvpages and short-lived compressed pages are short-lived
pages allocated from the system, and the pages in the reserved pool all
originate from short-lived pages. Consequently, the number of reserved
pool pages will increase to z_erofs_rsv_nrpages over time.
 With this patch, all short-lived pages are allocated from the reserved
pool first, so the number of reserved pool pages will only increase when
there are not enough pages. Thus, even if z_erofs_rsv_nrpages is set to
a large number for specific reasons, the actual number of reserved pool
pages may remain low as per demand. In the multi-app launch benchmarks
[1], z_erofs_rsv_nrpages is set at 256, while the number of reserved
pool pages remains below 64.

3. When erofs cache decompression is disabled
   (EROFS_ZIP_CACHE_DISABLED), all pages will *only* be allocated from
the reserved pool for erofs. This will significantly reduce the memory
pressure from erofs.

[1] For additional details on the multi-app launch benchmarks, please
refer to commit 0f6273ab46 ("erofs: add a reserved buffer pool for lz4
decompression").

Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240906121110.3701889-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-09-12 22:59:49 +08:00
Bibo Mao
3abb708ec0 LoongArch: KVM: Implement function kvm_para_has_feature()
Implement function kvm_para_has_feature() to detect supported paravirt
features. It can be used by device driver to detect and enable paravirt
features, such as the EIOINTC irqchip driver is able to detect feature
KVM_FEATURE_VIRT_EXTIOI and do some optimization.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-09-12 22:56:14 +08:00
Riyan Dhiman
26e197b7f9 block: fix potential invalid pointer dereference in blk_add_partition
The blk_add_partition() function initially used a single if-condition
(IS_ERR(part)) to check for errors when adding a partition. This was
modified to handle the specific case of -ENXIO separately, allowing the
function to proceed without logging the error in this case. However,
this change unintentionally left a path where md_autodetect_dev()
could be called without confirming that part is a valid pointer.

This commit separates the error handling logic by splitting the
initial if-condition, improving code readability and handling specific
error scenarios explicitly. The function now distinguishes the general
error case from -ENXIO without altering the existing behavior of
md_autodetect_dev() calls.

Fixes: b72053072c (block: allow partitions on host aware zone devices)
Signed-off-by: Riyan Dhiman <riyandhiman14@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240911132954.5874-1-riyandhiman14@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-12 08:46:40 -06:00
Arnd Bergmann
168c3e0d44 Allwinner SoC device tree changes for 6.12 part 2
ARM64 device tree and binding-only changes
 - Add system and pin voltage regulator supplies for NanoPi NEO Plus2
 -----BEGIN PGP SIGNATURE-----
 
 iQJCBAABCgAsFiEE2nN1m/hhnkhOWjtHOJpUIZwPJDAFAmbipcUOHHdlbnNAY3Np
 ZS5vcmcACgkQOJpUIZwPJDB4vg//Q0MmjEWVfuWKq2nwr3OXcsvwtu6RIyuzzup4
 plWbA+njC+ED1150Az52i/gjHo4GyxrIscHJ+HWk06yGapFqUbofolGaadoKCJXN
 fYyL23PxWJRcehqD8zmXHxRTjEjZtPRBzpSUr22L6ypRp5vJTsjSOkB2fp7Fhzbm
 C84MQunF1J5K5PcyPsS9il3tdcuKR2JaMoblfYYhNXz55HfRAoEjL39wXHB8Y9Hr
 WXNZQxOlkQIeClDSR1p8QYu4LY/1Q+BGHFqslWK+M2aNOLY/ed5l8aBNfX7ZXF1h
 MQXRCfGEzGbsrm9qjjVWZE2Ge/618EnFWru7J1gwX/FEYS8YJJkpQeuRd56eFjga
 DFqqoFpHtFcOCD9W7vn/NMvQWxbfnQhj1wNliyzGpoeByIegxCaaTN4uMhW65sFu
 NOXdk953v/mKmjGqLycFwjaGV/dysrQ4l2etzKH/DcXJP6b475UKZcc1o82ffkfG
 zIDTcBs5jZi2KfGr9xD1gDSPCCKqlHL1eP2LM0mqd9gSgEV/MRHTeNEfnEYv+9L6
 5vcZ3LXeCFCSyyPWVLUwJY/5RVwNyxrrjdvXyUDvaZM06gUaZsJJ0UtOn7SLbY25
 2TvecnlBYPnixIe45twl+pFmDoee9219J6E6LKqTo8e+l3a8f6qhmn8YKxHHaADj
 QpKjmls=
 =jGM2
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmbi+eAACgkQYKtH/8kJ
 Uid6+g/8DIzjrn55xdjHit4cTxjpmRysXEQyACkYjtbXQ1whGBkcgv2HJ6JLUHEU
 IW0deXfz5tMbkopBJDtATeiVzZWL7bzA8YvEwR26VUrdMsJuFb2G3ru+A3hsn2Wn
 3KBQLN+9ufrxFx6yHXFOqzuEtrxaygPVjVJcbm9mLzgWVhRGtrFdz8EZxhP9Ty4N
 TdVVhjnQeuXlVjiwcanupX9hAOjUhDI6gCdGx+m7rIP+vo2k/76b+3pvoG7QtH8b
 qUyBjOGSTvZ5q9U2UFkkz+rLr0pd0n6rcO/ONmIV7K8XgLer8KY8RtuVcU1lFFzK
 Rfomq+9mF0BknBcQFP5gdgjEVOjdRm6FDbk8Bf3V6rWvs2YjcpCWtK07o7UeB2qe
 6WFWcaigpeAt5wfrhPREemEItF/K/KBjSYCucskPRKnMIV38dFvjHGjFxP4zHcJX
 wAmnZTYWKIKrrUxMq0jZqglSSIvs8kCjJbz6a0JW7MKjZQGH3m4aekUbMNQ9h74r
 PukXi9BwogxLDsM62y+UNNFg/McRxzYleTAC52J+gaKyBZ3npk+RITrrAIpH8Zdt
 qT6fAkgFOfXzVS6N50TNoZSH4c59x5PJGRq7q3f5FxyekgedL32KGR0SboSc2UVE
 DTa4Yl/wfG/6CzZfuvJwhWORqev7xHiWJIGpLpbz++mQzxVezuc=
 =cl3e
 -----END PGP SIGNATURE-----

Merge tag 'sunxi-dt-for-6.12-2' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into soc/dt

Allwinner SoC device tree changes for 6.12 part 2

ARM64 device tree and binding-only changes
- Add system and pin voltage regulator supplies for NanoPi NEO Plus2

* tag 'sunxi-dt-for-6.12-2' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
  arm64: dts: allwinner: h5: NanoPi NEO Plus2: Use regulators for pio
  arm64: dts: allwinner: h5: NanoPi Neo Plus2: Fix regulators

Link: https://lore.kernel.org/r/ZuKmwD8VQrvNx8ir@wens.tw
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-09-12 14:25:36 +00:00
Matthew Auld
94c4aa2661
drm/xe/client: add missing bo locking in show_meminfo()
bo_meminfo() wants to inspect bo state like tt and the ttm resource,
however this state can change at any point leading to stuff like NPD and
UAF, if the bo lock is not held. Grab the bo lock when calling
bo_meminfo(), ensuring we drop any spinlocks first. In the case of
object_idr we now also need to hold a ref.

v2 (MattB)
  - Also add xe_bo_assert_held()

Fixes: 0845233388 ("drm/xe: Implement fdinfo memory stats printing")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240911155527.178910-6-matthew.auld@intel.com
(cherry picked from commit 4f63d712fa104c3ebefcb289d1e733e86d8698c7)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-12 10:07:22 -04:00
Matthew Auld
9bd7ff293f
drm/xe/client: fix deadlock in show_meminfo()
There is a real deadlock as well as sleeping in atomic() bug in here, if
the bo put happens to be the last ref, since bo destruction wants to
grab the same spinlock and sleeping locks.  Fix that by dropping the ref
using xe_bo_put_deferred(), and moving the final commit outside of the
lock. Dropping the lock around the put is tricky since the bo can go
out of scope and delete itself from the list, making it difficult to
navigate to the next list entry.

Fixes: 0845233388 ("drm/xe: Implement fdinfo memory stats printing")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2727
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240911155527.178910-5-matthew.auld@intel.com
(cherry picked from commit 0083b8e6f11d7662283a267d4ce7c966812ffd8a)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-12 10:07:18 -04:00
Ashutosh Dixit
a262cc8d55
drm/xe/oa: Enable Xe2+ PES disaggregation
Enable Xe2+ PES disaggregation (for OAG) to retrieve disaggregated metrics
when disaggregated data is needed. Userspace can select whether to receive
aggregated or disaggregated metrics via the particular OA configuration it
uses (programmed via DRM_XE_OBSERVATION_OP_ADD_CONFIG).

Bspec: 61101
Fixes: e936f885f1 ("drm/xe/oa/uapi: Expose OA stream fd")
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240909165933.2638765-1-ashutosh.dixit@intel.com
Cc: stable@vger.kernel.org
(cherry picked from commit fb2551a0e93897aec7fb3d4f473ebc06b146d160)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-12 10:07:15 -04:00
Jani Nikula
dd10595c32
drm/xe/display: fix compat IS_DISPLAY_STEP() range end
It's supposed to be an open range at the end like in i915. Fingers
crossed that nobody relies on this definition.

Fixes: 44e694958b ("drm/xe/display: Implement display support")
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Acked-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/fe8743770694e429f6902491cdb306c97bdf701a.1724180287.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit 453afb1a43)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-12 10:06:55 -04:00
Nirmoy Das
062d59eb96
drm/xe: Fix access_ok check in user_fence_create
Check size of the data not size of the pointer.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407300421.IBkAja96-lkp@intel.com/
Fixes: ddeb7989a9 ("drm/xe: Validate user fence during creation")
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Reviewed-by: Apoorva Singh <apoorva.singh@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240806110722.28661-1-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
(cherry picked from commit e102b5ed6e)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-12 10:06:30 -04:00
Matthew Brost
5e2d1d4dc1
drm/xe: Fix possible UAF in guc_exec_queue_process_msg
Store xe_device ahead of processing message as message can be free'd in
some cases.

v2:
 - Including missing local changes
v3:
 - Resend for CI

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202407231445.rpisd1vA-lkp@intel.com/
Fixes: 55ea73aacf ("drm/xe: Build PM into GuC CT layer")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240724164341.1848954-1-matthew.brost@intel.com
(cherry picked from commit 1a394b4f50)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-12 10:06:22 -04:00
Matthew Brost
572239f7f1
drm/xe: Remove fence check from send_tlb_invalidation
'fence' argument in send_tlb_invalidation cannot be NULL, remove
non-NULL check from send_tlb_invalidation.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202407231049.esig0Fkb-lkp@intel.com/
Fixes: 58bfe66744 ("drm/xe: Drop xe_gt_tlb_invalidation_wait")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240723190714.1744653-1-matthew.brost@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
(cherry picked from commit 6482253e6e)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-12 10:06:16 -04:00
Lucas De Marchi
a2655358cb
drm/xe/gt: Remove double include
The header generated/xe_wa_oob.h is included twice. Remove one.

Fixes: 27cb2b7fec ("drm/xe/bmg: implement Wa_16023588340")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202407052122.AzuWSPuo-lkp@intel.com/
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240708173301.1543871-1-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 3d122660dc)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-09-12 10:06:02 -04:00
Lorenzo Bianconi
3e705251d9 net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init()
Move nf flowtable bpf initialization in nf_flow_table module load
routine since nf_flow_table_bpf is part of nf_flow_table module and not
nf_flow_table_inet one. This patch allows to avoid the following kernel
warning running the reproducer below:

$modprobe nf_flow_table_inet
$rmmod nf_flow_table_inet
$modprobe nf_flow_table_inet
modprobe: ERROR: could not insert 'nf_flow_table_inet': Invalid argument

[  184.081501] ------------[ cut here ]------------
[  184.081527] WARNING: CPU: 0 PID: 1362 at kernel/bpf/btf.c:8206 btf_populate_kfunc_set+0x23c/0x330
[  184.081550] CPU: 0 UID: 0 PID: 1362 Comm: modprobe Kdump: loaded Not tainted 6.11.0-0.rc5.22.el10.x86_64 #1
[  184.081553] Hardware name: Red Hat OpenStack Compute, BIOS 1.14.0-1.module+el8.4.0+8855+a9e237a9 04/01/2014
[  184.081554] RIP: 0010:btf_populate_kfunc_set+0x23c/0x330
[  184.081558] RSP: 0018:ff22cfb38071fc90 EFLAGS: 00010202
[  184.081559] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[  184.081560] RDX: 000000000000006e RSI: ffffffff95c00000 RDI: ff13805543436350
[  184.081561] RBP: ffffffffc0e22180 R08: ff13805543410808 R09: 000000000001ec00
[  184.081562] R10: ff13805541c8113c R11: 0000000000000010 R12: ff13805541b83c00
[  184.081563] R13: ff13805543410800 R14: 0000000000000001 R15: ffffffffc0e2259a
[  184.081564] FS:  00007fa436c46740(0000) GS:ff1380557ba00000(0000) knlGS:0000000000000000
[  184.081569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  184.081570] CR2: 000055e7b3187000 CR3: 0000000100c48003 CR4: 0000000000771ef0
[  184.081571] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  184.081572] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  184.081572] PKRU: 55555554
[  184.081574] Call Trace:
[  184.081575]  <TASK>
[  184.081578]  ? show_trace_log_lvl+0x1b0/0x2f0
[  184.081580]  ? show_trace_log_lvl+0x1b0/0x2f0
[  184.081582]  ? __register_btf_kfunc_id_set+0x199/0x200
[  184.081585]  ? btf_populate_kfunc_set+0x23c/0x330
[  184.081586]  ? __warn.cold+0x93/0xed
[  184.081590]  ? btf_populate_kfunc_set+0x23c/0x330
[  184.081592]  ? report_bug+0xff/0x140
[  184.081594]  ? handle_bug+0x3a/0x70
[  184.081596]  ? exc_invalid_op+0x17/0x70
[  184.081597]  ? asm_exc_invalid_op+0x1a/0x20
[  184.081601]  ? btf_populate_kfunc_set+0x23c/0x330
[  184.081602]  __register_btf_kfunc_id_set+0x199/0x200
[  184.081605]  ? __pfx_nf_flow_inet_module_init+0x10/0x10 [nf_flow_table_inet]
[  184.081607]  do_one_initcall+0x58/0x300
[  184.081611]  do_init_module+0x60/0x230
[  184.081614]  __do_sys_init_module+0x17a/0x1b0
[  184.081617]  do_syscall_64+0x7d/0x160
[  184.081620]  ? __count_memcg_events+0x58/0xf0
[  184.081623]  ? handle_mm_fault+0x234/0x350
[  184.081626]  ? do_user_addr_fault+0x347/0x640
[  184.081630]  ? clear_bhb_loop+0x25/0x80
[  184.081633]  ? clear_bhb_loop+0x25/0x80
[  184.081634]  ? clear_bhb_loop+0x25/0x80
[  184.081637]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  184.081639] RIP: 0033:0x7fa43652e4ce
[  184.081647] RSP: 002b:00007ffe8213be18 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[  184.081649] RAX: ffffffffffffffda RBX: 000055e7b3176c20 RCX: 00007fa43652e4ce
[  184.081650] RDX: 000055e7737fde79 RSI: 0000000000003990 RDI: 000055e7b3185380
[  184.081651] RBP: 000055e7737fde79 R08: 0000000000000007 R09: 000055e7b3179bd0
[  184.081651] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000040000
[  184.081652] R13: 000055e7b3176fa0 R14: 0000000000000000 R15: 000055e7b3179b80

Fixes: 391bb6594f ("netfilter: Add bpf_xdp_flow_lookup kfunc")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://patch.msgid.link/20240911-nf-flowtable-bpf-modprob-fix-v1-1-f9fc075aafc3@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-12 15:41:03 +02:00
Paolo Abeni
8700970971 netfilter pull request 24-09-12
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmbiF/oACgkQ1V2XiooU
 IOQRQBAAoeOp8RlVU4XTm2++A+jwZODeeKoDcnyrdXWFZFUmORHLuUmIzVUDMJyB
 rp9/Jnw/J8aS5pT4Bx79cWuPlZ9UylSsPr8lt7oJ3NNxRwzbdQjX97wKhONAYUGZ
 j4Jnd1ObWj5uDHvI4hbMoqZlqzk64Cgdw6tMgYDxmjTnUuJbJztNL6QkUWvZz4mR
 0gY3SluDadKc3dLqoFDefi7ZLidn5Fc3W99JZu295y40pe3qzWNhFPhQPC1s+1T6
 9svzC95cIN1JncEqnuplcZQJEylRikOH2W6sH1SFflvI4QhiUXnOqluwjVGiDuxm
 nPsoXZx3/uqlWwNKEMn/WwjTg9bqUnTzaT8M6RlZKimwmo+FKIgOQb6QjCgaBMWm
 kbp2XWM/rmsDhjM58asCVgw7Bcftw6mrh8qt5fq2gxGnyE6G+3M9WDR8F9VmhxME
 AtjgNqogrFYJkMSjVBvrqIr6CSzad3GgG+upWCbArAAyIZvAJdX54DSvXxgiv7bS
 CV5J03/zIMohiLAwiGZC5q0IPNaHN4hUnBSTRhWSsQpnjg00dUeplYLlOfIRvgJO
 uPKhzpYBzo+E4CzZLWeeNC+ArWT7iIO4NOYLxxuD1Olm71LNNHMlt2XFCJWV9oJl
 UsB9QKWDh3MwFmbgyzxf/0QalGawGL/MVWa8MGmRKcUsxM9uXuU=
 =Oc51
 -----END PGP SIGNATURE-----

Merge tag 'nf-24-09-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following batch contains two fixes from Florian Westphal:

Patch #1 fixes a sk refcount leak in nft_socket on mismatch.

Patch #2 fixes cgroupsv2 matching from containers due to incorrect
	 level in subtree.

netfilter pull request 24-09-12

* tag 'nf-24-09-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_socket: make cgroupsv2 matching work with namespaces
  netfilter: nft_socket: fix sk refcount leaks
====================

Link: https://patch.msgid.link/20240911222520.3606-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-12 15:26:19 +02:00
Bibo Mao
cdc118f802 LoongArch: KVM: Enable paravirt feature control from VMM
Export kernel paravirt features to user space, so that VMM can control
each single paravirt feature. By default paravirt features will be the
same with kvm supported features if VMM does not set it.

Also a new feature KVM_FEATURE_VIRT_EXTIOI is added which can be set
from user space. This feature indicates that the virt EIOINTC can route
interrupts to 256 vCPUs, rather than 4 vCPUs like with real HW.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-09-12 20:53:40 +08:00
Song Gao
f4e40ea9f7 LoongArch: KVM: Add PMU support for guest
On LoongArch, the host and guest have their own PMU CSRs registers and
they share PMU hardware resources. A set of PMU CSRs consists of a CTRL
register and a CNTR register. We can set which PMU CSRs are used by the
guest by writing to the GCFG register [24:26] bits.

On KVM side:
- Save the host PMU CSRs into structure kvm_context.
- If the host supports the PMU feature.
  - When entering guest mode, save the host PMU CSRs and restore the guest PMU CSRs.
  - When exiting guest mode, save the guest PMU CSRs and restore the host PMU CSRs.

Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-09-12 20:53:40 +08:00
Philipp Stanner
fc8c818e75 PCI: Fix potential deadlock in pcim_intx()
25216afc9d ("PCI: Add managed pcim_intx()") moved the allocation step for
pci_intx()'s device resource from pcim_enable_device() to pcim_intx(). As
before, pcim_enable_device() sets pci_dev.is_managed to true; and it is
never set to false again.

Due to the lifecycle of a struct pci_dev, it can happen that a second
driver obtains the same pci_dev after a first driver ran.  If one driver
uses pcim_enable_device() and the other doesn't, this causes the other
driver to run into managed pcim_intx(), which will try to allocate when
called for the first time.

Allocations might sleep, so calling pci_intx() while holding spinlocks
becomes then invalid, which causes lockdep warnings and could cause
deadlocks:

  ========================================================
  WARNING: possible irq lock inversion dependency detected
  6.11.0-rc6+ #59 Tainted: G        W
  --------------------------------------------------------
  CPU 0/KVM/1537 just changed the state of lock:
  ffffa0f0cff965f0 (&vdev->irqlock){-...}-{2:2}, at:
  vfio_intx_handler+0x21/0xd0 [vfio_pci_core] but this lock took another,
  HARDIRQ-unsafe lock in the past: (fs_reclaim){+.+.}-{0:0}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:

  Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
			       local_irq_disable();
			       lock(&vdev->irqlock);
			       lock(fs_reclaim);
  <Interrupt>
    lock(&vdev->irqlock);

  *** DEADLOCK ***

Have pcim_enable_device()'s release function, pcim_disable_device(), set
pci_dev.is_managed to false so that subsequent drivers using the same
struct pci_dev do not implicitly run into managed code.

Link: https://lore.kernel.org/r/20240905072556.11375-2-pstanner@redhat.com
Fixes: 25216afc9d ("PCI: Add managed pcim_intx()")
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Closes: https://lore.kernel.org/all/20240903094431.63551744.alex.williamson@redhat.com/
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
2024-09-12 07:52:50 -05:00
Will Deacon
75078ba2b3 Merge branch 'for-next/timers' into for-next/core
* for-next/timers:
  arm64: Implement prctl(PR_{G,S}ET_TSC)
2024-09-12 13:44:03 +01:00
Will Deacon
2ef52ca02c Merge branch 'for-next/selftests' into for-next/core
* for-next/selftests:
  kselftest/arm64: Fix build warnings for ptrace
  kselftest/arm64: Actually test SME vector length changes via sigreturn
  kselftest/arm64: signal: fix/refactor SVE vector length enumeration
2024-09-12 13:43:57 +01:00
Will Deacon
982a847c71 Merge branch 'for-next/poe' into for-next/core
* for-next/poe: (31 commits)
  arm64: pkeys: remove redundant WARN
  kselftest/arm64: Add test case for POR_EL0 signal frame records
  kselftest/arm64: parse POE_MAGIC in a signal frame
  kselftest/arm64: add HWCAP test for FEAT_S1POE
  selftests: mm: make protection_keys test work on arm64
  selftests: mm: move fpregs printing
  kselftest/arm64: move get_header()
  arm64: add Permission Overlay Extension Kconfig
  arm64: enable PKEY support for CPUs with S1POE
  arm64: enable POE and PIE to coexist
  arm64/ptrace: add support for FEAT_POE
  arm64: add POE signal support
  arm64: implement PKEYS support
  arm64: add pte_access_permitted_no_overlay()
  arm64: handle PKEY/POE faults
  arm64: mask out POIndex when modifying a PTE
  arm64: convert protection key into vm_flags and pgprot values
  arm64: add POIndex defines
  arm64: re-order MTE VM_ flags
  arm64: enable the Permission Overlay Extension for EL0
  ...
2024-09-12 13:43:41 +01:00
Will Deacon
3175e051c3 Merge branch 'for-next/pkvm-guest' into for-next/core
* for-next/pkvm-guest:
  arm64: smccc: Reserve block of KVM "vendor" services for pKVM hypercalls
  drivers/virt: pkvm: Intercept ioremap using pKVM MMIO_GUARD hypercall
  arm64: mm: Add confidential computing hook to ioremap_prot()
  drivers/virt: pkvm: Hook up mem_encrypt API using pKVM hypercalls
  arm64: mm: Add top-level dispatcher for internal mem_encrypt API
  drivers/virt: pkvm: Add initial support for running as a protected guest
  firmware/smccc: Call arch-specific hook on discovering KVM services
2024-09-12 13:43:22 +01:00
Will Deacon
119e3eef32 Merge branch 'for-next/perf' into for-next/core
* for-next/perf: (33 commits)
  perf: arm-ni: Fix an NULL vs IS_ERR() bug
  perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
  MAINTAINERS: List Arm interconnect PMUs as supported
  perf: Add driver for Arm NI-700 interconnect PMU
  dt-bindings/perf: Add Arm NI-700 PMU
  perf/arm-cmn: Improve format attr printing
  perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
  perf/arm-cmn: Support CMN S3
  dt-bindings: perf: arm-cmn: Add CMN S3
  perf/arm-cmn: Refactor DTC PMU register access
  perf/arm-cmn: Make cycle counts less surprising
  perf/arm-cmn: Improve build-time assertion
  perf/arm-cmn: Ensure dtm_idx is big enough
  perf/arm-cmn: Fix CCLA register offset
  perf/arm-cmn: Refactor node ID handling. Again.
  drivers/perf: hisi_pcie: Export supported Root Ports [bdf_min, bdf_max]
  drivers/perf: hisi_pcie: Fix TLP headers bandwidth counting
  drivers/perf: hisi_pcie: Record hardware counts correctly
  drivers/perf: arm_spe: Use perf_allow_kernel() for permissions
  perf/dwc_pcie: Add support for QCOM vendor devices
  ...
2024-09-12 13:43:16 +01:00
Will Deacon
c2c9402369 Merge branch 'for-next/mm' into for-next/core
* for-next/mm:
  arm64/mm: use lm_alias() with addresses passed to memblock_free()
  mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
  arm64: Expose the end of the linear map in PHYSMEM_END
  arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
  arm64/mm: Delete __init region from memblock.reserved
2024-09-12 13:43:08 +01:00
Will Deacon
f661eb5f8d Merge branch 'for-next/misc' into for-next/core
* for-next/misc:
  arm64: hibernate: Fix warning for cast from restricted gfp_t
  arm64: esr: Define ESR_ELx_EC_* constants as UL
  arm64: Constify struct kobj_type
  arm64: smp: smp_send_stop() and crash_smp_send_stop() should try non-NMI first
  arm64/sve: Remove unused declaration read_smcr_features()
  arm64: mm: Remove unused declaration early_io_map()
  arm64: el2_setup.h: Rename some labels to be more diff-friendly
  arm64: signal: Fix some under-bracketed UAPI macros
  arm64/mm: Drop TCR_SMP_FLAGS
  arm64/mm: Drop PMD_SECT_VALID
2024-09-12 13:42:57 +01:00
Will Deacon
dd22f44485 Merge branch 'for-next/errata' into for-next/core
* for-next/errata:
  arm64: errata: Enable the AC03_CPU_38 workaround for ampere1a
2024-09-12 13:42:50 +01:00
Will Deacon
d2ea63804b Merge branch 'for-next/acpi' into for-next/core
* for-next/acpi:
  ACPI/IORT: Add PMCG platform information for HiSilicon HIP10/11
  ACPI: ARM64: add acpi_iort.h to MAINTAINERS
  ACPI/IORT: Switch to use kmemdup_array()
2024-09-12 13:42:42 +01:00
Gao Xiang
2349d2fa02 erofs: sunset unneeded NOFAILs
With iterative development, our codebase can now deal with compressed
buffer misses properly if both in-place I/O and compressed buffer
allocation fail.

Note that if readahead fails (with non-uptodate folios), the original
request will then fall back to synchronous read, and `.read_folio()`
should return appropriate errnos; otherwise -EIO will be passed to
user space, which is unexpected.

To simplify rarely encountered failure paths, a mimic decompression
will be just used.  Before that, failure reasons are recorded in
compressed_bvecs[] and they also act as placeholders to avoid in-place
pages.  They will be parsed just before decompression and then pass
back to `.read_folio()`.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240905084732.2684515-1-hsiangkao@linux.alibaba.com
2024-09-12 20:26:43 +08:00
Dan Carpenter
2e091a805f perf: arm-ni: Fix an NULL vs IS_ERR() bug
The devm_ioremap() function never returns error pointers, it returns a
NULL pointer if there is an error.

Fixes: 4d5a7680f2 ("perf: Add driver for Arm NI-700 interconnect PMU")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/04d6ccc3-6d31-4f0f-ab0f-7a88342cc09a@stanley.mountain
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-12 12:49:33 +01:00
Min-Hua Chen
ecdd16df45 arm64: hibernate: Fix warning for cast from restricted gfp_t
This patch fixes the following warning by adding __force
to the cast:
arch/arm64/kernel/hibernate.c:410:44: sparse: warning: cast from restricted gfp_t

No functional change intended.

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Link: https://lore.kernel.org/r/20240910232507.313555-1-minhuadotchen@gmail.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-12 12:48:58 +01:00
Jinjie Ruan
07f1eb718d
spi: geni-qcom: Use devm functions to simplify code
Use devm_pm_runtime_enable(), devm_request_irq() and
devm_spi_register_controller() to simplify code.

And also register a callback spi_geni_release_dma_chan() with
devm_add_action_or_reset(), to release dma channel in both error
and device detach path, which can make sure the release sequence is
consistent with the original one.

1. Unregister spi controller.
2. Free the IRQ.
3. Free DMA chans
4. Disable runtime PM.

So the remove function can also be removed.

Reviewed-by: Douglas Anderson <dianders@chromium.org>
Suggested-by: Doug Anderson <dianders@chromium.org>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20240912091701.3720857-1-ruanjinjie@huawei.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2024-09-12 12:39:04 +01:00
Mark Brown
f10d52087c
spi: Merge up fixes
A patch for Qualcomm depends on some fixes.
2024-09-12 12:38:44 +01:00
Dennis Lam
4b40d43d9f
docs: filesystems: corrected grammar of netfs page
Fixed the word "aren't" to "isn't" based on singular word "bufferage".

Signed-off-by: Dennis Lam <dennis.lamerice@gmail.com>
Link: https://lore.kernel.org/r/20240912012550.13748-2-dennis.lamerice@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:43 +02:00
Christian Brauner
3956e7284c
Merge branch 'netfs-writeback' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs into vfs.netfs
Merge patch series "netfs: Read/write improvements" from David Howells
<dhowells@redhat.com>.

* 'netfs-writeback' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (25 commits)
  cifs: Don't support ITER_XARRAY
  cifs: Switch crypto buffer to use a folio_queue rather than an xarray
  cifs: Use iterate_and_advance*() routines directly for hashing
  netfs: Cancel dirty folios that have no storage destination
  cachefiles, netfs: Fix write to partial block at EOF
  netfs: Remove fs/netfs/io.c
  netfs: Speed up buffered reading
  afs: Make read subreqs async
  netfs: Simplify the writeback code
  netfs: Provide an iterator-reset function
  netfs: Use new folio_queue data type and iterator instead of xarray iter
  cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs
  iov_iter: Provide copy_folio_from_iter()
  mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios
  netfs: Use bh-disabling spinlocks for rreq->lock
  netfs: Set the request work function upon allocation
  netfs: Remove NETFS_COPY_TO_CACHE
  netfs: Reserve netfs_sreq_source 0 as unset/unknown
  netfs: Move max_len/max_nr_segs from netfs_io_subrequest to netfs_io_stream
  netfs, cifs: Move CIFS_INO_MODIFIED_ATTR to netfs_inode
  ...

Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:42 +02:00
David Howells
4aa571d67e
cifs: Don't support ITER_XARRAY
There's now no need to support ITER_XARRAY in cifs as netfslib hands down
ITER_FOLIOQ instead - and that's simpler to use with iterate_and_advance()
as it doesn't hold the RCU read lock over the step function.

This is part of the process of phasing out ITER_XARRAY.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-26-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:42 +02:00
David Howells
a2906d3316
cifs: Switch crypto buffer to use a folio_queue rather than an xarray
Switch cifs from using an xarray to hold the transport crypto buffer to
using a folio_queue and use ITER_FOLIOQ rather than ITER_XARRAY.

This is part of the process of phasing out ITER_XARRAY.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-25-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:42 +02:00
David Howells
2982c8c19b
cifs: Use iterate_and_advance*() routines directly for hashing
Replace the bespoke cifs iterators of ITER_BVEC and ITER_KVEC to do hashing
with iterate_and_advance_kernel() - a variant on iterate_and_advance() that
only supports kernel-internal ITER_* types and not UBUF/IOVEC types.

The bespoke ITER_XARRAY is left because we don't really want to be calling
crypto_shash_update() under the RCU read lock for large amounts of data;
besides, ITER_XARRAY is going to be phased out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-24-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:42 +02:00
David Howells
8f246b7c0a
netfs: Cancel dirty folios that have no storage destination
Kafs wants to be able to cache the contents of directories (and symlinks),
but whilst these are downloaded from the server with the FS.FetchData RPC
op and similar, the same as for regular files, they can't be updated by
FS.StoreData, but rather have special operations (FS.MakeDir, etc.).

Now, rather than redownloading a directory's content after each change made
to that directory, kafs modifies the local blob.  This blob can be saved
out to the cache, and since it's using netfslib, kafs just marks the folios
dirty and lets ->writepages() on the directory take care of it, as for an
regular file.

This is fine as long as there's a cache as although the upload stream is
disabled, there's a cache stream to drive the procedure.  But if the cache
goes away in the meantime, suddenly there's no way do any writes and the
code gets confused, complains "R=%x: No submit" to dmesg and leaves the
dirty folio hanging.

Fix this by just cancelling the store of the folio if neither stream is
active.  (If there's no cache at the time of dirtying, we should just not
mark the folio dirty).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-23-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:41 +02:00
David Howells
c4f1450ecc
cachefiles, netfs: Fix write to partial block at EOF
Because it uses DIO writes, cachefiles is unable to make a write to the
backing file if that write is not aligned to and sized according to the
backing file's DIO block alignment.  This makes it tricky to handle a write
to the cache where the EOF on the network file is not correctly aligned.

To get around this, netfslib attempts to tell the driver it is calling how
much more data there is available beyond the EOF that it can use to pad the
write (netfslib preclears the part of the folio above the EOF).  However,
it tries to tell the cache what the maximum length is, but doesn't
calculate this correctly; and, in any case, cachefiles actually ignores the
value and just skips the block.

Fix this by:

 (1) Change the value passed to indicate the amount of extra data that can
     be added to the operation (now ->submit_extendable_to).  This is much
     simpler to calculate as it's just the end of the folio minus the top
     of the data within the folio - rather than having to account for data
     spread over multiple folios.

 (2) Make cachefiles add some of this data if the subrequest it is given
     ends at the network file's i_size if the extra data is sufficient to
     pad out to a whole block.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-22-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:41 +02:00
David Howells
86b374d061
netfs: Remove fs/netfs/io.c
Remove fs/netfs/io.c as it is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-21-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:41 +02:00
David Howells
ee4cdf7ba8
netfs: Speed up buffered reading
Improve the efficiency of buffered reads in a number of ways:

 (1) Overhaul the algorithm in general so that it's a lot more compact and
     split the read submission code between buffered and unbuffered
     versions.  The unbuffered version can be vastly simplified.

 (2) Read-result collection is handed off to a work queue rather than being
     done in the I/O thread.  Multiple subrequests can be processes
     simultaneously.

 (3) When a subrequest is collected, any folios it fully spans are
     collected and "spare" data on either side is donated to either the
     previous or the next subrequest in the sequence.

Notes:

 (*) Readahead expansion is massively slows down fio, presumably because it
     causes a load of extra allocations, both folio and xarray, up front
     before RPC requests can be transmitted.

 (*) RDMA with cifs does appear to work, both with SIW and RXE.

 (*) PG_private_2-based reading and copy-to-cache is split out into its own
     file and altered to use folio_queue.  Note that the copy to the cache
     now creates a new write transaction against the cache and adds the
     folios to be copied into it.  This allows it to use part of the
     writeback I/O code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:41 +02:00
David Howells
2e45b92297
afs: Make read subreqs async
Perform AFS read subrequests in a work item rather than in the calling
thread.  For normal buffered reads, this will allow the calling thread to
copy data from the pagecache to the application at the same time as the
demarshalling thread is shovelling data from skbuffs into the pagecache.

This will also allow the RA mark to trigger a new read before we've
finished shovelling the data from the current one.

Note: This would be a bit safer if the FS.FetchData RPC ops returned the
metadata (including the data version number) before returning the data.
This would allow me to flush the pagecache before installing the new data.

In future, it may be possible to asynchronously flush the pagecache either
side of the region being read.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-19-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:40 +02:00
David Howells
983cdcf8fe
netfs: Simplify the writeback code
Use the new folio_queue structures to simplify the writeback code.  The
problem with referring to the i_pages xarray directly is that we may have
gaps in the sequence of folios we're writing from that we need to skip when
we're removing the writeback mark from the folios we're writing back from.

At the moment the code tries to deal with this by carefully tracking the
gaps in each writeback stream (eg. write to server and write to cache) and
divining when there's a gap that spans folios (something that's not helped
by folios not being a consistent size).

Instead, the folio_queue buffer contains pointers only the folios we're
dealing with, has them in ascending order and indicates a gap by placing
non-consequitive folios next to each other.  This makes it possible to
track where we need to clean up to by just keeping track of where we've
processed to on each stream and taking the minimum.

Note that the I/O iterator is always rounded up to the end of the folio,
even if that is beyond the EOF position, so that the cache can do DIO from
the page.  The excess space is cleared, though mmapped writes clobber it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-18-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:40 +02:00
David Howells
bfaa33b8ba
netfs: Provide an iterator-reset function
Provide a function to reset the iterator on a subrequest.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-17-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:40 +02:00
David Howells
cd0277ed0c
netfs: Use new folio_queue data type and iterator instead of xarray iter
Make the netfs write-side routines use the new folio_queue struct to hold a
rolling buffer of folios, with the issuer adding folios at the tail and the
collector removing them from the head as they're processed instead of using
an xarray.

This will allow a subsequent patch to simplify the write collector.

The primary mark (as tested by folioq_is_marked()) is used to note if the
corresponding folio needs putting.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-16-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:40 +02:00
David Howells
c45ebd636c
cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs
Make smb_extract_iter_to_rdma() extract page fragments from an ITER_FOLIOQ
iterator into RDMA SGEs.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: linux-cifs@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-15-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:40 +02:00
David Howells
197a3de607
iov_iter: Provide copy_folio_from_iter()
Provide a copy_folio_from_iter() wrapper.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20240814203850.2240469-14-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:39 +02:00
David Howells
db0aa2e956
mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios
Define a data structure, struct folio_queue, to represent a sequence of
folios and a kernel-internal I/O iterator type, ITER_FOLIOQ, to allow a
list of folio_queue structures to be used to provide a buffer to
iov_iter-taking functions, such as sendmsg and recvmsg.

The folio_queue structure looks like:

	struct folio_queue {
		struct folio_batch	vec;
		u8			orders[PAGEVEC_SIZE];
		struct folio_queue	*next;
		struct folio_queue	*prev;
		unsigned long		marks;
		unsigned long		marks2;
	};

It does not use a list_head so that next and/or prev can be set to NULL at
the ends of the list, allowing iov_iter-handling routines to determine that
they *are* the ends without needing to store a head pointer in the iov_iter
struct.

A folio_batch struct is used to hold the folio pointers which allows the
batch to be passed to batch handling functions.  Two mark bits are
available per slot.  The intention is to use at least one of them to mark
folios that need putting, but that might not be ultimately necessary.
Accessor functions are used to access the slots to do the masking and an
additional accessor function is used to indicate the size of the array.

The order of each folio is also stored in the structure to avoid the need
for iov_iter_advance() and iov_iter_revert() to have to query each folio to
find its size.

With careful barriering, this can be used as an extending buffer with new
folios inserted and new folio_queue structs added without the need for a
lock.  Further, provided we always keep at least one struct in the buffer,
we can also remove consumed folios and consumed structs from the head end
as we without the need for locks.

[Questions/thoughts]

 (1) To manage this, I need a head pointer, a tail pointer, a tail slot
     number (assuming insertion happens at the tail end and the next
     pointers point from head to tail).  Should I put these into a struct
     of their own, say "folio_queue_head" or "rolling_buffer"?

     I will end up with two of these in netfs_io_request eventually, one
     keeping track of the pagecache I'm dealing with for buffered I/O and
     the other to hold a bounce buffer when we need one.

 (2) Should I make the slots {folio,off,len} or bio_vec?

 (3) This is intended to replace ITER_XARRAY eventually.  Using an xarray
     in I/O iteration requires the taking of the RCU read lock, doing
     copying under the RCU read lock, walking the xarray (which may change
     under us), handling retries and dealing with special values.

     The advantage of ITER_XARRAY is that when we're dealing with the
     pagecache directly, we don't need any allocation - but if we're doing
     encrypted comms, there's a good chance we'd be using a bounce buffer
     anyway.

     This will require afs, erofs, cifs, orangefs and fscache to be
     converted to not use this.  afs still uses it for dirs and symlinks;
     some of erofs usages should be easy to change, but there's one which
     won't be so easy; ceph's use via fscache can be fixed by porting ceph
     to netfslib; cifs is using xarray as a bounce buffer - that can be
     moved to use sheaves instead; and orangefs has a similar problem to
     erofs - maybe orangefs could use netfslib?

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Gao Xiang <xiang@kernel.org>
cc: Mike Marshall <hubcap@omnibond.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-erofs@lists.ozlabs.org
cc: devel@lists.orangefs.org
Link: https://lore.kernel.org/r/20240814203850.2240469-13-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:21 +02:00
Christian Brauner
2077006d47
uidgid: make sure we fit into one cacheline
When I expanded uidgid mappings I intended for a struct uid_gid_map to
fit into a single cacheline on x86 as they tend to be pretty
performance sensitive (idmapped mounts etc). But a 4 byte hole was added
that brought it over 64 bytes. Fix that by adding the static extent
array and the extent counter into a substruct. C's type punning for
unions guarantees that we can access ->nr_extents even if the last
written to member wasn't within the same object. This is also what we
rely on in struct_group() and friends. This of course relies on
non-strict aliasing which we don't do.

99) If the member used to read the contents of a union object is not the
    same as the member last used to store a value in the object, the
    appropriate part of the object representation of the value is
    reinterpreted as an object representation in the new type as
    described in 6.2.6 (a process sometimes called "type punning").

Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf
Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:16:09 +02:00
Christian Brauner
24a988f75c
Merge patch series "file: remove f_version"
Christian Brauner <brauner@kernel.org> says:

The f_version member in struct file isn't particularly well-defined. It
is mainly used as a cookie to detect concurrent seeks when iterating
directories. But it is also abused by some subsystems for completely
unrelated things.

It is mostly a directory specific thing that doesn't really need to live
in struct file and with its wonky semantics it really lacks a specific
function.

For pipes, f_version is (ab)used to defer poll notifications until a
write has happened. And struct pipe_inode_info is used by multiple
struct files in their ->private_data so there's no chance of pushing
that down into file->private_data without introducing another pointer
indirection.

But this should be a solvable problem. Only regular files with
FMODE_ATOMIC_POS and directories require f_pos_lock. Pipes and other
files don't. So this adds a union into struct file encompassing
f_pos_lock and a pipe specific f_pipe member that pipes can use. This
union of course can be extended to other file types and is similar to
what we do in struct inode already.

* patches from https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-0-6d3e4816aa7b@kernel.org:
  fs: remove f_version
  pipe: use f_pipe
  fs: add f_pipe
  ubifs: store cookie in private data
  ufs: store cookie in private data
  udf: store cookie in private data
  proc: store cookie in private data
  ocfs2: store cookie in private data
  input: remove f_version abuse
  ext4: store cookie in private data
  ext2: store cookie in private data
  affs: store cookie in private data
  fs: add generic_llseek_cookie()
  fs: use must_set_pos()
  fs: add must_set_pos()
  fs: add vfs_setpos_cookie()
  s390: remove unused f_version
  ceph: remove unused f_version
  adi: remove unused f_version
  file: remove pointless comment

Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-0-6d3e4816aa7b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 11:58:46 +02:00