linux/drivers
Dragos Tatulea 73790bdfba vdpa/mlx5: Fix hang when cvq commands are triggered during device unregister
Currently the vdpa device is unregistered after the workqueue that
processes vq commands is disabled. However, the device unregister
process can still send commands to the cvq (a vlan delete for example)
which leads to a hang because the handing workqueue has been disabled
and the command never finishes:

 [ 2263.095764] rcu: INFO: rcu_sched self-detected stall on CPU
 [ 2263.096307] rcu:        9-....: (5250 ticks this GP) idle=dac4/1/0x4000000000000000 softirq=111009/111009 fqs=2544
 [ 2263.097154] rcu:        (t=5251 jiffies g=393549 q=347 ncpus=10)
 [ 2263.097648] CPU: 9 PID: 94300 Comm: kworker/u20:2 Not tainted 6.3.0-rc6_for_upstream_min_debug_2023_04_14_00_02 #1
 [ 2263.098535] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 [ 2263.099481] Workqueue: mlx5_events mlx5_vhca_state_work_handler [mlx5_core]
 [ 2263.100143] RIP: 0010:virtnet_send_command+0x109/0x170
 [ 2263.100621] Code: 1d df f5 ff 85 c0 78 5c 48 8b 7b 08 e8 d0 c5 f5 ff 84 c0 75 11 eb 22 48 8b 7b 08 e8 01 b7 f5 ff 84 c0 75 15 f3 90 48 8b 7b 08 <48> 8d 74 24 04 e8 8d c5 f5 ff 48 85 c0 74 de 48 8b 83 f8 00 00 00
 [ 2263.102148] RSP: 0018:ffff888139cf36e8 EFLAGS: 00000246
 [ 2263.102624] RAX: 0000000000000000 RBX: ffff888166bea940 RCX: 0000000000000001
 [ 2263.103244] RDX: 0000000000000000 RSI: ffff888139cf36ec RDI: ffff888146763800
 [ 2263.103864] RBP: ffff888139cf3710 R08: ffff88810d201000 R09: 0000000000000000
 [ 2263.104473] R10: 0000000000000002 R11: 0000000000000003 R12: 0000000000000002
 [ 2263.105082] R13: 0000000000000002 R14: ffff888114528400 R15: ffff888166bea000
 [ 2263.105689] FS:  0000000000000000(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
 [ 2263.106404] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [ 2263.106925] CR2: 00007f31f394b000 CR3: 000000010615b006 CR4: 0000000000370ea0
 [ 2263.107542] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [ 2263.108163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [ 2263.108769] Call Trace:
 [ 2263.109059]  <TASK>
 [ 2263.109320]  ? check_preempt_wakeup+0x11f/0x230
 [ 2263.109750]  virtnet_vlan_rx_kill_vid+0x5a/0xa0
 [ 2263.110180]  vlan_vid_del+0x9c/0x170
 [ 2263.110546]  vlan_device_event+0x351/0x760 [8021q]
 [ 2263.111004]  raw_notifier_call_chain+0x41/0x60
 [ 2263.111426]  dev_close_many+0xcb/0x120
 [ 2263.111808]  unregister_netdevice_many_notify+0x130/0x770
 [ 2263.112297]  ? wq_worker_running+0xa/0x30
 [ 2263.112688]  unregister_netdevice_queue+0x89/0xc0
 [ 2263.113128]  unregister_netdev+0x18/0x20
 [ 2263.113512]  virtnet_remove+0x4f/0x230
 [ 2263.113885]  virtio_dev_remove+0x31/0x70
 [ 2263.114273]  device_release_driver_internal+0x18f/0x1f0
 [ 2263.114746]  bus_remove_device+0xc6/0x130
 [ 2263.115146]  device_del+0x173/0x3c0
 [ 2263.115502]  ? kernfs_find_ns+0x35/0xd0
 [ 2263.115895]  device_unregister+0x1a/0x60
 [ 2263.116279]  unregister_virtio_device+0x11/0x20
 [ 2263.116706]  device_release_driver_internal+0x18f/0x1f0
 [ 2263.117182]  bus_remove_device+0xc6/0x130
 [ 2263.117576]  device_del+0x173/0x3c0
 [ 2263.117929]  ? vdpa_dev_remove+0x20/0x20 [vdpa]
 [ 2263.118364]  device_unregister+0x1a/0x60
 [ 2263.118752]  mlx5_vdpa_dev_del+0x4c/0x80 [mlx5_vdpa]
 [ 2263.119232]  vdpa_match_remove+0x21/0x30 [vdpa]
 [ 2263.119663]  bus_for_each_dev+0x71/0xc0
 [ 2263.120054]  vdpa_mgmtdev_unregister+0x57/0x70 [vdpa]
 [ 2263.120520]  mlx5v_remove+0x12/0x20 [mlx5_vdpa]
 [ 2263.120953]  auxiliary_bus_remove+0x18/0x30
 [ 2263.121356]  device_release_driver_internal+0x18f/0x1f0
 [ 2263.121830]  bus_remove_device+0xc6/0x130
 [ 2263.122223]  device_del+0x173/0x3c0
 [ 2263.122581]  ? devl_param_driverinit_value_get+0x29/0x90
 [ 2263.123070]  mlx5_rescan_drivers_locked+0xc4/0x2d0 [mlx5_core]
 [ 2263.123633]  mlx5_unregister_device+0x54/0x80 [mlx5_core]
 [ 2263.124169]  mlx5_uninit_one+0x54/0x150 [mlx5_core]
 [ 2263.124656]  mlx5_sf_dev_remove+0x45/0x90 [mlx5_core]
 [ 2263.125153]  auxiliary_bus_remove+0x18/0x30
 [ 2263.125560]  device_release_driver_internal+0x18f/0x1f0
 [ 2263.126052]  bus_remove_device+0xc6/0x130
 [ 2263.126451]  device_del+0x173/0x3c0
 [ 2263.126815]  mlx5_sf_dev_remove+0x39/0xf0 [mlx5_core]
 [ 2263.127318]  mlx5_sf_dev_state_change_handler+0x178/0x270 [mlx5_core]
 [ 2263.127920]  blocking_notifier_call_chain+0x5a/0x80
 [ 2263.128379]  mlx5_vhca_state_work_handler+0x151/0x200 [mlx5_core]
 [ 2263.128951]  process_one_work+0x1bb/0x3c0
 [ 2263.129355]  ? process_one_work+0x3c0/0x3c0
 [ 2263.129766]  worker_thread+0x4d/0x3c0
 [ 2263.130140]  ? process_one_work+0x3c0/0x3c0
 [ 2263.130548]  kthread+0xb9/0xe0
 [ 2263.130895]  ? kthread_complete_and_exit+0x20/0x20
 [ 2263.131349]  ret_from_fork+0x1f/0x30
 [ 2263.131717]  </TASK>

The fix is to disable and destroy the workqueue after the device
unregister. It is expected that vhost will not trigger kicks after
the unregister. But even if it would, the wq is disabled already by
setting the pointer to NULL (done so in the referenced commit).

Fixes: ad6dc1daaf ("vdpa/mlx5: Avoid processing works if workqueue was destroyed")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20230516095800.3549932-1-dtatulea@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jason Wang <jasowang@redhat.com>
2023-06-08 15:43:08 -04:00
..
accel accel/qaic: Fix NNC message corruption 2023-05-23 09:51:38 -06:00
accessibility braille_console: remove MODULE_LICENSE in non-modules 2023-04-13 13:13:53 -07:00
acpi First batch of EFI fixes for v6.4: 2023-06-01 20:43:11 -04:00
amba ARM: tegra: remove MODULE_LICENSE in non-modules 2023-04-13 13:13:50 -07:00
android binder: fix UAF of alloc->vma in race with munmap() 2023-05-20 17:56:23 +01:00
ata ata: libata-scsi: Use correct device no in ata_find_dev() 2023-05-30 08:08:18 +09:00
atm
auxdisplay
base Char/Misc driver fixes for 6.4-rc5 2023-06-04 08:32:30 -04:00
bcma bcma: Add explicit of_device.h include 2023-04-14 15:32:56 +03:00
block xen: branch for v6.4-rc4 2023-05-27 09:42:56 -07:00
bluetooth Bluetooth: btnxpuart: Fix compiler warnings 2023-05-19 15:38:29 -07:00
bus modules-6.4-rc1 2023-04-27 16:36:55 -07:00
cdrom
cdx cdx: fix build failure due to sysfs 'bus_type' argument needing to be const 2023-04-27 16:21:32 -07:00
char tpm, tpm_tis: correct tpm_tis_flags enumeration values 2023-06-02 17:35:22 -04:00
clk A couple more patches that would be good to get into -rc1. 2023-05-07 10:31:45 -07:00
clocksource Timekeeping and clocksource/event driver updates the second batch: 2023-04-29 10:24:30 -07:00
comedi
connector
counter - New Drivers 2023-05-02 10:41:31 -07:00
cpufreq cpufreq: amd-pstate: Update policy->cur in amd_pstate_adjust_perf() 2023-05-25 19:35:13 +02:00
cpuidle RISC-V: Align SBI probe implementation with spec 2023-04-29 13:04:50 -07:00
crypto This push fixes the following problems: 2023-05-07 10:57:14 -07:00
cxl cxl: Explicitly initialize resources when media is not ready 2023-05-26 13:34:39 -07:00
dax
dca Mainly singleton patches all over the place. Series of note are: 2023-04-27 19:57:00 -07:00
devfreq Driver core changes for 6.4-rc1 2023-04-27 11:53:57 -07:00
dio
dma dmaengine: at_hdmac: Extend the Flow Controller bitfield to three bits 2023-05-24 11:20:28 +05:30
dma-buf - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
edac Driver core changes for 6.4-rc1 2023-04-27 11:53:57 -07:00
eisa
extcon
firewire firewire: net: fix unexpected release of object for asynchronous request packet 2023-05-11 09:06:49 +09:00
firmware First batch of EFI fixes for v6.4: 2023-06-01 20:43:11 -04:00
fpga Char/Misc drivers for 6.4-rc1 2023-04-27 12:07:50 -07:00
fsi
gnss
gpio gpio-f7188x: fix chip name and pin count on Nuvoton chip 2023-05-23 10:47:41 +02:00
gpu Merge tag 'drm-intel-fixes-2023-06-01' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes 2023-06-02 10:33:29 +10:00
greybus
hid for-linus-2023060101 2023-06-01 09:02:04 -04:00
hsi
hte Devicetree updates for v6.4, part 2: 2023-04-27 10:09:05 -07:00
hv hyperv-next for v6.4 2023-04-27 17:17:12 -07:00
hwmon hwmon: (k10temp) Add PCI ID for family 19, model 78h 2023-05-08 11:36:19 +02:00
hwspinlock hwspinlock: remove MODULE_LICENSE in non-modules 2023-04-13 13:13:52 -07:00
hwtracing coresight: perf: Release Coresight path when alloc trace id failed 2023-05-11 11:18:21 +01:00
i2c i2c: gxp: fix build failure without CONFIG_I2C_SLAVE 2023-05-03 17:27:29 +02:00
i3c i3c: ast2600: set variable ast2600_i3c_ops storage-class-specifier to static 2023-04-30 23:50:26 +02:00
idle intel_idle: mark few variables as __read_mostly 2023-04-27 19:37:36 +02:00
iio iio: imu: inv_icm42600: fix timestamp reset 2023-05-20 17:33:14 +01:00
infiniband RDMA/irdma: Fix Local Invalidate fencing 2023-05-29 14:06:29 -03:00
input Input updates for 6.4 merge window: 2023-05-01 17:18:56 -07:00
interconnect modules-6.4-rc1 2023-04-27 16:36:55 -07:00
iommu iommu/mediatek: Flush IOTLB completely only if domain has been attached 2023-06-01 11:50:13 +02:00
ipack
irqchip irqchip/gic: Correctly validate OF quirk descriptors 2023-05-30 11:01:22 +01:00
isdn Including fixes from netfilter. 2023-05-05 19:12:01 -07:00
leds leds: qcom-lpg: Fix PWM period limits 2023-06-03 17:00:28 +02:00
macintosh powerpc updates for 6.4 2023-04-28 16:24:32 -07:00
mailbox mailbox: mailbox-test: fix a locking issue in mbox_test_message_write() 2023-05-31 13:26:44 -05:00
mcb mcb-lpc: Reallocate memory region to avoid memory overlapping 2023-04-20 14:24:01 +02:00
md md/raid5: fix miscalculation of 'end_sector' in raid5_read_one_chunk() 2023-05-24 10:44:19 -07:00
media media: uvcvideo: Don't expose unsupported formats to userspace 2023-06-02 18:48:02 +01:00
memory ARM: SoC drivers for v6.4 2023-04-25 12:02:16 -07:00
memstick
message Objtool changes for v6.4: 2023-04-28 14:02:54 -07:00
mfd - New Drivers 2023-05-02 10:41:31 -07:00
misc misc: fastrpc: reject new invocations during device removal 2023-05-29 15:09:50 +01:00
mmc mmc: pwrseq: sd8787: Fix WILC CHIP_EN and RESETN toggling order 2023-05-24 14:33:32 +02:00
most
mtd mtd: rawnand: marvell: don't set the NAND frequency select 2023-06-01 18:12:33 +02:00
mux
net mlx5-fixes-2023-05-31 2023-06-01 10:15:43 -07:00
nfc nfcsim.c: Fix error checking for debugfs_create_dir 2023-05-26 12:18:35 +01:00
ntb
nubus
nvdimm
nvme nvme: fix the name of Zone Append for verbose logging 2023-05-31 09:21:26 -07:00
nvmem modules-6.4-rc1 2023-04-27 16:36:55 -07:00
of Devicetree fixes for 6.4, part 1: 2023-05-05 13:27:59 -07:00
opp Devicetree updates for v6.4, part 2: 2023-04-27 10:09:05 -07:00
parisc parisc: Replace regular spinlock with spin_trylock on panic path 2023-05-03 17:43:26 +02:00
parport
pci PCI/DPC: Quirk PIO log size for Intel Ice Lake Root Ports 2023-05-11 17:38:46 -05:00
pcmcia
peci
perf RISC-V: Align SBI probe implementation with spec 2023-04-29 13:04:50 -07:00
phy phy: qcom-snps: correct struct qcom_snps_hsphy kerneldoc 2023-05-16 19:48:55 +05:30
pinctrl Pin control bulk changes for the v6.4 kernel: 2023-05-02 15:40:41 -07:00
platform platform/x86/intel/ifs: Annotate work queue on stack so object debug does not complain 2023-05-23 12:55:16 +02:00
pnp
power power: supply: Fix logic checking if system is running from battery 2023-05-16 23:02:56 +02:00
powercap
pps
ps3
ptp Driver core changes for 6.4-rc1 2023-04-27 11:53:57 -07:00
pwm pwm: Changes for v6.4-rc1 2023-05-03 11:25:01 -07:00
rapidio Mainly singleton patches all over the place. Series of note are: 2023-04-27 19:57:00 -07:00
ras
regulator regulator: mt6359: add read check for PMIC MT6359 2023-05-18 19:24:47 +09:00
remoteproc Mainly singleton patches all over the place. Series of note are: 2023-04-27 19:57:00 -07:00
reset Nothing looks out of the ordinary in this batch of clk driver updates. There 2023-04-29 17:29:39 -07:00
rpmsg Driver core changes for 6.4-rc1 2023-04-27 11:53:57 -07:00
rtc - New Drivers 2023-05-02 10:41:31 -07:00
s390 block-6.4-2023-05-20 2023-05-20 08:48:04 -07:00
sbus Driver core changes for 6.4-rc1 2023-04-27 11:53:57 -07:00
scsi scsi: stex: Fix gcc 13 warnings 2023-05-31 11:36:40 -04:00
sh
siox
slimbus
soc soc: fsl: cpm1: Fix TSA and QMC dependencies in case of COMPILE_TEST 2023-05-30 12:25:25 +01:00
soundwire soundwire: intel_auxdevice: improve pm_prepare step 2023-04-12 15:36:55 +05:30
spi spi: spi-cadence: Interleave write of TX and read of RX FIFO 2023-05-22 11:41:05 +01:00
spmi spmi: Add a check for remove callback when removing a SPMI driver 2023-04-20 14:16:39 +02:00
ssb
staging media: staging: media: imx: initialize hs_settle to avoid warning 2023-06-02 18:45:52 +01:00
target scsi: target: iscsi: Prevent login threads from racing between each other 2023-05-22 16:29:39 -04:00
tc
tee Fixes an uninitialized variable in OP-TEE driver 2023-05-25 17:16:52 +02:00
thermal thermal: intel: int340x: Add new line for UUID display 2023-05-24 19:50:04 +02:00
thunderbolt thunderbolt: Clear registers properly when auto clear isn't in use 2023-05-09 09:39:03 +03:00
tty serial: cpm_uart: Fix a COMPILE_TEST dependency 2023-05-30 12:25:47 +01:00
ufs scsi: ufs: core: Fix MCQ nr_hw_queues 2023-05-16 21:07:26 -04:00
uio
usb usb: typec: tps6598x: Fix broken polling mode after system suspend/resume 2023-05-30 15:29:41 +01:00
vdpa vdpa/mlx5: Fix hang when cvq commands are triggered during device unregister 2023-06-08 15:43:08 -04:00
vfio vfio/type1: check pfn valid before converting to struct page 2023-05-23 14:16:29 -06:00
vhost vhost: use kzalloc() instead of kmalloc() followed by memset() 2023-06-07 16:22:30 -04:00
video fbdev: bw2: Convert to platform remove callback returning void 2023-05-30 18:33:25 +02:00
virt Devicetree updates for v6.4, part 2: 2023-04-27 10:09:05 -07:00
virtio - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
vlynq
w1 Char/Misc drivers for 6.4-rc1 2023-04-27 12:07:50 -07:00
watchdog linux-watchdog 6.4-rc1 tag 2023-05-04 18:33:56 -07:00
xen xen: branch for v6.4-rc4 2023-05-27 09:42:56 -07:00
zorro
Kconfig
Makefile