All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-08-28 12:22 ` Wen Gong
  0 siblings, 0 replies; 24+ messages in thread
From: Wen Gong @ 2020-08-28 12:22 UTC (permalink / raw)
  To: ath10k; +Cc: linux-wireless, wgong

It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
test simulate crash and ifconfig down/rmmod meanwhile.

Test steps:

1.Test commands, either can reproduce the hang for PCIe, SDIO and SNOC.
echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_pci

2. dmesg:
[ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
[ 5622.655995] ieee80211 phy0: Hardware restart was requested
[ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
[ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
[ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
[ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
[ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
[ 5776.359855] Call trace:
[ 5776.359868]  dump_backtrace+0x0/0x170
[ 5776.359881]  show_stack+0x20/0x2c
[ 5776.359896]  dump_stack+0xd4/0x10c
[ 5776.359916]  panic+0x12c/0x29c
[ 5776.359937]  hung_task_panic+0x0/0x50
[ 5776.359953]  kthread+0x120/0x130
[ 5776.359965]  ret_from_fork+0x10/0x18
[ 5776.359986] SMP: stopping secondary CPUs
[ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
[ 5776.360026] CPU features: 0x0,2188200c
[ 5776.360035] Memory Limit: none

command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
callstack of ifconfig:
[<0>] __switch_to+0x120/0x13c
[<0>] msleep+0x28/0x38
[<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
[<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
[<0>] ath10k_halt+0x120/0x178 [ath10k_core]
[<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
[<0>] drv_stop+0xe0/0x1e4 [mac80211]
[<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
[<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
[<0>] ieee80211_stop+0x20/0x30 [mac80211]
[<0>] __dev_close_many+0xb8/0x11c
[<0>] __dev_change_flags+0xe0/0x1d0
[<0>] dev_change_flags+0x30/0x6c
[<0>] devinet_ioctl+0x370/0x564
[<0>] inet_ioctl+0xdc/0x304
[<0>] sock_do_ioctl+0x50/0x288
[<0>] compat_sock_ioctl+0x1b4/0x1aac
[<0>] __se_compat_sys_ioctl+0x100/0x26fc
[<0>] __arm64_compat_sys_ioctl+0x20/0x2c
[<0>] el0_svc_common+0xa4/0x154
[<0>] el0_svc_compat_handler+0x2c/0x38
[<0>] el0_svc_compat+0x8/0x18
[<0>] 0xffffffffffffffff

callstack of rmmod:
[<0>] __switch_to+0x120/0x13c
[<0>] msleep+0x28/0x38
[<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
[<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
[<0>] ath10k_halt+0x120/0x178 [ath10k_core]
[<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
[<0>] drv_stop+0xe0/0x1e4 [mac80211]
[<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
[<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
[<0>] ieee80211_stop+0x20/0x30 [mac80211]
[<0>] __dev_close_many+0xb8/0x11c
[<0>] dev_close_many+0x70/0x100
[<0>] dev_close+0x4c/0x80
[<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
[<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
[<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
[<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
[<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
[<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
[<0>] sdio_bus_remove+0x48/0x108
[<0>] device_release_driver_internal+0x138/0x1ec
[<0>] driver_detach+0x6c/0xa8
[<0>] bus_remove_driver+0x78/0xa8
[<0>] driver_unregister+0x30/0x50
[<0>] sdio_unregister_driver+0x28/0x34
[<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
[<0>] __arm64_sys_delete_module+0x1e0/0x22c
[<0>] el0_svc_common+0xa4/0x154
[<0>] el0_svc_compat_handler+0x2c/0x38
[<0>] el0_svc_compat+0x8/0x18
[<0>] 0xffffffffffffffff

SNOC:
[  647.156863] Call trace:
[  647.162166] [<ffffff80080855a4>] __switch_to+0x120/0x13c
[  647.164512] [<ffffff800899d8b8>] __schedule+0x5ec/0x798
[  647.170062] [<ffffff800899dad8>] schedule+0x74/0x94
[  647.175050] [<ffffff80089a0848>] schedule_timeout+0x314/0x42c
[  647.179874] [<ffffff80089a0a14>] schedule_timeout_uninterruptible+0x34/0x40
[  647.185780] [<ffffff80082a494>] msleep+0x28/0x38
[  647.192546] [<ffffff800117ec4c>] ath10k_snoc_hif_stop+0x4c/0x1e0 [ath10k_snoc]
[  647.197439] [<ffffff80010dfbd8>] ath10k_core_stop+0x50/0x7c [ath10k_core]
[  647.204652] [<ffffff80010c8f48>] ath10k_halt+0x114/0x16c [ath10k_core]
[  647.211420] [<ffffff80010cad68>] ath10k_stop+0x4c/0x88 [ath10k_core]
[  647.217865] [<ffffff8000fdbf54>] drv_stop+0x110/0x244 [mac80211]
[  647.224367] [<ffffff80010147ac>] ieee80211_stop_device+0x48/0x54 [mac80211]
[  647.230359] [<ffffff8000ff3eec>] ieee80211_do_stop+0x6a4/0x73c [mac80211]
[  647.237033] [<ffffff8000ff4500>] ieee80211_stop+0x20/0x30 [mac80211]
[  647.243942] [<ffffff80087e39b8>] __dev_close_many+0xa0/0xfc
[  647.250435] [<ffffff80087e3888>] dev_close_many+0x70/0x100
[  647.255651] [<ffffff80087e3a60>] dev_close+0x4c/0x80
[  647.261244] [<ffffff8000f1ba54>] cfg80211_shutdown_all_interfaces+0x44/0xcc [cfg80211]
[  647.266383] [<ffffff8000ff3fdc>] ieee80211_remove_interfaces+0x58/0x1b4 [mac80211]
[  647.274128] [<ffffff8000fda540>] ieee80211_unregister_hw+0x50/0x120 [mac80211]
[  647.281659] [<ffffff80010ca314>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
[  647.288839] [<ffffff80010dfc94>] ath10k_core_unregister+0x48/0x90 [ath10k_core]
[  647.296027] [<ffffff800117e598>] ath10k_snoc_remove+0x5c/0x150 [ath10k_snoc]
[  647.303229] [<ffffff80085625fc>] platform_drv_remove+0x28/0x50
[  647.310517] [<ffffff80085601a4>] device_release_driver_internal+0x114/0x1b8
[  647.316257] [<ffffff80085602e4>] driver_detach+0x6c/0xa8
[  647.323021] [<ffffff800855e5b8>] bus_remove_driver+0x78/0xa8
[  647.328571] [<ffffff800856107c>] driver_unregister+0x30/0x50
[  647.334213] [<ffffff8008562674>] platform_driver_unregister+0x1c/0x28
[  647.339876] [<ffffff800117fefc>] cleanup_module+0x1c/0x120 [ath10k_snoc]
[  647.346196] [<ffffff8008143ab8>] SyS_delete_module+0x1dc/0x22c

PCIe:
[  615.392770] rmmod           D    0  3523   3458 0x00000080
[  615.392777] Call Trace:
[  615.392784]  __schedule+0x617/0x7d3
[  615.392791]  ? __mod_timer+0x263/0x35c
[  615.392797]  schedule+0x62/0x72
[  615.392803]  schedule_timeout+0x8d/0xf3
[  615.392809]  ? run_local_timers+0x6b/0x6b
[  615.392814]  msleep+0x1b/0x22
[  615.392824]  ath10k_pci_hif_stop+0x68/0xd6 [ath10k_pci]
[  615.392844]  ath10k_core_stop+0x44/0x67 [ath10k_core]
[  615.392859]  ath10k_halt+0x102/0x153 [ath10k_core]
[  615.392873]  ath10k_stop+0x38/0x75 [ath10k_core]
[  615.392893]  drv_stop+0x9a/0x13c [mac80211]
[  615.392915]  ieee80211_do_stop+0x772/0x7cd [mac80211]
[  615.392937]  ieee80211_stop+0x1a/0x1e [mac80211]
[  615.392945]  __dev_close_many+0x9e/0xf0
[  615.392952]  dev_close_many+0x62/0xe8
[  615.392958]  dev_close+0x54/0x7d
[  615.392975]  cfg80211_shutdown_all_interfaces+0x6e/0xa5 [cfg80211]
[  615.393021]  ieee80211_remove_interfaces+0x52/0x1aa [mac80211]
[  615.393049]  ieee80211_unregister_hw+0x54/0x136 [mac80211]
[  615.393068]  ath10k_mac_unregister+0x19/0x4a [ath10k_core]
[  615.393091]  ath10k_core_unregister+0x39/0x7e [ath10k_core]
[  615.393104]  ath10k_pci_remove+0x3d/0x7f [ath10k_pci]
[  615.393117]  pci_device_remove+0x41/0xa6
[  615.393129]  device_release_driver_internal+0x123/0x1ec
[  615.393140]  driver_detach+0x60/0x90
[  615.393152]  bus_remove_driver+0x72/0x9f
[  615.393164]  pci_unregister_driver+0x1e/0x87
[  615.393177]  SyS_delete_module+0x1d7/0x277
[  615.393188]  do_syscall_64+0x6b/0xf7
[  615.393199]  entry_SYSCALL_64_after_hwframe+0x41/0xa6

The test command run simulate_fw_crash firstly and it call into
ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
is called and bit NAPI_STATE_SCHED is set. After that, function
ath10k_sdio_hif_stop is called again from ath10k_stop by command
"ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.

It is blocked by napi_synchronize, napi_disable will set bit with
NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
becuase bit NAPI_STATE_SCHED is set by napi_disable.

function of napi_synchronize
static inline void napi_synchronize(const struct napi_struct *n)
{
	if (IS_ENABLED(CONFIG_SMP))
		while (test_bit(NAPI_STATE_SCHED, &n->state))
			msleep(1);
	else
		barrier();
}

function of napi_disable
void napi_disable(struct napi_struct *n)
{
	might_sleep();
	set_bit(NAPI_STATE_DISABLE, &n->state);

	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
		msleep(1);
	while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
		msleep(1);

	hrtimer_cancel(&n->timer);

	clear_bit(NAPI_STATE_DISABLE, &n->state);
}

Add flag for it avoid the hang and crash.

Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049
Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1
Tested-on: WCN3990 hw1.0 SNOC hw1.0 WLAN.HL.3.1-01307.1-QCAHLSWMTPL-2

Signed-off-by: Wen Gong <wgong@codeaurora.org>
---
v3: change all bus type

v2: change to use flag 
 drivers/net/wireless/ath/ath10k/ahb.c  |  5 ++---
 drivers/net/wireless/ath/ath10k/core.c | 20 ++++++++++++++++++++
 drivers/net/wireless/ath/ath10k/core.h |  3 +++
 drivers/net/wireless/ath/ath10k/pci.c  |  7 ++++---
 drivers/net/wireless/ath/ath10k/sdio.c |  5 ++---
 drivers/net/wireless/ath/ath10k/snoc.c |  5 ++---
 6 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/ahb.c b/drivers/net/wireless/ath/ath10k/ahb.c
index 05a61975c83f..869524852fba 100644
--- a/drivers/net/wireless/ath/ath10k/ahb.c
+++ b/drivers/net/wireless/ath/ath10k/ahb.c
@@ -626,7 +626,7 @@ static int ath10k_ahb_hif_start(struct ath10k *ar)
 {
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot ahb hif start\n");
 
-	napi_enable(&ar->napi);
+	ath10k_core_napi_enable(ar);
 	ath10k_ce_enable_interrupts(ar);
 	ath10k_pci_enable_legacy_irq(ar);
 
@@ -644,8 +644,7 @@ static void ath10k_ahb_hif_stop(struct ath10k *ar)
 	ath10k_ahb_irq_disable(ar);
 	synchronize_irq(ar_ahb->irq);
 
-	napi_synchronize(&ar->napi);
-	napi_disable(&ar->napi);
+	ath10k_core_napi_sync_disable(ar);
 
 	ath10k_pci_flush(ar);
 }
diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
index cfffd20df0cc..7e52fd14659d 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -2277,6 +2277,26 @@ static int ath10k_init_hw_params(struct ath10k *ar)
 	return 0;
 }
 
+void ath10k_core_napi_enable(struct ath10k *ar)
+{
+	if (ar->napi_enabled)
+		return;
+
+	napi_enable(&ar->napi);
+	ar->napi_enabled = true;
+}
+EXPORT_SYMBOL(ath10k_core_napi_enable);
+
+void ath10k_core_napi_sync_disable(struct ath10k *ar)
+{
+	if (ar->napi_enabled) {
+		napi_synchronize(&ar->napi);
+		napi_disable(&ar->napi);
+		ar->napi_enabled = false;
+	}
+}
+EXPORT_SYMBOL(ath10k_core_napi_sync_disable);
+
 static void ath10k_core_restart(struct work_struct *work)
 {
 	struct ath10k *ar = container_of(work, struct ath10k, restart_work);
diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index 5c18f6c20462..efb26420cc20 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -1230,6 +1230,7 @@ struct ath10k {
 	/* NAPI */
 	struct net_device napi_dev;
 	struct napi_struct napi;
+	bool napi_enabled;
 
 	struct work_struct set_coverage_class_work;
 	/* protected by conf_mutex */
@@ -1276,6 +1277,8 @@ static inline bool ath10k_peer_stats_enabled(struct ath10k *ar)
 
 extern unsigned long ath10k_coredump_mask;
 
+void ath10k_core_napi_sync_disable(struct ath10k *ar);
+void ath10k_core_napi_enable(struct ath10k *ar);
 struct ath10k *ath10k_core_create(size_t priv_size, struct device *dev,
 				  enum ath10k_bus bus,
 				  enum ath10k_hw_rev hw_rev,
diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 36426efdb2ea..7786accc5f96 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1958,7 +1958,7 @@ static int ath10k_pci_hif_start(struct ath10k *ar)
 
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
 
-	napi_enable(&ar->napi);
+	ath10k_core_napi_enable(ar);
 
 	ath10k_pci_irq_enable(ar);
 	ath10k_pci_rx_post(ar);
@@ -2075,8 +2075,9 @@ static void ath10k_pci_hif_stop(struct ath10k *ar)
 
 	ath10k_pci_irq_disable(ar);
 	ath10k_pci_irq_sync(ar);
-	napi_synchronize(&ar->napi);
-	napi_disable(&ar->napi);
+
+	ath10k_core_napi_sync_disable(ar);
+
 	cancel_work_sync(&ar_pci->dump_work);
 
 	/* Most likely the device has HTT Rx ring configured. The only way to
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index 81ddaafb6721..8834f118ef77 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -1859,7 +1859,7 @@ static int ath10k_sdio_hif_start(struct ath10k *ar)
 	struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
 	int ret;
 
-	napi_enable(&ar->napi);
+	ath10k_core_napi_enable(ar);
 
 	/* Sleep 20 ms before HIF interrupts are disabled.
 	 * This will give target plenty of time to process the BMI done
@@ -1986,8 +1986,7 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar)
 
 	spin_unlock_bh(&ar_sdio->wr_async_lock);
 
-	napi_synchronize(&ar->napi);
-	napi_disable(&ar->napi);
+	ath10k_core_napi_sync_disable(ar);
 }
 
 #ifdef CONFIG_PM
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 645ed5f63ef8..d54438c01df8 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -913,8 +913,7 @@ static void ath10k_snoc_hif_stop(struct ath10k *ar)
 	if (!test_bit(ATH10K_FLAG_CRASH_FLUSH, &ar->dev_flags))
 		ath10k_snoc_irq_disable(ar);
 
-	napi_synchronize(&ar->napi);
-	napi_disable(&ar->napi);
+	ath10k_core_napi_sync_disable(ar);
 	ath10k_snoc_buffer_cleanup(ar);
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif stop\n");
 }
@@ -923,7 +922,7 @@ static int ath10k_snoc_hif_start(struct ath10k *ar)
 {
 	struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar);
 
-	napi_enable(&ar->napi);
+	ath10k_core_napi_enable(ar);
 	ath10k_snoc_irq_enable(ar);
 	ath10k_snoc_rx_post(ar);
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-08-28 12:22 ` Wen Gong
  0 siblings, 0 replies; 24+ messages in thread
From: Wen Gong @ 2020-08-28 12:22 UTC (permalink / raw)
  To: ath10k; +Cc: linux-wireless, wgong

It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
test simulate crash and ifconfig down/rmmod meanwhile.

Test steps:

1.Test commands, either can reproduce the hang for PCIe, SDIO and SNOC.
echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_pci

2. dmesg:
[ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
[ 5622.655995] ieee80211 phy0: Hardware restart was requested
[ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
[ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
[ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
[ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
[ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
[ 5776.359855] Call trace:
[ 5776.359868]  dump_backtrace+0x0/0x170
[ 5776.359881]  show_stack+0x20/0x2c
[ 5776.359896]  dump_stack+0xd4/0x10c
[ 5776.359916]  panic+0x12c/0x29c
[ 5776.359937]  hung_task_panic+0x0/0x50
[ 5776.359953]  kthread+0x120/0x130
[ 5776.359965]  ret_from_fork+0x10/0x18
[ 5776.359986] SMP: stopping secondary CPUs
[ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
[ 5776.360026] CPU features: 0x0,2188200c
[ 5776.360035] Memory Limit: none

command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
callstack of ifconfig:
[<0>] __switch_to+0x120/0x13c
[<0>] msleep+0x28/0x38
[<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
[<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
[<0>] ath10k_halt+0x120/0x178 [ath10k_core]
[<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
[<0>] drv_stop+0xe0/0x1e4 [mac80211]
[<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
[<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
[<0>] ieee80211_stop+0x20/0x30 [mac80211]
[<0>] __dev_close_many+0xb8/0x11c
[<0>] __dev_change_flags+0xe0/0x1d0
[<0>] dev_change_flags+0x30/0x6c
[<0>] devinet_ioctl+0x370/0x564
[<0>] inet_ioctl+0xdc/0x304
[<0>] sock_do_ioctl+0x50/0x288
[<0>] compat_sock_ioctl+0x1b4/0x1aac
[<0>] __se_compat_sys_ioctl+0x100/0x26fc
[<0>] __arm64_compat_sys_ioctl+0x20/0x2c
[<0>] el0_svc_common+0xa4/0x154
[<0>] el0_svc_compat_handler+0x2c/0x38
[<0>] el0_svc_compat+0x8/0x18
[<0>] 0xffffffffffffffff

callstack of rmmod:
[<0>] __switch_to+0x120/0x13c
[<0>] msleep+0x28/0x38
[<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
[<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
[<0>] ath10k_halt+0x120/0x178 [ath10k_core]
[<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
[<0>] drv_stop+0xe0/0x1e4 [mac80211]
[<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
[<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
[<0>] ieee80211_stop+0x20/0x30 [mac80211]
[<0>] __dev_close_many+0xb8/0x11c
[<0>] dev_close_many+0x70/0x100
[<0>] dev_close+0x4c/0x80
[<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
[<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
[<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
[<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
[<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
[<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
[<0>] sdio_bus_remove+0x48/0x108
[<0>] device_release_driver_internal+0x138/0x1ec
[<0>] driver_detach+0x6c/0xa8
[<0>] bus_remove_driver+0x78/0xa8
[<0>] driver_unregister+0x30/0x50
[<0>] sdio_unregister_driver+0x28/0x34
[<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
[<0>] __arm64_sys_delete_module+0x1e0/0x22c
[<0>] el0_svc_common+0xa4/0x154
[<0>] el0_svc_compat_handler+0x2c/0x38
[<0>] el0_svc_compat+0x8/0x18
[<0>] 0xffffffffffffffff

SNOC:
[  647.156863] Call trace:
[  647.162166] [<ffffff80080855a4>] __switch_to+0x120/0x13c
[  647.164512] [<ffffff800899d8b8>] __schedule+0x5ec/0x798
[  647.170062] [<ffffff800899dad8>] schedule+0x74/0x94
[  647.175050] [<ffffff80089a0848>] schedule_timeout+0x314/0x42c
[  647.179874] [<ffffff80089a0a14>] schedule_timeout_uninterruptible+0x34/0x40
[  647.185780] [<ffffff80082a494>] msleep+0x28/0x38
[  647.192546] [<ffffff800117ec4c>] ath10k_snoc_hif_stop+0x4c/0x1e0 [ath10k_snoc]
[  647.197439] [<ffffff80010dfbd8>] ath10k_core_stop+0x50/0x7c [ath10k_core]
[  647.204652] [<ffffff80010c8f48>] ath10k_halt+0x114/0x16c [ath10k_core]
[  647.211420] [<ffffff80010cad68>] ath10k_stop+0x4c/0x88 [ath10k_core]
[  647.217865] [<ffffff8000fdbf54>] drv_stop+0x110/0x244 [mac80211]
[  647.224367] [<ffffff80010147ac>] ieee80211_stop_device+0x48/0x54 [mac80211]
[  647.230359] [<ffffff8000ff3eec>] ieee80211_do_stop+0x6a4/0x73c [mac80211]
[  647.237033] [<ffffff8000ff4500>] ieee80211_stop+0x20/0x30 [mac80211]
[  647.243942] [<ffffff80087e39b8>] __dev_close_many+0xa0/0xfc
[  647.250435] [<ffffff80087e3888>] dev_close_many+0x70/0x100
[  647.255651] [<ffffff80087e3a60>] dev_close+0x4c/0x80
[  647.261244] [<ffffff8000f1ba54>] cfg80211_shutdown_all_interfaces+0x44/0xcc [cfg80211]
[  647.266383] [<ffffff8000ff3fdc>] ieee80211_remove_interfaces+0x58/0x1b4 [mac80211]
[  647.274128] [<ffffff8000fda540>] ieee80211_unregister_hw+0x50/0x120 [mac80211]
[  647.281659] [<ffffff80010ca314>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
[  647.288839] [<ffffff80010dfc94>] ath10k_core_unregister+0x48/0x90 [ath10k_core]
[  647.296027] [<ffffff800117e598>] ath10k_snoc_remove+0x5c/0x150 [ath10k_snoc]
[  647.303229] [<ffffff80085625fc>] platform_drv_remove+0x28/0x50
[  647.310517] [<ffffff80085601a4>] device_release_driver_internal+0x114/0x1b8
[  647.316257] [<ffffff80085602e4>] driver_detach+0x6c/0xa8
[  647.323021] [<ffffff800855e5b8>] bus_remove_driver+0x78/0xa8
[  647.328571] [<ffffff800856107c>] driver_unregister+0x30/0x50
[  647.334213] [<ffffff8008562674>] platform_driver_unregister+0x1c/0x28
[  647.339876] [<ffffff800117fefc>] cleanup_module+0x1c/0x120 [ath10k_snoc]
[  647.346196] [<ffffff8008143ab8>] SyS_delete_module+0x1dc/0x22c

PCIe:
[  615.392770] rmmod           D    0  3523   3458 0x00000080
[  615.392777] Call Trace:
[  615.392784]  __schedule+0x617/0x7d3
[  615.392791]  ? __mod_timer+0x263/0x35c
[  615.392797]  schedule+0x62/0x72
[  615.392803]  schedule_timeout+0x8d/0xf3
[  615.392809]  ? run_local_timers+0x6b/0x6b
[  615.392814]  msleep+0x1b/0x22
[  615.392824]  ath10k_pci_hif_stop+0x68/0xd6 [ath10k_pci]
[  615.392844]  ath10k_core_stop+0x44/0x67 [ath10k_core]
[  615.392859]  ath10k_halt+0x102/0x153 [ath10k_core]
[  615.392873]  ath10k_stop+0x38/0x75 [ath10k_core]
[  615.392893]  drv_stop+0x9a/0x13c [mac80211]
[  615.392915]  ieee80211_do_stop+0x772/0x7cd [mac80211]
[  615.392937]  ieee80211_stop+0x1a/0x1e [mac80211]
[  615.392945]  __dev_close_many+0x9e/0xf0
[  615.392952]  dev_close_many+0x62/0xe8
[  615.392958]  dev_close+0x54/0x7d
[  615.392975]  cfg80211_shutdown_all_interfaces+0x6e/0xa5 [cfg80211]
[  615.393021]  ieee80211_remove_interfaces+0x52/0x1aa [mac80211]
[  615.393049]  ieee80211_unregister_hw+0x54/0x136 [mac80211]
[  615.393068]  ath10k_mac_unregister+0x19/0x4a [ath10k_core]
[  615.393091]  ath10k_core_unregister+0x39/0x7e [ath10k_core]
[  615.393104]  ath10k_pci_remove+0x3d/0x7f [ath10k_pci]
[  615.393117]  pci_device_remove+0x41/0xa6
[  615.393129]  device_release_driver_internal+0x123/0x1ec
[  615.393140]  driver_detach+0x60/0x90
[  615.393152]  bus_remove_driver+0x72/0x9f
[  615.393164]  pci_unregister_driver+0x1e/0x87
[  615.393177]  SyS_delete_module+0x1d7/0x277
[  615.393188]  do_syscall_64+0x6b/0xf7
[  615.393199]  entry_SYSCALL_64_after_hwframe+0x41/0xa6

The test command run simulate_fw_crash firstly and it call into
ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
is called and bit NAPI_STATE_SCHED is set. After that, function
ath10k_sdio_hif_stop is called again from ath10k_stop by command
"ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.

It is blocked by napi_synchronize, napi_disable will set bit with
NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
becuase bit NAPI_STATE_SCHED is set by napi_disable.

function of napi_synchronize
static inline void napi_synchronize(const struct napi_struct *n)
{
	if (IS_ENABLED(CONFIG_SMP))
		while (test_bit(NAPI_STATE_SCHED, &n->state))
			msleep(1);
	else
		barrier();
}

function of napi_disable
void napi_disable(struct napi_struct *n)
{
	might_sleep();
	set_bit(NAPI_STATE_DISABLE, &n->state);

	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
		msleep(1);
	while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
		msleep(1);

	hrtimer_cancel(&n->timer);

	clear_bit(NAPI_STATE_DISABLE, &n->state);
}

Add flag for it avoid the hang and crash.

Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049
Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1
Tested-on: WCN3990 hw1.0 SNOC hw1.0 WLAN.HL.3.1-01307.1-QCAHLSWMTPL-2

Signed-off-by: Wen Gong <wgong@codeaurora.org>
---
v3: change all bus type

v2: change to use flag 
 drivers/net/wireless/ath/ath10k/ahb.c  |  5 ++---
 drivers/net/wireless/ath/ath10k/core.c | 20 ++++++++++++++++++++
 drivers/net/wireless/ath/ath10k/core.h |  3 +++
 drivers/net/wireless/ath/ath10k/pci.c  |  7 ++++---
 drivers/net/wireless/ath/ath10k/sdio.c |  5 ++---
 drivers/net/wireless/ath/ath10k/snoc.c |  5 ++---
 6 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/ahb.c b/drivers/net/wireless/ath/ath10k/ahb.c
index 05a61975c83f..869524852fba 100644
--- a/drivers/net/wireless/ath/ath10k/ahb.c
+++ b/drivers/net/wireless/ath/ath10k/ahb.c
@@ -626,7 +626,7 @@ static int ath10k_ahb_hif_start(struct ath10k *ar)
 {
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot ahb hif start\n");
 
-	napi_enable(&ar->napi);
+	ath10k_core_napi_enable(ar);
 	ath10k_ce_enable_interrupts(ar);
 	ath10k_pci_enable_legacy_irq(ar);
 
@@ -644,8 +644,7 @@ static void ath10k_ahb_hif_stop(struct ath10k *ar)
 	ath10k_ahb_irq_disable(ar);
 	synchronize_irq(ar_ahb->irq);
 
-	napi_synchronize(&ar->napi);
-	napi_disable(&ar->napi);
+	ath10k_core_napi_sync_disable(ar);
 
 	ath10k_pci_flush(ar);
 }
diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
index cfffd20df0cc..7e52fd14659d 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -2277,6 +2277,26 @@ static int ath10k_init_hw_params(struct ath10k *ar)
 	return 0;
 }
 
+void ath10k_core_napi_enable(struct ath10k *ar)
+{
+	if (ar->napi_enabled)
+		return;
+
+	napi_enable(&ar->napi);
+	ar->napi_enabled = true;
+}
+EXPORT_SYMBOL(ath10k_core_napi_enable);
+
+void ath10k_core_napi_sync_disable(struct ath10k *ar)
+{
+	if (ar->napi_enabled) {
+		napi_synchronize(&ar->napi);
+		napi_disable(&ar->napi);
+		ar->napi_enabled = false;
+	}
+}
+EXPORT_SYMBOL(ath10k_core_napi_sync_disable);
+
 static void ath10k_core_restart(struct work_struct *work)
 {
 	struct ath10k *ar = container_of(work, struct ath10k, restart_work);
diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index 5c18f6c20462..efb26420cc20 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -1230,6 +1230,7 @@ struct ath10k {
 	/* NAPI */
 	struct net_device napi_dev;
 	struct napi_struct napi;
+	bool napi_enabled;
 
 	struct work_struct set_coverage_class_work;
 	/* protected by conf_mutex */
@@ -1276,6 +1277,8 @@ static inline bool ath10k_peer_stats_enabled(struct ath10k *ar)
 
 extern unsigned long ath10k_coredump_mask;
 
+void ath10k_core_napi_sync_disable(struct ath10k *ar);
+void ath10k_core_napi_enable(struct ath10k *ar);
 struct ath10k *ath10k_core_create(size_t priv_size, struct device *dev,
 				  enum ath10k_bus bus,
 				  enum ath10k_hw_rev hw_rev,
diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 36426efdb2ea..7786accc5f96 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1958,7 +1958,7 @@ static int ath10k_pci_hif_start(struct ath10k *ar)
 
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
 
-	napi_enable(&ar->napi);
+	ath10k_core_napi_enable(ar);
 
 	ath10k_pci_irq_enable(ar);
 	ath10k_pci_rx_post(ar);
@@ -2075,8 +2075,9 @@ static void ath10k_pci_hif_stop(struct ath10k *ar)
 
 	ath10k_pci_irq_disable(ar);
 	ath10k_pci_irq_sync(ar);
-	napi_synchronize(&ar->napi);
-	napi_disable(&ar->napi);
+
+	ath10k_core_napi_sync_disable(ar);
+
 	cancel_work_sync(&ar_pci->dump_work);
 
 	/* Most likely the device has HTT Rx ring configured. The only way to
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index 81ddaafb6721..8834f118ef77 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -1859,7 +1859,7 @@ static int ath10k_sdio_hif_start(struct ath10k *ar)
 	struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
 	int ret;
 
-	napi_enable(&ar->napi);
+	ath10k_core_napi_enable(ar);
 
 	/* Sleep 20 ms before HIF interrupts are disabled.
 	 * This will give target plenty of time to process the BMI done
@@ -1986,8 +1986,7 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar)
 
 	spin_unlock_bh(&ar_sdio->wr_async_lock);
 
-	napi_synchronize(&ar->napi);
-	napi_disable(&ar->napi);
+	ath10k_core_napi_sync_disable(ar);
 }
 
 #ifdef CONFIG_PM
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 645ed5f63ef8..d54438c01df8 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -913,8 +913,7 @@ static void ath10k_snoc_hif_stop(struct ath10k *ar)
 	if (!test_bit(ATH10K_FLAG_CRASH_FLUSH, &ar->dev_flags))
 		ath10k_snoc_irq_disable(ar);
 
-	napi_synchronize(&ar->napi);
-	napi_disable(&ar->napi);
+	ath10k_core_napi_sync_disable(ar);
 	ath10k_snoc_buffer_cleanup(ar);
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif stop\n");
 }
@@ -923,7 +922,7 @@ static int ath10k_snoc_hif_start(struct ath10k *ar)
 {
 	struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar);
 
-	napi_enable(&ar->napi);
+	ath10k_core_napi_enable(ar);
 	ath10k_snoc_irq_enable(ar);
 	ath10k_snoc_rx_post(ar);
 
-- 
2.23.0


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-08-28 12:22 ` Wen Gong
@ 2020-08-28 14:09   ` Krishna Chaitanya
  -1 siblings, 0 replies; 24+ messages in thread
From: Krishna Chaitanya @ 2020-08-28 14:09 UTC (permalink / raw)
  To: Wen Gong; +Cc: ath10k, linux-wireless

On Fri, Aug 28, 2020 at 5:53 PM Wen Gong <wgong@codeaurora.org> wrote:
>
> It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
> test simulate crash and ifconfig down/rmmod meanwhile.
>
> Test steps:
>
> 1.Test commands, either can reproduce the hang for PCIe, SDIO and SNOC.
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
> echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_pci
>
> 2. dmesg:
> [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
> [ 5622.655995] ieee80211 phy0: Hardware restart was requested
> [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
> [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
> [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
> [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
> [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
> [ 5776.359855] Call trace:
> [ 5776.359868]  dump_backtrace+0x0/0x170
> [ 5776.359881]  show_stack+0x20/0x2c
> [ 5776.359896]  dump_stack+0xd4/0x10c
> [ 5776.359916]  panic+0x12c/0x29c
> [ 5776.359937]  hung_task_panic+0x0/0x50
> [ 5776.359953]  kthread+0x120/0x130
> [ 5776.359965]  ret_from_fork+0x10/0x18
> [ 5776.359986] SMP: stopping secondary CPUs
> [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
> [ 5776.360026] CPU features: 0x0,2188200c
> [ 5776.360035] Memory Limit: none
>
> command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
> callstack of ifconfig:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] __dev_change_flags+0xe0/0x1d0
> [<0>] dev_change_flags+0x30/0x6c
> [<0>] devinet_ioctl+0x370/0x564
> [<0>] inet_ioctl+0xdc/0x304
> [<0>] sock_do_ioctl+0x50/0x288
> [<0>] compat_sock_ioctl+0x1b4/0x1aac
> [<0>] __se_compat_sys_ioctl+0x100/0x26fc
> [<0>] __arm64_compat_sys_ioctl+0x20/0x2c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> callstack of rmmod:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] dev_close_many+0x70/0x100
> [<0>] dev_close+0x4c/0x80
> [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
> [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
> [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
> [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
> [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
> [<0>] sdio_bus_remove+0x48/0x108
> [<0>] device_release_driver_internal+0x138/0x1ec
> [<0>] driver_detach+0x6c/0xa8
> [<0>] bus_remove_driver+0x78/0xa8
> [<0>] driver_unregister+0x30/0x50
> [<0>] sdio_unregister_driver+0x28/0x34
> [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
> [<0>] __arm64_sys_delete_module+0x1e0/0x22c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> SNOC:
> [  647.156863] Call trace:
> [  647.162166] [<ffffff80080855a4>] __switch_to+0x120/0x13c
> [  647.164512] [<ffffff800899d8b8>] __schedule+0x5ec/0x798
> [  647.170062] [<ffffff800899dad8>] schedule+0x74/0x94
> [  647.175050] [<ffffff80089a0848>] schedule_timeout+0x314/0x42c
> [  647.179874] [<ffffff80089a0a14>] schedule_timeout_uninterruptible+0x34/0x40
> [  647.185780] [<ffffff80082a494>] msleep+0x28/0x38
> [  647.192546] [<ffffff800117ec4c>] ath10k_snoc_hif_stop+0x4c/0x1e0 [ath10k_snoc]
> [  647.197439] [<ffffff80010dfbd8>] ath10k_core_stop+0x50/0x7c [ath10k_core]
> [  647.204652] [<ffffff80010c8f48>] ath10k_halt+0x114/0x16c [ath10k_core]
> [  647.211420] [<ffffff80010cad68>] ath10k_stop+0x4c/0x88 [ath10k_core]
> [  647.217865] [<ffffff8000fdbf54>] drv_stop+0x110/0x244 [mac80211]
> [  647.224367] [<ffffff80010147ac>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [  647.230359] [<ffffff8000ff3eec>] ieee80211_do_stop+0x6a4/0x73c [mac80211]
> [  647.237033] [<ffffff8000ff4500>] ieee80211_stop+0x20/0x30 [mac80211]
> [  647.243942] [<ffffff80087e39b8>] __dev_close_many+0xa0/0xfc
> [  647.250435] [<ffffff80087e3888>] dev_close_many+0x70/0x100
> [  647.255651] [<ffffff80087e3a60>] dev_close+0x4c/0x80
> [  647.261244] [<ffffff8000f1ba54>] cfg80211_shutdown_all_interfaces+0x44/0xcc [cfg80211]
> [  647.266383] [<ffffff8000ff3fdc>] ieee80211_remove_interfaces+0x58/0x1b4 [mac80211]
> [  647.274128] [<ffffff8000fda540>] ieee80211_unregister_hw+0x50/0x120 [mac80211]
> [  647.281659] [<ffffff80010ca314>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [  647.288839] [<ffffff80010dfc94>] ath10k_core_unregister+0x48/0x90 [ath10k_core]
> [  647.296027] [<ffffff800117e598>] ath10k_snoc_remove+0x5c/0x150 [ath10k_snoc]
> [  647.303229] [<ffffff80085625fc>] platform_drv_remove+0x28/0x50
> [  647.310517] [<ffffff80085601a4>] device_release_driver_internal+0x114/0x1b8
> [  647.316257] [<ffffff80085602e4>] driver_detach+0x6c/0xa8
> [  647.323021] [<ffffff800855e5b8>] bus_remove_driver+0x78/0xa8
> [  647.328571] [<ffffff800856107c>] driver_unregister+0x30/0x50
> [  647.334213] [<ffffff8008562674>] platform_driver_unregister+0x1c/0x28
> [  647.339876] [<ffffff800117fefc>] cleanup_module+0x1c/0x120 [ath10k_snoc]
> [  647.346196] [<ffffff8008143ab8>] SyS_delete_module+0x1dc/0x22c
>
> PCIe:
> [  615.392770] rmmod           D    0  3523   3458 0x00000080
> [  615.392777] Call Trace:
> [  615.392784]  __schedule+0x617/0x7d3
> [  615.392791]  ? __mod_timer+0x263/0x35c
> [  615.392797]  schedule+0x62/0x72
> [  615.392803]  schedule_timeout+0x8d/0xf3
> [  615.392809]  ? run_local_timers+0x6b/0x6b
> [  615.392814]  msleep+0x1b/0x22
> [  615.392824]  ath10k_pci_hif_stop+0x68/0xd6 [ath10k_pci]
> [  615.392844]  ath10k_core_stop+0x44/0x67 [ath10k_core]
> [  615.392859]  ath10k_halt+0x102/0x153 [ath10k_core]
> [  615.392873]  ath10k_stop+0x38/0x75 [ath10k_core]
> [  615.392893]  drv_stop+0x9a/0x13c [mac80211]
> [  615.392915]  ieee80211_do_stop+0x772/0x7cd [mac80211]
> [  615.392937]  ieee80211_stop+0x1a/0x1e [mac80211]
> [  615.392945]  __dev_close_many+0x9e/0xf0
> [  615.392952]  dev_close_many+0x62/0xe8
> [  615.392958]  dev_close+0x54/0x7d
> [  615.392975]  cfg80211_shutdown_all_interfaces+0x6e/0xa5 [cfg80211]
> [  615.393021]  ieee80211_remove_interfaces+0x52/0x1aa [mac80211]
> [  615.393049]  ieee80211_unregister_hw+0x54/0x136 [mac80211]
> [  615.393068]  ath10k_mac_unregister+0x19/0x4a [ath10k_core]
> [  615.393091]  ath10k_core_unregister+0x39/0x7e [ath10k_core]
> [  615.393104]  ath10k_pci_remove+0x3d/0x7f [ath10k_pci]
> [  615.393117]  pci_device_remove+0x41/0xa6
> [  615.393129]  device_release_driver_internal+0x123/0x1ec
> [  615.393140]  driver_detach+0x60/0x90
> [  615.393152]  bus_remove_driver+0x72/0x9f
> [  615.393164]  pci_unregister_driver+0x1e/0x87
> [  615.393177]  SyS_delete_module+0x1d7/0x277
> [  615.393188]  do_syscall_64+0x6b/0xf7
> [  615.393199]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
>
> The test command run simulate_fw_crash firstly and it call into
> ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
> is called and bit NAPI_STATE_SCHED is set. After that, function
> ath10k_sdio_hif_stop is called again from ath10k_stop by command
> "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.
>
> It is blocked by napi_synchronize, napi_disable will set bit with
> NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
> becuase bit NAPI_STATE_SCHED is set by napi_disable.
>
> function of napi_synchronize
> static inline void napi_synchronize(const struct napi_struct *n)
> {
>         if (IS_ENABLED(CONFIG_SMP))
>                 while (test_bit(NAPI_STATE_SCHED, &n->state))
>                         msleep(1);
>         else
>                 barrier();
> }
>
> function of napi_disable
> void napi_disable(struct napi_struct *n)
> {
>         might_sleep();
>         set_bit(NAPI_STATE_DISABLE, &n->state);
>
>         while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
>                 msleep(1);
>         while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
>                 msleep(1);
>
>         hrtimer_cancel(&n->timer);
>
>         clear_bit(NAPI_STATE_DISABLE, &n->state);
> }
>
> Add flag for it avoid the hang and crash.
>
> Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049
> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1
> Tested-on: WCN3990 hw1.0 SNOC hw1.0 WLAN.HL.3.1-01307.1-QCAHLSWMTPL-2
>
> Signed-off-by: Wen Gong <wgong@codeaurora.org>
> ---
> v3: change all bus type
>
> v2: change to use flag
>  drivers/net/wireless/ath/ath10k/ahb.c  |  5 ++---
>  drivers/net/wireless/ath/ath10k/core.c | 20 ++++++++++++++++++++
>  drivers/net/wireless/ath/ath10k/core.h |  3 +++
>  drivers/net/wireless/ath/ath10k/pci.c  |  7 ++++---
>  drivers/net/wireless/ath/ath10k/sdio.c |  5 ++---
>  drivers/net/wireless/ath/ath10k/snoc.c |  5 ++---
>  6 files changed, 33 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/ahb.c b/drivers/net/wireless/ath/ath10k/ahb.c
> index 05a61975c83f..869524852fba 100644
> --- a/drivers/net/wireless/ath/ath10k/ahb.c
> +++ b/drivers/net/wireless/ath/ath10k/ahb.c
> @@ -626,7 +626,7 @@ static int ath10k_ahb_hif_start(struct ath10k *ar)
>  {
>         ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot ahb hif start\n");
>
> -       napi_enable(&ar->napi);
> +       ath10k_core_napi_enable(ar);
>         ath10k_ce_enable_interrupts(ar);
>         ath10k_pci_enable_legacy_irq(ar);
>
> @@ -644,8 +644,7 @@ static void ath10k_ahb_hif_stop(struct ath10k *ar)
>         ath10k_ahb_irq_disable(ar);
>         synchronize_irq(ar_ahb->irq);
>
> -       napi_synchronize(&ar->napi);
> -       napi_disable(&ar->napi);
> +       ath10k_core_napi_sync_disable(ar);
>
>         ath10k_pci_flush(ar);
>  }
> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
> index cfffd20df0cc..7e52fd14659d 100644
> --- a/drivers/net/wireless/ath/ath10k/core.c
> +++ b/drivers/net/wireless/ath/ath10k/core.c
> @@ -2277,6 +2277,26 @@ static int ath10k_init_hw_params(struct ath10k *ar)
>         return 0;
>  }
>
> +void ath10k_core_napi_enable(struct ath10k *ar)
> +{
> +       if (ar->napi_enabled)
> +               return;
> +
> +       napi_enable(&ar->napi);
> +       ar->napi_enabled = true;
> +}
> +EXPORT_SYMBOL(ath10k_core_napi_enable);
> +
> +void ath10k_core_napi_sync_disable(struct ath10k *ar)
> +{
> +       if (ar->napi_enabled) {
> +               napi_synchronize(&ar->napi);
> +               napi_disable(&ar->napi);
> +               ar->napi_enabled = false;
> +       }
> +}
> +EXPORT_SYMBOL(ath10k_core_napi_sync_disable);
> +
>  static void ath10k_core_restart(struct work_struct *work)
>  {
>         struct ath10k *ar = container_of(work, struct ath10k, restart_work);
> diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
> index 5c18f6c20462..efb26420cc20 100644
> --- a/drivers/net/wireless/ath/ath10k/core.h
> +++ b/drivers/net/wireless/ath/ath10k/core.h
> @@ -1230,6 +1230,7 @@ struct ath10k {
>         /* NAPI */
>         struct net_device napi_dev;
>         struct napi_struct napi;
> +       bool napi_enabled;
>
>         struct work_struct set_coverage_class_work;
>         /* protected by conf_mutex */
> @@ -1276,6 +1277,8 @@ static inline bool ath10k_peer_stats_enabled(struct ath10k *ar)
>
>  extern unsigned long ath10k_coredump_mask;
>
> +void ath10k_core_napi_sync_disable(struct ath10k *ar);
> +void ath10k_core_napi_enable(struct ath10k *ar);
>  struct ath10k *ath10k_core_create(size_t priv_size, struct device *dev,
>                                   enum ath10k_bus bus,
>                                   enum ath10k_hw_rev hw_rev,
> diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
> index 36426efdb2ea..7786accc5f96 100644
> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -1958,7 +1958,7 @@ static int ath10k_pci_hif_start(struct ath10k *ar)
>
>         ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
>
> -       napi_enable(&ar->napi);
> +       ath10k_core_napi_enable(ar);
>
>         ath10k_pci_irq_enable(ar);
>         ath10k_pci_rx_post(ar);
> @@ -2075,8 +2075,9 @@ static void ath10k_pci_hif_stop(struct ath10k *ar)
>
>         ath10k_pci_irq_disable(ar);
>         ath10k_pci_irq_sync(ar);
> -       napi_synchronize(&ar->napi);
> -       napi_disable(&ar->napi);
> +
> +       ath10k_core_napi_sync_disable(ar);
> +
>         cancel_work_sync(&ar_pci->dump_work);
>
>         /* Most likely the device has HTT Rx ring configured. The only way to
> diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> index 81ddaafb6721..8834f118ef77 100644
> --- a/drivers/net/wireless/ath/ath10k/sdio.c
> +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> @@ -1859,7 +1859,7 @@ static int ath10k_sdio_hif_start(struct ath10k *ar)
>         struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
>         int ret;
>
> -       napi_enable(&ar->napi);
> +       ath10k_core_napi_enable(ar);
>
>         /* Sleep 20 ms before HIF interrupts are disabled.
>          * This will give target plenty of time to process the BMI done
> @@ -1986,8 +1986,7 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar)
>
>         spin_unlock_bh(&ar_sdio->wr_async_lock);
>
> -       napi_synchronize(&ar->napi);
> -       napi_disable(&ar->napi);
> +       ath10k_core_napi_sync_disable(ar);
>  }
>
>  #ifdef CONFIG_PM
> diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
> index 645ed5f63ef8..d54438c01df8 100644
> --- a/drivers/net/wireless/ath/ath10k/snoc.c
> +++ b/drivers/net/wireless/ath/ath10k/snoc.c
> @@ -913,8 +913,7 @@ static void ath10k_snoc_hif_stop(struct ath10k *ar)
>         if (!test_bit(ATH10K_FLAG_CRASH_FLUSH, &ar->dev_flags))
>                 ath10k_snoc_irq_disable(ar);
>
> -       napi_synchronize(&ar->napi);
> -       napi_disable(&ar->napi);
> +       ath10k_core_napi_sync_disable(ar);
>         ath10k_snoc_buffer_cleanup(ar);
>         ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif stop\n");
>  }
> @@ -923,7 +922,7 @@ static int ath10k_snoc_hif_start(struct ath10k *ar)
>  {
>         struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar);
>
> -       napi_enable(&ar->napi);
> +       ath10k_core_napi_enable(ar);
>         ath10k_snoc_irq_enable(ar);
>         ath10k_snoc_rx_post(ar);
>
LGTM, except maybe a shorted title "ath10k: Prevent deinitializing NAPI twice".

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-08-28 14:09   ` Krishna Chaitanya
  0 siblings, 0 replies; 24+ messages in thread
From: Krishna Chaitanya @ 2020-08-28 14:09 UTC (permalink / raw)
  To: Wen Gong; +Cc: linux-wireless, ath10k

On Fri, Aug 28, 2020 at 5:53 PM Wen Gong <wgong@codeaurora.org> wrote:
>
> It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
> test simulate crash and ifconfig down/rmmod meanwhile.
>
> Test steps:
>
> 1.Test commands, either can reproduce the hang for PCIe, SDIO and SNOC.
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
> echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_pci
>
> 2. dmesg:
> [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
> [ 5622.655995] ieee80211 phy0: Hardware restart was requested
> [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
> [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
> [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
> [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
> [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
> [ 5776.359855] Call trace:
> [ 5776.359868]  dump_backtrace+0x0/0x170
> [ 5776.359881]  show_stack+0x20/0x2c
> [ 5776.359896]  dump_stack+0xd4/0x10c
> [ 5776.359916]  panic+0x12c/0x29c
> [ 5776.359937]  hung_task_panic+0x0/0x50
> [ 5776.359953]  kthread+0x120/0x130
> [ 5776.359965]  ret_from_fork+0x10/0x18
> [ 5776.359986] SMP: stopping secondary CPUs
> [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
> [ 5776.360026] CPU features: 0x0,2188200c
> [ 5776.360035] Memory Limit: none
>
> command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
> callstack of ifconfig:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] __dev_change_flags+0xe0/0x1d0
> [<0>] dev_change_flags+0x30/0x6c
> [<0>] devinet_ioctl+0x370/0x564
> [<0>] inet_ioctl+0xdc/0x304
> [<0>] sock_do_ioctl+0x50/0x288
> [<0>] compat_sock_ioctl+0x1b4/0x1aac
> [<0>] __se_compat_sys_ioctl+0x100/0x26fc
> [<0>] __arm64_compat_sys_ioctl+0x20/0x2c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> callstack of rmmod:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] dev_close_many+0x70/0x100
> [<0>] dev_close+0x4c/0x80
> [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
> [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
> [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
> [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
> [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
> [<0>] sdio_bus_remove+0x48/0x108
> [<0>] device_release_driver_internal+0x138/0x1ec
> [<0>] driver_detach+0x6c/0xa8
> [<0>] bus_remove_driver+0x78/0xa8
> [<0>] driver_unregister+0x30/0x50
> [<0>] sdio_unregister_driver+0x28/0x34
> [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
> [<0>] __arm64_sys_delete_module+0x1e0/0x22c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> SNOC:
> [  647.156863] Call trace:
> [  647.162166] [<ffffff80080855a4>] __switch_to+0x120/0x13c
> [  647.164512] [<ffffff800899d8b8>] __schedule+0x5ec/0x798
> [  647.170062] [<ffffff800899dad8>] schedule+0x74/0x94
> [  647.175050] [<ffffff80089a0848>] schedule_timeout+0x314/0x42c
> [  647.179874] [<ffffff80089a0a14>] schedule_timeout_uninterruptible+0x34/0x40
> [  647.185780] [<ffffff80082a494>] msleep+0x28/0x38
> [  647.192546] [<ffffff800117ec4c>] ath10k_snoc_hif_stop+0x4c/0x1e0 [ath10k_snoc]
> [  647.197439] [<ffffff80010dfbd8>] ath10k_core_stop+0x50/0x7c [ath10k_core]
> [  647.204652] [<ffffff80010c8f48>] ath10k_halt+0x114/0x16c [ath10k_core]
> [  647.211420] [<ffffff80010cad68>] ath10k_stop+0x4c/0x88 [ath10k_core]
> [  647.217865] [<ffffff8000fdbf54>] drv_stop+0x110/0x244 [mac80211]
> [  647.224367] [<ffffff80010147ac>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [  647.230359] [<ffffff8000ff3eec>] ieee80211_do_stop+0x6a4/0x73c [mac80211]
> [  647.237033] [<ffffff8000ff4500>] ieee80211_stop+0x20/0x30 [mac80211]
> [  647.243942] [<ffffff80087e39b8>] __dev_close_many+0xa0/0xfc
> [  647.250435] [<ffffff80087e3888>] dev_close_many+0x70/0x100
> [  647.255651] [<ffffff80087e3a60>] dev_close+0x4c/0x80
> [  647.261244] [<ffffff8000f1ba54>] cfg80211_shutdown_all_interfaces+0x44/0xcc [cfg80211]
> [  647.266383] [<ffffff8000ff3fdc>] ieee80211_remove_interfaces+0x58/0x1b4 [mac80211]
> [  647.274128] [<ffffff8000fda540>] ieee80211_unregister_hw+0x50/0x120 [mac80211]
> [  647.281659] [<ffffff80010ca314>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [  647.288839] [<ffffff80010dfc94>] ath10k_core_unregister+0x48/0x90 [ath10k_core]
> [  647.296027] [<ffffff800117e598>] ath10k_snoc_remove+0x5c/0x150 [ath10k_snoc]
> [  647.303229] [<ffffff80085625fc>] platform_drv_remove+0x28/0x50
> [  647.310517] [<ffffff80085601a4>] device_release_driver_internal+0x114/0x1b8
> [  647.316257] [<ffffff80085602e4>] driver_detach+0x6c/0xa8
> [  647.323021] [<ffffff800855e5b8>] bus_remove_driver+0x78/0xa8
> [  647.328571] [<ffffff800856107c>] driver_unregister+0x30/0x50
> [  647.334213] [<ffffff8008562674>] platform_driver_unregister+0x1c/0x28
> [  647.339876] [<ffffff800117fefc>] cleanup_module+0x1c/0x120 [ath10k_snoc]
> [  647.346196] [<ffffff8008143ab8>] SyS_delete_module+0x1dc/0x22c
>
> PCIe:
> [  615.392770] rmmod           D    0  3523   3458 0x00000080
> [  615.392777] Call Trace:
> [  615.392784]  __schedule+0x617/0x7d3
> [  615.392791]  ? __mod_timer+0x263/0x35c
> [  615.392797]  schedule+0x62/0x72
> [  615.392803]  schedule_timeout+0x8d/0xf3
> [  615.392809]  ? run_local_timers+0x6b/0x6b
> [  615.392814]  msleep+0x1b/0x22
> [  615.392824]  ath10k_pci_hif_stop+0x68/0xd6 [ath10k_pci]
> [  615.392844]  ath10k_core_stop+0x44/0x67 [ath10k_core]
> [  615.392859]  ath10k_halt+0x102/0x153 [ath10k_core]
> [  615.392873]  ath10k_stop+0x38/0x75 [ath10k_core]
> [  615.392893]  drv_stop+0x9a/0x13c [mac80211]
> [  615.392915]  ieee80211_do_stop+0x772/0x7cd [mac80211]
> [  615.392937]  ieee80211_stop+0x1a/0x1e [mac80211]
> [  615.392945]  __dev_close_many+0x9e/0xf0
> [  615.392952]  dev_close_many+0x62/0xe8
> [  615.392958]  dev_close+0x54/0x7d
> [  615.392975]  cfg80211_shutdown_all_interfaces+0x6e/0xa5 [cfg80211]
> [  615.393021]  ieee80211_remove_interfaces+0x52/0x1aa [mac80211]
> [  615.393049]  ieee80211_unregister_hw+0x54/0x136 [mac80211]
> [  615.393068]  ath10k_mac_unregister+0x19/0x4a [ath10k_core]
> [  615.393091]  ath10k_core_unregister+0x39/0x7e [ath10k_core]
> [  615.393104]  ath10k_pci_remove+0x3d/0x7f [ath10k_pci]
> [  615.393117]  pci_device_remove+0x41/0xa6
> [  615.393129]  device_release_driver_internal+0x123/0x1ec
> [  615.393140]  driver_detach+0x60/0x90
> [  615.393152]  bus_remove_driver+0x72/0x9f
> [  615.393164]  pci_unregister_driver+0x1e/0x87
> [  615.393177]  SyS_delete_module+0x1d7/0x277
> [  615.393188]  do_syscall_64+0x6b/0xf7
> [  615.393199]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
>
> The test command run simulate_fw_crash firstly and it call into
> ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
> is called and bit NAPI_STATE_SCHED is set. After that, function
> ath10k_sdio_hif_stop is called again from ath10k_stop by command
> "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.
>
> It is blocked by napi_synchronize, napi_disable will set bit with
> NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
> becuase bit NAPI_STATE_SCHED is set by napi_disable.
>
> function of napi_synchronize
> static inline void napi_synchronize(const struct napi_struct *n)
> {
>         if (IS_ENABLED(CONFIG_SMP))
>                 while (test_bit(NAPI_STATE_SCHED, &n->state))
>                         msleep(1);
>         else
>                 barrier();
> }
>
> function of napi_disable
> void napi_disable(struct napi_struct *n)
> {
>         might_sleep();
>         set_bit(NAPI_STATE_DISABLE, &n->state);
>
>         while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
>                 msleep(1);
>         while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
>                 msleep(1);
>
>         hrtimer_cancel(&n->timer);
>
>         clear_bit(NAPI_STATE_DISABLE, &n->state);
> }
>
> Add flag for it avoid the hang and crash.
>
> Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049
> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1
> Tested-on: WCN3990 hw1.0 SNOC hw1.0 WLAN.HL.3.1-01307.1-QCAHLSWMTPL-2
>
> Signed-off-by: Wen Gong <wgong@codeaurora.org>
> ---
> v3: change all bus type
>
> v2: change to use flag
>  drivers/net/wireless/ath/ath10k/ahb.c  |  5 ++---
>  drivers/net/wireless/ath/ath10k/core.c | 20 ++++++++++++++++++++
>  drivers/net/wireless/ath/ath10k/core.h |  3 +++
>  drivers/net/wireless/ath/ath10k/pci.c  |  7 ++++---
>  drivers/net/wireless/ath/ath10k/sdio.c |  5 ++---
>  drivers/net/wireless/ath/ath10k/snoc.c |  5 ++---
>  6 files changed, 33 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/ahb.c b/drivers/net/wireless/ath/ath10k/ahb.c
> index 05a61975c83f..869524852fba 100644
> --- a/drivers/net/wireless/ath/ath10k/ahb.c
> +++ b/drivers/net/wireless/ath/ath10k/ahb.c
> @@ -626,7 +626,7 @@ static int ath10k_ahb_hif_start(struct ath10k *ar)
>  {
>         ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot ahb hif start\n");
>
> -       napi_enable(&ar->napi);
> +       ath10k_core_napi_enable(ar);
>         ath10k_ce_enable_interrupts(ar);
>         ath10k_pci_enable_legacy_irq(ar);
>
> @@ -644,8 +644,7 @@ static void ath10k_ahb_hif_stop(struct ath10k *ar)
>         ath10k_ahb_irq_disable(ar);
>         synchronize_irq(ar_ahb->irq);
>
> -       napi_synchronize(&ar->napi);
> -       napi_disable(&ar->napi);
> +       ath10k_core_napi_sync_disable(ar);
>
>         ath10k_pci_flush(ar);
>  }
> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
> index cfffd20df0cc..7e52fd14659d 100644
> --- a/drivers/net/wireless/ath/ath10k/core.c
> +++ b/drivers/net/wireless/ath/ath10k/core.c
> @@ -2277,6 +2277,26 @@ static int ath10k_init_hw_params(struct ath10k *ar)
>         return 0;
>  }
>
> +void ath10k_core_napi_enable(struct ath10k *ar)
> +{
> +       if (ar->napi_enabled)
> +               return;
> +
> +       napi_enable(&ar->napi);
> +       ar->napi_enabled = true;
> +}
> +EXPORT_SYMBOL(ath10k_core_napi_enable);
> +
> +void ath10k_core_napi_sync_disable(struct ath10k *ar)
> +{
> +       if (ar->napi_enabled) {
> +               napi_synchronize(&ar->napi);
> +               napi_disable(&ar->napi);
> +               ar->napi_enabled = false;
> +       }
> +}
> +EXPORT_SYMBOL(ath10k_core_napi_sync_disable);
> +
>  static void ath10k_core_restart(struct work_struct *work)
>  {
>         struct ath10k *ar = container_of(work, struct ath10k, restart_work);
> diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
> index 5c18f6c20462..efb26420cc20 100644
> --- a/drivers/net/wireless/ath/ath10k/core.h
> +++ b/drivers/net/wireless/ath/ath10k/core.h
> @@ -1230,6 +1230,7 @@ struct ath10k {
>         /* NAPI */
>         struct net_device napi_dev;
>         struct napi_struct napi;
> +       bool napi_enabled;
>
>         struct work_struct set_coverage_class_work;
>         /* protected by conf_mutex */
> @@ -1276,6 +1277,8 @@ static inline bool ath10k_peer_stats_enabled(struct ath10k *ar)
>
>  extern unsigned long ath10k_coredump_mask;
>
> +void ath10k_core_napi_sync_disable(struct ath10k *ar);
> +void ath10k_core_napi_enable(struct ath10k *ar);
>  struct ath10k *ath10k_core_create(size_t priv_size, struct device *dev,
>                                   enum ath10k_bus bus,
>                                   enum ath10k_hw_rev hw_rev,
> diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
> index 36426efdb2ea..7786accc5f96 100644
> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -1958,7 +1958,7 @@ static int ath10k_pci_hif_start(struct ath10k *ar)
>
>         ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
>
> -       napi_enable(&ar->napi);
> +       ath10k_core_napi_enable(ar);
>
>         ath10k_pci_irq_enable(ar);
>         ath10k_pci_rx_post(ar);
> @@ -2075,8 +2075,9 @@ static void ath10k_pci_hif_stop(struct ath10k *ar)
>
>         ath10k_pci_irq_disable(ar);
>         ath10k_pci_irq_sync(ar);
> -       napi_synchronize(&ar->napi);
> -       napi_disable(&ar->napi);
> +
> +       ath10k_core_napi_sync_disable(ar);
> +
>         cancel_work_sync(&ar_pci->dump_work);
>
>         /* Most likely the device has HTT Rx ring configured. The only way to
> diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> index 81ddaafb6721..8834f118ef77 100644
> --- a/drivers/net/wireless/ath/ath10k/sdio.c
> +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> @@ -1859,7 +1859,7 @@ static int ath10k_sdio_hif_start(struct ath10k *ar)
>         struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar);
>         int ret;
>
> -       napi_enable(&ar->napi);
> +       ath10k_core_napi_enable(ar);
>
>         /* Sleep 20 ms before HIF interrupts are disabled.
>          * This will give target plenty of time to process the BMI done
> @@ -1986,8 +1986,7 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar)
>
>         spin_unlock_bh(&ar_sdio->wr_async_lock);
>
> -       napi_synchronize(&ar->napi);
> -       napi_disable(&ar->napi);
> +       ath10k_core_napi_sync_disable(ar);
>  }
>
>  #ifdef CONFIG_PM
> diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
> index 645ed5f63ef8..d54438c01df8 100644
> --- a/drivers/net/wireless/ath/ath10k/snoc.c
> +++ b/drivers/net/wireless/ath/ath10k/snoc.c
> @@ -913,8 +913,7 @@ static void ath10k_snoc_hif_stop(struct ath10k *ar)
>         if (!test_bit(ATH10K_FLAG_CRASH_FLUSH, &ar->dev_flags))
>                 ath10k_snoc_irq_disable(ar);
>
> -       napi_synchronize(&ar->napi);
> -       napi_disable(&ar->napi);
> +       ath10k_core_napi_sync_disable(ar);
>         ath10k_snoc_buffer_cleanup(ar);
>         ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif stop\n");
>  }
> @@ -923,7 +922,7 @@ static int ath10k_snoc_hif_start(struct ath10k *ar)
>  {
>         struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar);
>
> -       napi_enable(&ar->napi);
> +       ath10k_core_napi_enable(ar);
>         ath10k_snoc_irq_enable(ar);
>         ath10k_snoc_rx_post(ar);
>
LGTM, except maybe a shorted title "ath10k: Prevent deinitializing NAPI twice".

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-08-28 12:22 ` Wen Gong
  (?)
  (?)
@ 2020-09-07 16:23 ` Kalle Valo
  -1 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-09-07 16:23 UTC (permalink / raw)
  To: Wen Gong; +Cc: ath10k, linux-wireless

Wen Gong <wgong@codeaurora.org> writes:

> It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
> test simulate crash and ifconfig down/rmmod meanwhile.
>
> Test steps:
>
> 1.Test commands, either can reproduce the hang for PCIe, SDIO and SNOC.
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
> echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_pci
>
> 2. dmesg:
> [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
> [ 5622.655995] ieee80211 phy0: Hardware restart was requested
> [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
> [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
> [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
> [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
> [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
> [ 5776.359855] Call trace:
> [ 5776.359868]  dump_backtrace+0x0/0x170
> [ 5776.359881]  show_stack+0x20/0x2c
> [ 5776.359896]  dump_stack+0xd4/0x10c
> [ 5776.359916]  panic+0x12c/0x29c
> [ 5776.359937]  hung_task_panic+0x0/0x50
> [ 5776.359953]  kthread+0x120/0x130
> [ 5776.359965]  ret_from_fork+0x10/0x18
> [ 5776.359986] SMP: stopping secondary CPUs
> [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
> [ 5776.360026] CPU features: 0x0,2188200c
> [ 5776.360035] Memory Limit: none
>
> command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
> callstack of ifconfig:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] __dev_change_flags+0xe0/0x1d0
> [<0>] dev_change_flags+0x30/0x6c
> [<0>] devinet_ioctl+0x370/0x564
> [<0>] inet_ioctl+0xdc/0x304
> [<0>] sock_do_ioctl+0x50/0x288
> [<0>] compat_sock_ioctl+0x1b4/0x1aac
> [<0>] __se_compat_sys_ioctl+0x100/0x26fc
> [<0>] __arm64_compat_sys_ioctl+0x20/0x2c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> callstack of rmmod:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] dev_close_many+0x70/0x100
> [<0>] dev_close+0x4c/0x80
> [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
> [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
> [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
> [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
> [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
> [<0>] sdio_bus_remove+0x48/0x108
> [<0>] device_release_driver_internal+0x138/0x1ec
> [<0>] driver_detach+0x6c/0xa8
> [<0>] bus_remove_driver+0x78/0xa8
> [<0>] driver_unregister+0x30/0x50
> [<0>] sdio_unregister_driver+0x28/0x34
> [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
> [<0>] __arm64_sys_delete_module+0x1e0/0x22c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> SNOC:
> [  647.156863] Call trace:
> [  647.162166] [<ffffff80080855a4>] __switch_to+0x120/0x13c
> [  647.164512] [<ffffff800899d8b8>] __schedule+0x5ec/0x798
> [  647.170062] [<ffffff800899dad8>] schedule+0x74/0x94
> [  647.175050] [<ffffff80089a0848>] schedule_timeout+0x314/0x42c
> [  647.179874] [<ffffff80089a0a14>] schedule_timeout_uninterruptible+0x34/0x40
> [  647.185780] [<ffffff80082a494>] msleep+0x28/0x38
> [  647.192546] [<ffffff800117ec4c>] ath10k_snoc_hif_stop+0x4c/0x1e0 [ath10k_snoc]
> [  647.197439] [<ffffff80010dfbd8>] ath10k_core_stop+0x50/0x7c [ath10k_core]
> [  647.204652] [<ffffff80010c8f48>] ath10k_halt+0x114/0x16c [ath10k_core]
> [  647.211420] [<ffffff80010cad68>] ath10k_stop+0x4c/0x88 [ath10k_core]
> [  647.217865] [<ffffff8000fdbf54>] drv_stop+0x110/0x244 [mac80211]
> [  647.224367] [<ffffff80010147ac>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [  647.230359] [<ffffff8000ff3eec>] ieee80211_do_stop+0x6a4/0x73c [mac80211]
> [  647.237033] [<ffffff8000ff4500>] ieee80211_stop+0x20/0x30 [mac80211]
> [  647.243942] [<ffffff80087e39b8>] __dev_close_many+0xa0/0xfc
> [  647.250435] [<ffffff80087e3888>] dev_close_many+0x70/0x100
> [  647.255651] [<ffffff80087e3a60>] dev_close+0x4c/0x80
> [  647.261244] [<ffffff8000f1ba54>] cfg80211_shutdown_all_interfaces+0x44/0xcc [cfg80211]
> [  647.266383] [<ffffff8000ff3fdc>] ieee80211_remove_interfaces+0x58/0x1b4 [mac80211]
> [  647.274128] [<ffffff8000fda540>] ieee80211_unregister_hw+0x50/0x120 [mac80211]
> [  647.281659] [<ffffff80010ca314>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [  647.288839] [<ffffff80010dfc94>] ath10k_core_unregister+0x48/0x90 [ath10k_core]
> [  647.296027] [<ffffff800117e598>] ath10k_snoc_remove+0x5c/0x150 [ath10k_snoc]
> [  647.303229] [<ffffff80085625fc>] platform_drv_remove+0x28/0x50
> [  647.310517] [<ffffff80085601a4>] device_release_driver_internal+0x114/0x1b8
> [  647.316257] [<ffffff80085602e4>] driver_detach+0x6c/0xa8
> [  647.323021] [<ffffff800855e5b8>] bus_remove_driver+0x78/0xa8
> [  647.328571] [<ffffff800856107c>] driver_unregister+0x30/0x50
> [  647.334213] [<ffffff8008562674>] platform_driver_unregister+0x1c/0x28
> [  647.339876] [<ffffff800117fefc>] cleanup_module+0x1c/0x120 [ath10k_snoc]
> [  647.346196] [<ffffff8008143ab8>] SyS_delete_module+0x1dc/0x22c
>
> PCIe:
> [  615.392770] rmmod           D    0  3523   3458 0x00000080
> [  615.392777] Call Trace:
> [  615.392784]  __schedule+0x617/0x7d3
> [  615.392791]  ? __mod_timer+0x263/0x35c
> [  615.392797]  schedule+0x62/0x72
> [  615.392803]  schedule_timeout+0x8d/0xf3
> [  615.392809]  ? run_local_timers+0x6b/0x6b
> [  615.392814]  msleep+0x1b/0x22
> [  615.392824]  ath10k_pci_hif_stop+0x68/0xd6 [ath10k_pci]
> [  615.392844]  ath10k_core_stop+0x44/0x67 [ath10k_core]
> [  615.392859]  ath10k_halt+0x102/0x153 [ath10k_core]
> [  615.392873]  ath10k_stop+0x38/0x75 [ath10k_core]
> [  615.392893]  drv_stop+0x9a/0x13c [mac80211]
> [  615.392915]  ieee80211_do_stop+0x772/0x7cd [mac80211]
> [  615.392937]  ieee80211_stop+0x1a/0x1e [mac80211]
> [  615.392945]  __dev_close_many+0x9e/0xf0
> [  615.392952]  dev_close_many+0x62/0xe8
> [  615.392958]  dev_close+0x54/0x7d
> [  615.392975]  cfg80211_shutdown_all_interfaces+0x6e/0xa5 [cfg80211]
> [  615.393021]  ieee80211_remove_interfaces+0x52/0x1aa [mac80211]
> [  615.393049]  ieee80211_unregister_hw+0x54/0x136 [mac80211]
> [  615.393068]  ath10k_mac_unregister+0x19/0x4a [ath10k_core]
> [  615.393091]  ath10k_core_unregister+0x39/0x7e [ath10k_core]
> [  615.393104]  ath10k_pci_remove+0x3d/0x7f [ath10k_pci]
> [  615.393117]  pci_device_remove+0x41/0xa6
> [  615.393129]  device_release_driver_internal+0x123/0x1ec
> [  615.393140]  driver_detach+0x60/0x90
> [  615.393152]  bus_remove_driver+0x72/0x9f
> [  615.393164]  pci_unregister_driver+0x1e/0x87
> [  615.393177]  SyS_delete_module+0x1d7/0x277
> [  615.393188]  do_syscall_64+0x6b/0xf7
> [  615.393199]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
>
> The test command run simulate_fw_crash firstly and it call into
> ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
> is called and bit NAPI_STATE_SCHED is set. After that, function
> ath10k_sdio_hif_stop is called again from ath10k_stop by command
> "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.
>
> It is blocked by napi_synchronize, napi_disable will set bit with
> NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
> becuase bit NAPI_STATE_SCHED is set by napi_disable.
>
> function of napi_synchronize
> static inline void napi_synchronize(const struct napi_struct *n)
> {
> 	if (IS_ENABLED(CONFIG_SMP))
> 		while (test_bit(NAPI_STATE_SCHED, &n->state))
> 			msleep(1);
> 	else
> 		barrier();
> }
>
> function of napi_disable
> void napi_disable(struct napi_struct *n)
> {
> 	might_sleep();
> 	set_bit(NAPI_STATE_DISABLE, &n->state);
>
> 	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
> 		msleep(1);
> 	while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
> 		msleep(1);
>
> 	hrtimer_cancel(&n->timer);
>
> 	clear_bit(NAPI_STATE_DISABLE, &n->state);
> }
>
> Add flag for it avoid the hang and crash.
>
> Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049
> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1
> Tested-on: WCN3990 hw1.0 SNOC hw1.0 WLAN.HL.3.1-01307.1-QCAHLSWMTPL-2
>
> Signed-off-by: Wen Gong <wgong@codeaurora.org>

[...]

> +void ath10k_core_napi_enable(struct ath10k *ar)
> +{
> +	if (ar->napi_enabled)
> +		return;
> +
> +	napi_enable(&ar->napi);
> +	ar->napi_enabled = true;
> +}
> +EXPORT_SYMBOL(ath10k_core_napi_enable);
> +
> +void ath10k_core_napi_sync_disable(struct ath10k *ar)
> +{
> +	if (ar->napi_enabled) {
> +		napi_synchronize(&ar->napi);
> +		napi_disable(&ar->napi);
> +		ar->napi_enabled = false;
> +	}
> +}
> +EXPORT_SYMBOL(ath10k_core_napi_sync_disable);

Just like with the recent firmware restart patch, isn't ar->napi_enabled
racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?

Or are we holding a lock? But then that should be documented with
lockdep_assert_held().

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-08-28 12:22 ` Wen Gong
                   ` (2 preceding siblings ...)
  (?)
@ 2020-09-07 16:23 ` Kalle Valo
  -1 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-09-07 16:23 UTC (permalink / raw)
  To: Wen Gong; +Cc: linux-wireless, ath10k

Wen Gong <wgong@codeaurora.org> writes:

> It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
> test simulate crash and ifconfig down/rmmod meanwhile.
>
> Test steps:
>
> 1.Test commands, either can reproduce the hang for PCIe, SDIO and SNOC.
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
> echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_pci
>
> 2. dmesg:
> [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
> [ 5622.655995] ieee80211 phy0: Hardware restart was requested
> [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
> [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
> [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
> [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
> [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
> [ 5776.359855] Call trace:
> [ 5776.359868]  dump_backtrace+0x0/0x170
> [ 5776.359881]  show_stack+0x20/0x2c
> [ 5776.359896]  dump_stack+0xd4/0x10c
> [ 5776.359916]  panic+0x12c/0x29c
> [ 5776.359937]  hung_task_panic+0x0/0x50
> [ 5776.359953]  kthread+0x120/0x130
> [ 5776.359965]  ret_from_fork+0x10/0x18
> [ 5776.359986] SMP: stopping secondary CPUs
> [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
> [ 5776.360026] CPU features: 0x0,2188200c
> [ 5776.360035] Memory Limit: none
>
> command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
> callstack of ifconfig:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] __dev_change_flags+0xe0/0x1d0
> [<0>] dev_change_flags+0x30/0x6c
> [<0>] devinet_ioctl+0x370/0x564
> [<0>] inet_ioctl+0xdc/0x304
> [<0>] sock_do_ioctl+0x50/0x288
> [<0>] compat_sock_ioctl+0x1b4/0x1aac
> [<0>] __se_compat_sys_ioctl+0x100/0x26fc
> [<0>] __arm64_compat_sys_ioctl+0x20/0x2c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> callstack of rmmod:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] dev_close_many+0x70/0x100
> [<0>] dev_close+0x4c/0x80
> [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
> [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
> [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
> [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
> [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
> [<0>] sdio_bus_remove+0x48/0x108
> [<0>] device_release_driver_internal+0x138/0x1ec
> [<0>] driver_detach+0x6c/0xa8
> [<0>] bus_remove_driver+0x78/0xa8
> [<0>] driver_unregister+0x30/0x50
> [<0>] sdio_unregister_driver+0x28/0x34
> [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
> [<0>] __arm64_sys_delete_module+0x1e0/0x22c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
>
> SNOC:
> [  647.156863] Call trace:
> [  647.162166] [<ffffff80080855a4>] __switch_to+0x120/0x13c
> [  647.164512] [<ffffff800899d8b8>] __schedule+0x5ec/0x798
> [  647.170062] [<ffffff800899dad8>] schedule+0x74/0x94
> [  647.175050] [<ffffff80089a0848>] schedule_timeout+0x314/0x42c
> [  647.179874] [<ffffff80089a0a14>] schedule_timeout_uninterruptible+0x34/0x40
> [  647.185780] [<ffffff80082a494>] msleep+0x28/0x38
> [  647.192546] [<ffffff800117ec4c>] ath10k_snoc_hif_stop+0x4c/0x1e0 [ath10k_snoc]
> [  647.197439] [<ffffff80010dfbd8>] ath10k_core_stop+0x50/0x7c [ath10k_core]
> [  647.204652] [<ffffff80010c8f48>] ath10k_halt+0x114/0x16c [ath10k_core]
> [  647.211420] [<ffffff80010cad68>] ath10k_stop+0x4c/0x88 [ath10k_core]
> [  647.217865] [<ffffff8000fdbf54>] drv_stop+0x110/0x244 [mac80211]
> [  647.224367] [<ffffff80010147ac>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [  647.230359] [<ffffff8000ff3eec>] ieee80211_do_stop+0x6a4/0x73c [mac80211]
> [  647.237033] [<ffffff8000ff4500>] ieee80211_stop+0x20/0x30 [mac80211]
> [  647.243942] [<ffffff80087e39b8>] __dev_close_many+0xa0/0xfc
> [  647.250435] [<ffffff80087e3888>] dev_close_many+0x70/0x100
> [  647.255651] [<ffffff80087e3a60>] dev_close+0x4c/0x80
> [  647.261244] [<ffffff8000f1ba54>] cfg80211_shutdown_all_interfaces+0x44/0xcc [cfg80211]
> [  647.266383] [<ffffff8000ff3fdc>] ieee80211_remove_interfaces+0x58/0x1b4 [mac80211]
> [  647.274128] [<ffffff8000fda540>] ieee80211_unregister_hw+0x50/0x120 [mac80211]
> [  647.281659] [<ffffff80010ca314>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [  647.288839] [<ffffff80010dfc94>] ath10k_core_unregister+0x48/0x90 [ath10k_core]
> [  647.296027] [<ffffff800117e598>] ath10k_snoc_remove+0x5c/0x150 [ath10k_snoc]
> [  647.303229] [<ffffff80085625fc>] platform_drv_remove+0x28/0x50
> [  647.310517] [<ffffff80085601a4>] device_release_driver_internal+0x114/0x1b8
> [  647.316257] [<ffffff80085602e4>] driver_detach+0x6c/0xa8
> [  647.323021] [<ffffff800855e5b8>] bus_remove_driver+0x78/0xa8
> [  647.328571] [<ffffff800856107c>] driver_unregister+0x30/0x50
> [  647.334213] [<ffffff8008562674>] platform_driver_unregister+0x1c/0x28
> [  647.339876] [<ffffff800117fefc>] cleanup_module+0x1c/0x120 [ath10k_snoc]
> [  647.346196] [<ffffff8008143ab8>] SyS_delete_module+0x1dc/0x22c
>
> PCIe:
> [  615.392770] rmmod           D    0  3523   3458 0x00000080
> [  615.392777] Call Trace:
> [  615.392784]  __schedule+0x617/0x7d3
> [  615.392791]  ? __mod_timer+0x263/0x35c
> [  615.392797]  schedule+0x62/0x72
> [  615.392803]  schedule_timeout+0x8d/0xf3
> [  615.392809]  ? run_local_timers+0x6b/0x6b
> [  615.392814]  msleep+0x1b/0x22
> [  615.392824]  ath10k_pci_hif_stop+0x68/0xd6 [ath10k_pci]
> [  615.392844]  ath10k_core_stop+0x44/0x67 [ath10k_core]
> [  615.392859]  ath10k_halt+0x102/0x153 [ath10k_core]
> [  615.392873]  ath10k_stop+0x38/0x75 [ath10k_core]
> [  615.392893]  drv_stop+0x9a/0x13c [mac80211]
> [  615.392915]  ieee80211_do_stop+0x772/0x7cd [mac80211]
> [  615.392937]  ieee80211_stop+0x1a/0x1e [mac80211]
> [  615.392945]  __dev_close_many+0x9e/0xf0
> [  615.392952]  dev_close_many+0x62/0xe8
> [  615.392958]  dev_close+0x54/0x7d
> [  615.392975]  cfg80211_shutdown_all_interfaces+0x6e/0xa5 [cfg80211]
> [  615.393021]  ieee80211_remove_interfaces+0x52/0x1aa [mac80211]
> [  615.393049]  ieee80211_unregister_hw+0x54/0x136 [mac80211]
> [  615.393068]  ath10k_mac_unregister+0x19/0x4a [ath10k_core]
> [  615.393091]  ath10k_core_unregister+0x39/0x7e [ath10k_core]
> [  615.393104]  ath10k_pci_remove+0x3d/0x7f [ath10k_pci]
> [  615.393117]  pci_device_remove+0x41/0xa6
> [  615.393129]  device_release_driver_internal+0x123/0x1ec
> [  615.393140]  driver_detach+0x60/0x90
> [  615.393152]  bus_remove_driver+0x72/0x9f
> [  615.393164]  pci_unregister_driver+0x1e/0x87
> [  615.393177]  SyS_delete_module+0x1d7/0x277
> [  615.393188]  do_syscall_64+0x6b/0xf7
> [  615.393199]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
>
> The test command run simulate_fw_crash firstly and it call into
> ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
> is called and bit NAPI_STATE_SCHED is set. After that, function
> ath10k_sdio_hif_stop is called again from ath10k_stop by command
> "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.
>
> It is blocked by napi_synchronize, napi_disable will set bit with
> NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
> becuase bit NAPI_STATE_SCHED is set by napi_disable.
>
> function of napi_synchronize
> static inline void napi_synchronize(const struct napi_struct *n)
> {
> 	if (IS_ENABLED(CONFIG_SMP))
> 		while (test_bit(NAPI_STATE_SCHED, &n->state))
> 			msleep(1);
> 	else
> 		barrier();
> }
>
> function of napi_disable
> void napi_disable(struct napi_struct *n)
> {
> 	might_sleep();
> 	set_bit(NAPI_STATE_DISABLE, &n->state);
>
> 	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
> 		msleep(1);
> 	while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
> 		msleep(1);
>
> 	hrtimer_cancel(&n->timer);
>
> 	clear_bit(NAPI_STATE_DISABLE, &n->state);
> }
>
> Add flag for it avoid the hang and crash.
>
> Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049
> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1
> Tested-on: WCN3990 hw1.0 SNOC hw1.0 WLAN.HL.3.1-01307.1-QCAHLSWMTPL-2
>
> Signed-off-by: Wen Gong <wgong@codeaurora.org>

[...]

> +void ath10k_core_napi_enable(struct ath10k *ar)
> +{
> +	if (ar->napi_enabled)
> +		return;
> +
> +	napi_enable(&ar->napi);
> +	ar->napi_enabled = true;
> +}
> +EXPORT_SYMBOL(ath10k_core_napi_enable);
> +
> +void ath10k_core_napi_sync_disable(struct ath10k *ar)
> +{
> +	if (ar->napi_enabled) {
> +		napi_synchronize(&ar->napi);
> +		napi_disable(&ar->napi);
> +		ar->napi_enabled = false;
> +	}
> +}
> +EXPORT_SYMBOL(ath10k_core_napi_sync_disable);

Just like with the recent firmware restart patch, isn't ar->napi_enabled
racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?

Or are we holding a lock? But then that should be documented with
lockdep_assert_held().

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
       [not found] ` <87d02x1rqb.fsf@codeaurora.org>
  2020-09-08  3:45   ` Wen Gong
@ 2020-09-08  3:45   ` Wen Gong
  1 sibling, 0 replies; 24+ messages in thread
From: Wen Gong @ 2020-09-08  3:45 UTC (permalink / raw)
  To: Kalle Valo; +Cc: ath10k, linux-wireless

On 2020-09-08 00:22, Kalle Valo wrote:
> Wen Gong <wgong@codeaurora.org> writes:
[...]
> Just like with the recent firmware restart patch, isn't 
> ar->napi_enabled
> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
> 
> Or are we holding a lock? But then that should be documented with
> lockdep_assert_held().
yes, ath10k_hif_start is only called from ath10k_core_start, it has 
"lockdep_assert_held(&ar->conf_mutex)",
and ath10k_hif_stop is only called from ath10k_core_stop, it also has 
"lockdep_assert_held(&ar->conf_mutex)".
then it will not 2 thread both enter ath10k_hif_start/ath10k_hif_stop 
meanwhile.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
       [not found] ` <87d02x1rqb.fsf@codeaurora.org>
@ 2020-09-08  3:45   ` Wen Gong
  2020-12-09  9:24       ` Kalle Valo
  2020-09-08  3:45   ` Wen Gong
  1 sibling, 1 reply; 24+ messages in thread
From: Wen Gong @ 2020-09-08  3:45 UTC (permalink / raw)
  To: Kalle Valo; +Cc: linux-wireless, ath10k

On 2020-09-08 00:22, Kalle Valo wrote:
> Wen Gong <wgong@codeaurora.org> writes:
[...]
> Just like with the recent firmware restart patch, isn't 
> ar->napi_enabled
> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
> 
> Or are we holding a lock? But then that should be documented with
> lockdep_assert_held().
yes, ath10k_hif_start is only called from ath10k_core_start, it has 
"lockdep_assert_held(&ar->conf_mutex)",
and ath10k_hif_stop is only called from ath10k_core_stop, it also has 
"lockdep_assert_held(&ar->conf_mutex)".
then it will not 2 thread both enter ath10k_hif_start/ath10k_hif_stop 
meanwhile.

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-08-28 14:09   ` Krishna Chaitanya
@ 2020-12-09  9:11     ` Kalle Valo
  -1 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-09  9:11 UTC (permalink / raw)
  To: Krishna Chaitanya; +Cc: Wen Gong, linux-wireless, ath10k

Krishna Chaitanya <chaitanya.mgit@gmail.com> writes:

> On Fri, Aug 28, 2020 at 5:53 PM Wen Gong <wgong@codeaurora.org> wrote:
>>
>> It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
>> test simulate crash and ifconfig down/rmmod meanwhile.

(editing out hundreds of lines of unnecessary text, please edit your quotes)

> LGTM, except maybe a shorted title "ath10k: Prevent deinitializing NAPI twice".

Thanks, I fixed the title in the pending branch.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-12-09  9:11     ` Kalle Valo
  0 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-09  9:11 UTC (permalink / raw)
  To: Krishna Chaitanya; +Cc: linux-wireless, ath10k, Wen Gong

Krishna Chaitanya <chaitanya.mgit@gmail.com> writes:

> On Fri, Aug 28, 2020 at 5:53 PM Wen Gong <wgong@codeaurora.org> wrote:
>>
>> It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
>> test simulate crash and ifconfig down/rmmod meanwhile.

(editing out hundreds of lines of unnecessary text, please edit your quotes)

> LGTM, except maybe a shorted title "ath10k: Prevent deinitializing NAPI twice".

Thanks, I fixed the title in the pending branch.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-09-08  3:45   ` Wen Gong
@ 2020-12-09  9:24       ` Kalle Valo
  0 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-09  9:24 UTC (permalink / raw)
  To: Wen Gong; +Cc: linux-wireless, ath10k

Wen Gong <wgong@codeaurora.org> writes:

> On 2020-09-08 00:22, Kalle Valo wrote:
>
>> Just like with the recent firmware restart patch, isn't
>> ar->napi_enabled
>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>
>> Or are we holding a lock? But then that should be documented with
>> lockdep_assert_held().
>
> yes, ath10k_hif_start is only called from ath10k_core_start, it has
> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
> called from ath10k_core_stop, it also has
> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
> enter ath10k_hif_start/ath10k_hif_stop meanwhile.

Ok, but every function depending on a lock being held should still call
lockdep_assert_held(), that way we can catch the bug if locking changes
later. So it's not enough that ath10k_core_stop() has
lockdep_assert_held(), also these napi functions should have it.

I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
set_bit() & co, simpler locking that way and no lockdep_assert_held()
needed anymore. Please check my changes in the pending branch, I have
only compile tested them:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-12-09  9:24       ` Kalle Valo
  0 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-09  9:24 UTC (permalink / raw)
  To: Wen Gong; +Cc: linux-wireless, ath10k

Wen Gong <wgong@codeaurora.org> writes:

> On 2020-09-08 00:22, Kalle Valo wrote:
>
>> Just like with the recent firmware restart patch, isn't
>> ar->napi_enabled
>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>
>> Or are we holding a lock? But then that should be documented with
>> lockdep_assert_held().
>
> yes, ath10k_hif_start is only called from ath10k_core_start, it has
> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
> called from ath10k_core_stop, it also has
> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
> enter ath10k_hif_start/ath10k_hif_stop meanwhile.

Ok, but every function depending on a lock being held should still call
lockdep_assert_held(), that way we can catch the bug if locking changes
later. So it's not enough that ath10k_core_stop() has
lockdep_assert_held(), also these napi functions should have it.

I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
set_bit() & co, simpler locking that way and no lockdep_assert_held()
needed anymore. Please check my changes in the pending branch, I have
only compile tested them:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-12-09  9:24       ` Kalle Valo
@ 2020-12-09 15:00         ` Ben Greear
  -1 siblings, 0 replies; 24+ messages in thread
From: Ben Greear @ 2020-12-09 15:00 UTC (permalink / raw)
  To: Kalle Valo, Wen Gong; +Cc: linux-wireless, ath10k

On 12/9/20 1:24 AM, Kalle Valo wrote:
> Wen Gong <wgong@codeaurora.org> writes:
> 
>> On 2020-09-08 00:22, Kalle Valo wrote:
>>
>>> Just like with the recent firmware restart patch, isn't
>>> ar->napi_enabled
>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>>
>>> Or are we holding a lock? But then that should be documented with
>>> lockdep_assert_held().
>>
>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>> called from ath10k_core_stop, it also has
>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
> 
> Ok, but every function depending on a lock being held should still call
> lockdep_assert_held(), that way we can catch the bug if locking changes
> later. So it's not enough that ath10k_core_stop() has
> lockdep_assert_held(), also these napi functions should have it.
> 
> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
> set_bit() & co, simpler locking that way and no lockdep_assert_held()
> needed anymore. Please check my changes in the pending branch, I have
> only compile tested them:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9
> 

Why do you not need locking?  You can't just check a bit is set and then do work and set
it later without locking, two concurrent CPU threads can pass the first check and both get into
the logic below it?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-12-09 15:00         ` Ben Greear
  0 siblings, 0 replies; 24+ messages in thread
From: Ben Greear @ 2020-12-09 15:00 UTC (permalink / raw)
  To: Kalle Valo, Wen Gong; +Cc: linux-wireless, ath10k

On 12/9/20 1:24 AM, Kalle Valo wrote:
> Wen Gong <wgong@codeaurora.org> writes:
> 
>> On 2020-09-08 00:22, Kalle Valo wrote:
>>
>>> Just like with the recent firmware restart patch, isn't
>>> ar->napi_enabled
>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>>
>>> Or are we holding a lock? But then that should be documented with
>>> lockdep_assert_held().
>>
>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>> called from ath10k_core_stop, it also has
>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
> 
> Ok, but every function depending on a lock being held should still call
> lockdep_assert_held(), that way we can catch the bug if locking changes
> later. So it's not enough that ath10k_core_stop() has
> lockdep_assert_held(), also these napi functions should have it.
> 
> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
> set_bit() & co, simpler locking that way and no lockdep_assert_held()
> needed anymore. Please check my changes in the pending branch, I have
> only compile tested them:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9
> 

Why do you not need locking?  You can't just check a bit is set and then do work and set
it later without locking, two concurrent CPU threads can pass the first check and both get into
the logic below it?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-12-09  9:24       ` Kalle Valo
@ 2020-12-10  2:27         ` Wen Gong
  -1 siblings, 0 replies; 24+ messages in thread
From: Wen Gong @ 2020-12-10  2:27 UTC (permalink / raw)
  To: Kalle Valo; +Cc: linux-wireless, ath10k

On 2020-12-09 17:24, Kalle Valo wrote:
> Wen Gong <wgong@codeaurora.org> writes:
> 
>> On 2020-09-08 00:22, Kalle Valo wrote:
>> 
>>> Just like with the recent firmware restart patch, isn't
>>> ar->napi_enabled
>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>> 
>>> Or are we holding a lock? But then that should be documented with
>>> lockdep_assert_held().
>> 
>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>> called from ath10k_core_stop, it also has
>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
> 
> Ok, but every function depending on a lock being held should still call
> lockdep_assert_held(), that way we can catch the bug if locking changes
> later. So it's not enough that ath10k_core_stop() has
> lockdep_assert_held(), also these napi functions should have it.
> 
> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
> set_bit() & co, simpler locking that way and no lockdep_assert_held()
> needed anymore. Please check my changes in the pending branch, I have
> only compile tested them:
I checked, it only changed ar->napi_enabled to flag 
ATH10K_FLAG_NAPI_ENABLED,
not found probelm.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-12-10  2:27         ` Wen Gong
  0 siblings, 0 replies; 24+ messages in thread
From: Wen Gong @ 2020-12-10  2:27 UTC (permalink / raw)
  To: Kalle Valo; +Cc: linux-wireless, ath10k

On 2020-12-09 17:24, Kalle Valo wrote:
> Wen Gong <wgong@codeaurora.org> writes:
> 
>> On 2020-09-08 00:22, Kalle Valo wrote:
>> 
>>> Just like with the recent firmware restart patch, isn't
>>> ar->napi_enabled
>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>> 
>>> Or are we holding a lock? But then that should be documented with
>>> lockdep_assert_held().
>> 
>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>> called from ath10k_core_stop, it also has
>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
> 
> Ok, but every function depending on a lock being held should still call
> lockdep_assert_held(), that way we can catch the bug if locking changes
> later. So it's not enough that ath10k_core_stop() has
> lockdep_assert_held(), also these napi functions should have it.
> 
> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
> set_bit() & co, simpler locking that way and no lockdep_assert_held()
> needed anymore. Please check my changes in the pending branch, I have
> only compile tested them:
I checked, it only changed ar->napi_enabled to flag 
ATH10K_FLAG_NAPI_ENABLED,
not found probelm.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-12-09 15:00         ` Ben Greear
@ 2020-12-10  2:29           ` Wen Gong
  -1 siblings, 0 replies; 24+ messages in thread
From: Wen Gong @ 2020-12-10  2:29 UTC (permalink / raw)
  To: Ben Greear; +Cc: Kalle Valo, linux-wireless, ath10k

On 2020-12-09 23:00, Ben Greear wrote:
> On 12/9/20 1:24 AM, Kalle Valo wrote:
>> Wen Gong <wgong@codeaurora.org> writes:
>> 
>>> On 2020-09-08 00:22, Kalle Valo wrote:
>>> 
>>>> Just like with the recent firmware restart patch, isn't
>>>> ar->napi_enabled
>>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>>> 
>>>> Or are we holding a lock? But then that should be documented with
>>>> lockdep_assert_held().
>>> 
>>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>>> called from ath10k_core_stop, it also has
>>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread 
>>> both
>>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
>> 
>> Ok, but every function depending on a lock being held should still 
>> call
>> lockdep_assert_held(), that way we can catch the bug if locking 
>> changes
>> later. So it's not enough that ath10k_core_stop() has
>> lockdep_assert_held(), also these napi functions should have it.
>> 
>> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
>> set_bit() & co, simpler locking that way and no lockdep_assert_held()
>> needed anymore. Please check my changes in the pending branch, I have
>> only compile tested them:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9
>> 
> 
> Why do you not need locking?  You can't just check a bit is set and
> then do work and set
> it later without locking, two concurrent CPU threads can pass the
> first check and both get into
> the logic below it?
> 
maybe because which I said before:
ath10k_hif_start is only called from ath10k_core_start, it has
"lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
called from ath10k_core_stop, it also has
"lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
enter ath10k_hif_start/ath10k_hif_stop meanwhile.

> Thanks,
> Ben

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-12-10  2:29           ` Wen Gong
  0 siblings, 0 replies; 24+ messages in thread
From: Wen Gong @ 2020-12-10  2:29 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-wireless, ath10k, Kalle Valo

On 2020-12-09 23:00, Ben Greear wrote:
> On 12/9/20 1:24 AM, Kalle Valo wrote:
>> Wen Gong <wgong@codeaurora.org> writes:
>> 
>>> On 2020-09-08 00:22, Kalle Valo wrote:
>>> 
>>>> Just like with the recent firmware restart patch, isn't
>>>> ar->napi_enabled
>>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>>> 
>>>> Or are we holding a lock? But then that should be documented with
>>>> lockdep_assert_held().
>>> 
>>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>>> called from ath10k_core_stop, it also has
>>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread 
>>> both
>>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
>> 
>> Ok, but every function depending on a lock being held should still 
>> call
>> lockdep_assert_held(), that way we can catch the bug if locking 
>> changes
>> later. So it's not enough that ath10k_core_stop() has
>> lockdep_assert_held(), also these napi functions should have it.
>> 
>> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
>> set_bit() & co, simpler locking that way and no lockdep_assert_held()
>> needed anymore. Please check my changes in the pending branch, I have
>> only compile tested them:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9
>> 
> 
> Why do you not need locking?  You can't just check a bit is set and
> then do work and set
> it later without locking, two concurrent CPU threads can pass the
> first check and both get into
> the logic below it?
> 
maybe because which I said before:
ath10k_hif_start is only called from ath10k_core_start, it has
"lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
called from ath10k_core_stop, it also has
"lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
enter ath10k_hif_start/ath10k_hif_stop meanwhile.

> Thanks,
> Ben

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-12-09 15:00         ` Ben Greear
@ 2020-12-15  7:56           ` Kalle Valo
  -1 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-15  7:56 UTC (permalink / raw)
  To: Ben Greear; +Cc: Wen Gong, linux-wireless, ath10k

Ben Greear <greearb@candelatech.com> writes:

> On 12/9/20 1:24 AM, Kalle Valo wrote:
>> Wen Gong <wgong@codeaurora.org> writes:
>>
>>> On 2020-09-08 00:22, Kalle Valo wrote:
>>>
>>>> Just like with the recent firmware restart patch, isn't
>>>> ar->napi_enabled
>>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>>>
>>>> Or are we holding a lock? But then that should be documented with
>>>> lockdep_assert_held().
>>>
>>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>>> called from ath10k_core_stop, it also has
>>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
>>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
>>
>> Ok, but every function depending on a lock being held should still call
>> lockdep_assert_held(), that way we can catch the bug if locking changes
>> later. So it's not enough that ath10k_core_stop() has
>> lockdep_assert_held(), also these napi functions should have it.
>>
>> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
>> set_bit() & co, simpler locking that way and no lockdep_assert_held()
>> needed anymore. Please check my changes in the pending branch, I have
>> only compile tested them:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9
>>
>
> Why do you not need locking? You can't just check a bit is set and
> then do work and set it later without locking, two concurrent CPU
> threads can pass the first check and both get into the logic below it?

Good point, there is a race. I now fixed the patch in the pending and
documented that core_mutex needs to be held when changing the NAPI
state:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=2fe769592ef6d4ae14260989dcbdbde4bff01cb6

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-12-15  7:56           ` Kalle Valo
  0 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-15  7:56 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-wireless, ath10k, Wen Gong

Ben Greear <greearb@candelatech.com> writes:

> On 12/9/20 1:24 AM, Kalle Valo wrote:
>> Wen Gong <wgong@codeaurora.org> writes:
>>
>>> On 2020-09-08 00:22, Kalle Valo wrote:
>>>
>>>> Just like with the recent firmware restart patch, isn't
>>>> ar->napi_enabled
>>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>>>
>>>> Or are we holding a lock? But then that should be documented with
>>>> lockdep_assert_held().
>>>
>>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>>> called from ath10k_core_stop, it also has
>>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
>>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
>>
>> Ok, but every function depending on a lock being held should still call
>> lockdep_assert_held(), that way we can catch the bug if locking changes
>> later. So it's not enough that ath10k_core_stop() has
>> lockdep_assert_held(), also these napi functions should have it.
>>
>> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
>> set_bit() & co, simpler locking that way and no lockdep_assert_held()
>> needed anymore. Please check my changes in the pending branch, I have
>> only compile tested them:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9
>>
>
> Why do you not need locking? You can't just check a bit is set and
> then do work and set it later without locking, two concurrent CPU
> threads can pass the first check and both get into the logic below it?

Good point, there is a race. I now fixed the patch in the pending and
documented that core_mutex needs to be held when changing the NAPI
state:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=2fe769592ef6d4ae14260989dcbdbde4bff01cb6

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-12-10  2:29           ` Wen Gong
@ 2020-12-15  8:05             ` Kalle Valo
  -1 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-15  8:05 UTC (permalink / raw)
  To: Wen Gong; +Cc: Ben Greear, linux-wireless, ath10k

Wen Gong <wgong@codeaurora.org> writes:

> On 2020-12-09 23:00, Ben Greear wrote:
>> On 12/9/20 1:24 AM, Kalle Valo wrote:
>>> Wen Gong <wgong@codeaurora.org> writes:
>>>
>>>> On 2020-09-08 00:22, Kalle Valo wrote:
>>>>
>>>>> Just like with the recent firmware restart patch, isn't
>>>>> ar->napi_enabled
>>>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>>>>
>>>>> Or are we holding a lock? But then that should be documented with
>>>>> lockdep_assert_held().
>>>>
>>>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>>>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>>>> called from ath10k_core_stop, it also has
>>>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread
>>>> both
>>>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
>>>
>>> Ok, but every function depending on a lock being held should still
>>> call
>>> lockdep_assert_held(), that way we can catch the bug if locking
>>> changes
>>> later. So it's not enough that ath10k_core_stop() has
>>> lockdep_assert_held(), also these napi functions should have it.
>>>
>>> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
>>> set_bit() & co, simpler locking that way and no lockdep_assert_held()
>>> needed anymore. Please check my changes in the pending branch, I have
>>> only compile tested them:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9
>>>
>>
>> Why do you not need locking?  You can't just check a bit is set and
>> then do work and set
>> it later without locking, two concurrent CPU threads can pass the
>> first check and both get into
>> the logic below it?
>>
> maybe because which I said before:
>
> ath10k_hif_start is only called from ath10k_core_start, it has
> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
> called from ath10k_core_stop, it also has
> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
> enter ath10k_hif_start/ath10k_hif_stop meanwhile.

Yeah, but that was not visible from the code. I now changed the patch in
pending branch that this is clearly documented with
lockdep_assert_held() and lockdep will warn if someone breaks the
locking later on.

If a function relies on a lock being held, lockdep_assert_held() needs
to be _always_ used to make the locking dependencies clearly visible.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
@ 2020-12-15  8:05             ` Kalle Valo
  0 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-15  8:05 UTC (permalink / raw)
  To: Wen Gong; +Cc: Ben Greear, linux-wireless, ath10k

Wen Gong <wgong@codeaurora.org> writes:

> On 2020-12-09 23:00, Ben Greear wrote:
>> On 12/9/20 1:24 AM, Kalle Valo wrote:
>>> Wen Gong <wgong@codeaurora.org> writes:
>>>
>>>> On 2020-09-08 00:22, Kalle Valo wrote:
>>>>
>>>>> Just like with the recent firmware restart patch, isn't
>>>>> ar->napi_enabled
>>>>> racy? Wouldn't test_and_set_bit() and test_and_clear_bit() be safer?
>>>>>
>>>>> Or are we holding a lock? But then that should be documented with
>>>>> lockdep_assert_held().
>>>>
>>>> yes, ath10k_hif_start is only called from ath10k_core_start, it has
>>>> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
>>>> called from ath10k_core_stop, it also has
>>>> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread
>>>> both
>>>> enter ath10k_hif_start/ath10k_hif_stop meanwhile.
>>>
>>> Ok, but every function depending on a lock being held should still
>>> call
>>> lockdep_assert_held(), that way we can catch the bug if locking
>>> changes
>>> later. So it's not enough that ath10k_core_stop() has
>>> lockdep_assert_held(), also these napi functions should have it.
>>>
>>> I actually decided to switch using ATH10K_FLAG_NAPI_ENABLED with
>>> set_bit() & co, simpler locking that way and no lockdep_assert_held()
>>> needed anymore. Please check my changes in the pending branch, I have
>>> only compile tested them:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=e0a466d296bd862080f7796b41349f9f586272c9
>>>
>>
>> Why do you not need locking?  You can't just check a bit is set and
>> then do work and set
>> it later without locking, two concurrent CPU threads can pass the
>> first check and both get into
>> the logic below it?
>>
> maybe because which I said before:
>
> ath10k_hif_start is only called from ath10k_core_start, it has
> "lockdep_assert_held(&ar->conf_mutex)", and ath10k_hif_stop is only
> called from ath10k_core_stop, it also has
> "lockdep_assert_held(&ar->conf_mutex)". then it will not 2 thread both
> enter ath10k_hif_start/ath10k_hif_stop meanwhile.

Yeah, but that was not visible from the code. I now changed the patch in
pending branch that this is clearly documented with
lockdep_assert_held() and lockdep will warn if someone breaks the
locking later on.

If a function relies on a lock being held, lockdep_assert_held() needs
to be _always_ used to make the locking dependencies clearly visible.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-08-28 12:22 ` Wen Gong
                   ` (4 preceding siblings ...)
  (?)
@ 2020-12-17  6:52 ` Kalle Valo
  -1 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-17  6:52 UTC (permalink / raw)
  To: Wen Gong; +Cc: ath10k, linux-wireless, wgong

Wen Gong <wgong@codeaurora.org> wrote:

> It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
> test simulate crash and ifconfig down/rmmod meanwhile.
> 
> Test steps:
> 
> 1.Test commands, either can reproduce the hang for PCIe, SDIO and SNOC.
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
> echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_pci
> 
> 2. dmesg:
> [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
> [ 5622.655995] ieee80211 phy0: Hardware restart was requested
> [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
> [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
> [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
> [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
> [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
> [ 5776.359855] Call trace:
> [ 5776.359868]  dump_backtrace+0x0/0x170
> [ 5776.359881]  show_stack+0x20/0x2c
> [ 5776.359896]  dump_stack+0xd4/0x10c
> [ 5776.359916]  panic+0x12c/0x29c
> [ 5776.359937]  hung_task_panic+0x0/0x50
> [ 5776.359953]  kthread+0x120/0x130
> [ 5776.359965]  ret_from_fork+0x10/0x18
> [ 5776.359986] SMP: stopping secondary CPUs
> [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
> [ 5776.360026] CPU features: 0x0,2188200c
> [ 5776.360035] Memory Limit: none
> 
> command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
> callstack of ifconfig:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] __dev_change_flags+0xe0/0x1d0
> [<0>] dev_change_flags+0x30/0x6c
> [<0>] devinet_ioctl+0x370/0x564
> [<0>] inet_ioctl+0xdc/0x304
> [<0>] sock_do_ioctl+0x50/0x288
> [<0>] compat_sock_ioctl+0x1b4/0x1aac
> [<0>] __se_compat_sys_ioctl+0x100/0x26fc
> [<0>] __arm64_compat_sys_ioctl+0x20/0x2c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
> 
> callstack of rmmod:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] dev_close_many+0x70/0x100
> [<0>] dev_close+0x4c/0x80
> [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
> [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
> [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
> [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
> [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
> [<0>] sdio_bus_remove+0x48/0x108
> [<0>] device_release_driver_internal+0x138/0x1ec
> [<0>] driver_detach+0x6c/0xa8
> [<0>] bus_remove_driver+0x78/0xa8
> [<0>] driver_unregister+0x30/0x50
> [<0>] sdio_unregister_driver+0x28/0x34
> [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
> [<0>] __arm64_sys_delete_module+0x1e0/0x22c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
> 
> SNOC:
> [  647.156863] Call trace:
> [  647.162166] [<ffffff80080855a4>] __switch_to+0x120/0x13c
> [  647.164512] [<ffffff800899d8b8>] __schedule+0x5ec/0x798
> [  647.170062] [<ffffff800899dad8>] schedule+0x74/0x94
> [  647.175050] [<ffffff80089a0848>] schedule_timeout+0x314/0x42c
> [  647.179874] [<ffffff80089a0a14>] schedule_timeout_uninterruptible+0x34/0x40
> [  647.185780] [<ffffff80082a494>] msleep+0x28/0x38
> [  647.192546] [<ffffff800117ec4c>] ath10k_snoc_hif_stop+0x4c/0x1e0 [ath10k_snoc]
> [  647.197439] [<ffffff80010dfbd8>] ath10k_core_stop+0x50/0x7c [ath10k_core]
> [  647.204652] [<ffffff80010c8f48>] ath10k_halt+0x114/0x16c [ath10k_core]
> [  647.211420] [<ffffff80010cad68>] ath10k_stop+0x4c/0x88 [ath10k_core]
> [  647.217865] [<ffffff8000fdbf54>] drv_stop+0x110/0x244 [mac80211]
> [  647.224367] [<ffffff80010147ac>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [  647.230359] [<ffffff8000ff3eec>] ieee80211_do_stop+0x6a4/0x73c [mac80211]
> [  647.237033] [<ffffff8000ff4500>] ieee80211_stop+0x20/0x30 [mac80211]
> [  647.243942] [<ffffff80087e39b8>] __dev_close_many+0xa0/0xfc
> [  647.250435] [<ffffff80087e3888>] dev_close_many+0x70/0x100
> [  647.255651] [<ffffff80087e3a60>] dev_close+0x4c/0x80
> [  647.261244] [<ffffff8000f1ba54>] cfg80211_shutdown_all_interfaces+0x44/0xcc [cfg80211]
> [  647.266383] [<ffffff8000ff3fdc>] ieee80211_remove_interfaces+0x58/0x1b4 [mac80211]
> [  647.274128] [<ffffff8000fda540>] ieee80211_unregister_hw+0x50/0x120 [mac80211]
> [  647.281659] [<ffffff80010ca314>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [  647.288839] [<ffffff80010dfc94>] ath10k_core_unregister+0x48/0x90 [ath10k_core]
> [  647.296027] [<ffffff800117e598>] ath10k_snoc_remove+0x5c/0x150 [ath10k_snoc]
> [  647.303229] [<ffffff80085625fc>] platform_drv_remove+0x28/0x50
> [  647.310517] [<ffffff80085601a4>] device_release_driver_internal+0x114/0x1b8
> [  647.316257] [<ffffff80085602e4>] driver_detach+0x6c/0xa8
> [  647.323021] [<ffffff800855e5b8>] bus_remove_driver+0x78/0xa8
> [  647.328571] [<ffffff800856107c>] driver_unregister+0x30/0x50
> [  647.334213] [<ffffff8008562674>] platform_driver_unregister+0x1c/0x28
> [  647.339876] [<ffffff800117fefc>] cleanup_module+0x1c/0x120 [ath10k_snoc]
> [  647.346196] [<ffffff8008143ab8>] SyS_delete_module+0x1dc/0x22c
> 
> PCIe:
> [  615.392770] rmmod           D    0  3523   3458 0x00000080
> [  615.392777] Call Trace:
> [  615.392784]  __schedule+0x617/0x7d3
> [  615.392791]  ? __mod_timer+0x263/0x35c
> [  615.392797]  schedule+0x62/0x72
> [  615.392803]  schedule_timeout+0x8d/0xf3
> [  615.392809]  ? run_local_timers+0x6b/0x6b
> [  615.392814]  msleep+0x1b/0x22
> [  615.392824]  ath10k_pci_hif_stop+0x68/0xd6 [ath10k_pci]
> [  615.392844]  ath10k_core_stop+0x44/0x67 [ath10k_core]
> [  615.392859]  ath10k_halt+0x102/0x153 [ath10k_core]
> [  615.392873]  ath10k_stop+0x38/0x75 [ath10k_core]
> [  615.392893]  drv_stop+0x9a/0x13c [mac80211]
> [  615.392915]  ieee80211_do_stop+0x772/0x7cd [mac80211]
> [  615.392937]  ieee80211_stop+0x1a/0x1e [mac80211]
> [  615.392945]  __dev_close_many+0x9e/0xf0
> [  615.392952]  dev_close_many+0x62/0xe8
> [  615.392958]  dev_close+0x54/0x7d
> [  615.392975]  cfg80211_shutdown_all_interfaces+0x6e/0xa5 [cfg80211]
> [  615.393021]  ieee80211_remove_interfaces+0x52/0x1aa [mac80211]
> [  615.393049]  ieee80211_unregister_hw+0x54/0x136 [mac80211]
> [  615.393068]  ath10k_mac_unregister+0x19/0x4a [ath10k_core]
> [  615.393091]  ath10k_core_unregister+0x39/0x7e [ath10k_core]
> [  615.393104]  ath10k_pci_remove+0x3d/0x7f [ath10k_pci]
> [  615.393117]  pci_device_remove+0x41/0xa6
> [  615.393129]  device_release_driver_internal+0x123/0x1ec
> [  615.393140]  driver_detach+0x60/0x90
> [  615.393152]  bus_remove_driver+0x72/0x9f
> [  615.393164]  pci_unregister_driver+0x1e/0x87
> [  615.393177]  SyS_delete_module+0x1d7/0x277
> [  615.393188]  do_syscall_64+0x6b/0xf7
> [  615.393199]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
> 
> The test command run simulate_fw_crash firstly and it call into
> ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
> is called and bit NAPI_STATE_SCHED is set. After that, function
> ath10k_sdio_hif_stop is called again from ath10k_stop by command
> "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.
> 
> It is blocked by napi_synchronize, napi_disable will set bit with
> NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
> becuase bit NAPI_STATE_SCHED is set by napi_disable.
> 
> function of napi_synchronize
> static inline void napi_synchronize(const struct napi_struct *n)
> {
> 	if (IS_ENABLED(CONFIG_SMP))
> 		while (test_bit(NAPI_STATE_SCHED, &n->state))
> 			msleep(1);
> 	else
> 		barrier();
> }
> 
> function of napi_disable
> void napi_disable(struct napi_struct *n)
> {
> 	might_sleep();
> 	set_bit(NAPI_STATE_DISABLE, &n->state);
> 
> 	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
> 		msleep(1);
> 	while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
> 		msleep(1);
> 
> 	hrtimer_cancel(&n->timer);
> 
> 	clear_bit(NAPI_STATE_DISABLE, &n->state);
> }
> 
> Add flag for it avoid the hang and crash.
> 
> Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049
> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1
> Tested-on: WCN3990 hw1.0 SNOC hw1.0 WLAN.HL.3.1-01307.1-QCAHLSWMTPL-2
> 
> Signed-off-by: Wen Gong <wgong@codeaurora.org>
> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>

Patch applied to ath-next branch of ath.git, thanks.

e2f8b74e58cb ath10k: prevent deinitializing NAPI twice

-- 
https://patchwork.kernel.org/project/linux-wireless/patch/1598617348-2325-1-git-send-email-wgong@codeaurora.org/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang
  2020-08-28 12:22 ` Wen Gong
                   ` (5 preceding siblings ...)
  (?)
@ 2020-12-17  6:52 ` Kalle Valo
  -1 siblings, 0 replies; 24+ messages in thread
From: Kalle Valo @ 2020-12-17  6:52 UTC (permalink / raw)
  To: Wen Gong; +Cc: linux-wireless, ath10k, wgong

Wen Gong <wgong@codeaurora.org> wrote:

> It happened "Kernel panic - not syncing: hung_task: blocked tasks" when
> test simulate crash and ifconfig down/rmmod meanwhile.
> 
> Test steps:
> 
> 1.Test commands, either can reproduce the hang for PCIe, SDIO and SNOC.
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down
> echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio
> echo hw-restart > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_pci
> 
> 2. dmesg:
> [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash
> [ 5622.655995] ieee80211 phy0: Hardware restart was requested
> [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds.
> [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds.
> [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks
> [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G        W         4.19.86 #137
> [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT)
> [ 5776.359855] Call trace:
> [ 5776.359868]  dump_backtrace+0x0/0x170
> [ 5776.359881]  show_stack+0x20/0x2c
> [ 5776.359896]  dump_stack+0xd4/0x10c
> [ 5776.359916]  panic+0x12c/0x29c
> [ 5776.359937]  hung_task_panic+0x0/0x50
> [ 5776.359953]  kthread+0x120/0x130
> [ 5776.359965]  ret_from_fork+0x10/0x18
> [ 5776.359986] SMP: stopping secondary CPUs
> [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000
> [ 5776.360026] CPU features: 0x0,2188200c
> [ 5776.360035] Memory Limit: none
> 
> command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked
> callstack of ifconfig:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] __dev_change_flags+0xe0/0x1d0
> [<0>] dev_change_flags+0x30/0x6c
> [<0>] devinet_ioctl+0x370/0x564
> [<0>] inet_ioctl+0xdc/0x304
> [<0>] sock_do_ioctl+0x50/0x288
> [<0>] compat_sock_ioctl+0x1b4/0x1aac
> [<0>] __se_compat_sys_ioctl+0x100/0x26fc
> [<0>] __arm64_compat_sys_ioctl+0x20/0x2c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
> 
> callstack of rmmod:
> [<0>] __switch_to+0x120/0x13c
> [<0>] msleep+0x28/0x38
> [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio]
> [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core]
> [<0>] ath10k_halt+0x120/0x178 [ath10k_core]
> [<0>] ath10k_stop+0x4c/0x8c [ath10k_core]
> [<0>] drv_stop+0xe0/0x1e4 [mac80211]
> [<0>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211]
> [<0>] ieee80211_stop+0x20/0x30 [mac80211]
> [<0>] __dev_close_many+0xb8/0x11c
> [<0>] dev_close_many+0x70/0x100
> [<0>] dev_close+0x4c/0x80
> [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211]
> [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211]
> [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211]
> [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core]
> [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio]
> [<0>] sdio_bus_remove+0x48/0x108
> [<0>] device_release_driver_internal+0x138/0x1ec
> [<0>] driver_detach+0x6c/0xa8
> [<0>] bus_remove_driver+0x78/0xa8
> [<0>] driver_unregister+0x30/0x50
> [<0>] sdio_unregister_driver+0x28/0x34
> [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio]
> [<0>] __arm64_sys_delete_module+0x1e0/0x22c
> [<0>] el0_svc_common+0xa4/0x154
> [<0>] el0_svc_compat_handler+0x2c/0x38
> [<0>] el0_svc_compat+0x8/0x18
> [<0>] 0xffffffffffffffff
> 
> SNOC:
> [  647.156863] Call trace:
> [  647.162166] [<ffffff80080855a4>] __switch_to+0x120/0x13c
> [  647.164512] [<ffffff800899d8b8>] __schedule+0x5ec/0x798
> [  647.170062] [<ffffff800899dad8>] schedule+0x74/0x94
> [  647.175050] [<ffffff80089a0848>] schedule_timeout+0x314/0x42c
> [  647.179874] [<ffffff80089a0a14>] schedule_timeout_uninterruptible+0x34/0x40
> [  647.185780] [<ffffff80082a494>] msleep+0x28/0x38
> [  647.192546] [<ffffff800117ec4c>] ath10k_snoc_hif_stop+0x4c/0x1e0 [ath10k_snoc]
> [  647.197439] [<ffffff80010dfbd8>] ath10k_core_stop+0x50/0x7c [ath10k_core]
> [  647.204652] [<ffffff80010c8f48>] ath10k_halt+0x114/0x16c [ath10k_core]
> [  647.211420] [<ffffff80010cad68>] ath10k_stop+0x4c/0x88 [ath10k_core]
> [  647.217865] [<ffffff8000fdbf54>] drv_stop+0x110/0x244 [mac80211]
> [  647.224367] [<ffffff80010147ac>] ieee80211_stop_device+0x48/0x54 [mac80211]
> [  647.230359] [<ffffff8000ff3eec>] ieee80211_do_stop+0x6a4/0x73c [mac80211]
> [  647.237033] [<ffffff8000ff4500>] ieee80211_stop+0x20/0x30 [mac80211]
> [  647.243942] [<ffffff80087e39b8>] __dev_close_many+0xa0/0xfc
> [  647.250435] [<ffffff80087e3888>] dev_close_many+0x70/0x100
> [  647.255651] [<ffffff80087e3a60>] dev_close+0x4c/0x80
> [  647.261244] [<ffffff8000f1ba54>] cfg80211_shutdown_all_interfaces+0x44/0xcc [cfg80211]
> [  647.266383] [<ffffff8000ff3fdc>] ieee80211_remove_interfaces+0x58/0x1b4 [mac80211]
> [  647.274128] [<ffffff8000fda540>] ieee80211_unregister_hw+0x50/0x120 [mac80211]
> [  647.281659] [<ffffff80010ca314>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core]
> [  647.288839] [<ffffff80010dfc94>] ath10k_core_unregister+0x48/0x90 [ath10k_core]
> [  647.296027] [<ffffff800117e598>] ath10k_snoc_remove+0x5c/0x150 [ath10k_snoc]
> [  647.303229] [<ffffff80085625fc>] platform_drv_remove+0x28/0x50
> [  647.310517] [<ffffff80085601a4>] device_release_driver_internal+0x114/0x1b8
> [  647.316257] [<ffffff80085602e4>] driver_detach+0x6c/0xa8
> [  647.323021] [<ffffff800855e5b8>] bus_remove_driver+0x78/0xa8
> [  647.328571] [<ffffff800856107c>] driver_unregister+0x30/0x50
> [  647.334213] [<ffffff8008562674>] platform_driver_unregister+0x1c/0x28
> [  647.339876] [<ffffff800117fefc>] cleanup_module+0x1c/0x120 [ath10k_snoc]
> [  647.346196] [<ffffff8008143ab8>] SyS_delete_module+0x1dc/0x22c
> 
> PCIe:
> [  615.392770] rmmod           D    0  3523   3458 0x00000080
> [  615.392777] Call Trace:
> [  615.392784]  __schedule+0x617/0x7d3
> [  615.392791]  ? __mod_timer+0x263/0x35c
> [  615.392797]  schedule+0x62/0x72
> [  615.392803]  schedule_timeout+0x8d/0xf3
> [  615.392809]  ? run_local_timers+0x6b/0x6b
> [  615.392814]  msleep+0x1b/0x22
> [  615.392824]  ath10k_pci_hif_stop+0x68/0xd6 [ath10k_pci]
> [  615.392844]  ath10k_core_stop+0x44/0x67 [ath10k_core]
> [  615.392859]  ath10k_halt+0x102/0x153 [ath10k_core]
> [  615.392873]  ath10k_stop+0x38/0x75 [ath10k_core]
> [  615.392893]  drv_stop+0x9a/0x13c [mac80211]
> [  615.392915]  ieee80211_do_stop+0x772/0x7cd [mac80211]
> [  615.392937]  ieee80211_stop+0x1a/0x1e [mac80211]
> [  615.392945]  __dev_close_many+0x9e/0xf0
> [  615.392952]  dev_close_many+0x62/0xe8
> [  615.392958]  dev_close+0x54/0x7d
> [  615.392975]  cfg80211_shutdown_all_interfaces+0x6e/0xa5 [cfg80211]
> [  615.393021]  ieee80211_remove_interfaces+0x52/0x1aa [mac80211]
> [  615.393049]  ieee80211_unregister_hw+0x54/0x136 [mac80211]
> [  615.393068]  ath10k_mac_unregister+0x19/0x4a [ath10k_core]
> [  615.393091]  ath10k_core_unregister+0x39/0x7e [ath10k_core]
> [  615.393104]  ath10k_pci_remove+0x3d/0x7f [ath10k_pci]
> [  615.393117]  pci_device_remove+0x41/0xa6
> [  615.393129]  device_release_driver_internal+0x123/0x1ec
> [  615.393140]  driver_detach+0x60/0x90
> [  615.393152]  bus_remove_driver+0x72/0x9f
> [  615.393164]  pci_unregister_driver+0x1e/0x87
> [  615.393177]  SyS_delete_module+0x1d7/0x277
> [  615.393188]  do_syscall_64+0x6b/0xf7
> [  615.393199]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
> 
> The test command run simulate_fw_crash firstly and it call into
> ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable
> is called and bit NAPI_STATE_SCHED is set. After that, function
> ath10k_sdio_hif_stop is called again from ath10k_stop by command
> "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked.
> 
> It is blocked by napi_synchronize, napi_disable will set bit with
> NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop
> becuase bit NAPI_STATE_SCHED is set by napi_disable.
> 
> function of napi_synchronize
> static inline void napi_synchronize(const struct napi_struct *n)
> {
> 	if (IS_ENABLED(CONFIG_SMP))
> 		while (test_bit(NAPI_STATE_SCHED, &n->state))
> 			msleep(1);
> 	else
> 		barrier();
> }
> 
> function of napi_disable
> void napi_disable(struct napi_struct *n)
> {
> 	might_sleep();
> 	set_bit(NAPI_STATE_DISABLE, &n->state);
> 
> 	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
> 		msleep(1);
> 	while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
> 		msleep(1);
> 
> 	hrtimer_cancel(&n->timer);
> 
> 	clear_bit(NAPI_STATE_DISABLE, &n->state);
> }
> 
> Add flag for it avoid the hang and crash.
> 
> Tested-on: QCA6174 hw3.2 SDIO WLAN.RMH.4.4.1-00049
> Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1
> Tested-on: WCN3990 hw1.0 SNOC hw1.0 WLAN.HL.3.1-01307.1-QCAHLSWMTPL-2
> 
> Signed-off-by: Wen Gong <wgong@codeaurora.org>
> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>

Patch applied to ath-next branch of ath.git, thanks.

e2f8b74e58cb ath10k: prevent deinitializing NAPI twice

-- 
https://patchwork.kernel.org/project/linux-wireless/patch/1598617348-2325-1-git-send-email-wgong@codeaurora.org/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-12-17  6:54 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-28 12:22 [PATCH v3] ath10k: add flag to protect napi operation to avoid dead loop hang Wen Gong
2020-08-28 12:22 ` Wen Gong
2020-08-28 14:09 ` Krishna Chaitanya
2020-08-28 14:09   ` Krishna Chaitanya
2020-12-09  9:11   ` Kalle Valo
2020-12-09  9:11     ` Kalle Valo
2020-09-07 16:23 ` Kalle Valo
2020-09-07 16:23 ` Kalle Valo
     [not found] ` <87d02x1rqb.fsf@codeaurora.org>
2020-09-08  3:45   ` Wen Gong
2020-12-09  9:24     ` Kalle Valo
2020-12-09  9:24       ` Kalle Valo
2020-12-09 15:00       ` Ben Greear
2020-12-09 15:00         ` Ben Greear
2020-12-10  2:29         ` Wen Gong
2020-12-10  2:29           ` Wen Gong
2020-12-15  8:05           ` Kalle Valo
2020-12-15  8:05             ` Kalle Valo
2020-12-15  7:56         ` Kalle Valo
2020-12-15  7:56           ` Kalle Valo
2020-12-10  2:27       ` Wen Gong
2020-12-10  2:27         ` Wen Gong
2020-09-08  3:45   ` Wen Gong
2020-12-17  6:52 ` Kalle Valo
2020-12-17  6:52 ` Kalle Valo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.