linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* iwlwifi: BUG: unable to handle kernel paging request at ffffc90000e1c808
@ 2019-04-11 21:07 Michal Hocko
  2019-04-12  9:49 ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-04-11 21:07 UTC (permalink / raw)
  To: Johannes Berg, Emmanuel Grumbach, Luca Coelho,
	Intel Linux Wireless, linux-wireless
  Cc: LKML

Hi,
I have just encountered the following splat with 5.1-rc4. Is this a
known problem? Greping throug older logs show that "Error sending SCAN_CFG_CMD"
is not new. In older kernels it was followed by "Start IWL Error Log Dump:"
but it oopsed now. I was running 5.0 previously. Let me know if you need
more information.

[ 3076.724445] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
[ 3076.724449] iwlwifi 0000:01:00.0: Current CMD queue read_ptr 38 write_ptr 39
[ 3076.724468] BUG: unable to handle kernel paging request at ffffc90000e1c808
[ 3076.724469] #PF error: [normal kernel read fault]
[ 3076.724470] PGD 245924067 P4D 245924067 PUD 245925067 PMD 24441e067 PTE 0
[ 3076.724473] Oops: 0000 [#1] PREEMPT SMP PTI
[ 3076.724475] CPU: 2 PID: 10490 Comm: kworker/2:0 Not tainted 5.1.0-rc4-00003-gfd008d1a7a20 #50
[ 3076.724476] Hardware name: Dell Inc. Latitude E7470/0T6HHJ, BIOS 1.5.3 04/18/2016
[ 3076.724500] Workqueue: events iwl_mvm_tcm_work [iwlmvm]
[ 3076.724506] RIP: 0010:iwl_trans_pcie_read32+0xe/0x11 [iwlwifi]
[ 3076.724508] Code: f6 48 03 b7 28 a2 00 00 88 16 c3 0f 1f 44 00 00 89 f6 48 03 b7 28 a2 00 00 89 16 c3 0f 1f 44 00 00 89 f6 48 03 b7 28 a2 00 00 <8b> 06 c3 0f 1f 44 00 00 48 8b 47 10 83 78 48 13 19 c0 25 00 00 10
[ 3076.724509] RSP: 0000:ffffc9000622fcb0 EFLAGS: 00010282
[ 3076.724510] RAX: ffffffffa024ce32 RBX: ffff888243000018 RCX: 0000000000000003
[ 3076.724512] RDX: 0000000008000005 RSI: ffffc90000e1c808 RDI: ffff888243000018
[ 3076.724513] RBP: ffff888243009154 R08: 0000000000000001 R09: ffffffff824e062f
[ 3076.724514] R10: 0000000000000000 R11: ffffc9000622f9e7 R12: 00000001000a97a6
[ 3076.724514] R13: 0000000000000006 R14: ffff888242af4000 R15: ffff88824300a268
[ 3076.724516] FS:  0000000000000000(0000) GS:ffff888245f00000(0000) knlGS:0000000000000000
[ 3076.724517] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3076.724518] CR2: ffffc90000e1c808 CR3: 000000010131c003 CR4: 00000000003606e0
[ 3076.724519] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3076.724520] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3076.724521] Call Trace:
[ 3076.724527]  iwl_trans_sync_nmi+0xac/0x18a [iwlwifi]
[ 3076.724533]  iwl_trans_pcie_send_hcmd+0x2e1/0x3b2 [iwlwifi]
[ 3076.724536]  ? wait_woken+0x6d/0x6d
[ 3076.724542]  iwl_trans_send_cmd+0x99/0xb4 [iwlwifi]
[ 3076.724547]  iwl_mvm_send_cmd+0x2a/0x5c [iwlmvm]
[ 3076.724554]  iwl_mvm_config_scan+0x38b/0x3c2 [iwlmvm]
[ 3076.724560]  ? iwl_mvm_update_low_latency+0xbb/0xbb [iwlmvm]
[ 3076.724565]  iwl_mvm_recalc_tcm+0x473/0x48c [iwlmvm]
[ 3076.724569]  process_one_work+0x1e2/0x32b
[ 3076.724571]  ? pwq_unbound_release_workfn+0xc9/0xc9
[ 3076.724573]  worker_thread+0x1d8/0x2a3
[ 3076.724574]  kthread+0x114/0x11c
[ 3076.724576]  ? kthread_park+0x76/0x76
[ 3076.724579]  ret_from_fork+0x3a/0x50
[ 3076.724581] Modules linked in: tun ctr ccm binfmt_misc snd_hda_codec_hdmi arc4 snd_hda_codec_realtek snd_hda_codec_generic i915 i2c_algo_bit iosf_mbi snd_hda_intel iwlmvm drm_kms_helper uvcvideo snd_hda_codec videobuf2_vmalloc cfbfillrect mac80211 videobuf2_memops videobuf2_v4l2 snd_hda_core syscopyarea videobuf2_common cfbimgblt sysfillrect snd_pcm_oss videodev snd_mixer_oss sysimgblt coretemp iwlwifi fb_sys_fops hwmon cfbcopyarea x86_pkg_temp_thermal fb snd_pcm fbdev kvm_intel media drm kvm drm_panel_orientation_quirks snd_timer irqbypass i2c_i801 cfg80211 snd i2c_core video backlight
[ 3076.724596] CR2: ffffc90000e1c808
[ 3076.724598] ---[ end trace da5234017dd6ffd3 ]---
[ 3076.724603] RIP: 0010:iwl_trans_pcie_read32+0xe/0x11 [iwlwifi]
[ 3076.724605] Code: f6 48 03 b7 28 a2 00 00 88 16 c3 0f 1f 44 00 00 89 f6 48 03 b7 28 a2 00 00 89 16 c3 0f 1f 44 00 00 89 f6 48 03 b7 28 a2 00 00 <8b> 06 c3 0f 1f 44 00 00 48 8b 47 10 83 78 48 13 19 c0 25 00 00 10
[ 3076.724606] RSP: 0000:ffffc9000622fcb0 EFLAGS: 00010282
[ 3076.724607] RAX: ffffffffa024ce32 RBX: ffff888243000018 RCX: 0000000000000003
[ 3076.724608] RDX: 0000000008000005 RSI: ffffc90000e1c808 RDI: ffff888243000018
[ 3076.724609] RBP: ffff888243009154 R08: 0000000000000001 R09: ffffffff824e062f
[ 3076.724610] R10: 0000000000000000 R11: ffffc9000622f9e7 R12: 00000001000a97a6
[ 3076.724611] R13: 0000000000000006 R14: ffff888242af4000 R15: ffff88824300a268
[ 3076.724612] FS:  0000000000000000(0000) GS:ffff888245f00000(0000) knlGS:0000000000000000
[ 3076.724613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3076.724614] CR2: ffffc90000e1c808 CR3: 000000010131c003 CR4: 00000000003606e0
[ 3076.724615] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3076.724616] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3077.332767] iwlwifi 0000:01:00.0: Queue 0 is inactive on fifo 2 and stuck for 2500 ms. SW [38, 39] HW [162, 162] FH TRB=0x0a5a5a5a2
[ 3084.756712] iwlwifi 0000:01:00.0: Queue 10 is inactive on fifo 2 and stuck for 10000 ms. SW [81, 89] HW [162, 162] FH TRB=0x0a5a5a5a2
[ 3097.044641] iwlwifi 0000:01:00.0: Queue 11 is inactive on fifo 2 and stuck for 10000 ms. SW [203, 204] HW [162, 162] FH TRB=0x0a5a5a5a2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: iwlwifi: BUG: unable to handle kernel paging request at ffffc90000e1c808
  2019-04-11 21:07 iwlwifi: BUG: unable to handle kernel paging request at ffffc90000e1c808 Michal Hocko
@ 2019-04-12  9:49 ` Michal Hocko
  2019-04-16 18:36   ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-04-12  9:49 UTC (permalink / raw)
  To: Johannes Berg, Emmanuel Grumbach, Luca Coelho,
	Intel Linux Wireless, linux-wireless
  Cc: LKML

On Thu 11-04-19 23:07:20, Michal Hocko wrote:
> Hi,
> I have just encountered the following splat with 5.1-rc4. Is this a
> known problem? Greping throug older logs show that "Error sending SCAN_CFG_CMD"
> is not new. In older kernels it was followed by "Start IWL Error Log Dump:"
> but it oopsed now. I was running 5.0 previously. Let me know if you need
> more information.

FTR It happened again with 5.1.0-rc2-00001-ga3ac7917b730

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: iwlwifi: BUG: unable to handle kernel paging request at ffffc90000e1c808
  2019-04-12  9:49 ` Michal Hocko
@ 2019-04-16 18:36   ` Michal Hocko
  2019-04-17  7:33     ` Luca Coelho
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-04-16 18:36 UTC (permalink / raw)
  To: Johannes Berg, Emmanuel Grumbach, Luca Coelho,
	Intel Linux Wireless, linux-wireless
  Cc: LKML

On Fri 12-04-19 11:49:08, Michal Hocko wrote:
> On Thu 11-04-19 23:07:20, Michal Hocko wrote:
> > Hi,
> > I have just encountered the following splat with 5.1-rc4. Is this a
> > known problem? Greping throug older logs show that "Error sending SCAN_CFG_CMD"
> > is not new. In older kernels it was followed by "Start IWL Error Log Dump:"
> > but it oopsed now. I was running 5.0 previously. Let me know if you need
> > more information.
> 
> FTR It happened again with 5.1.0-rc2-00001-ga3ac7917b730

Is there anything more I can provide to help with this issue. I had to
go back to 5.0 because the system is not really usable after the oops.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: iwlwifi: BUG: unable to handle kernel paging request at ffffc90000e1c808
  2019-04-16 18:36   ` Michal Hocko
@ 2019-04-17  7:33     ` Luca Coelho
  2019-04-17  7:35       ` [PATCH] iwlwifi: don't panic in error path on non-msix systems Luca Coelho
  0 siblings, 1 reply; 13+ messages in thread
From: Luca Coelho @ 2019-04-17  7:33 UTC (permalink / raw)
  To: Michal Hocko, Johannes Berg, Emmanuel Grumbach,
	Intel Linux Wireless, linux-wireless
  Cc: LKML

On Tue, 2019-04-16 at 20:36 +0200, Michal Hocko wrote:
> On Fri 12-04-19 11:49:08, Michal Hocko wrote:
> > On Thu 11-04-19 23:07:20, Michal Hocko wrote:
> > > Hi,
> > > I have just encountered the following splat with 5.1-rc4. Is this
> > > a
> > > known problem? Greping throug older logs show that "Error sending
> > > SCAN_CFG_CMD"
> > > is not new. In older kernels it was followed by "Start IWL Error
> > > Log Dump:"
> > > but it oopsed now. I was running 5.0 previously. Let me know if
> > > you need
> > > more information.
> > 
> > FTR It happened again with 5.1.0-rc2-00001-ga3ac7917b730
> 
> Is there anything more I can provide to help with this issue. I had
> to
> go back to 5.0 because the system is not really usable after the
> oops.

Michal, thanks for reporting and all the data.  We have a fix for it
and I'm going to send it out in a second.  Please try it out and let us
know whether it works for you.

--
Cheers,
Luca.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] iwlwifi: don't panic in error path on non-msix systems
  2019-04-17  7:33     ` Luca Coelho
@ 2019-04-17  7:35       ` Luca Coelho
  2019-04-17 10:11         ` Michal Hocko
  2019-04-22 18:07         ` Michal Hocko
  0 siblings, 2 replies; 13+ messages in thread
From: Luca Coelho @ 2019-04-17  7:35 UTC (permalink / raw)
  To: kvalo, johannes, emmanuel.grumbach, linuxwifi
  Cc: linux-kernel, linux-wireless, Shahar S Matityahu, Michal Hocko,
	Luca Coelho

From: Shahar S Matityahu <shahar.s.matityahu@intel.com>

The driver uses msix causes-register to handle both msix and non msix
interrupts when performing sync nmi.  On devices that do not support
msix this register is unmapped and accessing it causes a kernel panic.

Solve this by differentiating the two cases and accessing the proper
causes-register in each case.

Reported-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
---
 .../net/wireless/intel/iwlwifi/pcie/trans.c   | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
index 79c1dc05f948..c4375b868901 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
@@ -3644,20 +3644,27 @@ struct iwl_trans *iwl_trans_pcie_alloc(struct pci_dev *pdev,
 
 void iwl_trans_pcie_sync_nmi(struct iwl_trans *trans)
 {
+	struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
 	unsigned long timeout = jiffies + IWL_TRANS_NMI_TIMEOUT;
+	u32 inta_addr, sw_err_bit;
+
+	if (trans_pcie->msix_enabled) {
+		inta_addr = CSR_MSIX_HW_INT_CAUSES_AD;
+		sw_err_bit = MSIX_HW_INT_CAUSES_REG_SW_ERR;
+	} else {
+		inta_addr = CSR_INT;
+		sw_err_bit = CSR_INT_BIT_SW_ERR;
+	}
 
 	iwl_disable_interrupts(trans);
 	iwl_force_nmi(trans);
 	while (time_after(timeout, jiffies)) {
-		u32 inta_hw = iwl_read32(trans,
-					 CSR_MSIX_HW_INT_CAUSES_AD);
+		u32 inta_hw = iwl_read32(trans, inta_addr);
 
 		/* Error detected by uCode */
-		if (inta_hw & MSIX_HW_INT_CAUSES_REG_SW_ERR) {
+		if (inta_hw & sw_err_bit) {
 			/* Clear causes register */
-			iwl_write32(trans, CSR_MSIX_HW_INT_CAUSES_AD,
-				    inta_hw &
-				    MSIX_HW_INT_CAUSES_REG_SW_ERR);
+			iwl_write32(trans, inta_addr, inta_hw & sw_err_bit);
 			break;
 		}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] iwlwifi: don't panic in error path on non-msix systems
  2019-04-17  7:35       ` [PATCH] iwlwifi: don't panic in error path on non-msix systems Luca Coelho
@ 2019-04-17 10:11         ` Michal Hocko
  2019-04-17 10:17           ` Luca Coelho
  2019-04-22 18:07         ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-04-17 10:11 UTC (permalink / raw)
  To: Luca Coelho
  Cc: kvalo, johannes, emmanuel.grumbach, linuxwifi, linux-kernel,
	linux-wireless, Shahar S Matityahu, Luca Coelho

Hi,
which tree is this supposed to be applied on? It doesn't do apply on top
of v5.1-rc4-3-gfd008d1a7a20. iwl_trans_pcie_sync_nmi doesn't exist in
drivers/net/wireless/intel/iwlwifi/pcie/trans.c. iwl_trans_sync_nmi
looks pretty similar. My patch massaging ended up with this. Please
double check. It compiles and even boots.

diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
index fe8269d023de..1f3969e2bcac 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
@@ -3639,20 +3639,27 @@ struct iwl_trans *iwl_trans_pcie_alloc(struct pci_dev *pdev,
 
 void iwl_trans_sync_nmi(struct iwl_trans *trans)
 {
+	struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
 	unsigned long timeout = jiffies + IWL_TRANS_NMI_TIMEOUT;
+	u32 inta_addr, sw_err_bit;
+
+	if (trans_pcie->msix_enabled) {
+		inta_addr = CSR_MSIX_HW_INT_CAUSES_AD;
+		sw_err_bit = MSIX_HW_INT_CAUSES_REG_SW_ERR;
+	} else {
+		inta_addr = CSR_INT;
+		sw_err_bit = CSR_INT_BIT_SW_ERR;
+	}
 
 	iwl_disable_interrupts(trans);
 	iwl_force_nmi(trans);
 	while (time_after(timeout, jiffies)) {
-		u32 inta_hw = iwl_read32(trans,
-					 CSR_MSIX_HW_INT_CAUSES_AD);
+		u32 inta_hw = iwl_read32(trans, inta_addr);
 
 		/* Error detected by uCode */
-		if (inta_hw & MSIX_HW_INT_CAUSES_REG_SW_ERR) {
+		if (inta_hw & sw_err_bit) {
 			/* Clear causes register */
-			iwl_write32(trans, CSR_MSIX_HW_INT_CAUSES_AD,
-				    inta_hw &
-				    MSIX_HW_INT_CAUSES_REG_SW_ERR);
+			iwl_write32(trans, inta_addr, inta_hw & sw_err_bit);
 			break;
 		}
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] iwlwifi: don't panic in error path on non-msix systems
  2019-04-17 10:11         ` Michal Hocko
@ 2019-04-17 10:17           ` Luca Coelho
  2019-04-17 10:53             ` Luca Coelho
  2019-04-17 10:59             ` Michal Hocko
  0 siblings, 2 replies; 13+ messages in thread
From: Luca Coelho @ 2019-04-17 10:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvalo, johannes, emmanuel.grumbach, linuxwifi, linux-kernel,
	linux-wireless, Shahar S Matityahu

On Wed, 2019-04-17 at 12:11 +0200, Michal Hocko wrote:
> Hi,
> which tree is this supposed to be applied on? It doesn't do apply on
> top
> of v5.1-rc4-3-gfd008d1a7a20. iwl_trans_pcie_sync_nmi doesn't exist in
> drivers/net/wireless/intel/iwlwifi/pcie/trans.c. iwl_trans_sync_nmi
> looks pretty similar. My patch massaging ended up with this. Please
> double check. It compiles and even boots.

Ah, sorry, I didn't mention that that patch was supposed to apply on
top of wireless-drivers-next, which is on its way to v5.2.  We have
other patches touching this function there, thus the conflict.

> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> index fe8269d023de..1f3969e2bcac 100644
> --- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> @@ -3639,20 +3639,27 @@ struct iwl_trans *iwl_trans_pcie_alloc(struct
> pci_dev *pdev,
>  
>  void iwl_trans_sync_nmi(struct iwl_trans *trans)
>  {
> +	struct iwl_trans_pcie *trans_pcie =
> IWL_TRANS_GET_PCIE_TRANS(trans);
>  	unsigned long timeout = jiffies + IWL_TRANS_NMI_TIMEOUT;
> +	u32 inta_addr, sw_err_bit;
> +
> +	if (trans_pcie->msix_enabled) {
> +		inta_addr = CSR_MSIX_HW_INT_CAUSES_AD;
> +		sw_err_bit = MSIX_HW_INT_CAUSES_REG_SW_ERR;
> +	} else {
> +		inta_addr = CSR_INT;
> +		sw_err_bit = CSR_INT_BIT_SW_ERR;
> +	}
>  
>  	iwl_disable_interrupts(trans);
>  	iwl_force_nmi(trans);
>  	while (time_after(timeout, jiffies)) {
> -		u32 inta_hw = iwl_read32(trans,
> -					 CSR_MSIX_HW_INT_CAUSES_AD);
> +		u32 inta_hw = iwl_read32(trans, inta_addr);
>  
>  		/* Error detected by uCode */
> -		if (inta_hw & MSIX_HW_INT_CAUSES_REG_SW_ERR) {
> +		if (inta_hw & sw_err_bit) {
>  			/* Clear causes register */
> -			iwl_write32(trans, CSR_MSIX_HW_INT_CAUSES_AD,
> -				    inta_hw &
> -				    MSIX_HW_INT_CAUSES_REG_SW_ERR);
> +			iwl_write32(trans, inta_addr, inta_hw &
> sw_err_bit);
>  			break;
>  		}

This looks good! So it fixed the problem you were having now?


--
Cheers,
Luca.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] iwlwifi: don't panic in error path on non-msix systems
  2019-04-17 10:17           ` Luca Coelho
@ 2019-04-17 10:53             ` Luca Coelho
  2019-04-17 10:59             ` Michal Hocko
  1 sibling, 0 replies; 13+ messages in thread
From: Luca Coelho @ 2019-04-17 10:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvalo, johannes, emmanuel.grumbach, linuxwifi, linux-kernel,
	linux-wireless, Shahar S Matityahu

On Wed, 2019-04-17 at 13:17 +0300, Luca Coelho wrote:
> On Wed, 2019-04-17 at 12:11 +0200, Michal Hocko wrote:
> > Hi,
> > which tree is this supposed to be applied on? It doesn't do apply
> > on
> > top
> > of v5.1-rc4-3-gfd008d1a7a20. iwl_trans_pcie_sync_nmi doesn't exist
> > in
> > drivers/net/wireless/intel/iwlwifi/pcie/trans.c. iwl_trans_sync_nmi
> > looks pretty similar. My patch massaging ended up with this. Please
> > double check. It compiles and even boots.
> 
> Ah, sorry, I didn't mention that that patch was supposed to apply on
> top of wireless-drivers-next, which is on its way to v5.2.  We have
> other patches touching this function there, thus the conflict.

Huh, I was confused.  My patch actually applies on top of wireless-
drivers, which has other patches that touch the same function on their
way to v5.1 (but not there yet).

--
Luca.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] iwlwifi: don't panic in error path on non-msix systems
  2019-04-17 10:17           ` Luca Coelho
  2019-04-17 10:53             ` Luca Coelho
@ 2019-04-17 10:59             ` Michal Hocko
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2019-04-17 10:59 UTC (permalink / raw)
  To: Luca Coelho
  Cc: kvalo, johannes, emmanuel.grumbach, linuxwifi, linux-kernel,
	linux-wireless, Shahar S Matityahu

On Wed 17-04-19 13:17:10, Luca Coelho wrote:
> On Wed, 2019-04-17 at 12:11 +0200, Michal Hocko wrote:
> > Hi,
> > which tree is this supposed to be applied on? It doesn't do apply on
> > top
> > of v5.1-rc4-3-gfd008d1a7a20. iwl_trans_pcie_sync_nmi doesn't exist in
> > drivers/net/wireless/intel/iwlwifi/pcie/trans.c. iwl_trans_sync_nmi
> > looks pretty similar. My patch massaging ended up with this. Please
> > double check. It compiles and even boots.
> 
> Ah, sorry, I didn't mention that that patch was supposed to apply on
> top of wireless-drivers-next, which is on its way to v5.2.  We have
> other patches touching this function there, thus the conflict.

This is hard to tell. I haven't seen any "Error sending SCAN_CFG_CMD:"
message in the kernel log yet and I believe this is related to the
crash. But I might be easily wrong here.

In any case, I suspect that those events usually happen in a different
wireless environment than I am running right now. So give me some more
time before I can be more confident with an answer.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] iwlwifi: don't panic in error path on non-msix systems
  2019-04-17  7:35       ` [PATCH] iwlwifi: don't panic in error path on non-msix systems Luca Coelho
  2019-04-17 10:11         ` Michal Hocko
@ 2019-04-22 18:07         ` Michal Hocko
  2019-04-22 23:56           ` Kirtika Ruchandani
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2019-04-22 18:07 UTC (permalink / raw)
  To: Luca Coelho
  Cc: kvalo, johannes, emmanuel.grumbach, linuxwifi, linux-kernel,
	linux-wireless, Shahar S Matityahu, Luca Coelho

On Wed 17-04-19 10:35:16, Luca Coelho wrote:
> From: Shahar S Matityahu <shahar.s.matityahu@intel.com>
> 
> The driver uses msix causes-register to handle both msix and non msix
> interrupts when performing sync nmi.  On devices that do not support
> msix this register is unmapped and accessing it causes a kernel panic.
> 
> Solve this by differentiating the two cases and accessing the proper
> causes-register in each case.
> 
> Reported-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@intel.com>
> Signed-off-by: Luca Coelho <luciano.coelho@intel.com>

$ dmesg | grep "Error sending SCAN_CFG_CMD:"
[49786.288548] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
[53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.

without the oops and with the iwlwifi internal dump IIUC which is the
previous behavior.
[53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
[53457.166882] iwlwifi 0000:01:00.0: Current CMD queue read_ptr 224 write_ptr 225
[53457.414973] iwlwifi 0000:01:00.0: HW error, resetting before reading
[53457.421339] iwlwifi 0000:01:00.0: Start IWL Error Log Dump:
[53457.421345] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1269232956
[53457.421347] iwlwifi 0000:01:00.0: Loaded firmware version: 36.9f0a2d68.0
[53457.421350] iwlwifi 0000:01:00.0: 0x45E91306 | ADVANCED_SYSASSERT          
[53457.421352] iwlwifi 0000:01:00.0: 0x2F58D384 | trm_hw_status0
[53457.421353] iwlwifi 0000:01:00.0: 0x7F1A8CFD | trm_hw_status1
[53457.421355] iwlwifi 0000:01:00.0: 0x07E787FD | branchlink2
[53457.421357] iwlwifi 0000:01:00.0: 0xE9E54368 | interruptlink1
[53457.421359] iwlwifi 0000:01:00.0: 0x470D9BBF | interruptlink2
[53457.421361] iwlwifi 0000:01:00.0: 0xAF040E7E | data1
[53457.421362] iwlwifi 0000:01:00.0: 0xE7FBCA48 | data2
[53457.421364] iwlwifi 0000:01:00.0: 0x4E4A8288 | data3
[53457.421366] iwlwifi 0000:01:00.0: 0x861DEA98 | beacon time
[53457.421368] iwlwifi 0000:01:00.0: 0xE8F23466 | tsf low
[53457.421369] iwlwifi 0000:01:00.0: 0xD7B19307 | tsf hi
[53457.421371] iwlwifi 0000:01:00.0: 0xE58934E3 | time gp1
[53457.421373] iwlwifi 0000:01:00.0: 0xB013FEBE | time gp2
[53457.421375] iwlwifi 0000:01:00.0: 0x962DCC75 | uCode revision type
[53457.421376] iwlwifi 0000:01:00.0: 0xFF8FB30F | uCode version major
[53457.421378] iwlwifi 0000:01:00.0: 0x0DD08E17 | uCode version minor
[53457.421380] iwlwifi 0000:01:00.0: 0x87FD70DE | hw version
[53457.421382] iwlwifi 0000:01:00.0: 0x853F6851 | board version
[53457.421384] iwlwifi 0000:01:00.0: 0x08D7F330 | hcmd
[53457.421385] iwlwifi 0000:01:00.0: 0x6B7E5FEE | isr0
[53457.421387] iwlwifi 0000:01:00.0: 0x2B1E7CD4 | isr1
[53457.421389] iwlwifi 0000:01:00.0: 0x3F133B16 | isr2
[53457.421391] iwlwifi 0000:01:00.0: 0x5D480C5A | isr3
[53457.421392] iwlwifi 0000:01:00.0: 0x34E93EBA | isr4
[53457.421394] iwlwifi 0000:01:00.0: 0x42AD8E83 | last cmd Id
[53457.421396] iwlwifi 0000:01:00.0: 0x1F5BBCFF | wait_event
[53457.421398] iwlwifi 0000:01:00.0: 0x6808B2C1 | l2p_control
[53457.421399] iwlwifi 0000:01:00.0: 0x0D5B1F33 | l2p_duration
[53457.421401] iwlwifi 0000:01:00.0: 0xF4C94535 | l2p_mhvalid
[53457.421403] iwlwifi 0000:01:00.0: 0x3DCE6EBB | l2p_addr_match
[53457.421405] iwlwifi 0000:01:00.0: 0xFDDC41FE | lmpm_pmg_sel
[53457.421406] iwlwifi 0000:01:00.0: 0xB53A17F5 | timestamp
[53457.421408] iwlwifi 0000:01:00.0: 0x5A6A4113 | flow_handler
[53457.421474] iwlwifi 0000:01:00.0: Start IWL Error Log Dump:
[53457.421477] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1182976748
[53457.421478] iwlwifi 0000:01:00.0: 0x62D2BDB3 | ADVANCED_SYSASSERT
[53457.421480] iwlwifi 0000:01:00.0: 0x4D9E5019 | umac branchlink1
[53457.421482] iwlwifi 0000:01:00.0: 0x8CB69F6E | umac branchlink2
[53457.421484] iwlwifi 0000:01:00.0: 0x9868662D | umac interruptlink1
[53457.421486] iwlwifi 0000:01:00.0: 0x9800F8F7 | umac interruptlink2
[53457.421488] iwlwifi 0000:01:00.0: 0xC71449B8 | umac data1
[53457.421489] iwlwifi 0000:01:00.0: 0xAB0AB17F | umac data2
[53457.421491] iwlwifi 0000:01:00.0: 0x6C6F9753 | umac data3
[53457.421493] iwlwifi 0000:01:00.0: 0xFC49D724 | umac major
[53457.421495] iwlwifi 0000:01:00.0: 0xA61CC627 | umac minor
[53457.421496] iwlwifi 0000:01:00.0: 0x45BAA0B8 | frame pointer
[53457.421498] iwlwifi 0000:01:00.0: 0x319D112B | stack pointer
[53457.421500] iwlwifi 0000:01:00.0: 0xEFD9E2E9 | last host cmd
[53457.421502] iwlwifi 0000:01:00.0: 0x82640FF7 | isr status reg
[53457.421506] ieee80211 phy0: Hardware restart was requested
[53457.941685] iwlwifi 0000:01:00.0: Queue 0 is inactive on fifo 0 and stuck for 2500 ms. SW [224, 225] HW [0, 0] FH TRB=0x02b759ca1

Feel free to add
Tested-by: Michal Hocko <mhocko@suse.com>

Thanks for your quick patch and sorry it took so long from my side.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] iwlwifi: don't panic in error path on non-msix systems
  2019-04-22 18:07         ` Michal Hocko
@ 2019-04-22 23:56           ` Kirtika Ruchandani
  2019-04-23  2:34             ` Kirtika Ruchandani
  0 siblings, 1 reply; 13+ messages in thread
From: Kirtika Ruchandani @ 2019-04-22 23:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Luca Coelho, kvalo, Johannes Berg, Grumbach, Emmanuel, linuxwifi,
	linux-kernel, linux-wireless, Shahar S Matityahu, Luca Coelho

On Mon, Apr 22, 2019 at 11:07 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 17-04-19 10:35:16, Luca Coelho wrote:
> > From: Shahar S Matityahu <shahar.s.matityahu@intel.com>
> >
> > The driver uses msix causes-register to handle both msix and non msix
> > interrupts when performing sync nmi.  On devices that do not support
> > msix this register is unmapped and accessing it causes a kernel panic.
> >
> > Solve this by differentiating the two cases and accessing the proper
> > causes-register in each case.

Are you sure reading CSR_INT from trans.c without explicitly getting irq_lock.c
like rx.c does, is thread-safe? I don't claim to understand this fully, but this
smells wrong from past experience with this driver. I'll see if I can cook up
a test case with a race condition here.

> >
> > Reported-by: Michal Hocko <mhocko@kernel.org>
> > Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@intel.com>
> > Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
>
> $ dmesg | grep "Error sending SCAN_CFG_CMD:"
> [49786.288548] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
> [53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
>
> without the oops and with the iwlwifi internal dump IIUC which is the
> previous behavior.
> [53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
> [53457.166882] iwlwifi 0000:01:00.0: Current CMD queue read_ptr 224 write_ptr 225
> [53457.414973] iwlwifi 0000:01:00.0: HW error, resetting before reading
> [53457.421339] iwlwifi 0000:01:00.0: Start IWL Error Log Dump:
> [53457.421345] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1269232956
> [53457.421347] iwlwifi 0000:01:00.0: Loaded firmware version: 36.9f0a2d68.0
> [53457.421350] iwlwifi 0000:01:00.0: 0x45E91306 | ADVANCED_SYSASSERT
> [53457.421352] iwlwifi 0000:01:00.0: 0x2F58D384 | trm_hw_status0
> [53457.421353] iwlwifi 0000:01:00.0: 0x7F1A8CFD | trm_hw_status1
> [53457.421355] iwlwifi 0000:01:00.0: 0x07E787FD | branchlink2
> [53457.421357] iwlwifi 0000:01:00.0: 0xE9E54368 | interruptlink1
> [53457.421359] iwlwifi 0000:01:00.0: 0x470D9BBF | interruptlink2
> [53457.421361] iwlwifi 0000:01:00.0: 0xAF040E7E | data1
> [53457.421362] iwlwifi 0000:01:00.0: 0xE7FBCA48 | data2
> [53457.421364] iwlwifi 0000:01:00.0: 0x4E4A8288 | data3
> [53457.421366] iwlwifi 0000:01:00.0: 0x861DEA98 | beacon time
> [53457.421368] iwlwifi 0000:01:00.0: 0xE8F23466 | tsf low
> [53457.421369] iwlwifi 0000:01:00.0: 0xD7B19307 | tsf hi
> [53457.421371] iwlwifi 0000:01:00.0: 0xE58934E3 | time gp1
> [53457.421373] iwlwifi 0000:01:00.0: 0xB013FEBE | time gp2
> [53457.421375] iwlwifi 0000:01:00.0: 0x962DCC75 | uCode revision type
> [53457.421376] iwlwifi 0000:01:00.0: 0xFF8FB30F | uCode version major
> [53457.421378] iwlwifi 0000:01:00.0: 0x0DD08E17 | uCode version minor
> [53457.421380] iwlwifi 0000:01:00.0: 0x87FD70DE | hw version
> [53457.421382] iwlwifi 0000:01:00.0: 0x853F6851 | board version
> [53457.421384] iwlwifi 0000:01:00.0: 0x08D7F330 | hcmd
> [53457.421385] iwlwifi 0000:01:00.0: 0x6B7E5FEE | isr0
> [53457.421387] iwlwifi 0000:01:00.0: 0x2B1E7CD4 | isr1
> [53457.421389] iwlwifi 0000:01:00.0: 0x3F133B16 | isr2
> [53457.421391] iwlwifi 0000:01:00.0: 0x5D480C5A | isr3
> [53457.421392] iwlwifi 0000:01:00.0: 0x34E93EBA | isr4
> [53457.421394] iwlwifi 0000:01:00.0: 0x42AD8E83 | last cmd Id
> [53457.421396] iwlwifi 0000:01:00.0: 0x1F5BBCFF | wait_event
> [53457.421398] iwlwifi 0000:01:00.0: 0x6808B2C1 | l2p_control
> [53457.421399] iwlwifi 0000:01:00.0: 0x0D5B1F33 | l2p_duration
> [53457.421401] iwlwifi 0000:01:00.0: 0xF4C94535 | l2p_mhvalid
> [53457.421403] iwlwifi 0000:01:00.0: 0x3DCE6EBB | l2p_addr_match
> [53457.421405] iwlwifi 0000:01:00.0: 0xFDDC41FE | lmpm_pmg_sel
> [53457.421406] iwlwifi 0000:01:00.0: 0xB53A17F5 | timestamp
> [53457.421408] iwlwifi 0000:01:00.0: 0x5A6A4113 | flow_handler
> [53457.421474] iwlwifi 0000:01:00.0: Start IWL Error Log Dump:
> [53457.421477] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1182976748
> [53457.421478] iwlwifi 0000:01:00.0: 0x62D2BDB3 | ADVANCED_SYSASSERT
> [53457.421480] iwlwifi 0000:01:00.0: 0x4D9E5019 | umac branchlink1
> [53457.421482] iwlwifi 0000:01:00.0: 0x8CB69F6E | umac branchlink2
> [53457.421484] iwlwifi 0000:01:00.0: 0x9868662D | umac interruptlink1
> [53457.421486] iwlwifi 0000:01:00.0: 0x9800F8F7 | umac interruptlink2
> [53457.421488] iwlwifi 0000:01:00.0: 0xC71449B8 | umac data1
> [53457.421489] iwlwifi 0000:01:00.0: 0xAB0AB17F | umac data2
> [53457.421491] iwlwifi 0000:01:00.0: 0x6C6F9753 | umac data3
> [53457.421493] iwlwifi 0000:01:00.0: 0xFC49D724 | umac major
> [53457.421495] iwlwifi 0000:01:00.0: 0xA61CC627 | umac minor
> [53457.421496] iwlwifi 0000:01:00.0: 0x45BAA0B8 | frame pointer
> [53457.421498] iwlwifi 0000:01:00.0: 0x319D112B | stack pointer
> [53457.421500] iwlwifi 0000:01:00.0: 0xEFD9E2E9 | last host cmd
> [53457.421502] iwlwifi 0000:01:00.0: 0x82640FF7 | isr status reg
> [53457.421506] ieee80211 phy0: Hardware restart was requested
> [53457.941685] iwlwifi 0000:01:00.0: Queue 0 is inactive on fifo 0 and stuck for 2500 ms. SW [224, 225] HW [0, 0] FH TRB=0x02b759ca1
>
> Feel free to add
> Tested-by: Michal Hocko <mhocko@suse.com>
>
> Thanks for your quick patch and sorry it took so long from my side.
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] iwlwifi: don't panic in error path on non-msix systems
  2019-04-22 23:56           ` Kirtika Ruchandani
@ 2019-04-23  2:34             ` Kirtika Ruchandani
  2019-04-23 13:27               ` Luca Coelho
  0 siblings, 1 reply; 13+ messages in thread
From: Kirtika Ruchandani @ 2019-04-23  2:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Luca Coelho, kvalo, Johannes Berg, Grumbach, Emmanuel, linuxwifi,
	linux-kernel, linux-wireless, Shahar S Matityahu, Luca Coelho

On Mon, Apr 22, 2019 at 4:56 PM Kirtika Ruchandani <kirtika@google.com> wrote:
>
> On Mon, Apr 22, 2019 at 11:07 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 17-04-19 10:35:16, Luca Coelho wrote:
> > > From: Shahar S Matityahu <shahar.s.matityahu@intel.com>
> > >
> > > The driver uses msix causes-register to handle both msix and non msix
> > > interrupts when performing sync nmi.  On devices that do not support
> > > msix this register is unmapped and accessing it causes a kernel panic.
> > >
> > > Solve this by differentiating the two cases and accessing the proper
> > > causes-register in each case.
>
> Are you sure reading CSR_INT from trans.c without explicitly getting irq_lock.c
> like rx.c does, is thread-safe? I don't claim to understand this fully, but this
> smells wrong from past experience with this driver. I'll see if I can cook up
> a test case with a race condition here.

Sorry for the typos. I meant "writing (not reading) to CSR_INT in
iwl_trans_pcie_sync_nmi
without explicitly getting trans_pcie->irq_lock like iwl_pcie_irq_handler does".
I spent some time playing around this, and while I don't have a black-and-white
test-case to show this patch has side-effects (at the very least), I
have some notes:

1. Repeatedly sending fw_nmi is a good test-case for the problem this patch
is trying to solve. i.e.  I had
$ cd /sys/kernel/debug/iwlwifi/${PCI_ID}/iwlmvm/
$ while true; do echo 1 > fw_nmi ; done  # this is likely too harsh
and needs a sleep in b/w

With the current ToT driver in wireless-drivers-next, this ramoops-es
super quickly after hitting
the problematic MSI-X read.

2. With this patch applied, I hit an ADVANCED_SYSASSERT 0x0 and it took ~800ms
and dozens of kernel warnings before the driver "recovered".
I've sent the full dmesg to Luca off-list.
It didn't get any better with adding spin_lock(&trans_pcie->irq_lock)
/ spin_unlock(&trans_pcie->irq_lock)
around the problematic CSR_INT write in iwl_trans_sync_nmi.

So I still don't have any concrete reasoning or proof, except a dmesg
after a test that doesn't look right.
Apologies if this discussion is moot - FWIW, the thing that caused a
code smell in the first place
was that historically, nothing in trans.c had mucked with writing to
CSR_INT, except when enabling interrupts again
with a 0xFFFFFFFF at _initialization time_ (and we've seen race
conditions there in the past).
You can check this with `git grep -p -w CSR_INT trans.c`.

> > >
> > > Reported-by: Michal Hocko <mhocko@kernel.org>
> > > Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@intel.com>
> > > Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
> >
> > $ dmesg | grep "Error sending SCAN_CFG_CMD:"
> > [49786.288548] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
> > [53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
> >
> > without the oops and with the iwlwifi internal dump IIUC which is the
> > previous behavior.
> > [53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
> > [53457.166882] iwlwifi 0000:01:00.0: Current CMD queue read_ptr 224 write_ptr 225
> > [53457.414973] iwlwifi 0000:01:00.0: HW error, resetting before reading
> > [53457.421339] iwlwifi 0000:01:00.0: Start IWL Error Log Dump:
> > [53457.421345] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1269232956
> > [53457.421347] iwlwifi 0000:01:00.0: Loaded firmware version: 36.9f0a2d68.0
> > [53457.421350] iwlwifi 0000:01:00.0: 0x45E91306 | ADVANCED_SYSASSERT
> > [53457.421352] iwlwifi 0000:01:00.0: 0x2F58D384 | trm_hw_status0
> > [53457.421353] iwlwifi 0000:01:00.0: 0x7F1A8CFD | trm_hw_status1
> > [53457.421355] iwlwifi 0000:01:00.0: 0x07E787FD | branchlink2
> > [53457.421357] iwlwifi 0000:01:00.0: 0xE9E54368 | interruptlink1
> > [53457.421359] iwlwifi 0000:01:00.0: 0x470D9BBF | interruptlink2
> > [53457.421361] iwlwifi 0000:01:00.0: 0xAF040E7E | data1
> > [53457.421362] iwlwifi 0000:01:00.0: 0xE7FBCA48 | data2
> > [53457.421364] iwlwifi 0000:01:00.0: 0x4E4A8288 | data3
> > [53457.421366] iwlwifi 0000:01:00.0: 0x861DEA98 | beacon time
> > [53457.421368] iwlwifi 0000:01:00.0: 0xE8F23466 | tsf low
> > [53457.421369] iwlwifi 0000:01:00.0: 0xD7B19307 | tsf hi
> > [53457.421371] iwlwifi 0000:01:00.0: 0xE58934E3 | time gp1
> > [53457.421373] iwlwifi 0000:01:00.0: 0xB013FEBE | time gp2
> > [53457.421375] iwlwifi 0000:01:00.0: 0x962DCC75 | uCode revision type
> > [53457.421376] iwlwifi 0000:01:00.0: 0xFF8FB30F | uCode version major
> > [53457.421378] iwlwifi 0000:01:00.0: 0x0DD08E17 | uCode version minor
> > [53457.421380] iwlwifi 0000:01:00.0: 0x87FD70DE | hw version
> > [53457.421382] iwlwifi 0000:01:00.0: 0x853F6851 | board version
> > [53457.421384] iwlwifi 0000:01:00.0: 0x08D7F330 | hcmd
> > [53457.421385] iwlwifi 0000:01:00.0: 0x6B7E5FEE | isr0
> > [53457.421387] iwlwifi 0000:01:00.0: 0x2B1E7CD4 | isr1
> > [53457.421389] iwlwifi 0000:01:00.0: 0x3F133B16 | isr2
> > [53457.421391] iwlwifi 0000:01:00.0: 0x5D480C5A | isr3
> > [53457.421392] iwlwifi 0000:01:00.0: 0x34E93EBA | isr4
> > [53457.421394] iwlwifi 0000:01:00.0: 0x42AD8E83 | last cmd Id
> > [53457.421396] iwlwifi 0000:01:00.0: 0x1F5BBCFF | wait_event
> > [53457.421398] iwlwifi 0000:01:00.0: 0x6808B2C1 | l2p_control
> > [53457.421399] iwlwifi 0000:01:00.0: 0x0D5B1F33 | l2p_duration
> > [53457.421401] iwlwifi 0000:01:00.0: 0xF4C94535 | l2p_mhvalid
> > [53457.421403] iwlwifi 0000:01:00.0: 0x3DCE6EBB | l2p_addr_match
> > [53457.421405] iwlwifi 0000:01:00.0: 0xFDDC41FE | lmpm_pmg_sel
> > [53457.421406] iwlwifi 0000:01:00.0: 0xB53A17F5 | timestamp
> > [53457.421408] iwlwifi 0000:01:00.0: 0x5A6A4113 | flow_handler
> > [53457.421474] iwlwifi 0000:01:00.0: Start IWL Error Log Dump:
> > [53457.421477] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1182976748
> > [53457.421478] iwlwifi 0000:01:00.0: 0x62D2BDB3 | ADVANCED_SYSASSERT
> > [53457.421480] iwlwifi 0000:01:00.0: 0x4D9E5019 | umac branchlink1
> > [53457.421482] iwlwifi 0000:01:00.0: 0x8CB69F6E | umac branchlink2
> > [53457.421484] iwlwifi 0000:01:00.0: 0x9868662D | umac interruptlink1
> > [53457.421486] iwlwifi 0000:01:00.0: 0x9800F8F7 | umac interruptlink2
> > [53457.421488] iwlwifi 0000:01:00.0: 0xC71449B8 | umac data1
> > [53457.421489] iwlwifi 0000:01:00.0: 0xAB0AB17F | umac data2
> > [53457.421491] iwlwifi 0000:01:00.0: 0x6C6F9753 | umac data3
> > [53457.421493] iwlwifi 0000:01:00.0: 0xFC49D724 | umac major
> > [53457.421495] iwlwifi 0000:01:00.0: 0xA61CC627 | umac minor
> > [53457.421496] iwlwifi 0000:01:00.0: 0x45BAA0B8 | frame pointer
> > [53457.421498] iwlwifi 0000:01:00.0: 0x319D112B | stack pointer
> > [53457.421500] iwlwifi 0000:01:00.0: 0xEFD9E2E9 | last host cmd
> > [53457.421502] iwlwifi 0000:01:00.0: 0x82640FF7 | isr status reg
> > [53457.421506] ieee80211 phy0: Hardware restart was requested
> > [53457.941685] iwlwifi 0000:01:00.0: Queue 0 is inactive on fifo 0 and stuck for 2500 ms. SW [224, 225] HW [0, 0] FH TRB=0x02b759ca1
> >
> > Feel free to add
> > Tested-by: Michal Hocko <mhocko@suse.com>
> >
> > Thanks for your quick patch and sorry it took so long from my side.
> >
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] iwlwifi: don't panic in error path on non-msix systems
  2019-04-23  2:34             ` Kirtika Ruchandani
@ 2019-04-23 13:27               ` Luca Coelho
  0 siblings, 0 replies; 13+ messages in thread
From: Luca Coelho @ 2019-04-23 13:27 UTC (permalink / raw)
  To: Kirtika Ruchandani, Michal Hocko
  Cc: kvalo, Johannes Berg, Grumbach, Emmanuel, linuxwifi,
	linux-kernel, linux-wireless, Shahar S Matityahu

On Mon, 2019-04-22 at 19:34 -0700, Kirtika Ruchandani wrote:
> On Mon, Apr 22, 2019 at 4:56 PM Kirtika Ruchandani <
> kirtika@google.com> wrote:
> > On Mon, Apr 22, 2019 at 11:07 AM Michal Hocko <mhocko@kernel.org>
> > wrote:
> > > On Wed 17-04-19 10:35:16, Luca Coelho wrote:
> > > > From: Shahar S Matityahu <shahar.s.matityahu@intel.com>
> > > > 
> > > > The driver uses msix causes-register to handle both msix and
> > > > non msix
> > > > interrupts when performing sync nmi.  On devices that do not
> > > > support
> > > > msix this register is unmapped and accessing it causes a kernel
> > > > panic.
> > > > 
> > > > Solve this by differentiating the two cases and accessing the
> > > > proper
> > > > causes-register in each case.
> > 
> > Are you sure reading CSR_INT from trans.c without explicitly
> > getting irq_lock.c
> > like rx.c does, is thread-safe? I don't claim to understand this
> > fully, but this
> > smells wrong from past experience with this driver. I'll see if I
> > can cook up
> > a test case with a race condition here.
> 
> Sorry for the typos. I meant "writing (not reading) to CSR_INT in
> iwl_trans_pcie_sync_nmi
> without explicitly getting trans_pcie->irq_lock like
> iwl_pcie_irq_handler does".
> I spent some time playing around this, and while I don't have a
> black-and-white
> test-case to show this patch has side-effects (at the very least), I
> have some notes:
> 
> 1. Repeatedly sending fw_nmi is a good test-case for the problem this
> patch
> is trying to solve. i.e.  I had
> $ cd /sys/kernel/debug/iwlwifi/${PCI_ID}/iwlmvm/
> $ while true; do echo 1 > fw_nmi ; done  # this is likely too harsh
> and needs a sleep in b/w
> 
> With the current ToT driver in wireless-drivers-next, this ramoops-es
> super quickly after hitting
> the problematic MSI-X read.
> 
> 2. With this patch applied, I hit an ADVANCED_SYSASSERT 0x0 and it
> took ~800ms
> and dozens of kernel warnings before the driver "recovered".
> I've sent the full dmesg to Luca off-list.
> It didn't get any better with adding spin_lock(&trans_pcie->irq_lock)
> / spin_unlock(&trans_pcie->irq_lock)
> around the problematic CSR_INT write in iwl_trans_sync_nmi.
> 
> So I still don't have any concrete reasoning or proof, except a dmesg
> after a test that doesn't look right.
> Apologies if this discussion is moot - FWIW, the thing that caused a
> code smell in the first place
> was that historically, nothing in trans.c had mucked with writing to
> CSR_INT, except when enabling interrupts again
> with a 0xFFFFFFFF at _initialization time_ (and we've seen race
> conditions there in the past).
> You can check this with `git grep -p -w CSR_INT trans.c`.


Hi Kirtika,

I think the SYSASSERT issue you found is unrelated.  Can we track that
separately? Again, it seems that we're doing something wrong during
recovery, which causes the many warnings and long time to recover...

Regarding the spinlock, I'm not sure.  It seems that we don't need it,
because we're just writing to the register.  Maybe some of the other
blocks that are spinlocked must have a guarantee that no one will
change the register in the middle of it.  But before we add the
spinlocks, we should check if that really is the case.

We'll check the dmesg you sent us, but I think it's better to create a
new issue in the tracker for it.

--
Cheers,
Luca.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-04-23 13:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-11 21:07 iwlwifi: BUG: unable to handle kernel paging request at ffffc90000e1c808 Michal Hocko
2019-04-12  9:49 ` Michal Hocko
2019-04-16 18:36   ` Michal Hocko
2019-04-17  7:33     ` Luca Coelho
2019-04-17  7:35       ` [PATCH] iwlwifi: don't panic in error path on non-msix systems Luca Coelho
2019-04-17 10:11         ` Michal Hocko
2019-04-17 10:17           ` Luca Coelho
2019-04-17 10:53             ` Luca Coelho
2019-04-17 10:59             ` Michal Hocko
2019-04-22 18:07         ` Michal Hocko
2019-04-22 23:56           ` Kirtika Ruchandani
2019-04-23  2:34             ` Kirtika Ruchandani
2019-04-23 13:27               ` Luca Coelho

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).