linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* wilc1000 kernel crash
@ 2022-10-24 13:54 Michael Walle
  2022-10-25 20:26 ` Ajay.Kathat
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Walle @ 2022-10-24 13:54 UTC (permalink / raw)
  To: Ajay Singh, Claudiu Beznea; +Cc: linux-wireless, Kalle Valo

Hi,

I'm using the WILC1000 wifi chip in SDIO mode and with NetworkManager
which seems to be probing the network in the background. What I am
seeing is a kernel oops while processing the workqueue.

This is on a kernel 5.15.74, but it also happens with the latest next,
but not that often - I guess due to a different timing.

My reduced steps to reproduce are the following:
  $ while true; do ifconfig wlan0 up; iw dev wlan0 scan & \
      ifconfig wlan0 down; done

After a while I'll get the following splash:

[  487.955326] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  487.955363] Mem abort info:
[  487.955366]   ESR = 0x96000004
[  487.955370]   EC = 0x25: DABT (current EL), IL = 32 bits
[  487.965939] FW not responding
[  487.971033]   SET = 0, FnV = 0
[  487.971039]   EA = 0, S1PTW = 0
[  487.971043]   FSC = 0x04: level 0 translation fault
[  487.971047] Data abort info:
[  487.971050]   ISV = 0, ISS = 0x00000004
[  487.971053]   CM = 0, WnR = 0
[  487.971059] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000497b0000
[  487.971066] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[  487.971085] Internal error: Oops: 96000004 [#1] SMP
[  487.971094] Modules linked in:
[  487.971104] CPU: 1 PID: 9 Comm: kworker/u8:0 Not tainted 5.15.74-00013-g2d5897cb12ef #130
[  487.971113] Hardware name: NXP i.MX8MNano DDR3L EVK board (DT)
[  487.971122] Workqueue: WILC_wq handle_rcvd_ntwrk_info
[  488.035377] wilc1000_sdio mmc1:0001:1: chipid (001003a0)
[  487.971085] Internal error: Oops:
[  488.041180] 96000004 [#1] SMP
[  488.041186] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  488.041196] pc : handle_rcvd_ntwrk_info+0x7c/0xc4
[  488.041208] lr : handle_rcvd_ntwrk_info+0x70/0xc4
[  488.049128] wilc1000_sdio mmc1:0001:1: has_thrpt_enh3 = 1...
[  488.057273] sp : ffff80000a20bd70
[  488.057277] x29: ffff80000a20bd70 x28: 0000000000000000 x27: 0000000000000000
[  488.057289] x26: ffff000000118470 x25: ffff000005059d05 x24: ffff00000de94d30
[  488.057299] x23: 0000000000000000
[  488.062670] wilc1000_sdio mmc1:0001:1 wlan0: ChipID [1003a0] loading firmware [atmel/wilc1000_wifi_firmware-1.bin]
[  488.070418]  x22: ffff000005059d00 x21: 0000000000000000
[  488.070428] x20: ffff00000de94d00 x19: ffff00000de94d28 x18: 0000000000000000
[  488.070440] x17: 0000000000000000 x16: 0000000000000000 x15: a4270000a4030001
[  488.070450] x14: 010102f2500018dd x13: 0018dd0000010002 x12: 0546000000000000
[  488.070461] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800008ad92a0
[  488.150644] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000063e88c0
[  488.157799] x5 : 0000000000000000 x4 : 0000000000000003 x3 : 0000000000000000
[  488.164947] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000001
[  488.172095] Call trace:
[  488.174548]  handle_rcvd_ntwrk_info+0x7c/0xc4
[  488.175400] FW not responding
[  488.178927]  process_one_work+0x1ec/0x48c
[  488.178941]  worker_thread+0x170/0x564
[  488.178948]  kthread+0x128/0x13c
[  488.178959]  ret_from_fork+0x10/0x20
[  488.185280] FW not responding
[  488.185958] Code: 9415ea8e b4000060 39400401 35000201 (f94002a3)
[  488.192042] FW not responding
[  488.192957] ---[ end trace fa915dc840cf0355 ]---
[  488.199700] FW not responding
[  488.205601] Kernel panic - not syncing: Oops: Fatal exception
[  488.205608] SMP: stopping secondary CPUs
[  488.205630] Kernel Offset: disabled
[  488.205634] CPU features: 0x00002001,20000846
[  488.205642] Memory Limit: none

In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
although it doesn't contain NULL. Thus the driver is calling into a
bogus function pointer. There seems to be no locking between the
asynchronous calls within the workqueue (wilc_enqueue_work()) and when
the interface is disabled (wilc_deinit()). wilc_deinit() will free the
host_if_drv object which might still be used within the workqueue
context.

BTW, ignore the "FW not repsonding" for now, that seems to be a
different problem.

-michael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wilc1000 kernel crash
  2022-10-24 13:54 wilc1000 kernel crash Michael Walle
@ 2022-10-25 20:26 ` Ajay.Kathat
  2022-10-26  8:54   ` Michael Walle
  0 siblings, 1 reply; 9+ messages in thread
From: Ajay.Kathat @ 2022-10-25 20:26 UTC (permalink / raw)
  To: mwalle, Claudiu.Beznea; +Cc: linux-wireless, kvalo

Hi Michael,

On 24/10/22 19:24, Michael Walle wrote:
> Hi,
>
> I'm using the WILC1000 wifi chip in SDIO mode and with NetworkManager
> which seems to be probing the network in the background. What I am
> seeing is a kernel oops while processing the workqueue.
>
> This is on a kernel 5.15.74, but it also happens with the latest next,
> but not that often - I guess due to a different timing.
>
> My reduced steps to reproduce are the following:
>    $ while true; do ifconfig wlan0 up; iw dev wlan0 scan & \
>        ifconfig wlan0 down; done

It looks like a timing issue. During the execution of above steps, I 
observed that most of the time, the interface down command is executed 
before the scan command. A few times, the scan command is executed 
between interface up and down.

>
> After a while I'll get the following splash:
>
> [  487.955326] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> [  487.955363] Mem abort info:
> [  487.955366]   ESR = 0x96000004
> [  487.955370]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  487.965939] FW not responding
> [  487.971033]   SET = 0, FnV = 0
> [  487.971039]   EA = 0, S1PTW = 0
> [  487.971043]   FSC = 0x04: level 0 translation fault
> [  487.971047] Data abort info:
> [  487.971050]   ISV = 0, ISS = 0x00000004
> [  487.971053]   CM = 0, WnR = 0
> [  487.971059] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000497b0000
> [  487.971066] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> [  487.971085] Internal error: Oops: 96000004 [#1] SMP
> [  487.971094] Modules linked in:
> [  487.971104] CPU: 1 PID: 9 Comm: kworker/u8:0 Not tainted 5.15.74-00013-g2d5897cb12ef #130
> [  487.971113] Hardware name: NXP i.MX8MNano DDR3L EVK board (DT)
> [  487.971122] Workqueue: WILC_wq handle_rcvd_ntwrk_info
> [  488.035377] wilc1000_sdio mmc1:0001:1: chipid (001003a0)
> [  487.971085] Internal error: Oops:
> [  488.041180] 96000004 [#1] SMP
> [  488.041186] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  488.041196] pc : handle_rcvd_ntwrk_info+0x7c/0xc4
> [  488.041208] lr : handle_rcvd_ntwrk_info+0x70/0xc4
> [  488.049128] wilc1000_sdio mmc1:0001:1: has_thrpt_enh3 = 1...
> [  488.057273] sp : ffff80000a20bd70
> [  488.057277] x29: ffff80000a20bd70 x28: 0000000000000000 x27: 0000000000000000
> [  488.057289] x26: ffff000000118470 x25: ffff000005059d05 x24: ffff00000de94d30
> [  488.057299] x23: 0000000000000000
> [  488.062670] wilc1000_sdio mmc1:0001:1 wlan0: ChipID [1003a0] loading firmware [atmel/wilc1000_wifi_firmware-1.bin]
> [  488.070418]  x22: ffff000005059d00 x21: 0000000000000000
> [  488.070428] x20: ffff00000de94d00 x19: ffff00000de94d28 x18: 0000000000000000
> [  488.070440] x17: 0000000000000000 x16: 0000000000000000 x15: a4270000a4030001
> [  488.070450] x14: 010102f2500018dd x13: 0018dd0000010002 x12: 0546000000000000
> [  488.070461] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800008ad92a0
> [  488.150644] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000063e88c0
> [  488.157799] x5 : 0000000000000000 x4 : 0000000000000003 x3 : 0000000000000000
> [  488.164947] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000001
> [  488.172095] Call trace:
> [  488.174548]  handle_rcvd_ntwrk_info+0x7c/0xc4
> [  488.175400] FW not responding
> [  488.178927]  process_one_work+0x1ec/0x48c
> [  488.178941]  worker_thread+0x170/0x564
> [  488.178948]  kthread+0x128/0x13c
> [  488.178959]  ret_from_fork+0x10/0x20
> [  488.185280] FW not responding
> [  488.185958] Code: 9415ea8e b4000060 39400401 35000201 (f94002a3)
> [  488.192042] FW not responding
> [  488.192957] ---[ end trace fa915dc840cf0355 ]---
> [  488.199700] FW not responding
> [  488.205601] Kernel panic - not syncing: Oops: Fatal exception
> [  488.205608] SMP: stopping secondary CPUs
> [  488.205630] Kernel Offset: disabled
> [  488.205634] CPU features: 0x00002001,20000846
> [  488.205642] Memory Limit: none
>
> In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
> although it doesn't contain NULL. Thus the driver is calling into a
> bogus function pointer. There seems to be no locking between the
> asynchronous calls within the workqueue (wilc_enqueue_work()) and when
> the interface is disabled (wilc_deinit()). wilc_deinit() will free the
> host_if_drv object which might still be used within the workqueue
> context.


Please try the below code changes with your test setup environment.


--- a/drivers/net/wireless/microchip/wilc1000/hif.c
+++ b/drivers/net/wireless/microchip/wilc1000/hif.c
@@ -495,12 +495,18 @@ static void handle_rcvd_ntwrk_info(struct 
work_struct *work)
  {
         struct host_if_msg *msg = container_of(work, struct 
host_if_msg, work);
         struct wilc_rcvd_net_info *rcvd_info = &msg->body.net_info;
-       struct wilc_user_scan_req *scan_req = 
&msg->vif->hif_drv->usr_scan_req;
+       struct host_if_drv *hif_drv = msg->vif->hif_drv;
+       struct wilc_user_scan_req *scan_req;
         const u8 *ch_elm;
         u8 *ies;
         int ies_len;
         size_t offset;

+       if (!hif_drv || !hif_drv->usr_scan_req.scan_result)
+               goto done;
+
+       scan_req = &hif_drv->usr_scan_req;
+
         if (ieee80211_is_probe_resp(rcvd_info->mgmt->frame_control))
                 offset = offsetof(struct ieee80211_mgmt, 
u.probe_resp.variable);
         else if (ieee80211_is_beacon(rcvd_info->mgmt->frame_control))
@@ -1574,6 +1580,9 @@ void wilc_network_info_received(struct wilc *wilc, 
u8 *buffer, u32 length)
                 return;
         }

+       if (!hif_drv->usr_scan_req.scan_result)
+               return;
+
         msg = wilc_alloc_work(vif, handle_rcvd_ntwrk_info, false);
         if (IS_ERR(msg))
                 return;

The above changes should avoid the kernel crash exception.


> BTW, ignore the "FW not repsonding" for now, that seems to be a
> different problem.

"FW not responding" log indicates the chip sleep command failure from 
Host to the FW. It's a temporary failure log for specific command. 
During the de-init process, this logs is often observed. IIRC, there was 
a change in the latest driver that reduced its frequency but I am unable 
to recall the exact change.

Regards,
Ajay


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wilc1000 kernel crash
  2022-10-25 20:26 ` Ajay.Kathat
@ 2022-10-26  8:54   ` Michael Walle
  2022-12-09 12:03     ` Michael Walle
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Walle @ 2022-10-26  8:54 UTC (permalink / raw)
  To: Ajay.Kathat; +Cc: Claudiu.Beznea, linux-wireless, kvalo

Hi Ajay,

On 22/10/25 08:26, Ajay.Kathat@microchip.com wrote:
> > In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
> > although it doesn't contain NULL. Thus the driver is calling into a
> > bogus function pointer. There seems to be no locking between the
> > asynchronous calls within the workqueue (wilc_enqueue_work()) and when
> > the interface is disabled (wilc_deinit()). wilc_deinit() will free the
> > host_if_drv object which might still be used within the workqueue
> > context.
> 
> 
> Please try the below code changes with your test setup environment.

The workqueue handling and wilc_deinit() run in parallel, correct? ..

> --- a/drivers/net/wireless/microchip/wilc1000/hif.c
> +++ b/drivers/net/wireless/microchip/wilc1000/hif.c
> @@ -495,12 +495,18 @@ static void handle_rcvd_ntwrk_info(struct 
> work_struct *work)
>   {
>          struct host_if_msg *msg = container_of(work, struct 
> host_if_msg, work);
>          struct wilc_rcvd_net_info *rcvd_info = &msg->body.net_info;
> -       struct wilc_user_scan_req *scan_req = 
> &msg->vif->hif_drv->usr_scan_req;
> +       struct host_if_drv *hif_drv = msg->vif->hif_drv;
> +       struct wilc_user_scan_req *scan_req;
>          const u8 *ch_elm;
>          u8 *ies;
>          int ies_len;
>          size_t offset;
> 
> +       if (!hif_drv || !hif_drv->usr_scan_req.scan_result)
> +               goto done;

.. So what if hif_drv will be set to NULL right after this check?

I don't think you'll get around using proper locking here.

> +
> +       scan_req = &hif_drv->usr_scan_req;
> +
>          if (ieee80211_is_probe_resp(rcvd_info->mgmt->frame_control))
>                  offset = offsetof(struct ieee80211_mgmt, 
> u.probe_resp.variable);
>          else if (ieee80211_is_beacon(rcvd_info->mgmt->frame_control))
> @@ -1574,6 +1580,9 @@ void wilc_network_info_received(struct wilc *wilc, 
> u8 *buffer, u32 length)
>                  return;
>          }
> 
> +       if (!hif_drv->usr_scan_req.scan_result)
> +               return;
> +

This is also racy. What if scan_result is cleared right after this
check? Then the work item will still get added to the work queue.

Here is the call tree:

isr_bh_routine() [interrupt thread ctx]
  wilc_handle_isr()
    mutex_lock(hif_cs)
    wilc_wlan_handle_isr_ext()
	  wilc_wlan_handle_rxq()
	    wilc_wlan_handle_rx_buff()
          wilc_wlan_cfg_indicate_rx()
            wilc_network_info_received()
			  wilc_enqueue_work(handle_rcvd_ntwrk_info)
    mutex_unlock(hif_cs)

handle_rcvd_ntwrk_info [hif_workqueue ctx]
  if (scan_result)
    scan_result()

wilc_mac_close() [ioctl ctx?]
  wilc_deinit_host_int()
    wilc_deinit()
	mutex_lock(deinit_lock)
	if (scan_result)
	  scan_result()
	  scan_result = NULL
	kfree(hif_drv)
	hif_drv = NULL
	mutex_unlock(deinit_lock)

I don't see any synchronization mechanisms, between these threads.

>          msg = wilc_alloc_work(vif, handle_rcvd_ntwrk_info, false);
>          if (IS_ERR(msg))
>                  return;
> 
> The above changes should avoid the kernel crash exception.

As mentioned above, I think this will just decrease the chance that
it is happening. Nonetheless, I've tried your changes but it doesn't
fix the crash.

> > BTW, ignore the "FW not repsonding" for now, that seems to be a
> > different problem.
> 
> "FW not responding" log indicates the chip sleep command failure from 
> Host to the FW. It's a temporary failure log for specific command. 
> During the de-init process, this logs is often observed. IIRC, there was 
> a change in the latest driver that reduced its frequency but I am unable 
> to recall the exact change.

So what is a "temporary failure"? It is a pr_warn() and customers
(rightfully) complains about them. Apart from that, it only happens on
the second "ifconfig wlan0 up". There are no messages during the first
one. So I suspect there is still something fishy.

-michael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wilc1000 kernel crash
  2022-10-26  8:54   ` Michael Walle
@ 2022-12-09 12:03     ` Michael Walle
  2022-12-09 14:14       ` Ajay.Kathat
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Walle @ 2022-12-09 12:03 UTC (permalink / raw)
  To: Ajay.Kathat; +Cc: Claudiu.Beznea, linux-wireless, kvalo

On 22/10/26 10:54, Michael Walle wrote:
> Hi Ajay,
> 
> On 22/10/25 08:26, Ajay.Kathat@microchip.com wrote:
> > > In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
> > > although it doesn't contain NULL. Thus the driver is calling into a
> > > bogus function pointer. There seems to be no locking between the
> > > asynchronous calls within the workqueue (wilc_enqueue_work()) and when
> > > the interface is disabled (wilc_deinit()). wilc_deinit() will free the
> > > host_if_drv object which might still be used within the workqueue
> > > context.
> > 
> > 
> > Please try the below code changes with your test setup environment.
> 
> The workqueue handling and wilc_deinit() run in parallel, correct? ..
> 
> > --- a/drivers/net/wireless/microchip/wilc1000/hif.c
> > +++ b/drivers/net/wireless/microchip/wilc1000/hif.c
> > @@ -495,12 +495,18 @@ static void handle_rcvd_ntwrk_info(struct 
> > work_struct *work)
> >   {
> >          struct host_if_msg *msg = container_of(work, struct 
> > host_if_msg, work);
> >          struct wilc_rcvd_net_info *rcvd_info = &msg->body.net_info;
> > -       struct wilc_user_scan_req *scan_req = 
> > &msg->vif->hif_drv->usr_scan_req;
> > +       struct host_if_drv *hif_drv = msg->vif->hif_drv;
> > +       struct wilc_user_scan_req *scan_req;
> >          const u8 *ch_elm;
> >          u8 *ies;
> >          int ies_len;
> >          size_t offset;
> > 
> > +       if (!hif_drv || !hif_drv->usr_scan_req.scan_result)
> > +               goto done;
> 
> .. So what if hif_drv will be set to NULL right after this check?
> 
> I don't think you'll get around using proper locking here.
> 
> > +
> > +       scan_req = &hif_drv->usr_scan_req;
> > +
> >          if (ieee80211_is_probe_resp(rcvd_info->mgmt->frame_control))
> >                  offset = offsetof(struct ieee80211_mgmt, 
> > u.probe_resp.variable);
> >          else if (ieee80211_is_beacon(rcvd_info->mgmt->frame_control))
> > @@ -1574,6 +1580,9 @@ void wilc_network_info_received(struct wilc *wilc, 
> > u8 *buffer, u32 length)
> >                  return;
> >          }
> > 
> > +       if (!hif_drv->usr_scan_req.scan_result)
> > +               return;
> > +
> 
> This is also racy. What if scan_result is cleared right after this
> check? Then the work item will still get added to the work queue.
> 
> Here is the call tree:
> 
> isr_bh_routine() [interrupt thread ctx]
>   wilc_handle_isr()
>     mutex_lock(hif_cs)
>     wilc_wlan_handle_isr_ext()
> 	  wilc_wlan_handle_rxq()
> 	    wilc_wlan_handle_rx_buff()
>           wilc_wlan_cfg_indicate_rx()
>             wilc_network_info_received()
> 			  wilc_enqueue_work(handle_rcvd_ntwrk_info)
>     mutex_unlock(hif_cs)
> 
> handle_rcvd_ntwrk_info [hif_workqueue ctx]
>   if (scan_result)
>     scan_result()
> 
> wilc_mac_close() [ioctl ctx?]
>   wilc_deinit_host_int()
>     wilc_deinit()
> 	mutex_lock(deinit_lock)
> 	if (scan_result)
> 	  scan_result()
> 	  scan_result = NULL
> 	kfree(hif_drv)
> 	hif_drv = NULL
> 	mutex_unlock(deinit_lock)
> 
> I don't see any synchronization mechanisms, between these threads.
> 
> >          msg = wilc_alloc_work(vif, handle_rcvd_ntwrk_info, false);
> >          if (IS_ERR(msg))
> >                  return;
> > 
> > The above changes should avoid the kernel crash exception.
> 
> As mentioned above, I think this will just decrease the chance that
> it is happening. Nonetheless, I've tried your changes but it doesn't
> fix the crash.

Any news here?

Thanks,
-michael

> > > BTW, ignore the "FW not repsonding" for now, that seems to be a
> > > different problem.
> > 
> > "FW not responding" log indicates the chip sleep command failure from 
> > Host to the FW. It's a temporary failure log for specific command. 
> > During the de-init process, this logs is often observed. IIRC, there was 
> > a change in the latest driver that reduced its frequency but I am unable 
> > to recall the exact change.
> 
> So what is a "temporary failure"? It is a pr_warn() and customers
> (rightfully) complains about them. Apart from that, it only happens on
> the second "ifconfig wlan0 up". There are no messages during the first
> one. So I suspect there is still something fishy.
> 
> -michael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wilc1000 kernel crash
  2022-12-09 12:03     ` Michael Walle
@ 2022-12-09 14:14       ` Ajay.Kathat
  2022-12-16 10:18         ` Michael Walle
  0 siblings, 1 reply; 9+ messages in thread
From: Ajay.Kathat @ 2022-12-09 14:14 UTC (permalink / raw)
  To: mwalle; +Cc: Claudiu.Beznea, linux-wireless, kvalo

On 09/12/22 17:33, Michael Walle wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>
> On 22/10/26 10:54, Michael Walle wrote:
>> Hi Ajay,
>>
>> On 22/10/25 08:26, Ajay.Kathat@microchip.com wrote:
>>>> In handle_rcvd_ntwrk_info() scan_req->scan_result isn't valid anymore,
>>>> although it doesn't contain NULL. Thus the driver is calling into a
>>>> bogus function pointer. There seems to be no locking between the
>>>> asynchronous calls within the workqueue (wilc_enqueue_work()) and when
>>>> the interface is disabled (wilc_deinit()). wilc_deinit() will free the
>>>> host_if_drv object which might still be used within the workqueue
>>>> context.
>>>
>>> Please try the below code changes with your test setup environment.
>> The workqueue handling and wilc_deinit() run in parallel, correct? ..
>>
>>> --- a/drivers/net/wireless/microchip/wilc1000/hif.c
>>> +++ b/drivers/net/wireless/microchip/wilc1000/hif.c
>>> @@ -495,12 +495,18 @@ static void handle_rcvd_ntwrk_info(struct
>>> work_struct *work)
>>>    {
>>>           struct host_if_msg *msg = container_of(work, struct
>>> host_if_msg, work);
>>>           struct wilc_rcvd_net_info *rcvd_info = &msg->body.net_info;
>>> -       struct wilc_user_scan_req *scan_req =
>>> &msg->vif->hif_drv->usr_scan_req;
>>> +       struct host_if_drv *hif_drv = msg->vif->hif_drv;
>>> +       struct wilc_user_scan_req *scan_req;
>>>           const u8 *ch_elm;
>>>           u8 *ies;
>>>           int ies_len;
>>>           size_t offset;
>>>
>>> +       if (!hif_drv || !hif_drv->usr_scan_req.scan_result)
>>> +               goto done;
>> .. So what if hif_drv will be set to NULL right after this check?
>>
>> I don't think you'll get around using proper locking here.
>>
>>> +
>>> +       scan_req = &hif_drv->usr_scan_req;
>>> +
>>>           if (ieee80211_is_probe_resp(rcvd_info->mgmt->frame_control))
>>>                   offset = offsetof(struct ieee80211_mgmt,
>>> u.probe_resp.variable);
>>>           else if (ieee80211_is_beacon(rcvd_info->mgmt->frame_control))
>>> @@ -1574,6 +1580,9 @@ void wilc_network_info_received(struct wilc *wilc,
>>> u8 *buffer, u32 length)
>>>                   return;
>>>           }
>>>
>>> +       if (!hif_drv->usr_scan_req.scan_result)
>>> +               return;
>>> +
>> This is also racy. What if scan_result is cleared right after this
>> check? Then the work item will still get added to the work queue.
>>
>> Here is the call tree:
>>
>> isr_bh_routine() [interrupt thread ctx]
>>    wilc_handle_isr()
>>      mutex_lock(hif_cs)
>>      wilc_wlan_handle_isr_ext()
>>          wilc_wlan_handle_rxq()
>>            wilc_wlan_handle_rx_buff()
>>            wilc_wlan_cfg_indicate_rx()
>>              wilc_network_info_received()
>>                          wilc_enqueue_work(handle_rcvd_ntwrk_info)
>>      mutex_unlock(hif_cs)
>>
>> handle_rcvd_ntwrk_info [hif_workqueue ctx]
>>    if (scan_result)
>>      scan_result()
>>
>> wilc_mac_close() [ioctl ctx?]
>>    wilc_deinit_host_int()
>>      wilc_deinit()
>>        mutex_lock(deinit_lock)
>>        if (scan_result)
>>          scan_result()
>>          scan_result = NULL
>>        kfree(hif_drv)
>>        hif_drv = NULL
>>        mutex_unlock(deinit_lock)
>>
>> I don't see any synchronization mechanisms, between these threads.
>>
>>>           msg = wilc_alloc_work(vif, handle_rcvd_ntwrk_info, false);
>>>           if (IS_ERR(msg))
>>>                   return;
>>>
>>> The above changes should avoid the kernel crash exception.
>> As mentioned above, I think this will just decrease the chance that
>> it is happening. Nonetheless, I've tried your changes but it doesn't
>> fix the crash.
> Any news here?


Hi Michael,


No progress yet. I tried to simulate the condition a few times but was 
unable to see the exact failure in my setup so I need to try more. For 
the other "FW not responding" continuous logs, I got some clue. 
Probably, will try to send that patch first.

Regards,
Ajay

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wilc1000 kernel crash
  2022-12-09 14:14       ` Ajay.Kathat
@ 2022-12-16 10:18         ` Michael Walle
  2023-04-03 14:24           ` Kirill Buksha
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Walle @ 2022-12-16 10:18 UTC (permalink / raw)
  To: Ajay.Kathat; +Cc: Claudiu.Beznea, linux-wireless, kvalo

Hi,

On 22/12/09 02:14, Ajay.Kathat@microchip.com wrote:
> No progress yet. I tried to simulate the condition a few times but was 
> unable to see the exact failure in my setup so I need to try more.

Shouldn't it also be possible to see the issue by code reading? I've
provided the call tree in my previous mail and my concerns regarding
the locking. Either I'm missing something there or there is no
locking between these threads which could cause this issue.

> For the other "FW not responding" continuous logs, I got some clue. 
> Probably, will try to send that patch first.

Ok, let me know if you have some patches, I'm happy to test them.

-michael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wilc1000 kernel crash
  2022-12-16 10:18         ` Michael Walle
@ 2023-04-03 14:24           ` Kirill Buksha
  2023-04-04  1:30             ` Ajay.Kathat
  0 siblings, 1 reply; 9+ messages in thread
From: Kirill Buksha @ 2023-04-03 14:24 UTC (permalink / raw)
  To: Michael Walle, Ajay.Kathat; +Cc: Claudiu.Beznea, linux-wireless, kvalo

On 16.12.22. 11:18, Michael Walle wrote:
> Hi,
>
> On 22/12/09 02:14, Ajay.Kathat@microchip.com wrote:
>> No progress yet. I tried to simulate the condition a few times but was 
>> unable to see the exact failure in my setup so I need to try more.
> Shouldn't it also be possible to see the issue by code reading? I've
> provided the call tree in my previous mail and my concerns regarding
> the locking. Either I'm missing something there or there is no
> locking between these threads which could cause this issue.
>
>> For the other "FW not responding" continuous logs, I got some clue. 
>> Probably, will try to send that patch first.
> Ok, let me know if you have some patches, I'm happy to test them.
>
> -michael
>
>

Hello,

I faced the same kernel oops issue. After analyzing my logs and brief
debugging, I agree with Mikhail: the problem seems to be accessing the
scan_result pointer after it has been nulled.

Regarding the solution: if there is a race between two threads (as
Michael described earlier), then I think that the locking mechanism will
be the most reliable solution. We ran into problems during
deinitialization, but driver contains two more places
(handle_scan_done() and wilc_disconnect() functions in wilc1000/hif.c),
where scan_result is set to NULL.

I use NetworkManager to manage networks and I have experienced the same
failure multiple times when switching from one WiFi network to another.
Keep in mind that switching between networks calls wilc_disconnect() and
wilc_deinit() functions and it is not yet clear which one is causing a
core dump. I think it's worth at least taking a look at these areas of
the code. What do you think?

Best regards,
Kirill Buksha.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wilc1000 kernel crash
  2023-04-03 14:24           ` Kirill Buksha
@ 2023-04-04  1:30             ` Ajay.Kathat
  2023-04-04 16:20               ` Kirill Buksha
  0 siblings, 1 reply; 9+ messages in thread
From: Ajay.Kathat @ 2023-04-04  1:30 UTC (permalink / raw)
  To: kirbuk200, mwalle; +Cc: Claudiu.Beznea, linux-wireless, kvalo

On 4/3/23 07:24, Kirill Buksha wrote:
> [Some people who received this message don't often get email from kirbuk200@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On 16.12.22. 11:18, Michael Walle wrote:
>> Hi,
>>
>> On 22/12/09 02:14, Ajay.Kathat@microchip.com wrote:
>>> No progress yet. I tried to simulate the condition a few times but was
>>> unable to see the exact failure in my setup so I need to try more.
>> Shouldn't it also be possible to see the issue by code reading? I've
>> provided the call tree in my previous mail and my concerns regarding
>> the locking. Either I'm missing something there or there is no
>> locking between these threads which could cause this issue.
>>
>>> For the other "FW not responding" continuous logs, I got some clue.
>>> Probably, will try to send that patch first.
>> Ok, let me know if you have some patches, I'm happy to test them.
>>
>> -michael
>>
>>
> 
> Hello,
> 
> I faced the same kernel oops issue. After analyzing my logs and brief
> debugging, I agree with Mikhail: the problem seems to be accessing the
> scan_result pointer after it has been nulled.

I have submitted a patch [1] which has fix for scan_result NULL pointer
exception issue. The submitted patch handles the synchronization between
mac_close() and asynchronous interrupts from firmware. Basically, it
takes care of blocking the execution of mac_close() till all pending
works are completed and afterward no new work addition is allowed since
the close is in progress. It is worth to try with that patch once and
check it's behavior.

1.
https://lore.kernel.org/linux-wireless/20230404012010.15261-1-ajay.kathat@microchip.com/T/#u

> 
> Regarding the solution: if there is a race between two threads (as
> Michael described earlier), then I think that the locking mechanism will
> be the most reliable solution. We ran into problems during
> deinitialization, but driver contains two more places
> (handle_scan_done() and wilc_disconnect() functions in wilc1000/hif.c),
> where scan_result is set to NULL.
> 
> I use NetworkManager to manage networks and I have experienced the same
> failure multiple times when switching from one WiFi network to another.
> Keep in mind that switching between networks calls wilc_disconnect() and
> wilc_deinit() functions and it is not yet clear which one is causing a
> core dump. I think it's worth at least taking a look at these areas of
> the code. What do you think?

If possible, please share the sequence(commands) for Wifi network
switching scenario. It looks like both functions(mac_close & disconnect)
are getting called from user context. mac_close() is a netdevice
callback whereas wilc_disconnect() is a cfg80211 callback. Generally,
wilc_disconnect() should be enough to disconnect from current Wifi
network without bringing the complete interface down. Is NetworkManager
closing the interface(mac_close()) before switching the WiFi network.


Regards,
Ajay

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wilc1000 kernel crash
  2023-04-04  1:30             ` Ajay.Kathat
@ 2023-04-04 16:20               ` Kirill Buksha
  0 siblings, 0 replies; 9+ messages in thread
From: Kirill Buksha @ 2023-04-04 16:20 UTC (permalink / raw)
  To: Ajay.Kathat; +Cc: Claudiu.Beznea, linux-wireless, kvalo, mwalle

On 4.4.23. 03:30, Ajay.Kathat@microchip.com wrote:
> On 4/3/23 07:24, Kirill Buksha wrote:
>> [Some people who received this message don't often get email from kirbuk200@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 16.12.22. 11:18, Michael Walle wrote:
>>> Hi,
>>>
>>> On 22/12/09 02:14, Ajay.Kathat@microchip.com wrote:
>>>> No progress yet. I tried to simulate the condition a few times but was
>>>> unable to see the exact failure in my setup so I need to try more.
>>> Shouldn't it also be possible to see the issue by code reading? I've
>>> provided the call tree in my previous mail and my concerns regarding
>>> the locking. Either I'm missing something there or there is no
>>> locking between these threads which could cause this issue.
>>>
>>>> For the other "FW not responding" continuous logs, I got some clue.
>>>> Probably, will try to send that patch first.
>>> Ok, let me know if you have some patches, I'm happy to test them.
>>>
>>> -michael
>>>
>>>
>> Hello,
>>
>> I faced the same kernel oops issue. After analyzing my logs and brief
>> debugging, I agree with Mikhail: the problem seems to be accessing the
>> scan_result pointer after it has been nulled.
> I have submitted a patch [1] which has fix for scan_result NULL pointer
> exception issue. The submitted patch handles the synchronization between
> mac_close() and asynchronous interrupts from firmware. Basically, it
> takes care of blocking the execution of mac_close() till all pending
> works are completed and afterward no new work addition is allowed since
> the close is in progress. It is worth to try with that patch once and
> check it's behavior.
>
> 1.
> https://lore.kernel.org/linux-wireless/20230404012010.15261-1-ajay.kathat@microchip.com/T/#u

Thank you for the patch. I will take a look/test it when I have time.

>> Regarding the solution: if there is a race between two threads (as
>> Michael described earlier), then I think that the locking mechanism will
>> be the most reliable solution. We ran into problems during
>> deinitialization, but driver contains two more places
>> (handle_scan_done() and wilc_disconnect() functions in wilc1000/hif.c),
>> where scan_result is set to NULL.
>>
>> I use NetworkManager to manage networks and I have experienced the same
>> failure multiple times when switching from one WiFi network to another.
>> Keep in mind that switching between networks calls wilc_disconnect() and
>> wilc_deinit() functions and it is not yet clear which one is causing a
>> core dump. I think it's worth at least taking a look at these areas of
>> the code. What do you think?
> If possible, please share the sequence(commands) for Wifi network
> switching scenario. It looks like both functions(mac_close & disconnect)
> are getting called from user context. mac_close() is a netdevice
> callback whereas wilc_disconnect() is a cfg80211 callback. Generally,
> wilc_disconnect() should be enough to disconnect from current Wifi
> network without bringing the complete interface down. Is NetworkManager
> closing the interface(mac_close()) before switching the WiFi network.
>
>
> Regards,
> Ajay

The commands are as follows:
while true; do nmcli c up wlan0-client; nmcli c up wlan0-client-2; done

It takes about 5 minutes until I see the core dump.
I see following message after every command:
...
wilc1000_sdio mmc0:0001:1 wlan0: Deinitializing wilc1000...
...
Message above comes from wilc_wlan_deinitialize() function which is called from wilc_mac_close(). It seems that interface is closed between connections.

Best regards,
Kirill Buksha.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-04-04 16:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-24 13:54 wilc1000 kernel crash Michael Walle
2022-10-25 20:26 ` Ajay.Kathat
2022-10-26  8:54   ` Michael Walle
2022-12-09 12:03     ` Michael Walle
2022-12-09 14:14       ` Ajay.Kathat
2022-12-16 10:18         ` Michael Walle
2023-04-03 14:24           ` Kirill Buksha
2023-04-04  1:30             ` Ajay.Kathat
2023-04-04 16:20               ` Kirill Buksha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).