Linux-Wireless Archive on lore.kernel.org
 help / color / Atom feed
* WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
@ 2019-09-11 10:05 Linus Torvalds
  2019-09-11 10:26 ` Johannes Berg
  2019-09-11 18:10 ` Kalle Valo
  0 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2019-09-11 10:05 UTC (permalink / raw)
  To: Johannes Berg, David S. Miller
  Cc: linux-wireless, Netdev, Linux List Kernel Mailing

So I'm at LCA, reading email, using my laptop more than I normally do,
and with different networking than I normally do.

And I just had a 802.11 WARN_ON() trigger, followed by essentially a
dead machine due to some lock held (maybe rtnl_lock).

It's possible that the lock held thing happened before, and is the
_reason_ for the delay, I don't know. I had to reboot the machine, but
I gathered as much information as made sense and was obvious before I
did so. That's appended.

                 Linus

---
Selected dmesg from the WARN_ON_ONCE() and just before, followed by
info from "sysrq-w":

Previous resume looks normal:

   PM: suspend exit
   ath10k_pci 0000:02:00.0: UART prints enabled
   ath10k_pci 0000:02:00.0: unsupported HTC service id: 1536
   wlp2s0: authenticate with 54:ec:2f:05:70:2c
   wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
   wlp2s0: send auth to 54:ec:2f:05:70:2c (try 2/3)
   wlp2s0: send auth to 54:ec:2f:05:70:2c (try 3/3)
   wlp2s0: authentication with 54:ec:2f:05:70:2c timed out
   wlp2s0: authenticate with 54:ec:2f:05:70:28
   wlp2s0: send auth to 54:ec:2f:05:70:28 (try 1/3)
   wlp2s0: authenticated
   wlp2s0: associate with 54:ec:2f:05:70:28 (try 1/3)
   wlp2s0: RX AssocResp from 54:ec:2f:05:70:28 (capab=0x1431 status=0 aid=1)
   wlp2s0: associated
   wlp2s0: Limiting TX power to 20 (20 - 0) dBm as advertised by
54:ec:2f:05:70:28
   ath: EEPROM regdomain: 0x826c
   ath: EEPROM indicates we should expect a country code
   ath: doing EEPROM country->regdmn map search
   ath: country maps to regdmn code: 0x37
   ath: Country alpha2 being used: PT
   ath: Regpair used: 0x37
   ath: regdomain 0x826c dynamically updated by country element
   IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0: link becomes ready
   wlp2s0: disconnect from AP 54:ec:2f:05:70:28 for new auth to
54:ec:2f:05:70:2c
   wlp2s0: authenticate with 54:ec:2f:05:70:2c
   wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
   wlp2s0: authenticated
   wlp2s0: associate with 54:ec:2f:05:70:2c (try 1/3)
   wlp2s0: RX ReassocResp from 54:ec:2f:05:70:2c (capab=0x1011 status=0 aid=2)
   wlp2s0: associated
   ath: EEPROM regdomain: 0x826c
   ath: EEPROM indicates we should expect a country code
   ath: doing EEPROM country->regdmn map search
   ath: country maps to regdmn code: 0x37
   ath: Country alpha2 being used: PT
   ath: Regpair used: 0x37
   ath: regdomain 0x826c dynamically updated by country element
   wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
54:ec:2f:05:70:2c

Another _almost_ successful suspend/resume:

   PM: suspend entry (s2idle)
   Filesystems sync: 0.028 seconds
   Freezing user space processes ... (elapsed 0.126 seconds) done.
   OOM killer disabled.
   Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
   printk: Suspending console(s) (use no_console_suspend to debug)
   wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
(Reason: 3=DEAUTH_LEAVING)
   ath10k_pci 0000:02:00.0: UART prints enabled
   ath10k_pci 0000:02:00.0: unsupported HTC service id: 1536
   OOM killer enabled.
   Restarting tasks ... done.
   PM: suspend exit
   ath10k_pci 0000:02:00.0: UART prints enabled
   ath10k_pci 0000:02:00.0: unsupported HTC service id: 1536
   wlp2s0: authenticate with 54:ec:2f:05:70:2c
   wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
   wlp2s0: authenticated
   wlp2s0: associate with 54:ec:2f:05:70:2c (try 1/3)
   wlp2s0: RX AssocResp from 54:ec:2f:05:70:2c (capab=0x1011 status=0 aid=2)
   wlp2s0: associated
   wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
54:ec:2f:05:70:2c
   ath: EEPROM regdomain: 0x826c
   ath: EEPROM indicates we should expect a country code
   ath: doing EEPROM country->regdmn map search
   ath: country maps to regdmn code: 0x37
   ath: Country alpha2 being used: PT
   ath: Regpair used: 0x37
   ath: regdomain 0x826c dynamically updated by country element
   IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0: link becomes ready

I say _almost_, because I don't see the "No TX power limit" for the
country lookup (yes, Portugal) this time?

But wait!

... then 10+ minutes later:

   ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
   ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
   ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
   ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
   ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
   ath10k_pci 0000:02:00.0: failed to setup powersave: -108

That certainly looks like something did try to set a power limit, but
eventually failed.

Immediately after that:

   wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
(Reason: 3=DEAUTH_LEAVING)
   ath10k_pci 0000:02:00.0: failed to read hi_board_data address: -16
   ath10k_pci 0000:02:00.0: failed to receive initialized event from
target: 00000000
   ath10k_pci 0000:02:00.0: failed to receive initialized event from
target: 00000000
   ath10k_pci 0000:02:00.0: failed to wait for target init: -110
   ieee80211 phy0: Hardware restart was requested
   ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
   ath10k_pci 0000:02:00.0: failed to setup powersave: -108
   ath10k_pci 0000:02:00.0: failed to set PS Mode 0 for vdev 0: -108
   ath10k_pci 0000:02:00.0: failed to setup powersave: -108
   ath10k_pci 0000:02:00.0: failed to setup ps on vdev 0: -108
   ath10k_pci 0000:02:00.0: failed to flush transmit queue (skip 1
ar-state 2): 5000
   ath10k_pci 0000:02:00.0: failed to delete peer 54:ec:2f:05:70:2c
for vdev 0: -108

and this then results in:

   WARNING: CPU: 4 PID: 1246 at net/mac80211/sta_info.c:1057
__sta_info_destroy_part2+0x147/0x150 [mac80211]
   Modules linked in: ccm fuse rfcomm xt_CHECKSUM xt_MASQUERADE tun
bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT
ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat
ip6table_nat ip6table_mangle ip6table_raw ip6table_security
iptable_nat nf_nat iptable_mangle iptable_raw iptable_security
nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ip_set nfnetlink
ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep vfat fat
hid_multitouch iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi
snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_hda_codec_realtek
snd_soc_core snd_hda_codec_generic snd_soc_acpi_intel_match
snd_soc_acpi snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp
dell_laptop snd_hda_intel ledtrig_audio dell_wmi dell_smbios
x86_pkg_temp_thermal snd_hda_codec ath10k_pci intel_powerclamp
wmi_bmof dell_wmi_descriptor intel_wmi_thunderbolt snd_hwdep coretemp
ath10k_core snd_hda_core kvm_intel snd_seq ath snd_seq_device kvm
btusb btrtl
    intel_rapl_msr btbcm btintel mac80211 irqbypass snd_pcm bluetooth
intel_cstate dcdbas intel_uncore intel_rapl_perf joydev snd_timer snd
ecdh_generic i2c_i801 cfg80211 ecc idma64 soundcore rtsx_pci_ms
processor_thermal_device mei_me memstick rfkill libarc4 intel_lpss_pci
ucsi_acpi intel_soc_dts_iosf mei typec_ucsi intel_lpss
intel_pch_thermal intel_rapl_common typec wmi int3403_thermal
int340x_thermal_zone intel_hid sparse_keymap int3400_thermal acpi_pad
acpi_thermal_rel dm_crypt i915 crct10dif_pclmul crc32_pclmul
crc32c_intel rtsx_pci_sdmmc mmc_core i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt nvme fb_sys_fops ghash_clmulni_intel
drm serio_raw nvme_core rtsx_pci i2c_hid pinctrl_cannonlake video
pinctrl_intel
   CPU: 4 PID: 1246 Comm: NetworkManager Not tainted
5.3.0-rc8-00004-g56037cadf604-dirty #5
   Hardware name: Dell Inc. XPS 13 9380/0KTW76, BIOS 1.5.0 06/03/2019
   RIP: 0010:__sta_info_destroy_part2+0x147/0x150 [mac80211]
   Code: bd 0c 01 00 00 00 0f 84 4e ff ff ff 45 31 c0 b9 01 00 00 00
48 89 ea 48 89 de 4c 89 e7 e8 e1 aa ff ff 85 c0 0f 84 30 ff ff ff <0f>
0b e9 29 ff ff ff 66 90 0f 1f 44 00 00 41 54 55 48 89 fd e8 40
   RSP: 0018:ffff98fc00daba90 EFLAGS: 00010282
   RAX: 00000000ffffff94 RBX: ffff8e1355ad2880 RCX: 0000000000000000
   RDX: ffff8e1355901ec0 RSI: 0000000000000000 RDI: ffff8e1357902e38
   RBP: ffff8e135bb08000 R08: 0000000000000000 R09: 0000000000000574
   R10: 000000000001cc5c R11: 0000000000000000 R12: ffff8e13579007a0
   R13: ffff8e1355ad2880 R14: 0000000000000001 R15: ffff8e1357900d58
   FS:  00007fbd99372bc0(0000) GS:ffff8e135e500000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00001cdbd7408000 CR3: 00000004708a2001 CR4: 00000000003606e0
   Call Trace:
    __sta_info_flush+0x125/0x170 [mac80211]
    ieee80211_set_disassoc+0xc2/0x590 [mac80211]
    ieee80211_mgd_deauth.cold+0x4a/0x1b8 [mac80211]
    cfg80211_mlme_deauth+0xb3/0x1d0 [cfg80211]
    ? startup_64+0x3/0x30
    cfg80211_mlme_down+0x66/0x90 [cfg80211]
    cfg80211_disconnect+0x129/0x1e0 [cfg80211]
    cfg80211_leave+0x27/0x40 [cfg80211]
    cfg80211_netdev_notifier_call+0x1a7/0x4e0 [cfg80211]
    ? ieee80211_iter_chan_contexts_atomic+0x3e/0x50 [mac80211]
    ? ath10k_monitor_recalc+0x38f/0x5d0 [ath10k_core]
    ? __schedule+0x143/0x660
    ? ath10k_config_ps+0x4e/0x70 [ath10k_core]
    ? inetdev_event+0x46/0x560
    notifier_call_chain+0x4c/0x70
    __dev_close_many+0x57/0x100
    dev_close_many+0x8d/0x140
    dev_close.part.0+0x44/0x70
    cfg80211_shutdown_all_interfaces+0x71/0xd0 [cfg80211]
    cfg80211_rfkill_set_block+0x22/0x30 [cfg80211]
    rfkill_set_block+0x92/0x140 [rfkill]
    rfkill_fop_write+0x11f/0x1c0 [rfkill]
    vfs_write+0xb6/0x1a0
    ksys_write+0xa7/0xe0
    do_syscall_64+0x59/0x1c0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
   RIP: 0033:0x7fbd9a36c92f
   Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 19 fd ff ff 48 8b
54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48>
3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 4c fd ff ff 48
   RSP: 002b:00007fff9a1ee220 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
   RAX: ffffffffffffffda RBX: 000055f18ef5eaad RCX: 00007fbd9a36c92f
   RDX: 0000000000000008 RSI: 00007fff9a1ee258 RDI: 0000000000000019
   RBP: 0000000000000019 R08: 0000000000000000 R09: 0000000000000001
   R10: 0000000000000000 R11: 0000000000000293 R12: 000055f190502090
   R13: 0000000000000000 R14: 0000000000000000 R15: 000055f1904ffbb0
   ---[ end trace 4d7c87877e20badd ]---
   ath10k_pci 0000:02:00.0: failed to recalculate rts/cts prot for vdev 0: -108
   ath10k_pci 0000:02:00.0: failed to set cts protection for vdev 0: -108

and it looks like it leaves some lock held.

Because after this, nothing network-related stops working, and the
machine eventually dies because workqueues don't make any progress (due
to rtnl_lock, I suspect):

   Workqueue: events_freezable ieee80211_restart_work [mac80211]
   Call Trace:
    schedule+0x39/0xa0
    schedule_preempt_disabled+0xa/0x10
    __mutex_lock.isra.0+0x263/0x4b0
    ieee80211_restart_work+0x54/0xe0 [mac80211]
    process_one_work+0x1cf/0x370
    worker_thread+0x4a/0x3c0
    kthread+0xfb/0x130
    ret_from_fork+0x35/0x40

   Workqueue: events disconnect_work [cfg80211]
   Call Trace:
    schedule+0x39/0xa0
    schedule_preempt_disabled+0xa/0x10
    __mutex_lock.isra.0+0x263/0x4b0
    disconnect_work+0x12/0xd0 [cfg80211]
    process_one_work+0x1cf/0x370
    worker_thread+0x4a/0x3c0
    kthread+0xfb/0x130
    ret_from_fork+0x35/0x40
   kworker/3:0     D    0  6313      2 0x80004000


   Workqueue: events linkwatch_event
   Call Trace:
    schedule+0x39/0xa0
    schedule_preempt_disabled+0xa/0x10
    __mutex_lock.isra.0+0x263/0x4b0
    linkwatch_event+0xa/0x30
    process_one_work+0x1cf/0x370
    worker_thread+0x4a/0x3c0
    kthread+0xfb/0x130
    ret_from_fork+0x35/0x40

   ping            D    0  6619   6568 0x80000002
   Call Trace:
    schedule+0x39/0xa0
    schedule_preempt_disabled+0xa/0x10
    __mutex_lock.isra.0+0x263/0x4b0
    ip6mr_sk_done+0x2e/0xd0
    rawv6_close+0x19/0x30
    inet_release+0x34/0x60
    __sock_release+0x3d/0xa0
    sock_close+0x11/0x20
    __fput+0xc1/0x250
    task_work_run+0x81/0xa0
    do_exit+0x2e5/0xaa0
    do_group_exit+0x3a/0x90
    __x64_sys_exit_group+0x14/0x20
    do_syscall_64+0x59/0x1c0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

   gdbus           D    0  6629   1949 0x80004000
   Call Trace:
    schedule+0x39/0xa0
    do_exit+0x1da/0xaa0
    do_group_exit+0x3a/0x90
    get_signal+0x14f/0x830
    do_signal+0x36/0x620
    exit_to_usermode_loop+0x6f/0xf0
    do_syscall_64+0x17d/0x1c0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 10:05 WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2()) Linus Torvalds
@ 2019-09-11 10:26 ` Johannes Berg
  2019-09-11 11:58   ` Linus Torvalds
  2019-09-11 18:19   ` Kalle Valo
  2019-09-11 18:10 ` Kalle Valo
  1 sibling, 2 replies; 12+ messages in thread
From: Johannes Berg @ 2019-09-11 10:26 UTC (permalink / raw)
  To: Linus Torvalds, David S. Miller, Kalle Valo
  Cc: linux-wireless, Netdev, Linux List Kernel Mailing

Hi,

> So I'm at LCA

When did LCA move to Portugal? ;-))

> , reading email, using my laptop more than I normally do,
> and with different networking than I normally do.
> 
> And I just had a 802.11 WARN_ON() trigger, followed by essentially a
> dead machine due to some lock held (maybe rtnl_lock).

yes, it's definitely stuck on the RTNL, all the lot of the workqueues
are trying to acquire it (linkwatch, ipv6, but oddly enough even the
mac80211 restart work).

> It's possible that the lock held thing happened before, and is the
> _reason_ for the delay, I don't know.

No, we do have the lock in the WARN_ON(), somewhere around
dev_close_many() it is acquired.

> Previous resume looks normal:
> [snip]
>    wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
> 54:ec:2f:05:70:2c

Is that the message you meant?

> Another _almost_ successful suspend/resume:
> [snip
>    wlp2s0: RX AssocResp from 54:ec:2f:05:70:2c (capab=0x1011 status=0 aid=2)
>    wlp2s0: associated
>    wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
> 54:ec:2f:05:70:2c
>    ath: EEPROM regdomain: 0x826c
>    ath: EEPROM indicates we should expect a country code
>    ath: doing EEPROM country->regdmn map search
>    ath: country maps to regdmn code: 0x37
>    ath: Country alpha2 being used: PT
>    ath: Regpair used: 0x37
>    ath: regdomain 0x826c dynamically updated by country element
>    IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0: link becomes ready
> 
> I say _almost_, because I don't see the "No TX power limit" for the
> country lookup (yes, Portugal) this time?

because here you had it too, just a bit earlier. It usually comes when a
beacon is received the first time, which depends on the AP timing.

>    ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
>    ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
>    ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
>    ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
>    ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
>    ath10k_pci 0000:02:00.0: failed to setup powersave: -108
> 
> That certainly looks like something did try to set a power limit, but
> eventually failed.

Yeah, that does seem a bit fishy. Kalle would have to comment for
ath10k.

> Immediately after that:
> 
>    wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
> (Reason: 3=DEAUTH_LEAVING)

I don't _think_ any of the above would be a reason to disconnect, but it
clearly looks like the device got stuck at this point, since everything
just fails afterwards.

>    ath10k_pci 0000:02:00.0: failed to delete peer 54:ec:2f:05:70:2c
> for vdev 0: -108
> 
> and this then results in:
> 
>    WARNING: CPU: 4 PID: 1246 at net/mac80211/sta_info.c:1057
> __sta_info_destroy_part2+0x147/0x150 [mac80211]

Not really a surprise. Perhaps we shouldn't even WARN_ON() this, if the
driver is stuck completely and returning errors to everything that
doesn't help so much.

Then again, the stack trace was helpful this time:

>     ieee80211_set_disassoc+0xc2/0x590 [mac80211]
>     ieee80211_mgd_deauth.cold+0x4a/0x1b8 [mac80211]
>     cfg80211_mlme_deauth+0xb3/0x1d0 [cfg80211]
>     cfg80211_mlme_down+0x66/0x90 [cfg80211]
>     cfg80211_disconnect+0x129/0x1e0 [cfg80211]
>     cfg80211_leave+0x27/0x40 [cfg80211]
>     cfg80211_netdev_notifier_call+0x1a7/0x4e0 [cfg80211]
>     notifier_call_chain+0x4c/0x70
>     __dev_close_many+0x57/0x100
>     dev_close_many+0x8d/0x140
>     dev_close.part.0+0x44/0x70
>     cfg80211_shutdown_all_interfaces+0x71/0xd0 [cfg80211]
>     cfg80211_rfkill_set_block+0x22/0x30 [cfg80211]
>     rfkill_set_block+0x92/0x140 [rfkill]
>     rfkill_fop_write+0x11f/0x1c0 [rfkill]
>     vfs_write+0xb6/0x1a0


Since we see that something actually did an rfkill operation. Did you
push a button there?

Like I said above, the fact that we get into notifier_call_chain() from
rfkill_set_block() means that we acquired the RTNL here somewhere
between these.

>    ath10k_pci 0000:02:00.0: failed to recalculate rts/cts prot for vdev 0: -108
>    ath10k_pci 0000:02:00.0: failed to set cts protection for vdev 0: -108
> 
> and it looks like it leaves some lock held.

Yeah, the RTNL.

You don't happen to have timing information on these logs, perhaps
recorded in the logfile/journal?

It seems odd to me, since the RTNL is acquired by
cfg80211_rfkill_set_block() and that doesn't even have an error path, it
just does
        rtnl_lock();
        cfg80211_shutdown_all_interfaces(&rdev->wiphy);
        rtnl_unlock();

The only explanation I therefore have is that something is just taking
*forever* in that code path, hence my question about timing information
on the logs.

Looks like indeed the driver gives the device at least *3 seconds* for
every command, see ath10k_wmi_cmd_send(), so most likely this would
eventually have finished, but who knows how many firmware commands it
would still have attempted to send...

Perhaps the driver should mark the device as dead and fail quickly once
it timed out once, or so, but I'll let Kalle comment on that.

johannes


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 10:26 ` Johannes Berg
@ 2019-09-11 11:58   ` Linus Torvalds
  2019-09-11 12:04     ` Johannes Berg
  2019-09-11 18:19   ` Kalle Valo
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2019-09-11 11:58 UTC (permalink / raw)
  To: Johannes Berg
  Cc: David S. Miller, Kalle Valo, linux-wireless, Netdev,
	Linux List Kernel Mailing

On Wed, Sep 11, 2019 at 11:26 AM Johannes Berg
<johannes@sipsolutions.net> wrote:
>
> Hi,
>
> > So I'm at LCA
>
> When did LCA move to Portugal? ;-))

Heh. I may be jetlagged and not thinking straight.  LCA/LPC ;)

> > Previous resume looks normal:
> > [snip]
> >    wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
> > 54:ec:2f:05:70:2c
>
> Is that the message you meant?
>
> > I say _almost_, because I don't see the "No TX power limit" for the
> > country lookup (yes, Portugal) this time?
>
> because here you had it too, just a bit earlier. It usually comes when a
> beacon is received the first time, which depends on the AP timing.

Duh. I'm blind and didn't notice, because I was expecting it in the same order.

And I didn't think about it or double-check, because the errors that
then followed later _looked_ like that TX power failing that I thought
hadn't happened.

> I don't _think_ any of the above would be a reason to disconnect, but it
> clearly looks like the device got stuck at this point, since everything
> just fails afterwards.

Yeah, maybe the power stuff was just another effect of things getting
stuck, rather than the reason for it getting stuck.

So I probably mis-attributed the cause.

> >    WARNING: CPU: 4 PID: 1246 at net/mac80211/sta_info.c:1057
> > __sta_info_destroy_part2+0x147/0x150 [mac80211]
>
> Not really a surprise. Perhaps we shouldn't even WARN_ON() this, if the
> driver is stuck completely and returning errors to everything that
> doesn't help so much.
>
> Then again, the stack trace was helpful this time:
>
> >     ieee80211_set_disassoc+0xc2/0x590 [mac80211]
> >     ieee80211_mgd_deauth.cold+0x4a/0x1b8 [mac80211]
> >     cfg80211_mlme_deauth+0xb3/0x1d0 [cfg80211]
> >     cfg80211_mlme_down+0x66/0x90 [cfg80211]
> >     cfg80211_disconnect+0x129/0x1e0 [cfg80211]
> >     cfg80211_leave+0x27/0x40 [cfg80211]
> >     cfg80211_netdev_notifier_call+0x1a7/0x4e0 [cfg80211]
> >     notifier_call_chain+0x4c/0x70
> >     __dev_close_many+0x57/0x100
> >     dev_close_many+0x8d/0x140
> >     dev_close.part.0+0x44/0x70
> >     cfg80211_shutdown_all_interfaces+0x71/0xd0 [cfg80211]
> >     cfg80211_rfkill_set_block+0x22/0x30 [cfg80211]
> >     rfkill_set_block+0x92/0x140 [rfkill]
> >     rfkill_fop_write+0x11f/0x1c0 [rfkill]
> >     vfs_write+0xb6/0x1a0
>
>
> Since we see that something actually did an rfkill operation. Did you
> push a button there?

No, I tried to turn off and turn on Wifi manually (no button, just the
settings panel).

I didn't notice the WARN_ON(), I just noticed that there was no
networking, and "turn it off and on again" is obviously the first
thing to try ;)

> You don't happen to have timing information on these logs, perhaps
> recorded in the logfile/journal?

Sure. I cleaned up the logs to not spam people with lots of illegible
data, but it's all in the journal log.

Rough timeline:

Sep 11 03:40:00 xps13 kernel: PM: suspend entry (s2idle)
Sep 11 03:40:00 xps13 kernel: Filesystems sync: 0.028 seconds
...
Sep 11 10:13:14 xps13 kernel: Restarting tasks ... done.
Sep 11 10:13:14 xps13 kernel: PM: suspend exit
Sep 11 10:13:14 xps13 kernel: ath10k_pci 0000:02:00.0: UART prints enabled
Sep 11 10:13:14 xps13 kernel: ath10k_pci 0000:02:00.0: unsupported HTC
service id: 1536
Sep 11 10:13:23 xps13 kernel: wlp2s0: authenticate with 54:ec:2f:05:70:2c
Sep 11 10:13:23 xps13 kernel: wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
Sep 11 10:13:23 xps13 kernel: wlp2s0: authenticated
Sep 11 10:13:23 xps13 kernel: wlp2s0: Limiting TX power to 23 (23 - 0)
dBm as advertised by 54:ec:2f:05:70:2c
...
Sep 11 10:13:23 xps13 kernel: ath: regdomain 0x826c dynamically
updated by country element
Sep 11 10:13:24 xps13 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0:
link becomes ready
Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: wmi command
16387 timeout, restarting hardware
...
Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: failed to read
hi_board_data address: -16
Sep 11 10:27:10 xps13 kernel: ath10k_pci 0000:02:00.0: failed to
receive initialized event from target: 00000000
Sep 11 10:27:13 xps13 kernel: ath10k_pci 0000:02:00.0: failed to
receive initialized event from target: 00000000
...
Sep 11 10:27:13 xps13 kernel: WARNING: CPU: 4 PID: 1246 at
net/mac80211/sta_info.c:1057 __sta_info_destroy_part2+0x147/0x150
[mac80211]

but if you want full logs I can send them in private to you.

I do suspect it's atheros and suspend/resume or something. The
wireless clearly worked for a while after the resume, but then at some
point it stopped.

> It seems odd to me, since the RTNL is acquired by
> cfg80211_rfkill_set_block() and that doesn't even have an error path, it
> just does
>         rtnl_lock();
>         cfg80211_shutdown_all_interfaces(&rdev->wiphy);
>         rtnl_unlock();
>
> The only explanation I therefore have is that something is just taking
> *forever* in that code path, hence my question about timing information
> on the logs.

Yeah, maybe it would time out everything eventually. But not for a
long time. It hadn't cleared up by

  Sep 11 10:36:21 xps13 gnome-session-f[6837]: gnome-session-failed:
Fatal IO error 0 (Success) on X server :0.

which is when I shut down the machine (and had to then force a hard
power-off because the shutdown wanted things that needed the rtnl_lock
to go away)

                 Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 11:58   ` Linus Torvalds
@ 2019-09-11 12:04     ` Johannes Berg
  2019-09-11 13:03       ` Ben Greear
  2019-09-11 13:32       ` Kalle Valo
  0 siblings, 2 replies; 12+ messages in thread
From: Johannes Berg @ 2019-09-11 12:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, Kalle Valo, linux-wireless, Netdev,
	Linux List Kernel Mailing

On Wed, 2019-09-11 at 12:58 +0100, Linus Torvalds wrote:
> 
> And I didn't think about it or double-check, because the errors that
> then followed later _looked_ like that TX power failing that I thought
> hadn't happened.

Yeah, it could be something already got stuck there, hard to say.

> > Since we see that something actually did an rfkill operation. Did you
> > push a button there?
> 
> No, I tried to turn off and turn on Wifi manually (no button, just the
> settings panel).

That does usually also cause rfkill, so that explains how we got down
this particular code path.

> I didn't notice the WARN_ON(), I just noticed that there was no
> networking, and "turn it off and on again" is obviously the first
> thing to try ;)

:-)

> Sep 11 10:27:13 xps13 kernel: WARNING: CPU: 4 PID: 1246 at
> net/mac80211/sta_info.c:1057 __sta_info_destroy_part2+0x147/0x150
> [mac80211]
> 
> but if you want full logs I can send them in private to you.

No, it's fine, though maybe Kalle does - he was stepping out for a while
but said he'd look later.

This is the interesting time - 10:27:13 we get one of the first
failures. Really the first one was this:

> Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware


> I do suspect it's atheros and suspend/resume or something. The
> wireless clearly worked for a while after the resume, but then at some
> point it stopped.

I'm not really sure it's related to suspend/resume at all, the firmware
seems to just have gotten stuck, and the device and firmware most likely
got reset over the suspend/resume anyway.

> > The only explanation I therefore have is that something is just taking
> > *forever* in that code path, hence my question about timing information
> > on the logs.
> 
> Yeah, maybe it would time out everything eventually. But not for a
> long time. It hadn't cleared up by
> 
>   Sep 11 10:36:21 xps13 gnome-session-f[6837]: gnome-session-failed:
> Fatal IO error 0 (Success) on X server :0.

Ok, that's way longer than I would have guessed even! That's over 9
minutes, that'd be close to 200 commands having to be issued and timing
out ...

I don't know. What I wrote before is basically all I can say, I think
the driver gets stuck somewhere waiting for the device "forever", and
the stack just doesn't get to release the lock, causing all the follow-
up problems.

johannes


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 12:04     ` Johannes Berg
@ 2019-09-11 13:03       ` Ben Greear
  2019-09-11 13:21         ` Linus Torvalds
  2019-09-11 13:32       ` Kalle Valo
  1 sibling, 1 reply; 12+ messages in thread
From: Ben Greear @ 2019-09-11 13:03 UTC (permalink / raw)
  To: Johannes Berg, Linus Torvalds
  Cc: David S. Miller, Kalle Valo, linux-wireless, Netdev,
	Linux List Kernel Mailing



On 09/11/2019 05:04 AM, Johannes Berg wrote:
> On Wed, 2019-09-11 at 12:58 +0100, Linus Torvalds wrote:
>>
>> And I didn't think about it or double-check, because the errors that
>> then followed later _looked_ like that TX power failing that I thought
>> hadn't happened.
>
> Yeah, it could be something already got stuck there, hard to say.
>
>>> Since we see that something actually did an rfkill operation. Did you
>>> push a button there?
>>
>> No, I tried to turn off and turn on Wifi manually (no button, just the
>> settings panel).
>
> That does usually also cause rfkill, so that explains how we got down
> this particular code path.
>
>> I didn't notice the WARN_ON(), I just noticed that there was no
>> networking, and "turn it off and on again" is obviously the first
>> thing to try ;)
>
> :-)
>
>> Sep 11 10:27:13 xps13 kernel: WARNING: CPU: 4 PID: 1246 at
>> net/mac80211/sta_info.c:1057 __sta_info_destroy_part2+0x147/0x150
>> [mac80211]
>>
>> but if you want full logs I can send them in private to you.
>
> No, it's fine, though maybe Kalle does - he was stepping out for a while
> but said he'd look later.
>
> This is the interesting time - 10:27:13 we get one of the first
> failures. Really the first one was this:
>
>> Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
>
>
>> I do suspect it's atheros and suspend/resume or something. The
>> wireless clearly worked for a while after the resume, but then at some
>> point it stopped.
>
> I'm not really sure it's related to suspend/resume at all, the firmware
> seems to just have gotten stuck, and the device and firmware most likely
> got reset over the suspend/resume anyway.
>
>>> The only explanation I therefore have is that something is just taking
>>> *forever* in that code path, hence my question about timing information
>>> on the logs.
>>
>> Yeah, maybe it would time out everything eventually. But not for a
>> long time. It hadn't cleared up by
>>
>>   Sep 11 10:36:21 xps13 gnome-session-f[6837]: gnome-session-failed:
>> Fatal IO error 0 (Success) on X server :0.
>
> Ok, that's way longer than I would have guessed even! That's over 9
> minutes, that'd be close to 200 commands having to be issued and timing
> out ...
>
> I don't know. What I wrote before is basically all I can say, I think
> the driver gets stuck somewhere waiting for the device "forever", and
> the stack just doesn't get to release the lock, causing all the follow-
> up problems.

It looks to me like the ath10k firmware is not responding to commands and/or
is out of its WMI tx credits.  The code often takes a lock and then blocks for up to 3
or so seconds waiting for a response from the firmware, and the mac80211 calling
code is often already holding rtnl.  Pretty much every mac80211 call will cause a
WMI message and thus potentially hit this timeout.

This can easily cause rtnl to be held for 3 seconds, but after that, I believe
upstream ath10k will now time out and kill the firmware and restart.  (I run
a significantly modified ath10k driver, and that is how mine works, at least.)

In this case, it looks like restarting the firmware/NIC failed, and I guess
that must get it in a state where it is still blocking and trying to talk
to the firmware?  Or maybe deadlock down inside ath10k driver.

For what it's worth, we see that WARN_ON often when ath10k firmware crashes, but it
seems to not be a big deal and the system normally recovers fine.

Out of curiosity, I'm interested to know what ath10k NIC chipset this is from.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 13:03       ` Ben Greear
@ 2019-09-11 13:21         ` Linus Torvalds
  2019-09-11 13:31           ` Ben Greear
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2019-09-11 13:21 UTC (permalink / raw)
  To: Ben Greear
  Cc: Johannes Berg, David S. Miller, Kalle Valo, linux-wireless,
	Netdev, Linux List Kernel Mailing

On Wed, Sep 11, 2019 at 2:03 PM Ben Greear <greearb@candelatech.com> wrote:
>
> Out of curiosity, I'm interested to know what ath10k NIC chipset this is from.

It's a Dell XPS 13 9380, with

  02:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac
Wireless Network Adapter (rev 32)
        Subsystem: Bigfoot Networks, Inc. Killer 1435 Wireless-AC

(numeric PCI ID 168c:003e, subsystem 1a56:143a).

The ath10k driver says

    qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:143a
    firmware ver WLAN.RM.4.4.1-00140-QCARMSWPZ-1 api 6 features
wowlan,ignore-otp,mfp crc32 29eb8ca1
    board_file api 2 bmi_id N/A crc32 4ed3569e

if that tells you anything more.

              Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 13:21         ` Linus Torvalds
@ 2019-09-11 13:31           ` Ben Greear
  0 siblings, 0 replies; 12+ messages in thread
From: Ben Greear @ 2019-09-11 13:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Berg, David S. Miller, Kalle Valo, linux-wireless,
	Netdev, Linux List Kernel Mailing



On 09/11/2019 06:21 AM, Linus Torvalds wrote:
> On Wed, Sep 11, 2019 at 2:03 PM Ben Greear <greearb@candelatech.com> wrote:
>>
>> Out of curiosity, I'm interested to know what ath10k NIC chipset this is from.
>
> It's a Dell XPS 13 9380, with
>
>   02:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac
> Wireless Network Adapter (rev 32)
>         Subsystem: Bigfoot Networks, Inc. Killer 1435 Wireless-AC
>
> (numeric PCI ID 168c:003e, subsystem 1a56:143a).
>
> The ath10k driver says
>
>     qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:143a
>     firmware ver WLAN.RM.4.4.1-00140-QCARMSWPZ-1 api 6 features
> wowlan,ignore-otp,mfp crc32 29eb8ca1
>     board_file api 2 bmi_id N/A crc32 4ed3569e
>
> if that tells you anything more.

That means it is something I have never used nor have firmware for, but
the WMI logic should be similar to what I described and have experienced
with other chips.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 12:04     ` Johannes Berg
  2019-09-11 13:03       ` Ben Greear
@ 2019-09-11 13:32       ` Kalle Valo
  1 sibling, 0 replies; 12+ messages in thread
From: Kalle Valo @ 2019-09-11 13:32 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, David S. Miller, linux-wireless, Netdev,
	Linux List Kernel Mailing

Johannes Berg <johannes@sipsolutions.net> writes:

>> Sep 11 10:27:13 xps13 kernel: WARNING: CPU: 4 PID: 1246 at
>> net/mac80211/sta_info.c:1057 __sta_info_destroy_part2+0x147/0x150
>> [mac80211]
>> 
>> but if you want full logs I can send them in private to you.
>
> No, it's fine, though maybe Kalle does - he was stepping out for a while
> but said he'd look later.

Linus, it would help if you could send me full logs with timestamps.
Also if you can, please grep your logs to see if these wmi timeouts have
happened before.

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 10:05 WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2()) Linus Torvalds
  2019-09-11 10:26 ` Johannes Berg
@ 2019-09-11 18:10 ` Kalle Valo
  1 sibling, 0 replies; 12+ messages in thread
From: Kalle Valo @ 2019-09-11 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Berg, David S. Miller, linux-wireless, Netdev,
	Linux List Kernel Mailing, ath10k

+ ath10k list

Linus Torvalds <torvalds@linux-foundation.org> writes:

> So I'm at LCA, reading email, using my laptop more than I normally do,
> and with different networking than I normally do.
>
> And I just had a 802.11 WARN_ON() trigger, followed by essentially a
> dead machine due to some lock held (maybe rtnl_lock).
>
> It's possible that the lock held thing happened before, and is the
> _reason_ for the delay, I don't know. I had to reboot the machine, but
> I gathered as much information as made sense and was obvious before I
> did so. That's appended.

Some notes while investigating this:

> But wait!
>
> ... then 10+ minutes later:
>
>    ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
>    ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
>    ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
>    ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
>    ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
>    ath10k_pci 0000:02:00.0: failed to setup powersave: -108
>
> That certainly looks like something did try to set a power limit, but
> eventually failed.

I suspect the failing WMI command is called from:

ath10k_bss_info_changed()
ath10k_mac_txpower_recalc()
ath10k_mac_txpower_setup()
ath10k_wmi_pdev_set_param()
ath10k_wmi_cmd_send()
ath10k_wmi_cmd_send_nowait()
ath10k_htc_send()

-11 is -EAGAIN which would mean that the HTC credits have run out some
 reason for the WMI command:

if (ep->tx_credits < credits) {
        ath10k_dbg(ar, ATH10K_DBG_HTC,
                "htc insufficient credits ep %d required %d available %d\n",
                eid, credits, ep->tx_credits);
        spin_unlock_bh(&htc->tx_lock);
        ret = -EAGAIN;
        goto err_pull;
}

Credits can run out, for example, if there's a lot of WMI command/event
activity and are not returned during the 3s wait, firmware crashed or
problems with the PCI bus. But when the WMI command timeout happens
ath10k is supposed to restart the firmware and everything should be
usable again.
                                             
> Immediately after that:
>
>    wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
> (Reason: 3=DEAUTH_LEAVING)
>    ath10k_pci 0000:02:00.0: failed to read hi_board_data address: -16
>    ath10k_pci 0000:02:00.0: failed to receive initialized event from
> target: 00000000
>    ath10k_pci 0000:02:00.0: failed to receive initialized event from
> target: 00000000
>    ath10k_pci 0000:02:00.0: failed to wait for target init: -110

I suspect here ath10k tries to reset the target during stop operation,
"failed to receive initialized event from target" comes from:

ath10k_pci_hif_stop()
ath10k_pci_safe_chip_reset()
ath10k_pci_warm_reset()
ath10k_pci_wait_for_target_init()

It shouldn't fail like that, which makes me suspect either a low level
problem or a bug in qca6174 firmware restart code. To check the latter,
could you please try to force a firmware crash and see if firmware
restart is working for you?

To crash the firmware you need to write either "hard" or "assert" (I
forgot which one QCA6174 firmware supports) to
/sys/kernel/debug/ieee80211/phy*/ath10k/simulate_fw_crash. And what
should happen is that the firmware crashes, ath10k prints a big pile of
warnings, restarts it and in few seconds everything resumes to normal
without user space even noticing it.

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 10:26 ` Johannes Berg
  2019-09-11 11:58   ` Linus Torvalds
@ 2019-09-11 18:19   ` Kalle Valo
  2019-09-11 18:23     ` Johannes Berg
  1 sibling, 1 reply; 12+ messages in thread
From: Kalle Valo @ 2019-09-11 18:19 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, David S. Miller, Kalle Valo, linux-wireless,
	Netdev, Linux List Kernel Mailing, ath10k

Johannes Berg <johannes@sipsolutions.net> writes:

>>    ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
>>    ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
>>    ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
>>    ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
>>    ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
>>    ath10k_pci 0000:02:00.0: failed to setup powersave: -108
>> 
>> That certainly looks like something did try to set a power limit, but
>> eventually failed.
>
> Yeah, that does seem a bit fishy. Kalle would have to comment for
> ath10k.
>
>> Immediately after that:
>> 
>>    wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
>> (Reason: 3=DEAUTH_LEAVING)
>
> I don't _think_ any of the above would be a reason to disconnect, but it
> clearly looks like the device got stuck at this point, since everything
> just fails afterwards.

Yeah, to me it looks anything ath10k tries to do with the devie fails,
even resetting the device.

> Looks like indeed the driver gives the device at least *3 seconds* for
> every command, see ath10k_wmi_cmd_send(), so most likely this would
> eventually have finished, but who knows how many firmware commands it
> would still have attempted to send...

3 seconds is a bit short but in normal cases it should be enough. Of
course we could increase the delay but I'm skeptic it would help here.

> Perhaps the driver should mark the device as dead and fail quickly once
> it timed out once, or so, but I'll let Kalle comment on that.

Actually we do try to restart the device when a timeout happens in
ath10k_wmi_cmd_send():

        if (ret == -EAGAIN) {
                ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
                            cmd_id);
                queue_work(ar->workqueue, &ar->restart_work);
        }
                        

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 18:19   ` Kalle Valo
@ 2019-09-11 18:23     ` Johannes Berg
  2019-09-11 18:48       ` Kalle Valo
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Berg @ 2019-09-11 18:23 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Linus Torvalds, David S. Miller, linux-wireless, Netdev,
	Linux List Kernel Mailing, ath10k

On Wed, 2019-09-11 at 21:19 +0300, Kalle Valo wrote:
> > Looks like indeed the driver gives the device at least *3 seconds* for
> > every command, see ath10k_wmi_cmd_send(), so most likely this would
> > eventually have finished, but who knows how many firmware commands it
> > would still have attempted to send...
> 
> 3 seconds is a bit short but in normal cases it should be enough. Of
> course we could increase the delay but I'm skeptic it would help here.

I was thinking 3 seconds is way too long :-)

> > Perhaps the driver should mark the device as dead and fail quickly once
> > it timed out once, or so, but I'll let Kalle comment on that.
> 
> Actually we do try to restart the device when a timeout happens in
> ath10k_wmi_cmd_send():
> 
>         if (ret == -EAGAIN) {
>                 ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
>                             cmd_id);
>                 queue_work(ar->workqueue, &ar->restart_work);
>         }

Yeah, and this is the problem, in a sense, I'd think. It seems to me
that at this point the code needs to tag the device as "dead" and
immediately return from any further commands submitted to it with an
error (e.g. -EIO). You can can actually see in the initial report that
while the restart was triggered, it too is waiting to acquire the RTNL:

>    Workqueue: events_freezable ieee80211_restart_work [mac80211]
>    Call Trace:
>     schedule+0x39/0xa0
>     schedule_preempt_disabled+0xa/0x10
>     __mutex_lock.isra.0+0x263/0x4b0
>     ieee80211_restart_work+0x54/0xe0 [mac80211]
>     process_one_work+0x1cf/0x370
>     worker_thread+0x4a/0x3c0
>     kthread+0xfb/0x130
>     ret_from_fork+0x35/0x40


So basically all this delay is mac80211 and the driver doing stuff with
the device, but every single thing has to time out and probably some
stuff loops etc., and then it just takes long enough with the RTNL held
that everything goes south.

johannes


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())
  2019-09-11 18:23     ` Johannes Berg
@ 2019-09-11 18:48       ` Kalle Valo
  0 siblings, 0 replies; 12+ messages in thread
From: Kalle Valo @ 2019-09-11 18:48 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, David S. Miller, linux-wireless, Netdev,
	Linux List Kernel Mailing, ath10k

Johannes Berg <johannes@sipsolutions.net> writes:

> On Wed, 2019-09-11 at 21:19 +0300, Kalle Valo wrote:
>> > Looks like indeed the driver gives the device at least *3 seconds* for
>> > every command, see ath10k_wmi_cmd_send(), so most likely this would
>> > eventually have finished, but who knows how many firmware commands it
>> > would still have attempted to send...
>> 
>> 3 seconds is a bit short but in normal cases it should be enough. Of
>> course we could increase the delay but I'm skeptic it would help here.
>
> I was thinking 3 seconds is way too long :-)

Heh :)

>> > Perhaps the driver should mark the device as dead and fail quickly once
>> > it timed out once, or so, but I'll let Kalle comment on that.
>> 
>> Actually we do try to restart the device when a timeout happens in
>> ath10k_wmi_cmd_send():
>> 
>>         if (ret == -EAGAIN) {
>>                 ath10k_warn(ar, "wmi command %d timeout, restarting hardware\n",
>>                             cmd_id);
>>                 queue_work(ar->workqueue, &ar->restart_work);
>>         }
>
> Yeah, and this is the problem, in a sense, I'd think. It seems to me
> that at this point the code needs to tag the device as "dead" and
> immediately return from any further commands submitted to it with an
> error (e.g. -EIO).

Yeah, ath10k_core_restart() is supposed change to state
ATH10K_STATE_RESTARTING but very few mac80211 ops in ath10k_ops are
checking for it, and to me it looks like quite late even. I think a
proper fix for ops which can sleep is to check ar->state is
ATH10K_STATE_ON and for ops which cannot sleep check
ATH10K_FLAG_CRASH_FLUSH.

But of course this just fixes the symptoms, the root cause for timeouts
needs to be found as well.

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, back to index

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-11 10:05 WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2()) Linus Torvalds
2019-09-11 10:26 ` Johannes Berg
2019-09-11 11:58   ` Linus Torvalds
2019-09-11 12:04     ` Johannes Berg
2019-09-11 13:03       ` Ben Greear
2019-09-11 13:21         ` Linus Torvalds
2019-09-11 13:31           ` Ben Greear
2019-09-11 13:32       ` Kalle Valo
2019-09-11 18:19   ` Kalle Valo
2019-09-11 18:23     ` Johannes Berg
2019-09-11 18:48       ` Kalle Valo
2019-09-11 18:10 ` Kalle Valo

Linux-Wireless Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-wireless/0 linux-wireless/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-wireless linux-wireless/ https://lore.kernel.org/linux-wireless \
		linux-wireless@vger.kernel.org linux-wireless@archiver.kernel.org
	public-inbox-index linux-wireless

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-wireless


AGPL code for this site: git clone https://public-inbox.org/ public-inbox