All of lore.kernel.org
 help / color / mirror / Atom feed
* Hard lockup during vif restart tests.
@ 2014-09-16 18:42 Ben Greear
  2014-09-17  6:34 ` Michal Kazior
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Greear @ 2014-09-16 18:42 UTC (permalink / raw)
  To: ath10k

This is on a 3.14.14+ hacked kernel, with CT firmware.

Test case is to restart stations (and the AP
on the other side) every 10-30 seconds.
After a bit, the station machine locked up hard.

I have no idea how to trouble-shoot this better, so this is
just FYI.

wlan1: deauthenticated from 04:f0:21:03:38:99 (Reason: 7)
ath10k: mac vdev 0 peer delete 04:f0:21:03:38:99 (sta gone)
ath10k: mac vdev 0 stop (disassociated
ath10k: mac vdev 0 down
ath10k: mac vdev 0 cts_prot 0
ath10k: mac vdev 0 slot_time 1
ath10k: mac vdev 0 preamble 1n
ath10k: mac config channel 5180 mhz flags 0x120
ath10k: mac radar config update: chan 5180MHz radar 0 chan radar 0 chan state USABLE
ath10k: mac config channel to 5180MHz (cf1 5180MHz cf2 0MHz width 20 (noht))
ath10k: mac vdev 0 delete (remove interface)
ath10k: boot suspend complete
ath10k: boot hif stop
ath10k: boot warm reset
ath10k: boot host cpu intr cause: 0x00007800
ath10k: boot target cpu intr cause: 0x02001000
ath10k: boot host cpu intr cause: 0x00000000
ath10k: boot target cpu intr cause: 0x02000000
ath10k: boot target reset state: 0x00000804
ath10k: boot warm reset complete
ath10k: boot hif power down
ath10k: boot warm reset
ath10k: boot host cpu intr cause: 0x00000000
ath10k: boot target cpu intr cause: 0x02000000
ath10k: boot host cpu intr cause: 0x00000000
ath10k: boot target cpu intr cause: 0x02000000
ath10k: boot target reset state: 0x00000800
ath10k: boot warm reset complete
ath10k: stop, state OFF
ath10k: boot hif power up
ath10k: boot warm reset
ath10k: boot host cpu intr cause: 0x00000000
ath10k: boot target cpu intr cause: 0x02000000
ath10k: boot host cpu intr cause: 0x00000000
ath10k: boot target cpu intr cause: 0x02000000
ath10k: boot target reset state: 0x00000800
ath10k: boot warm reset complete
ath10k: boot init ce src ring id 0 entries 16 base_addr ffff8800d206a000
ath10k: boot ce dest ring id 1 entries 512 base_addr ffff8800d5420000
ath10k: boot ce dest ring id 2 entries 32 base_addr ffff8800d2069000
ath10k: boot init ce src ring id 3 entries 32 base_addr ffff8800d2068000
ath10k: boot init ce src ring id 4 entries 4096 base_addr ffff8800d5570000
ath10k: boot init ce src ring id 7 entries 2 base_addr ffff8800d2067000
ath10k: boot ce dest ring id 7 entries 2 base_addr ffff8800d2066000
ath10k_pci 0000:04:00.0: irq 52 for MSI/MSI-X
ath10k: boot waiting target to initialise
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: boot target indicator 0
ath10k: failed to receive initialized event from target: 00000000
ath10k: failed to wait for target to init: -110
ath10k: boot warm reset
ath10k: boot host cpu intr cause: 0x00000000
ath10k: boot target cpu intr cause: 0x02000000
ath10k: boot host cpu intr cause: 0x00000000
ath10k: boot target cpu intr cause: 0x02000000
ath10k: boot target reset state: 0x00000800
ath10k: boot warm reset complete
ath10k: failed to power up target using warm reset: -110
ath10k: trying cold reset
ath10k: boot cold reset
ath10k: boot cold reset complete
[hang, even sysrq will not work]


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hard lockup during vif restart tests.
  2014-09-16 18:42 Hard lockup during vif restart tests Ben Greear
@ 2014-09-17  6:34 ` Michal Kazior
  2014-09-17 15:52   ` Ben Greear
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Kazior @ 2014-09-17  6:34 UTC (permalink / raw)
  To: Ben Greear; +Cc: ath10k

On 16 September 2014 20:42, Ben Greear <greearb@candelatech.com> wrote:
> This is on a 3.14.14+ hacked kernel, with CT firmware.
>
> Test case is to restart stations (and the AP
> on the other side) every 10-30 seconds.
> After a bit, the station machine locked up hard.
>
> I have no idea how to trouble-shoot this better, so this is
> just FYI.
>
[...]
> ath10k: boot warm reset complete
> ath10k: failed to power up target using warm reset: -110
> ath10k: trying cold reset
> ath10k: boot cold reset
> ath10k: boot cold reset complete
> [hang, even sysrq will not work]

There's a known problem with cold reset being capable of locking up
entire system (depends on the pci-e controller, e.g. AP135 splats a
Data Bus Error instead).

Actually warm reset can do the same in some corner cases: try running
Rx traffic and just start the recovery sequence (without actually
crashing the fw). My x86 locks up very easily with this.

I strongly suggest you use reset_mode=1 when you load ath10k_pci so
cold reset isn't used. This may result in ath10k being unable to bring
up the device in some rare cases (e.g. after an IOMMU fault if your
system supports it) but I believe it's far better than having the
whole system lock up.

My suspicion is tx/rx rings, dma transfer engines, internal irqs
aren't stopped properly. I have a prototype patch for the warm reset
problem but it's incomplete and I'm not sure if I can share it yet.


Michał

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hard lockup during vif restart tests.
  2014-09-17  6:34 ` Michal Kazior
@ 2014-09-17 15:52   ` Ben Greear
  2014-09-18  6:23     ` Michal Kazior
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Greear @ 2014-09-17 15:52 UTC (permalink / raw)
  To: Michal Kazior; +Cc: ath10k

On 09/16/2014 11:34 PM, Michal Kazior wrote:
> On 16 September 2014 20:42, Ben Greear <greearb@candelatech.com> wrote:
>> This is on a 3.14.14+ hacked kernel, with CT firmware.
>>
>> Test case is to restart stations (and the AP
>> on the other side) every 10-30 seconds.
>> After a bit, the station machine locked up hard.
>>
>> I have no idea how to trouble-shoot this better, so this is
>> just FYI.
>>
> [...]
>> ath10k: boot warm reset complete
>> ath10k: failed to power up target using warm reset: -110
>> ath10k: trying cold reset
>> ath10k: boot cold reset
>> ath10k: boot cold reset complete
>> [hang, even sysrq will not work]
> 
> There's a known problem with cold reset being capable of locking up
> entire system (depends on the pci-e controller, e.g. AP135 splats a
> Data Bus Error instead).
> 
> Actually warm reset can do the same in some corner cases: try running
> Rx traffic and just start the recovery sequence (without actually
> crashing the fw). My x86 locks up very easily with this.
> 
> I strongly suggest you use reset_mode=1 when you load ath10k_pci so
> cold reset isn't used. This may result in ath10k being unable to bring
> up the device in some rare cases (e.g. after an IOMMU fault if your
> system supports it) but I believe it's far better than having the
> whole system lock up.
> 
> My suspicion is tx/rx rings, dma transfer engines, internal irqs
> aren't stopped properly. I have a prototype patch for the warm reset
> problem but it's incomplete and I'm not sure if I can share it yet.

I will try the warm-reset-only flag, and I do hope you have success
with the warm/cold reset fixes.

But, I still wonder if we could just reset less often and maybe
make it a bit harder to hit these problems?

Why do we reset the firmware/NIC when we admin down/up the
vif (when a single vif is active)?  Couldn't we just keep
the firmware active in this state and not risk lockup due
to reset?

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hard lockup during vif restart tests.
  2014-09-17 15:52   ` Ben Greear
@ 2014-09-18  6:23     ` Michal Kazior
  2014-09-18  7:31       ` Kalle Valo
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Kazior @ 2014-09-18  6:23 UTC (permalink / raw)
  To: Ben Greear; +Cc: ath10k

On 17 September 2014 17:52, Ben Greear <greearb@candelatech.com> wrote:
> On 09/16/2014 11:34 PM, Michal Kazior wrote:
>> On 16 September 2014 20:42, Ben Greear <greearb@candelatech.com> wrote:
>>> This is on a 3.14.14+ hacked kernel, with CT firmware.
>>>
>>> Test case is to restart stations (and the AP
>>> on the other side) every 10-30 seconds.
>>> After a bit, the station machine locked up hard.
>>>
>>> I have no idea how to trouble-shoot this better, so this is
>>> just FYI.
>>>
>> [...]
>>> ath10k: boot warm reset complete
>>> ath10k: failed to power up target using warm reset: -110
>>> ath10k: trying cold reset
>>> ath10k: boot cold reset
>>> ath10k: boot cold reset complete
>>> [hang, even sysrq will not work]
>>
>> There's a known problem with cold reset being capable of locking up
>> entire system (depends on the pci-e controller, e.g. AP135 splats a
>> Data Bus Error instead).
>>
>> Actually warm reset can do the same in some corner cases: try running
>> Rx traffic and just start the recovery sequence (without actually
>> crashing the fw). My x86 locks up very easily with this.
>>
>> I strongly suggest you use reset_mode=1 when you load ath10k_pci so
>> cold reset isn't used. This may result in ath10k being unable to bring
>> up the device in some rare cases (e.g. after an IOMMU fault if your
>> system supports it) but I believe it's far better than having the
>> whole system lock up.
>>
>> My suspicion is tx/rx rings, dma transfer engines, internal irqs
>> aren't stopped properly. I have a prototype patch for the warm reset
>> problem but it's incomplete and I'm not sure if I can share it yet.
>
> I will try the warm-reset-only flag, and I do hope you have success
> with the warm/cold reset fixes.

It sort of works as it is now but it's ugly.


> But, I still wonder if we could just reset less often and maybe
> make it a bit harder to hit these problems?
>
> Why do we reset the firmware/NIC when we admin down/up the
> vif (when a single vif is active)?  Couldn't we just keep
> the firmware active in this state and not risk lockup due
> to reset?

If you put down last interface mac80211 calls drv_stop(). There isn't
any real need to keep the device up and running after that other than
trying to workaround the reset issue. But then you need to deal with
firmware quirks. I recall it could report Rx indications after all
vdevs had been removed (and this is now also observable with 10.2
during probing/bootup). It's just simpler to reboot firmware on
drv_stop/start().


Michał

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hard lockup during vif restart tests.
  2014-09-18  6:23     ` Michal Kazior
@ 2014-09-18  7:31       ` Kalle Valo
  2014-09-18 16:06         ` Ben Greear
  0 siblings, 1 reply; 6+ messages in thread
From: Kalle Valo @ 2014-09-18  7:31 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Ben Greear, ath10k

Michal Kazior <michal.kazior@tieto.com> writes:

>> Why do we reset the firmware/NIC when we admin down/up the
>> vif (when a single vif is active)?  Couldn't we just keep
>> the firmware active in this state and not risk lockup due
>> to reset?
>
> If you put down last interface mac80211 calls drv_stop(). There isn't
> any real need to keep the device up and running after that other than
> trying to workaround the reset issue. But then you need to deal with
> firmware quirks. I recall it could report Rx indications after all
> vdevs had been removed (and this is now also observable with 10.2
> during probing/bootup). It's just simpler to reboot firmware on
> drv_stop/start().

And there's the reliability issue. Being able to reset the firmware with
interface down/up sequency is pretty useful and AFAIK almost all
upstream drivers do that. And let's not forget power consumption either.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Hard lockup during vif restart tests.
  2014-09-18  7:31       ` Kalle Valo
@ 2014-09-18 16:06         ` Ben Greear
  0 siblings, 0 replies; 6+ messages in thread
From: Ben Greear @ 2014-09-18 16:06 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Michal Kazior, ath10k

On 09/18/2014 12:31 AM, Kalle Valo wrote:
> Michal Kazior <michal.kazior@tieto.com> writes:
> 
>>> Why do we reset the firmware/NIC when we admin down/up the
>>> vif (when a single vif is active)?  Couldn't we just keep
>>> the firmware active in this state and not risk lockup due
>>> to reset?
>>
>> If you put down last interface mac80211 calls drv_stop(). There isn't
>> any real need to keep the device up and running after that other than
>> trying to workaround the reset issue. But then you need to deal with
>> firmware quirks. I recall it could report Rx indications after all
>> vdevs had been removed (and this is now also observable with 10.2
>> during probing/bootup). It's just simpler to reboot firmware on
>> drv_stop/start().
> 
> And there's the reliability issue. Being able to reset the firmware with
> interface down/up sequency is pretty useful and AFAIK almost all
> upstream drivers do that. And let's not forget power consumption either.

The fact that restarting firmware can hang the machine or require
reboot to recover is so serious that I think a work-around is
worth looking into.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-09-18 16:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-16 18:42 Hard lockup during vif restart tests Ben Greear
2014-09-17  6:34 ` Michal Kazior
2014-09-17 15:52   ` Ben Greear
2014-09-18  6:23     ` Michal Kazior
2014-09-18  7:31       ` Kalle Valo
2014-09-18 16:06         ` Ben Greear

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.