All of lore.kernel.org
 help / color / mirror / Atom feed
* ath10k driver crashes whenever firmware crashes on ARM SoC
@ 2014-01-28 17:18 Avery Pennarun
  2014-01-28 18:20 ` Ben Greear
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Avery Pennarun @ 2014-01-28 17:18 UTC (permalink / raw)
  To: ath10k

Hi all,

When the ath10k firmware crashes on my device (let's not worry about
why the firmware crashes right now; one problem at a time), my host
CPU (ARMv7 based) can't recover.  I get some variant of this error:

[  780.116977] Unhandled fault: imprecise external abort (0x1406) at 0x2ac3706c
[  780.124336] Internal error: : 1406 [#1] SMP

I've narrowed this down to this code in ath10k/pci.c, ath10k_pci_device_reset:

        /* Put Target, including PCIe, into RESET. */
        val = ath10k_pci_reg_read32(ar, SOC_GLOBAL_RESET_ADDRESS);
        val |= 1;
        ath10k_pci_reg_write32(ar, SOC_GLOBAL_RESET_ADDRESS, val);
        for (i = 0; i < ATH_PCI_RESET_WAIT_MAX; i++) {
                if (ath10k_pci_reg_read32(ar, RTC_STATE_ADDRESS) &
                                          RTC_STATE_COLD_RESET_MASK)
                        break;
                msleep(1);
       }

Specifically, the pci_reg_read32().  I can insert as much time as I
want between the write32 and the read32; it always performs the read,
then crashes with the PC pointing a few instructions later, inside the
msleep(), with the imprecise external abort.  I think this means the
PCI read operation has encountered a PCI target abort, which suggests
that the SOC_GLOBAL_RESET_ADDRESS line has not successfully reset the
device.  From what I understand, on x86 processors PCI target aborts
are not fatal, so you might not notice this problem on those
platforms, but it's bad on ARM.

I'm using the ath10k driver from linux-next 20140117, but I had the
same problem with 3.13-rc2 so I don't think this has changed.

Are other people seeing this?  Is there something I can try to resolve it?

Thanks,

Avery

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-28 17:18 ath10k driver crashes whenever firmware crashes on ARM SoC Avery Pennarun
@ 2014-01-28 18:20 ` Ben Greear
  2014-01-28 18:34   ` Avery Pennarun
       [not found] ` <CAJ-VmokorbJ2iU4rGNYdRj+A22NR9cV-5h-tDN0pD2FCurZDpA@mail.gmail.com>
  2014-01-29 16:41 ` Kalle Valo
  2 siblings, 1 reply; 32+ messages in thread
From: Ben Greear @ 2014-01-28 18:20 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: ath10k

On 01/28/2014 09:18 AM, Avery Pennarun wrote:
> Hi all,
> 
> When the ath10k firmware crashes on my device (let's not worry about
> why the firmware crashes right now; one problem at a time), my host
> CPU (ARMv7 based) can't recover.  I get some variant of this error:

I don't know about your pci bus problem, but I'm interested in knowing
about firmware crashes (if you are at liberty to share the details).

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-28 18:20 ` Ben Greear
@ 2014-01-28 18:34   ` Avery Pennarun
  2014-01-28 19:01     ` Ben Greear
                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Avery Pennarun @ 2014-01-28 18:34 UTC (permalink / raw)
  To: Ben Greear; +Cc: ath10k

On Tue, Jan 28, 2014 at 1:20 PM, Ben Greear <greearb@candelatech.com> wrote:
> On 01/28/2014 09:18 AM, Avery Pennarun wrote:
>> When the ath10k firmware crashes on my device (let's not worry about
>> why the firmware crashes right now; one problem at a time), my host
>> CPU (ARMv7 based) can't recover.  I get some variant of this error:
>
> I don't know about your pci bus problem, but I'm interested in knowing
> about firmware crashes (if you are at liberty to share the details).

Well, since you asked... :)

I'm trying to build an especially robust system here, so when I
noticed that the driver will bring the entire system crashing down
upon a firmware crash, I've actually gone out of my way to make more
firmware crashes.  So I'm using the ath10k (not ap) firmware from a
month or so ago, in AP mode.  It's pretty easy to crash the firmware
with a sequence something like this:

- start hostapd (I'm using channel 36, HT20, no encryption)
# note that hostapd already adds a mon.wlan0 monitor interface
- iw wlan0 interface add mon0 type monitor
- ip link set mon0 up
- tcpdump -ni mon0 | head

This doesn't *always* work, but it kills the firmware maybe half the
time for me.  It may or may not be worse if there are clients
connected and pushing traffic.  I've noticed that once the firmware
has crashed once and recovered, it's hard to crash it again using the
same trick without unloading and reloading the driver.  Note that in
this case, the firmware crash doesn't always kill my host SoC with a
bus error (although sometimes it does).  Even if it doesn't die
completely, the driver generally comes out confused about the
monitoring interface(s): it prints "ath10k: Only one monitor interface
allowed", which is actually totally untrue, since before the crash I
was able to create and use two at a time.  (I think this error is a
side effect of getting out of sync with the firmware when it restarts,
and thus getting confused about "pmon" vs "vmon" monitor interfaces.)

Also, if I leave the ath10k driver running and pushing traffic for,
say, 10 minutes, the probability that the firmware will crash *and*
take my SoC with it, if I try to kill hostapd or unload the driver,
approaches 100%.

These are all problems worth worrying about, of course, but
fundamentally I really want to get the resets working.  The driver
resets in about one second when it *doesn't* crash, which is pretty
gross, but at least it means we can recover when the firmware is
crappy.  The especially crappy firmware right now makes it easier to
test the recovery process in the driver, which I want to fix first if
possible.  Once I feel good that it can recover from crashes, I will
be happier to complain about the actual crashes themselves :)

Have fun,

Avery

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-28 18:34   ` Avery Pennarun
@ 2014-01-28 19:01     ` Ben Greear
  2014-01-28 19:11       ` Avery Pennarun
  2014-01-28 20:10     ` Janusz Dziedzic
  2014-01-29 16:44     ` Kalle Valo
  2 siblings, 1 reply; 32+ messages in thread
From: Ben Greear @ 2014-01-28 19:01 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: ath10k

On 01/28/2014 10:34 AM, Avery Pennarun wrote:
> On Tue, Jan 28, 2014 at 1:20 PM, Ben Greear <greearb@candelatech.com> wrote:
>> On 01/28/2014 09:18 AM, Avery Pennarun wrote:
>>> When the ath10k firmware crashes on my device (let's not worry about
>>> why the firmware crashes right now; one problem at a time), my host
>>> CPU (ARMv7 based) can't recover.  I get some variant of this error:
>>
>> I don't know about your pci bus problem, but I'm interested in knowing
>> about firmware crashes (if you are at liberty to share the details).
> 
> Well, since you asked... :)
> 
> I'm trying to build an especially robust system here, so when I
> noticed that the driver will bring the entire system crashing down
> upon a firmware crash, I've actually gone out of my way to make more
> firmware crashes.  So I'm using the ath10k (not ap) firmware from a
> month or so ago, in AP mode.  It's pretty easy to crash the firmware
> with a sequence something like this:
> 
> - start hostapd (I'm using channel 36, HT20, no encryption)
> # note that hostapd already adds a mon.wlan0 monitor interface
> - iw wlan0 interface add mon0 type monitor
> - ip link set mon0 up
> - tcpdump -ni mon0 | head
> 
> This doesn't *always* work, but it kills the firmware maybe half the
> time for me.  It may or may not be worse if there are clients
> connected and pushing traffic.  I've noticed that once the firmware
> has crashed once and recovered, it's hard to crash it again using the
> same trick without unloading and reloading the driver.  Note that in
> this case, the firmware crash doesn't always kill my host SoC with a
> bus error (although sometimes it does).  Even if it doesn't die
> completely, the driver generally comes out confused about the
> monitoring interface(s): it prints "ath10k: Only one monitor interface
> allowed", which is actually totally untrue, since before the crash I
> was able to create and use two at a time.  (I think this error is a
> side effect of getting out of sync with the firmware when it restarts,
> and thus getting confused about "pmon" vs "vmon" monitor interfaces.)
> 
> Also, if I leave the ath10k driver running and pushing traffic for,
> say, 10 minutes, the probability that the firmware will crash *and*
> take my SoC with it, if I try to kill hostapd or unload the driver,
> approaches 100%.

I see similar issues (with the reset killing the PC) on x86-64
(core-i7 CPU).  Kalle mentioned a few days ago that at least some of the
NICs had issues with cold reset and that they hoped to
have a fix that uses warm reset in a week or two.

Interestingly, I also see hard PC lockup on longer runs, but
perhaps that is related to the cold-reset issue somehow.

I'm using the 10.x AP firmware, and my method of crashing firmware
is different at the moment :)

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-28 19:01     ` Ben Greear
@ 2014-01-28 19:11       ` Avery Pennarun
  0 siblings, 0 replies; 32+ messages in thread
From: Avery Pennarun @ 2014-01-28 19:11 UTC (permalink / raw)
  To: Ben Greear; +Cc: ath10k

On Tue, Jan 28, 2014 at 2:01 PM, Ben Greear <greearb@candelatech.com> wrote:
> I see similar issues (with the reset killing the PC) on x86-64
> (core-i7 CPU).  Kalle mentioned a few days ago that at least some of the
> NICs had issues with cold reset and that they hoped to
> have a fix that uses warm reset in a week or two.

I saw some messages about this on the list, including a patch that
looked promising:
http://lists.infradead.org/pipermail/ath10k/2013-December/000888.html
but it didn't help.  I'm pretty sure stuck firmware is not something
you can warm reboot to fix.

On the other hand, when my box reboots itself, it always comes back
okay and the driver loads fine.  So clearly there is *some* reset line
in the system that's working...

> Interestingly, I also see hard PC lockup on longer runs, but
> perhaps that is related to the cold-reset issue somehow.

Longer runs?  Lucky you! :)

> I'm using the 10.x AP firmware, and my method of crashing firmware
> is different at the moment :)

Yeah, I tried the AP firmware and it lasts longer.  I'm pretty sad
that there are two different firmwares with two different sets of bugs
to choose between.

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-28 18:34   ` Avery Pennarun
  2014-01-28 19:01     ` Ben Greear
@ 2014-01-28 20:10     ` Janusz Dziedzic
  2014-01-28 20:51       ` Avery Pennarun
  2014-01-29 16:44     ` Kalle Valo
  2 siblings, 1 reply; 32+ messages in thread
From: Janusz Dziedzic @ 2014-01-28 20:10 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Ben Greear, ath10k

2014-01-28 Avery Pennarun <apenwarr@gmail.com>:
> On Tue, Jan 28, 2014 at 1:20 PM, Ben Greear <greearb@candelatech.com> wrote:
>> On 01/28/2014 09:18 AM, Avery Pennarun wrote:
>>> When the ath10k firmware crashes on my device (let's not worry about
>>> why the firmware crashes right now; one problem at a time), my host
>>> CPU (ARMv7 based) can't recover.  I get some variant of this error:
>>
>> I don't know about your pci bus problem, but I'm interested in knowing
>> about firmware crashes (if you are at liberty to share the details).
>
> Well, since you asked... :)
>
> I'm trying to build an especially robust system here, so when I
> noticed that the driver will bring the entire system crashing down
> upon a firmware crash, I've actually gone out of my way to make more
> firmware crashes.  So I'm using the ath10k (not ap) firmware from a
> month or so ago, in AP mode.  It's pretty easy to crash the firmware
> with a sequence something like this:
>
> - start hostapd (I'm using channel 36, HT20, no encryption)
> # note that hostapd already adds a mon.wlan0 monitor interface
> - iw wlan0 interface add mon0 type monitor
> - ip link set mon0 up
> - tcpdump -ni mon0 | head
>
FW636 have problems with monitor iface:
iw wlan0 set type monitor
ifconfig wlan0 up
tcpdump -i wlan0
Will always crash firmware after "entered promiscuous mode"

Generally in case you will have/left only one monitor interface and
active tcpdump (in your case after hostapd kill), you will always get
this crash. So using monitor interface with FW636 is not good idea.

BTW monitor works fine with 10.x FW.

BR
Janusz

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-28 20:10     ` Janusz Dziedzic
@ 2014-01-28 20:51       ` Avery Pennarun
  0 siblings, 0 replies; 32+ messages in thread
From: Avery Pennarun @ 2014-01-28 20:51 UTC (permalink / raw)
  To: Janusz Dziedzic; +Cc: Ben Greear, ath10k

On Tue, Jan 28, 2014 at 3:10 PM, Janusz Dziedzic
<janusz.dziedzic@gmail.com> wrote:
> FW636 have problems with monitor iface:
> iw wlan0 set type monitor
> ifconfig wlan0 up
> tcpdump -i wlan0
> Will always crash firmware after "entered promiscuous mode"
>
> Generally in case you will have/left only one monitor interface and
> active tcpdump (in your case after hostapd kill), you will always get
> this crash. So using monitor interface with FW636 is not good idea.
>
> BTW monitor works fine with 10.x FW.

Well, in this case it gives a mostly-reproducible test case for the
driver's failure to recover from firmware crashes.  So it's useful in
that respect.  I was using the 10.x firmware until I wanted an easier
way to reproduce the problem :)

Avery

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
       [not found] ` <CAJ-VmokorbJ2iU4rGNYdRj+A22NR9cV-5h-tDN0pD2FCurZDpA@mail.gmail.com>
@ 2014-01-28 20:55   ` Avery Pennarun
  0 siblings, 0 replies; 32+ messages in thread
From: Avery Pennarun @ 2014-01-28 20:55 UTC (permalink / raw)
  To: Adrian Chadd; +Cc: ath10k

On Tue, Jan 28, 2014 at 3:28 PM, Adrian Chadd <adrian@freebsd.org> wrote:
> Can you put a sleep in before the initial write?

I've tried adding printks and various things before the write, and it
didn't seem to help.

I also noticed that iowrite32() contains a write barrier, and
ioread32() contains a read barrier, but they're structured like this:

iowrite32() { wmb(); write32(); }
ioread32() { read32(); rmb(); }

Which means that actually the write32() and read32() do not have
barriers between them.  I think that might be a problem, but adding
the barriers didn't help either.

> There are hardware bugs that are.. delicate. What you should do is find what
> kind of reporting you can pull out of the pcie endpoint the nic is connected
> to in order to see why it threw a fatal error. Then at least we can poke it
> further.

Ok.  I'm going to see if I can hunt down a PCIe bus analyzer and find
out what the deal is.

> Its also worth asking the qca hardware team about this. They'll likely want
> to know what the pcie errors are so please figure that out.

Thanks for the advice!

Are other people not seeing this?  Or are they just trying not to
crash the firmware so they don't have to care? :)

Thanks,

Avery

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-28 17:18 ath10k driver crashes whenever firmware crashes on ARM SoC Avery Pennarun
  2014-01-28 18:20 ` Ben Greear
       [not found] ` <CAJ-VmokorbJ2iU4rGNYdRj+A22NR9cV-5h-tDN0pD2FCurZDpA@mail.gmail.com>
@ 2014-01-29 16:41 ` Kalle Valo
  2014-01-29 18:44   ` Adrian Chadd
  2 siblings, 1 reply; 32+ messages in thread
From: Kalle Valo @ 2014-01-29 16:41 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: ath10k

Hi,

Avery Pennarun <apenwarr@gmail.com> writes:

> When the ath10k firmware crashes on my device (let's not worry about
> why the firmware crashes right now; one problem at a time), my host
> CPU (ARMv7 based) can't recover.  I get some variant of this error:
>
> [  780.116977] Unhandled fault: imprecise external abort (0x1406) at 0x2ac3706c
> [  780.124336] Internal error: : 1406 [#1] SMP
>
> I've narrowed this down to this code in ath10k/pci.c, ath10k_pci_device_reset:
>
>         /* Put Target, including PCIe, into RESET. */
>         val = ath10k_pci_reg_read32(ar, SOC_GLOBAL_RESET_ADDRESS);
>         val |= 1;
>         ath10k_pci_reg_write32(ar, SOC_GLOBAL_RESET_ADDRESS, val);
>         for (i = 0; i < ATH_PCI_RESET_WAIT_MAX; i++) {
>                 if (ath10k_pci_reg_read32(ar, RTC_STATE_ADDRESS) &
>                                           RTC_STATE_COLD_RESET_MASK)
>                         break;
>                 msleep(1);
>        }

Are you using CUS223 board? I was told that it has a problem with the
cold reset. When you issue the cold reset, some voltage in the board
goes too low and there's a chance that it breaks PCI communication.

> Specifically, the pci_reg_read32().  I can insert as much time as I
> want between the write32 and the read32; it always performs the read,
> then crashes with the PC pointing a few instructions later, inside the
> msleep(), with the imprecise external abort.  I think this means the
> PCI read operation has encountered a PCI target abort, which suggests
> that the SOC_GLOBAL_RESET_ADDRESS line has not successfully reset the
> device.  From what I understand, on x86 processors PCI target aborts
> are not fatal, so you might not notice this problem on those
> platforms, but it's bad on ARM.

FWIW the same problem also happens on MIPS.

> I'm using the ath10k driver from linux-next 20140117, but I had the
> same problem with 3.13-rc2 so I don't think this has changed.
>
> Are other people seeing this?  Is there something I can try to resolve it?

Yes, we see it as well. And we see it also on when doing interface down,
for example when shutting down hostapd. Soon we will post patches to
workaround the interface down issue, but for firmware crashes we haven't
yet found a reliable solution. I hope there's a way to fix warm reset to
properly recover from a firmware crash.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-28 18:34   ` Avery Pennarun
  2014-01-28 19:01     ` Ben Greear
  2014-01-28 20:10     ` Janusz Dziedzic
@ 2014-01-29 16:44     ` Kalle Valo
  2 siblings, 0 replies; 32+ messages in thread
From: Kalle Valo @ 2014-01-29 16:44 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Ben Greear, ath10k

Avery Pennarun <apenwarr@gmail.com> writes:

> On Tue, Jan 28, 2014 at 1:20 PM, Ben Greear <greearb@candelatech.com> wrote:
>> On 01/28/2014 09:18 AM, Avery Pennarun wrote:
>>> When the ath10k firmware crashes on my device (let's not worry about
>>> why the firmware crashes right now; one problem at a time), my host
>>> CPU (ARMv7 based) can't recover.  I get some variant of this error:
>>
>> I don't know about your pci bus problem, but I'm interested in knowing
>> about firmware crashes (if you are at liberty to share the details).
>
> Well, since you asked... :)
>
> I'm trying to build an especially robust system here, so when I
> noticed that the driver will bring the entire system crashing down
> upon a firmware crash, I've actually gone out of my way to make more
> firmware crashes.

Do you know that we have simulate_fw_crash debugfs file just for testing
this? Of course it's good to test various firmware crashing scenarious,
but in some cases this debugfs file helps a lot.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-29 16:41 ` Kalle Valo
@ 2014-01-29 18:44   ` Adrian Chadd
  2014-01-30  2:41     ` Avery Pennarun
  0 siblings, 1 reply; 32+ messages in thread
From: Adrian Chadd @ 2014-01-29 18:44 UTC (permalink / raw)
  To: Kalle Valo; +Cc: ath10k, Avery Pennarun

Hi,

Well, the problem is more likely that the PCIe bus doesn't come back
correctly, and the next IO write hits a PCI bus error.

What about seeing if you can detect the PCIe error before it's a fatal
one (hence my email earlier about trying to decode this stuff) and
then reset the PCIe port from the PCI side?


-a


On 29 January 2014 08:41, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
> Hi,
>
> Avery Pennarun <apenwarr@gmail.com> writes:
>
>> When the ath10k firmware crashes on my device (let's not worry about
>> why the firmware crashes right now; one problem at a time), my host
>> CPU (ARMv7 based) can't recover.  I get some variant of this error:
>>
>> [  780.116977] Unhandled fault: imprecise external abort (0x1406) at 0x2ac3706c
>> [  780.124336] Internal error: : 1406 [#1] SMP
>>
>> I've narrowed this down to this code in ath10k/pci.c, ath10k_pci_device_reset:
>>
>>         /* Put Target, including PCIe, into RESET. */
>>         val = ath10k_pci_reg_read32(ar, SOC_GLOBAL_RESET_ADDRESS);
>>         val |= 1;
>>         ath10k_pci_reg_write32(ar, SOC_GLOBAL_RESET_ADDRESS, val);
>>         for (i = 0; i < ATH_PCI_RESET_WAIT_MAX; i++) {
>>                 if (ath10k_pci_reg_read32(ar, RTC_STATE_ADDRESS) &
>>                                           RTC_STATE_COLD_RESET_MASK)
>>                         break;
>>                 msleep(1);
>>        }
>
> Are you using CUS223 board? I was told that it has a problem with the
> cold reset. When you issue the cold reset, some voltage in the board
> goes too low and there's a chance that it breaks PCI communication.
>
>> Specifically, the pci_reg_read32().  I can insert as much time as I
>> want between the write32 and the read32; it always performs the read,
>> then crashes with the PC pointing a few instructions later, inside the
>> msleep(), with the imprecise external abort.  I think this means the
>> PCI read operation has encountered a PCI target abort, which suggests
>> that the SOC_GLOBAL_RESET_ADDRESS line has not successfully reset the
>> device.  From what I understand, on x86 processors PCI target aborts
>> are not fatal, so you might not notice this problem on those
>> platforms, but it's bad on ARM.
>
> FWIW the same problem also happens on MIPS.
>
>> I'm using the ath10k driver from linux-next 20140117, but I had the
>> same problem with 3.13-rc2 so I don't think this has changed.
>>
>> Are other people seeing this?  Is there something I can try to resolve it?
>
> Yes, we see it as well. And we see it also on when doing interface down,
> for example when shutting down hostapd. Soon we will post patches to
> workaround the interface down issue, but for firmware crashes we haven't
> yet found a reliable solution. I hope there's a way to fix warm reset to
> properly recover from a firmware crash.
>
> --
> Kalle Valo
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-29 18:44   ` Adrian Chadd
@ 2014-01-30  2:41     ` Avery Pennarun
  2014-02-09  8:00       ` Avery Pennarun
  0 siblings, 1 reply; 32+ messages in thread
From: Avery Pennarun @ 2014-01-30  2:41 UTC (permalink / raw)
  To: Adrian Chadd; +Cc: Kalle Valo, ath10k

On Wed, Jan 29, 2014 at 1:44 PM, Adrian Chadd <adrian@freebsd.org> wrote:
> Well, the problem is more likely that the PCIe bus doesn't come back
> correctly, and the next IO write hits a PCI bus error.
>
> What about seeing if you can detect the PCIe error before it's a fatal
> one (hence my email earlier about trying to decode this stuff) and
> then reset the PCIe port from the PCI side?

Still chasing around some people to get a PCIe bus analyzer set up.

I did try a bit with resetting the PCI bus, but there's a lot of ways
that can get messy (turns out PCI bus controllers have a lot of
registers that get reset when you reset them, who knew?), and I didn't
manage to work around all of them.

My guess however is that the PCIe bus itself is probably not broken...
but the ath10k device may not be responding to it anymore when this
problem happens.

One thing I noticed is that if I reboot my device (which completely
resets the PCI bus), it comes back and works.  So *somehow* things are
getting cleaned up without actually cutting power.

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-01-30  2:41     ` Avery Pennarun
@ 2014-02-09  8:00       ` Avery Pennarun
  2014-02-27 15:48         ` Missing memory barriers Kalle Valo
  2014-03-11  7:33         ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo
  0 siblings, 2 replies; 32+ messages in thread
From: Avery Pennarun @ 2014-02-09  8:00 UTC (permalink / raw)
  To: Adrian Chadd; +Cc: Kalle Valo, ath10k

On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> Still chasing around some people to get a PCIe bus analyzer set up.

Okay, I finally managed to get enough parts put together to look at
the PCIe bus.  To make things a little more clear, I added a macro
that does essentially:

   pci_write_config_dword(0, 0x80000000 | __LINE__)
   mdelay(1);
   pci_write_config_dword(0, __LINE__)

...at various points in the code.  This way I can see precisely what
was the most recent PCIe transaction before the crash.

I'm not super familiar with PCIe, but what I think I'm seeing is:

- the firmware does not need to be loaded yet; sometimes I can crash
it just by doing a cold reset right at driver load time.  So the good
news is, the firmware code is not related.

- the crash is always in ath10k_pci_device_reset

- there are definitely some missing memory barriers in here; in a few
cases you can clearly see a write getting done before the read that
came before it.  Looking at the definitions for iowrite32 and
ioread32, and for rmb() and wmb(), we can see that the use of rmb()
and wmb() do not work properly (at least on ARM) when you care about
the ordering between reads and writes.  However, I don't think this
actually causes the problem.

- the crash happens after writing the 1 to SBC_GLOBAL_RESET_ADDRESS.
The write gets an ACK from the device, so there are no interrupted
PCIe transactions.

- after writing that 1, the PCI bus is fine for ~272 usec.  I can see
the first pci_write_config_dword in my macro above, but it crashes
during the mdelay(1) and the second pci_write_config doesn't go
through.

- ~272 usec after the write, I see TS1 packets getting transmitted at
maximum speed in both directions.  Does this mean the connection is
retraining?

- 50 usec after the first TS1 packet (a surprisingly precise number),
I see an EIOS packet sent in the downstream direction.  After that,
they appear every 25 usec.  However, they *all* show invalid parity
bits according to the PCI analyzer.

Does this ring a bell for anyone?  I think I can also export the
traces as csv in case someone wants to look at them.

Thanks!

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Missing memory barriers
  2014-02-09  8:00       ` Avery Pennarun
@ 2014-02-27 15:48         ` Kalle Valo
  2014-02-28  6:10           ` Avery Pennarun
  2014-03-11  7:33         ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo
  1 sibling, 1 reply; 32+ messages in thread
From: Kalle Valo @ 2014-02-27 15:48 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Adrian Chadd, ath10k

Hi Avery,

starting a new thread about memory barriers:

Avery Pennarun <apenwarr@gmail.com> writes:

> On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>
> - there are definitely some missing memory barriers in here; in a few
> cases you can clearly see a write getting done before the read that
> came before it.  Looking at the definitions for iowrite32 and
> ioread32, and for rmb() and wmb(), we can see that the use of rmb()
> and wmb() do not work properly (at least on ARM) when you care about
> the ordering between reads and writes.  However, I don't think this
> actually causes the problem.

Can you tell more about this, please? Did you find out where we are
actually doing it wrong?

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Missing memory barriers
  2014-02-27 15:48         ` Missing memory barriers Kalle Valo
@ 2014-02-28  6:10           ` Avery Pennarun
  2014-03-06 13:34             ` Kalle Valo
  0 siblings, 1 reply; 32+ messages in thread
From: Avery Pennarun @ 2014-02-28  6:10 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Adrian Chadd, ath10k

On Thu, Feb 27, 2014 at 7:48 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
> Avery Pennarun <apenwarr@gmail.com> writes:
>> - there are definitely some missing memory barriers in here; in a few
>> cases you can clearly see a write getting done before the read that
>> came before it.  Looking at the definitions for iowrite32 and
>> ioread32, and for rmb() and wmb(), we can see that the use of rmb()
>> and wmb() do not work properly (at least on ARM) when you care about
>> the ordering between reads and writes.  However, I don't think this
>> actually causes the problem.
>
> Can you tell more about this, please? Did you find out where we are
> actually doing it wrong?

Sure.  It's been a while since I wrote the above and it was with an
older version of the ath10k driver, but basically what happens is as
follows.

First look in arch/arm/include/asm/system.h.  Note that in various
situations, rmb() and wmb() may be defined identically or may do
slightly different things.  On the architecture I'm using, I'm fairly
sure they are defined identically.  Results may be slightly different
if they are not.

Next look at arch/arm/include/asm/io.h at the definitions for
iowrite32 and ioread32.  In pseudocode they are roughly like this:

iowrite32:
   wmb();
   write();
ioread32:
   read();
   rmb();

So for example in ath10k_pci_device_reset (or ath10k_pci_cold_reset if
you have that set of patches), there is a code sequence that looks
something like this:

- write the reset register
- loop:
  - read the status

I noticed that when the device crashes the PCIe bus due to voltage
problems, writing the reset register is not what causes the PCIe host
to notice an abort - it's reading the status afterward.  While
investigating this, I hooked up a PCIe bus analyzer and found that it
was reading the status *before* writing the reset.  That's because the
above expands out to:

   wmb()
   write(reset)
   read(status)
   rmb()

which the CPU or compiler is free to reorder as:

  wmb()
  read(status)
  write(reset)
  rmb()

And in fact, it *wants* to do this reordering because there is a
conditional that depends on the result of read(status), so with the
reordering, the CPU pipeline can think about that conditional while
executing write(reset) in parallel.  And this is indeed what happens,
according to my PCIe bus trace.

The way the ARM iowrite/ioread barriers are set up, it works as
expected when you read before writing, but the first read after a
write can be reordered.  If you want to be careful about this, I think
you'd have to insert an extra barrier by hand.

The bad news is that, while inserting the extra barrier did clean up
my bus trace, it didn't fix the underlying problem.  When the chip
dies due to this cold reset operation, the inability to read the
status register is only a symptom, not the cause.  In the end it's
harmless that we end up doing the first read before the write
operation finishes.  What happens isn't what the code says, but I
don't think that matters in this case.

(I think my main line of investigation at the time was in the first
read after the command was sent to take the chip *out* of reset.  I
thought maybe reading while it was in reset was the underlying cause
of the abort.  No such luck.)

Hope this helps.

Have fun,

Avery

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Missing memory barriers
  2014-02-28  6:10           ` Avery Pennarun
@ 2014-03-06 13:34             ` Kalle Valo
  0 siblings, 0 replies; 32+ messages in thread
From: Kalle Valo @ 2014-03-06 13:34 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Adrian Chadd, ath10k

Avery Pennarun <apenwarr@gmail.com> writes:

> On Thu, Feb 27, 2014 at 7:48 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
>> Avery Pennarun <apenwarr@gmail.com> writes:
>>> - there are definitely some missing memory barriers in here; in a few
>>> cases you can clearly see a write getting done before the read that
>>> came before it.  Looking at the definitions for iowrite32 and
>>> ioread32, and for rmb() and wmb(), we can see that the use of rmb()
>>> and wmb() do not work properly (at least on ARM) when you care about
>>> the ordering between reads and writes.  However, I don't think this
>>> actually causes the problem.
>>
>> Can you tell more about this, please? Did you find out where we are
>> actually doing it wrong?
>
> Sure.  It's been a while since I wrote the above and it was with an
> older version of the ath10k driver, but basically what happens is as
> follows.

[...]

> The bad news is that, while inserting the extra barrier did clean up
> my bus trace, it didn't fix the underlying problem.  When the chip
> dies due to this cold reset operation, the inability to read the
> status register is only a symptom, not the cause.  In the end it's
> harmless that we end up doing the first read before the write
> operation finishes.  What happens isn't what the code says, but I
> don't think that matters in this case.

Thanks for the excellent write up, I understand this better now. I
wasn't expecting that this would fix the cold reset issue, but these
kind of issues should be good to fix anyway. You never know what kind of
bugs they might cause in the future.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-02-09  8:00       ` Avery Pennarun
  2014-02-27 15:48         ` Missing memory barriers Kalle Valo
@ 2014-03-11  7:33         ` Kalle Valo
  2014-03-11  7:40           ` Avery Pennarun
  2014-03-11 19:01           ` Ben Greear
  1 sibling, 2 replies; 32+ messages in thread
From: Kalle Valo @ 2014-03-11  7:33 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Adrian Chadd, ath10k

Avery Pennarun <apenwarr@gmail.com> writes:

> On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>> Still chasing around some people to get a PCIe bus analyzer set up.
>
> Okay, I finally managed to get enough parts put together to look at
> the PCIe bus.  To make things a little more clear, I added a macro
> that does essentially:
>
>    pci_write_config_dword(0, 0x80000000 | __LINE__)
>    mdelay(1);
>    pci_write_config_dword(0, __LINE__)
>
> ...at various points in the code.  This way I can see precisely what
> was the most recent PCIe transaction before the crash.
>
> I'm not super familiar with PCIe, but what I think I'm seeing is:
>
> - the firmware does not need to be loaded yet; sometimes I can crash
> it just by doing a cold reset right at driver load time.  So the good
> news is, the firmware code is not related.
>
> - the crash is always in ath10k_pci_device_reset

[...]

> Does this ring a bell for anyone?  I think I can also export the
> traces as csv in case someone wants to look at them.

I showed your analysis to an HW engineer and the response I got was
"don't do that" (= don't use the cold reset). As you know, we now have a
workaround using the warm reset:

00f5482bcd94 ath10k: suspend hardware before reset
9042e17df834 ath10k: refactor suspend/resume functions
fc36e3ffcdd0 ath10k: fix device initialization routine

Have you tested these? Did they help at all?

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-11  7:33         ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo
@ 2014-03-11  7:40           ` Avery Pennarun
  2014-03-11  7:52             ` Adrian Chadd
  2014-03-11  8:10             ` Kalle Valo
  2014-03-11 19:01           ` Ben Greear
  1 sibling, 2 replies; 32+ messages in thread
From: Avery Pennarun @ 2014-03-11  7:40 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Adrian Chadd, ath10k

On Tue, Mar 11, 2014 at 2:33 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
> Avery Pennarun <apenwarr@gmail.com> writes:
>> On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>>> Still chasing around some people to get a PCIe bus analyzer set up.
>>
>> Okay, I finally managed to get enough parts put together to look at
>> the PCIe bus.  To make things a little more clear, I added a macro
>> that does essentially:
>>
>>    pci_write_config_dword(0, 0x80000000 | __LINE__)
>>    mdelay(1);
>>    pci_write_config_dword(0, __LINE__)
>>
>> ...at various points in the code.  This way I can see precisely what
>> was the most recent PCIe transaction before the crash.
>>
>> I'm not super familiar with PCIe, but what I think I'm seeing is:
>>
>> - the firmware does not need to be loaded yet; sometimes I can crash
>> it just by doing a cold reset right at driver load time.  So the good
>> news is, the firmware code is not related.
>>
>> - the crash is always in ath10k_pci_device_reset
>
> [...]
>
>> Does this ring a bell for anyone?  I think I can also export the
>> traces as csv in case someone wants to look at them.
>
> I showed your analysis to an HW engineer and the response I got was
> "don't do that" (= don't use the cold reset). As you know, we now have a
> workaround using the warm reset:
>
> 00f5482bcd94 ath10k: suspend hardware before reset
> 9042e17df834 ath10k: refactor suspend/resume functions
> fc36e3ffcdd0 ath10k: fix device initialization routine
>
> Have you tested these? Did they help at all?

Yes, I've tested them and they help, mainly by doing the cold reset
less often.  However, when the firmware hard crashes in certain ways
(for example, using my original test case), it looks like warm reset
can't fix that.  The driver then still must fall back to cold reset
and, some fairly large percentage of the time (1/3rd?), crashes the
bus.

We do have a separate reset line controlled by a GPIO.  Using that
crashes the SoC's PCIe host implementation (whee!).  But I got help
from the SoC manufacturer and was able to get some instructions for
resetting their PCIe host controller.  When I do all the magic
incantations in the right order, the system can recover, albeit with a
fully reset ath10k chip.  This workaround is unfortunately specific to
the host device platform so it won't do you much good.

Of course, a good way to avoid the problem is "don't crash the
firmware then," but that's not as robust as I'd like.  This box is
doing quite a few things, so rebooting to fix a problem on one of the
wireless cards is pretty expensive.

Nevertheless, the warm reset changes really do reduce the frequency of
this a lot, to the point where my workaround is almost never needed.
Thanks for that!

Have fun,

Avery

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-11  7:40           ` Avery Pennarun
@ 2014-03-11  7:52             ` Adrian Chadd
  2014-03-11  7:59               ` Avery Pennarun
  2014-03-11  8:13               ` Kalle Valo
  2014-03-11  8:10             ` Kalle Valo
  1 sibling, 2 replies; 32+ messages in thread
From: Adrian Chadd @ 2014-03-11  7:52 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Kalle Valo, ath10k

... it's not a complete loss!

This to me says "we need a hook from the driver to call the host
"reset the bus" thing".

We also kinda need it for ath9k/ath5k (if it's not there) so AHB
attached things can be reset by actually poking an SoC reset register.


-a


On 11 March 2014 00:40, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Tue, Mar 11, 2014 at 2:33 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
>> Avery Pennarun <apenwarr@gmail.com> writes:
>>> On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
>>>> Still chasing around some people to get a PCIe bus analyzer set up.
>>>
>>> Okay, I finally managed to get enough parts put together to look at
>>> the PCIe bus.  To make things a little more clear, I added a macro
>>> that does essentially:
>>>
>>>    pci_write_config_dword(0, 0x80000000 | __LINE__)
>>>    mdelay(1);
>>>    pci_write_config_dword(0, __LINE__)
>>>
>>> ...at various points in the code.  This way I can see precisely what
>>> was the most recent PCIe transaction before the crash.
>>>
>>> I'm not super familiar with PCIe, but what I think I'm seeing is:
>>>
>>> - the firmware does not need to be loaded yet; sometimes I can crash
>>> it just by doing a cold reset right at driver load time.  So the good
>>> news is, the firmware code is not related.
>>>
>>> - the crash is always in ath10k_pci_device_reset
>>
>> [...]
>>
>>> Does this ring a bell for anyone?  I think I can also export the
>>> traces as csv in case someone wants to look at them.
>>
>> I showed your analysis to an HW engineer and the response I got was
>> "don't do that" (= don't use the cold reset). As you know, we now have a
>> workaround using the warm reset:
>>
>> 00f5482bcd94 ath10k: suspend hardware before reset
>> 9042e17df834 ath10k: refactor suspend/resume functions
>> fc36e3ffcdd0 ath10k: fix device initialization routine
>>
>> Have you tested these? Did they help at all?
>
> Yes, I've tested them and they help, mainly by doing the cold reset
> less often.  However, when the firmware hard crashes in certain ways
> (for example, using my original test case), it looks like warm reset
> can't fix that.  The driver then still must fall back to cold reset
> and, some fairly large percentage of the time (1/3rd?), crashes the
> bus.
>
> We do have a separate reset line controlled by a GPIO.  Using that
> crashes the SoC's PCIe host implementation (whee!).  But I got help
> from the SoC manufacturer and was able to get some instructions for
> resetting their PCIe host controller.  When I do all the magic
> incantations in the right order, the system can recover, albeit with a
> fully reset ath10k chip.  This workaround is unfortunately specific to
> the host device platform so it won't do you much good.
>
> Of course, a good way to avoid the problem is "don't crash the
> firmware then," but that's not as robust as I'd like.  This box is
> doing quite a few things, so rebooting to fix a problem on one of the
> wireless cards is pretty expensive.
>
> Nevertheless, the warm reset changes really do reduce the frequency of
> this a lot, to the point where my workaround is almost never needed.
> Thanks for that!
>
> Have fun,
>
> Avery

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-11  7:52             ` Adrian Chadd
@ 2014-03-11  7:59               ` Avery Pennarun
  2014-03-11  8:13               ` Kalle Valo
  1 sibling, 0 replies; 32+ messages in thread
From: Avery Pennarun @ 2014-03-11  7:59 UTC (permalink / raw)
  To: Adrian Chadd; +Cc: Kalle Valo, ath10k

On Tue, Mar 11, 2014 at 2:52 AM, Adrian Chadd <adrian@freebsd.org> wrote:
> ... it's not a complete loss!
>
> This to me says "we need a hook from the driver to call the host
> "reset the bus" thing".
>
> We also kinda need it for ath9k/ath5k (if it's not there) so AHB
> attached things can be reset by actually poking an SoC reset register.

That would be awesome.  It would make my hack feel much less hacky :)

Have fun,

Avery

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-11  7:40           ` Avery Pennarun
  2014-03-11  7:52             ` Adrian Chadd
@ 2014-03-11  8:10             ` Kalle Valo
  1 sibling, 0 replies; 32+ messages in thread
From: Kalle Valo @ 2014-03-11  8:10 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Adrian Chadd, ath10k

Avery Pennarun <apenwarr@gmail.com> writes:

> On Tue, Mar 11, 2014 at 2:33 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
>
>> I showed your analysis to an HW engineer and the response I got was
>> "don't do that" (= don't use the cold reset). As you know, we now have a
>> workaround using the warm reset:
>>
>> 00f5482bcd94 ath10k: suspend hardware before reset
>> 9042e17df834 ath10k: refactor suspend/resume functions
>> fc36e3ffcdd0 ath10k: fix device initialization routine
>>
>> Have you tested these? Did they help at all?
>
> Yes, I've tested them and they help, mainly by doing the cold reset
> less often.  However, when the firmware hard crashes in certain ways
> (for example, using my original test case), it looks like warm reset
> can't fix that.  The driver then still must fall back to cold reset
> and, some fairly large percentage of the time (1/3rd?), crashes the
> bus.

Ok, thanks. I'll investigate more about the warm reset problems and try
to find ways to make it more reliable.

> We do have a separate reset line controlled by a GPIO.  Using that
> crashes the SoC's PCIe host implementation (whee!).  But I got help
> from the SoC manufacturer and was able to get some instructions for
> resetting their PCIe host controller.  When I do all the magic
> incantations in the right order, the system can recover, albeit with a
> fully reset ath10k chip.  This workaround is unfortunately specific to
> the host device platform so it won't do you much good.
>
> Of course, a good way to avoid the problem is "don't crash the
> firmware then," but that's not as robust as I'd like.

I never trust the firmware, in any device, and that's why I would like
to have in ath10k 100% reliable way to restart it from host.

> This box is doing quite a few things, so rebooting to fix a problem on
> one of the wireless cards is pretty expensive.

Yeah, that would be really bad. Restarting the firmware will take
something like 1-2 seconds and the user would only notice a small pause
in data traffic, a much better solution than rebooting the whole box.

> Nevertheless, the warm reset changes really do reduce the frequency of
> this a lot, to the point where my workaround is almost never needed.
> Thanks for that!

Great, thanks for the feedback.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-11  7:52             ` Adrian Chadd
  2014-03-11  7:59               ` Avery Pennarun
@ 2014-03-11  8:13               ` Kalle Valo
  2014-03-11  8:37                 ` Michal Kazior
  1 sibling, 1 reply; 32+ messages in thread
From: Kalle Valo @ 2014-03-11  8:13 UTC (permalink / raw)
  To: Adrian Chadd; +Cc: ath10k, Avery Pennarun

(Fixing top posting)

Adrian Chadd <adrian@freebsd.org> writes:

> On 11 March 2014 00:40, Avery Pennarun <apenwarr@gmail.com> wrote:
>
>> We do have a separate reset line controlled by a GPIO.  Using that
>> crashes the SoC's PCIe host implementation (whee!).  But I got help
>> from the SoC manufacturer and was able to get some instructions for
>> resetting their PCIe host controller.  When I do all the magic
>> incantations in the right order, the system can recover, albeit with a
>> fully reset ath10k chip.  This workaround is unfortunately specific to
>> the host device platform so it won't do you much good.
>
> ... it's not a complete loss!
>
> This to me says "we need a hook from the driver to call the host
> "reset the bus" thing".
>
> We also kinda need it for ath9k/ath5k (if it's not there) so AHB
> attached things can be reset by actually poking an SoC reset register.

Yeah, that kind of hook would be good to have.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-11  8:13               ` Kalle Valo
@ 2014-03-11  8:37                 ` Michal Kazior
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Kazior @ 2014-03-11  8:37 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Adrian Chadd, ath10k, Avery Pennarun

On 11 March 2014 09:13, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
> (Fixing top posting)
>
> Adrian Chadd <adrian@freebsd.org> writes:
>
>> On 11 March 2014 00:40, Avery Pennarun <apenwarr@gmail.com> wrote:
>>
>>> We do have a separate reset line controlled by a GPIO.  Using that
>>> crashes the SoC's PCIe host implementation (whee!).  But I got help
>>> from the SoC manufacturer and was able to get some instructions for
>>> resetting their PCIe host controller.  When I do all the magic
>>> incantations in the right order, the system can recover, albeit with a
>>> fully reset ath10k chip.  This workaround is unfortunately specific to
>>> the host device platform so it won't do you much good.
>>
>> ... it's not a complete loss!
>>
>> This to me says "we need a hook from the driver to call the host
>> "reset the bus" thing".
>>
>> We also kinda need it for ath9k/ath5k (if it's not there) so AHB
>> attached things can be reset by actually poking an SoC reset register.
>
> Yeah, that kind of hook would be good to have.

There is PCI error recovery in kernel
(Documentation/PCI/pci-error-recovery.txt) but I think it's only
implemented on ppc. I wonder if you could try hooking up with that?


Michał

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-11  7:33         ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo
  2014-03-11  7:40           ` Avery Pennarun
@ 2014-03-11 19:01           ` Ben Greear
  2014-03-12  8:22             ` Kalle Valo
  1 sibling, 1 reply; 32+ messages in thread
From: Ben Greear @ 2014-03-11 19:01 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Adrian Chadd, ath10k, Avery Pennarun

On 03/11/2014 12:33 AM, Kalle Valo wrote:
oes this ring a bell for anyone?  I think I can also export the
>> traces as csv in case someone wants to look at them.
> 
> I showed your analysis to an HW engineer and the response I got was
> "don't do that" (= don't use the cold reset).

Earlier (1/22/14), you said a cold reset problem in CUS223 was due primarily to
a bad piece of hardware (or bad design, or something) in the CUS223.

I guess the problem must extend to other NICs as well, and
maybe other reasons for causing the hangs?

Anything we can do to help debug the problem with warm resets
failing to work properly in some cases?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-11 19:01           ` Ben Greear
@ 2014-03-12  8:22             ` Kalle Valo
  2014-03-12 16:01               ` Ben Greear
  0 siblings, 1 reply; 32+ messages in thread
From: Kalle Valo @ 2014-03-12  8:22 UTC (permalink / raw)
  To: Ben Greear; +Cc: Adrian Chadd, ath10k, Avery Pennarun

Ben Greear <greearb@candelatech.com> writes:

> On 03/11/2014 12:33 AM, Kalle Valo wrote:
> oes this ring a bell for anyone?  I think I can also export the
>>> traces as csv in case someone wants to look at them.
>> 
>> I showed your analysis to an HW engineer and the response I got was
>> "don't do that" (= don't use the cold reset).
>
> Earlier (1/22/14), you said a cold reset problem in CUS223 was due primarily to
> a bad piece of hardware (or bad design, or something) in the CUS223.
>
> I guess the problem must extend to other NICs as well, and
> maybe other reasons for causing the hangs?

I don't know the details, but I understood this only happens on CUS223.
If someone has seen these cold reset problems on other boards, like
XB140, I would be very keen to hear about it.

> Anything we can do to help debug the problem with warm resets
> failing to work properly in some cases?

Thanks, but not right now there isn't anything. Once I have new ideas
for the warm reset problems, I'll post them here. Testing those would be
very useful.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-12  8:22             ` Kalle Valo
@ 2014-03-12 16:01               ` Ben Greear
  2014-03-12 23:28                 ` Avery Pennarun
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Greear @ 2014-03-12 16:01 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Adrian Chadd, ath10k, Avery Pennarun

On 03/12/2014 01:22 AM, Kalle Valo wrote:
> Ben Greear <greearb@candelatech.com> writes:
> 
>> On 03/11/2014 12:33 AM, Kalle Valo wrote:
>> oes this ring a bell for anyone?  I think I can also export the
>>>> traces as csv in case someone wants to look at them.
>>>
>>> I showed your analysis to an HW engineer and the response I got was
>>> "don't do that" (= don't use the cold reset).
>>
>> Earlier (1/22/14), you said a cold reset problem in CUS223 was due primarily to
>> a bad piece of hardware (or bad design, or something) in the CUS223.
>>
>> I guess the problem must extend to other NICs as well, and
>> maybe other reasons for causing the hangs?
> 
> I don't know the details, but I understood this only happens on CUS223.
> If someone has seen these cold reset problems on other boards, like
> XB140, I would be very keen to hear about it.

Come to think of it, I'm not sure I've seen a hard lockup of the machine
on non CUS223 machines, but I have seen cases where non CUS223 NIC wedges
and the existing restart does not recover it.

This requires a reboot to recover from.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-12 16:01               ` Ben Greear
@ 2014-03-12 23:28                 ` Avery Pennarun
  2014-03-13  5:09                   ` Kalle Valo
  0 siblings, 1 reply; 32+ messages in thread
From: Avery Pennarun @ 2014-03-12 23:28 UTC (permalink / raw)
  To: Ben Greear; +Cc: Kalle Valo, Adrian Chadd, ath10k

On Wed, Mar 12, 2014 at 11:01 AM, Ben Greear <greearb@candelatech.com> wrote:
> Come to think of it, I'm not sure I've seen a hard lockup of the machine
> on non CUS223 machines, but I have seen cases where non CUS223 NIC wedges
> and the existing restart does not recover it.
>
> This requires a reboot to recover from.

That's not so surprising; the current driver doesn't do the cold reset
that is the only thing you can recover from a wedge, I guess because
of the CUS223 bug.

Stupid question: can I honestly just buy a different module and make
my PCIe crashiness problems go away?  Is there a reason do prefer the
CUS223?

Another question: is there perhaps anything the firmware can do to eg.
set a watchdog timer, so that the internal CPU will restart (go back
to "waiting for firmware" mode) if it doesn't answer for a while?  The
idea would be for the device to un-wedge itself even if there's
nothing we can do to fix it from outside.

Thanks,

Avery

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-12 23:28                 ` Avery Pennarun
@ 2014-03-13  5:09                   ` Kalle Valo
  2014-03-13 17:34                     ` Adrian Chadd
  0 siblings, 1 reply; 32+ messages in thread
From: Kalle Valo @ 2014-03-13  5:09 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Ben Greear, Adrian Chadd, ath10k

Avery Pennarun <apenwarr@gmail.com> writes:

> On Wed, Mar 12, 2014 at 11:01 AM, Ben Greear <greearb@candelatech.com> wrote:
>> Come to think of it, I'm not sure I've seen a hard lockup of the machine
>> on non CUS223 machines, but I have seen cases where non CUS223 NIC wedges
>> and the existing restart does not recover it.
>>
>> This requires a reboot to recover from.
>
> That's not so surprising; the current driver doesn't do the cold reset
> that is the only thing you can recover from a wedge, I guess because
> of the CUS223 bug.
>
> Stupid question: can I honestly just buy a different module and make
> my PCIe crashiness problems go away?

That's my theory, but I would like to confirm that somehow.

> Is there a reason do prefer the CUS223?

Good question, I haven't figured out that. CUS223 board is physically
larger than XB140, for example, and I would assume there's a reason for
that.

> Another question: is there perhaps anything the firmware can do to eg.
> set a watchdog timer, so that the internal CPU will restart (go back
> to "waiting for firmware" mode) if it doesn't answer for a while?  The
> idea would be for the device to un-wedge itself even if there's
> nothing we can do to fix it from outside.

That's something a firmware engineer should comment on, which I'm not.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-13  5:09                   ` Kalle Valo
@ 2014-03-13 17:34                     ` Adrian Chadd
  2014-03-13 17:39                       ` Kalle Valo
  2014-03-13 17:42                       ` Ben Greear
  0 siblings, 2 replies; 32+ messages in thread
From: Adrian Chadd @ 2014-03-13 17:34 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Ben Greear, ath10k, Avery Pennarun

I think the CUS223 has higher transmit power, right?


-a


On 12 March 2014 22:09, Kalle Valo <kvalo@qca.qualcomm.com> wrote:
> Avery Pennarun <apenwarr@gmail.com> writes:
>
>> On Wed, Mar 12, 2014 at 11:01 AM, Ben Greear <greearb@candelatech.com> wrote:
>>> Come to think of it, I'm not sure I've seen a hard lockup of the machine
>>> on non CUS223 machines, but I have seen cases where non CUS223 NIC wedges
>>> and the existing restart does not recover it.
>>>
>>> This requires a reboot to recover from.
>>
>> That's not so surprising; the current driver doesn't do the cold reset
>> that is the only thing you can recover from a wedge, I guess because
>> of the CUS223 bug.
>>
>> Stupid question: can I honestly just buy a different module and make
>> my PCIe crashiness problems go away?
>
> That's my theory, but I would like to confirm that somehow.
>
>> Is there a reason do prefer the CUS223?
>
> Good question, I haven't figured out that. CUS223 board is physically
> larger than XB140, for example, and I would assume there's a reason for
> that.
>
>> Another question: is there perhaps anything the firmware can do to eg.
>> set a watchdog timer, so that the internal CPU will restart (go back
>> to "waiting for firmware" mode) if it doesn't answer for a while?  The
>> idea would be for the device to un-wedge itself even if there's
>> nothing we can do to fix it from outside.
>
> That's something a firmware engineer should comment on, which I'm not.
>
> --
> Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-13 17:34                     ` Adrian Chadd
@ 2014-03-13 17:39                       ` Kalle Valo
  2014-03-13 17:42                       ` Ben Greear
  1 sibling, 0 replies; 32+ messages in thread
From: Kalle Valo @ 2014-03-13 17:39 UTC (permalink / raw)
  To: Adrian Chadd; +Cc: Ben Greear, ath10k, Avery Pennarun

Adrian Chadd <adrian@freebsd.org> writes:

> I think the CUS223 has higher transmit power, right?

That's true. I was told CUS223 uses external PA which makes it possible
to use higher transmit power. XB143 uses FEM and hence the power is less
than on CUS223.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-13 17:34                     ` Adrian Chadd
  2014-03-13 17:39                       ` Kalle Valo
@ 2014-03-13 17:42                       ` Ben Greear
  2014-03-14  6:26                         ` Kalle Valo
  1 sibling, 1 reply; 32+ messages in thread
From: Ben Greear @ 2014-03-13 17:42 UTC (permalink / raw)
  To: Adrian Chadd; +Cc: Kalle Valo, ath10k, Avery Pennarun

On 03/13/2014 10:34 AM, Adrian Chadd wrote:
> I think the CUS223 has higher transmit power, right?
> 

Yes.

Aside from that, I have not noticed any significant differences.
Thoughput is generally the same as the WLE900VX I have been testing.

Also, I am not sure CUS223 is commercially available for purchase in small-ish
quantities, but perhaps I just didn't look in the right place at the right time.


>>> Another question: is there perhaps anything the firmware can do to eg.
>>> set a watchdog timer, so that the internal CPU will restart (go back
>>> to "waiting for firmware" mode) if it doesn't answer for a while?  The
>>> idea would be for the device to un-wedge itself even if there's
>>> nothing we can do to fix it from outside.
>>
>> That's something a firmware engineer should comment on, which I'm not.

Kalle:  Can you feed that question to your firmware contacts
at QCA?  I'm not sure I should be publicly speculating on such
matters :)

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: ath10k driver crashes whenever firmware crashes on ARM SoC
  2014-03-13 17:42                       ` Ben Greear
@ 2014-03-14  6:26                         ` Kalle Valo
  0 siblings, 0 replies; 32+ messages in thread
From: Kalle Valo @ 2014-03-14  6:26 UTC (permalink / raw)
  To: Ben Greear; +Cc: Adrian Chadd, ath10k, Avery Pennarun

Ben Greear <greearb@candelatech.com> writes:

>>>> Another question: is there perhaps anything the firmware can do to eg.
>>>> set a watchdog timer, so that the internal CPU will restart (go back
>>>> to "waiting for firmware" mode) if it doesn't answer for a while?  The
>>>> idea would be for the device to un-wedge itself even if there's
>>>> nothing we can do to fix it from outside.
>>>
>>> That's something a firmware engineer should comment on, which I'm not.
>
> Kalle:  Can you feed that question to your firmware contacts
> at QCA?

Sure, I'm trying to solve this warm reset problem and talking with them
anyway. All ideas very welcome how to solve this.

-- 
Kalle Valo

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-03-14  6:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-28 17:18 ath10k driver crashes whenever firmware crashes on ARM SoC Avery Pennarun
2014-01-28 18:20 ` Ben Greear
2014-01-28 18:34   ` Avery Pennarun
2014-01-28 19:01     ` Ben Greear
2014-01-28 19:11       ` Avery Pennarun
2014-01-28 20:10     ` Janusz Dziedzic
2014-01-28 20:51       ` Avery Pennarun
2014-01-29 16:44     ` Kalle Valo
     [not found] ` <CAJ-VmokorbJ2iU4rGNYdRj+A22NR9cV-5h-tDN0pD2FCurZDpA@mail.gmail.com>
2014-01-28 20:55   ` Avery Pennarun
2014-01-29 16:41 ` Kalle Valo
2014-01-29 18:44   ` Adrian Chadd
2014-01-30  2:41     ` Avery Pennarun
2014-02-09  8:00       ` Avery Pennarun
2014-02-27 15:48         ` Missing memory barriers Kalle Valo
2014-02-28  6:10           ` Avery Pennarun
2014-03-06 13:34             ` Kalle Valo
2014-03-11  7:33         ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo
2014-03-11  7:40           ` Avery Pennarun
2014-03-11  7:52             ` Adrian Chadd
2014-03-11  7:59               ` Avery Pennarun
2014-03-11  8:13               ` Kalle Valo
2014-03-11  8:37                 ` Michal Kazior
2014-03-11  8:10             ` Kalle Valo
2014-03-11 19:01           ` Ben Greear
2014-03-12  8:22             ` Kalle Valo
2014-03-12 16:01               ` Ben Greear
2014-03-12 23:28                 ` Avery Pennarun
2014-03-13  5:09                   ` Kalle Valo
2014-03-13 17:34                     ` Adrian Chadd
2014-03-13 17:39                       ` Kalle Valo
2014-03-13 17:42                       ` Ben Greear
2014-03-14  6:26                         ` Kalle Valo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.