* ath10k driver crashes whenever firmware crashes on ARM SoC @ 2014-01-28 17:18 Avery Pennarun 2014-01-28 18:20 ` Ben Greear ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Avery Pennarun @ 2014-01-28 17:18 UTC (permalink / raw) To: ath10k Hi all, When the ath10k firmware crashes on my device (let's not worry about why the firmware crashes right now; one problem at a time), my host CPU (ARMv7 based) can't recover. I get some variant of this error: [ 780.116977] Unhandled fault: imprecise external abort (0x1406) at 0x2ac3706c [ 780.124336] Internal error: : 1406 [#1] SMP I've narrowed this down to this code in ath10k/pci.c, ath10k_pci_device_reset: /* Put Target, including PCIe, into RESET. */ val = ath10k_pci_reg_read32(ar, SOC_GLOBAL_RESET_ADDRESS); val |= 1; ath10k_pci_reg_write32(ar, SOC_GLOBAL_RESET_ADDRESS, val); for (i = 0; i < ATH_PCI_RESET_WAIT_MAX; i++) { if (ath10k_pci_reg_read32(ar, RTC_STATE_ADDRESS) & RTC_STATE_COLD_RESET_MASK) break; msleep(1); } Specifically, the pci_reg_read32(). I can insert as much time as I want between the write32 and the read32; it always performs the read, then crashes with the PC pointing a few instructions later, inside the msleep(), with the imprecise external abort. I think this means the PCI read operation has encountered a PCI target abort, which suggests that the SOC_GLOBAL_RESET_ADDRESS line has not successfully reset the device. From what I understand, on x86 processors PCI target aborts are not fatal, so you might not notice this problem on those platforms, but it's bad on ARM. I'm using the ath10k driver from linux-next 20140117, but I had the same problem with 3.13-rc2 so I don't think this has changed. Are other people seeing this? Is there something I can try to resolve it? Thanks, Avery _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-28 17:18 ath10k driver crashes whenever firmware crashes on ARM SoC Avery Pennarun @ 2014-01-28 18:20 ` Ben Greear 2014-01-28 18:34 ` Avery Pennarun [not found] ` <CAJ-VmokorbJ2iU4rGNYdRj+A22NR9cV-5h-tDN0pD2FCurZDpA@mail.gmail.com> 2014-01-29 16:41 ` Kalle Valo 2 siblings, 1 reply; 32+ messages in thread From: Ben Greear @ 2014-01-28 18:20 UTC (permalink / raw) To: Avery Pennarun; +Cc: ath10k On 01/28/2014 09:18 AM, Avery Pennarun wrote: > Hi all, > > When the ath10k firmware crashes on my device (let's not worry about > why the firmware crashes right now; one problem at a time), my host > CPU (ARMv7 based) can't recover. I get some variant of this error: I don't know about your pci bus problem, but I'm interested in knowing about firmware crashes (if you are at liberty to share the details). Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-28 18:20 ` Ben Greear @ 2014-01-28 18:34 ` Avery Pennarun 2014-01-28 19:01 ` Ben Greear ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Avery Pennarun @ 2014-01-28 18:34 UTC (permalink / raw) To: Ben Greear; +Cc: ath10k On Tue, Jan 28, 2014 at 1:20 PM, Ben Greear <greearb@candelatech.com> wrote: > On 01/28/2014 09:18 AM, Avery Pennarun wrote: >> When the ath10k firmware crashes on my device (let's not worry about >> why the firmware crashes right now; one problem at a time), my host >> CPU (ARMv7 based) can't recover. I get some variant of this error: > > I don't know about your pci bus problem, but I'm interested in knowing > about firmware crashes (if you are at liberty to share the details). Well, since you asked... :) I'm trying to build an especially robust system here, so when I noticed that the driver will bring the entire system crashing down upon a firmware crash, I've actually gone out of my way to make more firmware crashes. So I'm using the ath10k (not ap) firmware from a month or so ago, in AP mode. It's pretty easy to crash the firmware with a sequence something like this: - start hostapd (I'm using channel 36, HT20, no encryption) # note that hostapd already adds a mon.wlan0 monitor interface - iw wlan0 interface add mon0 type monitor - ip link set mon0 up - tcpdump -ni mon0 | head This doesn't *always* work, but it kills the firmware maybe half the time for me. It may or may not be worse if there are clients connected and pushing traffic. I've noticed that once the firmware has crashed once and recovered, it's hard to crash it again using the same trick without unloading and reloading the driver. Note that in this case, the firmware crash doesn't always kill my host SoC with a bus error (although sometimes it does). Even if it doesn't die completely, the driver generally comes out confused about the monitoring interface(s): it prints "ath10k: Only one monitor interface allowed", which is actually totally untrue, since before the crash I was able to create and use two at a time. (I think this error is a side effect of getting out of sync with the firmware when it restarts, and thus getting confused about "pmon" vs "vmon" monitor interfaces.) Also, if I leave the ath10k driver running and pushing traffic for, say, 10 minutes, the probability that the firmware will crash *and* take my SoC with it, if I try to kill hostapd or unload the driver, approaches 100%. These are all problems worth worrying about, of course, but fundamentally I really want to get the resets working. The driver resets in about one second when it *doesn't* crash, which is pretty gross, but at least it means we can recover when the firmware is crappy. The especially crappy firmware right now makes it easier to test the recovery process in the driver, which I want to fix first if possible. Once I feel good that it can recover from crashes, I will be happier to complain about the actual crashes themselves :) Have fun, Avery _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-28 18:34 ` Avery Pennarun @ 2014-01-28 19:01 ` Ben Greear 2014-01-28 19:11 ` Avery Pennarun 2014-01-28 20:10 ` Janusz Dziedzic 2014-01-29 16:44 ` Kalle Valo 2 siblings, 1 reply; 32+ messages in thread From: Ben Greear @ 2014-01-28 19:01 UTC (permalink / raw) To: Avery Pennarun; +Cc: ath10k On 01/28/2014 10:34 AM, Avery Pennarun wrote: > On Tue, Jan 28, 2014 at 1:20 PM, Ben Greear <greearb@candelatech.com> wrote: >> On 01/28/2014 09:18 AM, Avery Pennarun wrote: >>> When the ath10k firmware crashes on my device (let's not worry about >>> why the firmware crashes right now; one problem at a time), my host >>> CPU (ARMv7 based) can't recover. I get some variant of this error: >> >> I don't know about your pci bus problem, but I'm interested in knowing >> about firmware crashes (if you are at liberty to share the details). > > Well, since you asked... :) > > I'm trying to build an especially robust system here, so when I > noticed that the driver will bring the entire system crashing down > upon a firmware crash, I've actually gone out of my way to make more > firmware crashes. So I'm using the ath10k (not ap) firmware from a > month or so ago, in AP mode. It's pretty easy to crash the firmware > with a sequence something like this: > > - start hostapd (I'm using channel 36, HT20, no encryption) > # note that hostapd already adds a mon.wlan0 monitor interface > - iw wlan0 interface add mon0 type monitor > - ip link set mon0 up > - tcpdump -ni mon0 | head > > This doesn't *always* work, but it kills the firmware maybe half the > time for me. It may or may not be worse if there are clients > connected and pushing traffic. I've noticed that once the firmware > has crashed once and recovered, it's hard to crash it again using the > same trick without unloading and reloading the driver. Note that in > this case, the firmware crash doesn't always kill my host SoC with a > bus error (although sometimes it does). Even if it doesn't die > completely, the driver generally comes out confused about the > monitoring interface(s): it prints "ath10k: Only one monitor interface > allowed", which is actually totally untrue, since before the crash I > was able to create and use two at a time. (I think this error is a > side effect of getting out of sync with the firmware when it restarts, > and thus getting confused about "pmon" vs "vmon" monitor interfaces.) > > Also, if I leave the ath10k driver running and pushing traffic for, > say, 10 minutes, the probability that the firmware will crash *and* > take my SoC with it, if I try to kill hostapd or unload the driver, > approaches 100%. I see similar issues (with the reset killing the PC) on x86-64 (core-i7 CPU). Kalle mentioned a few days ago that at least some of the NICs had issues with cold reset and that they hoped to have a fix that uses warm reset in a week or two. Interestingly, I also see hard PC lockup on longer runs, but perhaps that is related to the cold-reset issue somehow. I'm using the 10.x AP firmware, and my method of crashing firmware is different at the moment :) Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-28 19:01 ` Ben Greear @ 2014-01-28 19:11 ` Avery Pennarun 0 siblings, 0 replies; 32+ messages in thread From: Avery Pennarun @ 2014-01-28 19:11 UTC (permalink / raw) To: Ben Greear; +Cc: ath10k On Tue, Jan 28, 2014 at 2:01 PM, Ben Greear <greearb@candelatech.com> wrote: > I see similar issues (with the reset killing the PC) on x86-64 > (core-i7 CPU). Kalle mentioned a few days ago that at least some of the > NICs had issues with cold reset and that they hoped to > have a fix that uses warm reset in a week or two. I saw some messages about this on the list, including a patch that looked promising: http://lists.infradead.org/pipermail/ath10k/2013-December/000888.html but it didn't help. I'm pretty sure stuck firmware is not something you can warm reboot to fix. On the other hand, when my box reboots itself, it always comes back okay and the driver loads fine. So clearly there is *some* reset line in the system that's working... > Interestingly, I also see hard PC lockup on longer runs, but > perhaps that is related to the cold-reset issue somehow. Longer runs? Lucky you! :) > I'm using the 10.x AP firmware, and my method of crashing firmware > is different at the moment :) Yeah, I tried the AP firmware and it lasts longer. I'm pretty sad that there are two different firmwares with two different sets of bugs to choose between. _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-28 18:34 ` Avery Pennarun 2014-01-28 19:01 ` Ben Greear @ 2014-01-28 20:10 ` Janusz Dziedzic 2014-01-28 20:51 ` Avery Pennarun 2014-01-29 16:44 ` Kalle Valo 2 siblings, 1 reply; 32+ messages in thread From: Janusz Dziedzic @ 2014-01-28 20:10 UTC (permalink / raw) To: Avery Pennarun; +Cc: Ben Greear, ath10k 2014-01-28 Avery Pennarun <apenwarr@gmail.com>: > On Tue, Jan 28, 2014 at 1:20 PM, Ben Greear <greearb@candelatech.com> wrote: >> On 01/28/2014 09:18 AM, Avery Pennarun wrote: >>> When the ath10k firmware crashes on my device (let's not worry about >>> why the firmware crashes right now; one problem at a time), my host >>> CPU (ARMv7 based) can't recover. I get some variant of this error: >> >> I don't know about your pci bus problem, but I'm interested in knowing >> about firmware crashes (if you are at liberty to share the details). > > Well, since you asked... :) > > I'm trying to build an especially robust system here, so when I > noticed that the driver will bring the entire system crashing down > upon a firmware crash, I've actually gone out of my way to make more > firmware crashes. So I'm using the ath10k (not ap) firmware from a > month or so ago, in AP mode. It's pretty easy to crash the firmware > with a sequence something like this: > > - start hostapd (I'm using channel 36, HT20, no encryption) > # note that hostapd already adds a mon.wlan0 monitor interface > - iw wlan0 interface add mon0 type monitor > - ip link set mon0 up > - tcpdump -ni mon0 | head > FW636 have problems with monitor iface: iw wlan0 set type monitor ifconfig wlan0 up tcpdump -i wlan0 Will always crash firmware after "entered promiscuous mode" Generally in case you will have/left only one monitor interface and active tcpdump (in your case after hostapd kill), you will always get this crash. So using monitor interface with FW636 is not good idea. BTW monitor works fine with 10.x FW. BR Janusz _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-28 20:10 ` Janusz Dziedzic @ 2014-01-28 20:51 ` Avery Pennarun 0 siblings, 0 replies; 32+ messages in thread From: Avery Pennarun @ 2014-01-28 20:51 UTC (permalink / raw) To: Janusz Dziedzic; +Cc: Ben Greear, ath10k On Tue, Jan 28, 2014 at 3:10 PM, Janusz Dziedzic <janusz.dziedzic@gmail.com> wrote: > FW636 have problems with monitor iface: > iw wlan0 set type monitor > ifconfig wlan0 up > tcpdump -i wlan0 > Will always crash firmware after "entered promiscuous mode" > > Generally in case you will have/left only one monitor interface and > active tcpdump (in your case after hostapd kill), you will always get > this crash. So using monitor interface with FW636 is not good idea. > > BTW monitor works fine with 10.x FW. Well, in this case it gives a mostly-reproducible test case for the driver's failure to recover from firmware crashes. So it's useful in that respect. I was using the 10.x firmware until I wanted an easier way to reproduce the problem :) Avery _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-28 18:34 ` Avery Pennarun 2014-01-28 19:01 ` Ben Greear 2014-01-28 20:10 ` Janusz Dziedzic @ 2014-01-29 16:44 ` Kalle Valo 2 siblings, 0 replies; 32+ messages in thread From: Kalle Valo @ 2014-01-29 16:44 UTC (permalink / raw) To: Avery Pennarun; +Cc: Ben Greear, ath10k Avery Pennarun <apenwarr@gmail.com> writes: > On Tue, Jan 28, 2014 at 1:20 PM, Ben Greear <greearb@candelatech.com> wrote: >> On 01/28/2014 09:18 AM, Avery Pennarun wrote: >>> When the ath10k firmware crashes on my device (let's not worry about >>> why the firmware crashes right now; one problem at a time), my host >>> CPU (ARMv7 based) can't recover. I get some variant of this error: >> >> I don't know about your pci bus problem, but I'm interested in knowing >> about firmware crashes (if you are at liberty to share the details). > > Well, since you asked... :) > > I'm trying to build an especially robust system here, so when I > noticed that the driver will bring the entire system crashing down > upon a firmware crash, I've actually gone out of my way to make more > firmware crashes. Do you know that we have simulate_fw_crash debugfs file just for testing this? Of course it's good to test various firmware crashing scenarious, but in some cases this debugfs file helps a lot. -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <CAJ-VmokorbJ2iU4rGNYdRj+A22NR9cV-5h-tDN0pD2FCurZDpA@mail.gmail.com>]
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC [not found] ` <CAJ-VmokorbJ2iU4rGNYdRj+A22NR9cV-5h-tDN0pD2FCurZDpA@mail.gmail.com> @ 2014-01-28 20:55 ` Avery Pennarun 0 siblings, 0 replies; 32+ messages in thread From: Avery Pennarun @ 2014-01-28 20:55 UTC (permalink / raw) To: Adrian Chadd; +Cc: ath10k On Tue, Jan 28, 2014 at 3:28 PM, Adrian Chadd <adrian@freebsd.org> wrote: > Can you put a sleep in before the initial write? I've tried adding printks and various things before the write, and it didn't seem to help. I also noticed that iowrite32() contains a write barrier, and ioread32() contains a read barrier, but they're structured like this: iowrite32() { wmb(); write32(); } ioread32() { read32(); rmb(); } Which means that actually the write32() and read32() do not have barriers between them. I think that might be a problem, but adding the barriers didn't help either. > There are hardware bugs that are.. delicate. What you should do is find what > kind of reporting you can pull out of the pcie endpoint the nic is connected > to in order to see why it threw a fatal error. Then at least we can poke it > further. Ok. I'm going to see if I can hunt down a PCIe bus analyzer and find out what the deal is. > Its also worth asking the qca hardware team about this. They'll likely want > to know what the pcie errors are so please figure that out. Thanks for the advice! Are other people not seeing this? Or are they just trying not to crash the firmware so they don't have to care? :) Thanks, Avery _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-28 17:18 ath10k driver crashes whenever firmware crashes on ARM SoC Avery Pennarun 2014-01-28 18:20 ` Ben Greear [not found] ` <CAJ-VmokorbJ2iU4rGNYdRj+A22NR9cV-5h-tDN0pD2FCurZDpA@mail.gmail.com> @ 2014-01-29 16:41 ` Kalle Valo 2014-01-29 18:44 ` Adrian Chadd 2 siblings, 1 reply; 32+ messages in thread From: Kalle Valo @ 2014-01-29 16:41 UTC (permalink / raw) To: Avery Pennarun; +Cc: ath10k Hi, Avery Pennarun <apenwarr@gmail.com> writes: > When the ath10k firmware crashes on my device (let's not worry about > why the firmware crashes right now; one problem at a time), my host > CPU (ARMv7 based) can't recover. I get some variant of this error: > > [ 780.116977] Unhandled fault: imprecise external abort (0x1406) at 0x2ac3706c > [ 780.124336] Internal error: : 1406 [#1] SMP > > I've narrowed this down to this code in ath10k/pci.c, ath10k_pci_device_reset: > > /* Put Target, including PCIe, into RESET. */ > val = ath10k_pci_reg_read32(ar, SOC_GLOBAL_RESET_ADDRESS); > val |= 1; > ath10k_pci_reg_write32(ar, SOC_GLOBAL_RESET_ADDRESS, val); > for (i = 0; i < ATH_PCI_RESET_WAIT_MAX; i++) { > if (ath10k_pci_reg_read32(ar, RTC_STATE_ADDRESS) & > RTC_STATE_COLD_RESET_MASK) > break; > msleep(1); > } Are you using CUS223 board? I was told that it has a problem with the cold reset. When you issue the cold reset, some voltage in the board goes too low and there's a chance that it breaks PCI communication. > Specifically, the pci_reg_read32(). I can insert as much time as I > want between the write32 and the read32; it always performs the read, > then crashes with the PC pointing a few instructions later, inside the > msleep(), with the imprecise external abort. I think this means the > PCI read operation has encountered a PCI target abort, which suggests > that the SOC_GLOBAL_RESET_ADDRESS line has not successfully reset the > device. From what I understand, on x86 processors PCI target aborts > are not fatal, so you might not notice this problem on those > platforms, but it's bad on ARM. FWIW the same problem also happens on MIPS. > I'm using the ath10k driver from linux-next 20140117, but I had the > same problem with 3.13-rc2 so I don't think this has changed. > > Are other people seeing this? Is there something I can try to resolve it? Yes, we see it as well. And we see it also on when doing interface down, for example when shutting down hostapd. Soon we will post patches to workaround the interface down issue, but for firmware crashes we haven't yet found a reliable solution. I hope there's a way to fix warm reset to properly recover from a firmware crash. -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-29 16:41 ` Kalle Valo @ 2014-01-29 18:44 ` Adrian Chadd 2014-01-30 2:41 ` Avery Pennarun 0 siblings, 1 reply; 32+ messages in thread From: Adrian Chadd @ 2014-01-29 18:44 UTC (permalink / raw) To: Kalle Valo; +Cc: ath10k, Avery Pennarun Hi, Well, the problem is more likely that the PCIe bus doesn't come back correctly, and the next IO write hits a PCI bus error. What about seeing if you can detect the PCIe error before it's a fatal one (hence my email earlier about trying to decode this stuff) and then reset the PCIe port from the PCI side? -a On 29 January 2014 08:41, Kalle Valo <kvalo@qca.qualcomm.com> wrote: > Hi, > > Avery Pennarun <apenwarr@gmail.com> writes: > >> When the ath10k firmware crashes on my device (let's not worry about >> why the firmware crashes right now; one problem at a time), my host >> CPU (ARMv7 based) can't recover. I get some variant of this error: >> >> [ 780.116977] Unhandled fault: imprecise external abort (0x1406) at 0x2ac3706c >> [ 780.124336] Internal error: : 1406 [#1] SMP >> >> I've narrowed this down to this code in ath10k/pci.c, ath10k_pci_device_reset: >> >> /* Put Target, including PCIe, into RESET. */ >> val = ath10k_pci_reg_read32(ar, SOC_GLOBAL_RESET_ADDRESS); >> val |= 1; >> ath10k_pci_reg_write32(ar, SOC_GLOBAL_RESET_ADDRESS, val); >> for (i = 0; i < ATH_PCI_RESET_WAIT_MAX; i++) { >> if (ath10k_pci_reg_read32(ar, RTC_STATE_ADDRESS) & >> RTC_STATE_COLD_RESET_MASK) >> break; >> msleep(1); >> } > > Are you using CUS223 board? I was told that it has a problem with the > cold reset. When you issue the cold reset, some voltage in the board > goes too low and there's a chance that it breaks PCI communication. > >> Specifically, the pci_reg_read32(). I can insert as much time as I >> want between the write32 and the read32; it always performs the read, >> then crashes with the PC pointing a few instructions later, inside the >> msleep(), with the imprecise external abort. I think this means the >> PCI read operation has encountered a PCI target abort, which suggests >> that the SOC_GLOBAL_RESET_ADDRESS line has not successfully reset the >> device. From what I understand, on x86 processors PCI target aborts >> are not fatal, so you might not notice this problem on those >> platforms, but it's bad on ARM. > > FWIW the same problem also happens on MIPS. > >> I'm using the ath10k driver from linux-next 20140117, but I had the >> same problem with 3.13-rc2 so I don't think this has changed. >> >> Are other people seeing this? Is there something I can try to resolve it? > > Yes, we see it as well. And we see it also on when doing interface down, > for example when shutting down hostapd. Soon we will post patches to > workaround the interface down issue, but for firmware crashes we haven't > yet found a reliable solution. I hope there's a way to fix warm reset to > properly recover from a firmware crash. > > -- > Kalle Valo > > _______________________________________________ > ath10k mailing list > ath10k@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/ath10k _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-29 18:44 ` Adrian Chadd @ 2014-01-30 2:41 ` Avery Pennarun 2014-02-09 8:00 ` Avery Pennarun 0 siblings, 1 reply; 32+ messages in thread From: Avery Pennarun @ 2014-01-30 2:41 UTC (permalink / raw) To: Adrian Chadd; +Cc: Kalle Valo, ath10k On Wed, Jan 29, 2014 at 1:44 PM, Adrian Chadd <adrian@freebsd.org> wrote: > Well, the problem is more likely that the PCIe bus doesn't come back > correctly, and the next IO write hits a PCI bus error. > > What about seeing if you can detect the PCIe error before it's a fatal > one (hence my email earlier about trying to decode this stuff) and > then reset the PCIe port from the PCI side? Still chasing around some people to get a PCIe bus analyzer set up. I did try a bit with resetting the PCI bus, but there's a lot of ways that can get messy (turns out PCI bus controllers have a lot of registers that get reset when you reset them, who knew?), and I didn't manage to work around all of them. My guess however is that the PCIe bus itself is probably not broken... but the ath10k device may not be responding to it anymore when this problem happens. One thing I noticed is that if I reboot my device (which completely resets the PCI bus), it comes back and works. So *somehow* things are getting cleaned up without actually cutting power. _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-01-30 2:41 ` Avery Pennarun @ 2014-02-09 8:00 ` Avery Pennarun 2014-02-27 15:48 ` Missing memory barriers Kalle Valo 2014-03-11 7:33 ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo 0 siblings, 2 replies; 32+ messages in thread From: Avery Pennarun @ 2014-02-09 8:00 UTC (permalink / raw) To: Adrian Chadd; +Cc: Kalle Valo, ath10k On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote: > Still chasing around some people to get a PCIe bus analyzer set up. Okay, I finally managed to get enough parts put together to look at the PCIe bus. To make things a little more clear, I added a macro that does essentially: pci_write_config_dword(0, 0x80000000 | __LINE__) mdelay(1); pci_write_config_dword(0, __LINE__) ...at various points in the code. This way I can see precisely what was the most recent PCIe transaction before the crash. I'm not super familiar with PCIe, but what I think I'm seeing is: - the firmware does not need to be loaded yet; sometimes I can crash it just by doing a cold reset right at driver load time. So the good news is, the firmware code is not related. - the crash is always in ath10k_pci_device_reset - there are definitely some missing memory barriers in here; in a few cases you can clearly see a write getting done before the read that came before it. Looking at the definitions for iowrite32 and ioread32, and for rmb() and wmb(), we can see that the use of rmb() and wmb() do not work properly (at least on ARM) when you care about the ordering between reads and writes. However, I don't think this actually causes the problem. - the crash happens after writing the 1 to SBC_GLOBAL_RESET_ADDRESS. The write gets an ACK from the device, so there are no interrupted PCIe transactions. - after writing that 1, the PCI bus is fine for ~272 usec. I can see the first pci_write_config_dword in my macro above, but it crashes during the mdelay(1) and the second pci_write_config doesn't go through. - ~272 usec after the write, I see TS1 packets getting transmitted at maximum speed in both directions. Does this mean the connection is retraining? - 50 usec after the first TS1 packet (a surprisingly precise number), I see an EIOS packet sent in the downstream direction. After that, they appear every 25 usec. However, they *all* show invalid parity bits according to the PCI analyzer. Does this ring a bell for anyone? I think I can also export the traces as csv in case someone wants to look at them. Thanks! _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Missing memory barriers 2014-02-09 8:00 ` Avery Pennarun @ 2014-02-27 15:48 ` Kalle Valo 2014-02-28 6:10 ` Avery Pennarun 2014-03-11 7:33 ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo 1 sibling, 1 reply; 32+ messages in thread From: Kalle Valo @ 2014-02-27 15:48 UTC (permalink / raw) To: Avery Pennarun; +Cc: Adrian Chadd, ath10k Hi Avery, starting a new thread about memory barriers: Avery Pennarun <apenwarr@gmail.com> writes: > On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote: > > - there are definitely some missing memory barriers in here; in a few > cases you can clearly see a write getting done before the read that > came before it. Looking at the definitions for iowrite32 and > ioread32, and for rmb() and wmb(), we can see that the use of rmb() > and wmb() do not work properly (at least on ARM) when you care about > the ordering between reads and writes. However, I don't think this > actually causes the problem. Can you tell more about this, please? Did you find out where we are actually doing it wrong? -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Missing memory barriers 2014-02-27 15:48 ` Missing memory barriers Kalle Valo @ 2014-02-28 6:10 ` Avery Pennarun 2014-03-06 13:34 ` Kalle Valo 0 siblings, 1 reply; 32+ messages in thread From: Avery Pennarun @ 2014-02-28 6:10 UTC (permalink / raw) To: Kalle Valo; +Cc: Adrian Chadd, ath10k On Thu, Feb 27, 2014 at 7:48 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote: > Avery Pennarun <apenwarr@gmail.com> writes: >> - there are definitely some missing memory barriers in here; in a few >> cases you can clearly see a write getting done before the read that >> came before it. Looking at the definitions for iowrite32 and >> ioread32, and for rmb() and wmb(), we can see that the use of rmb() >> and wmb() do not work properly (at least on ARM) when you care about >> the ordering between reads and writes. However, I don't think this >> actually causes the problem. > > Can you tell more about this, please? Did you find out where we are > actually doing it wrong? Sure. It's been a while since I wrote the above and it was with an older version of the ath10k driver, but basically what happens is as follows. First look in arch/arm/include/asm/system.h. Note that in various situations, rmb() and wmb() may be defined identically or may do slightly different things. On the architecture I'm using, I'm fairly sure they are defined identically. Results may be slightly different if they are not. Next look at arch/arm/include/asm/io.h at the definitions for iowrite32 and ioread32. In pseudocode they are roughly like this: iowrite32: wmb(); write(); ioread32: read(); rmb(); So for example in ath10k_pci_device_reset (or ath10k_pci_cold_reset if you have that set of patches), there is a code sequence that looks something like this: - write the reset register - loop: - read the status I noticed that when the device crashes the PCIe bus due to voltage problems, writing the reset register is not what causes the PCIe host to notice an abort - it's reading the status afterward. While investigating this, I hooked up a PCIe bus analyzer and found that it was reading the status *before* writing the reset. That's because the above expands out to: wmb() write(reset) read(status) rmb() which the CPU or compiler is free to reorder as: wmb() read(status) write(reset) rmb() And in fact, it *wants* to do this reordering because there is a conditional that depends on the result of read(status), so with the reordering, the CPU pipeline can think about that conditional while executing write(reset) in parallel. And this is indeed what happens, according to my PCIe bus trace. The way the ARM iowrite/ioread barriers are set up, it works as expected when you read before writing, but the first read after a write can be reordered. If you want to be careful about this, I think you'd have to insert an extra barrier by hand. The bad news is that, while inserting the extra barrier did clean up my bus trace, it didn't fix the underlying problem. When the chip dies due to this cold reset operation, the inability to read the status register is only a symptom, not the cause. In the end it's harmless that we end up doing the first read before the write operation finishes. What happens isn't what the code says, but I don't think that matters in this case. (I think my main line of investigation at the time was in the first read after the command was sent to take the chip *out* of reset. I thought maybe reading while it was in reset was the underlying cause of the abort. No such luck.) Hope this helps. Have fun, Avery _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Missing memory barriers 2014-02-28 6:10 ` Avery Pennarun @ 2014-03-06 13:34 ` Kalle Valo 0 siblings, 0 replies; 32+ messages in thread From: Kalle Valo @ 2014-03-06 13:34 UTC (permalink / raw) To: Avery Pennarun; +Cc: Adrian Chadd, ath10k Avery Pennarun <apenwarr@gmail.com> writes: > On Thu, Feb 27, 2014 at 7:48 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote: >> Avery Pennarun <apenwarr@gmail.com> writes: >>> - there are definitely some missing memory barriers in here; in a few >>> cases you can clearly see a write getting done before the read that >>> came before it. Looking at the definitions for iowrite32 and >>> ioread32, and for rmb() and wmb(), we can see that the use of rmb() >>> and wmb() do not work properly (at least on ARM) when you care about >>> the ordering between reads and writes. However, I don't think this >>> actually causes the problem. >> >> Can you tell more about this, please? Did you find out where we are >> actually doing it wrong? > > Sure. It's been a while since I wrote the above and it was with an > older version of the ath10k driver, but basically what happens is as > follows. [...] > The bad news is that, while inserting the extra barrier did clean up > my bus trace, it didn't fix the underlying problem. When the chip > dies due to this cold reset operation, the inability to read the > status register is only a symptom, not the cause. In the end it's > harmless that we end up doing the first read before the write > operation finishes. What happens isn't what the code says, but I > don't think that matters in this case. Thanks for the excellent write up, I understand this better now. I wasn't expecting that this would fix the cold reset issue, but these kind of issues should be good to fix anyway. You never know what kind of bugs they might cause in the future. -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-02-09 8:00 ` Avery Pennarun 2014-02-27 15:48 ` Missing memory barriers Kalle Valo @ 2014-03-11 7:33 ` Kalle Valo 2014-03-11 7:40 ` Avery Pennarun 2014-03-11 19:01 ` Ben Greear 1 sibling, 2 replies; 32+ messages in thread From: Kalle Valo @ 2014-03-11 7:33 UTC (permalink / raw) To: Avery Pennarun; +Cc: Adrian Chadd, ath10k Avery Pennarun <apenwarr@gmail.com> writes: > On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote: >> Still chasing around some people to get a PCIe bus analyzer set up. > > Okay, I finally managed to get enough parts put together to look at > the PCIe bus. To make things a little more clear, I added a macro > that does essentially: > > pci_write_config_dword(0, 0x80000000 | __LINE__) > mdelay(1); > pci_write_config_dword(0, __LINE__) > > ...at various points in the code. This way I can see precisely what > was the most recent PCIe transaction before the crash. > > I'm not super familiar with PCIe, but what I think I'm seeing is: > > - the firmware does not need to be loaded yet; sometimes I can crash > it just by doing a cold reset right at driver load time. So the good > news is, the firmware code is not related. > > - the crash is always in ath10k_pci_device_reset [...] > Does this ring a bell for anyone? I think I can also export the > traces as csv in case someone wants to look at them. I showed your analysis to an HW engineer and the response I got was "don't do that" (= don't use the cold reset). As you know, we now have a workaround using the warm reset: 00f5482bcd94 ath10k: suspend hardware before reset 9042e17df834 ath10k: refactor suspend/resume functions fc36e3ffcdd0 ath10k: fix device initialization routine Have you tested these? Did they help at all? -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-11 7:33 ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo @ 2014-03-11 7:40 ` Avery Pennarun 2014-03-11 7:52 ` Adrian Chadd 2014-03-11 8:10 ` Kalle Valo 2014-03-11 19:01 ` Ben Greear 1 sibling, 2 replies; 32+ messages in thread From: Avery Pennarun @ 2014-03-11 7:40 UTC (permalink / raw) To: Kalle Valo; +Cc: Adrian Chadd, ath10k On Tue, Mar 11, 2014 at 2:33 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote: > Avery Pennarun <apenwarr@gmail.com> writes: >> On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote: >>> Still chasing around some people to get a PCIe bus analyzer set up. >> >> Okay, I finally managed to get enough parts put together to look at >> the PCIe bus. To make things a little more clear, I added a macro >> that does essentially: >> >> pci_write_config_dword(0, 0x80000000 | __LINE__) >> mdelay(1); >> pci_write_config_dword(0, __LINE__) >> >> ...at various points in the code. This way I can see precisely what >> was the most recent PCIe transaction before the crash. >> >> I'm not super familiar with PCIe, but what I think I'm seeing is: >> >> - the firmware does not need to be loaded yet; sometimes I can crash >> it just by doing a cold reset right at driver load time. So the good >> news is, the firmware code is not related. >> >> - the crash is always in ath10k_pci_device_reset > > [...] > >> Does this ring a bell for anyone? I think I can also export the >> traces as csv in case someone wants to look at them. > > I showed your analysis to an HW engineer and the response I got was > "don't do that" (= don't use the cold reset). As you know, we now have a > workaround using the warm reset: > > 00f5482bcd94 ath10k: suspend hardware before reset > 9042e17df834 ath10k: refactor suspend/resume functions > fc36e3ffcdd0 ath10k: fix device initialization routine > > Have you tested these? Did they help at all? Yes, I've tested them and they help, mainly by doing the cold reset less often. However, when the firmware hard crashes in certain ways (for example, using my original test case), it looks like warm reset can't fix that. The driver then still must fall back to cold reset and, some fairly large percentage of the time (1/3rd?), crashes the bus. We do have a separate reset line controlled by a GPIO. Using that crashes the SoC's PCIe host implementation (whee!). But I got help from the SoC manufacturer and was able to get some instructions for resetting their PCIe host controller. When I do all the magic incantations in the right order, the system can recover, albeit with a fully reset ath10k chip. This workaround is unfortunately specific to the host device platform so it won't do you much good. Of course, a good way to avoid the problem is "don't crash the firmware then," but that's not as robust as I'd like. This box is doing quite a few things, so rebooting to fix a problem on one of the wireless cards is pretty expensive. Nevertheless, the warm reset changes really do reduce the frequency of this a lot, to the point where my workaround is almost never needed. Thanks for that! Have fun, Avery _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-11 7:40 ` Avery Pennarun @ 2014-03-11 7:52 ` Adrian Chadd 2014-03-11 7:59 ` Avery Pennarun 2014-03-11 8:13 ` Kalle Valo 2014-03-11 8:10 ` Kalle Valo 1 sibling, 2 replies; 32+ messages in thread From: Adrian Chadd @ 2014-03-11 7:52 UTC (permalink / raw) To: Avery Pennarun; +Cc: Kalle Valo, ath10k ... it's not a complete loss! This to me says "we need a hook from the driver to call the host "reset the bus" thing". We also kinda need it for ath9k/ath5k (if it's not there) so AHB attached things can be reset by actually poking an SoC reset register. -a On 11 March 2014 00:40, Avery Pennarun <apenwarr@gmail.com> wrote: > On Tue, Mar 11, 2014 at 2:33 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote: >> Avery Pennarun <apenwarr@gmail.com> writes: >>> On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr@gmail.com> wrote: >>>> Still chasing around some people to get a PCIe bus analyzer set up. >>> >>> Okay, I finally managed to get enough parts put together to look at >>> the PCIe bus. To make things a little more clear, I added a macro >>> that does essentially: >>> >>> pci_write_config_dword(0, 0x80000000 | __LINE__) >>> mdelay(1); >>> pci_write_config_dword(0, __LINE__) >>> >>> ...at various points in the code. This way I can see precisely what >>> was the most recent PCIe transaction before the crash. >>> >>> I'm not super familiar with PCIe, but what I think I'm seeing is: >>> >>> - the firmware does not need to be loaded yet; sometimes I can crash >>> it just by doing a cold reset right at driver load time. So the good >>> news is, the firmware code is not related. >>> >>> - the crash is always in ath10k_pci_device_reset >> >> [...] >> >>> Does this ring a bell for anyone? I think I can also export the >>> traces as csv in case someone wants to look at them. >> >> I showed your analysis to an HW engineer and the response I got was >> "don't do that" (= don't use the cold reset). As you know, we now have a >> workaround using the warm reset: >> >> 00f5482bcd94 ath10k: suspend hardware before reset >> 9042e17df834 ath10k: refactor suspend/resume functions >> fc36e3ffcdd0 ath10k: fix device initialization routine >> >> Have you tested these? Did they help at all? > > Yes, I've tested them and they help, mainly by doing the cold reset > less often. However, when the firmware hard crashes in certain ways > (for example, using my original test case), it looks like warm reset > can't fix that. The driver then still must fall back to cold reset > and, some fairly large percentage of the time (1/3rd?), crashes the > bus. > > We do have a separate reset line controlled by a GPIO. Using that > crashes the SoC's PCIe host implementation (whee!). But I got help > from the SoC manufacturer and was able to get some instructions for > resetting their PCIe host controller. When I do all the magic > incantations in the right order, the system can recover, albeit with a > fully reset ath10k chip. This workaround is unfortunately specific to > the host device platform so it won't do you much good. > > Of course, a good way to avoid the problem is "don't crash the > firmware then," but that's not as robust as I'd like. This box is > doing quite a few things, so rebooting to fix a problem on one of the > wireless cards is pretty expensive. > > Nevertheless, the warm reset changes really do reduce the frequency of > this a lot, to the point where my workaround is almost never needed. > Thanks for that! > > Have fun, > > Avery _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-11 7:52 ` Adrian Chadd @ 2014-03-11 7:59 ` Avery Pennarun 2014-03-11 8:13 ` Kalle Valo 1 sibling, 0 replies; 32+ messages in thread From: Avery Pennarun @ 2014-03-11 7:59 UTC (permalink / raw) To: Adrian Chadd; +Cc: Kalle Valo, ath10k On Tue, Mar 11, 2014 at 2:52 AM, Adrian Chadd <adrian@freebsd.org> wrote: > ... it's not a complete loss! > > This to me says "we need a hook from the driver to call the host > "reset the bus" thing". > > We also kinda need it for ath9k/ath5k (if it's not there) so AHB > attached things can be reset by actually poking an SoC reset register. That would be awesome. It would make my hack feel much less hacky :) Have fun, Avery _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-11 7:52 ` Adrian Chadd 2014-03-11 7:59 ` Avery Pennarun @ 2014-03-11 8:13 ` Kalle Valo 2014-03-11 8:37 ` Michal Kazior 1 sibling, 1 reply; 32+ messages in thread From: Kalle Valo @ 2014-03-11 8:13 UTC (permalink / raw) To: Adrian Chadd; +Cc: ath10k, Avery Pennarun (Fixing top posting) Adrian Chadd <adrian@freebsd.org> writes: > On 11 March 2014 00:40, Avery Pennarun <apenwarr@gmail.com> wrote: > >> We do have a separate reset line controlled by a GPIO. Using that >> crashes the SoC's PCIe host implementation (whee!). But I got help >> from the SoC manufacturer and was able to get some instructions for >> resetting their PCIe host controller. When I do all the magic >> incantations in the right order, the system can recover, albeit with a >> fully reset ath10k chip. This workaround is unfortunately specific to >> the host device platform so it won't do you much good. > > ... it's not a complete loss! > > This to me says "we need a hook from the driver to call the host > "reset the bus" thing". > > We also kinda need it for ath9k/ath5k (if it's not there) so AHB > attached things can be reset by actually poking an SoC reset register. Yeah, that kind of hook would be good to have. -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-11 8:13 ` Kalle Valo @ 2014-03-11 8:37 ` Michal Kazior 0 siblings, 0 replies; 32+ messages in thread From: Michal Kazior @ 2014-03-11 8:37 UTC (permalink / raw) To: Kalle Valo; +Cc: Adrian Chadd, ath10k, Avery Pennarun On 11 March 2014 09:13, Kalle Valo <kvalo@qca.qualcomm.com> wrote: > (Fixing top posting) > > Adrian Chadd <adrian@freebsd.org> writes: > >> On 11 March 2014 00:40, Avery Pennarun <apenwarr@gmail.com> wrote: >> >>> We do have a separate reset line controlled by a GPIO. Using that >>> crashes the SoC's PCIe host implementation (whee!). But I got help >>> from the SoC manufacturer and was able to get some instructions for >>> resetting their PCIe host controller. When I do all the magic >>> incantations in the right order, the system can recover, albeit with a >>> fully reset ath10k chip. This workaround is unfortunately specific to >>> the host device platform so it won't do you much good. >> >> ... it's not a complete loss! >> >> This to me says "we need a hook from the driver to call the host >> "reset the bus" thing". >> >> We also kinda need it for ath9k/ath5k (if it's not there) so AHB >> attached things can be reset by actually poking an SoC reset register. > > Yeah, that kind of hook would be good to have. There is PCI error recovery in kernel (Documentation/PCI/pci-error-recovery.txt) but I think it's only implemented on ppc. I wonder if you could try hooking up with that? Michał _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-11 7:40 ` Avery Pennarun 2014-03-11 7:52 ` Adrian Chadd @ 2014-03-11 8:10 ` Kalle Valo 1 sibling, 0 replies; 32+ messages in thread From: Kalle Valo @ 2014-03-11 8:10 UTC (permalink / raw) To: Avery Pennarun; +Cc: Adrian Chadd, ath10k Avery Pennarun <apenwarr@gmail.com> writes: > On Tue, Mar 11, 2014 at 2:33 AM, Kalle Valo <kvalo@qca.qualcomm.com> wrote: > >> I showed your analysis to an HW engineer and the response I got was >> "don't do that" (= don't use the cold reset). As you know, we now have a >> workaround using the warm reset: >> >> 00f5482bcd94 ath10k: suspend hardware before reset >> 9042e17df834 ath10k: refactor suspend/resume functions >> fc36e3ffcdd0 ath10k: fix device initialization routine >> >> Have you tested these? Did they help at all? > > Yes, I've tested them and they help, mainly by doing the cold reset > less often. However, when the firmware hard crashes in certain ways > (for example, using my original test case), it looks like warm reset > can't fix that. The driver then still must fall back to cold reset > and, some fairly large percentage of the time (1/3rd?), crashes the > bus. Ok, thanks. I'll investigate more about the warm reset problems and try to find ways to make it more reliable. > We do have a separate reset line controlled by a GPIO. Using that > crashes the SoC's PCIe host implementation (whee!). But I got help > from the SoC manufacturer and was able to get some instructions for > resetting their PCIe host controller. When I do all the magic > incantations in the right order, the system can recover, albeit with a > fully reset ath10k chip. This workaround is unfortunately specific to > the host device platform so it won't do you much good. > > Of course, a good way to avoid the problem is "don't crash the > firmware then," but that's not as robust as I'd like. I never trust the firmware, in any device, and that's why I would like to have in ath10k 100% reliable way to restart it from host. > This box is doing quite a few things, so rebooting to fix a problem on > one of the wireless cards is pretty expensive. Yeah, that would be really bad. Restarting the firmware will take something like 1-2 seconds and the user would only notice a small pause in data traffic, a much better solution than rebooting the whole box. > Nevertheless, the warm reset changes really do reduce the frequency of > this a lot, to the point where my workaround is almost never needed. > Thanks for that! Great, thanks for the feedback. -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-11 7:33 ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo 2014-03-11 7:40 ` Avery Pennarun @ 2014-03-11 19:01 ` Ben Greear 2014-03-12 8:22 ` Kalle Valo 1 sibling, 1 reply; 32+ messages in thread From: Ben Greear @ 2014-03-11 19:01 UTC (permalink / raw) To: Kalle Valo; +Cc: Adrian Chadd, ath10k, Avery Pennarun On 03/11/2014 12:33 AM, Kalle Valo wrote: oes this ring a bell for anyone? I think I can also export the >> traces as csv in case someone wants to look at them. > > I showed your analysis to an HW engineer and the response I got was > "don't do that" (= don't use the cold reset). Earlier (1/22/14), you said a cold reset problem in CUS223 was due primarily to a bad piece of hardware (or bad design, or something) in the CUS223. I guess the problem must extend to other NICs as well, and maybe other reasons for causing the hangs? Anything we can do to help debug the problem with warm resets failing to work properly in some cases? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-11 19:01 ` Ben Greear @ 2014-03-12 8:22 ` Kalle Valo 2014-03-12 16:01 ` Ben Greear 0 siblings, 1 reply; 32+ messages in thread From: Kalle Valo @ 2014-03-12 8:22 UTC (permalink / raw) To: Ben Greear; +Cc: Adrian Chadd, ath10k, Avery Pennarun Ben Greear <greearb@candelatech.com> writes: > On 03/11/2014 12:33 AM, Kalle Valo wrote: > oes this ring a bell for anyone? I think I can also export the >>> traces as csv in case someone wants to look at them. >> >> I showed your analysis to an HW engineer and the response I got was >> "don't do that" (= don't use the cold reset). > > Earlier (1/22/14), you said a cold reset problem in CUS223 was due primarily to > a bad piece of hardware (or bad design, or something) in the CUS223. > > I guess the problem must extend to other NICs as well, and > maybe other reasons for causing the hangs? I don't know the details, but I understood this only happens on CUS223. If someone has seen these cold reset problems on other boards, like XB140, I would be very keen to hear about it. > Anything we can do to help debug the problem with warm resets > failing to work properly in some cases? Thanks, but not right now there isn't anything. Once I have new ideas for the warm reset problems, I'll post them here. Testing those would be very useful. -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-12 8:22 ` Kalle Valo @ 2014-03-12 16:01 ` Ben Greear 2014-03-12 23:28 ` Avery Pennarun 0 siblings, 1 reply; 32+ messages in thread From: Ben Greear @ 2014-03-12 16:01 UTC (permalink / raw) To: Kalle Valo; +Cc: Adrian Chadd, ath10k, Avery Pennarun On 03/12/2014 01:22 AM, Kalle Valo wrote: > Ben Greear <greearb@candelatech.com> writes: > >> On 03/11/2014 12:33 AM, Kalle Valo wrote: >> oes this ring a bell for anyone? I think I can also export the >>>> traces as csv in case someone wants to look at them. >>> >>> I showed your analysis to an HW engineer and the response I got was >>> "don't do that" (= don't use the cold reset). >> >> Earlier (1/22/14), you said a cold reset problem in CUS223 was due primarily to >> a bad piece of hardware (or bad design, or something) in the CUS223. >> >> I guess the problem must extend to other NICs as well, and >> maybe other reasons for causing the hangs? > > I don't know the details, but I understood this only happens on CUS223. > If someone has seen these cold reset problems on other boards, like > XB140, I would be very keen to hear about it. Come to think of it, I'm not sure I've seen a hard lockup of the machine on non CUS223 machines, but I have seen cases where non CUS223 NIC wedges and the existing restart does not recover it. This requires a reboot to recover from. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-12 16:01 ` Ben Greear @ 2014-03-12 23:28 ` Avery Pennarun 2014-03-13 5:09 ` Kalle Valo 0 siblings, 1 reply; 32+ messages in thread From: Avery Pennarun @ 2014-03-12 23:28 UTC (permalink / raw) To: Ben Greear; +Cc: Kalle Valo, Adrian Chadd, ath10k On Wed, Mar 12, 2014 at 11:01 AM, Ben Greear <greearb@candelatech.com> wrote: > Come to think of it, I'm not sure I've seen a hard lockup of the machine > on non CUS223 machines, but I have seen cases where non CUS223 NIC wedges > and the existing restart does not recover it. > > This requires a reboot to recover from. That's not so surprising; the current driver doesn't do the cold reset that is the only thing you can recover from a wedge, I guess because of the CUS223 bug. Stupid question: can I honestly just buy a different module and make my PCIe crashiness problems go away? Is there a reason do prefer the CUS223? Another question: is there perhaps anything the firmware can do to eg. set a watchdog timer, so that the internal CPU will restart (go back to "waiting for firmware" mode) if it doesn't answer for a while? The idea would be for the device to un-wedge itself even if there's nothing we can do to fix it from outside. Thanks, Avery _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-12 23:28 ` Avery Pennarun @ 2014-03-13 5:09 ` Kalle Valo 2014-03-13 17:34 ` Adrian Chadd 0 siblings, 1 reply; 32+ messages in thread From: Kalle Valo @ 2014-03-13 5:09 UTC (permalink / raw) To: Avery Pennarun; +Cc: Ben Greear, Adrian Chadd, ath10k Avery Pennarun <apenwarr@gmail.com> writes: > On Wed, Mar 12, 2014 at 11:01 AM, Ben Greear <greearb@candelatech.com> wrote: >> Come to think of it, I'm not sure I've seen a hard lockup of the machine >> on non CUS223 machines, but I have seen cases where non CUS223 NIC wedges >> and the existing restart does not recover it. >> >> This requires a reboot to recover from. > > That's not so surprising; the current driver doesn't do the cold reset > that is the only thing you can recover from a wedge, I guess because > of the CUS223 bug. > > Stupid question: can I honestly just buy a different module and make > my PCIe crashiness problems go away? That's my theory, but I would like to confirm that somehow. > Is there a reason do prefer the CUS223? Good question, I haven't figured out that. CUS223 board is physically larger than XB140, for example, and I would assume there's a reason for that. > Another question: is there perhaps anything the firmware can do to eg. > set a watchdog timer, so that the internal CPU will restart (go back > to "waiting for firmware" mode) if it doesn't answer for a while? The > idea would be for the device to un-wedge itself even if there's > nothing we can do to fix it from outside. That's something a firmware engineer should comment on, which I'm not. -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-13 5:09 ` Kalle Valo @ 2014-03-13 17:34 ` Adrian Chadd 2014-03-13 17:39 ` Kalle Valo 2014-03-13 17:42 ` Ben Greear 0 siblings, 2 replies; 32+ messages in thread From: Adrian Chadd @ 2014-03-13 17:34 UTC (permalink / raw) To: Kalle Valo; +Cc: Ben Greear, ath10k, Avery Pennarun I think the CUS223 has higher transmit power, right? -a On 12 March 2014 22:09, Kalle Valo <kvalo@qca.qualcomm.com> wrote: > Avery Pennarun <apenwarr@gmail.com> writes: > >> On Wed, Mar 12, 2014 at 11:01 AM, Ben Greear <greearb@candelatech.com> wrote: >>> Come to think of it, I'm not sure I've seen a hard lockup of the machine >>> on non CUS223 machines, but I have seen cases where non CUS223 NIC wedges >>> and the existing restart does not recover it. >>> >>> This requires a reboot to recover from. >> >> That's not so surprising; the current driver doesn't do the cold reset >> that is the only thing you can recover from a wedge, I guess because >> of the CUS223 bug. >> >> Stupid question: can I honestly just buy a different module and make >> my PCIe crashiness problems go away? > > That's my theory, but I would like to confirm that somehow. > >> Is there a reason do prefer the CUS223? > > Good question, I haven't figured out that. CUS223 board is physically > larger than XB140, for example, and I would assume there's a reason for > that. > >> Another question: is there perhaps anything the firmware can do to eg. >> set a watchdog timer, so that the internal CPU will restart (go back >> to "waiting for firmware" mode) if it doesn't answer for a while? The >> idea would be for the device to un-wedge itself even if there's >> nothing we can do to fix it from outside. > > That's something a firmware engineer should comment on, which I'm not. > > -- > Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-13 17:34 ` Adrian Chadd @ 2014-03-13 17:39 ` Kalle Valo 2014-03-13 17:42 ` Ben Greear 1 sibling, 0 replies; 32+ messages in thread From: Kalle Valo @ 2014-03-13 17:39 UTC (permalink / raw) To: Adrian Chadd; +Cc: Ben Greear, ath10k, Avery Pennarun Adrian Chadd <adrian@freebsd.org> writes: > I think the CUS223 has higher transmit power, right? That's true. I was told CUS223 uses external PA which makes it possible to use higher transmit power. XB143 uses FEM and hence the power is less than on CUS223. -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-13 17:34 ` Adrian Chadd 2014-03-13 17:39 ` Kalle Valo @ 2014-03-13 17:42 ` Ben Greear 2014-03-14 6:26 ` Kalle Valo 1 sibling, 1 reply; 32+ messages in thread From: Ben Greear @ 2014-03-13 17:42 UTC (permalink / raw) To: Adrian Chadd; +Cc: Kalle Valo, ath10k, Avery Pennarun On 03/13/2014 10:34 AM, Adrian Chadd wrote: > I think the CUS223 has higher transmit power, right? > Yes. Aside from that, I have not noticed any significant differences. Thoughput is generally the same as the WLE900VX I have been testing. Also, I am not sure CUS223 is commercially available for purchase in small-ish quantities, but perhaps I just didn't look in the right place at the right time. >>> Another question: is there perhaps anything the firmware can do to eg. >>> set a watchdog timer, so that the internal CPU will restart (go back >>> to "waiting for firmware" mode) if it doesn't answer for a while? The >>> idea would be for the device to un-wedge itself even if there's >>> nothing we can do to fix it from outside. >> >> That's something a firmware engineer should comment on, which I'm not. Kalle: Can you feed that question to your firmware contacts at QCA? I'm not sure I should be publicly speculating on such matters :) Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: ath10k driver crashes whenever firmware crashes on ARM SoC 2014-03-13 17:42 ` Ben Greear @ 2014-03-14 6:26 ` Kalle Valo 0 siblings, 0 replies; 32+ messages in thread From: Kalle Valo @ 2014-03-14 6:26 UTC (permalink / raw) To: Ben Greear; +Cc: Adrian Chadd, ath10k, Avery Pennarun Ben Greear <greearb@candelatech.com> writes: >>>> Another question: is there perhaps anything the firmware can do to eg. >>>> set a watchdog timer, so that the internal CPU will restart (go back >>>> to "waiting for firmware" mode) if it doesn't answer for a while? The >>>> idea would be for the device to un-wedge itself even if there's >>>> nothing we can do to fix it from outside. >>> >>> That's something a firmware engineer should comment on, which I'm not. > > Kalle: Can you feed that question to your firmware contacts > at QCA? Sure, I'm trying to solve this warm reset problem and talking with them anyway. All ideas very welcome how to solve this. -- Kalle Valo _______________________________________________ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2014-03-14 6:27 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-01-28 17:18 ath10k driver crashes whenever firmware crashes on ARM SoC Avery Pennarun 2014-01-28 18:20 ` Ben Greear 2014-01-28 18:34 ` Avery Pennarun 2014-01-28 19:01 ` Ben Greear 2014-01-28 19:11 ` Avery Pennarun 2014-01-28 20:10 ` Janusz Dziedzic 2014-01-28 20:51 ` Avery Pennarun 2014-01-29 16:44 ` Kalle Valo [not found] ` <CAJ-VmokorbJ2iU4rGNYdRj+A22NR9cV-5h-tDN0pD2FCurZDpA@mail.gmail.com> 2014-01-28 20:55 ` Avery Pennarun 2014-01-29 16:41 ` Kalle Valo 2014-01-29 18:44 ` Adrian Chadd 2014-01-30 2:41 ` Avery Pennarun 2014-02-09 8:00 ` Avery Pennarun 2014-02-27 15:48 ` Missing memory barriers Kalle Valo 2014-02-28 6:10 ` Avery Pennarun 2014-03-06 13:34 ` Kalle Valo 2014-03-11 7:33 ` ath10k driver crashes whenever firmware crashes on ARM SoC Kalle Valo 2014-03-11 7:40 ` Avery Pennarun 2014-03-11 7:52 ` Adrian Chadd 2014-03-11 7:59 ` Avery Pennarun 2014-03-11 8:13 ` Kalle Valo 2014-03-11 8:37 ` Michal Kazior 2014-03-11 8:10 ` Kalle Valo 2014-03-11 19:01 ` Ben Greear 2014-03-12 8:22 ` Kalle Valo 2014-03-12 16:01 ` Ben Greear 2014-03-12 23:28 ` Avery Pennarun 2014-03-13 5:09 ` Kalle Valo 2014-03-13 17:34 ` Adrian Chadd 2014-03-13 17:39 ` Kalle Valo 2014-03-13 17:42 ` Ben Greear 2014-03-14 6:26 ` Kalle Valo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.