All of lore.kernel.org
 help / color / mirror / Atom feed
* question about arm64 exception handling
@ 2018-06-28 11:58 Arend van Spriel
  2018-06-28 16:13 ` James Morse
  0 siblings, 1 reply; 4+ messages in thread
From: Arend van Spriel @ 2018-06-28 11:58 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Catalin,

I am looking at some issues we are seeing on a custom arm platform and I 
had some questions regarding exception handling and hope you can help me 
with that.

The platform has 4 cortex a53 cores running 4.1.51 kernel. On the 
platform upon accessing our wifi devices over PCIe, we observe two issue 
intermittently.

1) synchronous external abort
2) NMI watchdog

With the latter I notice el1_irq in the exception stack trace. Is that 
due to the NMI watchdog or is this caused by another issue like our 
crappy hardware ;-) ?

Regarding the SEA I noticed stuff has been added for that in 4.13. Is is 
worth backporting that to obtain more info about that?

Regards,
Arend

^ permalink raw reply	[flat|nested] 4+ messages in thread

* question about arm64 exception handling
  2018-06-28 11:58 question about arm64 exception handling Arend van Spriel
@ 2018-06-28 16:13 ` James Morse
  2018-07-02 10:34   ` Arend van Spriel
  0 siblings, 1 reply; 4+ messages in thread
From: James Morse @ 2018-06-28 16:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arend,

On 28/06/18 12:58, Arend van Spriel wrote:
> I am looking at some issues we are seeing on a custom arm platform and I had
> some questions regarding exception handling and hope you can help me with that.
> 
> The platform has 4 cortex a53 cores running 4.1.51 kernel.

v4.1 is circa 2015, and chance you could try v4.17?


> On the platform upon
> accessing our wifi devices over PCIe, we observe two issue intermittently.
> 
> 1) synchronous external abort
> 2) NMI watchdog
> 
> With the latter I notice el1_irq in the exception stack trace. Is that due to
> the NMI watchdog or is this caused by another issue like our crappy hardware ;-) ?


> Regarding the SEA I noticed stuff has been added for that in 4.13. Is is worth
> backporting that to obtain more info about that?

Unlikely. Those changes are related to firmware-first RAS. You would see this on
server platforms booting with ACPI and have a 'HEST' table.

Can you share the backtrace for the Synchronous External Abort? It should point
at the instruction in the (probably) driver that causes the abort.


Thanks,

James

^ permalink raw reply	[flat|nested] 4+ messages in thread

* question about arm64 exception handling
  2018-06-28 16:13 ` James Morse
@ 2018-07-02 10:34   ` Arend van Spriel
  2018-07-02 10:53     ` James Morse
  0 siblings, 1 reply; 4+ messages in thread
From: Arend van Spriel @ 2018-07-02 10:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 6/28/2018 6:13 PM, James Morse wrote:
> Hi Arend,
>
> On 28/06/18 12:58, Arend van Spriel wrote:
>> I am looking at some issues we are seeing on a custom arm platform and I had
>> some questions regarding exception handling and hope you can help me with that.
>>
>> The platform has 4 cortex a53 cores running 4.1.51 kernel.
>
> v4.1 is circa 2015, and chance you could try v4.17?

Hi James,

If it were up to me, sure. Unfortunately it is not :-(

>> On the platform upon
>> accessing our wifi devices over PCIe, we observe two issue intermittently.
>>
>> 1) synchronous external abort
>> 2) NMI watchdog
>>
>> With the latter I notice el1_irq in the exception stack trace. Is that due to
>> the NMI watchdog or is this caused by another issue like our crappy hardware ;-) ?
>
>
>> Regarding the SEA I noticed stuff has been added for that in 4.13. Is is worth
>> backporting that to obtain more info about that?
>
> Unlikely. Those changes are related to firmware-first RAS. You would see this on
> server platforms booting with ACPI and have a 'HEST' table.

I see. That is not really what I am working with here.

> Can you share the backtrace for the Synchronous External Abort? It should point
> at the instruction in the (probably) driver that causes the abort.

15:05:24.284  LOG   4908REF2   Unhandled fault: synchronous external 
abort (0x96000210) at 0xffffff80010a4154

So between brackets is the ESR. Looking in the TRM I can only conclude 
it is exception class 0x25 whatever that is. Need to dive in armv8a arch 
doc. So the address 0xffffff80010a4154 is the fault address. It is not 
in the range of my wireless driver, but it is probably in the PCIe host 
driver accessing the device.

Thanks for the pointers. Digging deeper.

Regards,
Arend

^ permalink raw reply	[flat|nested] 4+ messages in thread

* question about arm64 exception handling
  2018-07-02 10:34   ` Arend van Spriel
@ 2018-07-02 10:53     ` James Morse
  0 siblings, 0 replies; 4+ messages in thread
From: James Morse @ 2018-07-02 10:53 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arend,

On 02/07/18 11:34, Arend van Spriel wrote:
> On 6/28/2018 6:13 PM, James Morse wrote:
>> On 28/06/18 12:58, Arend van Spriel wrote:
>>> I am looking at some issues we are seeing on a custom arm platform and I had
>>> some questions regarding exception handling and hope you can help me with that.
>>>
>>> The platform has 4 cortex a53 cores running 4.1.51 kernel.

>>> On the platform upon
>>> accessing our wifi devices over PCIe, we observe two issue intermittently.
>>>
>>> 1) synchronous external abort

>> Can you share the backtrace for the Synchronous External Abort? It should point
>> at the instruction in the (probably) driver that causes the abort.
> 
> 15:05:24.284? LOG?? 4908REF2?? Unhandled fault: synchronous external abort
> (0x96000210) at 0xffffff80010a4154

I'm not sure where this 'LOG 4908REF2' is coming from, but from the kernel you
should get a stack trace after the external abort message. Looks a bit like [0]
(also a v4.1 kernel).


> So between brackets is the ESR. Looking in the TRM I can only conclude it is
> exception class 0x25 whatever that is.

"Data abort taken without a change in exception level."
This happened because EL1 made a load/store that the device on the other end
rejected with an abort.

Your ISS decodes as a load, not on a translation table walk, with the 'EA' bit set.

(The EA bit is used by the CPU to classify some external aborts from some other
type. Its implementation-defined, and not likely to be helpful here).


> Need to dive in armv8a arch doc. So the address 0xffffff80010a4154 is the fault address. 

32bit aligned address. It may be the device on the other end wanted 64bit
alignment. (I'm guessing)


> It is not in the range of my
> wireless driver, but it is probably in the PCIe host driver accessing the device.

Another guess: whatever whatever mapped that address space used the wrong memory
attributes.


> Thanks for the pointers. Digging deeper.

Good luck!


Thanks,

James

[0] https://lists.gt.net/linux/kernel/2223260

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-07-02 10:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28 11:58 question about arm64 exception handling Arend van Spriel
2018-06-28 16:13 ` James Morse
2018-07-02 10:34   ` Arend van Spriel
2018-07-02 10:53     ` James Morse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.