linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Resend: How to handle the SMMU RAS Error in the kernel
@ 2018-11-17 15:41 gengdongjiu
  2018-11-19 18:05 ` James Morse
  0 siblings, 1 reply; 4+ messages in thread
From: gengdongjiu @ 2018-11-17 15:41 UTC (permalink / raw)
  To: robin.murphy, James Morse, arm-mail-list
  Cc: xuqiang36, Linux Kernel Mailing List, gengdongjiu

Hi robin/will/James


In the current kernel, it only handles three kinds of error, which is
memory error, PCIE device and ARM process. But now the SMMU already
support the RAS, how to handle the SMMU RAS error in the kernel?

I check the UEFI_SPEC_2.7, the ACPI's CPER have the IOMMU type, but it
seems the IOMMU type only are specific to AMD’s IOMMU specification,
not have the ARM’s IOMMU section type, can we reuse this IOMMU section
type for the ARM SMMU?



N.2.11.3 IOMMU specific DMAr Error Section

Type: {0x036F84E1, 0x7F37, 0x428c, {0xA7, 0x9E, 0x57, 0x5F, 0xDF,
0xAA, 0x84, 0xEC}}

All fields in this error record are specific to AMD’s IOMMU
specification. This error section has a fixed size.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Resend: How to handle the SMMU RAS Error in the kernel
  2018-11-17 15:41 Resend: How to handle the SMMU RAS Error in the kernel gengdongjiu
@ 2018-11-19 18:05 ` James Morse
  2018-11-21  8:10   ` gengdongjiu
  0 siblings, 1 reply; 4+ messages in thread
From: James Morse @ 2018-11-19 18:05 UTC (permalink / raw)
  To: gengdongjiu
  Cc: robin.murphy, arm-mail-list, xuqiang36,
	Linux Kernel Mailing List, gengdongjiu

Hi gengdongjiu,

On 17/11/2018 15:41, gengdongjiu wrote:
> In the current kernel, it only handles three kinds of error, which is
> memory error, PCIE device and ARM process. But now the SMMU already
> support the RAS, how to handle the SMMU RAS error in the kernel?

What errors are being detected here?

I don't know much about the SMMU, but I think we should start with a list of
errors that we want to handle.


Is this a v8.2 fault handling interrupt from the SMMU taken to EL3?
Or a cpu-access that was returned as external-abort? or a device access that was
told external-abort?

What do we intend to do with this error information? Does the DMA layer have
error handling we can hook this into?

Is this just another interface for memory-errors? (e.g the SMMU provides a
device/address pair and the kernel works out the physical page to run
memory_failure() on)


> I check the UEFI_SPEC_2.7, the ACPI's CPER have the IOMMU type, but it
> seems the IOMMU type only are specific to AMD’s IOMMU specification,

... and Intel VT-d. It looks like UEFI generalises all these as types of 'DMAr'.


> not have the ARM’s IOMMU section type, can we reuse this IOMMU section
> type for the ARM SMMU?

The architecture specific records for AMD? No. Even if the information was the
same, the presence of this record tells you its an AMD IOMMU, which its not.

The generic error section? Maybe.

Assuming the 'fault reason' list in Table 285 is sufficient to cover our list or
errors, we can use the 'DMAr Generic Errors' section, (N.2.11.1), to describe
the generic bits of the error ... but SMMU doesn't have an 'Architecture Type',
so we at least need to get one allocated.

We will probably need an architecture specific 'DMAr Error Section'.


I think adding the UEFI bits is probably the 'easy' bit. We should start with a
list of errors, and the error handling code. This way we know what we need to
add to the spec.



Thanks,

James

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Resend: How to handle the SMMU RAS Error in the kernel
  2018-11-19 18:05 ` James Morse
@ 2018-11-21  8:10   ` gengdongjiu
  2018-11-30 18:37     ` James Morse
  0 siblings, 1 reply; 4+ messages in thread
From: gengdongjiu @ 2018-11-21  8:10 UTC (permalink / raw)
  To: James Morse, gengdongjiu
  Cc: robin.murphy, arm-mail-list, xuqiang36, Linux Kernel Mailing List

Hi James,
   Thanks for the mail.

On 2018/11/20 2:05, James Morse wrote:
> Hi gengdongjiu,
> 
> On 17/11/2018 15:41, gengdongjiu wrote:
>> In the current kernel, it only handles three kinds of error, which is
>> memory error, PCIE device and ARM process. But now the SMMU already
>> support the RAS, how to handle the SMMU RAS error in the kernel?
> 
> What errors are being detected here?
> 
> I don't know much about the SMMU, but I think we should start with a list of
> errors that we want to handle.

In our platform, the SMMU RAS error mainly include below which flow the SMMU spec:
1. one bit ECC error, reported as CE.
2. two bits ECC error, reported as UEU.
3. fetch error in the SMMUv3 spec, reported as UER.

The 2 and 3 should be handled, but I do not know how do recovery to it.

> 
> 
> Is this a v8.2 fault handling interrupt from the SMMU taken to EL3?
> Or a cpu-access that was returned as external-abort? or a device access that was
> told external-abort?
it flows v8.2 RAS spec, it is a v8.2 fault handling interrupt from the SMMU taken to EL3.

> 
> What do we intend to do with this error information? Does the DMA layer have
> error handling we can hook this into?

we can get the DMA layer error from the RAS registers, such as DMA read errors.
may be the handle method is disabling this SMMU to avoid propagation.

> 
> Is this just another interface for memory-errors? (e.g the SMMU provides a
> device/address pair and the kernel works out the physical page to run
> memory_failure() on)
I need to check more.

> 
> 
>> I check the UEFI_SPEC_2.7, the ACPI's CPER have the IOMMU type, but it
>> seems the IOMMU type only are specific to AMD’s IOMMU specification,
> 
> ... and Intel VT-d. It looks like UEFI generalises all these as types of 'DMAr'.
yes, it is.

> 
> 
>> not have the ARM’s IOMMU section type, can we reuse this IOMMU section
>> type for the ARM SMMU?
> 
> The architecture specific records for AMD? No. Even if the information was the
> same, the presence of this record tells you its an AMD IOMMU, which its not.
> 
> The generic error section? Maybe.
> 
> Assuming the 'fault reason' list in Table 285 is sufficient to cover our list or
> errors, we can use the 'DMAr Generic Errors' section, (N.2.11.1), to describe
> the generic bits of the error ... but SMMU doesn't have an 'Architecture Type',
> so we at least need to get one allocated.
> 
> We will probably need an architecture specific 'DMAr Error Section'.
> 
> 
> I think adding the UEFI bits is probably the 'easy' bit. We should start with a
> list of errors, and the error handling code. This way we know what we need to
> add to the spec.

The list of SMMU RAS error is shown below, but I still do not know how to handle it.
1. one bit ECC error
2. two bits ECC error
3. fetch error in the SMMUv3 spec

> 
> 
> 
> Thanks,
> 
> James
> 
> .
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Resend: How to handle the SMMU RAS Error in the kernel
  2018-11-21  8:10   ` gengdongjiu
@ 2018-11-30 18:37     ` James Morse
  0 siblings, 0 replies; 4+ messages in thread
From: James Morse @ 2018-11-30 18:37 UTC (permalink / raw)
  To: gengdongjiu, gengdongjiu
  Cc: robin.murphy, arm-mail-list, xuqiang36, Linux Kernel Mailing List

Hi gengdongjiu,

On 21/11/2018 08:10, gengdongjiu wrote:
> On 2018/11/20 2:05, James Morse wrote:
>> On 17/11/2018 15:41, gengdongjiu wrote:
>>> In the current kernel, it only handles three kinds of error, which is
>>> memory error, PCIE device and ARM process. But now the SMMU already
>>> support the RAS, how to handle the SMMU RAS error in the kernel?
>>
>> What errors are being detected here?
>>
>> I don't know much about the SMMU, but I think we should start with a list of
>> errors that we want to handle.
> 
> In our platform, the SMMU RAS error mainly include below which flow the SMMU spec:
> 1. one bit ECC error, reported as CE.
> 2. two bits ECC error, reported as UEU.
> 3. fetch error in the SMMUv3 spec, reported as UER.

These are faults, but this isn't enough information for software to act on.

Are these faults in the device, host-memory, or memory that is part of the SMMU?
Was the error discovered during a read/write by the device?, (which one?) Or the
SMMU's page-tables, or command-queue.


> The 2 and 3 should be handled, but I do not know how do recovery to it.

Me neither. If we can come up with the errors that can be detected, we can work
out which ones can be handled.


Thanks,

James

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-11-30 18:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-17 15:41 Resend: How to handle the SMMU RAS Error in the kernel gengdongjiu
2018-11-19 18:05 ` James Morse
2018-11-21  8:10   ` gengdongjiu
2018-11-30 18:37     ` James Morse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).