Re: Question about SEA handling process happened in user space

From: Xiaofei Tan <tanxiaofei@huawei.com>
To: James Morse <james.morse@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Linuxarm <linuxarm@huawei.com>, Will Deacon <will@kernel.org>,
	Dave Martin <Dave.Martin@arm.com>,
	linux-arm-kernel@lists.infradead.org,
	Shiju Jose <shiju.jose@huawei.com>
Subject: Re: Question about SEA handling process happened in user space
Date: Thu, 9 Apr 2020 16:42:30 +0800	[thread overview]
Message-ID: <5E8EDFF6.4050903@huawei.com> (raw)
In-Reply-To: <7d6668d6-ec4a-e362-94a3-c31950651c02@arm.com>

Hi James，

On 2020/4/8 0:37, James Morse wrote:
> On 01/04/2020 04:49, Xiaofei Tan wrote:
>> On 2020/4/1 1:00, James Morse wrote:
>>> On 3/31/20 10:41 AM, Xiaofei Tan wrote:
>>>> 1.memory_failure() is only called for "memory error section" record. Then
>>>> should we use this memory record for ghes sea report? Our platform is
>>>> using "ARM processor error section".
> 
>>> For what classes of error?
> 
>> Both processor cache ecc error and memory error (marked by poison) can lead to SEA.
> 
> These are the errors that can cause the hardware to notify software via external abort.
> For which classes of error does your firmware then use a 'processor error'?
> 

For all processor generation and consumption errors.

> It sounds like you assume everything reported in the CPU records must be a processor
> error, and everything reported by the memory error must be a memory error.
> 

Not exactly, we use processor error section for everything reported in the CPU records, and
the memory is similar. Though it just consumed errors from other node. So, the problem is what
kinds of error section should we use for error consumption case.

> 
> (digression: this isn't really true!
> The CPU could report that it read poison from memory. Is this a memory error, or a CPU
> error? Equally the memory controller could report that a PCIe device wrote to a
> not-present DIMM. Is this a memory error?)
> 

Yes, this is error consumption case.

> 
>>> If memory has become corrupted, you should tell the OS about the memory error.
>>>
>>> >From (my) memory: linux will just print out 'processor errors', and panic() if
>>> they are marked as fatal. I don't think you can use these to convey a memory
>>> error...
>>>
>>
>> OK. Then firmware should detect error source. If it is processor cache error,
>> we use "ARM processor error section", else if it is memory error, we use "memory error section".
> 
>> Normally, we report memory error only from RAS node of DDRC or HHA module. For SEA,
> 
> Do you have patches to get linux to do something useful with the processor error nodes?
> 
> We'd need it to handle uncorrected cache errors with a physical address, as if they were
> memory errors...
> 

Yes, we have some patches to do this thing inside. Then memory_failure() will be called for
arm processor error section when physical address is available.

> A virtual address is no-use as the memory may have been re-mapped in the meantime.
> 

Right.

> 
>> It is a little strange to report as memory error when collect errors from processor
>> RAS node.
> 
> Its pragmatic: today linux ignores the processor errors.
> If you suffer a cache error, the memory that backed that cached location is now also
> corrupt, as you've lost the writes that made the cache-line dirty.
> 
> If you can describe this memory corruption, without treating it as 'the error' then an OS
> that doesn't know about the process error sections will still do the right thing. (i.e.
> leave out the device/row/rank stuff to avoid it being attributed to a DIMM)
> 
> The downside is you have fake memory errors when nothing bad happened to the DIMM. These
> should be uniform, and smaller than the errors actually occurring at the DIMM.
> 

agree.

> I've no idea if patches adding support for the processor error nodes would be considered
> for stable.
> 

I think this part is worth improving.
BTW, should ARM processor record physical address when consumed an memory poison error for SEA?
It is helpful to do error recovery. Is this mandatory for arm spec?

> 
>>>> 2.Should we define an error source structure for each cpu core in HEST table?
>>>> If not, there may be conflict if more than one cpu core fall into SEA.
>>>
>>> This is a question for the people who wrote your firmware.
>>> For firmware first, you must have set SCR_EL3.EA. What does your firmware do if
>>> two CPUs take an external abort at the same time?
>>
>> Will block the second one until first SEA finished and error source of HEST table free.
> 
> Okay, so one 'SEA' entry in the HEST describes the single region that CPER will be written to.
> 
> 
>>> Each CPU having its own area to read/write CPER would mean you need one
>>> NOTIFY_SEA entry in the HEST for each area ... but how does the OS know which
>>> CPU is which?
>>
>> Yes, OS don't know this.
>> So, it is ok to share the only one area for all CPUs.
> 
> Yes, as there is no way to pair the memory with the CPUs.
> 
> If there is more than one region, then each CPU taking an external abort will walk the
> list, checking each one.
> 
> Its up to firmware to ensure this is serialised. Sounds like you've got this sorted.
> 

Yes

> 
> Thanks,
> 
> James
> 
> .
> 

-- 
 thanks
tanxiaofei

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel