Re: [PATCH] acpi/nfit: badrange report spill over to clean range

From: Jane Chu <jane.chu@oracle.com>
To: Dan Williams <dan.j.williams@intel.com>,
	"hch@infradead.org" <hch@infradead.org>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"ira.weiny@intel.com" <ira.weiny@intel.com>,
	"nvdimm@lists.linux.dev" <nvdimm@lists.linux.dev>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] acpi/nfit: badrange report spill over to clean range
Date: Wed, 13 Jul 2022 23:52:13 +0000	[thread overview]
Message-ID: <09df842d-d8e4-0594-56b0-b4bb9ea37b67@oracle.com> (raw)
In-Reply-To: <62ce16518e7d3_6070c29447@dwillia2-xfh.jf.intel.com.notmuch>

On 7/12/2022 5:48 PM, Dan Williams wrote:
> Jane Chu wrote:
>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>> granularity") changed nfit_handle_mce() callback to report badrange for
>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>> populated with Intel DCPMEM v2 dimms, it appears that
>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>> 8 poisons.
>>
>> [29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
>> [..]
>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
>> [29076.627519] mce: [Hardware Error]: Machine check events logged
>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>> [..]
>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
>> [..]
>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>> [..]
>> {
>>    "dev":"namespace0.0",
>>    "mode":"fsdax",
>>    "map":"dev",
>>    "size":33820770304,
>>    "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>>    "sector_size":512,
>>    "align":2097152,
>>    "blockdev":"pmem0",
>>    "badblock_count":8,
>>    "badblocks":[
>>      {
>>        "offset":8208,
>>        "length":8,
>>        "dimms":[
>>          "nmem0"
>>        ]
>>      }
>>    ]
>> }
>>
>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
>> radius and shouldn't be used.  More over, as each injected poison is being
>> reported independently, any alignment under 512-byte appear works:
>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
>> or 512-byte.
>>
>> To get around this issue, 512-bytes is chosen as the alignment because
>>    a. it happens to be the badblock granularity,
>>    b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
>>    c. architecture agnostic
> 
> I am failing to see the kernel bug? Yes, you injected less than 8
> "badblocks" of poison and the hardware reported 8 blocks of poison, but
> that's not the kernel's fault, that's the hardware. What happens when
> hardware really does detect 8 blocks of consective poison and this
> implementation decides to only record 1 at a time?

In that case, there will be 8 reports of the poisons by APEI GHES, and
ARC scan will also report 8 poisons, each will get to be added to the 
bad range via nvdimm_bus_add_badrange(), so none of them will be missed.

In the above 2 poison example, the poison in 0x00000040a0602400 and in 
0x00000040a0602600 were separately reported.

> 
> It seems the fix you want is for the hardware to report the precise
> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> that precision in this case.

That field describes a 4K range even for a single poison, it confuses 
people unnecessarily.

> 
> However, the ARS engine likely can return the precise error ranges so I
> think the fix is to just use the address range indicated by 1UL <<
> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> scrub request to ask the device for the precise error list.

You mean for nfit_handle_mce() callback to issue a short ARS per each 
poison report over a 4K range in order to decide the precise range as a 
workaround of the hardware issue?  if there are 8 poisoned detected, 
there will be 8 short ARS, sure we want to do that?  also, for now, is 
it possible to log more than 1 poison per 512byte block?

thanks!
-jane