linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] acpi/nfit: badrange report spill over to clean range
@ 2022-07-11 23:26 Jane Chu
  2022-07-13  0:48 ` Dan Williams
  0 siblings, 1 reply; 12+ messages in thread
From: Jane Chu @ 2022-07-11 23:26 UTC (permalink / raw)
  To: dan.j.williams, hch, vishal.l.verma, dave.jiang, ira.weiny,
	nvdimm, linux-kernel

Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
granularity") changed nfit_handle_mce() callback to report badrange for
each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
instead of the hardcoded L1_CACHE_BYTES. However recently on a server
populated with Intel DCPMEM v2 dimms, it appears that
1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
8 poisons.

[29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
[..]
[29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
[29076.627519] mce: [Hardware Error]: Machine check events logged
[29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
[..]
[29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
[..]
[29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
[..]
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":33820770304,
  "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0",
  "badblock_count":8,
  "badblocks":[
    {
      "offset":8208,
      "length":8,
      "dimms":[
        "nmem0"
      ]
    }
  ]
}

So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
radius and shouldn't be used.  More over, as each injected poison is being
reported independently, any alignment under 512-byte appear works:
L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
or 512-byte.

To get around this issue, 512-bytes is chosen as the alignment because
  a. it happens to be the badblock granularity,
  b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
  c. architecture agnostic

Fixes: 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity")
Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 drivers/acpi/nfit/mce.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index d48a388b796e..eeacc8eb807f 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,7 +32,6 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 	 */
 	mutex_lock(&acpi_desc_lock);
 	list_for_each_entry(acpi_desc, &acpi_descs, list) {
-		unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
 		struct device *dev = acpi_desc->dev;
 		int found_match = 0;
 
@@ -64,7 +63,8 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 
 		/* If this fails due to an -ENOMEM, there is little we can do */
 		nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
-				ALIGN_DOWN(mce->addr, align), align);
+				ALIGN(mce->addr, SECTOR_SIZE),
+				SECTOR_SIZE);
 		nvdimm_region_notify(nfit_spa->nd_region,
 				NVDIMM_REVALIDATE_POISON);
 

base-commit: e35e5b6f695d241ffb1d223207da58a1fbcdff4b
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-08-29  8:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-11 23:26 [PATCH] acpi/nfit: badrange report spill over to clean range Jane Chu
2022-07-13  0:48 ` Dan Williams
2022-07-13 23:52   ` Jane Chu
2022-07-14  0:24     ` Dan Williams
2022-07-14 23:22       ` Jane Chu
2022-07-15  0:58         ` Dan Williams
2022-07-15 17:38           ` Jane Chu
2022-07-15  1:19         ` Dan Williams
2022-07-15 17:26           ` Jane Chu
2022-07-15 19:17             ` Dan Williams
2022-07-15 22:46               ` Jane Chu
2022-08-29  8:11                 ` [tip: ras/core] x86/mce: Retrieve poison range from hardware tip-bot2 for Jane Chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).