[PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
@ 2022-06-07 21:20 Tony Luck
  2022-06-27 14:40 ` Borislav Petkov
  2022-08-22 17:41 ` [tip: ras/core] " tip-bot2 for Tony Luck
  0 siblings, 2 replies; 13+ messages in thread
From: Tony Luck @ 2022-06-07 21:20 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86, linux-kernel, patches, Tony Luck

A large scale study of memory errors in data centers showed that it is
best to aggressively take pages with corrected errors offline. This is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Here's the link to the study. I thought of putting into the code
comment, or the commit comment. But these links are sometimes changed
as website is re-organised, making the link stale.

https://www.intel.com/content/dam/www/public/us/en/documents/intel-and-samsung-mrt-improving-memory-reliability-at-data-centers.pdf

The paper has two recommendations:
1) Change threshold to "2".
2) Do very smart platform dependent things

This commit only addresses the first :-)
---
 drivers/ras/cec.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 42f2fc0bc8a9..5d614c383ccf 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -125,8 +125,11 @@ static struct ce_array {
 static DEFINE_MUTEX(ce_mutex);
 static u64 dfs_pfn;
 
-/* Amount of errors after which we offline */
-static u64 action_threshold = COUNT_MASK;
+/*
+ * Number of errors after which we offline. Default is to aggressively
+ * offline the page when a second error is seen.
+ */
+static u64 action_threshold = 2;
 
 /* Each element "decays" each decay_interval which is 24hrs by default. */
 #define CEC_DECAY_DEFAULT_INTERVAL	24 * 60 * 60	/* 24 hrs */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
  2022-06-07 21:20 [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2" Tony Luck
@ 2022-06-27 14:40 ` Borislav Petkov
  2022-06-27 17:27   ` Luck, Tony
  2022-08-22 17:41 ` [tip: ras/core] " tip-bot2 for Tony Luck
  1 sibling, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2022-06-27 14:40 UTC (permalink / raw)
  To: Tony Luck; +Cc: x86, linux-kernel, patches, Yazen Ghannam

On Tue, Jun 07, 2022 at 02:20:15PM -0700, Tony Luck wrote:
> A large scale study of memory errors in data centers showed that it is
> best to aggressively take pages with corrected errors offline. This is
> the best strategy of using corrected errors as a predictor of future
> uncorrected errors.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> 
> ---
> Here's the link to the study. I thought of putting into the code
> comment, or the commit comment. But these links are sometimes changed
> as website is re-organised, making the link stale.
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/intel-and-samsung-mrt-improving-memory-reliability-at-data-centers.pdf
> 
> The paper has two recommendations:
> 1) Change threshold to "2".

Kinda unconditional that... we haven't talked to other vendors even.

> 2) Do very smart platform dependent things

If you mean AI, that probably won't happen in the kernel.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
  2022-06-27 14:40 ` Borislav Petkov
@ 2022-06-27 17:27   ` Luck, Tony
  2022-06-28 15:59     ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2022-06-27 17:27 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86, linux-kernel, patches, Yazen Ghannam

>> 1) Change threshold to "2".
>
> Kinda unconditional that... we haven't talked to other vendors even.

Existing default is 1023 ... which is not a good choice for anyone (except
perhaps ostriches that want to bury their heads in the sand an ignore marginal
DIMMs for as long as possible).

So changing the threshold to "2" would be an improvement in at least being right for
one vendor, instead of wrong for all.

If someone comes up with a different value for another CPU or DIMM vendor
combination ... would we have the RAS_CEC driver check boot_cpu_data.x86_vendor
and SMBIOS to set a different default?

>> 2) Do very smart platform dependent things
>
> If you mean AI, that probably won't happen in the kernel.

Agreed. You don't even need the "probably". This isn't kernel material.

Linux already had a hook in the GHES code to take an error record from
the platform and offline a page. So this "smart" code could be done
by BIOS or BMC just providing the resulting list of pages that should
be taken offline to Linux.

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
  2022-06-27 17:27   ` Luck, Tony
@ 2022-06-28 15:59     ` Borislav Petkov
  2022-06-28 16:51       ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2022-06-28 15:59 UTC (permalink / raw)
  To: Luck, Tony; +Cc: x86, linux-kernel, patches, Yazen Ghannam

On Mon, Jun 27, 2022 at 05:27:57PM +0000, Luck, Tony wrote:
> Existing default is 1023 ... which is not a good choice for anyone (except
> perhaps ostriches that want to bury their heads in the sand an ignore marginal
> DIMMs for as long as possible).

Why isn't that a good choice?

I'm sure there are error rates where this fits just fine.

> So changing the threshold to "2" would be an improvement in at least
> being right for one vendor, instead of wrong for all.

So I'm pretty sure that is not needed on AMD at all.

> Linux already had a hook in the GHES code to take an error record from
> the platform and offline a page. So this "smart" code could be done
> by BIOS or BMC just providing the resulting list of pages that should
> be taken offline to Linux.

So my worry is some firmware agent interfering with our recovery
strategy. And reportedly, there are people who don't like the firmware
recovery at all and prefer it all is done in the OS.

Which then makes it a problem of how to synchronize with the firmware
about who does what in RAS. And we don't have any API here...

Anyway, this is just a worry I have from watching where it all goes
to.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
  2022-06-28 15:59     ` Borislav Petkov
@ 2022-06-28 16:51       ` Luck, Tony
  2022-06-30  7:11         ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2022-06-28 16:51 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86, linux-kernel, patches, Yazen Ghannam

>> Existing default is 1023 ... which is not a good choice for anyone (except
>> perhaps ostriches that want to bury their heads in the sand an ignore marginal
>> DIMMs for as long as possible).
>
>Why isn't that a good choice?

It fails to use the capabilities of h/w an Linux to avoid a fatal error in the future.
Corrected errors are (sometimes) a predictor of marginal/aging memory. Copying
data out of a failing page while there are just corrected errors can avoid losing
that whole page later.

A single error is plausibly a particle strike causing a bit flip. But a second error
in the same page is a long shot (my desktop has 64G of memory, so 16 million
pages ... that's an awful lot of other targets for a second particle strike).

>I'm sure there are error rates where this fits just fine.

Explain further. Apart from the "ostrich" case I'm not sure what they are.

>> So changing the threshold to "2" would be an improvement in at least
>> being right for one vendor, instead of wrong for all.
>
>So I'm pretty sure that is not needed on AMD at all.

It's far more a property of DIMMs than of the CPU. Unless AMD are using
some DECTED or better level of ECC for memory.

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
  2022-06-28 16:51       ` Luck, Tony
@ 2022-06-30  7:11         ` Borislav Petkov
  2022-06-30 17:02           ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2022-06-30  7:11 UTC (permalink / raw)
  To: Luck, Tony; +Cc: x86, linux-kernel, patches, Yazen Ghannam

On Tue, Jun 28, 2022 at 04:51:49PM +0000, Luck, Tony wrote:
> It fails to use the capabilities of h/w an Linux to avoid a fatal
> error in the future. Corrected errors are (sometimes) a predictor of
> marginal/aging memory. Copying data out of a failing page while there
> are just corrected errors can avoid losing that whole page later.

Hm, for some reason you're trying to persuade me that 2 correctable
errors per page mean that that location is going to turn into
uncorrectable and thus all pages which get two CEs per 24h should
immediately be offlined.

It might and it is commonly accepted that CEs in a DIMM could likely
lead to UEs in the future but not necessarily. That DIMM could trigger
those CEs for years and if the ECC function in the memory controller is
good enough, it could handle those CEs and keep on going like nothing's
happened.

I.e., I'm not buying this unconditional 2 CEs/24h without any sensible
proof. That "study" simply says that someone has done some evaluation
and here's our short-term solution and you should accept it - no
questions asked.

Hell, that study is even advocating the opposite:

"not all the faults (or the pages with the CE rate satisfying a certain
condition) are equally prone to future UEs. The CE rate in the past
period is not a good predictive indicator of future UEs."

So what you're doing is punishing DIMMs which can "wobble" this way with
a couple of CEs for years without causing any issues otherwise.

> Explain further. Apart from the "ostrich" case I'm not sure what they
> are.

Actually, you should explain why this drastic measure of only two
correctable errors, all of a sudden?

The most common failure in DIMMs is single-device failure, modern ECC
schemes can handle those just fine. So what's up?

> It's far more a property of DIMMs than of the CPU. Unless AMD are
> using some DECTED or better level of ECC for memory.

Well, it does the usual any number of bit flips in a single DRAM device
ECC recovery:

https://www.amd.com/system/files/documents/advanced-memory-device-correction.pdf

And the papers quoted there basically say that the majority of failures
are to single DRAM devices which the ECC scheme can handle just fine.

And the multiple DRAM devices failures are a very small percentage of
all the failures.

Which makes me wonder even more why is your change needed at all?

I'd understand if this were some very paranoid HPC system doing very
important computations and where it can't allow itself to suffer UEs so
it'll go and proactively offline pages at the very first sign of trouble
but the data says that the ECC scheme can handle single device failure
just fine and those devices fail only very seldomly and after a loooong
time.

So, if anything, your change should be Intel-only.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
  2022-06-30  7:11         ` Borislav Petkov
@ 2022-06-30 17:02           ` Luck, Tony
  2022-07-01  8:49             ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2022-06-30 17:02 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86, linux-kernel, patches, Yazen Ghannam

On Thu, Jun 30, 2022 at 09:11:26AM +0200, Borislav Petkov wrote:
> On Tue, Jun 28, 2022 at 04:51:49PM +0000, Luck, Tony wrote:
> > It fails to use the capabilities of h/w an Linux to avoid a fatal
> > error in the future. Corrected errors are (sometimes) a predictor of
> > marginal/aging memory. Copying data out of a failing page while there
> > are just corrected errors can avoid losing that whole page later.
> 
> Hm, for some reason you're trying to persuade me that 2 correctable
> errors per page mean that that location is going to turn into
> uncorrectable and thus all pages which get two CEs per 24h should
> immediately be offlined.

Yes. The cost to offline a page is low (4KB reduction in system capacity
on a system with 10's or 100's of GB memory). The risk to the system
if the page does develop an uncorected error is high (process is killed,
or system crashes).

> It might and it is commonly accepted that CEs in a DIMM could likely
> lead to UEs in the future but not necessarily. That DIMM could trigger
> those CEs for years and if the ECC function in the memory controller is
> good enough, it could handle those CEs and keep on going like nothing's
> happened.

The question is whether the default threshold should be "do I feel
lucky?" and those corrected errors are nothing to worry about. Or
"do I want to take the safe path?" and premptively offline pages
at the first sign of trouble.

> I.e., I'm not buying this unconditional 2 CEs/24h without any sensible
> proof. That "study" simply says that someone has done some evaluation
> and here's our short-term solution and you should accept it - no
> questions asked.
> 
> Hell, that study is even advocating the opposite:
> 
> "not all the faults (or the pages with the CE rate satisfying a certain
> condition) are equally prone to future UEs. The CE rate in the past
> period is not a good predictive indicator of future UEs."

It's a cost/risk tradeoff. I think the costs are so low and the risks
are so high that a low threshold is the right choice.

> So what you're doing is punishing DIMMs which can "wobble" this way with
> a couple of CEs for years without causing any issues otherwise.

Is there a study about "wobbly" DIMMs?

> > Explain further. Apart from the "ostrich" case I'm not sure what they
> > are.
> 
> Actually, you should explain why this drastic measure of only two
> correctable errors, all of a sudden?

We now have some real data. Instead of a "finger in the air guess" that
was made (on a different generation of DIMM technology ... the AMD paper
you reference below says DDR4 is 5.5x worse than DDR3).

> The most common failure in DIMMs is single-device failure, modern ECC
> schemes can handle those just fine. So what's up?

Second most common on DDR4 DIMMs is "row failure". Which current ECC
systems don't handle well.

> > It's far more a property of DIMMs than of the CPU. Unless AMD are
> > using some DECTED or better level of ECC for memory.
> 
> Well, it does the usual any number of bit flips in a single DRAM device
> ECC recovery:
> 
> https://www.amd.com/system/files/documents/advanced-memory-device-correction.pdf
> 
> And the papers quoted there basically say that the majority of failures
> are to single DRAM devices which the ECC scheme can handle just fine.
> 
> And the multiple DRAM devices failures are a very small percentage of
> all the failures.
> 
> Which makes me wonder even more why is your change needed at all?
> 
> I'd understand if this were some very paranoid HPC system doing very
> important computations and where it can't allow itself to suffer UEs so
> it'll go and proactively offline pages at the very first sign of trouble
> but the data says that the ECC scheme can handle single device failure
> just fine and those devices fail only very seldomly and after a loooong
> time.
> 
> So, if anything, your change should be Intel-only.

What AMD named "AMDC" looks the same as the trademarked feature Chipkill"
by IBM, and also implemented by Intel with various (less catchy) names
like ADDDC and DDDDC.  So everyone has some form of "advanced RAS" to
handle DRAM device failure.

But lets talk about "fail only very seldomly". For you and I with only a
handful of machines to worry about "very seldom" translates into "there
are many other more important things to worry about".

But look at the error rate for memory from the perspective of a medium
sized cloud service provider with 100,000 systems across a few data
centers. Say just 8 DIMMs per server, and 18 DRAM devices per DIMM
that's 14.4 million devices. Run 24x7 for a week (168 hours) and you
have clocked 2.4 billion device hours. The AMD paper says average
FIT rate for DDR4 DRAM is 248. So the expectation should be nearly 600
DRAM faults per week across all 100K systems.

While that's low from one perspective (0.6% servers affected) it's high
enough to be interesting to the CSP - because they lose revenue and
reputation when they have to tell their customers: "sorry the VM you
rented from us just crashed". Note that one physical system crashing
may take down dozens of VMs.

While anyone can tune the RAS_CEC threshold. The default value should
be something reasonable. I'm sticking with "2" being much more
reasonable default than 1023.

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
  2022-06-30 17:02           ` Luck, Tony
@ 2022-07-01  8:49             ` Borislav Petkov
  2022-07-01 16:44               ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2022-07-01  8:49 UTC (permalink / raw)
  To: Luck, Tony; +Cc: x86, linux-kernel, patches, Yazen Ghannam

On Thu, Jun 30, 2022 at 10:02:36AM -0700, Luck, Tony wrote:
> Yes. The cost to offline a page is low (4KB reduction in system capacity
> on a system with 10's or 100's of GB memory).

*If* that page is going to go bad at all.

> The risk to the system if the page does develop an uncorected error is
> high (process is killed, or system crashes).

That's not what the papers say.

> The question is whether the default threshold should be "do I feel
> lucky?" and those corrected errors are nothing to worry about. Or
> "do I want to take the safe path?" and premptively offline pages
> at the first sign of trouble.

Well, we can't decide that for every possible situation so if Intel's
recommendation is to do that on Intel systems, then users can set that.

/sys/kernel/debug/ras/cec/action_threshold is perhaps not the perfect
interface for that but we can make something more user-friendly.

> Is there a study about "wobbly" DIMMs?

Most of the papers I looked at say that the majority of errors are CE
and that there's a likelihood that those errors can turn UE but none is
quantifying that likelihood. One paper says that a huge number of the
errors are transient. If you offline such a page just because two alpha
particles flew through it, you're offlining a perfectly good page.

DRAM vendor is also important as different DRAM vendors show different
error stats. And so on and so on.

So you can't simply go and decide for all and say, the answer is 2.

> We now have some real data. Instead of a "finger in the air guess"
> that was made (on a different generation of DIMM technology ... the
> AMD paper you reference below says DDR4 is 5.5x worse than DDR3).

In the next sentence it says that the hardware handles those errors just
fine!

> Second most common on DDR4 DIMMs is "row failure". Which current ECC
> systems don't handle well.

This is not what we're talking about here - we're talking about
offlining pages after 2 CEs.

As to the row offlining - yes, no question there, we need to address
that.

> While that's low from one perspective (0.6% servers affected) it's high
> enough to be interesting to the CSP - because they lose revenue and
> reputation when they have to tell their customers: "sorry the VM you
> rented from us just crashed". Note that one physical system crashing
> may take down dozens of VMs.

So that whitepaper doesn't specify what they call "fault". Because
in one of the papers in the Reference section, they explain the
terminology:

"A fault is the underlying cause of an error, such as a stuck-at bit or
high-energy particle strike. Faults can be active (causing errors), or
dormant (not causing errors).

An error is an incorrect portion of state resulting from an active
fault, such as an incorrect value in memory. Errors may be detected and
possibly corrected by higher level mechanisms such as parity or error
correcting codes (ECC). They may also go uncorrected, or in the worst
case, completely undetected (i.e., silent)."

So even if we put on the most pessimistic glasses and say that 0.6%
of the faults result in system crashes, then CSP can go and set the
threshold to something lower for their use case after following
recommendations by DRAM and CPU vendor and so on.

> While anyone can tune the RAS_CEC threshold. The default value should
> be something reasonable. I'm sticking with "2" being much more
> reasonable default than 1023.

You can make that configurable or Intel-only or whatever - but not
unconditional for everyone.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
  2022-07-01  8:49             ` Borislav Petkov
@ 2022-07-01 16:44               ` Luck, Tony
  2022-07-01 19:12                 ` [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems Tony Luck
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2022-07-01 16:44 UTC (permalink / raw)
  To: Yazen Ghannam; +Cc: x86, linux-kernel, patches, Borislav Petkov

> You can make that configurable or Intel-only or whatever - but not
> unconditional for everyone.

I'm nervous about making a change for Intel that excludes AMD. It doesn't
look like good community spirit.

Yazen: Boris added you to the thread. Would this change hurt/help/do_nothing
for AMD systems?

If I post a patch that follows Boris's suggestion above to change the threshold
to "2" only for Intel systems, could I get a Reviewed-by: tag from you, or someone
else at AMD?

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems
  2022-07-01 16:44               ` Luck, Tony
@ 2022-07-01 19:12                 ` Tony Luck
  2022-08-02 12:07                   ` Yazen Ghannam
  0 siblings, 1 reply; 13+ messages in thread
From: Tony Luck @ 2022-07-01 19:12 UTC (permalink / raw)
  To: yazen.ghannam; +Cc: tony.luck, bp, linux-kernel, patches, x86

A large scale study of memory errors on Intel systems in data centers
showed that aggressively taking pages with corrected errors offline is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

It is unknown whether this would help other vendors. There are some
indicators that it would not.

Set the threshold to "2" on Intel systems.

Do-not-apply-without-agreement-from-AMD
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 drivers/ras/cec.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 42f2fc0bc8a9..b1fc193b2036 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -556,6 +556,14 @@ static int __init cec_init(void)
 	if (ce_arr.disabled)
 		return -ENODEV;
 
+	/*
+	 * Intel systems may avoid uncorreectable errors
+	 * if pages with corrected errors are aggresively
+	 * taken offline.
+	 */
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		action_threshold = 2;
+
 	ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
 	if (!ce_arr.array) {
 		pr_err("Error allocating CE array page!\n");
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems
  2022-07-01 19:12                 ` [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems Tony Luck
@ 2022-08-02 12:07                   ` Yazen Ghannam
  2022-08-02 16:18                     ` [PATCH v2] " Tony Luck
  0 siblings, 1 reply; 13+ messages in thread
From: Yazen Ghannam @ 2022-08-02 12:07 UTC (permalink / raw)
  To: Tony Luck; +Cc: bp, linux-kernel, patches, x86

On Fri, Jul 01, 2022 at 12:12:39PM -0700, Tony Luck wrote:
> A large scale study of memory errors on Intel systems in data centers
> showed that aggressively taking pages with corrected errors offline is
> the best strategy of using corrected errors as a predictor of future
> uncorrected errors.
> 
> It is unknown whether this would help other vendors. There are some
> indicators that it would not.
> 
> Set the threshold to "2" on Intel systems.
> 
> Do-not-apply-without-agreement-from-AMD
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Hi Tony,
The guidance from our hardware folks is that this isn't necessary for our
systems. So I think restricting this to Intel systems is okay.

> ---
>  drivers/ras/cec.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
> index 42f2fc0bc8a9..b1fc193b2036 100644
> --- a/drivers/ras/cec.c
> +++ b/drivers/ras/cec.c
> @@ -556,6 +556,14 @@ static int __init cec_init(void)
>  	if (ce_arr.disabled)
>  		return -ENODEV;
>  
> +	/*
> +	 * Intel systems may avoid uncorreectable errors
> +	 * if pages with corrected errors are aggresively
> +	 * taken offline.
> +	 */

s/uncorreectable/uncorrectable/
s/aggresively/aggressively/

> +	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> +		action_threshold = 2;
> +
>  	ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
>  	if (!ce_arr.array) {
>  		pr_err("Error allocating CE array page!\n");
> --

Looks good to me overall.

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] RAS/CEC: Reduce offline page threshold for Intel systems
  2022-08-02 12:07                   ` Yazen Ghannam
@ 2022-08-02 16:18                     ` Tony Luck
  0 siblings, 0 replies; 13+ messages in thread
From: Tony Luck @ 2022-08-02 16:18 UTC (permalink / raw)
  To: Yazen Ghannam; +Cc: bp, linux-kernel, patches, x86

A large scale study of memory errors on Intel systems in data centers
showed that aggressively taking pages with corrected errors offline is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

Set the threshold to "2" on Intel systems. AMD guidance is that this is
not necessary for their systems.

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

V2:
	Fix some spelling errors. 
	Add note to commit that AMD systems do not need this.
	Add Yazen's Reviewed-by tag.

 drivers/ras/cec.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 42f2fc0bc8a9..321af498ee11 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -556,6 +556,14 @@ static int __init cec_init(void)
 	if (ce_arr.disabled)
 		return -ENODEV;
 
+	/*
+	 * Intel systems may avoid uncorrectable errors
+	 * if pages with corrected errors are aggressively
+	 * taken offline.
+	 */
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		action_threshold = 2;
+
 	ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
 	if (!ce_arr.array) {
 		pr_err("Error allocating CE array page!\n");
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [tip: ras/core] RAS/CEC: Reduce offline page threshold for Intel systems
  2022-06-07 21:20 [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2" Tony Luck
  2022-06-27 14:40 ` Borislav Petkov
@ 2022-08-22 17:41 ` tip-bot2 for Tony Luck
  1 sibling, 0 replies; 13+ messages in thread
From: tip-bot2 for Tony Luck @ 2022-08-22 17:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Tony Luck, Borislav Petkov, Yazen Ghannam, x86, linux-kernel

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     d25c6948a6aad787d9fd64de6b5362c3f23cc8d0
Gitweb:        https://git.kernel.org/tip/d25c6948a6aad787d9fd64de6b5362c3f23cc8d0
Author:        Tony Luck <tony.luck@intel.com>
AuthorDate:    Tue, 02 Aug 2022 09:18:47 -07:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Mon, 22 Aug 2022 19:30:02 +02:00

RAS/CEC: Reduce offline page threshold for Intel systems

A large scale study of memory errors on Intel systems in data centers
showed that aggressively taking pages with corrected errors offline is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

Set the threshold to "2" on Intel systems. AMD guidance is that this is
not necessary for their systems.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20220607212015.175591-1-tony.luck@intel.com
Link: https://lore.kernel.org/r/YulOZ/Eso0bwUcC4@agluck-desk3.sc.intel.com
---
 drivers/ras/cec.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 42f2fc0..321af49 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -556,6 +556,14 @@ static int __init cec_init(void)
 	if (ce_arr.disabled)
 		return -ENODEV;
 
+	/*
+	 * Intel systems may avoid uncorrectable errors
+	 * if pages with corrected errors are aggressively
+	 * taken offline.
+	 */
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		action_threshold = 2;
+
 	ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
 	if (!ce_arr.array) {
 		pr_err("Error allocating CE array page!\n");

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-08-22 17:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-07 21:20 [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2" Tony Luck
2022-06-27 14:40 ` Borislav Petkov
2022-06-27 17:27   ` Luck, Tony
2022-06-28 15:59     ` Borislav Petkov
2022-06-28 16:51       ` Luck, Tony
2022-06-30  7:11         ` Borislav Petkov
2022-06-30 17:02           ` Luck, Tony
2022-07-01  8:49             ` Borislav Petkov
2022-07-01 16:44               ` Luck, Tony
2022-07-01 19:12                 ` [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems Tony Luck
2022-08-02 12:07                   ` Yazen Ghannam
2022-08-02 16:18                     ` [PATCH v2] " Tony Luck
2022-08-22 17:41 ` [tip: ras/core] " tip-bot2 for Tony Luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).