All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out
@ 2021-11-04  7:44 Zhaolong Zhang
  2021-11-04  9:13 ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Zhaolong Zhang @ 2021-11-04  7:44 UTC (permalink / raw)
  To: Zhaolong Zhang, Tony Luck, Borislav Petkov; +Cc: x86, linux-edac, linux-kernel

set cpu_missing before mce_panic() so that it prints correct msg.

Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
---
 arch/x86/kernel/cpu/mce/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 50a3e455cded..ccefe131ab55 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -903,13 +903,13 @@ static int mce_timed_out(u64 *t, const char *msg)
 	if (!mca_cfg.monarch_timeout)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
+		cpu_missing = 1;
 		if (mca_cfg.tolerant <= 1) {
 			if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
 				pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
 					 cpumask_pr_args(&mce_missing_cpus));
 			mce_panic(msg, NULL, NULL);
 		}
-		cpu_missing = 1;
 		return 1;
 	}
 	*t -= SPINUNIT;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out
  2021-11-04  7:44 [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out Zhaolong Zhang
@ 2021-11-04  9:13 ` Borislav Petkov
  2021-11-04 15:47   ` Luck, Tony
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-11-04  9:13 UTC (permalink / raw)
  To: Zhaolong Zhang, Tony Luck; +Cc: x86, linux-edac, linux-kernel

On Thu, Nov 04, 2021 at 03:44:31PM +0800, Zhaolong Zhang wrote:
> set cpu_missing before mce_panic() so that it prints correct msg.
> 
> Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 50a3e455cded..ccefe131ab55 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -903,13 +903,13 @@ static int mce_timed_out(u64 *t, const char *msg)
>  	if (!mca_cfg.monarch_timeout)
>  		goto out;
>  	if ((s64)*t < SPINUNIT) {
> +		cpu_missing = 1;
>  		if (mca_cfg.tolerant <= 1) {
>  			if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
>  				pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
>  					 cpumask_pr_args(&mce_missing_cpus));
>  			mce_panic(msg, NULL, NULL);
>  		}
> -		cpu_missing = 1;
>  		return 1;
>  	}
>  	*t -= SPINUNIT;
> -- 

Frankly, we might just as well kill that cpu_missing thing because we
already say that some CPUs are not responding.

And that "Some CPUs didn't answer in synchronization" is not really
telling me a whole lot.

Tony, do you see any real need to keep it?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out
  2021-11-04  9:13 ` Borislav Petkov
@ 2021-11-04 15:47   ` Luck, Tony
  2021-11-04 18:02     ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Luck, Tony @ 2021-11-04 15:47 UTC (permalink / raw)
  To: Borislav Petkov, Zhaolong Zhang; +Cc: x86, linux-edac, linux-kernel

> Frankly, we might just as well kill that cpu_missing thing because we
> already say that some CPUs are not responding.

Yes. The more recent commit:

7bb39313cd62 ("x86/mce: Make mce_timed_out() identify holdout CPUs")

tries to provide the more detailed message about *which* CPUs are missing

> Tony, do you see any real need to keep it?

I think cpu_missing can be dropped.

-Tony

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out
  2021-11-04 15:47   ` Luck, Tony
@ 2021-11-04 18:02     ` Borislav Petkov
  2021-11-05  2:19       ` Zhaolong Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-11-04 18:02 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Zhaolong Zhang, x86, linux-edac, linux-kernel

On Thu, Nov 04, 2021 at 03:47:36PM +0000, Luck, Tony wrote:
> > Frankly, we might just as well kill that cpu_missing thing because we
> > already say that some CPUs are not responding.
> 
> Yes. The more recent commit:
> 
> 7bb39313cd62 ("x86/mce: Make mce_timed_out() identify holdout CPUs")
> 
> tries to provide the more detailed message about *which* CPUs are missing

Exactly.

> I think cpu_missing can be dropped.

Zhaolong, you could send a patch doing that, instead.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re:Re: [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out
  2021-11-04 18:02     ` Borislav Petkov
@ 2021-11-05  2:19       ` Zhaolong Zhang
  2021-11-08  8:28         ` [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus Zhaolong Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Zhaolong Zhang @ 2021-11-05  2:19 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, x86, linux-edac, linux-kernel


At 2021-11-05 02:02:36, "Borislav Petkov" <bp@alien8.de> wrote:
>On Thu, Nov 04, 2021 at 03:47:36PM +0000, Luck, Tony wrote:
>> > Frankly, we might just as well kill that cpu_missing thing because we
>> > already say that some CPUs are not responding.
>> 
>> Yes. The more recent commit:
>> 
>> 7bb39313cd62 ("x86/mce: Make mce_timed_out() identify holdout CPUs")
>> 
>> tries to provide the more detailed message about *which* CPUs are missing
>
>Exactly.
>
>> I think cpu_missing can be dropped.
>
>Zhaolong, you could send a patch doing that, instead.

Thanks for the reply. Let me see how to do it properly.

Regards,
Zhaolong

>
>Thx.
>
>-- 
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-05  2:19       ` Zhaolong Zhang
@ 2021-11-08  8:28         ` Zhaolong Zhang
  2021-11-08  9:31           ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Zhaolong Zhang @ 2021-11-08  8:28 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov, Zhaolong Zhang
  Cc: x86, linux-edac, linux-kernel, Paul E . McKenney

move mce_missing_cpus checking into mce_panic() as well, because we don't want
to lose the cpu missing information in case mca_cfg.tolerant > 1 and there is
no_way_out.

Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
---
 arch/x86/kernel/cpu/mce/core.c | 38 ++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 50a3e455cded..0bb59e68a457 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -99,7 +99,6 @@ struct mca_config mca_cfg __read_mostly = {
 
 static DEFINE_PER_CPU(struct mce, mces_seen);
 static unsigned long mce_need_notify;
-static int cpu_missing;
 
 /*
  * MCA banks polled by the period polling timer for corrected events.
@@ -253,6 +252,12 @@ static atomic_t mce_panicked;
 static int fake_panic;
 static atomic_t mce_fake_panicked;
 
+/*
+ * Track which CPUs entered the MCA broadcast synchronization and which not in
+ * order to print holdouts.
+ */
+static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
+
 /* Panic in progress. Enable interrupts and wait for final IPI */
 static void wait_for_panic(void)
 {
@@ -314,8 +319,13 @@ static void mce_panic(const char *msg, struct mce *final, char *exp)
 		if (!apei_err)
 			apei_err = apei_write_mce(final);
 	}
-	if (cpu_missing)
-		pr_emerg(HW_ERR "Some CPUs didn't answer in synchronization\n");
+	/*
+	 * cpu_online_mask == &mce_missing_cpus means it is reset and no timeout happens.
+	 */
+	if (!cpumask_equal(cpu_online_mask, &mce_missing_cpus) &&
+	    cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
+		pr_emerg(HW_ERR "CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
+			 cpumask_pr_args(&mce_missing_cpus));
 	if (exp)
 		pr_emerg(HW_ERR "Machine check: %s\n", exp);
 	if (!fake_panic) {
@@ -880,12 +890,6 @@ static atomic_t mce_executing;
  */
 static atomic_t mce_callin;
 
-/*
- * Track which CPUs entered the MCA broadcast synchronization and which not in
- * order to print holdouts.
- */
-static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
-
 /*
  * Check if a timeout waiting for other CPUs happened.
  */
@@ -904,12 +908,8 @@ static int mce_timed_out(u64 *t, const char *msg)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
 		if (mca_cfg.tolerant <= 1) {
-			if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
-				pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
-					 cpumask_pr_args(&mce_missing_cpus));
 			mce_panic(msg, NULL, NULL);
 		}
-		cpu_missing = 1;
 		return 1;
 	}
 	*t -= SPINUNIT;
@@ -1079,8 +1079,10 @@ static int mce_end(int order)
 
 	if (!timeout)
 		goto reset;
-	if (order < 0)
+	if (order < 0) {
+		timeout = 0;
 		goto reset;
+	}
 
 	/*
 	 * Allow others to run.
@@ -1128,7 +1130,12 @@ static int mce_end(int order)
 reset:
 	atomic_set(&global_nwo, 0);
 	atomic_set(&mce_callin, 0);
-	cpumask_setall(&mce_missing_cpus);
+	/*
+ 	 * Don't reset mce_missing_cpus if there is mce_timed_out() so that
+ 	 * mce_panic() can report right thing.
+ 	 */
+	if (!((s64)timeout < SPINUNIT))
+		cpumask_setall(&mce_missing_cpus);
 	barrier();
 
 	/*
@@ -2720,7 +2727,6 @@ struct dentry *mce_get_debugfs_dir(void)
 
 static void mce_reset(void)
 {
-	cpu_missing = 0;
 	atomic_set(&mce_fake_panicked, 0);
 	atomic_set(&mce_executing, 0);
 	atomic_set(&mce_callin, 0);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-08  8:28         ` [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus Zhaolong Zhang
@ 2021-11-08  9:31           ` Borislav Petkov
  2021-11-08 10:13             ` Zhaolong Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-11-08  9:31 UTC (permalink / raw)
  To: Zhaolong Zhang
  Cc: Tony Luck, x86, linux-edac, linux-kernel, Paul E . McKenney

On Mon, Nov 08, 2021 at 04:28:32PM +0800, Zhaolong Zhang wrote:
> move mce_missing_cpus checking into mce_panic() as well, because we don't want
> to lose the cpu missing information in case mca_cfg.tolerant > 1 and there is
> no_way_out.
> 
> Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c | 38 ++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 16 deletions(-)

I was actually expecting to see something like this:

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6ed365337a3b..30de00fe0d7a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -99,7 +99,6 @@ struct mca_config mca_cfg __read_mostly = {
 
 static DEFINE_PER_CPU(struct mce, mces_seen);
 static unsigned long mce_need_notify;
-static int cpu_missing;
 
 /*
  * MCA banks polled by the period polling timer for corrected events.
@@ -314,8 +313,6 @@ static void mce_panic(const char *msg, struct mce *final, char *exp)
 		if (!apei_err)
 			apei_err = apei_write_mce(final);
 	}
-	if (cpu_missing)
-		pr_emerg(HW_ERR "Some CPUs didn't answer in synchronization\n");
 	if (exp)
 		pr_emerg(HW_ERR "Machine check: %s\n", exp);
 	if (!fake_panic) {
@@ -891,7 +888,6 @@ static int mce_timed_out(u64 *t, const char *msg)
 					 cpumask_pr_args(&mce_missing_cpus));
 			mce_panic(msg, NULL, NULL);
 		}
-		cpu_missing = 1;
 		return 1;
 	}
 	*t -= SPINUNIT;
@@ -2702,7 +2698,6 @@ struct dentry *mce_get_debugfs_dir(void)
 
 static void mce_reset(void)
 {
-	cpu_missing = 0;
 	atomic_set(&mce_fake_panicked, 0);
 	atomic_set(&mce_executing, 0);
 	atomic_set(&mce_callin, 0);

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-08  9:31           ` Borislav Petkov
@ 2021-11-08 10:13             ` Zhaolong Zhang
  2021-11-08 10:31               ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Zhaolong Zhang @ 2021-11-08 10:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, linux-edac, linux-kernel, Paul E . McKenney

At 2021-11-08 17:31:52, "Borislav Petkov" <bp@alien8.de> wrote:
>On Mon, Nov 08, 2021 at 04:28:32PM +0800, Zhaolong Zhang wrote:
>> move mce_missing_cpus checking into mce_panic() as well, because we don't want
>> to lose the cpu missing information in case mca_cfg.tolerant > 1 and there is
>> no_way_out.
>> 
>> Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
>> ---
>>  arch/x86/kernel/cpu/mce/core.c | 38 ++++++++++++++++++++--------------
>>  1 file changed, 22 insertions(+), 16 deletions(-)
>
>I was actually expecting to see something like this:

Hi Boris,

I was concerning that if I simply remove the cpu_missing code, we will lose the log in the
situation where mca_cfg.tolerant > 1 and no_way_out is set afterwards.

Do you think we can safely ignore that situation?

Regards,
Zhaolong


>
>diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>index 6ed365337a3b..30de00fe0d7a 100644
>--- a/arch/x86/kernel/cpu/mce/core.c
>+++ b/arch/x86/kernel/cpu/mce/core.c
>@@ -99,7 +99,6 @@ struct mca_config mca_cfg __read_mostly = {
> 
> static DEFINE_PER_CPU(struct mce, mces_seen);
> static unsigned long mce_need_notify;
>-static int cpu_missing;
> 
> /*
>  * MCA banks polled by the period polling timer for corrected events.
>@@ -314,8 +313,6 @@ static void mce_panic(const char *msg, struct mce *final, char *exp)
> 		if (!apei_err)
> 			apei_err = apei_write_mce(final);
> 	}
>-	if (cpu_missing)
>-		pr_emerg(HW_ERR "Some CPUs didn't answer in synchronization\n");
> 	if (exp)
> 		pr_emerg(HW_ERR "Machine check: %s\n", exp);
> 	if (!fake_panic) {
>@@ -891,7 +888,6 @@ static int mce_timed_out(u64 *t, const char *msg)
> 					 cpumask_pr_args(&mce_missing_cpus));
> 			mce_panic(msg, NULL, NULL);
> 		}
>-		cpu_missing = 1;
> 		return 1;
> 	}
> 	*t -= SPINUNIT;
>@@ -2702,7 +2698,6 @@ struct dentry *mce_get_debugfs_dir(void)
> 
> static void mce_reset(void)
> {
>-	cpu_missing = 0;
> 	atomic_set(&mce_fake_panicked, 0);
> 	atomic_set(&mce_executing, 0);
> 	atomic_set(&mce_callin, 0);
>
>-- 
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-08 10:13             ` Zhaolong Zhang
@ 2021-11-08 10:31               ` Borislav Petkov
  2021-11-08 12:47                 ` Zhaolong Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-11-08 10:31 UTC (permalink / raw)
  To: Zhaolong Zhang
  Cc: Tony Luck, x86, linux-edac, linux-kernel, Paul E . McKenney

On Mon, Nov 08, 2021 at 06:13:04PM +0800, Zhaolong Zhang wrote:
> I was concerning that if I simply remove the cpu_missing code, we will lose the log in the
> situation where mca_cfg.tolerant > 1 and no_way_out is set afterwards.
> 
> Do you think we can safely ignore that situation?

Well, how likely is to have such a situation in practice?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-08 10:31               ` Borislav Petkov
@ 2021-11-08 12:47                 ` Zhaolong Zhang
  2021-11-09  8:31                   ` Zhaolong Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Zhaolong Zhang @ 2021-11-08 12:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, linux-edac, linux-kernel, Paul E . McKenney

At 2021-11-08 18:31:38, "Borislav Petkov" <bp@alien8.de> wrote:
>On Mon, Nov 08, 2021 at 06:13:04PM +0800, Zhaolong Zhang wrote:
>> I was concerning that if I simply remove the cpu_missing code, we will lose the log in the
>> situation where mca_cfg.tolerant > 1 and no_way_out is set afterwards.
>> 
>> Do you think we can safely ignore that situation?
>
>Well, how likely is to have such a situation in practice?

It is difficult to answer...
But since current code is dealing with this situation, I think I should cover it too,
although it is only a piece of log.

Regards,
Zhaolong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-08 12:47                 ` Zhaolong Zhang
@ 2021-11-09  8:31                   ` Zhaolong Zhang
  2021-11-09  8:35                     ` [PATCH] x86/mce: Get rid of cpu_missing Zhaolong Zhang
  2021-11-09  9:07                     ` [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus Borislav Petkov
  0 siblings, 2 replies; 22+ messages in thread
From: Zhaolong Zhang @ 2021-11-09  8:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, linux-edac, linux-kernel, Paul E . McKenney

At 2021-11-08 20:47:59, "Zhaolong Zhang" <zhangzl2013@126.com> wrote:
>At 2021-11-08 18:31:38, "Borislav Petkov" <bp@alien8.de> wrote:
>>On Mon, Nov 08, 2021 at 06:13:04PM +0800, Zhaolong Zhang wrote:
>>> I was concerning that if I simply remove the cpu_missing code, we will lose the log in the
>>> situation where mca_cfg.tolerant > 1 and no_way_out is set afterwards.
>>> 
>>> Do you think we can safely ignore that situation?
>>
>>Well, how likely is to have such a situation in practice?
>
>It is difficult to answer...
>But since current code is dealing with this situation, I think I should cover it too,
>although it is only a piece of log.

Hi Boris,

I reconsidered the situation.
If there is a non-recoverable mce as well, just let it print that reason. No need to bring the
timeout message indeed. Because since the tolerant was set to a high level to ignore the timeout,
we can eventually ignore them.

So simply drop cpu_missing variable as you mentioned should work.

I am not sure whether it should be authored by you or suggested by you.
Anyway, I will post a new patch exactly as you suggested. Please pick it or ignore it as appropriate :)

Thanks,
Zhaolong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] x86/mce: Get rid of cpu_missing
  2021-11-09  8:31                   ` Zhaolong Zhang
@ 2021-11-09  8:35                     ` Zhaolong Zhang
  2021-11-09  9:15                       ` Borislav Petkov
  2021-11-09  9:07                     ` [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus Borislav Petkov
  1 sibling, 1 reply; 22+ messages in thread
From: Zhaolong Zhang @ 2021-11-09  8:35 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov, Zhaolong Zhang
  Cc: x86, linux-edac, linux-kernel, Paul E . McKenney

Drop cpu_missing since we have more capable mce_missing_cpus.

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
---
 arch/x86/kernel/cpu/mce/core.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 50a3e455cded..51aefffe39f1 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -99,7 +99,6 @@ struct mca_config mca_cfg __read_mostly = {
 
 static DEFINE_PER_CPU(struct mce, mces_seen);
 static unsigned long mce_need_notify;
-static int cpu_missing;
 
 /*
  * MCA banks polled by the period polling timer for corrected events.
@@ -314,8 +313,6 @@ static void mce_panic(const char *msg, struct mce *final, char *exp)
 		if (!apei_err)
 			apei_err = apei_write_mce(final);
 	}
-	if (cpu_missing)
-		pr_emerg(HW_ERR "Some CPUs didn't answer in synchronization\n");
 	if (exp)
 		pr_emerg(HW_ERR "Machine check: %s\n", exp);
 	if (!fake_panic) {
@@ -909,7 +906,6 @@ static int mce_timed_out(u64 *t, const char *msg)
 					 cpumask_pr_args(&mce_missing_cpus));
 			mce_panic(msg, NULL, NULL);
 		}
-		cpu_missing = 1;
 		return 1;
 	}
 	*t -= SPINUNIT;
@@ -2720,7 +2716,6 @@ struct dentry *mce_get_debugfs_dir(void)
 
 static void mce_reset(void)
 {
-	cpu_missing = 0;
 	atomic_set(&mce_fake_panicked, 0);
 	atomic_set(&mce_executing, 0);
 	atomic_set(&mce_callin, 0);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-09  8:31                   ` Zhaolong Zhang
  2021-11-09  8:35                     ` [PATCH] x86/mce: Get rid of cpu_missing Zhaolong Zhang
@ 2021-11-09  9:07                     ` Borislav Petkov
  2021-11-09 16:06                       ` Luck, Tony
  1 sibling, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-11-09  9:07 UTC (permalink / raw)
  To: Zhaolong Zhang
  Cc: Tony Luck, x86, linux-edac, linux-kernel, Paul E . McKenney

On Tue, Nov 09, 2021 at 04:31:23PM +0800, Zhaolong Zhang wrote:
> If there is a non-recoverable mce as well, just let it print that
> reason. No need to bring the timeout message indeed. Because since
> the tolerant was set to a high level to ignore the timeout, we can
> eventually ignore them.

Here's how I see it:

	/*
	 * Tolerant levels:
	 * 0: always panic on uncorrected errors, log corrected errors
	 * 1: panic or SIGBUS on uncorrected errors, log corrected errors
	 * 2: SIGBUS or log uncorrected errors (if possible), log corr. errors
	 * 3: never panic or SIGBUS, log all errors (for testing only)
	 */

So on normal deployments, no one should fiddle with tolerant levels - so
you'll be running at tolerance level 0 by default and all should print
out. Same for level 1.

Levels 2 and 3 are, to me at least, purely for testing *only*. And,
actually, that error message should be issued regardless of the
tolerance level - only the panicking should be controlled by that. IOW,
that code should do:

        if ((s64)*t < SPINUNIT) {
                if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
                        pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
                                 cpumask_pr_args(&mce_missing_cpus));
                if (mca_cfg.tolerant <= 1)
                        mce_panic(msg, NULL, NULL);
                return 1;
        }

because, regardless of tolerance level, saying that some cores didn't
respond is important info.

You could do that as a separate patch, on top, if you feel like it.

> I am not sure whether it should be authored by you or suggested by
> you.

Suggested is fine.

> Anyway, I will post a new patch exactly as you suggested. Please pick
> it or ignore it as appropriate :)

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: Get rid of cpu_missing
  2021-11-09  8:35                     ` [PATCH] x86/mce: Get rid of cpu_missing Zhaolong Zhang
@ 2021-11-09  9:15                       ` Borislav Petkov
  2021-11-09 14:19                         ` Zhaolong Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-11-09  9:15 UTC (permalink / raw)
  To: Zhaolong Zhang
  Cc: Tony Luck, x86, linux-edac, linux-kernel, Paul E . McKenney

On Tue, Nov 09, 2021 at 04:35:47PM +0800, Zhaolong Zhang wrote:
> Drop cpu_missing since we have more capable mce_missing_cpus.

Who is "we"?

Also, you need to try harder with that commit message - mce_missing_cpus
is a cpumask and I don't see how a cpumask can be "more capable"...

Some more hints on a possible way to structure a commit message - those
are just hints - not necessarily rules - but it should help you get an
idea:

Problem is A.

It happens because of B.

Fix it by doing C.

(Potentially do D).

For more detailed info, see
Documentation/process/submitting-patches.rst, Section "2) Describe your
changes".

Also, to the tone, from Documentation/process/submitting-patches.rst:

 "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
  to do frotz", as if you are giving orders to the codebase to change
  its behaviour."

Also, do not talk about what your patch does - that should hopefully be
visible in the diff itself. Rather, talk about *why* you're doing what
you're doing.

Also, please use passive voice in your commit message: no "we" or "I", etc,
and describe your changes in imperative mood.

Bottom line is: personal pronouns are ambiguous in text, especially with
so many parties/companies/etc developing the kernel so let's avoid them
please.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: Get rid of cpu_missing
  2021-11-09  9:15                       ` Borislav Petkov
@ 2021-11-09 14:19                         ` Zhaolong Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Zhaolong Zhang @ 2021-11-09 14:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, x86, linux-edac, linux-kernel, Paul E . McKenney

At 2021-11-09 17:15:11, "Borislav Petkov" <bp@alien8.de> wrote:
>On Tue, Nov 09, 2021 at 04:35:47PM +0800, Zhaolong Zhang wrote:
>> Drop cpu_missing since we have more capable mce_missing_cpus.
>
>Who is "we"?
>
>Also, you need to try harder with that commit message - mce_missing_cpus
>is a cpumask and I don't see how a cpumask can be "more capable"...
>
>Some more hints on a possible way to structure a commit message - those
>are just hints - not necessarily rules - but it should help you get an
>idea:
>
>Problem is A.
>
>It happens because of B.
>
>Fix it by doing C.
>
>(Potentially do D).
>
>For more detailed info, see
>Documentation/process/submitting-patches.rst, Section "2) Describe your
>changes".
>
>Also, to the tone, from Documentation/process/submitting-patches.rst:
>
> "Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
>  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
>  to do frotz", as if you are giving orders to the codebase to change
>  its behaviour."
>
>Also, do not talk about what your patch does - that should hopefully be
>visible in the diff itself. Rather, talk about *why* you're doing what
>you're doing.
>
>Also, please use passive voice in your commit message: no "we" or "I", etc,
>and describe your changes in imperative mood.
>
>Bottom line is: personal pronouns are ambiguous in text, especially with
>so many parties/companies/etc developing the kernel so let's avoid them
>please.

Hi Boris,

Thank you so much for your kind reply. I really appreciate your detailed guidance.
I've sent a v2 patch with new descriptions, trying to be useful and brief.
Hope it is qualified...

Regards,
Zhaolong

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-09  9:07                     ` [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus Borislav Petkov
@ 2021-11-09 16:06                       ` Luck, Tony
  2021-11-09 19:48                         ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Luck, Tony @ 2021-11-09 16:06 UTC (permalink / raw)
  To: Borislav Petkov, Zhaolong Zhang
  Cc: x86, linux-edac, linux-kernel, Paul E . McKenney

>        if ((s64)*t < SPINUNIT) {
>                if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
>                        pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
>                                 cpumask_pr_args(&mce_missing_cpus));
>                if (mca_cfg.tolerant <= 1)
>                        mce_panic(msg, NULL, NULL);
>                return 1;
>        }

Just a note that skipping the mce_panic() here isn't going to help much. With some CPUs
stuck not responding to #MC the system is going to lock up or crash for other timeouts in
the next few seconds.

-Tony


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-09 16:06                       ` Luck, Tony
@ 2021-11-09 19:48                         ` Borislav Petkov
  2021-11-09 19:50                           ` Luck, Tony
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-11-09 19:48 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Zhaolong Zhang, x86, linux-edac, linux-kernel, Paul E . McKenney

On Tue, Nov 09, 2021 at 04:06:48PM +0000, Luck, Tony wrote:
> Just a note that skipping the mce_panic() here isn't going to help
> much. With some CPUs stuck not responding to #MC the system is going
> to lock up or crash for other timeouts in the next few seconds.

Yeh, I spent a couple of minutes today staring at this ->tolerant
thing and wondering why we need it at all. I wouldn't mind ripping it
altogether unless you're using it for testing or so.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-09 19:48                         ` Borislav Petkov
@ 2021-11-09 19:50                           ` Luck, Tony
  2021-11-09 20:21                             ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Luck, Tony @ 2021-11-09 19:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Zhaolong Zhang, x86, linux-edac, linux-kernel, Paul E . McKenney

>> Just a note that skipping the mce_panic() here isn't going to help
>> much. With some CPUs stuck not responding to #MC the system is going
>> to lock up or crash for other timeouts in the next few seconds.
>
> Yeh, I spent a couple of minutes today staring at this ->tolerant
> thing and wondering why we need it at all. I wouldn't mind ripping it
> altogether unless you're using it for testing or so.

I think it might have been useful before recoverable machine checks. But
now it just seems to cause confusion. I do not ever use it. I would not be
sad to see it go.

-Tony

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-09 19:50                           ` Luck, Tony
@ 2021-11-09 20:21                             ` Borislav Petkov
  2021-11-09 20:44                               ` Luck, Tony
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-11-09 20:21 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Zhaolong Zhang, x86, linux-edac, linux-kernel, Paul E . McKenney

On Tue, Nov 09, 2021 at 07:50:57PM +0000, Luck, Tony wrote:
> I think it might have been useful before recoverable machine checks. But
> now it just seems to cause confusion. I do not ever use it. I would not be
> sad to see it go.

Yeah,

what do we do with the sysfs knob? It probably is an ABI:

/sys/devices/system/machinecheck/machinecheck1/tolerant
/sys/devices/system/machinecheck/machinecheck2/tolerant
...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-09 20:21                             ` Borislav Petkov
@ 2021-11-09 20:44                               ` Luck, Tony
  2021-11-09 21:30                                 ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Luck, Tony @ 2021-11-09 20:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Zhaolong Zhang, x86, linux-edac, linux-kernel, Paul E . McKenney

> what do we do with the sysfs knob? It probably is an ABI:
>
> /sys/devices/system/machinecheck/machinecheck1/tolerant
> /sys/devices/system/machinecheck/machinecheck2/tolerant

$ git grep tolerant -- Documentation/ABI/
$

An undocumented ABI! Well, not documented with all the other sysfs bits.

It does appear in:
Documentation/x86/x86_64/machinecheck.rst

Of course, like a lot of documentation, it isn't accurate. It wasn't
updated to describe what happens with recoverable errors.
Final paragraph says:

        Note this only makes a difference if the CPU allows recovery
        from a machine check exception. Current x86 CPUs generally do not.

Recovery was first introduced in the Nehalem generation which ark.intel.com
says was launched in Q1'2010. So over a decade.

Choices:
1) Leave the file there, but remove the code that uses the value
2) Delete the file too

Option 1 doesn't break any scripts that look for the file, but may make
people shout louder when they find it no longer does anything.

Option 2 is the more honest approach.


-Tony


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus
  2021-11-09 20:44                               ` Luck, Tony
@ 2021-11-09 21:30                                 ` Borislav Petkov
  2021-12-20 20:43                                   ` [PATCH] x86/mce: Remove the tolerance level control Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2021-11-09 21:30 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Zhaolong Zhang, x86, linux-edac, linux-kernel, Paul E . McKenney

On Tue, Nov 09, 2021 at 08:44:41PM +0000, Luck, Tony wrote:
> > what do we do with the sysfs knob? It probably is an ABI:
> >
> > /sys/devices/system/machinecheck/machinecheck1/tolerant
> > /sys/devices/system/machinecheck/machinecheck2/tolerant
> 
> $ git grep tolerant -- Documentation/ABI/
> $
> 
> An undocumented ABI! Well, not documented with all the other sysfs bits.
> 
> It does appear in:
> Documentation/x86/x86_64/machinecheck.rst

Yeah, we have some spreading of documentation which is not necessarily
helpful.

> Of course, like a lot of documentation, it isn't accurate. It wasn't
> updated to describe what happens with recoverable errors. Final
> paragraph says:
>
>         Note this only makes a difference if the CPU allows recovery
>         from a machine check exception. Current x86 CPUs generally do
>         not.
>
> Recovery was first introduced in the Nehalem generation which
> ark.intel.com says was launched in Q1'2010. So over a decade.
>
> Choices: 1) Leave the file there, but remove the code that uses the
> value 2) Delete the file too
>
> Option 1 doesn't break any scripts that look for the file, but may
> make people shout louder when they find it no longer does anything.
>
> Option 2 is the more honest approach.

Ack, we can try 2 and see who cries.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] x86/mce: Remove the tolerance level control
  2021-11-09 21:30                                 ` Borislav Petkov
@ 2021-12-20 20:43                                   ` Borislav Petkov
  0 siblings, 0 replies; 22+ messages in thread
From: Borislav Petkov @ 2021-12-20 20:43 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Zhaolong Zhang, x86, linux-edac, linux-kernel, Paul E . McKenney

I guess something like this:

---
From: Borislav Petkov <bp@suse.de>

This is pretty much unused and not really useful. What is more, all
relevant MCA hardware has recoverable machine checks support so there's
no real need to tweak MCA tolerance levels in order to *maybe* extend
machine lifetime.

So rip it out.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 Documentation/ABI/removed/sysfs-mce       | 37 ++++++++++++++++
 Documentation/ABI/testing/sysfs-mce       | 32 --------------
 Documentation/vm/hwpoison.rst             |  2 -
 Documentation/x86/x86_64/boot-options.rst |  9 +---
 arch/x86/kernel/cpu/mce/core.c            | 53 +++++++++--------------
 arch/x86/kernel/cpu/mce/internal.h        |  3 +-
 arch/x86/kernel/cpu/mce/severity.c        | 21 ++++-----
 7 files changed, 68 insertions(+), 89 deletions(-)
 create mode 100644 Documentation/ABI/removed/sysfs-mce

diff --git a/Documentation/ABI/removed/sysfs-mce b/Documentation/ABI/removed/sysfs-mce
new file mode 100644
index 000000000000..ef5dd2a80918
--- /dev/null
+++ b/Documentation/ABI/removed/sysfs-mce
@@ -0,0 +1,37 @@
+What:		/sys/devices/system/machinecheck/machinecheckX/tolerant
+Contact:	Borislav Petkov <bp@suse.de>
+Date:		Dec, 2021
+Description:
+		Unused and obsolete after the advent of recoverable machine
+		checks (see last sentence below) and those are present since
+		2010 (Nehalem).
+
+		Original description:
+
+		The entries appear for each CPU, but they are truly shared
+		between all CPUs.
+
+		Tolerance level. When a machine check exception occurs for a
+		non corrected machine check the kernel can take different
+		actions.
+
+		Since machine check exceptions can happen any time it is
+		sometimes risky for the kernel to kill a process because it
+		defies normal kernel locking rules. The tolerance level
+		configures how hard the kernel tries to recover even at some
+		risk of	deadlock. Higher tolerant values trade potentially
+		better uptime with the risk of a crash or even corruption
+		(for tolerant >= 3).
+
+		==  ===========================================================
+		 0  always panic on uncorrected errors, log corrected errors
+		 1  panic or SIGBUS on uncorrected errors, log corrected errors
+		 2  SIGBUS or log uncorrected errors, log corrected errors
+		 3  never panic or SIGBUS, log all errors (for testing only)
+		==  ===========================================================
+
+		Default: 1
+
+		Note this only makes a difference if the CPU allows recovery
+		from a machine check exception. Current x86 CPUs generally
+		do not.
diff --git a/Documentation/ABI/testing/sysfs-mce b/Documentation/ABI/testing/sysfs-mce
index c8cd989034b4..83172f50e27c 100644
--- a/Documentation/ABI/testing/sysfs-mce
+++ b/Documentation/ABI/testing/sysfs-mce
@@ -53,38 +53,6 @@ Description:
 		(but some corrected errors might be still reported
 		in other ways)
 
-What:		/sys/devices/system/machinecheck/machinecheckX/tolerant
-Contact:	Andi Kleen <ak@linux.intel.com>
-Date:		Feb, 2007
-Description:
-		The entries appear for each CPU, but they are truly shared
-		between all CPUs.
-
-		Tolerance level. When a machine check exception occurs for a
-		non corrected machine check the kernel can take different
-		actions.
-
-		Since machine check exceptions can happen any time it is
-		sometimes risky for the kernel to kill a process because it
-		defies normal kernel locking rules. The tolerance level
-		configures how hard the kernel tries to recover even at some
-		risk of	deadlock. Higher tolerant values trade potentially
-		better uptime with the risk of a crash or even corruption
-		(for tolerant >= 3).
-
-		==  ===========================================================
-		 0  always panic on uncorrected errors, log corrected errors
-		 1  panic or SIGBUS on uncorrected errors, log corrected errors
-		 2  SIGBUS or log uncorrected errors, log corrected errors
-		 3  never panic or SIGBUS, log all errors (for testing only)
-		==  ===========================================================
-
-		Default: 1
-
-		Note this only makes a difference if the CPU allows recovery
-		from a machine check exception. Current x86 CPUs generally
-		do not.
-
 What:		/sys/devices/system/machinecheck/machinecheckX/trigger
 Contact:	Andi Kleen <ak@linux.intel.com>
 Date:		Feb, 2007
diff --git a/Documentation/vm/hwpoison.rst b/Documentation/vm/hwpoison.rst
index 89b5f7a52077..c742de1769d1 100644
--- a/Documentation/vm/hwpoison.rst
+++ b/Documentation/vm/hwpoison.rst
@@ -60,8 +60,6 @@ There are two (actually three) modes memory failure recovery can be in:
 
 vm.memory_failure_recovery sysctl set to zero:
 	All memory failures cause a panic. Do not attempt recovery.
-	(on x86 this can be also affected by the tolerant level of the
-	MCE subsystem)
 
 early kill
 	(can be controlled globally and per process)
diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst
index ccb7e86bf8d9..07aa0007f346 100644
--- a/Documentation/x86/x86_64/boot-options.rst
+++ b/Documentation/x86/x86_64/boot-options.rst
@@ -47,14 +47,7 @@ Please see Documentation/x86/x86_64/machinecheck.rst for sysfs runtime tunables.
 		in a reboot. On Intel systems it is enabled by default.
    mce=nobootlog
 		Disable boot machine check logging.
-   mce=tolerancelevel[,monarchtimeout] (number,number)
-		tolerance levels:
-		0: always panic on uncorrected errors, log corrected errors
-		1: panic or SIGBUS on uncorrected errors, log corrected errors
-		2: SIGBUS or log uncorrected errors, log corrected errors
-		3: never panic or SIGBUS, log all errors (for testing only)
-		Default is 1
-		Can be also set using sysfs which is preferable.
+   mce=monarchtimeout (number)
 		monarchtimeout:
 		Sets the time in us to wait for other CPUs on machine checks. 0
 		to disable.
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 5818b837fd4d..8d30469ab38c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -86,14 +86,6 @@ struct mce_vendor_flags mce_flags __read_mostly;
 
 struct mca_config mca_cfg __read_mostly = {
 	.bootlog  = -1,
-	/*
-	 * Tolerant levels:
-	 * 0: always panic on uncorrected errors, log corrected errors
-	 * 1: panic or SIGBUS on uncorrected errors, log corrected errors
-	 * 2: SIGBUS or log uncorrected errors (if possible), log corr. errors
-	 * 3: never panic or SIGBUS, log all errors (for testing only)
-	 */
-	.tolerant = 1,
 	.monarch_timeout = -1
 };
 
@@ -774,7 +766,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 			goto clear_it;
 
 		mce_read_aux(&m, i);
-		m.severity = mce_severity(&m, NULL, mca_cfg.tolerant, NULL, false);
+		m.severity = mce_severity(&m, NULL, NULL, false);
 		/*
 		 * Don't get the IP here because it's unlikely to
 		 * have anything to do with the actual error location.
@@ -854,7 +846,7 @@ static int mce_no_way_out(struct mce *m, char **msg, unsigned long *validp,
 			quirk_sandybridge_ifu(i, m, regs);
 
 		m->bank = i;
-		if (mce_severity(m, regs, mca_cfg.tolerant, &tmp, true) >= MCE_PANIC_SEVERITY) {
+		if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) {
 			mce_read_aux(m, i);
 			*msg = tmp;
 			return 1;
@@ -902,12 +894,11 @@ static noinstr int mce_timed_out(u64 *t, const char *msg)
 	if (!mca_cfg.monarch_timeout)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
-		if (mca_cfg.tolerant <= 1) {
-			if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
-				pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
-					 cpumask_pr_args(&mce_missing_cpus));
-			mce_panic(msg, NULL, NULL);
-		}
+		if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
+			pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
+				 cpumask_pr_args(&mce_missing_cpus));
+		mce_panic(msg, NULL, NULL);
+
 		ret = 1;
 		goto out;
 	}
@@ -971,9 +962,9 @@ static void mce_reign(void)
 	 * This dumps all the mces in the log buffer and stops the
 	 * other CPUs.
 	 */
-	if (m && global_worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
+	if (m && global_worst >= MCE_PANIC_SEVERITY) {
 		/* call mce_severity() to get "msg" for panic */
-		mce_severity(m, NULL, mca_cfg.tolerant, &msg, true);
+		mce_severity(m, NULL, &msg, true);
 		mce_panic("Fatal machine check", m, msg);
 	}
 
@@ -987,7 +978,7 @@ static void mce_reign(void)
 	 * No machine check event found. Must be some external
 	 * source or one CPU is hung. Panic.
 	 */
-	if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3)
+	if (global_worst <= MCE_KEEP_SEVERITY)
 		mce_panic("Fatal machine check from unknown source", NULL, NULL);
 
 	/*
@@ -1234,7 +1225,7 @@ __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *final,
 		/* Set taint even when machine check was not enabled. */
 		taint++;
 
-		severity = mce_severity(m, regs, cfg->tolerant, NULL, true);
+		severity = mce_severity(m, regs, NULL, true);
 
 		/*
 		 * When machine check was for corrected/deferred handler don't
@@ -1392,7 +1383,6 @@ noinstr void do_machine_check(struct pt_regs *regs)
 	int worst = 0, order, no_way_out, kill_current_task, lmce, taint = 0;
 	DECLARE_BITMAP(valid_banks, MAX_NR_BANKS) = { 0 };
 	DECLARE_BITMAP(toclear, MAX_NR_BANKS) = { 0 };
-	struct mca_config *cfg = &mca_cfg;
 	struct mce m, *final;
 	char *msg = NULL;
 
@@ -1411,7 +1401,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 
 	/*
 	 * If no_way_out gets set, there is no safe way to recover from this
-	 * MCE.  If mca_cfg.tolerant is cranked up, we'll try anyway.
+	 * MCE.
 	 */
 	no_way_out = 0;
 
@@ -1445,7 +1435,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 	 * severity is MCE_AR_SEVERITY we have other options.
 	 */
 	if (!(m.mcgstatus & MCG_STATUS_RIPV))
-		kill_current_task = (cfg->tolerant == 3) ? 0 : 1;
+		kill_current_task = 1;
 	/*
 	 * Check if this MCE is signaled to only this logical processor,
 	 * on Intel, Zhaoxin only.
@@ -1462,7 +1452,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 	 * to see it will clear it.
 	 */
 	if (lmce) {
-		if (no_way_out && cfg->tolerant < 3)
+		if (no_way_out)
 			mce_panic("Fatal local machine check", &m, msg);
 	} else {
 		order = mce_start(&no_way_out);
@@ -1482,7 +1472,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
 			if (!no_way_out)
 				no_way_out = worst >= MCE_PANIC_SEVERITY;
 
-			if (no_way_out && cfg->tolerant < 3)
+			if (no_way_out)
 				mce_panic("Fatal machine check on current CPU", &m, msg);
 		}
 	} else {
@@ -1494,8 +1484,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
 		 * fatal error. We call "mce_severity()" again to
 		 * make sure we have the right "msg".
 		 */
-		if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
-			mce_severity(&m, regs, cfg->tolerant, &msg, true);
+		if (worst >= MCE_PANIC_SEVERITY) {
+			mce_severity(&m, regs, &msg, true);
 			mce_panic("Local fatal machine check!", &m, msg);
 		}
 	}
@@ -2223,10 +2213,9 @@ static int __init mcheck_enable(char *str)
 		cfg->bios_cmci_threshold = 1;
 	else if (!strcmp(str, "recovery"))
 		cfg->recovery = 1;
-	else if (isdigit(str[0])) {
-		if (get_option(&str, &cfg->tolerant) == 2)
-			get_option(&str, &(cfg->monarch_timeout));
-	} else {
+	else if (isdigit(str[0]))
+		get_option(&str, &(cfg->monarch_timeout));
+	else {
 		pr_info("mce argument %s ignored. Please use /sys\n", str);
 		return 0;
 	}
@@ -2476,7 +2465,6 @@ static ssize_t store_int_with_restart(struct device *s,
 	return ret;
 }
 
-static DEVICE_INT_ATTR(tolerant, 0644, mca_cfg.tolerant);
 static DEVICE_INT_ATTR(monarch_timeout, 0644, mca_cfg.monarch_timeout);
 static DEVICE_BOOL_ATTR(dont_log_ce, 0644, mca_cfg.dont_log_ce);
 static DEVICE_BOOL_ATTR(print_all, 0644, mca_cfg.print_all);
@@ -2497,7 +2485,6 @@ static struct dev_ext_attribute dev_attr_cmci_disabled = {
 };
 
 static struct device_attribute *mce_device_attrs[] = {
-	&dev_attr_tolerant.attr,
 	&dev_attr_check_interval.attr,
 #ifdef CONFIG_X86_MCELOG_LEGACY
 	&dev_attr_trigger,
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 52c633950b38..831d2e2c6c3b 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -35,7 +35,7 @@ int mce_gen_pool_add(struct mce *mce);
 int mce_gen_pool_init(void);
 struct llist_node *mce_gen_pool_prepare_records(void);
 
-int mce_severity(struct mce *a, struct pt_regs *regs, int tolerant, char **msg, bool is_excp);
+int mce_severity(struct mce *a, struct pt_regs *regs, char **msg, bool is_excp);
 struct dentry *mce_get_debugfs_dir(void);
 
 extern mce_banks_t mce_banks_ce_disabled;
@@ -127,7 +127,6 @@ struct mca_config {
 	bool ignore_ce;
 	bool print_all;
 
-	int tolerant;
 	int monarch_timeout;
 	int panic_timeout;
 	u32 rip_msr;
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 7aa2bda93cbb..b9f29d0434db 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -330,8 +330,7 @@ static int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
  * See AMD Error Scope Hierarchy table in a newer BKDG. For example
  * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
  */
-static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, int tolerant,
-				    char **msg, bool is_excp)
+static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
 {
 	enum context ctx = error_context(m, regs);
 
@@ -383,8 +382,7 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, int tol
 	return MCE_KEEP_SEVERITY;
 }
 
-static noinstr int mce_severity_intel(struct mce *m, struct pt_regs *regs,
-				      int tolerant, char **msg, bool is_excp)
+static noinstr int mce_severity_intel(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
 {
 	enum exception excp = (is_excp ? EXCP_CONTEXT : NO_EXCP);
 	enum context ctx = error_context(m, regs);
@@ -412,22 +410,21 @@ static noinstr int mce_severity_intel(struct mce *m, struct pt_regs *regs,
 		if (msg)
 			*msg = s->msg;
 		s->covered = 1;
-		if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
-			if (tolerant < 1)
-				return MCE_PANIC_SEVERITY;
-		}
+
+		if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL)
+			return MCE_PANIC_SEVERITY;
+
 		return s->sev;
 	}
 }
 
-int noinstr mce_severity(struct mce *m, struct pt_regs *regs, int tolerant, char **msg,
-			 bool is_excp)
+int noinstr mce_severity(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
 {
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
 	    boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
-		return mce_severity_amd(m, regs, tolerant, msg, is_excp);
+		return mce_severity_amd(m, regs, msg, is_excp);
 	else
-		return mce_severity_intel(m, regs, tolerant, msg, is_excp);
+		return mce_severity_intel(m, regs, msg, is_excp);
 }
 
 #ifdef CONFIG_DEBUG_FS
-- 
2.29.2


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-12-20 20:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-04  7:44 [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out Zhaolong Zhang
2021-11-04  9:13 ` Borislav Petkov
2021-11-04 15:47   ` Luck, Tony
2021-11-04 18:02     ` Borislav Petkov
2021-11-05  2:19       ` Zhaolong Zhang
2021-11-08  8:28         ` [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus Zhaolong Zhang
2021-11-08  9:31           ` Borislav Petkov
2021-11-08 10:13             ` Zhaolong Zhang
2021-11-08 10:31               ` Borislav Petkov
2021-11-08 12:47                 ` Zhaolong Zhang
2021-11-09  8:31                   ` Zhaolong Zhang
2021-11-09  8:35                     ` [PATCH] x86/mce: Get rid of cpu_missing Zhaolong Zhang
2021-11-09  9:15                       ` Borislav Petkov
2021-11-09 14:19                         ` Zhaolong Zhang
2021-11-09  9:07                     ` [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus Borislav Petkov
2021-11-09 16:06                       ` Luck, Tony
2021-11-09 19:48                         ` Borislav Petkov
2021-11-09 19:50                           ` Luck, Tony
2021-11-09 20:21                             ` Borislav Petkov
2021-11-09 20:44                               ` Luck, Tony
2021-11-09 21:30                                 ` Borislav Petkov
2021-12-20 20:43                                   ` [PATCH] x86/mce: Remove the tolerance level control Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.