[Patch V1 1/3] x86, mce: MCE log size not enough for high core parts

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
@ 2015-09-24  5:48 Ashok Raj
  2015-09-24  5:48 ` [Patch V1 2/3] x86, mce: Refactor parts of mce_log() to reuse when logging from offline CPUs Ashok Raj
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ashok Raj @ 2015-09-24  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ashok Raj, Boris Petkov, linux-edac, Tony Luck

MCE_LOG_LEN appears to be short for high core count parts. Especially when
handling fatal errors, we don't clear MCE banks. Socket level MC banks
are visible to all CPUs that share banks.

Assuming 18 core part, 2 threads per core 2 banks per thread and couple uncore
MSRs. Rounding to 128 with some fudge to grow in future.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Suggested-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/mce.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 2dbc0bf..4293ae7 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -88,7 +88,7 @@
 #define MCE_EXTENDED_BANK	128
 #define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
 
-#define MCE_LOG_LEN 32
+#define MCE_LOG_LEN 		128
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
 /*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Patch V1 2/3] x86, mce: Refactor parts of mce_log() to reuse when logging from offline CPUs
  2015-09-24  5:48 [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Ashok Raj
@ 2015-09-24  5:48 ` Ashok Raj
  2015-09-24  5:48 ` [Patch V1 3/3] x86, mce: Account for offline CPUs during MCE rendezvous Ashok Raj
  2015-09-24 15:47 ` [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Borislav Petkov
  2 siblings, 0 replies; 13+ messages in thread
From: Ashok Raj @ 2015-09-24  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ashok Raj, Boris Petkov, linux-edac, Tony Luck

Simply refactoring part of mce_log() to facilitate logging from offline
CPUs.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 48bd244..2df073d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -152,16 +152,10 @@ static struct mce_log mcelog = {
 	.recordlen	= sizeof(struct mce),
 };
 
-void mce_log(struct mce *mce)
+static void mce_log_add(struct mce *mce)
 {
 	unsigned next, entry;
 
-	/* Emit the trace record: */
-	trace_mce_record(mce);
-
-	if (!mce_gen_pool_add(mce))
-		irq_work_queue(&mce_irq_work);
-
 	mce->finished = 0;
 	wmb();
 	for (;;) {
@@ -199,6 +193,19 @@ void mce_log(struct mce *mce)
 	set_bit(0, &mce_need_notify);
 }
 
+void mce_log(struct mce *mce)
+{
+	unsigned next, entry;
+
+	/* Emit the trace record: */
+	trace_mce_record(mce);
+
+	if (!mce_gen_pool_add(mce))
+		irq_work_queue(&mce_irq_work);
+
+	mce_log_add(mce);
+}
+
 void mce_inject_log(struct mce *m)
 {
 	mutex_lock(&mce_chrdev_read_mutex);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Patch V1 3/3] x86, mce: Account for offline CPUs during MCE rendezvous.
  2015-09-24  5:48 [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Ashok Raj
  2015-09-24  5:48 ` [Patch V1 2/3] x86, mce: Refactor parts of mce_log() to reuse when logging from offline CPUs Ashok Raj
@ 2015-09-24  5:48 ` Ashok Raj
  2015-09-24 15:47 ` [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Borislav Petkov
  2 siblings, 0 replies; 13+ messages in thread
From: Ashok Raj @ 2015-09-24  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ashok Raj, Boris Petkov, linux-edac, Tony Luck

Linux has logical CPU offline, supported as shown below.

#echo 0 > /sys/devices/system/cpu/cpuX/online

Hardware doesn't know about CPUs offlined by the OS, hence hardware will
continue broadcast any MCE to all CPUs in the system, which includes
CPUs offlined.  Hence mce_start() and mce_end() should use cpu_present_map to
count CPUs in rendezvous. CPUs Offlined by OS are also in the MCE domain,
so they also have to process int18 handlers. Since current code only accounts
for CPUs online.  This will result in cpu_callin being higher by the number
of of CPUs offined.

The main benefit is in the odd case the offline CPU is the source of
the MCE, kernel will be able to capture logs properly even for offline
CPUs.

This patch does the following.

- Allow offline CPUs to participate in the MCE rendezvous process.
- Ensure the offline CPU will not be choosen as the rendezvous master CPU
- Collect logs from the offline cpu and report them via rendezvous master.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 2df073d..080eefe 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -195,8 +195,6 @@ static void mce_log_add(struct mce *mce)
 
 void mce_log(struct mce *mce)
 {
-	unsigned next, entry;
-
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
@@ -756,8 +754,14 @@ static void mce_reign(void)
 	 * This CPU is the Monarch and the other CPUs have run
 	 * through their handlers.
 	 * Grade the severity of the errors of all the CPUs.
+	 * Intel CPUs broadcast MCEs to all CPUs booted. Even if they are
+	 * parked in idle due to logical CPU offline. Hence we should count
+	 * all CPUs to process MCEs.
+	 * Intel CPUs broadcsat MCEs to all CPUs booted. Even if they are
+	 * parked in idle due to logical CPU offline. Hence we should count
+	 * all CPUs to process MCEs.
 	 */
-	for_each_possible_cpu(cpu) {
+	for_each_present_cpu(cpu) {
 		int severity = mce_severity(&per_cpu(mces_seen, cpu),
 					    mca_cfg.tolerant,
 					    &nmsg, true);
@@ -809,8 +813,9 @@ static atomic_t global_nwo;
 static int mce_start(int *no_way_out)
 {
 	int order;
-	int cpus = num_online_cpus();
+	int cpus = num_present_cpus();
 	u64 timeout = (u64)mca_cfg.monarch_timeout * NSEC_PER_USEC;
+	unsigned int this_cpu = smp_processor_id();
 
 	if (!timeout)
 		return -1;
@@ -820,6 +825,16 @@ static int mce_start(int *no_way_out)
 	 * global_nwo should be updated before mce_callin
 	 */
 	smp_wmb();
+
+	/*
+	 * If this cpu is offline, make sure it won't be elected as
+	 * the rendezvous master
+	 */
+	if (cpu_is_offline(this_cpu)) {
+		while (!atomic_read(&mce_callin))
+			ndelay(SPINUNIT);
+	}
+
 	order = atomic_inc_return(&mce_callin);
 
 	/*
@@ -890,7 +905,7 @@ static int mce_end(int order)
 
 	if (order == 1) {
 		/* CHECKME: Can this race with a parallel hotplug? */
-		int cpus = num_online_cpus();
+		int cpus = num_present_cpus();
 
 		/*
 		 * Monarch: Wait for everyone to go through their scanning
@@ -984,6 +999,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	int i;
 	int worst = 0;
 	int severity;
+	unsigned int cpu = smp_processor_id();
+
 	/*
 	 * Establish sequential order between the CPUs entering the machine
 	 * check handler.
@@ -1098,7 +1115,10 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		m.severity = severity;
 		m.usable_addr = mce_usable_address(&m);
 
-		mce_log(&m);
+		if (cpu_is_offline(cpu))
+			mce_log_add(&m);
+		else
+			mce_log(&m);
 
 		if (severity > worst) {
 			*final = m;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-24  5:48 [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Ashok Raj
  2015-09-24  5:48 ` [Patch V1 2/3] x86, mce: Refactor parts of mce_log() to reuse when logging from offline CPUs Ashok Raj
  2015-09-24  5:48 ` [Patch V1 3/3] x86, mce: Account for offline CPUs during MCE rendezvous Ashok Raj
@ 2015-09-24 15:47 ` Borislav Petkov
  2015-09-24 18:44   ` Luck, Tony
  2 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2015-09-24 15:47 UTC (permalink / raw)
  To: Ashok Raj; +Cc: linux-kernel, linux-edac, Tony Luck

On Thu, Sep 24, 2015 at 01:48:38AM -0400, Ashok Raj wrote:
> MCE_LOG_LEN appears to be short for high core count parts. Especially when
> handling fatal errors, we don't clear MCE banks. Socket level MC banks
> are visible to all CPUs that share banks.
> 
> Assuming 18 core part, 2 threads per core 2 banks per thread and couple uncore
> MSRs. Rounding to 128 with some fudge to grow in future.
> 
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Suggested-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/include/asm/mce.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 2dbc0bf..4293ae7 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -88,7 +88,7 @@
>  #define MCE_EXTENDED_BANK	128
>  #define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
>  
> -#define MCE_LOG_LEN 32
> +#define MCE_LOG_LEN 		128
>  #define MCE_LOG_SIGNATURE	"MACHINECHECK"

Hmm, I don't think this is what I meant when we talked about it
previously. So let me try again:

Now that we have this shiny 2-pages sized lockless gen_pool, why are we
still dealing with struct mce_log mcelog? Why can't we rip it out and
kill it finally? And switch to the gen_pool?

All code that reads from mcelog - /dev/mcelog chrdev - should switch to
the lockless buffer and will iterate through the logged MCEs there.

I think this way we're much better prepared for future machine sizes.
We can even use memblock to allocate appropriate memory at boot for the
gen_pool if the 2 pages are not enough.

Hmmm?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-24 15:47 ` [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Borislav Petkov
@ 2015-09-24 18:44   ` Luck, Tony
  2015-09-24 18:52     ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2015-09-24 18:44 UTC (permalink / raw)
  To: Borislav Petkov, Raj, Ashok; +Cc: linux-kernel, linux-edac

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 860 bytes --]

> Now that we have this shiny 2-pages sized lockless gen_pool, why are we
> still dealing with struct mce_log mcelog? Why can't we rip it out and
> kill it finally? And switch to the gen_pool?
>
> All code that reads from mcelog - /dev/mcelog chrdev - should switch to
> the lockless buffer and will iterate through the logged MCEs there.

I think we have a problem of when to delete entries ... we can only do that
when all the interested consumers of logs have seen an entry. But we have
no control in the kernel on consumption from /dev/mcelog.

Historic semantics was that the first MCE_LOG_LEN errors would sit
in the buffer waiting for userspace to begin running a daemon to read
them.

-Tony

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-24 18:44   ` Luck, Tony
@ 2015-09-24 18:52     ` Borislav Petkov
  2015-09-24 19:00       ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2015-09-24 18:52 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Raj, Ashok, linux-kernel, linux-edac

On Thu, Sep 24, 2015 at 06:44:25PM +0000, Luck, Tony wrote:
> > Now that we have this shiny 2-pages sized lockless gen_pool, why are we
> > still dealing with struct mce_log mcelog? Why can't we rip it out and
> > kill it finally? And switch to the gen_pool?
> >
> > All code that reads from mcelog - /dev/mcelog chrdev - should switch to
> > the lockless buffer and will iterate through the logged MCEs there.
> 
> I think we have a problem of when to delete entries ... we can only do that
> when all the interested consumers of logs have seen an entry. But we have
> no control in the kernel on consumption from /dev/mcelog.
> 
> Historic semantics was that the first MCE_LOG_LEN errors would sit
> in the buffer waiting for userspace to begin running a daemon to read
> them.

Right, we can tag them with various flags when iterating over them in
the gen_pool. The in-kernel consumers can look at them, modify, update
the information, etc.

Userspace can then consume them and delete them.

If we get new ones logged in the meantime and userspace hasn't managed
to consume and delete the present ones yet, we overwrite the oldest ones
and set MCE_OVERFLOW like mce_log does now for mcelog. And that's no
difference in functionality than what we have now.

The advantage is that we get in-kernel consumers to look at them first
and we keep all MCE records concentrated in one place.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-24 18:52     ` Borislav Petkov
@ 2015-09-24 19:00       ` Luck, Tony
  2015-09-24 19:22         ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2015-09-24 19:00 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Raj, Ashok, linux-kernel, linux-edac

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 969 bytes --]

> If we get new ones logged in the meantime and userspace hasn't managed
> to consume and delete the present ones yet, we overwrite the oldest ones
> and set MCE_OVERFLOW like mce_log does now for mcelog. And that's no
> difference in functionality than what we have now.

Ummmm. No.

                for (;;) {

                        /*
                         * When the buffer fills up discard new entries.
                         * Assume that the earlier errors are the more
                         * interesting ones:
                         */
                        if (entry >= MCE_LOG_LEN) {
                                set_bit(MCE_OVERFLOW,
                                        (unsigned long *)&mcelog.flags);
                                return;
                        }

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-24 19:00       ` Luck, Tony
@ 2015-09-24 19:22         ` Borislav Petkov
  2015-09-24 20:22           ` Raj, Ashok
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2015-09-24 19:22 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Raj, Ashok, linux-kernel, linux-edac

On Thu, Sep 24, 2015 at 07:00:46PM +0000, Luck, Tony wrote:
> > If we get new ones logged in the meantime and userspace hasn't managed
> > to consume and delete the present ones yet, we overwrite the oldest ones
> > and set MCE_OVERFLOW like mce_log does now for mcelog. And that's no
> > difference in functionality than what we have now.
> 
> Ummmm. No.
> 
>                 for (;;) {
> 
>                         /*
>                          * When the buffer fills up discard new entries.
>                          * Assume that the earlier errors are the more
>                          * interesting ones:
>                          */
>                         if (entry >= MCE_LOG_LEN) {
>                                 set_bit(MCE_OVERFLOW,
>                                         (unsigned long *)&mcelog.flags);
>                                 return;
>                         }

Ah, we return. But we shouldn't return - we should overwrite. I believe
we've talked about the policy of overwriting old errors with new ones.

TBH, I don't think there's a 100%-correct policy to act according to
when our error logging buffers are full:

- we can overwrite old errors with new but then this way we might lose
the one important error record with which it all started.

- if we don't overwrite, we might fill up with "unimportant" correctable
error records and miss other, more important ones which happen now

- ...

We could try to implement some cheap heuristics which decide what and
when to overwrite but I'm sceptical it'll be always correct...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-24 19:22         ` Borislav Petkov
@ 2015-09-24 20:22           ` Raj, Ashok
  2015-09-24 21:07             ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Raj, Ashok @ 2015-09-24 20:22 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, linux-kernel, linux-edac, Ashok Raj

Hi Boris

On Thu, Sep 24, 2015 at 09:22:24PM +0200, Borislav Petkov wrote:
> 
> Ah, we return. But we shouldn't return - we should overwrite. I believe
> we've talked about the policy of overwriting old errors with new ones.
> 

Another reason i had a separate buffer in my earlier patch was to avoid 
calling rcu() functions from the offline CPU. I had an offline discussion 
with Paul McKenney  he said don't do that... 

mce_gen_pool_add()->gen_pool_alloc() which calls rcu_read_lock() and such. 
So it didn't seem approprite. 

Also the function doesn't seem safe to be called in NMI context. Although
MCE is different, for all intentional purposes we should treat both as same
priority. The old style log is simple and tested in those cases.

I like everything you say below... something we could do as our next phase
of improving logging and might need more careful work to build it right.

just like how MC banks have overwrite rules, we can possibly do something
like that if the buffer fills up.

> TBH, I don't think there's a 100%-correct policy to act according to
> when our error logging buffers are full:
> 
> - we can overwrite old errors with new but then this way we might lose
> the one important error record with which it all started.
> 
> - if we don't overwrite, we might fill up with "unimportant" correctable
> error records and miss other, more important ones which happen now
> 
> - ...
> 
> We could try to implement some cheap heuristics which decide what and
> when to overwrite but I'm sceptical it'll be always correct...
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-24 20:22           ` Raj, Ashok
@ 2015-09-24 21:07             ` Borislav Petkov
  2015-09-24 21:25               ` Raj, Ashok
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2015-09-24 21:07 UTC (permalink / raw)
  To: Raj, Ashok; +Cc: Luck, Tony, linux-kernel, linux-edac

On Thu, Sep 24, 2015 at 01:22:12PM -0700, Raj, Ashok wrote:
> Another reason i had a separate buffer in my earlier patch was to avoid 
> calling rcu() functions from the offline CPU. I had an offline discussion 
> with Paul McKenney  he said don't do that... 
> 
> mce_gen_pool_add()->gen_pool_alloc() which calls rcu_read_lock() and such. 
> So it didn't seem approprite.

How are you ever going to call into those from an offlined CPU?!

And that's easy:

	if (!cpu_online(cpu))
		return;

> Also the function doesn't seem safe to be called in NMI context. Although

That's why it is a lockless buffer - we added it *exactly* because we didn't
want to call printk in an NMI context. So please expand...

> MCE is different, for all intentional purposes we should treat both as same
> priority. The old style log is simple and tested in those cases.
> 
> I like everything you say below... something we could do as our next phase
> of improving logging and might need more careful work to build it right.
> 
> just like how MC banks have overwrite rules, we can possibly do something
> like that if the buffer fills up.

Right.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-24 21:07             ` Borislav Petkov
@ 2015-09-24 21:25               ` Raj, Ashok
  2015-09-25  8:29                 ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Raj, Ashok @ 2015-09-24 21:25 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, linux-kernel, linux-edac, Ashok Raj

Hi Boris

I should have expanded on it.. 

On Thu, Sep 24, 2015 at 11:07:33PM +0200, Borislav Petkov wrote:
> 
> How are you ever going to call into those from an offlined CPU?!
> 
> And that's easy:
> 
> 	if (!cpu_online(cpu))
> 		return;
> 

The last patch of that series had 2 changes.

1. Allow offline cpu's to participate in the rendezvous. Since in the odd
chance the offline cpus have any errors collected we can still report them.
(we changed mce_start/mce_end to use cpu_present_mask instead of just 
online map). 

Without this change today if i were to inject an broadcast MCE
it ends up hanging, since the offline cpu is also incrementing mce_callin.
It will always end up more than cpu_online_mask by the number of cpu's
logically offlined

Consider for e.g. if 2 thread of the core are offline. And the MLC picks up
an error. Other cpus in the socket can't access them. Only way is to let those 
CPUs read and report their own banks as they are core scoped. In upcoming CPUs
we have some banks that can be thread scoped as well.

Its understood OS doesn't execute any code on those CPUs. But SMI can still
run on them, and could collect errors that can be logged.

2. If the cpu is offline, we copied them to mce_log buffer, and them copy 
those out from the rendezvous master during mce_reign().

If we were to replace this mce_log_add() with gen_pool_add(), then i would 
have to call mce_gen_pool_add() from the offline CPU. This will end up calling
RCU functions. 

We don't want to leave any errors reported by the offline CPU for purpose
of logging. It is rare, but still interested in capturing those errors if they
were to happen.

Does this help?

Cheers,
Ashok

> > Also the function doesn't seem safe to be called in NMI context. Although
> 
> That's why it is a lockless buffer - we added it *exactly* because we didn't
> want to call printk in an NMI context. So please expand...
> 

S

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-24 21:25               ` Raj, Ashok
@ 2015-09-25  8:29                 ` Borislav Petkov
  2015-09-25 16:29                   ` Raj, Ashok
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2015-09-25  8:29 UTC (permalink / raw)
  To: Raj, Ashok, x86-ml; +Cc: Luck, Tony, linux-kernel, linux-edac

+ x86@kernel.org

On Thu, Sep 24, 2015 at 02:25:41PM -0700, Raj, Ashok wrote:
> Hi Boris
> 
> I should have expanded on it.. 
> 
> On Thu, Sep 24, 2015 at 11:07:33PM +0200, Borislav Petkov wrote:
> > 
> > How are you ever going to call into those from an offlined CPU?!
> > 
> > And that's easy:
> > 
> > 	if (!cpu_online(cpu))
> > 		return;
> > 
> 
> The last patch of that series had 2 changes.
> 
> 1. Allow offline cpu's to participate in the rendezvous. Since in the odd
> chance the offline cpus have any errors collected we can still report them.
> (we changed mce_start/mce_end to use cpu_present_mask instead of just 
> online map).

This is not necessarily wrong - it is just unusual.

> Without this change today if i were to inject an broadcast MCE
> it ends up hanging, since the offline cpu is also incrementing mce_callin.
> It will always end up more than cpu_online_mask by the number of cpu's
> logically offlined

Yeah, I'd like to have a bit somewhere which says "don't report MCEs on this
core." But we talked about this already.

> Consider for e.g. if 2 thread of the core are offline. And the MLC picks up

What is MLC?

> an error. Other cpus in the socket can't access them. Only way is to let those 
> CPUs read and report their own banks as they are core scoped. In upcoming CPUs
> we have some banks that can be thread scoped as well.
> 
> Its understood OS doesn't execute any code on those CPUs. But SMI can still
> run on them, and could collect errors that can be logged.

Well, that is not our problem, is it?

I mean, SMM wants to stay undetected. When all of a sudden offlined
cores start reporting MCEs, that's going to raise some brows.

Regardless, there are other reasons why offlined cores might report MCEs
- the fact that logical cores share functional units and data flow goes
through them might trip the reporting on those cores. Yadda yadda...

> 2. If the cpu is offline, we copied them to mce_log buffer, and them copy 
> those out from the rendezvous master during mce_reign().
> 
> If we were to replace this mce_log_add() with gen_pool_add(), then i would 
> have to call mce_gen_pool_add() from the offline CPU. This will end up calling
> RCU functions. 
> 
> We don't want to leave any errors reported by the offline CPU for purpose
> of logging. It is rare, but still interested in capturing those errors if they
> were to happen.
> 
> Does this help?

So first of all, we need to hold this down somewhere, maybe in
Documentation/ to explain why we're running on offlined cores. This is
certainly unusual code and people will ask WTF is going on there.

Then, I really really don't like a static buffer which we will have
to increase with each new bigger machine configuration. This is just
clumsy.

It'd be probably much better to make that MCE buffer per CPU. We can
say, we're allowed to log 2-3, hell, 5 errors in it and when we're done
with the rendezvous, an online core goes and flushes out the error
records to gen_pool.

This scales much better than any artificial MCE_LOG_LEN size.

Oh, and we either overwrite old errors when we fill up the percpu buffer
or we return. that's something we can discuss later. Or we come up with
a bit smarter strategy of selecting which ones to overwrite.

Just artificially increasing a static buffer is not good design IMO.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
  2015-09-25  8:29                 ` Borislav Petkov
@ 2015-09-25 16:29                   ` Raj, Ashok
  0 siblings, 0 replies; 13+ messages in thread
From: Raj, Ashok @ 2015-09-25 16:29 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86-ml, Luck, Tony, linux-kernel, linux-edac, ashok.raj

On Fri, Sep 25, 2015 at 10:29:01AM +0200, Borislav Petkov wrote:
> > > 
> > 
> > The last patch of that series had 2 changes.
> > 
> > 1. Allow offline cpu's to participate in the rendezvous. Since in the odd
> > chance the offline cpus have any errors collected we can still report them.
> > (we changed mce_start/mce_end to use cpu_present_mask instead of just 
> > online map).
> 
> This is not necessarily wrong - it is just unusual.

Correct!.
> 
> > Without this change today if i were to inject an broadcast MCE
> > it ends up hanging, since the offline cpu is also incrementing mce_callin.
> > It will always end up more than cpu_online_mask by the number of cpu's
> > logically offlined
> 
> Yeah, I'd like to have a bit somewhere which says "don't report MCEs on this
> core." But we talked about this already.
> 
> > Consider for e.g. if 2 thread of the core are offline. And the MLC picks up
> 
> What is MLC?

Mid Level Cache. This is shared between the 2 threads in that core. As opposed
to the Last Level Cache (LLC) which is shared between all the threads in the
socket.

> 
> > an error. Other cpus in the socket can't access them. Only way is to let those 
> > CPUs read and report their own banks as they are core scoped. In upcoming CPUs
> > we have some banks that can be thread scoped as well.
> > 
> > Its understood OS doesn't execute any code on those CPUs. But SMI can still
> > run on them, and could collect errors that can be logged.
> 
> Well, that is not our problem, is it?
> 
> I mean, SMM wants to stay undetected. When all of a sudden offlined
> cores start reporting MCEs, that's going to raise some brows.

You are right.. i was simply trying to state how an offline CPU from the
OS perspective could still be collecting errors. Only trying to highlight
what happens from a platform level.

> 
> Regardless, there are other reasons why offlined cores might report MCEs
> - the fact that logical cores share functional units and data flow goes
> through them might trip the reporting on those cores. Yadda yadda...

Yep!
> 
> > 2. If the cpu is offline, we copied them to mce_log buffer, and them copy 
> > those out from the rendezvous master during mce_reign().
> > 
> > If we were to replace this mce_log_add() with gen_pool_add(), then i would 
> > have to call mce_gen_pool_add() from the offline CPU. This will end up calling
> > RCU functions. 
> > 
> > We don't want to leave any errors reported by the offline CPU for purpose
> > of logging. It is rare, but still interested in capturing those errors if they
> > were to happen.
> > 
> > Does this help?
> 
> So first of all, we need to hold this down somewhere, maybe in
> Documentation/ to explain why we're running on offlined cores. This is
> certainly unusual code and people will ask WTF is going on there.

Good idea to document these weird cases. I can do it in either the 
cpu-hotplug.txt, or in the x86/x86_64/machinecheck file as appropriate.
Will add that in my next update.

> 
> Then, I really really don't like a static buffer which we will have
> to increase with each new bigger machine configuration. This is just
> clumsy.
> 
> It'd be probably much better to make that MCE buffer per CPU. We can
> say, we're allowed to log 2-3, hell, 5 errors in it and when we're done
> with the rendezvous, an online core goes and flushes out the error
> records to gen_pool.

That makes good sense.. i will make these per-cpu and also look at 
clearing them during mce_panic to make sure we flush them in fatal cases.

per-cpu certainly scales better than a static number.. 

> 
> This scales much better than any artificial MCE_LOG_LEN size.
> 
> Oh, and we either overwrite old errors when we fill up the percpu buffer
> or we return. that's something we can discuss later. Or we come up with
> a bit smarter strategy of selecting which ones to overwrite.
> 
> Just artificially increasing a static buffer is not good design IMO.

Agreed.. thanks, i will get another rev rolling.

Cheers,
Ashok


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-09-25 16:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-24  5:48 [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Ashok Raj
2015-09-24  5:48 ` [Patch V1 2/3] x86, mce: Refactor parts of mce_log() to reuse when logging from offline CPUs Ashok Raj
2015-09-24  5:48 ` [Patch V1 3/3] x86, mce: Account for offline CPUs during MCE rendezvous Ashok Raj
2015-09-24 15:47 ` [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Borislav Petkov
2015-09-24 18:44   ` Luck, Tony
2015-09-24 18:52     ` Borislav Petkov
2015-09-24 19:00       ` Luck, Tony
2015-09-24 19:22         ` Borislav Petkov
2015-09-24 20:22           ` Raj, Ashok
2015-09-24 21:07             ` Borislav Petkov
2015-09-24 21:25               ` Raj, Ashok
2015-09-25  8:29                 ` Borislav Petkov
2015-09-25 16:29                   ` Raj, Ashok

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.