linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86: Prevent oops with >16 memory controllers
@ 2015-02-14  3:18 Daniel J Blueman
  2015-02-16 11:40 ` Borislav Petkov
  0 siblings, 1 reply; 2+ messages in thread
From: Daniel J Blueman @ 2015-02-14  3:18 UTC (permalink / raw)
  To: Doug Thompson, Borislav Petkov
  Cc: Daniel J Blueman, Mauro Carvalho Chehab, linux-edac,
	linux-kernel, Steffen Persvold

When ECC interrupts occur on memory controllers after EDAC_MAX_MCS (16), the
kernel fatally dereferences unallocated structures [1]; this occurs on at
least NumaConnect systems.

Minimally fix by checking if a memory controller info structure is allocated;
candidate for stable.

Signed-off-by: Daniel J Blueman <daniel@numascale.com>

-- [1]

BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
Oops: 0000 [#2] SMP
Modules linked in:
CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G   D    3.19.0 #1
Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.5b    01/28/2015
task: ffff8807dbfb8c00 ti: ffff8807dd16c000 task.ti: ffff8807dd16c000
RIP: 0010:[<ffffffff819f714f>] [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
RSP: 0000:ffff8907dfc03c48 EFLAGS: 00010297
RAX: 0000000000000001 RBX: 9c67400010080a13 RCX: 0000000000001dc6
RDX: 000000001dc61dc6 RSI: ffff8907dfc03df0 RDI: 000000000000001c
RBP: ffff8907dfc03ce8 R08: 0000000000000000 R09: 0000000000000022
R10: ffff891fffa30380 R11: 00000000001cfc90 R12: 0000000000000008
R13: 0000000000000000 R14: 000000000000001c R15: 00009c6740001000
FS: 00007fa97ee18700(0000) GS:ffff8907dfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000320 CR3: 0000003f889b8000 CR4: 00000000000407e0
Stack:
 0000000000000000 ffff8907dfc03df0 0000000000000008 9c67400010080a13
 000000000000001c 00009c6740001000 ffff8907dfc03c88 ffffffff810e4f9a
 ffff8907dfc03ce8 ffffffff81b375b9 0000000000000000 0000000000000010
Call Trace:
 <IRQ>
 [<ffffffff810e4f9a>] ? vprintk_default+0x1a/0x20
 [<ffffffff81b375b9>] ? printk+0x41/0x43
 [<ffffffff819f65ff>] amd_decode_mce+0x58f/0x8e0
 [<ffffffff810c79ed>] notifier_call_chain+0x4d/0x80
 [<ffffffff810c7a45>] atomic_notifier_call_chain+0x15/0x20
 [<ffffffff8105352d>] mce_log+0x1d/0x130
 [<ffffffff810537d4>] machine_check_poll+0x194/0x260
 [<ffffffff81053ae6>] mce_timer_fn+0x116/0x140
 [<ffffffff810539d0>] ? mce_cpu_restart+0x40/0x40
 [<ffffffff810f17e7>] call_timer_fn.isra.29+0x17/0x80
 [<ffffffff810f1a9b>] run_timer_softirq+0x18b/0x220
 [<ffffffff810afbb9>] __do_softirq+0xf9/0x200
 [<ffffffff810afde6>] irq_exit+0x76/0xa0
 [<ffffffff8105dfc1>] smp_apic_timer_interrupt+0x41/0x50
 [<ffffffff81b49067>] apic_timer_interrupt+0x67/0x70
 <EOI>
 [<ffffffff810dfd05>] ? down_read_trylock+0x15/0x20
 [<ffffffff8106901b>] __do_page_fault+0xbb/0x4c0
 [<ffffffff81b4447a>] ? __schedule+0x25a/0x850
 [<ffffffff8106945c>] do_page_fault+0xc/0x10
 [<ffffffff81b4977f>] page_fault+0x1f/0x30
---
 drivers/edac/amd64_edac.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 17638d7..baccc0e 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2175,7 +2175,7 @@ static void __log_bus_error(struct mem_ctl_info *mci, struct err_info *err,
 static inline void decode_bus_error(int node_id, struct mce *m)
 {
 	struct mem_ctl_info *mci = mcis[node_id];
-	struct amd64_pvt *pvt = mci->pvt_info;
+	struct amd64_pvt *pvt;
 	u8 ecc_type = (m->status >> 45) & 0x3;
 	u8 xec = XEC(m->status, 0x1f);
 	u16 ec = EC(m->status);
@@ -2190,6 +2190,11 @@ static inline void decode_bus_error(int node_id, struct mce *m)
 	if (xec && xec != F10_NBSL_EXT_ERR_ECC)
 		return;
 
+	/* Unable to decode on memory controllers after EDAC_MAX_MCS, as no mci is allocated */
+	if (!mci)
+		return;
+	pvt = mci->pvt_info;
+
 	memset(&err, 0, sizeof(err));
 
 	sys_addr = get_error_address(pvt, m);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] x86: Prevent oops with >16 memory controllers
  2015-02-14  3:18 [PATCH] x86: Prevent oops with >16 memory controllers Daniel J Blueman
@ 2015-02-16 11:40 ` Borislav Petkov
  0 siblings, 0 replies; 2+ messages in thread
From: Borislav Petkov @ 2015-02-16 11:40 UTC (permalink / raw)
  To: Daniel J Blueman
  Cc: Doug Thompson, Mauro Carvalho Chehab, linux-edac, linux-kernel,
	Steffen Persvold

On Sat, Feb 14, 2015 at 11:18:40AM +0800, Daniel J Blueman wrote:
> When ECC interrupts occur on memory controllers after EDAC_MAX_MCS (16), the

I knew this artificial limit would come back to bite us someday :-\

> kernel fatally dereferences unallocated structures [1]; this occurs on at
> least NumaConnect systems.
> 
> Minimally fix by checking if a memory controller info structure is allocated;
> candidate for stable.
> 
> Signed-off-by: Daniel J Blueman <daniel@numascale.com>
> 
> -- [1]
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
> IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
> PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
> Oops: 0000 [#2] SMP
> Modules linked in:
> CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G   D    3.19.0 #1

CPU 224?! What node is that? :)

> ---
>  drivers/edac/amd64_edac.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 17638d7..baccc0e 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -2175,7 +2175,7 @@ static void __log_bus_error(struct mem_ctl_info *mci, struct err_info *err,
>  static inline void decode_bus_error(int node_id, struct mce *m)
>  {
>  	struct mem_ctl_info *mci = mcis[node_id];
> -	struct amd64_pvt *pvt = mci->pvt_info;
> +	struct amd64_pvt *pvt;
>  	u8 ecc_type = (m->status >> 45) & 0x3;
>  	u8 xec = XEC(m->status, 0x1f);
>  	u16 ec = EC(m->status);
> @@ -2190,6 +2190,11 @@ static inline void decode_bus_error(int node_id, struct mce *m)
>  	if (xec && xec != F10_NBSL_EXT_ERR_ECC)
>  		return;
>  
> +	/* Unable to decode on memory controllers after EDAC_MAX_MCS, as no mci is allocated */
> +	if (!mci)
> +		return;
> +	pvt = mci->pvt_info;

Hmm, so we have all the facilities to fix that properly, IINM:
edac_mc_find(), add_mc_to_global_list() and so on.

Would looking through the list of the memory controllers help instead,
i.e. if you do:

static inline void decode_bus_error(int node_id, struct mce *m)
{
	struct mem_ctl_info *mci = edac_mc_find(node_id);
	if (!mci)
		return;

?

Then we can get rid of that local mcis dumbness and do it properly...

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-02-16 11:40 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-14  3:18 [PATCH] x86: Prevent oops with >16 memory controllers Daniel J Blueman
2015-02-16 11:40 ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).