Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context

From: Borislav Petkov <bp@suse.de>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: "Chen, Gong" <gong.chen@linux.intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context
Date: Thu, 11 Jun 2015 10:27:47 +0200	[thread overview]
Message-ID: <20150611082747.GA30391@pd.tnic> (raw)
In-Reply-To: <20150608202658.GF5877@pd.tnic>

On Mon, Jun 08, 2015 at 10:26:58PM +0200, Borislav Petkov wrote:
> On Mon, Jun 08, 2015 at 08:03:08PM +0000, Luck, Tony wrote:
> > @@ -156,7 +156,8 @@ void mce_log(struct mce *mce)
> >  	/* Emit the trace record: */
> >  	trace_mce_record(mce);
> >  
> > -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
> > +	mce_genpool_add(mce);
> > +	irq_work_queue(&mce_irq_work);
> > 
> > Is it safe to call irq_work_queue() from MCE context?
> 
> Yeah, we're using it in contexts like:

Yeah, so that is safe but there's another issue which I triggered
yesterday. I hope the commit message explains it thoroughly. If you see
holes, please shout. I haven't run this on the EINJ box yet, that's
still outstanding.

In any case, I'm thinking of not rushing those changes to tip yet and
have them simmer in linux-next for a whole round and have them ready for
4.3. Just to be on the safe side and have some more time hammering on
them.

---
From: Borislav Petkov <bp@suse.de>
Date: Wed, 10 Jun 2015 18:49:18 +0200
Subject: [PATCH] x86/mce: Fix logging of early boot machine checks

With moving the irq_work queuing to mce_log() which simplified other
code, we're faced with the problem of logging MCEs during early boot
which might be there from before. In particular, facilities which we
rely on, like workqueues, for example, are not initialized yet. See
splat below.

As a first step, move the workqueue init before the first invocation of
mce_log() in

__mcheck_cpu_init_generic
|-> machine_check_poll
    |-> mce_log

But that's not enough because the normal system workqueue on which stuff
gets queued with schedule_work() is not initialized yet either.

Therefore, we still queue the work in the genpool but delay the
scheduling of that work to a late initcall where everything should be
initialized.

While at it, make sure we queue irq_work only if the addition to the
genpool has been successful.

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: [<          (null)>]           (null)
  PGD 0
  Oops: 0010 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.0-rc7+ #12
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
  task: ffffffff8197d580 ti: ffffffff8196c000 task.ti: ffffffff8196c000
  RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)

  <registers snipped>

  Call Trace:
   <IRQ>
    ? irq_work_run_list
    irq_work_run
    smp_irq_work_interrupt
    irq_work_interrupt
   <EOI>
    ? apic_send_IPI_self
    arch_irq_work_raise
    irq_work_queue
    mce_log
    machine_check_poll
    __mcheck_cpu_init_generic
    mcheck_cpu_init
    identify_cpu
    identify_boot_cpu
    check_bugs
    start_kernel
    x86_64_start_reservations
    x86_64_start_kernel
  Code:  Bad RIP value.
  RIP  [<          (null)>]           (null)
   RSP <ffff88007b603f40>
  CR2: 0000000000000000
  ---[ end trace 6d6dd507e2636bd4 ]---
  Kernel panic - not syncing: Fatal exception in interrupt
  ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a93a150c3583..17cef061c24a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -159,8 +159,8 @@ void mce_log(struct mce *mce)
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	mce_genpool_add(mce);
-	irq_work_queue(&mce_irq_work);
+	if (mce_genpool_add(mce))
+		irq_work_queue(&mce_irq_work);
 
 	mce->finished = 0;
 	wmb();
@@ -480,7 +480,7 @@ int mce_available(struct cpuinfo_x86 *c)
 
 static void mce_schedule_work(void)
 {
-	if (!mce_genpool_empty())
+	if (!mce_genpool_empty() && keventd_up())
 		schedule_work(&mce_work);
 }
 
@@ -648,8 +648,9 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 			if (m.status & MCI_STATUS_ADDRV) {
 				m.severity = severity;
 				m.usable_addr = mce_usable_address(&m);
-				mce_genpool_add(&m);
-				mce_schedule_work();
+
+				if (mce_genpool_add(&m))
+					mce_schedule_work();
 			}
 		}
 
@@ -1704,13 +1705,14 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
 		return;
 	}
 
+	INIT_WORK(&mce_work, mce_process_work);
+	init_irq_work(&mce_irq_work, mce_irq_work_cb);
+
 	machine_check_vector = do_machine_check;
 
 	__mcheck_cpu_init_generic();
 	__mcheck_cpu_init_vendor(c);
 	__mcheck_cpu_init_timer();
-	INIT_WORK(&mce_work, mce_process_work);
-	init_irq_work(&mce_irq_work, mce_irq_work_cb);
 }
 
 /*
@@ -2565,5 +2567,20 @@ static int __init mcheck_debugfs_init(void)
 
 	return 0;
 }
-late_initcall(mcheck_debugfs_init);
+#else
+static int __init mcheck_debugfs_init(void) {}
 #endif
+
+static int __init mcheck_late_init(void)
+{
+	mcheck_debugfs_init();
+
+	/*
+	 * Flush out everything that has been logged during early boot, now that
+	 * everything has been initialized (workqueues, decoders, ...).
+	 */
+	mce_schedule_work();
+
+	return 0;
+}
+late_initcall(mcheck_late_init);
-- 
2.3.5

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--