From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752944AbaKLPsk (ORCPT ); Wed, 12 Nov 2014 10:48:40 -0500 Received: from mail-la0-f47.google.com ([209.85.215.47]:56483 "EHLO mail-la0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752377AbaKLPsj (ORCPT ); Wed, 12 Nov 2014 10:48:39 -0500 MIME-Version: 1.0 In-Reply-To: <20141112103011.GA16807@pd.tnic> References: <20141111213628.GP31490@pd.tnic> <20141111223316.GQ31490@pd.tnic> <20141111230926.GR31490@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F3292A03B@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F3292A157@ORSMSX114.amr.corp.intel.com> <20141112103011.GA16807@pd.tnic> From: Andy Lutomirski Date: Wed, 12 Nov 2014 07:48:15 -0800 Message-ID: Subject: Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace To: Borislav Petkov Cc: Andi Kleen , "linux-kernel@vger.kernel.org" , X86 ML , Peter Zijlstra , Tony Luck , Oleg Nesterov Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Nov 12, 2014 2:30 AM, "Borislav Petkov" wrote: > > On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote: > > > Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they > > > are quite easy to spot. > > > > Bother ... except for the SRAO cases where *everyone* is an innocent > > bystander - but someone should go look for the error and queue up > > a page offline event. Perhaps for this we'd do the self-ipi trick and > > then scan the banks to look for SRAO signatures to process as well > > as clearing MCIP > > First of all, good morning :-) > > So I'm reading all those notes again and I can't say I'm convinced about > the change. And all those "but but, what might happen if" cases are what > bother me. > > First, the stack: switching to a possibly almost full kernel stack > where we have other stuff already allocated is a bad idea IMO. There's > a good reason to have a known-good, solely-for-our-purpose stack which > we get to use. It is always there and it gives us the optimal conditions > possible for recovery. Switching back to the possibly crowded kernel > stack and causing stack overflows while handling an MCE is not an > improvement in my book. I only switch stacks on entry from userspace, and the kernel stack is completely empty if that happens. > > Tony: > > The current code has an ugly hole at the moment. End of > > do_machine_check() clears MCG_STATUS. At that point we are still > > running on the magic stack for machine check exceptions ... if we take > > a machine check in the small window from there until we get off this > > stack (iret) ... then we will enter the handler back on the same stack > > that we haven't finished using yet. Bad things ensue. > > I'm not crazy about it either but how often does this actually happen in > practice? And if it does happen, then, well, we die, so be it. It can > happen with other MCEs coming in back-to-back like poisoned data (1st > MCE) being consumed on another core (2nd MCE). I know, Intel does the > broadcasting but that is going to change, I hear and AMD raises an MCE > on the core which detects it only. > > Another big issue I see with this is verifying the locking works and > generally testing this serious change. MCA is core kernel infrastructure > and we cannot break this left and right. And changing stacks we're > running on and the whole dynamics of the code might open a whole other > can of worms. One nice thing for testing is that my patch applies to int3 from userspace as well, and that's easy to test. > > Oh, and we shouldn't forget why we're doing this: just to save two > members to task_struct and getting rid of paranoid_userspace. Yep, still > not convinced the effort is worth it. I think I want to make this change anyway, though, since it may simplify fsgsbase support enough to justify it solely on that account. I don't think that the machine check code needs to change at all to accommodate a stack switch, but I think it makes some simplifications possible. > > Now, I think Andy had a much better/simpler/cleaner suggestion > yesterday: task_work_add(). It is basically what _TIF_MCE_NOTIFY does > and it is called in the same path - do_notify_resume() - but generic and > we can drop _TIF_MCE_NOTIFY then and use the generic thing. > > Much less intrusive change. Less intrusive is certainly true. --Andy > > -- > Regards/Gruss, > Boris. > > Sent from a fat crate under my desk. Formatting is fine. > --