From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755174AbcFQH1s (ORCPT ); Fri, 17 Jun 2016 03:27:48 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:50213 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750774AbcFQH1q (ORCPT ); Fri, 17 Jun 2016 03:27:46 -0400 X-IBM-Helo: d06dlp01.portsmouth.uk.ibm.com X-IBM-MailFrom: heiko.carstens@de.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Fri, 17 Jun 2016 09:27:37 +0200 From: Heiko Carstens To: Andy Lutomirski Cc: Andy Lutomirski , "linux-kernel@vger.kernel.org" , X86 ML , Borislav Petkov , Nadav Amit , Kees Cook , Brian Gerst , "kernel-hardening@lists.openwall.com" , Linus Torvalds , Josh Poimboeuf Subject: Re: [PATCH 00/13] Virtually mapped stacks with guard pages (x86, core) References: <20160616060538.GA3923@osiris> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16061707-0036-0000-0000-000001F08A10 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16061707-0037-0000-0000-000011670B00 Message-Id: <20160617072737.GA3960@osiris> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-06-17_04:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1606170092 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 16, 2016 at 08:58:07PM -0700, Andy Lutomirski wrote: > On Wed, Jun 15, 2016 at 11:05 PM, Heiko Carstens > wrote: > > On Wed, Jun 15, 2016 at 05:28:22PM -0700, Andy Lutomirski wrote: > >> Since the dawn of time, a kernel stack overflow has been a real PITA > >> to debug, has caused nondeterministic crashes some time after the > >> actual overflow, and has generally been easy to exploit for root. > >> > >> With this series, arches can enable HAVE_ARCH_VMAP_STACK. Arches > >> that enable it (just x86 for now) get virtually mapped stacks with > >> guard pages. This causes reliable faults when the stack overflows. > >> > >> If the arch implements it well, we get a nice OOPS on stack overflow > >> (as opposed to panicing directly or otherwise exploding badly). On > >> x86, the OOPS is nice, has a usable call trace, and the overflowing > >> task is killed cleanly. > > > > Do you have numbers which reflect the performance impact of this change? > > > > It seems to add ~1.5µs per thread creation/join pair, which is around > 15% overhead. I *think* the major cost is that vmalloc calls > alloc_kmem_pages_node once per page rather than using a higher-order > block if available. > > Anyway, if anyone wants this to become faster, I think the way to do > it would be to ask some friendly mm folks to see if they can speed up > vmalloc. I don't really want to dig in to the guts of the page > allocator. My instinct would be to add a new interface > (GFP_SMALLER_OK?) to ask the page allocator for a high-order > allocation such that, if a high-order block is not immediately > available (on the freelist) then it should fall back to a smaller > allocation rather than working hard to get a high-order allocation. > Then vmalloc could use this, and vfree could free pages in blocks > corresponding to whatever orders it got the pages in, thus avoiding > the need to merge all the pages back together. > > There's another speedup available: actually reuse allocations. We > could keep a very small freelist of vmap_areas with their associated > pages so we could reuse them. (We can't efficiently reuse a vmap_area > without its backing pages because we need to flush the TLB in the > middle if we do that.) That's rather expensive. Just for the records: on s390 we use gcc's architecture specific compile options (kernel: CONFIG_STACK_GUARD) -mstack-guard=stack-guard -mstack-size=stack-size These generate two additional instructions at the beginning of each function prologue and verify that the stack size left won't be below a specified number of bytes. If so it would execute an illegal instruction. A disassembly looks like this (r15 is the stackpointer): 0000000000000670 : 670: eb 6f f0 48 00 24 stmg %r6,%r15,72(%r15) 676: c0 d0 00 00 00 00 larl %r13,676 67c: a7 f1 3f 80 tmll %r15,16256 <--- test if enough space left 680: b9 04 00 ef lgr %r14,%r15 684: a7 84 00 01 je 686 <--- branch to illegal op 688: e3 f0 ff 90 ff 71 lay %r15,-112(%r15) The branch jumps actually into the branch instruction itself since the 0001 part of the "je" instruction is an illegal instruction. This catches at least wild stack overflows because of two many functions being called. Of course it doesn't catch wild accesses outside the stack because e.g. the index into an array on the stack is wrong. The runtime overhead is within noise ratio, therefore we have this always enabled.