From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751763AbcFWBWj (ORCPT ); Wed, 22 Jun 2016 21:22:39 -0400 Received: from mail-vk0-f48.google.com ([209.85.213.48]:36152 "EHLO mail-vk0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750809AbcFWBWh convert rfc822-to-8bit (ORCPT ); Wed, 22 Jun 2016 21:22:37 -0400 MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Wed, 22 Jun 2016 18:22:17 -0700 Message-ID: Subject: Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core) To: Linus Torvalds Cc: Andy Lutomirski , "the arch/x86 maintainers" , Linux Kernel Mailing List , "linux-arch@vger.kernel.org" , Borislav Petkov , Nadav Amit , Kees Cook , Brian Gerst , "kernel-hardening@lists.openwall.com" , Josh Poimboeuf , Jann Horn , Heiko Carstens Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds wrote: > On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski wrote: >> >> On my laptop, this adds about 1.5µs of overhead to task creation, >> which seems to be mainly caused by vmalloc inefficiently allocating >> individual pages even when a higher-order page is available on the >> freelist. > > I really think that problem needs to be fixed before this should be merged. > > The easy fix may be to just have a very limited re-use of these stacks > in generic code, rather than try to do anything fancy with multi-page > allocations. Just a few of these allocations held in reserve (perhaps > make the allocations percpu to avoid new locks). I implemented a percpu cache, and it's useless. When a task goes away, one reference is held until the next RCU grace period so that task_struct can be used under RCU (look for delayed_put_task_struct). This means that free_task gets called in giant batches under heavy clone() load, which is the only time that any of this matters, which means that only get to refill the cache once per RCU batch, which means that there's very little benefit. Once thread_info stops living in the stack, we could, in principle, exempt the stack itself from RCU protection, thus saving a bit of memory under load and making the cache work. I've started working on (optionally, per-arch) getting rid of on-stack thread_info, but that's not ready yet. FWIW, the same issue quite possibly hurts non-vmap-stack performance as well, as it makes it much less likely that a cache-hot stack gets immediately reused under heavy fork load. So may I skip this for now? I think that the performance hit is unlikely to matter on most workloads, and I also expect the speedup from not using higher-order allocations to be a decent win on some workloads. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andy Lutomirski Subject: Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core) Date: Wed, 22 Jun 2016 18:22:17 -0700 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-vk0-f45.google.com ([209.85.213.45]:34641 "EHLO mail-vk0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750835AbcFWBWh convert rfc822-to-8bit (ORCPT ); Wed, 22 Jun 2016 21:22:37 -0400 Received: by mail-vk0-f45.google.com with SMTP id c2so55529654vkg.1 for ; Wed, 22 Jun 2016 18:22:37 -0700 (PDT) In-Reply-To: Sender: linux-arch-owner@vger.kernel.org List-ID: To: Linus Torvalds Cc: Andy Lutomirski , the arch/x86 maintainers , Linux Kernel Mailing List , "linux-arch@vger.kernel.org" , Borislav Petkov , Nadav Amit , Kees Cook , Brian Gerst , "kernel-hardening@lists.openwall.com" , Josh Poimboeuf , Jann Horn , Heiko Carstens On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds wrote: > On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski wr= ote: >> >> On my laptop, this adds about 1.5=C2=B5s of overhead to task creatio= n, >> which seems to be mainly caused by vmalloc inefficiently allocating >> individual pages even when a higher-order page is available on the >> freelist. > > I really think that problem needs to be fixed before this should be m= erged. > > The easy fix may be to just have a very limited re-use of these stack= s > in generic code, rather than try to do anything fancy with multi-page > allocations. Just a few of these allocations held in reserve (perhaps > make the allocations percpu to avoid new locks). I implemented a percpu cache, and it's useless. When a task goes away, one reference is held until the next RCU grace period so that task_struct can be used under RCU (look for delayed_put_task_struct). This means that free_task gets called in giant batches under heavy clone() load, which is the only time that any of this matters, which means that only get to refill the cache once per RCU batch, which means that there's very little benefit. Once thread_info stops living in the stack, we could, in principle, exempt the stack itself from RCU protection, thus saving a bit of memory under load and making the cache work. I've started working on (optionally, per-arch) getting rid of on-stack thread_info, but that's not ready yet. =46WIW, the same issue quite possibly hurts non-vmap-stack performance as well, as it makes it much less likely that a cache-hot stack gets immediately reused under heavy fork load. So may I skip this for now? I think that the performance hit is unlikely to matter on most workloads, and I also expect the speedup from not using higher-order allocations to be a decent win on some workloads. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Reply-To: kernel-hardening@lists.openwall.com MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Wed, 22 Jun 2016 18:22:17 -0700 Message-ID: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: [kernel-hardening] Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core) To: Linus Torvalds Cc: Andy Lutomirski , the arch/x86 maintainers , Linux Kernel Mailing List , "linux-arch@vger.kernel.org" , Borislav Petkov , Nadav Amit , Kees Cook , Brian Gerst , "kernel-hardening@lists.openwall.com" , Josh Poimboeuf , Jann Horn , Heiko Carstens List-ID: On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds wrote: > On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski wrote: >> >> On my laptop, this adds about 1.5=C2=B5s of overhead to task creation, >> which seems to be mainly caused by vmalloc inefficiently allocating >> individual pages even when a higher-order page is available on the >> freelist. > > I really think that problem needs to be fixed before this should be merge= d. > > The easy fix may be to just have a very limited re-use of these stacks > in generic code, rather than try to do anything fancy with multi-page > allocations. Just a few of these allocations held in reserve (perhaps > make the allocations percpu to avoid new locks). I implemented a percpu cache, and it's useless. When a task goes away, one reference is held until the next RCU grace period so that task_struct can be used under RCU (look for delayed_put_task_struct). This means that free_task gets called in giant batches under heavy clone() load, which is the only time that any of this matters, which means that only get to refill the cache once per RCU batch, which means that there's very little benefit. Once thread_info stops living in the stack, we could, in principle, exempt the stack itself from RCU protection, thus saving a bit of memory under load and making the cache work. I've started working on (optionally, per-arch) getting rid of on-stack thread_info, but that's not ready yet. FWIW, the same issue quite possibly hurts non-vmap-stack performance as well, as it makes it much less likely that a cache-hot stack gets immediately reused under heavy fork load. So may I skip this for now? I think that the performance hit is unlikely to matter on most workloads, and I also expect the speedup from not using higher-order allocations to be a decent win on some workloads. --Andy