From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752778AbdDJAKq (ORCPT ); Sun, 9 Apr 2017 20:10:46 -0400 Received: from mail.kernel.org ([198.145.29.136]:59884 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752678AbdDJAKl (ORCPT ); Sun, 9 Apr 2017 20:10:41 -0400 MIME-Version: 1.0 In-Reply-To: <58EA2D58.17782.6ADE22BD@pageexec.freemail.hu> References: <1490811363-93944-1-git-send-email-keescook@chromium.org> <58E7EF70.30766.621C4F44@pageexec.freemail.hu> <58EA2D58.17782.6ADE22BD@pageexec.freemail.hu> From: Andy Lutomirski Date: Sun, 9 Apr 2017 17:10:16 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap() To: PaX Team Cc: Andy Lutomirski , Mathias Krause , Thomas Gleixner , Kees Cook , "kernel-hardening@lists.openwall.com" , Mark Rutland , Hoeun Ryu , Emese Revfy , Russell King , X86 ML , "linux-kernel@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" , Peter Zijlstra Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Apr 9, 2017 at 5:47 AM, PaX Team wrote: > On 7 Apr 2017 at 21:58, Andy Lutomirski wrote: > >> On Fri, Apr 7, 2017 at 12:58 PM, PaX Team wrote: >> > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: >> >> Then someone who cares about performance can benchmark the CR0.WP >> >> approach against it and try to argue that it's a good idea. This >> >> benchmark should wait until I'm done with my PCID work, because PCID >> >> is going to make use_mm() a whole heck of a lot faster. >> > >> > in my measurements switching PCID is hovers around 230 cycles for snb-ivb >> > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's >> > of course a whole lot more impact for switching address spaces so it'll never >> > be fast enough to beat cr0.wp. >> > >> >> If I'm reading this right, you're saying that a non-flushing CR3 write >> is about the same cost as a CR0.WP write. If so, then why should CR0 >> be preferred over the (arch-neutral) CR3 approach? > > cr3 (page table switching) isn't arch neutral at all ;). you probably meant > the higher level primitives except they're not enough to implement the scheme > as discussed before since the enter/exit paths are very much arch dependent. Yes. > > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 > writes and nothing else, use_mm suffers not only from the cr3 writes but > also locking/atomic ops and cr4 writes on its path and the inevitable TLB > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp > a fast path in the microcode and reduce its overhead by an order of magnitude. > If the CR4 writes happen in for this use case, that's a bug. >> And why would switching address spaces obviously be much slower? >> There'll be a very small number of TLB fills needed for the actual >> protected access. > > you'll be duplicating TLB entries in the alternative PCID for both code > and data, where they will accumulate (=take room away from the normal PCID > and expose unwanted memory for access) unless you also flush them when > switching back (which then will cost even more cycles). also i'm not sure > that processors implement all the 12 PCID bits so depending on how many PCIDs > you plan to use, you could be causing even more unnecessary TLB replacements. > Unless the CPU is rather dumber than I expect, the only duplicated entries should be for the writable aliases of pages that are written. The rest of the pages are global and should be shared for all PCIDs. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 From: luto@kernel.org (Andy Lutomirski) Date: Sun, 9 Apr 2017 17:10:16 -0700 Subject: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap() In-Reply-To: <58EA2D58.17782.6ADE22BD@pageexec.freemail.hu> References: <1490811363-93944-1-git-send-email-keescook@chromium.org> <58E7EF70.30766.621C4F44@pageexec.freemail.hu> <58EA2D58.17782.6ADE22BD@pageexec.freemail.hu> Message-ID: To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Sun, Apr 9, 2017 at 5:47 AM, PaX Team wrote: > On 7 Apr 2017 at 21:58, Andy Lutomirski wrote: > >> On Fri, Apr 7, 2017 at 12:58 PM, PaX Team wrote: >> > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: >> >> Then someone who cares about performance can benchmark the CR0.WP >> >> approach against it and try to argue that it's a good idea. This >> >> benchmark should wait until I'm done with my PCID work, because PCID >> >> is going to make use_mm() a whole heck of a lot faster. >> > >> > in my measurements switching PCID is hovers around 230 cycles for snb-ivb >> > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's >> > of course a whole lot more impact for switching address spaces so it'll never >> > be fast enough to beat cr0.wp. >> > >> >> If I'm reading this right, you're saying that a non-flushing CR3 write >> is about the same cost as a CR0.WP write. If so, then why should CR0 >> be preferred over the (arch-neutral) CR3 approach? > > cr3 (page table switching) isn't arch neutral at all ;). you probably meant > the higher level primitives except they're not enough to implement the scheme > as discussed before since the enter/exit paths are very much arch dependent. Yes. > > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 > writes and nothing else, use_mm suffers not only from the cr3 writes but > also locking/atomic ops and cr4 writes on its path and the inevitable TLB > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp > a fast path in the microcode and reduce its overhead by an order of magnitude. > If the CR4 writes happen in for this use case, that's a bug. >> And why would switching address spaces obviously be much slower? >> There'll be a very small number of TLB fills needed for the actual >> protected access. > > you'll be duplicating TLB entries in the alternative PCID for both code > and data, where they will accumulate (=take room away from the normal PCID > and expose unwanted memory for access) unless you also flush them when > switching back (which then will cost even more cycles). also i'm not sure > that processors implement all the 12 PCID bits so depending on how many PCIDs > you plan to use, you could be causing even more unnecessary TLB replacements. > Unless the CPU is rather dumber than I expect, the only duplicated entries should be for the writable aliases of pages that are written. The rest of the pages are global and should be shared for all PCIDs. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <58EA2D58.17782.6ADE22BD@pageexec.freemail.hu> References: <1490811363-93944-1-git-send-email-keescook@chromium.org> <58E7EF70.30766.621C4F44@pageexec.freemail.hu> <58EA2D58.17782.6ADE22BD@pageexec.freemail.hu> From: Andy Lutomirski Date: Sun, 9 Apr 2017 17:10:16 -0700 Message-ID: Content-Type: text/plain; charset=UTF-8 Subject: Re: [kernel-hardening] Re: [RFC v2][PATCH 04/11] x86: Implement __arch_rare_write_begin/unmap() To: PaX Team Cc: Andy Lutomirski , Mathias Krause , Thomas Gleixner , Kees Cook , "kernel-hardening@lists.openwall.com" , Mark Rutland , Hoeun Ryu , Emese Revfy , Russell King , X86 ML , "linux-kernel@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" , Peter Zijlstra List-ID: On Sun, Apr 9, 2017 at 5:47 AM, PaX Team wrote: > On 7 Apr 2017 at 21:58, Andy Lutomirski wrote: > >> On Fri, Apr 7, 2017 at 12:58 PM, PaX Team wrote: >> > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: >> >> Then someone who cares about performance can benchmark the CR0.WP >> >> approach against it and try to argue that it's a good idea. This >> >> benchmark should wait until I'm done with my PCID work, because PCID >> >> is going to make use_mm() a whole heck of a lot faster. >> > >> > in my measurements switching PCID is hovers around 230 cycles for snb-ivb >> > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's >> > of course a whole lot more impact for switching address spaces so it'll never >> > be fast enough to beat cr0.wp. >> > >> >> If I'm reading this right, you're saying that a non-flushing CR3 write >> is about the same cost as a CR0.WP write. If so, then why should CR0 >> be preferred over the (arch-neutral) CR3 approach? > > cr3 (page table switching) isn't arch neutral at all ;). you probably meant > the higher level primitives except they're not enough to implement the scheme > as discussed before since the enter/exit paths are very much arch dependent. Yes. > > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 > writes and nothing else, use_mm suffers not only from the cr3 writes but > also locking/atomic ops and cr4 writes on its path and the inevitable TLB > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp > a fast path in the microcode and reduce its overhead by an order of magnitude. > If the CR4 writes happen in for this use case, that's a bug. >> And why would switching address spaces obviously be much slower? >> There'll be a very small number of TLB fills needed for the actual >> protected access. > > you'll be duplicating TLB entries in the alternative PCID for both code > and data, where they will accumulate (=take room away from the normal PCID > and expose unwanted memory for access) unless you also flush them when > switching back (which then will cost even more cycles). also i'm not sure > that processors implement all the 12 PCID bits so depending on how many PCIDs > you plan to use, you could be causing even more unnecessary TLB replacements. > Unless the CPU is rather dumber than I expect, the only duplicated entries should be for the writable aliases of pages that are written. The rest of the pages are global and should be shared for all PCIDs. --Andy