All of lore.kernel.org
 help / color / mirror / Atom feed
From: Laura Abbott <labbott@redhat.com>
To: Khalid Aziz <khalid.aziz@oracle.com>,
	juergh@gmail.com, tycho@tycho.ws, jsteckli@amazon.de,
	ak@linux.intel.com, torvalds@linux-foundation.org,
	liran.alon@oracle.com, keescook@google.com,
	konrad.wilk@oracle.com
Cc: deepa.srinivasan@oracle.com, chris.hyser@oracle.com,
	tyhicks@canonical.com, dwmw@amazon.co.uk,
	andrew.cooper3@citrix.com, jcm@redhat.com,
	boris.ostrovsky@oracle.com, kanth.ghatraju@oracle.com,
	joao.m.martins@oracle.com, jmattson@google.com,
	pradeep.vincent@oracle.com, john.haxby@oracle.com,
	tglx@linutronix.de, kirill.shutemov@linux.intel.com, hch@lst.de,
	steven.sistare@oracle.com, kernel-hardening@lists.openwall.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership
Date: Thu, 17 Jan 2019 15:38:44 -0800	[thread overview]
Message-ID: <38897e25-c07f-9da1-1671-ae31ac64c864@redhat.com> (raw)
In-Reply-To: <cover.1547153058.git.khalid.aziz@oracle.com>

On 1/10/19 1:09 PM, Khalid Aziz wrote:
> I am continuing to build on the work Juerg, Tycho and Julian have done
> on XPFO. After the last round of updates, we were seeing very
> significant performance penalties when stale TLB entries were flushed
> actively after an XPFO TLB update. Benchmark for measuring performance
> is kernel build using parallel make. To get full protection from
> ret2dir attackes, we must flush stale TLB entries. Performance
> penalty from flushing stale TLB entries goes up as the number of
> cores goes up. On a desktop class machine with only 4 cores,
> enabling TLB flush for stale entries causes system time for "make
> -j4" to go up by a factor of 2.614x but on a larger machine with 96
> cores, system time with "make -j60" goes up by a factor of 26.366x!
> I have been working on reducing this performance penalty.
> 
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush. Problem is made worse by
> the fact that if number of entries being flushed exceeds
> tlb_single_page_flush_ceiling, it results in a full TLB flush on
> every CPU. A rogue process can launch a ret2dir attack only from a
> CPU that has dual mapping for its pages in physmap in its TLB. We
> can hence defer TLB flush on a CPU until a process that would have
> caused a TLB flush is scheduled on that CPU. I have added a cpumask
> to task_struct which is then used to post pending TLB flush on CPUs
> other than the one a process is running on. This cpumask is checked
> when a process migrates to a new CPU and TLB is flushed at that
> time. I measured system time for parallel make with unmodified 4.20
> kernel, 4.20 with XPFO patches before this optimization and then
> again after applying this optimization. Here are the results:
> 
> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
> make -j60 all
> 
> 4.20				915.183s
> 4.20+XPFO			24129.354s	26.366x
> 4.20+XPFO+Deferred flush	1216.987s	 1.330xx
> 
> 
> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
> make -j4 all
> 
> 4.20				607.671s
> 4.20+XPFO			1588.646s	2.614x
> 4.20+XPFO+Deferred flush	794.473s	1.307xx
> 
> 30+% overhead is still very high and there is room for improvement.
> Dave Hansen had suggested batch updating TLB entries and Tycho had
> created an initial implementation but I have not been able to get
> that to work correctly. I am still working on it and I suspect we
> will see a noticeable improvement in performance with that. In the
> code I added, I post a pending full TLB flush to all other CPUs even
> when number of TLB entries being flushed on current CPU does not
> exceed tlb_single_page_flush_ceiling. There has to be a better way
> to do this. I just haven't found an efficient way to implemented
> delayed limited TLB flush on other CPUs.
> 
> I am not entirely sure if switch_mm_irqs_off() is indeed the right
> place to perform the pending TLB flush for a CPU. Any feedback on
> that will be very helpful. Delaying full TLB flushes on other CPUs
> seems to help tremendously, so if there is a better way to implement
> the same thing than what I have done in patch 16, I am open to
> ideas.
> 
> Performance with this patch set is good enough to use these as
> starting point for further refinement before we merge it into main
> kernel, hence RFC.
> 
> Since not flushing stale TLB entries creates a false sense of
> security, I would recommend making TLB flush mandatory and eliminate
> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
> flushing by default for XPFO page table modifications").
> 
> What remains to be done beyond this patch series:
> 
> 1. Performance improvements
> 2. Remove xpfotlbflush parameter
> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
>     from Juerg. I dropped it for now since swiotlb code for ARM has
>     changed a lot in 4.20.
> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
>     CPUs" to other architectures besides x86.
> 
> 
> ---------------------------------------------------------
> 
> Juerg Haefliger (5):
>    mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
>    swiotlb: Map the buffer if it was unmapped by XPFO
>    arm64/mm: Add support for XPFO
>    arm64/mm, xpfo: temporarily map dcache regions
>    lkdtm: Add test for XPFO
> 
> Julian Stecklina (4):
>    mm, x86: omit TLB flushing by default for XPFO page table
>      modifications
>    xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
>    xpfo, mm: optimize spinlock usage in xpfo_kunmap
>    EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap
> 
> Khalid Aziz (2):
>    xpfo, mm: Fix hang when booting with "xpfotlbflush"
>    xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
> 
> Tycho Andersen (5):
>    mm: add MAP_HUGETLB support to vm_mmap
>    x86: always set IF before oopsing from page fault
>    xpfo: add primitives for mapping underlying memory
>    arm64/mm: disable section/contiguous mappings if XPFO is enabled
>    mm: add a user_virt_to_phys symbol
> 
>   .../admin-guide/kernel-parameters.txt         |   2 +
>   arch/arm64/Kconfig                            |   1 +
>   arch/arm64/mm/Makefile                        |   2 +
>   arch/arm64/mm/flush.c                         |   7 +
>   arch/arm64/mm/mmu.c                           |   2 +-
>   arch/arm64/mm/xpfo.c                          |  58 ++++
>   arch/x86/Kconfig                              |   1 +
>   arch/x86/include/asm/pgtable.h                |  26 ++
>   arch/x86/include/asm/tlbflush.h               |   1 +
>   arch/x86/mm/Makefile                          |   2 +
>   arch/x86/mm/fault.c                           |  10 +
>   arch/x86/mm/pageattr.c                        |  23 +-
>   arch/x86/mm/tlb.c                             |  27 ++
>   arch/x86/mm/xpfo.c                            | 171 ++++++++++++
>   drivers/misc/lkdtm/Makefile                   |   1 +
>   drivers/misc/lkdtm/core.c                     |   3 +
>   drivers/misc/lkdtm/lkdtm.h                    |   5 +
>   drivers/misc/lkdtm/xpfo.c                     | 194 ++++++++++++++
>   include/linux/highmem.h                       |  15 +-
>   include/linux/mm.h                            |   2 +
>   include/linux/mm_types.h                      |   8 +
>   include/linux/page-flags.h                    |  13 +
>   include/linux/sched.h                         |   9 +
>   include/linux/xpfo.h                          |  90 +++++++
>   include/trace/events/mmflags.h                |  10 +-
>   kernel/dma/swiotlb.c                          |   3 +-
>   mm/Makefile                                   |   1 +
>   mm/mmap.c                                     |  19 +-
>   mm/page_alloc.c                               |   3 +
>   mm/util.c                                     |  32 +++
>   mm/xpfo.c                                     | 247 ++++++++++++++++++
>   security/Kconfig                              |  29 ++
>   32 files changed, 974 insertions(+), 43 deletions(-)
>   create mode 100644 arch/arm64/mm/xpfo.c
>   create mode 100644 arch/x86/mm/xpfo.c
>   create mode 100644 drivers/misc/lkdtm/xpfo.c
>   create mode 100644 include/linux/xpfo.h
>   create mode 100644 mm/xpfo.c
> 

Also gave this a boot on my X1 Carbon and I got some lockdep splat:

[   16.863110] ================================
[   16.863119] WARNING: inconsistent lock state
[   16.863128] 4.20.0-xpfo+ #6 Not tainted
[   16.863136] --------------------------------
[   16.863145] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[   16.863157] swapper/5/0 [HC1[1]:SC1[1]:HE0:SE0] takes:
[   16.863168] 00000000301e129a (&(&page->xpfo_lock)->rlock){?.+.}, at: xpfo_do_map+0x1b/0x90
[   16.863188] {HARDIRQ-ON-W} state was registered at:
[   16.863200]   _raw_spin_lock+0x30/0x70
[   16.863208]   xpfo_do_map+0x1b/0x90
[   16.863217]   simple_write_begin+0xc7/0x240
[   16.863227]   generic_perform_write+0xf7/0x1c0
[   16.863237]   __generic_file_write_iter+0xfa/0x1c0
[   16.863247]   generic_file_write_iter+0xab/0x150
[   16.863257]   __vfs_write+0x139/0x1a0
[   16.863264]   vfs_write+0xba/0x1c0
[   16.863272]   ksys_write+0x52/0xc0
[   16.863281]   xwrite+0x29/0x5a
[   16.863288]   do_copy+0x2b/0xc8
[   16.863296]   write_buffer+0x2a/0x3a
[   16.863304]   unpack_to_rootfs+0x107/0x2c8
[   16.863312]   populate_rootfs+0x5d/0x10a
[   16.863322]   do_one_initcall+0x5d/0x2be
[   16.863541]   kernel_init_freeable+0x21b/0x2c9
[   16.863764]   kernel_init+0xa/0x109
[   16.863988]   ret_from_fork+0x3a/0x50
[   16.864220] irq event stamp: 337503
[   16.864456] hardirqs last  enabled at (337502): [<ffffffff8ce000a7>] __do_softirq+0xa7/0x47c
[   16.864715] hardirqs last disabled at (337503): [<ffffffff8c0037e8>] trace_hardirqs_off_thunk+0x1a/0x1c
[   16.864985] softirqs last  enabled at (337500): [<ffffffff8c0c6d88>] irq_enter+0x68/0x70
[   16.865263] softirqs last disabled at (337501): [<ffffffff8c0c6ea9>] irq_exit+0x119/0x120
[   16.865546]
                other info that might help us debug this:
[   16.866128]  Possible unsafe locking scenario:

[   16.866733]        CPU0
[   16.867039]        ----
[   16.867370]   lock(&(&page->xpfo_lock)->rlock);
[   16.867693]   <Interrupt>
[   16.868019]     lock(&(&page->xpfo_lock)->rlock);
[   16.868354]
                 *** DEADLOCK ***

[   16.869373] 1 lock held by swapper/5/0:
[   16.869727]  #0: 00000000800b2c51 (&(&ctx->completion_lock)->rlock){-.-.}, at: aio_complete+0x3c/0x460
[   16.870106]
                stack backtrace:
[   16.870868] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-xpfo+ #6
[   16.871270] Hardware name: LENOVO 20KGS23S00/20KGS23S00, BIOS N23ET40W (1.15 ) 04/13/2018
[   16.871686] Call Trace:
[   16.872106]  <IRQ>
[   16.872531]  dump_stack+0x85/0xc0
[   16.872962]  print_usage_bug.cold.60+0x1a8/0x1e2
[   16.873407]  ? print_shortest_lock_dependencies+0x40/0x40
[   16.873856]  mark_lock+0x502/0x600
[   16.874308]  ? check_usage_backwards+0x120/0x120
[   16.874769]  __lock_acquire+0x6e2/0x1650
[   16.875236]  ? find_held_lock+0x34/0xa0
[   16.875710]  ? sched_clock_cpu+0xc/0xb0
[   16.876185]  lock_acquire+0x9e/0x180
[   16.876668]  ? xpfo_do_map+0x1b/0x90
[   16.877154]  _raw_spin_lock+0x30/0x70
[   16.877649]  ? xpfo_do_map+0x1b/0x90
[   16.878144]  xpfo_do_map+0x1b/0x90
[   16.878647]  aio_complete+0xb2/0x460
[   16.879154]  blkdev_bio_end_io+0x71/0x150
[   16.879665]  blk_update_request+0xd7/0x2e0
[   16.880170]  blk_mq_end_request+0x1a/0x100
[   16.880669]  blk_mq_complete_request+0x98/0x120
[   16.881175]  nvme_irq+0x192/0x210 [nvme]
[   16.881675]  __handle_irq_event_percpu+0x46/0x2a0
[   16.882174]  handle_irq_event_percpu+0x30/0x80
[   16.882670]  handle_irq_event+0x34/0x51
[   16.883252]  handle_edge_irq+0x7b/0x190
[   16.883772]  handle_irq+0xbf/0x100
[   16.883774]  do_IRQ+0x5f/0x120
[   16.883776]  common_interrupt+0xf/0xf
[   16.885469] RIP: 0010:__do_softirq+0xae/0x47c
[   16.885470] Code: 0c 00 00 01 c7 44 24 24 0a 00 00 00 44 89 7c 24 04 48 c7 c0 c0 1e 1e 00 65 66 c7 00 00 00 e8 69 3d 3e ff fb 66 0f 1f 44 00 00 <48> c7 44 24 08 80 51 60 8d b8 ff ff ff ff 0f bc 44 24 04 83 c0 01
[   16.885471] RSP: 0018:ffff8bde5e003f68 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdd
[   16.887291] RAX: ffff8bde5b303740 RBX: ffff8bde5b303740 RCX: 0000000000000000
[   16.887291] RDX: ffff8bde5b303740 RSI: 0000000000000000 RDI: ffff8bde5b303740
[   16.887292] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[   16.887293] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   16.887294] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000202
[   16.887296]  ? common_interrupt+0xa/0xf
[   16.890885]  ? __do_softirq+0xa7/0x47c
[   16.890887]  ? hrtimer_interrupt+0x12e/0x220
[   16.890889]  irq_exit+0x119/0x120
[   16.890920]  smp_apic_timer_interrupt+0xa2/0x230
[   16.890921]  apic_timer_interrupt+0xf/0x20
[   16.890922]  </IRQ>
[   16.890955] RIP: 0010:cpuidle_enter_state+0xbe/0x350
[   16.890956] Code: 80 7c 24 0b 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 6d 02 00 00 31 ff e8 8e 61 91 ff e8 19 77 98 ff fb 66 0f 1f 44 00 00 <85> ed 0f 88 36 02 00 00 48 b8 ff ff ff ff f3 01 00 00 48 2b 1c 24
[   16.890957] RSP: 0018:ffffa91a41997ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[   16.891025] RAX: ffff8bde5b303740 RBX: 00000003ed1dca4d RCX: 0000000000000000
[   16.891026] RDX: ffff8bde5b303740 RSI: 0000000000000001 RDI: ffff8bde5b303740
[   16.891027] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
[   16.891028] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8d7f8898
[   16.891028] R13: ffffc91a3f800a00 R14: 0000000000000004 R15: 0000000000000000
[   16.891032]  do_idle+0x23e/0x280
[   16.891119]  cpu_startup_entry+0x19/0x20
[   16.891122]  start_secondary+0x1b3/0x200
[   16.891124]  secondary_startup_64+0xa4/0xb0

This was 4.20 + this series. config was based on what Fedora has.

Thanks,
Laura

      parent reply	other threads:[~2019-01-17 23:38 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-10 21:09 [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 01/16] mm: add MAP_HUGETLB support to vm_mmap Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 02/16] x86: always set IF before oopsing from page fault Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 03/16] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO) Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 04/16] swiotlb: Map the buffer if it was unmapped by XPFO Khalid Aziz
2019-01-23 14:16   ` Konrad Rzeszutek Wilk
2019-01-10 21:09 ` [RFC PATCH v7 05/16] arm64/mm: Add support for XPFO Khalid Aziz
2019-01-10 21:09   ` Khalid Aziz
2019-01-23 14:20   ` Konrad Rzeszutek Wilk
2019-01-23 14:20     ` Konrad Rzeszutek Wilk
2019-02-12 15:45     ` Khalid Aziz
2019-02-12 15:45       ` Khalid Aziz
2019-01-23 14:24   ` Konrad Rzeszutek Wilk
2019-01-23 14:24     ` Konrad Rzeszutek Wilk
2019-02-12 15:52     ` Khalid Aziz
2019-02-12 15:52       ` Khalid Aziz
2019-02-12 20:01       ` Laura Abbott
2019-02-12 20:01         ` Laura Abbott
2019-02-12 20:34         ` Khalid Aziz
2019-02-12 20:34           ` Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 06/16] xpfo: add primitives for mapping underlying memory Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 07/16] arm64/mm, xpfo: temporarily map dcache regions Khalid Aziz
2019-01-10 21:09   ` Khalid Aziz
2019-01-11 14:54   ` Tycho Andersen
2019-01-11 14:54     ` Tycho Andersen
2019-01-11 18:28     ` Khalid Aziz
2019-01-11 18:28       ` Khalid Aziz
2019-01-11 19:50       ` Tycho Andersen
2019-01-11 19:50         ` Tycho Andersen
2019-01-23 14:56   ` Konrad Rzeszutek Wilk
2019-01-23 14:56     ` Konrad Rzeszutek Wilk
2019-01-10 21:09 ` [RFC PATCH v7 08/16] arm64/mm: disable section/contiguous mappings if XPFO is enabled Khalid Aziz
2019-01-10 21:09   ` Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 09/16] mm: add a user_virt_to_phys symbol Khalid Aziz
2019-01-10 21:09   ` Khalid Aziz
2019-01-23 15:03   ` Konrad Rzeszutek Wilk
2019-01-23 15:03     ` Konrad Rzeszutek Wilk
2019-01-10 21:09 ` [RFC PATCH v7 10/16] lkdtm: Add test for XPFO Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 11/16] mm, x86: omit TLB flushing by default for XPFO page table modifications Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 12/16] xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION Khalid Aziz
2019-01-16 15:01   ` Julian Stecklina
2019-01-10 21:09 ` [RFC PATCH v7 13/16] xpfo, mm: optimize spinlock usage in xpfo_kunmap Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 14/16] EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap Khalid Aziz
2019-01-17  0:18   ` Laura Abbott
2019-01-17 15:14     ` Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 15/16] xpfo, mm: Fix hang when booting with "xpfotlbflush" Khalid Aziz
2019-01-10 21:09 ` [RFC PATCH v7 16/16] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Khalid Aziz
2019-01-10 23:07 ` [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership Kees Cook
2019-01-10 23:07   ` Kees Cook
2019-01-11  0:20   ` Khalid Aziz
2019-01-11  0:44   ` Andy Lutomirski
2019-01-11  0:44     ` Andy Lutomirski
2019-01-11 21:45     ` Khalid Aziz
2019-01-10 23:40 ` Dave Hansen
2019-01-11  9:59   ` Peter Zijlstra
2019-01-11 18:21   ` Khalid Aziz
2019-01-11 20:42     ` Dave Hansen
2019-01-11 21:06       ` Andy Lutomirski
2019-01-11 21:06         ` Andy Lutomirski
2019-01-11 23:25         ` Khalid Aziz
2019-01-11 23:23       ` Khalid Aziz
2019-01-16  1:28 ` Laura Abbott
2019-01-16 14:56 ` Julian Stecklina
2019-01-16 15:16   ` Khalid Aziz
2019-01-17 23:38 ` Laura Abbott [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=38897e25-c07f-9da1-1671-ae31ac64c864@redhat.com \
    --to=labbott@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=chris.hyser@oracle.com \
    --cc=deepa.srinivasan@oracle.com \
    --cc=dwmw@amazon.co.uk \
    --cc=hch@lst.de \
    --cc=jcm@redhat.com \
    --cc=jmattson@google.com \
    --cc=joao.m.martins@oracle.com \
    --cc=john.haxby@oracle.com \
    --cc=jsteckli@amazon.de \
    --cc=juergh@gmail.com \
    --cc=kanth.ghatraju@oracle.com \
    --cc=keescook@google.com \
    --cc=kernel-hardening@lists.openwall.com \
    --cc=khalid.aziz@oracle.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liran.alon@oracle.com \
    --cc=pradeep.vincent@oracle.com \
    --cc=steven.sistare@oracle.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=tycho@tycho.ws \
    --cc=tyhicks@canonical.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.