All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 11 Jul 2017 03:40:02 -0700	[thread overview]
Message-ID: <E37E0D40-821A-4C82-B924-F1CE6DF97719@gmail.com> (raw)
In-Reply-To: <20170711092935.bogdb4oja6v7kilq@suse.de>

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 12:30:28AM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote:
>>>> Something bothers me about the TLB flushes batching mechanism that Linux
>>>> uses on x86 and I would appreciate your opinion regarding it.
>>>> 
>>>> As you know, try_to_unmap_one() can batch TLB invalidations. While doing so,
>>>> however, the page-table lock(s) are not held, and I see no indication of the
>>>> pending flush saved (and regarded) in the relevant mm-structs.
>>>> 
>>>> So, my question: what prevents, at least in theory, the following scenario:
>>>> 
>>>> 	CPU0 				CPU1
>>>> 	----				----
>>>> 					user accesses memory using RW PTE 
>>>> 					[PTE now cached in TLB]
>>>> 	try_to_unmap_one()
>>>> 	==> ptep_get_and_clear()
>>>> 	==> set_tlb_ubc_flush_pending()
>>>> 					mprotect(addr, PROT_READ)
>>>> 					==> change_pte_range()
>>>> 					==> [ PTE non-present - no flush ]
>>>> 
>>>> 					user writes using cached RW PTE
>>>> 	...
>>>> 
>>>> 	try_to_unmap_flush()
>>>> 
>>>> 
>>>> As you see CPU1 write should have failed, but may succeed. 
>>>> 
>>>> Now I don???t have a PoC since in practice it seems hard to create such a
>>>> scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE
>>>> would not be reclaimed.
>>> 
>>> That is the same to a race whereby there is no batching mechanism and the
>>> racing operation happens between a pte clear and a flush as ptep_clear_flush
>>> is not atomic. All that differs is that the race window is a different size.
>>> The application on CPU1 is buggy in that it may or may not succeed the write
>>> but it is buggy regardless of whether a batching mechanism is used or not.
>> 
>> Thanks for your quick and detailed response, but I fail to see how it can
>> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
>> but without batching they are both performed under the page table lock
>> (which is acquired in page_vma_mapped_walk and released in
>> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
>> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
>> and change_pte_range, acquire the lock before accessing the PTEs.
> 
> I was primarily thinking in terms of memory corruption or data loss.
> However, we are still protected although it's not particularly obvious why.
> 
> On the reclaim side, we are either reclaiming clean pages (which ignore
> the accessed bit) or normal reclaim. If it's clean pages then any parallel
> write must update the dirty bit at minimum. If it's normal reclaim then
> the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
> ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
> the page in either as part of page_referenced or try_to_unmap_one but
> clearing the accessed bit flushes the TLB.

Wait. Are you looking at the x86 arch function? The TLB is not flushed when
the access bit is cleared:

int ptep_clear_flush_young(struct vm_area_struct *vma,
                           unsigned long address, pte_t *ptep)
{
        /*
         * On x86 CPUs, clearing the accessed bit without a TLB flush
         * doesn't cause data corruption. [ It could cause incorrect
         * page aging and the (mistaken) reclaim of hot pages, but the
         * chance of that should be relatively low. ]
         *                 
         * So as a performance optimization don't flush the TLB when
         * clearing the accessed bit, it will eventually be flushed by
         * a context switch or a VM operation anyway. [ In the rare
         * event of it not getting flushed for a long time the delay
         * shouldn't really matter because there's no real memory
         * pressure for swapout to react to. ]
         */
        return ptep_test_and_clear_young(vma, address, ptep);
}

> 
> On the mprotect side then, as the page was first accessed, clearing the
> accessed bit incurs a TLB flush on the reclaim side before the second write.
> That means any TLB entry that exists cannot have the accessed bit set so
> a second write needs to update it.
> 
> While it's not clearly documented, I checked with hardware engineers
> at the time that an update of the accessed or dirty bit even with a TLB
> entry will check the underlying page tables and trap if it's not present
> and the subsequent fault will then fail on sigsegv if the VMA protections
> no longer allow the write.
> 
> So, on one side if ignoring the accessed bit during reclaim, the pages
> are clean so any access will set the dirty bit and trap if unmapped in
> parallel. On the other side, the accessed bit if set cleared the TLB and
> if not set, then the hardware needs to update and again will trap if
> unmapped in parallel.


Yet, even regardless to the TLB flush it seems there is still a possible
race:

CPU0				CPU1
----				----
ptep_clear_flush_young_notify
==> PTE.A==0
				access PTE
				==> PTE.A=1
prep_get_and_clear
				change mapping (and PTE)
				Use stale TLB entry


> If this guarantee from hardware was every shown to be wrong or another
> architecture wanted to add batching without the same guarantee then mprotect
> would need to do a local_flush_tlb if no pages were updated by the mprotect
> but right now, this should not be necessary.
> 
>> Can you please explain why you consider the application to be buggy?
> 
> I considered it a bit dumb to mprotect for READ/NONE and then try writing
> the same mapping. However, it will behave as expected.

I don’t think that this is the only scenario. For example, the application
may create a new memory mapping of a different file using mmap at the same
memory address that was used before, just as that memory is reclaimed. The
application can (inadvertently) cause such a scenario by using MAP_FIXED.
But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same
virtual address.

>> AFAIU
>> an application can wish to trap certain memory accesses using userfaultfd or
>> SIGSEGV. For example, it may do it for garbage collection or sandboxing. To
>> do so, it can use mprotect with PROT_NONE and expect to be able to trap
>> future accesses to that memory. This use-case is described in usefaultfd
>> documentation.
> 
> Such applications are safe due to how the accessed bit is handled by the
> software (flushes TLB if clearing young) and hardware (traps if updating
> the accessed or dirty bit and the underlying PTE was unmapped even if
> there is a TLB entry).

I don’t think it is so. And I also think there are many additional
potentially problematic scenarios.

Thanks for your patience,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-07-11 10:40 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit [this message]
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E37E0D40-821A-4C82-B924-F1CE6DF97719@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.