linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 18 Jul 2017 14:28:27 -0700	[thread overview]
Message-ID: <F7E154AB-5C1D-477F-A6BF-EFCAE5381B2D@gmail.com> (raw)
In-Reply-To: <20170715155518.ok2q62efc2vurqk5@suse.de>

Mel Gorman <mgorman@suse.de> wrote:

> On Fri, Jul 14, 2017 at 04:16:44PM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote:
>>>>> If reclaim is first, it'll take the PTL, set batched while a racing
>>>>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap
>>>>> immediately calls flush_tlb_batched_pending() before proceeding as normal,
>>>>> finding pte_none with the TLB flushed.
>>>> 
>>>> This is the scenario I regarded in my example. Notice that when the first
>>>> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table
>>>> locks - allowing them to run concurrently. As a result
>>>> flush_tlb_batched_pending is executed before the PTE was cleared and
>>>> mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear
>>>> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE.
>>> 
>>> If they hold different PTL locks, it means that reclaim and and the parallel
>>> munmap/mprotect/madvise/mremap operation are operating on different regions
>>> of an mm or separate mm's and the race should not apply or at the very
>>> least is equivalent to not batching the flushes. For multiple parallel
>>> operations, munmap/mprotect/mremap are serialised by mmap_sem so there
>>> is only one risky operation at a time. For multiple madvise, there is a
>>> small window when a page is accessible after madvise returns but it is an
>>> advisory call so it's primarily a data integrity concern and the TLB is
>>> flushed before the page is either freed or IO starts on the reclaim side.
>> 
>> I think there is some miscommunication. Perhaps one detail was missing:
>> 
>> CPU0				CPU1
>> ---- 				----
>> should_defer_flush
>> => mm->tlb_flush_batched=true		
>> 				flush_tlb_batched_pending (another PT)
>> 				=> flush TLB
>> 				=> mm->tlb_flush_batched=false
>> 
>> 				Access PTE (and cache in TLB)
>> ptep_get_and_clear(PTE)
>> ...
>> 
>> 				flush_tlb_batched_pending (batched PT)
>> 				[ no flush since tlb_flush_batched=false ]
>> 				use the stale PTE
>> ...
>> try_to_unmap_flush
>> 
>> There are only 2 CPUs and both regard the same address-space. CPU0 reclaim a
>> page from this address-space. Just between setting tlb_flush_batch and the
>> actual clearing of the PTE, the process on CPU1 runs munmap and calls
>> flush_tlb_batched_pending. This can happen if CPU1 regards a different
>> page-table.
> 
> If both regard the same address-space then they have the same page table so
> there is a disconnect between the first and last sentence in your paragraph
> above. On CPU 0, the setting of tlb_flush_batched and ptep_get_and_clear
> is also reversed as the sequence is
> 
>                        pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>                        set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> 
> Additional barriers should not be needed as within the critical section
> that can race, it's protected by the lock and with Andy's code, there is
> a full barrier before the setting of tlb_flush_batched. With Andy's code,
> there may be a need for a compiler barrier but I can rethink about that
> and add it during the backport to -stable if necessary.
> 
> So the setting happens after the clear and if they share the same address
> space and collide then they both share the same PTL so are protected from
> each other.
> 
> If there are separate address spaces using a shared mapping then the
> same race does not occur.

I missed the fact you reverted the two operations since the previous version
of the patch. This specific scenario should be solved with this patch.

But in general, I think there is a need for a simple locking scheme.
Otherwise, people (like me) would be afraid to make any changes to the code,
and additional missing TLB flushes would exist. For example, I suspect that
a user may trigger insert_pfn() or insert_page(), and rely on their output.
While it makes little sense, the user can try to insert the page on the same
address of another page. If the other page was already reclaimed the
operation should succeed and otherwise fail. But it may succeed while the
other page is going through reclamation, resulting in:

CPU0					CPU1
----					----				
					ptep_clear_flush_notify()
- access memory using a PTE
[ PTE cached in TLB ]
					try_to_unmap_one()
					==> ptep_get_and_clear() == false
insert_page()
==> pte_none() = true
    [retval = 0]

- access memory using a stale PTE


Additional potential situations can be caused, IIUC, by mcopy_atomic_pte(),
mfill_zeropage_pte(), shmem_mcopy_atomic_pte().

Even more importantly, I suspect there is an additional similar but
unrelated problem. clear_refs_write() can be used with CLEAR_REFS_SOFT_DIRTY
to write-protect PTEs. However, it batches TLB flushes, while only holding
mmap_sem for read, and without any indication in mm that TLB flushes are
pending.

As a result, concurrent operation such as KSM’s write_protect_page() or
page_mkclean_one() can consider the page write-protected while in fact it is
still accessible - since the TLB flush was deferred. As a result, they may
mishandle the PTE without flushing the page. In the case of
page_mkclean_one(), I suspect it may even lead to memory corruption. I admit
that in x86 there are some mitigating factors that would make such “attack”
complicated, but it still seems wrong to me, no?

Thanks,
Nadav

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2017-07-18 21:28 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit [this message]
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=F7E154AB-5C1D-477F-A6BF-EFCAE5381B2D@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).