Re: Potential race in TLB flush batching?

From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Wed, 19 Jul 2017 13:20:01 -0700	[thread overview]
Message-ID: <4BD983A1-724B-4FD7-B502-55351717BC5F@gmail.com> (raw)
In-Reply-To: <20170719195820.drtfmweuhdc4eca6@suse.de>

Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jul 19, 2017 at 12:41:01PM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote:
>>>>> If there are separate address spaces using a shared mapping then the
>>>>> same race does not occur.
>>>> 
>>>> I missed the fact you reverted the two operations since the previous version
>>>> of the patch. This specific scenario should be solved with this patch.
>>>> 
>>>> But in general, I think there is a need for a simple locking scheme.
>>> 
>>> Such as?
>> 
>> Something like:
>> 
>> bool is_potentially_stale_pte(pte_t pte, pgprot_t prot, int lock_state);
>> 
>> which would get the current PTE, the protection bits that the user is
>> interested in, and whether mmap_sem is taken read/write/none.
> 
> From a PTE you cannot know the state of mmap_sem because you can rmap
> back to multiple mm's for shared mappings. It's also fairly heavy handed.
> Technically, you could lock on the basis of the VMA but that has other
> consequences for scalability. The staleness is also a factor because
> it's a case of "does the staleness matter". Sometimes it does, sometimes
> it doesn't.  mmap_sem even if it could be used does not always tell us
> the right information either because it can matter whether we are racing
> against a userspace reference or a kernel operation.
> 
> It's possible your idea could be made work, but right now I'm not seeing a
> solution that handles every corner case. I asked to hear what your ideas
> were because anything I thought of that could batch TLB flushing in the
> general case had flaws that did not improve over what is already there.

I don’t disagree with what you say - perhaps my scheme is too simplistic.
But the bottom line, if you cannot form simple rules for when TLB needs to
be flushed, what are the chances others would get it right?

>> [snip]
>> 
>>>> As a result, concurrent operation such as KSM???s write_protect_page() or
>>> 
>>> write_protect_page operates under the page lock and cannot race with reclaim.
>> 
>> I still do not understand this claim. IIUC, reclaim can unmap the page in
>> some page table, decide not to reclaim the page and release the page-lock
>> before flush.
> 
> shrink_page_list is the caller of try_to_unmap in reclaim context. It
> has this check
> 
>                if (!trylock_page(page))
>                        goto keep;
> 
> For pages it cannot lock, they get put back on the LRU and recycled instead
> of reclaimed. Hence, if KSM or anything else holds the page lock, reclaim
> can't unmap it.

Yes, of course, since KSM does not batch TLB flushes. I regarded the other
direction - first try_to_unmap() removes the PTE (but still does not flush),
unlocks the page, and then KSM acquires the page lock and calls
write_protect_page(). It finds out the PTE is not present and does not flush
the TLB.

>>>> page_mkclean_one() can consider the page write-protected while in fact it is
>>>> still accessible - since the TLB flush was deferred.
>>> 
>>> As long as it's flushed before any IO occurs that would lose a data update,
>>> it's not a data integrity issue.
>>> 
>>>> As a result, they may
>>>> mishandle the PTE without flushing the page. In the case of
>>>> page_mkclean_one(), I suspect it may even lead to memory corruption. I admit
>>>> that in x86 there are some mitigating factors that would make such ???attack???
>>>> complicated, but it still seems wrong to me, no?
>>> 
>>> I worry that you're beginning to see races everywhere. I admit that the
>>> rules and protections here are varied and complex but it's worth keeping
>>> in mind that data integrity is the key concern (no false reads to wrong
>>> data, no lost writes) and the first race you identified found some problems
>>> here. However, with or without batching, there is always a delay between
>>> when a PTE is cleared and when the TLB entries are removed.
>> 
>> Sure, but usually the delay occurs while the page-table lock is taken so
>> there is no race.
>> 
>> Now, it is not fair to call me a paranoid, considering that these races are
>> real - I confirmed that at least two can happen in practice.
> 
> It's less an accusation of paranoia and more a caution that the fact that
> pte_clear_flush is not atomic means that it can be difficult to find what
> races matter and what ones don't.
> 
>> As for ???data integrity is the key concern??? - violating the memory management
>> API can cause data integrity issues for programs.
> 
> The madvise one should be fixed too. It could also be "fixed" by
> removing all batching but the performance cost will be sufficiently high
> that there will be pressure to find an alternative.
> 
>> It may not cause the OS to
>> crash, but it should not be acceptable either, and may potentially raise
>> security concerns. If you think that the current behavior is ok, let the
>> documentation and man pages clarify that mprotect may not protect, madvise
>> may not advise and so on.
> 
> The madvise one should be fixed, not because because it allows a case
> whereby userspace thinks it has initialised a structure that is actually
> in a page that is freed after a TLB is flushed resulting in a lost
> write. It wouldn't cause any issues with shared or file-backed mappings
> but it is a problem for anonymous.
> 
>> And although you would use it against me, I would say: Nobody knew that TLB
>> flushing could be so complicated.
> 
> There is no question that the area is complicated.

My comment was actually an unfunny joke... Never mind.

Thanks,
Nadav

p.s.: Thanks for your patience.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>