All of lore.kernel.org
 help / color / mirror / Atom feed
* Potential race in TLB flush batching?
@ 2017-07-11  0:52 Nadav Amit
  2017-07-11  6:41 ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-11  0:52 UTC (permalink / raw)
  To: Mel Gorman, Andy Lutomirski; +Cc: open list:MEMORY MANAGEMENT

Something bothers me about the TLB flushes batching mechanism that Linux
uses on x86 and I would appreciate your opinion regarding it.

As you know, try_to_unmap_one() can batch TLB invalidations. While doing so,
however, the page-table lock(s) are not held, and I see no indication of the
pending flush saved (and regarded) in the relevant mm-structs.

So, my question: what prevents, at least in theory, the following scenario:

	CPU0 				CPU1
	----				----
					user accesses memory using RW PTE 
					[PTE now cached in TLB]
	try_to_unmap_one()
	==> ptep_get_and_clear()
	==> set_tlb_ubc_flush_pending()
					mprotect(addr, PROT_READ)
					==> change_pte_range()
					==> [ PTE non-present - no flush ]

					user writes using cached RW PTE
	...

	try_to_unmap_flush()


As you see CPU1 write should have failed, but may succeed. 

Now I don’t have a PoC since in practice it seems hard to create such a
scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE
would not be reclaimed.

Yet, isn’t it a problem? Am I missing something?

Thanks,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
@ 2017-07-11  6:41 ` Mel Gorman
  2017-07-11  7:30   ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-11  6:41 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote:
> Something bothers me about the TLB flushes batching mechanism that Linux
> uses on x86 and I would appreciate your opinion regarding it.
> 
> As you know, try_to_unmap_one() can batch TLB invalidations. While doing so,
> however, the page-table lock(s) are not held, and I see no indication of the
> pending flush saved (and regarded) in the relevant mm-structs.
> 
> So, my question: what prevents, at least in theory, the following scenario:
> 
> 	CPU0 				CPU1
> 	----				----
> 					user accesses memory using RW PTE 
> 					[PTE now cached in TLB]
> 	try_to_unmap_one()
> 	==> ptep_get_and_clear()
> 	==> set_tlb_ubc_flush_pending()
> 					mprotect(addr, PROT_READ)
> 					==> change_pte_range()
> 					==> [ PTE non-present - no flush ]
> 
> 					user writes using cached RW PTE
> 	...
> 
> 	try_to_unmap_flush()
> 
> 
> As you see CPU1 write should have failed, but may succeed. 
> 
> Now I don???t have a PoC since in practice it seems hard to create such a
> scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE
> would not be reclaimed.
> 

That is the same to a race whereby there is no batching mechanism and the
racing operation happens between a pte clear and a flush as ptep_clear_flush
is not atomic. All that differs is that the race window is a different size.
The application on CPU1 is buggy in that it may or may not succeed the write
but it is buggy regardless of whether a batching mechanism is used or not.

The user accessed the PTE before the mprotect so, at the time of mprotect,
the PTE is either clean or dirty. If it is clean then any subsequent write
would transition the PTE from clean to dirty and an architecture enabling
the batching mechanism must trap a clean->dirty transition for unmapped
entries as commented upon in try_to_unmap_one (and was checked that this
is true for x86 at least). This avoids data corruption due to a lost update.

If the previous access was a write then the batching flushes the page if
any IO is required to avoid any writes after the IO has been initiated
using try_to_unmap_flush_dirty so again there is no data corruption. There
is a window where the TLB entry exists after the unmapping but this exists
regardless of whether we batch or not.

In either case, before a page is freed and potentially allocated to another
process, the TLB is flushed.

> Yet, isn???t it a problem? Am I missing something?
> 

It's not a problem as such as it's basically a buggy application that
can only hurt itself.  I cannot see a path whereby the cached PTE can be
used to corrupt data by either accessing it after IO has been initiated
(lost data update) or access a physical page that has been allocated to
another process (arbitrary corruption).

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11  6:41 ` Mel Gorman
@ 2017-07-11  7:30   ` Nadav Amit
  2017-07-11  9:29     ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-11  7:30 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote:
>> Something bothers me about the TLB flushes batching mechanism that Linux
>> uses on x86 and I would appreciate your opinion regarding it.
>> 
>> As you know, try_to_unmap_one() can batch TLB invalidations. While doing so,
>> however, the page-table lock(s) are not held, and I see no indication of the
>> pending flush saved (and regarded) in the relevant mm-structs.
>> 
>> So, my question: what prevents, at least in theory, the following scenario:
>> 
>> 	CPU0 				CPU1
>> 	----				----
>> 					user accesses memory using RW PTE 
>> 					[PTE now cached in TLB]
>> 	try_to_unmap_one()
>> 	==> ptep_get_and_clear()
>> 	==> set_tlb_ubc_flush_pending()
>> 					mprotect(addr, PROT_READ)
>> 					==> change_pte_range()
>> 					==> [ PTE non-present - no flush ]
>> 
>> 					user writes using cached RW PTE
>> 	...
>> 
>> 	try_to_unmap_flush()
>> 
>> 
>> As you see CPU1 write should have failed, but may succeed. 
>> 
>> Now I don???t have a PoC since in practice it seems hard to create such a
>> scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE
>> would not be reclaimed.
> 
> That is the same to a race whereby there is no batching mechanism and the
> racing operation happens between a pte clear and a flush as ptep_clear_flush
> is not atomic. All that differs is that the race window is a different size.
> The application on CPU1 is buggy in that it may or may not succeed the write
> but it is buggy regardless of whether a batching mechanism is used or not.

Thanks for your quick and detailed response, but I fail to see how it can
happen without batching. Indeed, the PTE clear and flush are not “atomic”,
but without batching they are both performed under the page table lock
(which is acquired in page_vma_mapped_walk and released in
page_vma_mapped_walk_done). Since the lock is taken, other cores should not
be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
and change_pte_range, acquire the lock before accessing the PTEs.

Can you please explain why you consider the application to be buggy? AFAIU
an application can wish to trap certain memory accesses using userfaultfd or
SIGSEGV. For example, it may do it for garbage collection or sandboxing. To
do so, it can use mprotect with PROT_NONE and expect to be able to trap
future accesses to that memory. This use-case is described in usefaultfd
documentation.

> The user accessed the PTE before the mprotect so, at the time of mprotect,
> the PTE is either clean or dirty. If it is clean then any subsequent write
> would transition the PTE from clean to dirty and an architecture enabling
> the batching mechanism must trap a clean->dirty transition for unmapped
> entries as commented upon in try_to_unmap_one (and was checked that this
> is true for x86 at least). This avoids data corruption due to a lost update.
> 
> If the previous access was a write then the batching flushes the page if
> any IO is required to avoid any writes after the IO has been initiated
> using try_to_unmap_flush_dirty so again there is no data corruption. There
> is a window where the TLB entry exists after the unmapping but this exists
> regardless of whether we batch or not.
> 
> In either case, before a page is freed and potentially allocated to another
> process, the TLB is flushed.

To clarify my concern again - I am not regarding a memory corruption as you
do, but situations in which the application wishes to trap certain memory
accesses but fails to do so. Having said that, I would add, that even if an
application has a bug, it may expect this bug not to affect memory that was
previously unmapped (and may be written to permanent storage).

Thanks (again),
Nadav

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11  7:30   ` Nadav Amit
@ 2017-07-11  9:29     ` Mel Gorman
  2017-07-11 10:40       ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-11  9:29 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 12:30:28AM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote:
> >> Something bothers me about the TLB flushes batching mechanism that Linux
> >> uses on x86 and I would appreciate your opinion regarding it.
> >> 
> >> As you know, try_to_unmap_one() can batch TLB invalidations. While doing so,
> >> however, the page-table lock(s) are not held, and I see no indication of the
> >> pending flush saved (and regarded) in the relevant mm-structs.
> >> 
> >> So, my question: what prevents, at least in theory, the following scenario:
> >> 
> >> 	CPU0 				CPU1
> >> 	----				----
> >> 					user accesses memory using RW PTE 
> >> 					[PTE now cached in TLB]
> >> 	try_to_unmap_one()
> >> 	==> ptep_get_and_clear()
> >> 	==> set_tlb_ubc_flush_pending()
> >> 					mprotect(addr, PROT_READ)
> >> 					==> change_pte_range()
> >> 					==> [ PTE non-present - no flush ]
> >> 
> >> 					user writes using cached RW PTE
> >> 	...
> >> 
> >> 	try_to_unmap_flush()
> >> 
> >> 
> >> As you see CPU1 write should have failed, but may succeed. 
> >> 
> >> Now I don???t have a PoC since in practice it seems hard to create such a
> >> scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE
> >> would not be reclaimed.
> > 
> > That is the same to a race whereby there is no batching mechanism and the
> > racing operation happens between a pte clear and a flush as ptep_clear_flush
> > is not atomic. All that differs is that the race window is a different size.
> > The application on CPU1 is buggy in that it may or may not succeed the write
> > but it is buggy regardless of whether a batching mechanism is used or not.
> 
> Thanks for your quick and detailed response, but I fail to see how it can
> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
> but without batching they are both performed under the page table lock
> (which is acquired in page_vma_mapped_walk and released in
> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
> and change_pte_range, acquire the lock before accessing the PTEs.
> 

I was primarily thinking in terms of memory corruption or data loss.
However, we are still protected although it's not particularly obvious why.

On the reclaim side, we are either reclaiming clean pages (which ignore
the accessed bit) or normal reclaim. If it's clean pages then any parallel
write must update the dirty bit at minimum. If it's normal reclaim then
the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
the page in either as part of page_referenced or try_to_unmap_one but
clearing the accessed bit flushes the TLB.

On the mprotect side then, as the page was first accessed, clearing the
accessed bit incurs a TLB flush on the reclaim side before the second write.
That means any TLB entry that exists cannot have the accessed bit set so
a second write needs to update it.

While it's not clearly documented, I checked with hardware engineers
at the time that an update of the accessed or dirty bit even with a TLB
entry will check the underlying page tables and trap if it's not present
and the subsequent fault will then fail on sigsegv if the VMA protections
no longer allow the write.

So, on one side if ignoring the accessed bit during reclaim, the pages
are clean so any access will set the dirty bit and trap if unmapped in
parallel. On the other side, the accessed bit if set cleared the TLB and
if not set, then the hardware needs to update and again will trap if
unmapped in parallel.

If this guarantee from hardware was every shown to be wrong or another
architecture wanted to add batching without the same guarantee then mprotect
would need to do a local_flush_tlb if no pages were updated by the mprotect
but right now, this should not be necessary.

> Can you please explain why you consider the application to be buggy?

I considered it a bit dumb to mprotect for READ/NONE and then try writing
the same mapping. However, it will behave as expected.

> AFAIU
> an application can wish to trap certain memory accesses using userfaultfd or
> SIGSEGV. For example, it may do it for garbage collection or sandboxing. To
> do so, it can use mprotect with PROT_NONE and expect to be able to trap
> future accesses to that memory. This use-case is described in usefaultfd
> documentation.
> 

Such applications are safe due to how the accessed bit is handled by the
software (flushes TLB if clearing young) and hardware (traps if updating
the accessed or dirty bit and the underlying PTE was unmapped even if
there is a TLB entry).

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11  9:29     ` Mel Gorman
@ 2017-07-11 10:40       ` Nadav Amit
  2017-07-11 13:20         ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-11 10:40 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 12:30:28AM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote:
>>>> Something bothers me about the TLB flushes batching mechanism that Linux
>>>> uses on x86 and I would appreciate your opinion regarding it.
>>>> 
>>>> As you know, try_to_unmap_one() can batch TLB invalidations. While doing so,
>>>> however, the page-table lock(s) are not held, and I see no indication of the
>>>> pending flush saved (and regarded) in the relevant mm-structs.
>>>> 
>>>> So, my question: what prevents, at least in theory, the following scenario:
>>>> 
>>>> 	CPU0 				CPU1
>>>> 	----				----
>>>> 					user accesses memory using RW PTE 
>>>> 					[PTE now cached in TLB]
>>>> 	try_to_unmap_one()
>>>> 	==> ptep_get_and_clear()
>>>> 	==> set_tlb_ubc_flush_pending()
>>>> 					mprotect(addr, PROT_READ)
>>>> 					==> change_pte_range()
>>>> 					==> [ PTE non-present - no flush ]
>>>> 
>>>> 					user writes using cached RW PTE
>>>> 	...
>>>> 
>>>> 	try_to_unmap_flush()
>>>> 
>>>> 
>>>> As you see CPU1 write should have failed, but may succeed. 
>>>> 
>>>> Now I don???t have a PoC since in practice it seems hard to create such a
>>>> scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE
>>>> would not be reclaimed.
>>> 
>>> That is the same to a race whereby there is no batching mechanism and the
>>> racing operation happens between a pte clear and a flush as ptep_clear_flush
>>> is not atomic. All that differs is that the race window is a different size.
>>> The application on CPU1 is buggy in that it may or may not succeed the write
>>> but it is buggy regardless of whether a batching mechanism is used or not.
>> 
>> Thanks for your quick and detailed response, but I fail to see how it can
>> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
>> but without batching they are both performed under the page table lock
>> (which is acquired in page_vma_mapped_walk and released in
>> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
>> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
>> and change_pte_range, acquire the lock before accessing the PTEs.
> 
> I was primarily thinking in terms of memory corruption or data loss.
> However, we are still protected although it's not particularly obvious why.
> 
> On the reclaim side, we are either reclaiming clean pages (which ignore
> the accessed bit) or normal reclaim. If it's clean pages then any parallel
> write must update the dirty bit at minimum. If it's normal reclaim then
> the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
> ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
> the page in either as part of page_referenced or try_to_unmap_one but
> clearing the accessed bit flushes the TLB.

Wait. Are you looking at the x86 arch function? The TLB is not flushed when
the access bit is cleared:

int ptep_clear_flush_young(struct vm_area_struct *vma,
                           unsigned long address, pte_t *ptep)
{
        /*
         * On x86 CPUs, clearing the accessed bit without a TLB flush
         * doesn't cause data corruption. [ It could cause incorrect
         * page aging and the (mistaken) reclaim of hot pages, but the
         * chance of that should be relatively low. ]
         *                 
         * So as a performance optimization don't flush the TLB when
         * clearing the accessed bit, it will eventually be flushed by
         * a context switch or a VM operation anyway. [ In the rare
         * event of it not getting flushed for a long time the delay
         * shouldn't really matter because there's no real memory
         * pressure for swapout to react to. ]
         */
        return ptep_test_and_clear_young(vma, address, ptep);
}

> 
> On the mprotect side then, as the page was first accessed, clearing the
> accessed bit incurs a TLB flush on the reclaim side before the second write.
> That means any TLB entry that exists cannot have the accessed bit set so
> a second write needs to update it.
> 
> While it's not clearly documented, I checked with hardware engineers
> at the time that an update of the accessed or dirty bit even with a TLB
> entry will check the underlying page tables and trap if it's not present
> and the subsequent fault will then fail on sigsegv if the VMA protections
> no longer allow the write.
> 
> So, on one side if ignoring the accessed bit during reclaim, the pages
> are clean so any access will set the dirty bit and trap if unmapped in
> parallel. On the other side, the accessed bit if set cleared the TLB and
> if not set, then the hardware needs to update and again will trap if
> unmapped in parallel.


Yet, even regardless to the TLB flush it seems there is still a possible
race:

CPU0				CPU1
----				----
ptep_clear_flush_young_notify
==> PTE.A==0
				access PTE
				==> PTE.A=1
prep_get_and_clear
				change mapping (and PTE)
				Use stale TLB entry


> If this guarantee from hardware was every shown to be wrong or another
> architecture wanted to add batching without the same guarantee then mprotect
> would need to do a local_flush_tlb if no pages were updated by the mprotect
> but right now, this should not be necessary.
> 
>> Can you please explain why you consider the application to be buggy?
> 
> I considered it a bit dumb to mprotect for READ/NONE and then try writing
> the same mapping. However, it will behave as expected.

I don’t think that this is the only scenario. For example, the application
may create a new memory mapping of a different file using mmap at the same
memory address that was used before, just as that memory is reclaimed. The
application can (inadvertently) cause such a scenario by using MAP_FIXED.
But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same
virtual address.

>> AFAIU
>> an application can wish to trap certain memory accesses using userfaultfd or
>> SIGSEGV. For example, it may do it for garbage collection or sandboxing. To
>> do so, it can use mprotect with PROT_NONE and expect to be able to trap
>> future accesses to that memory. This use-case is described in usefaultfd
>> documentation.
> 
> Such applications are safe due to how the accessed bit is handled by the
> software (flushes TLB if clearing young) and hardware (traps if updating
> the accessed or dirty bit and the underlying PTE was unmapped even if
> there is a TLB entry).

I don’t think it is so. And I also think there are many additional
potentially problematic scenarios.

Thanks for your patience,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 10:40       ` Nadav Amit
@ 2017-07-11 13:20         ` Mel Gorman
  2017-07-11 14:58           ` Andy Lutomirski
  2017-07-11 16:22           ` Nadav Amit
  0 siblings, 2 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-11 13:20 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 03:40:02AM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@suse.de> wrote:
> 
> >>> That is the same to a race whereby there is no batching mechanism and the
> >>> racing operation happens between a pte clear and a flush as ptep_clear_flush
> >>> is not atomic. All that differs is that the race window is a different size.
> >>> The application on CPU1 is buggy in that it may or may not succeed the write
> >>> but it is buggy regardless of whether a batching mechanism is used or not.
> >> 
> >> Thanks for your quick and detailed response, but I fail to see how it can
> >> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
> >> but without batching they are both performed under the page table lock
> >> (which is acquired in page_vma_mapped_walk and released in
> >> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
> >> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
> >> and change_pte_range, acquire the lock before accessing the PTEs.
> > 
> > I was primarily thinking in terms of memory corruption or data loss.
> > However, we are still protected although it's not particularly obvious why.
> > 
> > On the reclaim side, we are either reclaiming clean pages (which ignore
> > the accessed bit) or normal reclaim. If it's clean pages then any parallel
> > write must update the dirty bit at minimum. If it's normal reclaim then
> > the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
> > ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
> > the page in either as part of page_referenced or try_to_unmap_one but
> > clearing the accessed bit flushes the TLB.
> 
> Wait. Are you looking at the x86 arch function? The TLB is not flushed when
> the access bit is cleared:
> 
> int ptep_clear_flush_young(struct vm_area_struct *vma,
>                            unsigned long address, pte_t *ptep)
> {
>         /*
>          * On x86 CPUs, clearing the accessed bit without a TLB flush
>          * doesn't cause data corruption. [ It could cause incorrect
>          * page aging and the (mistaken) reclaim of hot pages, but the
>          * chance of that should be relatively low. ]
>          *                 
>          * So as a performance optimization don't flush the TLB when
>          * clearing the accessed bit, it will eventually be flushed by
>          * a context switch or a VM operation anyway. [ In the rare
>          * event of it not getting flushed for a long time the delay
>          * shouldn't really matter because there's no real memory
>          * pressure for swapout to react to. ]
>          */
>         return ptep_test_and_clear_young(vma, address, ptep);
> }
> 

I forgot this detail, thanks for correcting me.

> > 
> > On the mprotect side then, as the page was first accessed, clearing the
> > accessed bit incurs a TLB flush on the reclaim side before the second write.
> > That means any TLB entry that exists cannot have the accessed bit set so
> > a second write needs to update it.
> > 
> > While it's not clearly documented, I checked with hardware engineers
> > at the time that an update of the accessed or dirty bit even with a TLB
> > entry will check the underlying page tables and trap if it's not present
> > and the subsequent fault will then fail on sigsegv if the VMA protections
> > no longer allow the write.
> > 
> > So, on one side if ignoring the accessed bit during reclaim, the pages
> > are clean so any access will set the dirty bit and trap if unmapped in
> > parallel. On the other side, the accessed bit if set cleared the TLB and
> > if not set, then the hardware needs to update and again will trap if
> > unmapped in parallel.
> 
> 
> Yet, even regardless to the TLB flush it seems there is still a possible
> race:
> 
> CPU0				CPU1
> ----				----
> ptep_clear_flush_young_notify
> ==> PTE.A==0
> 				access PTE
> 				==> PTE.A=1
> prep_get_and_clear
> 				change mapping (and PTE)
> 				Use stale TLB entry

So I think you're right and this is a potential race. The first access can
be a read or a write as it's a problem if the mprotect call restricts
access.

> > If this guarantee from hardware was every shown to be wrong or another
> > architecture wanted to add batching without the same guarantee then mprotect
> > would need to do a local_flush_tlb if no pages were updated by the mprotect
> > but right now, this should not be necessary.
> > 
> >> Can you please explain why you consider the application to be buggy?
> > 
> > I considered it a bit dumb to mprotect for READ/NONE and then try writing
> > the same mapping. However, it will behave as expected.
> 
> I don???t think that this is the only scenario. For example, the application
> may create a new memory mapping of a different file using mmap at the same
> memory address that was used before, just as that memory is reclaimed.

That requires the existing mapping to be unmapped which will flush the
TLB and parallel mmap/munmap serialises on mmap_sem. The race appears to
be specific to mprotect which avoids the TLB flush if no pages were updated.

> The
> application can (inadvertently) cause such a scenario by using MAP_FIXED.
> But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same
> virtual address.
> 

With flushes in between.

> > Such applications are safe due to how the accessed bit is handled by the
> > software (flushes TLB if clearing young) and hardware (traps if updating
> > the accessed or dirty bit and the underlying PTE was unmapped even if
> > there is a TLB entry).
> 
> I don???t think it is so. And I also think there are many additional
> potentially problematic scenarios.
> 

I believe it's specific to mprotect but can be handled by flushing the
local TLB when mprotect updates no pages. Something like this;

---8<---
mm, mprotect: Flush the local TLB if mprotect potentially raced with a parallel reclaim

Nadav Amit identified a theoritical race between page reclaim and mprotect
due to TLB flushes being batched outside of the PTL being held. He described
the race as follows

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE.
This is not a data integrity issue as the TLB is always flushed before any
IO is queued or a page is freed but it is a correctness issue as a process
restricting access with mprotect() may still be able to access the data
after the syscall returns due to a stale TLB entry. Handle this issue by
flushing the local TLB if reclaim is potentially batching TLB flushes and
mprotect altered no pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org # v4.4+
---
 mm/internal.h |  5 ++++-
 mm/mprotect.c | 12 ++++++++++--
 mm/rmap.c     | 20 ++++++++++++++++++++
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..9b7d1a597816 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void batched_unmap_protection_update(void);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
-
+static inline void batched_unmap_protection_update()
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..3de353d4b5fb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -254,9 +254,17 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 				 dirty_accountable, prot_numa);
 	} while (pgd++, addr = next, addr != end);
 
-	/* Only flush the TLB if we actually modified any entries: */
-	if (pages)
+	/*
+	 * Only flush all TLBs if we actually modified any entries. If no
+	 * pages are modified, then call batched_unmap_protection_update
+	 * if the context is a mprotect() syscall.
+	 */
+	if (pages) {
 		flush_tlb_range(vma, start, end);
+	} else {
+		if (!prot_numa)
+			batched_unmap_protection_update();
+	}
 	clear_tlb_flush_pending(mm);
 
 	return pages;
diff --git a/mm/rmap.c b/mm/rmap.c
index d405f0e0ee96..02cb035e4ce6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -643,6 +643,26 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 
 	return should_defer;
 }
+
+/*
+ * This is called after an mprotect update that altered no pages. Batched
+ * unmap releases the PTL before a flush occurs leaving a window where
+ * an mprotect that reduces access rights can still access the page after
+ * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
+ * the local TLB if mprotect updates no pages so that the the caller of
+ * mprotect always gets expected behaviour. It's overkill and unnecessary to
+ * flush all TLBs as a separate thread accessing the data that raced with
+ * both reclaim and mprotect as there is no risk of data corruption and
+ * the exact timing of a parallel thread seeing a protection update without
+ * any serialisation on the application side is always uncertain.
+ */
+void batched_unmap_protection_update(void)
+{
+	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+	local_flush_tlb();
+	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
+}
+
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 13:20         ` Mel Gorman
@ 2017-07-11 14:58           ` Andy Lutomirski
  2017-07-11 15:53             ` Mel Gorman
  2017-07-11 16:22           ` Nadav Amit
  1 sibling, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-11 14:58 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman <mgorman@suse.de> wrote:
> +
> +/*
> + * This is called after an mprotect update that altered no pages. Batched
> + * unmap releases the PTL before a flush occurs leaving a window where
> + * an mprotect that reduces access rights can still access the page after
> + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
> + * the local TLB if mprotect updates no pages so that the the caller of
> + * mprotect always gets expected behaviour. It's overkill and unnecessary to
> + * flush all TLBs as a separate thread accessing the data that raced with
> + * both reclaim and mprotect as there is no risk of data corruption and
> + * the exact timing of a parallel thread seeing a protection update without
> + * any serialisation on the application side is always uncertain.
> + */
> +void batched_unmap_protection_update(void)
> +{
> +       count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> +       local_flush_tlb();
> +       trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> +}
> +

What about remote CPUs?  You could get migrated right after mprotect()
or the inconsistency could be observed on another CPU.  I also really
don't like bypassing arch code like this.  The implementation of
flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!)
is *very* different from what's there now, and it is not written in
the expectation that some generic code might call local_tlb_flush()
and expect any kind of coherency at all.

I'm also still nervous about situations in which, while a batched
flush is active, a user calls mprotect() and then does something else
that gets confused by the fact that there's an RO PTE and doesn't
flush out the RW TLB entry.  COWing a page, perhaps?

Would a better fix perhaps be to find a way to figure out whether a
batched flush is pending on the mm in question and flush it out if you
do any optimizations based on assuming that the TLB is in any respect
consistent with the page tables?  With the changes in -tip, x86 could,
in principle, supply a function to sync up its TLB state.  That would
require cross-CPU poking at state or an inconditional IPI (that might
end up not flushing anything), but either is doable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 14:58           ` Andy Lutomirski
@ 2017-07-11 15:53             ` Mel Gorman
  2017-07-11 17:23               ` Andy Lutomirski
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-11 15:53 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Nadav Amit, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote:
> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman <mgorman@suse.de> wrote:
> > +
> > +/*
> > + * This is called after an mprotect update that altered no pages. Batched
> > + * unmap releases the PTL before a flush occurs leaving a window where
> > + * an mprotect that reduces access rights can still access the page after
> > + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
> > + * the local TLB if mprotect updates no pages so that the the caller of
> > + * mprotect always gets expected behaviour. It's overkill and unnecessary to
> > + * flush all TLBs as a separate thread accessing the data that raced with
> > + * both reclaim and mprotect as there is no risk of data corruption and
> > + * the exact timing of a parallel thread seeing a protection update without
> > + * any serialisation on the application side is always uncertain.
> > + */
> > +void batched_unmap_protection_update(void)
> > +{
> > +       count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> > +       local_flush_tlb();
> > +       trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> > +}
> > +
> 
> What about remote CPUs?  You could get migrated right after mprotect()
> or the inconsistency could be observed on another CPU. 

If it's migrated then it has also context switched so the TLB entry will
be read for the first time. If the entry is inconsistent for another CPU
accessing the data then it'll potentially successfully access a page that
was just mprotected but this is similar to simply racing with the call
to mprotect itself. The timing isn't exact, nor does it need to be. One
thread accessing data racing with another thread doing mprotect without
any synchronisation in the application is always going to be unreliable.
I'm less certain once PCID tracking is in place and whether it's possible for
a process to be context switching fast enough to allow an access. If it's
possible then batching would require an unconditional flush on mprotect
even if no pages are updated if access is being limited by the mprotect
which would be unfortunate.

> I also really
> don't like bypassing arch code like this.  The implementation of
> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!)
> is *very* different from what's there now, and it is not written in
> the expectation that some generic code might call local_tlb_flush()
> and expect any kind of coherency at all.
> 

Assuming that gets merged first then the most straight-forward approach
would be to setup a arch_tlbflush_unmap_batch with just the local CPU set
in the mask or something similar.

> I'm also still nervous about situations in which, while a batched
> flush is active, a user calls mprotect() and then does something else
> that gets confused by the fact that there's an RO PTE and doesn't
> flush out the RW TLB entry.  COWing a page, perhaps?
> 

The race in question only applies if mprotect had no PTEs to update. If
any page was updated then the TLB is flushed before mprotect returns.
With the patch (or a variant on top of your work), at least the local TLB
will be flushed even if no PTEs were updated. This might be more expensive
than it has to be but I expect that mprotects on range with no PTEs to
update are fairly rare.

> Would a better fix perhaps be to find a way to figure out whether a
> batched flush is pending on the mm in question and flush it out if you
> do any optimizations based on assuming that the TLB is in any respect
> consistent with the page tables?  With the changes in -tip, x86 could,
> in principle, supply a function to sync up its TLB state.  That would
> require cross-CPU poking at state or an inconditional IPI (that might
> end up not flushing anything), but either is doable.

It's potentially doable if a field like tlb_flush_pending was added
to mm_struct that is set when batching starts. I don't think there is
a logical place where it can be cleared as when the TLB gets flushed by
reclaim, it can't rmap again to clear the flag. What would happen is that
the first mprotect after any batching happened at any point in the past
would have to unconditionally flush the TLB and then clear the flag. That
would be a relatively minor hit and cover all the possibilities and should
work unmodified with or without your series applied.

Would that be preferable to you?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 13:20         ` Mel Gorman
  2017-07-11 14:58           ` Andy Lutomirski
@ 2017-07-11 16:22           ` Nadav Amit
  1 sibling, 0 replies; 70+ messages in thread
From: Nadav Amit @ 2017-07-11 16:22 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 03:40:02AM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>>>> That is the same to a race whereby there is no batching mechanism and the
>>>>> racing operation happens between a pte clear and a flush as ptep_clear_flush
>>>>> is not atomic. All that differs is that the race window is a different size.
>>>>> The application on CPU1 is buggy in that it may or may not succeed the write
>>>>> but it is buggy regardless of whether a batching mechanism is used or not.
>>>> 
>>>> Thanks for your quick and detailed response, but I fail to see how it can
>>>> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
>>>> but without batching they are both performed under the page table lock
>>>> (which is acquired in page_vma_mapped_walk and released in
>>>> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
>>>> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
>>>> and change_pte_range, acquire the lock before accessing the PTEs.
>>> 
>>> I was primarily thinking in terms of memory corruption or data loss.
>>> However, we are still protected although it's not particularly obvious why.
>>> 
>>> On the reclaim side, we are either reclaiming clean pages (which ignore
>>> the accessed bit) or normal reclaim. If it's clean pages then any parallel
>>> write must update the dirty bit at minimum. If it's normal reclaim then
>>> the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
>>> ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
>>> the page in either as part of page_referenced or try_to_unmap_one but
>>> clearing the accessed bit flushes the TLB.
>> 
>> Wait. Are you looking at the x86 arch function? The TLB is not flushed when
>> the access bit is cleared:
>> 
>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>>                           unsigned long address, pte_t *ptep)
>> {
>>        /*
>>         * On x86 CPUs, clearing the accessed bit without a TLB flush
>>         * doesn't cause data corruption. [ It could cause incorrect
>>         * page aging and the (mistaken) reclaim of hot pages, but the
>>         * chance of that should be relatively low. ]
>>         *                 
>>         * So as a performance optimization don't flush the TLB when
>>         * clearing the accessed bit, it will eventually be flushed by
>>         * a context switch or a VM operation anyway. [ In the rare
>>         * event of it not getting flushed for a long time the delay
>>         * shouldn't really matter because there's no real memory
>>         * pressure for swapout to react to. ]
>>         */
>>        return ptep_test_and_clear_young(vma, address, ptep);
>> }
> 
> I forgot this detail, thanks for correcting me.
> 
>>> On the mprotect side then, as the page was first accessed, clearing the
>>> accessed bit incurs a TLB flush on the reclaim side before the second write.
>>> That means any TLB entry that exists cannot have the accessed bit set so
>>> a second write needs to update it.
>>> 
>>> While it's not clearly documented, I checked with hardware engineers
>>> at the time that an update of the accessed or dirty bit even with a TLB
>>> entry will check the underlying page tables and trap if it's not present
>>> and the subsequent fault will then fail on sigsegv if the VMA protections
>>> no longer allow the write.
>>> 
>>> So, on one side if ignoring the accessed bit during reclaim, the pages
>>> are clean so any access will set the dirty bit and trap if unmapped in
>>> parallel. On the other side, the accessed bit if set cleared the TLB and
>>> if not set, then the hardware needs to update and again will trap if
>>> unmapped in parallel.
>> 
>> 
>> Yet, even regardless to the TLB flush it seems there is still a possible
>> race:
>> 
>> CPU0				CPU1
>> ----				----
>> ptep_clear_flush_young_notify
>> ==> PTE.A==0
>> 				access PTE
>> 				==> PTE.A=1
>> prep_get_and_clear
>> 				change mapping (and PTE)
>> 				Use stale TLB entry
> 
> So I think you're right and this is a potential race. The first access can
> be a read or a write as it's a problem if the mprotect call restricts
> access.
> 
>>> If this guarantee from hardware was every shown to be wrong or another
>>> architecture wanted to add batching without the same guarantee then mprotect
>>> would need to do a local_flush_tlb if no pages were updated by the mprotect
>>> but right now, this should not be necessary.
>>> 
>>>> Can you please explain why you consider the application to be buggy?
>>> 
>>> I considered it a bit dumb to mprotect for READ/NONE and then try writing
>>> the same mapping. However, it will behave as expected.
>> 
>> I don???t think that this is the only scenario. For example, the application
>> may create a new memory mapping of a different file using mmap at the same
>> memory address that was used before, just as that memory is reclaimed.
> 
> That requires the existing mapping to be unmapped which will flush the
> TLB and parallel mmap/munmap serialises on mmap_sem. The race appears to
> be specific to mprotect which avoids the TLB flush if no pages were updated.

Why? As far as I see the chain of calls during munmap is somewhat like:

do_munmap
=>unmap_region
==>tlb_gather_mmu
===>unmap_vmas
====>unmap_page_range
...
=====>zap_pte_range 	- this one batches only present PTEs
===>free_pgtables	- this one is only if page-tables are removed
===>pte_free_tlb
==>tlb_finish_mmu
===>tlb_flush_mmu
====>tlb_flush_mmu_tlbonly

zap_pte_range will check if pte_none and can find it is - if a concurrent
try_to_unmap_one already cleared the PTE. In this case it will not update
the range of the mmu_gather and would not indicate that a flush of the PTE
is needed. Then, tlb_flush_mmu_tlbonly will find that no PTE was cleared
(tlb->end == 0) and avoid flush, or may just flush fewer PTEs than actually
needed.

Due to this behavior, it raises a concern that in other cases as well, when
mmu_gather is used, a PTE flush may be missed.

>> The
>> application can (inadvertently) cause such a scenario by using MAP_FIXED.
>> But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same
>> virtual address.
> 
> With flushes in between.
> 
>>> Such applications are safe due to how the accessed bit is handled by the
>>> software (flushes TLB if clearing young) and hardware (traps if updating
>>> the accessed or dirty bit and the underlying PTE was unmapped even if
>>> there is a TLB entry).
>> 
>> I don???t think it is so. And I also think there are many additional
>> potentially problematic scenarios.
> 
> I believe it's specific to mprotect but can be handled by flushing the
> local TLB when mprotect updates no pages. Something like this;
> 
> ---8<---
> mm, mprotect: Flush the local TLB if mprotect potentially raced with a parallel reclaim
> 
> Nadav Amit identified a theoritical race between page reclaim and mprotect
> due to TLB flushes being batched outside of the PTL being held. He described
> the race as follows
> 
>        CPU0                            CPU1
>        ----                            ----
>                                        user accesses memory using RW PTE
>                                        [PTE now cached in TLB]
>        try_to_unmap_one()
>        ==> ptep_get_and_clear()
>        ==> set_tlb_ubc_flush_pending()
>                                        mprotect(addr, PROT_READ)
>                                        ==> change_pte_range()
>                                        ==> [ PTE non-present - no flush ]
> 
>                                        user writes using cached RW PTE
>        ...
> 
>        try_to_unmap_flush()
> 
> The same type of race exists for reads when protecting for PROT_NONE.
> This is not a data integrity issue as the TLB is always flushed before any
> IO is queued or a page is freed but it is a correctness issue as a process
> restricting access with mprotect() may still be able to access the data
> after the syscall returns due to a stale TLB entry. Handle this issue by
> flushing the local TLB if reclaim is potentially batching TLB flushes and
> mprotect altered no pages.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Cc: stable@vger.kernel.org # v4.4+
> ---
> mm/internal.h |  5 ++++-
> mm/mprotect.c | 12 ++++++++++--
> mm/rmap.c     | 20 ++++++++++++++++++++
> 3 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 0e4f558412fb..9b7d1a597816 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
> #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> void try_to_unmap_flush(void);
> void try_to_unmap_flush_dirty(void);
> +void batched_unmap_protection_update(void);
> #else
> static inline void try_to_unmap_flush(void)
> {
> @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
> static inline void try_to_unmap_flush_dirty(void)
> {
> }
> -
> +static inline void batched_unmap_protection_update()
> +{
> +}
> #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> 
> extern const struct trace_print_flags pageflag_names[];
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..3de353d4b5fb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -254,9 +254,17 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
> 				 dirty_accountable, prot_numa);
> 	} while (pgd++, addr = next, addr != end);
> 
> -	/* Only flush the TLB if we actually modified any entries: */
> -	if (pages)
> +	/*
> +	 * Only flush all TLBs if we actually modified any entries. If no
> +	 * pages are modified, then call batched_unmap_protection_update
> +	 * if the context is a mprotect() syscall.
> +	 */
> +	if (pages) {
> 		flush_tlb_range(vma, start, end);
> +	} else {
> +		if (!prot_numa)
> +			batched_unmap_protection_update();
> +	}
> 	clear_tlb_flush_pending(mm);
> 
> 	return pages;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d405f0e0ee96..02cb035e4ce6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -643,6 +643,26 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> 
> 	return should_defer;
> }
> +
> +/*
> + * This is called after an mprotect update that altered no pages. Batched
> + * unmap releases the PTL before a flush occurs leaving a window where
> + * an mprotect that reduces access rights can still access the page after
> + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
> + * the local TLB if mprotect updates no pages so that the the caller of
> + * mprotect always gets expected behaviour. It's overkill and unnecessary to
> + * flush all TLBs as a separate thread accessing the data that raced with
> + * both reclaim and mprotect as there is no risk of data corruption and
> + * the exact timing of a parallel thread seeing a protection update without
> + * any serialisation on the application side is always uncertain.
> + */
> +void batched_unmap_protection_update(void)
> +{
> +	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> +	local_flush_tlb();
> +	trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> +}
> +
> #else
> static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> {

I don’t think this solution is enough. I am sorry for not providing a
solution, but I don’t see an easy one.

Thanks,
Nadav

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 15:53             ` Mel Gorman
@ 2017-07-11 17:23               ` Andy Lutomirski
  2017-07-11 19:18                 ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-11 17:23 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, Nadav Amit, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 8:53 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote:
>> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman <mgorman@suse.de> wrote:
>> > +
>> > +/*
>> > + * This is called after an mprotect update that altered no pages. Batched
>> > + * unmap releases the PTL before a flush occurs leaving a window where
>> > + * an mprotect that reduces access rights can still access the page after
>> > + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
>> > + * the local TLB if mprotect updates no pages so that the the caller of
>> > + * mprotect always gets expected behaviour. It's overkill and unnecessary to
>> > + * flush all TLBs as a separate thread accessing the data that raced with
>> > + * both reclaim and mprotect as there is no risk of data corruption and
>> > + * the exact timing of a parallel thread seeing a protection update without
>> > + * any serialisation on the application side is always uncertain.
>> > + */
>> > +void batched_unmap_protection_update(void)
>> > +{
>> > +       count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>> > +       local_flush_tlb();
>> > +       trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>> > +}
>> > +
>>
>> What about remote CPUs?  You could get migrated right after mprotect()
>> or the inconsistency could be observed on another CPU.
>
> If it's migrated then it has also context switched so the TLB entry will
> be read for the first time.

I don't think this is true.  On current kernels, if the other CPU is
running a thread in the same process, then there won't be a flush if
we migrate there.  In -tip, slated for 4.13, if the other CPU is lazy
and is using the current process's page tables, it won't flush if we
migrate there and it's not stale (as determined by the real flush
APIs, not local_tlb_flush()).  With PCID, the kernel will aggressively
try to avoid the flush no matter what.

> If the entry is inconsistent for another CPU
> accessing the data then it'll potentially successfully access a page that
> was just mprotected but this is similar to simply racing with the call
> to mprotect itself. The timing isn't exact, nor does it need to be.

Thread A:
mprotect(..., PROT_READ);
pthread_mutex_unlock();

Thread B:
pthread_mutex_lock();
write to the mprotected address;

I think it's unlikely that this exact scenario will affect a
conventional C program, but I can see various GC systems and sandboxes
being very surprised.

> One
> thread accessing data racing with another thread doing mprotect without
> any synchronisation in the application is always going to be unreliable.

As above, there can be synchronization that's entirely invisible to the kernel.

>> I also really
>> don't like bypassing arch code like this.  The implementation of
>> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!)
>> is *very* different from what's there now, and it is not written in
>> the expectation that some generic code might call local_tlb_flush()
>> and expect any kind of coherency at all.
>>
>
> Assuming that gets merged first then the most straight-forward approach
> would be to setup a arch_tlbflush_unmap_batch with just the local CPU set
> in the mask or something similar.

With what semantics?

>> Would a better fix perhaps be to find a way to figure out whether a
>> batched flush is pending on the mm in question and flush it out if you
>> do any optimizations based on assuming that the TLB is in any respect
>> consistent with the page tables?  With the changes in -tip, x86 could,
>> in principle, supply a function to sync up its TLB state.  That would
>> require cross-CPU poking at state or an inconditional IPI (that might
>> end up not flushing anything), but either is doable.
>
> It's potentially doable if a field like tlb_flush_pending was added
> to mm_struct that is set when batching starts. I don't think there is
> a logical place where it can be cleared as when the TLB gets flushed by
> reclaim, it can't rmap again to clear the flag. What would happen is that
> the first mprotect after any batching happened at any point in the past
> would have to unconditionally flush the TLB and then clear the flag. That
> would be a relatively minor hit and cover all the possibilities and should
> work unmodified with or without your series applied.
>
> Would that be preferable to you?

I'm not sure I understand it well enough to know whether I like it.
I'm imagining an API that says "I'm about to rely on TLBs being
coherent for this mm -- make it so".  On x86, this would be roughly
equivalent to a flush on the mm minus the mandatory flush part, at
least with my patches applied.  It would be considerably messier
without my patches.

But I'd like to make sure that the full extent of the problem is
understood before getting too excited about solving it.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 17:23               ` Andy Lutomirski
@ 2017-07-11 19:18                 ` Mel Gorman
  2017-07-11 20:06                   ` Nadav Amit
                                     ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-11 19:18 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Nadav Amit, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 10:23:50AM -0700, Andrew Lutomirski wrote:
> On Tue, Jul 11, 2017 at 8:53 AM, Mel Gorman <mgorman@suse.de> wrote:
> > On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote:
> >> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman <mgorman@suse.de> wrote:
> >> > +
> >> > +/*
> >> > + * This is called after an mprotect update that altered no pages. Batched
> >> > + * unmap releases the PTL before a flush occurs leaving a window where
> >> > + * an mprotect that reduces access rights can still access the page after
> >> > + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
> >> > + * the local TLB if mprotect updates no pages so that the the caller of
> >> > + * mprotect always gets expected behaviour. It's overkill and unnecessary to
> >> > + * flush all TLBs as a separate thread accessing the data that raced with
> >> > + * both reclaim and mprotect as there is no risk of data corruption and
> >> > + * the exact timing of a parallel thread seeing a protection update without
> >> > + * any serialisation on the application side is always uncertain.
> >> > + */
> >> > +void batched_unmap_protection_update(void)
> >> > +{
> >> > +       count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> >> > +       local_flush_tlb();
> >> > +       trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> >> > +}
> >> > +
> >>
> >> What about remote CPUs?  You could get migrated right after mprotect()
> >> or the inconsistency could be observed on another CPU.
> >
> > If it's migrated then it has also context switched so the TLB entry will
> > be read for the first time.
> 
> I don't think this is true.  On current kernels, if the other CPU is
> running a thread in the same process, then there won't be a flush if
> we migrate there. 

True although that would also be covered if a flush happening unconditionally
on mprotect (and arguably munmap) if a batched TLB flush took place in the
past. It's heavier than it needs to be but it would be trivial to track
and only incur a cost if reclaim touched any pages belonging to the process
in the past so a relatively rare operation in the normal case. It could be
forced by continually keeping a system under memory pressure while looping
around mprotect but the worst-case would be similar costs to never batching
the flushing at all.

> In -tip, slated for 4.13, if the other CPU is lazy
> and is using the current process's page tables, it won't flush if we
> migrate there and it's not stale (as determined by the real flush
> APIs, not local_tlb_flush()).  With PCID, the kernel will aggressively
> try to avoid the flush no matter what.
> 

I agree that PCID means that flushing needs to be more agressive and there
is not much point working on two solutions and assume PCID is merged.

> > If the entry is inconsistent for another CPU
> > accessing the data then it'll potentially successfully access a page that
> > was just mprotected but this is similar to simply racing with the call
> > to mprotect itself. The timing isn't exact, nor does it need to be.
> 
> Thread A:
> mprotect(..., PROT_READ);
> pthread_mutex_unlock();
> 
> Thread B:
> pthread_mutex_lock();
> write to the mprotected address;
> 
> I think it's unlikely that this exact scenario will affect a
> conventional C program, but I can see various GC systems and sandboxes
> being very surprised.
> 

Maybe. The window is massively wide as the mprotect, unlock, remote wakeup
and write all need to complete between the unmap releasing the PTL and
the flush taking place. Still, it is theoritically possible.

> 
> >> I also really
> >> don't like bypassing arch code like this.  The implementation of
> >> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!)
> >> is *very* different from what's there now, and it is not written in
> >> the expectation that some generic code might call local_tlb_flush()
> >> and expect any kind of coherency at all.
> >>
> >
> > Assuming that gets merged first then the most straight-forward approach
> > would be to setup a arch_tlbflush_unmap_batch with just the local CPU set
> > in the mask or something similar.
> 
> With what semantics?
> 

I'm dropping this idea because the more I think about it, the more I think
that a more general flush is needed if TLB batching was used in the past.
We could keep active track of mm's with flushes pending but it would be
fairly complex, cost in terms of keeping track of mm's needing flushing
and ultimately might be more expensive than just flushing immediately.

If it's actually unfixable then, even though it's theoritical given the
massive amount of activity that has to happen in a very short window, there
would be no choice but to remove the TLB batching entirely which would be
very unfortunate given that IPIs during reclaim will be very high once again.

> >> Would a better fix perhaps be to find a way to figure out whether a
> >> batched flush is pending on the mm in question and flush it out if you
> >> do any optimizations based on assuming that the TLB is in any respect
> >> consistent with the page tables?  With the changes in -tip, x86 could,
> >> in principle, supply a function to sync up its TLB state.  That would
> >> require cross-CPU poking at state or an inconditional IPI (that might
> >> end up not flushing anything), but either is doable.
> >
> > It's potentially doable if a field like tlb_flush_pending was added
> > to mm_struct that is set when batching starts. I don't think there is
> > a logical place where it can be cleared as when the TLB gets flushed by
> > reclaim, it can't rmap again to clear the flag. What would happen is that
> > the first mprotect after any batching happened at any point in the past
> > would have to unconditionally flush the TLB and then clear the flag. That
> > would be a relatively minor hit and cover all the possibilities and should
> > work unmodified with or without your series applied.
> >
> > Would that be preferable to you?
> 
> I'm not sure I understand it well enough to know whether I like it.
> I'm imagining an API that says "I'm about to rely on TLBs being
> coherent for this mm -- make it so". 

I don't think we should be particularly clever about this and instead just
flush the full mm if there is a risk of a parallel batching of flushing is
in progress resulting in a stale TLB entry being used. I think tracking mms
that are currently batching would end up being costly in terms of memory,
fairly complex, or both. Something like this?

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..ab8f7e11c160 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -495,6 +495,10 @@ struct mm_struct {
 	 */
 	bool tlb_flush_pending;
 #endif
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	/* See flush_tlb_batched_pending() */
+	bool tlb_flush_batched;
+#endif
 	struct uprobes_state uprobes_state;
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..bf835a5a9854 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void flush_tlb_batched_pending(struct mm_struct *mm);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
-
+static inline void mm_tlb_flush_batched(struct mm_struct *mm)
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/memory.c b/mm/memory.c
index bb11c474857e..b0c3d1556a94 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	init_rss_vec(rss);
 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	pte = start_pte;
+	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..27135b91a4b4 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	if (!pte)
 		return 0;
 
+	/* Guard against parallel reclaim batching a TLB flush without PTL */
+	flush_tlb_batched_pending(vma->vm_mm);
+
 	/* Get target node for single threaded private VMAs */
 	if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
 	    atomic_read(&vma->vm_mm->mm_users) == 1)
diff --git a/mm/rmap.c b/mm/rmap.c
index d405f0e0ee96..52633a124a4e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 		return false;
 
 	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
 		should_defer = true;
+		mm->tlb_flush_batched = true;
+	}
 	put_cpu();
 
 	return should_defer;
 }
+
+/*
+ * Reclaim batches unmaps pages under the PTL but does not flush the TLB
+ * TLB prior to releasing the PTL. It's possible a parallel mprotect or
+ * munmap can race between reclaim unmapping the page and flushing the
+ * page. If this race occurs, it potentially allows access to data via
+ * a stale TLB entry. Tracking all mm's that have TLB batching pending
+ * would be expensive during reclaim so instead track whether TLB batching
+ * occured in the past and if so then do a full mm flush here. This will
+ * cost one additional flush per reclaim cycle paid by the first munmap or
+ * mprotect. This assumes it's called under the PTL to synchronise access
+ * to mm->tlb_flush_batched.
+ */
+void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+	if (mm->tlb_flush_batched) {
+		flush_tlb_mm(mm);
+		mm->tlb_flush_batched = false;
+	}
+}
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 19:18                 ` Mel Gorman
@ 2017-07-11 20:06                   ` Nadav Amit
  2017-07-11 21:09                     ` Mel Gorman
  2017-07-11 20:09                   ` Mel Gorman
  2017-07-11 22:07                   ` Andy Lutomirski
  2 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-11 20:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 10:23:50AM -0700, Andrew Lutomirski wrote:
>> On Tue, Jul 11, 2017 at 8:53 AM, Mel Gorman <mgorman@suse.de> wrote:
>>> On Tue, Jul 11, 2017 at 07:58:04AM -0700, Andrew Lutomirski wrote:
>>>> On Tue, Jul 11, 2017 at 6:20 AM, Mel Gorman <mgorman@suse.de> wrote:
>>>>> +
>>>>> +/*
>>>>> + * This is called after an mprotect update that altered no pages. Batched
>>>>> + * unmap releases the PTL before a flush occurs leaving a window where
>>>>> + * an mprotect that reduces access rights can still access the page after
>>>>> + * mprotect returns via a stale TLB entry. Avoid this possibility by flushing
>>>>> + * the local TLB if mprotect updates no pages so that the the caller of
>>>>> + * mprotect always gets expected behaviour. It's overkill and unnecessary to
>>>>> + * flush all TLBs as a separate thread accessing the data that raced with
>>>>> + * both reclaim and mprotect as there is no risk of data corruption and
>>>>> + * the exact timing of a parallel thread seeing a protection update without
>>>>> + * any serialisation on the application side is always uncertain.
>>>>> + */
>>>>> +void batched_unmap_protection_update(void)
>>>>> +{
>>>>> +       count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>>>>> +       local_flush_tlb();
>>>>> +       trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
>>>>> +}
>>>>> +
>>>> 
>>>> What about remote CPUs?  You could get migrated right after mprotect()
>>>> or the inconsistency could be observed on another CPU.
>>> 
>>> If it's migrated then it has also context switched so the TLB entry will
>>> be read for the first time.
>> 
>> I don't think this is true.  On current kernels, if the other CPU is
>> running a thread in the same process, then there won't be a flush if
>> we migrate there.
> 
> True although that would also be covered if a flush happening unconditionally
> on mprotect (and arguably munmap) if a batched TLB flush took place in the
> past. It's heavier than it needs to be but it would be trivial to track
> and only incur a cost if reclaim touched any pages belonging to the process
> in the past so a relatively rare operation in the normal case. It could be
> forced by continually keeping a system under memory pressure while looping
> around mprotect but the worst-case would be similar costs to never batching
> the flushing at all.
> 
>> In -tip, slated for 4.13, if the other CPU is lazy
>> and is using the current process's page tables, it won't flush if we
>> migrate there and it's not stale (as determined by the real flush
>> APIs, not local_tlb_flush()).  With PCID, the kernel will aggressively
>> try to avoid the flush no matter what.
> 
> I agree that PCID means that flushing needs to be more agressive and there
> is not much point working on two solutions and assume PCID is merged.
> 
>>> If the entry is inconsistent for another CPU
>>> accessing the data then it'll potentially successfully access a page that
>>> was just mprotected but this is similar to simply racing with the call
>>> to mprotect itself. The timing isn't exact, nor does it need to be.
>> 
>> Thread A:
>> mprotect(..., PROT_READ);
>> pthread_mutex_unlock();
>> 
>> Thread B:
>> pthread_mutex_lock();
>> write to the mprotected address;
>> 
>> I think it's unlikely that this exact scenario will affect a
>> conventional C program, but I can see various GC systems and sandboxes
>> being very surprised.
> 
> Maybe. The window is massively wide as the mprotect, unlock, remote wakeup
> and write all need to complete between the unmap releasing the PTL and
> the flush taking place. Still, it is theoritically possible.

Consider also virtual machines. A VCPU may be preempted by the hypervisor
right after a PTE change and before the flush - so the time between the two
can be rather large.

>>>> I also really
>>>> don't like bypassing arch code like this.  The implementation of
>>>> flush_tlb_mm_range() in tip:x86/mm (and slated for this merge window!)
>>>> is *very* different from what's there now, and it is not written in
>>>> the expectation that some generic code might call local_tlb_flush()
>>>> and expect any kind of coherency at all.
>>> 
>>> Assuming that gets merged first then the most straight-forward approach
>>> would be to setup a arch_tlbflush_unmap_batch with just the local CPU set
>>> in the mask or something similar.
>> 
>> With what semantics?
> 
> I'm dropping this idea because the more I think about it, the more I think
> that a more general flush is needed if TLB batching was used in the past.
> We could keep active track of mm's with flushes pending but it would be
> fairly complex, cost in terms of keeping track of mm's needing flushing
> and ultimately might be more expensive than just flushing immediately.
> 
> If it's actually unfixable then, even though it's theoritical given the
> massive amount of activity that has to happen in a very short window, there
> would be no choice but to remove the TLB batching entirely which would be
> very unfortunate given that IPIs during reclaim will be very high once again.
> 
>>>> Would a better fix perhaps be to find a way to figure out whether a
>>>> batched flush is pending on the mm in question and flush it out if you
>>>> do any optimizations based on assuming that the TLB is in any respect
>>>> consistent with the page tables?  With the changes in -tip, x86 could,
>>>> in principle, supply a function to sync up its TLB state.  That would
>>>> require cross-CPU poking at state or an inconditional IPI (that might
>>>> end up not flushing anything), but either is doable.
>>> 
>>> It's potentially doable if a field like tlb_flush_pending was added
>>> to mm_struct that is set when batching starts. I don't think there is
>>> a logical place where it can be cleared as when the TLB gets flushed by
>>> reclaim, it can't rmap again to clear the flag. What would happen is that
>>> the first mprotect after any batching happened at any point in the past
>>> would have to unconditionally flush the TLB and then clear the flag. That
>>> would be a relatively minor hit and cover all the possibilities and should
>>> work unmodified with or without your series applied.
>>> 
>>> Would that be preferable to you?
>> 
>> I'm not sure I understand it well enough to know whether I like it.
>> I'm imagining an API that says "I'm about to rely on TLBs being
>> coherent for this mm -- make it so".
> 
> I don't think we should be particularly clever about this and instead just
> flush the full mm if there is a risk of a parallel batching of flushing is
> in progress resulting in a stale TLB entry being used. I think tracking mms
> that are currently batching would end up being costly in terms of memory,
> fairly complex, or both. Something like this?
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 45cdb27791a3..ab8f7e11c160 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -495,6 +495,10 @@ struct mm_struct {
> 	 */
> 	bool tlb_flush_pending;
> #endif
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +	/* See flush_tlb_batched_pending() */
> +	bool tlb_flush_batched;
> +#endif
> 	struct uprobes_state uprobes_state;
> #ifdef CONFIG_HUGETLB_PAGE
> 	atomic_long_t hugetlb_usage;
> diff --git a/mm/internal.h b/mm/internal.h
> index 0e4f558412fb..bf835a5a9854 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
> #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> void try_to_unmap_flush(void);
> void try_to_unmap_flush_dirty(void);
> +void flush_tlb_batched_pending(struct mm_struct *mm);
> #else
> static inline void try_to_unmap_flush(void)
> {
> @@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
> static inline void try_to_unmap_flush_dirty(void)
> {
> }
> -
> +static inline void mm_tlb_flush_batched(struct mm_struct *mm)
> +{
> +}
> #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> 
> extern const struct trace_print_flags pageflag_names[];
> diff --git a/mm/memory.c b/mm/memory.c
> index bb11c474857e..b0c3d1556a94 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> 	init_rss_vec(rss);
> 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> 	pte = start_pte;
> +	flush_tlb_batched_pending(mm);
> 	arch_enter_lazy_mmu_mode();
> 	do {
> 		pte_t ptent = *pte;
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..27135b91a4b4 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> 	if (!pte)
> 		return 0;
> 
> +	/* Guard against parallel reclaim batching a TLB flush without PTL */
> +	flush_tlb_batched_pending(vma->vm_mm);
> +
> 	/* Get target node for single threaded private VMAs */
> 	if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
> 	    atomic_read(&vma->vm_mm->mm_users) == 1)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d405f0e0ee96..52633a124a4e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> 		return false;
> 
> 	/* If remote CPUs need to be flushed then defer batch the flush */
> -	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
> 		should_defer = true;
> +		mm->tlb_flush_batched = true;
> +	}
> 	put_cpu();
> 
> 	return should_defer;
> }
> +
> +/*
> + * Reclaim batches unmaps pages under the PTL but does not flush the TLB
> + * TLB prior to releasing the PTL. It's possible a parallel mprotect or
> + * munmap can race between reclaim unmapping the page and flushing the
> + * page. If this race occurs, it potentially allows access to data via
> + * a stale TLB entry. Tracking all mm's that have TLB batching pending
> + * would be expensive during reclaim so instead track whether TLB batching
> + * occured in the past and if so then do a full mm flush here. This will
> + * cost one additional flush per reclaim cycle paid by the first munmap or
> + * mprotect. This assumes it's called under the PTL to synchronise access
> + * to mm->tlb_flush_batched.
> + */
> +void flush_tlb_batched_pending(struct mm_struct *mm)
> +{
> +	if (mm->tlb_flush_batched) {
> +		flush_tlb_mm(mm);
> +		mm->tlb_flush_batched = false;
> +	}
> +}
> #else
> static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> {

I don’t know what is exactly the invariant that is kept, so it is hard for
me to figure out all sort of questions:

Should pte_accessible return true if mm->tlb_flush_batch==true ?

Does madvise_free_pte_range need to be modified as well?

How will future code not break anything?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 19:18                 ` Mel Gorman
  2017-07-11 20:06                   ` Nadav Amit
@ 2017-07-11 20:09                   ` Mel Gorman
  2017-07-11 21:52                     ` Mel Gorman
  2017-07-11 22:07                   ` Andy Lutomirski
  2 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-11 20:09 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Nadav Amit, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote:
> I don't think we should be particularly clever about this and instead just
> flush the full mm if there is a risk of a parallel batching of flushing is
> in progress resulting in a stale TLB entry being used. I think tracking mms
> that are currently batching would end up being costly in terms of memory,
> fairly complex, or both. Something like this?
> 

mremap and madvise(DONTNEED) would also need to flush. Memory policies are
fine as a move_pages call that hits the race will simply fail to migrate
a page that is being freed and once migration starts, it'll be flushed so
a stale access has no further risk. copy_page_range should also be ok as
the old mm is flushed and the new mm cannot have entries yet.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 20:06                   ` Nadav Amit
@ 2017-07-11 21:09                     ` Mel Gorman
  0 siblings, 0 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-11 21:09 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 01:06:48PM -0700, Nadav Amit wrote:
> > +/*
> > + * Reclaim batches unmaps pages under the PTL but does not flush the TLB
> > + * TLB prior to releasing the PTL. It's possible a parallel mprotect or
> > + * munmap can race between reclaim unmapping the page and flushing the
> > + * page. If this race occurs, it potentially allows access to data via
> > + * a stale TLB entry. Tracking all mm's that have TLB batching pending
> > + * would be expensive during reclaim so instead track whether TLB batching
> > + * occured in the past and if so then do a full mm flush here. This will
> > + * cost one additional flush per reclaim cycle paid by the first munmap or
> > + * mprotect. This assumes it's called under the PTL to synchronise access
> > + * to mm->tlb_flush_batched.
> > + */
> > +void flush_tlb_batched_pending(struct mm_struct *mm)
> > +{
> > +	if (mm->tlb_flush_batched) {
> > +		flush_tlb_mm(mm);
> > +		mm->tlb_flush_batched = false;
> > +	}
> > +}
> > #else
> > static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
> > {
> 
> I don???t know what is exactly the invariant that is kept, so it is hard for
> me to figure out all sort of questions:
> 
> Should pte_accessible return true if mm->tlb_flush_batch==true ?
> 

It shouldn't be necessary. The contexts where we hit the path are

uprobes: elevated page count so no parallel reclaim
dax: PTEs are not mapping that would be reclaimed
hugetlbfs: Not reclaimed
ksm: holds page lock and elevates count so cannot race with reclaim
cow: at the time of the flush, the page count is elevated so cannot race with reclaim
page_mkclean: only concerned with marking existing ptes clean but in any
	case, the batching flushes the TLB before issueing any IO so there
	isn't space for a stable TLB entry to be used for something bad.

> Does madvise_free_pte_range need to be modified as well?
> 

Yes, I noticed that out shortly after sending the first version and
commented upon it.

> How will future code not break anything?
> 

I can't really answer that without a crystal ball. Code dealing with page
table updates would need to take some care if it can race with parallel
reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 20:09                   ` Mel Gorman
@ 2017-07-11 21:52                     ` Mel Gorman
  2017-07-11 22:27                       ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-11 21:52 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Nadav Amit, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote:
> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote:
> > I don't think we should be particularly clever about this and instead just
> > flush the full mm if there is a risk of a parallel batching of flushing is
> > in progress resulting in a stale TLB entry being used. I think tracking mms
> > that are currently batching would end up being costly in terms of memory,
> > fairly complex, or both. Something like this?
> > 
> 
> mremap and madvise(DONTNEED) would also need to flush. Memory policies are
> fine as a move_pages call that hits the race will simply fail to migrate
> a page that is being freed and once migration starts, it'll be flushed so
> a stale access has no further risk. copy_page_range should also be ok as
> the old mm is flushed and the new mm cannot have entries yet.
> 

Adding those results in

---8<---
mm, mprotect: Flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries

Nadav Amit identified a theoritical race between page reclaim and mprotect
due to TLB flushes being batched outside of the PTL being held. He described
the race as follows

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such as
munmap, mremap and madvise.

For some operations like mprotect, it's not a data integrity issue but it
is a correctness issue. For munmap, it's potentially a data integrity issue
although the race is massive as an munmap, mmap and return to userspace must
all complete between the window when reclaim drops the PTL and flushes the
TLB. However, it's theoritically possible so handle this issue by flushing
the mm if reclaim is potentially currently batching TLB flushes.

Other instances where a flush is required for a present pte should be ok
as either page reference counts are elevated preventing parallel reclaim
or in the case of page_mkclean there isn't an obvious path that userspace
could take advantage of without using the operations that are guarded by
this patch. Other users such as gup as a race with reclaim looks just at
PTEs. huge page variants should be ok as they don't race with reclaim.
mincore only looks at PTEs. userfault also should be ok as if a parallel
reclaim takes place, it will either fault the page back in or read some
of the data before the flush occurs triggering a fault.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org # v4.4+
---
 include/linux/mm_types.h |  4 ++++
 mm/internal.h            |  5 ++++-
 mm/madvise.c             |  1 +
 mm/memory.c              |  1 +
 mm/mprotect.c            |  3 +++
 mm/mremap.c              |  1 +
 mm/rmap.c                | 24 +++++++++++++++++++++++-
 7 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..ab8f7e11c160 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -495,6 +495,10 @@ struct mm_struct {
 	 */
 	bool tlb_flush_pending;
 #endif
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	/* See flush_tlb_batched_pending() */
+	bool tlb_flush_batched;
+#endif
 	struct uprobes_state uprobes_state;
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..bf835a5a9854 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void flush_tlb_batched_pending(struct mm_struct *mm);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
-
+static inline void mm_tlb_flush_batched(struct mm_struct *mm)
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/madvise.c b/mm/madvise.c
index 25b78ee4fc2c..75d2cffbe61d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -320,6 +320,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 
 	tlb_remove_check_page_size_change(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
diff --git a/mm/memory.c b/mm/memory.c
index bb11c474857e..b0c3d1556a94 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	init_rss_vec(rss);
 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	pte = start_pte;
+	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..27135b91a4b4 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -61,6 +61,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	if (!pte)
 		return 0;
 
+	/* Guard against parallel reclaim batching a TLB flush without PTL */
+	flush_tlb_batched_pending(vma->vm_mm);
+
 	/* Get target node for single threaded private VMAs */
 	if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
 	    atomic_read(&vma->vm_mm->mm_users) == 1)
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b199ef9..6e3d857458de 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -152,6 +152,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	new_ptl = pte_lockptr(mm, new_pmd);
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
diff --git a/mm/rmap.c b/mm/rmap.c
index d405f0e0ee96..5a3e4ff9c4a0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 		return false;
 
 	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
 		should_defer = true;
+		mm->tlb_flush_batched = true;
+	}
 	put_cpu();
 
 	return should_defer;
 }
+
+/*
+ * Reclaim unmaps pages under the PTL but does not flush the TLB prior to
+ * releasing the PTL if TLB flushes are batched. It's possible a parallel
+ * operation such as mprotect or munmap to race between reclaim unmapping
+ * the page and flushing the page If this race occurs, it potentially allows
+ * access to data via a stale TLB entry. Tracking all mm's that have TLB
+ * batching pending would be expensive during reclaim so instead track
+ * whether TLB batching occured in the past and if so then do a full mmi
+ * flush here. This will cost one additional flush per reclaim cycle paid
+ * by the first operation at risk such as mprotect and mumap. This assumes
+ * it's called under the PTL to synchronise access to mm->tlb_flush_batched.
+ */
+void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+	if (mm->tlb_flush_batched) {
+		flush_tlb_mm(mm);
+		mm->tlb_flush_batched = false;
+	}
+}
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 19:18                 ` Mel Gorman
  2017-07-11 20:06                   ` Nadav Amit
  2017-07-11 20:09                   ` Mel Gorman
@ 2017-07-11 22:07                   ` Andy Lutomirski
  2017-07-11 22:33                     ` Mel Gorman
  2017-07-14  7:00                     ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-11 22:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, Nadav Amit, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 12:18 PM, Mel Gorman <mgorman@suse.de> wrote:

I would change this slightly:

> +void flush_tlb_batched_pending(struct mm_struct *mm)
> +{
> +       if (mm->tlb_flush_batched) {
> +               flush_tlb_mm(mm);

How about making this a new helper arch_tlbbatch_flush_one_mm(mm);
The idea is that this could be implemented as flush_tlb_mm(mm), but
the actual semantics needed are weaker.  All that's really needed
AFAICS is to make sure that any arch_tlbbatch_add_mm() calls on this
mm that have already happened become effective by the time that
arch_tlbbatch_flush_one_mm() returns.

The initial implementation would be this:

struct flush_tlb_info info = {
  .mm = mm,
  .new_tlb_gen = atomic64_read(&mm->context.tlb_gen);
  .start = 0,
  .end = TLB_FLUSH_ALL,
};

and the rest is like flush_tlb_mm_range().  flush_tlb_func_common()
will already do the right thing, but the comments should probably be
updated, too.  The benefit would be that, if you just call this on an
mm when everything is already flushed, it will still do the IPIs but
it won't do the actual flush.

A better future implementation could iterate over each cpu in
mm_cpumask(), and, using either a new lock or very careful atomics,
check whether that CPU really needs flushing.  In -tip, all the
information needed to figure this out is already there in the percpu
state -- it's just not currently set up for remote access.

For backports, it would just be flush_tlb_mm().

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 21:52                     ` Mel Gorman
@ 2017-07-11 22:27                       ` Nadav Amit
  2017-07-11 22:34                         ` Nadav Amit
  2017-07-12  8:27                         ` Mel Gorman
  0 siblings, 2 replies; 70+ messages in thread
From: Nadav Amit @ 2017-07-11 22:27 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote:
>> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote:
>>> I don't think we should be particularly clever about this and instead just
>>> flush the full mm if there is a risk of a parallel batching of flushing is
>>> in progress resulting in a stale TLB entry being used. I think tracking mms
>>> that are currently batching would end up being costly in terms of memory,
>>> fairly complex, or both. Something like this?
>> 
>> mremap and madvise(DONTNEED) would also need to flush. Memory policies are
>> fine as a move_pages call that hits the race will simply fail to migrate
>> a page that is being freed and once migration starts, it'll be flushed so
>> a stale access has no further risk. copy_page_range should also be ok as
>> the old mm is flushed and the new mm cannot have entries yet.
> 
> Adding those results in

You are way too fast for me.

> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> 		return false;
> 
> 	/* If remote CPUs need to be flushed then defer batch the flush */
> -	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
> 		should_defer = true;
> +		mm->tlb_flush_batched = true;
> +	}

Since mm->tlb_flush_batched is set before the PTE is actually cleared, it
still seems to leave a short window for a race.

CPU0				CPU1
---- 				----
should_defer_flush
=> mm->tlb_flush_batched=true		
				flush_tlb_batched_pending (another PT)
				=> flush TLB
				=> mm->tlb_flush_batched=false
ptep_get_and_clear
...

				flush_tlb_batched_pending (batched PT)
				use the stale PTE
...
try_to_unmap_flush


IOW it seems that mm->flush_flush_batched should be set after the PTE is
cleared (and have some compiler barrier to be on the safe side).

Just to clarify - I don’t try to annoy, but I considered building and
submitting a patch based on some artifacts of a study I conducted, and this
issue drove me crazy.

One more question, please: how does elevated page count or even locking the
page help (as you mention in regard to uprobes and ksm)? Yes, the page will
not be reclaimed, but IIUC try_to_unmap is called before the reference count
is frozen, and the page lock is dropped on each iteration of the loop in
shrink_page_list. In this case, it seems to me that uprobes or ksm may still
not flush the TLB.

Thanks,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 22:07                   ` Andy Lutomirski
@ 2017-07-11 22:33                     ` Mel Gorman
  2017-07-14  7:00                     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-11 22:33 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Nadav Amit, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 03:07:57PM -0700, Andrew Lutomirski wrote:
> On Tue, Jul 11, 2017 at 12:18 PM, Mel Gorman <mgorman@suse.de> wrote:
> 
> I would change this slightly:
> 
> > +void flush_tlb_batched_pending(struct mm_struct *mm)
> > +{
> > +       if (mm->tlb_flush_batched) {
> > +               flush_tlb_mm(mm);
> 
> How about making this a new helper arch_tlbbatch_flush_one_mm(mm);
> The idea is that this could be implemented as flush_tlb_mm(mm), but
> the actual semantics needed are weaker.  All that's really needed
> AFAICS is to make sure that any arch_tlbbatch_add_mm() calls on this
> mm that have already happened become effective by the time that
> arch_tlbbatch_flush_one_mm() returns.
> 
> The initial implementation would be this:
> 
> struct flush_tlb_info info = {
>   .mm = mm,
>   .new_tlb_gen = atomic64_read(&mm->context.tlb_gen);
>   .start = 0,
>   .end = TLB_FLUSH_ALL,
> };
> 
> and the rest is like flush_tlb_mm_range().  flush_tlb_func_common()
> will already do the right thing, but the comments should probably be
> updated, too. 

Yes, from what I remember from your patches and a quick recheck, that should
be fine. I'll be leaving it until the morning to actually do the work. It
requires that your stuff be upstream first but last time I checked, they
were expected in this merge window.

> The benefit would be that, if you just call this on an
> mm when everything is already flushed, it will still do the IPIs but
> it won't do the actual flush.
> 

The benefit is somewhat marginal given that a process that has been
partially reclaimed already has taken a hit on any latency requirements
it has. However, it's a much better fit with your work in general.

> A better future implementation could iterate over each cpu in
> mm_cpumask(), and, using either a new lock or very careful atomics,
> check whether that CPU really needs flushing.  In -tip, all the
> information needed to figure this out is already there in the percpu
> state -- it's just not currently set up for remote access.
> 

Potentially yes although I'm somewhat wary of adding too much complexity
in that path. It'll either be very rare in which case the maintenance
cost isn't worth it or the process is being continually thrashed by
reclaim in which case saving a few TLB flushes isn't going to prevent
performance falling through the floor.

> For backports, it would just be flush_tlb_mm().
> 

Agreed.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 22:27                       ` Nadav Amit
@ 2017-07-11 22:34                         ` Nadav Amit
  2017-07-12  8:27                         ` Mel Gorman
  1 sibling, 0 replies; 70+ messages in thread
From: Nadav Amit @ 2017-07-11 22:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Nadav Amit <nadav.amit@gmail.com> wrote:

> Mel Gorman <mgorman@suse.de> wrote:
> 
>> On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote:
>>> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote:
>>>> I don't think we should be particularly clever about this and instead just
>>>> flush the full mm if there is a risk of a parallel batching of flushing is
>>>> in progress resulting in a stale TLB entry being used. I think tracking mms
>>>> that are currently batching would end up being costly in terms of memory,
>>>> fairly complex, or both. Something like this?
>>> 
>>> mremap and madvise(DONTNEED) would also need to flush. Memory policies are
>>> fine as a move_pages call that hits the race will simply fail to migrate
>>> a page that is being freed and once migration starts, it'll be flushed so
>>> a stale access has no further risk. copy_page_range should also be ok as
>>> the old mm is flushed and the new mm cannot have entries yet.
>> 
>> Adding those results in
> 
> You are way too fast for me.
> 
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>> 		return false;
>> 
>> 	/* If remote CPUs need to be flushed then defer batch the flush */
>> -	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
>> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
>> 		should_defer = true;
>> +		mm->tlb_flush_batched = true;
>> +	}
> 
> Since mm->tlb_flush_batched is set before the PTE is actually cleared, it
> still seems to leave a short window for a race.
> 
> CPU0				CPU1
> ---- 				----
> should_defer_flush
> => mm->tlb_flush_batched=true		
> 				flush_tlb_batched_pending (another PT)
> 				=> flush TLB
> 				=> mm->tlb_flush_batched=false
> ptep_get_and_clear
> ...
> 
> 				flush_tlb_batched_pending (batched PT)
> 				use the stale PTE
> ...
> try_to_unmap_flush
> 
> 
> IOW it seems that mm->flush_flush_batched should be set after the PTE is
> cleared (and have some compiler barrier to be on the safe side).

I’m actually not sure about that. Without a lock that other order may be
racy as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 22:27                       ` Nadav Amit
  2017-07-11 22:34                         ` Nadav Amit
@ 2017-07-12  8:27                         ` Mel Gorman
  2017-07-12 23:27                           ` Nadav Amit
  1 sibling, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-12  8:27 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 11, 2017 at 03:27:55PM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote:
> >> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote:
> >>> I don't think we should be particularly clever about this and instead just
> >>> flush the full mm if there is a risk of a parallel batching of flushing is
> >>> in progress resulting in a stale TLB entry being used. I think tracking mms
> >>> that are currently batching would end up being costly in terms of memory,
> >>> fairly complex, or both. Something like this?
> >> 
> >> mremap and madvise(DONTNEED) would also need to flush. Memory policies are
> >> fine as a move_pages call that hits the race will simply fail to migrate
> >> a page that is being freed and once migration starts, it'll be flushed so
> >> a stale access has no further risk. copy_page_range should also be ok as
> >> the old mm is flushed and the new mm cannot have entries yet.
> > 
> > Adding those results in
> 
> You are way too fast for me.
> 
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> > 		return false;
> > 
> > 	/* If remote CPUs need to be flushed then defer batch the flush */
> > -	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> > +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
> > 		should_defer = true;
> > +		mm->tlb_flush_batched = true;
> > +	}
> 
> Since mm->tlb_flush_batched is set before the PTE is actually cleared, it
> still seems to leave a short window for a race.
> 
> CPU0				CPU1
> ---- 				----
> should_defer_flush
> => mm->tlb_flush_batched=true		
> 				flush_tlb_batched_pending (another PT)
> 				=> flush TLB
> 				=> mm->tlb_flush_batched=false
> ptep_get_and_clear
> ...
> 
> 				flush_tlb_batched_pending (batched PT)
> 				use the stale PTE
> ...
> try_to_unmap_flush
> 
> IOW it seems that mm->flush_flush_batched should be set after the PTE is
> cleared (and have some compiler barrier to be on the safe side).

I'm relying on setting and clearing of tlb_flush_batched is under a PTL
that is contended if the race is active.

If reclaim is first, it'll take the PTL, set batched while a racing
mprotect/munmap/etc spins. On release, the racing mprotect/munmmap
immediately calls flush_tlb_batched_pending() before proceeding as normal,
finding pte_none with the TLB flushed.

If the mprotect/munmap/etc is first, it'll take the PTL, observe that
pte_present and handle the flushing itself while reclaim potentially
spins. When reclaim acquires the lock, it'll still set set tlb_flush_batched.

As it's PTL that is taken for that field, it is possible for the accesses
to be re-ordered but only in the case where a race is not occurring.
I'll think some more about whether barriers are necessary but concluded
they weren't needed in this instance. Doing the setting/clear+flush under
the PTL, the protection is similar to normal page table operations that
do not batch the flush.

> One more question, please: how does elevated page count or even locking the
> page help (as you mention in regard to uprobes and ksm)? Yes, the page will
> not be reclaimed, but IIUC try_to_unmap is called before the reference count
> is frozen, and the page lock is dropped on each iteration of the loop in
> shrink_page_list. In this case, it seems to me that uprobes or ksm may still
> not flush the TLB.
> 

If page lock is held then reclaim skips the page entirely and uprobe,
ksm and cow holds the page lock for pages that potentially be observed
by reclaim.  That is the primary protection for those paths.

The elevated page count is less relevant but I was keeping it in mind
trying to think of cases where a stale TLB entry existed and pointed to
the wrong page.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-12  8:27                         ` Mel Gorman
@ 2017-07-12 23:27                           ` Nadav Amit
  2017-07-12 23:36                             ` Andy Lutomirski
  2017-07-13  6:07                             ` Mel Gorman
  0 siblings, 2 replies; 70+ messages in thread
From: Nadav Amit @ 2017-07-12 23:27 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 03:27:55PM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote:
>>>> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote:
>>>>> I don't think we should be particularly clever about this and instead just
>>>>> flush the full mm if there is a risk of a parallel batching of flushing is
>>>>> in progress resulting in a stale TLB entry being used. I think tracking mms
>>>>> that are currently batching would end up being costly in terms of memory,
>>>>> fairly complex, or both. Something like this?
>>>> 
>>>> mremap and madvise(DONTNEED) would also need to flush. Memory policies are
>>>> fine as a move_pages call that hits the race will simply fail to migrate
>>>> a page that is being freed and once migration starts, it'll be flushed so
>>>> a stale access has no further risk. copy_page_range should also be ok as
>>>> the old mm is flushed and the new mm cannot have entries yet.
>>> 
>>> Adding those results in
>> 
>> You are way too fast for me.
>> 
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>>> 		return false;
>>> 
>>> 	/* If remote CPUs need to be flushed then defer batch the flush */
>>> -	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
>>> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
>>> 		should_defer = true;
>>> +		mm->tlb_flush_batched = true;
>>> +	}
>> 
>> Since mm->tlb_flush_batched is set before the PTE is actually cleared, it
>> still seems to leave a short window for a race.
>> 
>> CPU0				CPU1
>> ---- 				----
>> should_defer_flush
>> => mm->tlb_flush_batched=true		
>> 				flush_tlb_batched_pending (another PT)
>> 				=> flush TLB
>> 				=> mm->tlb_flush_batched=false
>> ptep_get_and_clear
>> ...
>> 
>> 				flush_tlb_batched_pending (batched PT)
>> 				use the stale PTE
>> ...
>> try_to_unmap_flush
>> 
>> IOW it seems that mm->flush_flush_batched should be set after the PTE is
>> cleared (and have some compiler barrier to be on the safe side).
> 
> I'm relying on setting and clearing of tlb_flush_batched is under a PTL
> that is contended if the race is active.
> 
> If reclaim is first, it'll take the PTL, set batched while a racing
> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap
> immediately calls flush_tlb_batched_pending() before proceeding as normal,
> finding pte_none with the TLB flushed.

This is the scenario I regarded in my example. Notice that when the first
flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table
locks - allowing them to run concurrently. As a result
flush_tlb_batched_pending is executed before the PTE was cleared and
mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear
mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE.

> If the mprotect/munmap/etc is first, it'll take the PTL, observe that
> pte_present and handle the flushing itself while reclaim potentially
> spins. When reclaim acquires the lock, it'll still set set tlb_flush_batched.
> 
> As it's PTL that is taken for that field, it is possible for the accesses
> to be re-ordered but only in the case where a race is not occurring.
> I'll think some more about whether barriers are necessary but concluded
> they weren't needed in this instance. Doing the setting/clear+flush under
> the PTL, the protection is similar to normal page table operations that
> do not batch the flush.
> 
>> One more question, please: how does elevated page count or even locking the
>> page help (as you mention in regard to uprobes and ksm)? Yes, the page will
>> not be reclaimed, but IIUC try_to_unmap is called before the reference count
>> is frozen, and the page lock is dropped on each iteration of the loop in
>> shrink_page_list. In this case, it seems to me that uprobes or ksm may still
>> not flush the TLB.
> 
> If page lock is held then reclaim skips the page entirely and uprobe,
> ksm and cow holds the page lock for pages that potentially be observed
> by reclaim.  That is the primary protection for those paths.

It is really hard, at least for me, to track this synchronization scheme, as
each path is protected in different means. I still don’t understand why it
is true, since the loop in shrink_page_list calls __ClearPageLocked(page) on
each iteration, before the actual flush takes place.

Actually, I think that based on Andy’s patches there is a relatively
reasonable solution. For each mm we will hold both a “pending_tlb_gen”
(increased under the PT-lock) and an “executed_tlb_gen”. Once
flush_tlb_mm_range finishes flushing it will use cmpxchg to update the
executed_tlb_gen to the pending_tlb_gen that was prior the flush (the
cmpxchg will ensure the TLB gen only goes forward). Then, whenever
pending_tlb_gen is different than executed_tlb_gen - a flush is needed.

Nadav 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-12 23:27                           ` Nadav Amit
@ 2017-07-12 23:36                             ` Andy Lutomirski
  2017-07-12 23:42                               ` Nadav Amit
  2017-07-13  6:07                             ` Mel Gorman
  1 sibling, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-12 23:36 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Mel Gorman, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>
> Actually, I think that based on Andy’s patches there is a relatively
> reasonable solution. For each mm we will hold both a “pending_tlb_gen”
> (increased under the PT-lock) and an “executed_tlb_gen”. Once
> flush_tlb_mm_range finishes flushing it will use cmpxchg to update the
> executed_tlb_gen to the pending_tlb_gen that was prior the flush (the
> cmpxchg will ensure the TLB gen only goes forward). Then, whenever
> pending_tlb_gen is different than executed_tlb_gen - a flush is needed.
>

Why do we need executed_tlb_gen?  We already have
cpu_tlbstate.ctxs[...].tlb_gen.  Or is the idea that executed_tlb_gen
guarantees that all cpus in mm_cpumask are at least up to date to
executed_tlb_gen?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-12 23:36                             ` Andy Lutomirski
@ 2017-07-12 23:42                               ` Nadav Amit
  2017-07-13  5:38                                 ` Andy Lutomirski
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-12 23:42 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Mel Gorman, open list:MEMORY MANAGEMENT

Andy Lutomirski <luto@kernel.org> wrote:

> On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> Actually, I think that based on Andy’s patches there is a relatively
>> reasonable solution. For each mm we will hold both a “pending_tlb_gen”
>> (increased under the PT-lock) and an “executed_tlb_gen”. Once
>> flush_tlb_mm_range finishes flushing it will use cmpxchg to update the
>> executed_tlb_gen to the pending_tlb_gen that was prior the flush (the
>> cmpxchg will ensure the TLB gen only goes forward). Then, whenever
>> pending_tlb_gen is different than executed_tlb_gen - a flush is needed.
> 
> Why do we need executed_tlb_gen?  We already have
> cpu_tlbstate.ctxs[...].tlb_gen.  Or is the idea that executed_tlb_gen
> guarantees that all cpus in mm_cpumask are at least up to date to
> executed_tlb_gen?

Hm... So actually it may be enough, no? Just compare mm->context.tlb_gen
with cpu_tlbstate.ctxs[...].tlb_gen and flush if they are different?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-12 23:42                               ` Nadav Amit
@ 2017-07-13  5:38                                 ` Andy Lutomirski
  2017-07-13 16:05                                   ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-13  5:38 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, Mel Gorman, open list:MEMORY MANAGEMENT

On Wed, Jul 12, 2017 at 4:42 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> Andy Lutomirski <luto@kernel.org> wrote:
>
>> On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> Actually, I think that based on Andy’s patches there is a relatively
>>> reasonable solution. For each mm we will hold both a “pending_tlb_gen”
>>> (increased under the PT-lock) and an “executed_tlb_gen”. Once
>>> flush_tlb_mm_range finishes flushing it will use cmpxchg to update the
>>> executed_tlb_gen to the pending_tlb_gen that was prior the flush (the
>>> cmpxchg will ensure the TLB gen only goes forward). Then, whenever
>>> pending_tlb_gen is different than executed_tlb_gen - a flush is needed.
>>
>> Why do we need executed_tlb_gen?  We already have
>> cpu_tlbstate.ctxs[...].tlb_gen.  Or is the idea that executed_tlb_gen
>> guarantees that all cpus in mm_cpumask are at least up to date to
>> executed_tlb_gen?
>
> Hm... So actually it may be enough, no? Just compare mm->context.tlb_gen
> with cpu_tlbstate.ctxs[...].tlb_gen and flush if they are different?
>

Wouldn't that still leave the races where the CPU observing the stale
TLB entry isn't the CPU that did munmap/mprotect/whatever?  I think
executed_tlb_gen or similar may really be needed for your approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-12 23:27                           ` Nadav Amit
  2017-07-12 23:36                             ` Andy Lutomirski
@ 2017-07-13  6:07                             ` Mel Gorman
  2017-07-13 16:08                               ` Andy Lutomirski
  2017-07-14 23:16                               ` Nadav Amit
  1 sibling, 2 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-13  6:07 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote:
> > If reclaim is first, it'll take the PTL, set batched while a racing
> > mprotect/munmap/etc spins. On release, the racing mprotect/munmmap
> > immediately calls flush_tlb_batched_pending() before proceeding as normal,
> > finding pte_none with the TLB flushed.
> 
> This is the scenario I regarded in my example. Notice that when the first
> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table
> locks - allowing them to run concurrently. As a result
> flush_tlb_batched_pending is executed before the PTE was cleared and
> mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear
> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE.
> 

If they hold different PTL locks, it means that reclaim and and the parallel
munmap/mprotect/madvise/mremap operation are operating on different regions
of an mm or separate mm's and the race should not apply or at the very
least is equivalent to not batching the flushes. For multiple parallel
operations, munmap/mprotect/mremap are serialised by mmap_sem so there
is only one risky operation at a time. For multiple madvise, there is a
small window when a page is accessible after madvise returns but it is an
advisory call so it's primarily a data integrity concern and the TLB is
flushed before the page is either freed or IO starts on the reclaim side.

> > If the mprotect/munmap/etc is first, it'll take the PTL, observe that
> > pte_present and handle the flushing itself while reclaim potentially
> > spins. When reclaim acquires the lock, it'll still set set tlb_flush_batched.
> > 
> > As it's PTL that is taken for that field, it is possible for the accesses
> > to be re-ordered but only in the case where a race is not occurring.
> > I'll think some more about whether barriers are necessary but concluded
> > they weren't needed in this instance. Doing the setting/clear+flush under
> > the PTL, the protection is similar to normal page table operations that
> > do not batch the flush.
> > 
> >> One more question, please: how does elevated page count or even locking the
> >> page help (as you mention in regard to uprobes and ksm)? Yes, the page will
> >> not be reclaimed, but IIUC try_to_unmap is called before the reference count
> >> is frozen, and the page lock is dropped on each iteration of the loop in
> >> shrink_page_list. In this case, it seems to me that uprobes or ksm may still
> >> not flush the TLB.
> > 
> > If page lock is held then reclaim skips the page entirely and uprobe,
> > ksm and cow holds the page lock for pages that potentially be observed
> > by reclaim.  That is the primary protection for those paths.
> 
> It is really hard, at least for me, to track this synchronization scheme, as
> each path is protected in different means. I still don???t understand why it
> is true, since the loop in shrink_page_list calls __ClearPageLocked(page) on
> each iteration, before the actual flush takes place.
> 

At teh point of __ClearPageLocked, reclaim was holding the page lock and
reached the point where there cannot be any other references to it and
is definitely cleaned. Any hypothetical TLB entry that exists at this
point is for read-only which would trap if a write was attempted and the
TLB is flushed before the page is freed so there is no possibility the
page is reallocated and the TLB entry now points to unrelated data.

> Actually, I think that based on Andy???s patches there is a relatively
> reasonable solution.

On top of Andy's work, the patch currently is below. Andy, is that roughly
what you had in mind? I didn't think the comments in flush_tlb_func_common
needed updating.

I would have test results but the test against the tip tree without the patch
failed overnight with what looks like filesystem corruption that happened
*after* tests completed and it was untarring and building the next kernel to
test with the patch applied. I'm not sure why yet or how reproducible it is.

---8<---
mm, mprotect: Flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries

Nadav Amit identified a theoritical race between page reclaim and mprotect
due to TLB flushes being batched outside of the PTL being held. He described
the race as follows

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such as
munmap, mremap and madvise.

For some operations like mprotect, it's not a data integrity issue but it
is a correctness issue. For munmap, it's potentially a data integrity issue
although the race is massive as an munmap, mmap and return to userspace must
all complete between the window when reclaim drops the PTL and flushes the
TLB. However, it's theoritically possible so handle this issue by flushing
the mm if reclaim is potentially currently batching TLB flushes.

Other instances where a flush is required for a present pte should be
ok as either the page lock is held preventing parallel reclaim or a
page reference count is elevated preventing a parallel free leading to
corruption. In the case of page_mkclean there isn't an obvious path that
userspace could take advantage of without using the operations that are
guarded by this patch. Other users such as gup as a race with reclaim
looks just at PTEs. huge page variants should be ok as they don't race
with reclaim.  mincore only looks at PTEs. userfault also should be ok as
if a parallel reclaim takes place, it will either fault the page back in
or read some of the data before the flush occurs triggering a fault.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org # v4.4+

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6397275008db..1ad93cf26826 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -325,6 +325,7 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_flush_one_mm(struct mm_struct *mm);
 
 #ifndef CONFIG_PARAVIRT
 #define flush_tlb_others(mask, info)	\
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2c1b8881e9d3..a72975a517a1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	put_cpu();
 }
 
+/*
+ * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when
+ * this returns. Using the current mm tlb_gen means the TLB will be up to date
+ * with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a flush has
+ * happened since then the IPIs will still be sent but the actual flush is
+ * avoided. Unfortunately the IPIs are necessary as the per-cpu context
+ * tlb_gens cannot be safely accessed.
+ */
+void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
+{
+	int cpu;
+	struct flush_tlb_info info = {
+		.mm = mm,
+		.new_tlb_gen = atomic64_read(&mm->context.tlb_gen),
+		.start = 0,
+		.end = TLB_FLUSH_ALL,
+	};
+
+	cpu = get_cpu();
+
+	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+		VM_WARN_ON(irqs_disabled());
+		local_irq_disable();
+		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+		local_irq_enable();
+	}
+
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+		flush_tlb_others(mm_cpumask(mm), &info);
+
+	put_cpu();
+}
+
 static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..ab8f7e11c160 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -495,6 +495,10 @@ struct mm_struct {
 	 */
 	bool tlb_flush_pending;
 #endif
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	/* See flush_tlb_batched_pending() */
+	bool tlb_flush_batched;
+#endif
 	struct uprobes_state uprobes_state;
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..9c8a2bfb975c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void flush_tlb_batched_pending(struct mm_struct *mm);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
-
+static inline void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/madvise.c b/mm/madvise.c
index 25b78ee4fc2c..75d2cffbe61d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -320,6 +320,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 
 	tlb_remove_check_page_size_change(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
diff --git a/mm/memory.c b/mm/memory.c
index bb11c474857e..b0c3d1556a94 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	init_rss_vec(rss);
 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	pte = start_pte;
+	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..f42749e6bf4e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -66,6 +66,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	    atomic_read(&vma->vm_mm->mm_users) == 1)
 		target_node = numa_node_id();
 
+	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b199ef9..6e3d857458de 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -152,6 +152,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	new_ptl = pte_lockptr(mm, new_pmd);
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
diff --git a/mm/rmap.c b/mm/rmap.c
index 130c238fe384..7c5c8ef583fa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -603,6 +603,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 
 	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
 	tlb_ubc->flush_required = true;
+	mm->tlb_flush_batched = true;
 
 	/*
 	 * If the PTE was dirty then it's best to assume it's writable. The
@@ -631,6 +632,29 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 
 	return should_defer;
 }
+
+/*
+ * Reclaim unmaps pages under the PTL but does not flush the TLB prior to
+ * releasing the PTL if TLB flushes are batched. It's possible a parallel
+ * operation such as mprotect or munmap to race between reclaim unmapping
+ * the page and flushing the page If this race occurs, it potentially allows
+ * access to data via a stale TLB entry. Tracking all mm's that have TLB
+ * batching in flight would be expensive during reclaim so instead track
+ * whether TLB batching occured in the past and if so then do a flush here
+ * if required. This will cost one additional flush per reclaim cycle paid
+ * by the first operation at risk such as mprotect and mumap.
+ *
+ * This must be called under the PTL so that accesses to tlb_flush_batched
+ * that is potentially a "reclaim vs mprotect/munmap/etc" race will
+ * synchronise via the PTL.
+ */
+void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+	if (mm->tlb_flush_batched) {
+		arch_tlbbatch_flush_one_mm(mm);
+		mm->tlb_flush_batched = false;
+	}
+}
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-13  5:38                                 ` Andy Lutomirski
@ 2017-07-13 16:05                                   ` Nadav Amit
  2017-07-13 16:06                                     ` Andy Lutomirski
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-13 16:05 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Mel Gorman, open list:MEMORY MANAGEMENT

Andy Lutomirski <luto@kernel.org> wrote:

> On Wed, Jul 12, 2017 at 4:42 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> Andy Lutomirski <luto@kernel.org> wrote:
>> 
>>> On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>>> Actually, I think that based on Andy’s patches there is a relatively
>>>> reasonable solution. For each mm we will hold both a “pending_tlb_gen”
>>>> (increased under the PT-lock) and an “executed_tlb_gen”. Once
>>>> flush_tlb_mm_range finishes flushing it will use cmpxchg to update the
>>>> executed_tlb_gen to the pending_tlb_gen that was prior the flush (the
>>>> cmpxchg will ensure the TLB gen only goes forward). Then, whenever
>>>> pending_tlb_gen is different than executed_tlb_gen - a flush is needed.
>>> 
>>> Why do we need executed_tlb_gen?  We already have
>>> cpu_tlbstate.ctxs[...].tlb_gen.  Or is the idea that executed_tlb_gen
>>> guarantees that all cpus in mm_cpumask are at least up to date to
>>> executed_tlb_gen?
>> 
>> Hm... So actually it may be enough, no? Just compare mm->context.tlb_gen
>> with cpu_tlbstate.ctxs[...].tlb_gen and flush if they are different?
> 
> Wouldn't that still leave the races where the CPU observing the stale
> TLB entry isn't the CPU that did munmap/mprotect/whatever?  I think
> executed_tlb_gen or similar may really be needed for your approach.

Yes, you are right.

This approach requires a counter that is only updated after the flush is
completed by all cores. This way you ensure there is no CPU that did not
complete the flush.

Does it make sense?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-13 16:05                                   ` Nadav Amit
@ 2017-07-13 16:06                                     ` Andy Lutomirski
  0 siblings, 0 replies; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-13 16:06 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, Mel Gorman, open list:MEMORY MANAGEMENT

On Thu, Jul 13, 2017 at 9:05 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
> Andy Lutomirski <luto@kernel.org> wrote:
>
>> On Wed, Jul 12, 2017 at 4:42 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>>> On Wed, Jul 12, 2017 at 4:27 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>>>> Actually, I think that based on Andy’s patches there is a relatively
>>>>> reasonable solution. For each mm we will hold both a “pending_tlb_gen”
>>>>> (increased under the PT-lock) and an “executed_tlb_gen”. Once
>>>>> flush_tlb_mm_range finishes flushing it will use cmpxchg to update the
>>>>> executed_tlb_gen to the pending_tlb_gen that was prior the flush (the
>>>>> cmpxchg will ensure the TLB gen only goes forward). Then, whenever
>>>>> pending_tlb_gen is different than executed_tlb_gen - a flush is needed.
>>>>
>>>> Why do we need executed_tlb_gen?  We already have
>>>> cpu_tlbstate.ctxs[...].tlb_gen.  Or is the idea that executed_tlb_gen
>>>> guarantees that all cpus in mm_cpumask are at least up to date to
>>>> executed_tlb_gen?
>>>
>>> Hm... So actually it may be enough, no? Just compare mm->context.tlb_gen
>>> with cpu_tlbstate.ctxs[...].tlb_gen and flush if they are different?
>>
>> Wouldn't that still leave the races where the CPU observing the stale
>> TLB entry isn't the CPU that did munmap/mprotect/whatever?  I think
>> executed_tlb_gen or similar may really be needed for your approach.
>
> Yes, you are right.
>
> This approach requires a counter that is only updated after the flush is
> completed by all cores. This way you ensure there is no CPU that did not
> complete the flush.
>
> Does it make sense?

Yes.  It could be a delta on top of Mel's patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-13  6:07                             ` Mel Gorman
@ 2017-07-13 16:08                               ` Andy Lutomirski
  2017-07-13 17:07                                 ` Mel Gorman
  2017-07-14 23:16                               ` Nadav Amit
  1 sibling, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-13 16:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 12, 2017 at 11:07 PM, Mel Gorman <mgorman@suse.de> wrote:
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>         put_cpu();
>  }
>
> +/*
> + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when

s/are up to date/have flushed the TLBs/ perhaps?


Can you update this comment in arch/x86/include/asm/tlbflush.h:

         * - Fully flush a single mm.  .mm will be set, .end will be
         *   TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to
         *   which the IPI sender is trying to catch us up.

by adding something like: This can also happen due to
arch_tlbflush_flush_one_mm(), in which case it's quite likely that
most or all CPUs are already up to date.

Thanks,
Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-13 16:08                               ` Andy Lutomirski
@ 2017-07-13 17:07                                 ` Mel Gorman
  2017-07-13 17:15                                   ` Andy Lutomirski
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-13 17:07 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Nadav Amit, open list:MEMORY MANAGEMENT

On Thu, Jul 13, 2017 at 09:08:21AM -0700, Andrew Lutomirski wrote:
> On Wed, Jul 12, 2017 at 11:07 PM, Mel Gorman <mgorman@suse.de> wrote:
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> >         put_cpu();
> >  }
> >
> > +/*
> > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when
> 
> s/are up to date/have flushed the TLBs/ perhaps?
> 
> 
> Can you update this comment in arch/x86/include/asm/tlbflush.h:
> 
>          * - Fully flush a single mm.  .mm will be set, .end will be
>          *   TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to
>          *   which the IPI sender is trying to catch us up.
> 
> by adding something like: This can also happen due to
> arch_tlbflush_flush_one_mm(), in which case it's quite likely that
> most or all CPUs are already up to date.
> 

No problem, thanks. Care to ack the patch below? If so, I'll send it
to Ingo with x86 and linux-mm cc'd after some tests complete (hopefully
successfully). It's fairly x86 specific and makes sense to go in with the
rest of the pcid and mm tlb_gen stuff rather than via Andrew's tree even
through it touches core mm.

---8<---
mm, mprotect: Flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries

Nadav Amit identified a theoritical race between page reclaim and mprotect
due to TLB flushes being batched outside of the PTL being held. He described
the race as follows

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such as
munmap, mremap and madvise.

For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an mprotect
that limits access still allows access. For munmap, it's potentially a data
integrity issue although the race is massive as an munmap, mmap and return to
userspace must all complete between the window when reclaim drops the PTL and
flushes the TLB. However, it's theoritically possible so handle this issue
by flushing the mm if reclaim is potentially currently batching TLB flushes.

Other instances where a flush is required for a present pte should be
ok as either the page lock is held preventing parallel reclaim or a
page reference count is elevated preventing a parallel free leading to
corruption. In the case of page_mkclean there isn't an obvious path that
userspace could take advantage of without using the operations that are
guarded by this patch. Other users such as gup as a race with reclaim
looks just at PTEs. huge page variants should be ok as they don't race
with reclaim.  mincore only looks at PTEs. userfault also should be ok as
if a parallel reclaim takes place, it will either fault the page back in
or read some of the data before the flush occurs triggering a fault.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org # v4.4+
---
 arch/x86/include/asm/tlbflush.h |  6 +++++-
 arch/x86/mm/tlb.c               | 33 +++++++++++++++++++++++++++++++++
 include/linux/mm_types.h        |  4 ++++
 mm/internal.h                   |  5 ++++-
 mm/madvise.c                    |  1 +
 mm/memory.c                     |  1 +
 mm/mprotect.c                   |  1 +
 mm/mremap.c                     |  1 +
 mm/rmap.c                       | 24 ++++++++++++++++++++++++
 9 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d23e61dc0640..1849e8da7a27 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -294,7 +294,10 @@ struct flush_tlb_info {
 	 *
 	 * - Fully flush a single mm.  .mm will be set, .end will be
 	 *   TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to
-	 *   which the IPI sender is trying to catch us up.
+	 *   which the IPI sender is trying to catch us up. This can
+	 *   also happen due to arch_tlbflush_flush_one_mm(), in which
+	 *   case it's quite likely that most or all CPUs are already
+	 *   up to date.
 	 *
 	 * - Partially flush a single mm.  .mm will be set, .start and
 	 *   .end will indicate the range, and .new_tlb_gen will be set
@@ -339,6 +342,7 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_flush_one_mm(struct mm_struct *mm);
 
 #ifndef CONFIG_PARAVIRT
 #define flush_tlb_others(mask, info)	\
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 63a5b451c128..248063dc5be8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -505,6 +505,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	put_cpu();
 }
 
+/*
+ * Ensure that any arch_tlbbatch_add_mm calls on this mm have flushed the TLB
+ * when this returns. Using the current mm tlb_gen means the TLB will be up
+ * to date with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a
+ * flush has happened since then the IPIs will still be sent but the actual
+ * flush is avoided. Unfortunately the IPIs are necessary as the per-cpu
+ * context tlb_gens cannot be safely accessed.
+ */
+void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
+{
+	int cpu;
+	struct flush_tlb_info info = {
+		.mm = mm,
+		.new_tlb_gen = atomic64_read(&mm->context.tlb_gen),
+		.start = 0,
+		.end = TLB_FLUSH_ALL,
+	};
+
+	cpu = get_cpu();
+
+	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+		VM_WARN_ON(irqs_disabled());
+		local_irq_disable();
+		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+		local_irq_enable();
+	}
+
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+		flush_tlb_others(mm_cpumask(mm), &info);
+
+	put_cpu();
+}
+
 static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..ab8f7e11c160 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -495,6 +495,10 @@ struct mm_struct {
 	 */
 	bool tlb_flush_pending;
 #endif
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	/* See flush_tlb_batched_pending() */
+	bool tlb_flush_batched;
+#endif
 	struct uprobes_state uprobes_state;
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..9c8a2bfb975c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -498,6 +498,7 @@ extern struct workqueue_struct *mm_percpu_wq;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 void try_to_unmap_flush(void);
 void try_to_unmap_flush_dirty(void);
+void flush_tlb_batched_pending(struct mm_struct *mm);
 #else
 static inline void try_to_unmap_flush(void)
 {
@@ -505,7 +506,9 @@ static inline void try_to_unmap_flush(void)
 static inline void try_to_unmap_flush_dirty(void)
 {
 }
-
+static inline void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+}
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 extern const struct trace_print_flags pageflag_names[];
diff --git a/mm/madvise.c b/mm/madvise.c
index 25b78ee4fc2c..75d2cffbe61d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -320,6 +320,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 
 	tlb_remove_check_page_size_change(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
diff --git a/mm/memory.c b/mm/memory.c
index bb11c474857e..b0c3d1556a94 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	init_rss_vec(rss);
 	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	pte = start_pte;
+	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..f42749e6bf4e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -66,6 +66,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	    atomic_read(&vma->vm_mm->mm_users) == 1)
 		target_node = numa_node_id();
 
+	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 	do {
 		oldpte = *pte;
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b199ef9..6e3d857458de 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -152,6 +152,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	new_ptl = pte_lockptr(mm, new_pmd);
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
 	for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
diff --git a/mm/rmap.c b/mm/rmap.c
index 130c238fe384..7c5c8ef583fa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -603,6 +603,7 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 
 	arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
 	tlb_ubc->flush_required = true;
+	mm->tlb_flush_batched = true;
 
 	/*
 	 * If the PTE was dirty then it's best to assume it's writable. The
@@ -631,6 +632,29 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 
 	return should_defer;
 }
+
+/*
+ * Reclaim unmaps pages under the PTL but does not flush the TLB prior to
+ * releasing the PTL if TLB flushes are batched. It's possible a parallel
+ * operation such as mprotect or munmap to race between reclaim unmapping
+ * the page and flushing the page If this race occurs, it potentially allows
+ * access to data via a stale TLB entry. Tracking all mm's that have TLB
+ * batching in flight would be expensive during reclaim so instead track
+ * whether TLB batching occured in the past and if so then do a flush here
+ * if required. This will cost one additional flush per reclaim cycle paid
+ * by the first operation at risk such as mprotect and mumap.
+ *
+ * This must be called under the PTL so that accesses to tlb_flush_batched
+ * that is potentially a "reclaim vs mprotect/munmap/etc" race will
+ * synchronise via the PTL.
+ */
+void flush_tlb_batched_pending(struct mm_struct *mm)
+{
+	if (mm->tlb_flush_batched) {
+		arch_tlbbatch_flush_one_mm(mm);
+		mm->tlb_flush_batched = false;
+	}
+}
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-13 17:07                                 ` Mel Gorman
@ 2017-07-13 17:15                                   ` Andy Lutomirski
  2017-07-13 18:23                                     ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-13 17:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, Nadav Amit, open list:MEMORY MANAGEMENT

On Thu, Jul 13, 2017 at 10:07 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Jul 13, 2017 at 09:08:21AM -0700, Andrew Lutomirski wrote:
>> On Wed, Jul 12, 2017 at 11:07 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > --- a/arch/x86/mm/tlb.c
>> > +++ b/arch/x86/mm/tlb.c
>> > @@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>> >         put_cpu();
>> >  }
>> >
>> > +/*
>> > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when
>>
>> s/are up to date/have flushed the TLBs/ perhaps?
>>
>>
>> Can you update this comment in arch/x86/include/asm/tlbflush.h:
>>
>>          * - Fully flush a single mm.  .mm will be set, .end will be
>>          *   TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to
>>          *   which the IPI sender is trying to catch us up.
>>
>> by adding something like: This can also happen due to
>> arch_tlbflush_flush_one_mm(), in which case it's quite likely that
>> most or all CPUs are already up to date.
>>
>
> No problem, thanks. Care to ack the patch below? If so, I'll send it
> to Ingo with x86 and linux-mm cc'd after some tests complete (hopefully
> successfully). It's fairly x86 specific and makes sense to go in with the
> rest of the pcid and mm tlb_gen stuff rather than via Andrew's tree even
> through it touches core mm.

Acked-by: Andy Lutomirski <luto@kernel.org> # for the x86 parts

When you send to Ingo, you might want to change
arch_tlbbatch_flush_one_mm to arch_tlbbatch_flush_one_mm(), because
otherwise he'll probably do it for you :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-13 17:15                                   ` Andy Lutomirski
@ 2017-07-13 18:23                                     ` Mel Gorman
  0 siblings, 0 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-13 18:23 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Nadav Amit, open list:MEMORY MANAGEMENT

On Thu, Jul 13, 2017 at 10:15:15AM -0700, Andrew Lutomirski wrote:
> On Thu, Jul 13, 2017 at 10:07 AM, Mel Gorman <mgorman@suse.de> wrote:
> > On Thu, Jul 13, 2017 at 09:08:21AM -0700, Andrew Lutomirski wrote:
> >> On Wed, Jul 12, 2017 at 11:07 PM, Mel Gorman <mgorman@suse.de> wrote:
> >> > --- a/arch/x86/mm/tlb.c
> >> > +++ b/arch/x86/mm/tlb.c
> >> > @@ -455,6 +455,39 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> >> >         put_cpu();
> >> >  }
> >> >
> >> > +/*
> >> > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when
> >>
> >> s/are up to date/have flushed the TLBs/ perhaps?
> >>
> >>
> >> Can you update this comment in arch/x86/include/asm/tlbflush.h:
> >>
> >>          * - Fully flush a single mm.  .mm will be set, .end will be
> >>          *   TLB_FLUSH_ALL, and .new_tlb_gen will be the tlb_gen to
> >>          *   which the IPI sender is trying to catch us up.
> >>
> >> by adding something like: This can also happen due to
> >> arch_tlbflush_flush_one_mm(), in which case it's quite likely that
> >> most or all CPUs are already up to date.
> >>
> >
> > No problem, thanks. Care to ack the patch below? If so, I'll send it
> > to Ingo with x86 and linux-mm cc'd after some tests complete (hopefully
> > successfully). It's fairly x86 specific and makes sense to go in with the
> > rest of the pcid and mm tlb_gen stuff rather than via Andrew's tree even
> > through it touches core mm.
> 
> Acked-by: Andy Lutomirski <luto@kernel.org> # for the x86 parts
> 
> When you send to Ingo, you might want to change
> arch_tlbbatch_flush_one_mm to arch_tlbbatch_flush_one_mm(), because
> otherwise he'll probably do it for you :)

*cringe*. I fixed it up.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-11 22:07                   ` Andy Lutomirski
  2017-07-11 22:33                     ` Mel Gorman
@ 2017-07-14  7:00                     ` Benjamin Herrenschmidt
  2017-07-14  8:31                       ` Mel Gorman
  1 sibling, 1 reply; 70+ messages in thread
From: Benjamin Herrenschmidt @ 2017-07-14  7:00 UTC (permalink / raw)
  To: Andy Lutomirski, Mel Gorman; +Cc: Nadav Amit, linux-mm

On Tue, 2017-07-11 at 15:07 -0700, Andy Lutomirski wrote:
> On Tue, Jul 11, 2017 at 12:18 PM, Mel Gorman <mgorman@suse.de> wrote:
> 
> I would change this slightly:
> 
> > +void flush_tlb_batched_pending(struct mm_struct *mm)
> > +{
> > +A A A A A A  if (mm->tlb_flush_batched) {
> > +A A A A A A A A A A A A A A  flush_tlb_mm(mm);
> 
> How about making this a new helper arch_tlbbatch_flush_one_mm(mm);
> The idea is that this could be implemented as flush_tlb_mm(mm), but
> the actual semantics needed are weaker.A  All that's really needed
> AFAICS is to make sure that any arch_tlbbatch_add_mm() calls on this
> mm that have already happened become effective by the time that
> arch_tlbbatch_flush_one_mm() returns.

Jumping in ... I just discovered that 'new' batching stuff... is it
documented anywhere ?

We already had some form of batching via the mmu_gather, now there's a
different somewhat orthogonal and it's completely unclear what it's
about and why we couldn't use what we already had. Also what
assumptions it makes if I want to port it to my arch....

The page table management code was messy enough without yet another
undocumented batching mechanism that isn't quite the one we already
had...
 
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-14  7:00                     ` Benjamin Herrenschmidt
@ 2017-07-14  8:31                       ` Mel Gorman
  2017-07-14  9:02                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-14  8:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Andy Lutomirski, Nadav Amit, linux-mm

On Fri, Jul 14, 2017 at 05:00:41PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2017-07-11 at 15:07 -0700, Andy Lutomirski wrote:
> > On Tue, Jul 11, 2017 at 12:18 PM, Mel Gorman <mgorman@suse.de> wrote:
> > 
> > I would change this slightly:
> > 
> > > +void flush_tlb_batched_pending(struct mm_struct *mm)
> > > +{
> > > +       if (mm->tlb_flush_batched) {
> > > +               flush_tlb_mm(mm);
> > 
> > How about making this a new helper arch_tlbbatch_flush_one_mm(mm);
> > The idea is that this could be implemented as flush_tlb_mm(mm), but
> > the actual semantics needed are weaker.  All that's really needed
> > AFAICS is to make sure that any arch_tlbbatch_add_mm() calls on this
> > mm that have already happened become effective by the time that
> > arch_tlbbatch_flush_one_mm() returns.
> 
> Jumping in ... I just discovered that 'new' batching stuff... is it
> documented anywhere ?
> 

This should be a new thread.

The original commit log has many of the details and the comments have
others. It's clearer what the boundaries are and what is needed from an
architecture with Andy's work on top which right now is easier to see
from tip/x86/mm

> We already had some form of batching via the mmu_gather, now there's a
> different somewhat orthogonal and it's completely unclear what it's
> about and why we couldn't use what we already had. Also what
> assumptions it makes if I want to port it to my arch....
> 

The batching in this context is more about mm's than individual pages
and was done this was as the number of mm's to track was potentially
unbound. At the time of implementation, tracking individual pages and the
extra bits for mmu_gather was overkill and fairly complex due to the need
to potentiall restart when the gather structure filled.

It may also be only a gain on a limited number of architectures depending
on exactly how an architecture handles flushing. At the time, batching
this for x86 in the worse-case scenario where all pages being reclaimed
were mapped from multiple threads knocked 24.4% off elapsed run time and
29% off system CPU but only on multi-socket NUMA machines. On UMA, it was
barely noticable. For some workloads where only a few pages are mapped or
the mapped pages on the LRU are relatively sparese, it'll make no difference.

The worst-case situation is extremely IPI intensive on x86 where many
IPIs were being sent for each unmap. It's only worth even considering if
you see that the time spent sending IPIs for flushes is a large portion
of reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-14  8:31                       ` Mel Gorman
@ 2017-07-14  9:02                         ` Benjamin Herrenschmidt
  2017-07-14  9:27                           ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Benjamin Herrenschmidt @ 2017-07-14  9:02 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, Nadav Amit, linux-mm, Aneesh Kumar K.V

On Fri, 2017-07-14 at 09:31 +0100, Mel Gorman wrote:
> It may also be only a gain on a limited number of architectures depending
> on exactly how an architecture handles flushing. At the time, batching
> this for x86 in the worse-case scenario where all pages being reclaimed
> were mapped from multiple threads knocked 24.4% off elapsed run time and
> 29% off system CPU but only on multi-socket NUMA machines. On UMA, it was
> barely noticable. For some workloads where only a few pages are mapped or
> the mapped pages on the LRU are relatively sparese, it'll make no difference.
> 
> The worst-case situation is extremely IPI intensive on x86 where many
> IPIs were being sent for each unmap. It's only worth even considering if
> you see that the time spent sending IPIs for flushes is a large portion
> of reclaim.

Ok, it would be interesting to see how that compares to powerpc with
its HW tlb invalidation broadcasts. We tend to hate them and prefer
IPIs in most cases but maybe not *this* case .. (mostly we find that
IPI + local inval is better for large scale invals, such as full mm on
exit/fork etc...).

In the meantime I found the original commits, we'll dig and see if it's
useful for us.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-14  9:02                         ` Benjamin Herrenschmidt
@ 2017-07-14  9:27                           ` Mel Gorman
  2017-07-14 22:21                             ` Andy Lutomirski
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-14  9:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andy Lutomirski, Nadav Amit, linux-mm, Aneesh Kumar K.V

On Fri, Jul 14, 2017 at 07:02:57PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2017-07-14 at 09:31 +0100, Mel Gorman wrote:
> > It may also be only a gain on a limited number of architectures depending
> > on exactly how an architecture handles flushing. At the time, batching
> > this for x86 in the worse-case scenario where all pages being reclaimed
> > were mapped from multiple threads knocked 24.4% off elapsed run time and
> > 29% off system CPU but only on multi-socket NUMA machines. On UMA, it was
> > barely noticable. For some workloads where only a few pages are mapped or
> > the mapped pages on the LRU are relatively sparese, it'll make no difference.
> > 
> > The worst-case situation is extremely IPI intensive on x86 where many
> > IPIs were being sent for each unmap. It's only worth even considering if
> > you see that the time spent sending IPIs for flushes is a large portion
> > of reclaim.
> 
> Ok, it would be interesting to see how that compares to powerpc with
> its HW tlb invalidation broadcasts. We tend to hate them and prefer
> IPIs in most cases but maybe not *this* case .. (mostly we find that
> IPI + local inval is better for large scale invals, such as full mm on
> exit/fork etc...).
> 
> In the meantime I found the original commits, we'll dig and see if it's
> useful for us.
> 

I would suggest that it is based on top of Andy's work that is currently in
Linus' tree for 4.13-rc1 as the core/arch boundary is a lot clearer. While
there is other work pending on top related to mm and generation counters,
that is primarily important for addressing the race which ppc64 may not
need if you always flush to clear the accessed bit (or equivalent). The
main thing to watch for is that if an accessed or young bit is being set
for the first time that the arch check the underlying PTE and trap if it's
invalid. If that holds and there is a flush when the young bit is cleared
then you probably do not need the arch hook that closes the race.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-14  9:27                           ` Mel Gorman
@ 2017-07-14 22:21                             ` Andy Lutomirski
  0 siblings, 0 replies; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-14 22:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Benjamin Herrenschmidt, Andy Lutomirski, Nadav Amit, linux-mm,
	Aneesh Kumar K.V

On Fri, Jul 14, 2017 at 2:27 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Fri, Jul 14, 2017 at 07:02:57PM +1000, Benjamin Herrenschmidt wrote:
>> On Fri, 2017-07-14 at 09:31 +0100, Mel Gorman wrote:
>> > It may also be only a gain on a limited number of architectures depending
>> > on exactly how an architecture handles flushing. At the time, batching
>> > this for x86 in the worse-case scenario where all pages being reclaimed
>> > were mapped from multiple threads knocked 24.4% off elapsed run time and
>> > 29% off system CPU but only on multi-socket NUMA machines. On UMA, it was
>> > barely noticable. For some workloads where only a few pages are mapped or
>> > the mapped pages on the LRU are relatively sparese, it'll make no difference.
>> >
>> > The worst-case situation is extremely IPI intensive on x86 where many
>> > IPIs were being sent for each unmap. It's only worth even considering if
>> > you see that the time spent sending IPIs for flushes is a large portion
>> > of reclaim.
>>
>> Ok, it would be interesting to see how that compares to powerpc with
>> its HW tlb invalidation broadcasts. We tend to hate them and prefer
>> IPIs in most cases but maybe not *this* case .. (mostly we find that
>> IPI + local inval is better for large scale invals, such as full mm on
>> exit/fork etc...).
>>
>> In the meantime I found the original commits, we'll dig and see if it's
>> useful for us.
>>
>
> I would suggest that it is based on top of Andy's work that is currently in
> Linus' tree for 4.13-rc1 as the core/arch boundary is a lot clearer. While
> there is other work pending on top related to mm and generation counters,
> that is primarily important for addressing the race which ppc64 may not
> need if you always flush to clear the accessed bit (or equivalent). The
> main thing to watch for is that if an accessed or young bit is being set
> for the first time that the arch check the underlying PTE and trap if it's
> invalid. If that holds and there is a flush when the young bit is cleared
> then you probably do not need the arch hook that closes the race.
>

Ben, if you could read the API in tip:x86/mm + Mel's patch, it would
be fantastic.  I'd like to know whether a non-x86 non-mm person can
understand the API (arch_tlbbatch_add_mm, arch_tlbbatch_flush, and
arch_tlbbatch_flush_one_mm) well enough to implement it.  I'd also
like to know for real that it makes sense outside of x86.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-13  6:07                             ` Mel Gorman
  2017-07-13 16:08                               ` Andy Lutomirski
@ 2017-07-14 23:16                               ` Nadav Amit
  2017-07-15 15:55                                 ` Mel Gorman
  1 sibling, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-14 23:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote:
>>> If reclaim is first, it'll take the PTL, set batched while a racing
>>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap
>>> immediately calls flush_tlb_batched_pending() before proceeding as normal,
>>> finding pte_none with the TLB flushed.
>> 
>> This is the scenario I regarded in my example. Notice that when the first
>> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table
>> locks - allowing them to run concurrently. As a result
>> flush_tlb_batched_pending is executed before the PTE was cleared and
>> mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear
>> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE.
> 
> If they hold different PTL locks, it means that reclaim and and the parallel
> munmap/mprotect/madvise/mremap operation are operating on different regions
> of an mm or separate mm's and the race should not apply or at the very
> least is equivalent to not batching the flushes. For multiple parallel
> operations, munmap/mprotect/mremap are serialised by mmap_sem so there
> is only one risky operation at a time. For multiple madvise, there is a
> small window when a page is accessible after madvise returns but it is an
> advisory call so it's primarily a data integrity concern and the TLB is
> flushed before the page is either freed or IO starts on the reclaim side.

I think there is some miscommunication. Perhaps one detail was missing:

CPU0				CPU1
---- 				----
should_defer_flush
=> mm->tlb_flush_batched=true		
				flush_tlb_batched_pending (another PT)
				=> flush TLB
				=> mm->tlb_flush_batched=false

				Access PTE (and cache in TLB)
ptep_get_and_clear(PTE)
...

				flush_tlb_batched_pending (batched PT)
				[ no flush since tlb_flush_batched=false ]
				use the stale PTE
...
try_to_unmap_flush

There are only 2 CPUs and both regard the same address-space. CPU0 reclaim a
page from this address-space. Just between setting tlb_flush_batch and the
actual clearing of the PTE, the process on CPU1 runs munmap and calls
flush_tlb_batched_pending. This can happen if CPU1 regards a different
page-table.

So CPU1 flushes the TLB and clears the tlb_flush_batched indication. Note,
however, that CPU0 still did not clear the PTE so CPU1 can access this PTE
and cache it. Then, after CPU0 clears the PTE, the process on CPU1 can try
to munmap the region that includes the cleared PTE. However, now it does not
flush the TLB.

> +/*
> + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when
> + * this returns. Using the current mm tlb_gen means the TLB will be up to date
> + * with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a flush has
> + * happened since then the IPIs will still be sent but the actual flush is
> + * avoided. Unfortunately the IPIs are necessary as the per-cpu context
> + * tlb_gens cannot be safely accessed.
> + */
> +void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
> +{
> +	int cpu;
> +	struct flush_tlb_info info = {
> +		.mm = mm,
> +		.new_tlb_gen = atomic64_read(&mm->context.tlb_gen),
> +		.start = 0,
> +		.end = TLB_FLUSH_ALL,
> +	};
> +
> +	cpu = get_cpu();
> +
> +	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> +		VM_WARN_ON(irqs_disabled());
> +		local_irq_disable();
> +		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
> +		local_irq_enable();
> +	}
> +
> +	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> +		flush_tlb_others(mm_cpumask(mm), &info);
> +
> +	put_cpu();
> +}
> +

It is a shame that after Andy collapsed all the different flushing flows,
you create a new one. How about squashing this untested one to yours?

-- >8 --

Subject: x86/mm: refactor flush_tlb_mm_range and arch_tlbbatch_flush_one_mm

flush_tlb_mm_range() and arch_tlbbatch_flush_one_mm() share a lot of mutual
code. After the recent work on combining the x86 TLB userspace entries
flushes, it is a shame to break them into different code-paths again.

Refactor the mutual code into perform_tlb_flush().

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 48 +++++++++++++++++++-----------------------------
 1 file changed, 19 insertions(+), 29 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 248063dc5be8..56e00443a6cf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -404,17 +404,30 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
  */
 static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 
+static void perform_tlb_flush(struct mm_struct *mm, struct flush_tlb_info *info)
+{
+	int cpu = get_cpu();
+
+	if (info->mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+		VM_WARN_ON(irqs_disabled());
+		local_irq_disable();
+		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+		local_irq_enable();
+	}
+
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+		flush_tlb_others(mm_cpumask(mm), info);
+
+	put_cpu();
+}
+
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
-	int cpu;
-
 	struct flush_tlb_info info = {
 		.mm = mm,
 	};
 
-	cpu = get_cpu();
-
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	info.new_tlb_gen = inc_mm_tlb_gen(mm);
 
@@ -429,17 +442,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		info.end = TLB_FLUSH_ALL;
 	}
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
-		VM_WARN_ON(irqs_disabled());
-		local_irq_disable();
-		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
-		local_irq_enable();
-	}
-
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), &info);
-
-	put_cpu();
+	perform_tlb_flush(mm, &info);
 }
 
 
@@ -515,7 +518,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
  */
 void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
 {
-	int cpu;
 	struct flush_tlb_info info = {
 		.mm = mm,
 		.new_tlb_gen = atomic64_read(&mm->context.tlb_gen),
@@ -523,19 +525,7 @@ void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
 		.end = TLB_FLUSH_ALL,
 	};
 
-	cpu = get_cpu();
-
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
-		VM_WARN_ON(irqs_disabled());
-		local_irq_disable();
-		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
-		local_irq_enable();
-	}
-
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), &info);
-
-	put_cpu();
+	perform_tlb_flush(mm, &info);
 }
 
 static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-14 23:16                               ` Nadav Amit
@ 2017-07-15 15:55                                 ` Mel Gorman
  2017-07-15 16:41                                   ` Andy Lutomirski
  2017-07-18 21:28                                   ` Nadav Amit
  0 siblings, 2 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-15 15:55 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Fri, Jul 14, 2017 at 04:16:44PM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote:
> >>> If reclaim is first, it'll take the PTL, set batched while a racing
> >>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap
> >>> immediately calls flush_tlb_batched_pending() before proceeding as normal,
> >>> finding pte_none with the TLB flushed.
> >> 
> >> This is the scenario I regarded in my example. Notice that when the first
> >> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table
> >> locks - allowing them to run concurrently. As a result
> >> flush_tlb_batched_pending is executed before the PTE was cleared and
> >> mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear
> >> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE.
> > 
> > If they hold different PTL locks, it means that reclaim and and the parallel
> > munmap/mprotect/madvise/mremap operation are operating on different regions
> > of an mm or separate mm's and the race should not apply or at the very
> > least is equivalent to not batching the flushes. For multiple parallel
> > operations, munmap/mprotect/mremap are serialised by mmap_sem so there
> > is only one risky operation at a time. For multiple madvise, there is a
> > small window when a page is accessible after madvise returns but it is an
> > advisory call so it's primarily a data integrity concern and the TLB is
> > flushed before the page is either freed or IO starts on the reclaim side.
> 
> I think there is some miscommunication. Perhaps one detail was missing:
> 
> CPU0				CPU1
> ---- 				----
> should_defer_flush
> => mm->tlb_flush_batched=true		
> 				flush_tlb_batched_pending (another PT)
> 				=> flush TLB
> 				=> mm->tlb_flush_batched=false
> 
> 				Access PTE (and cache in TLB)
> ptep_get_and_clear(PTE)
> ...
> 
> 				flush_tlb_batched_pending (batched PT)
> 				[ no flush since tlb_flush_batched=false ]
> 				use the stale PTE
> ...
> try_to_unmap_flush
> 
> There are only 2 CPUs and both regard the same address-space. CPU0 reclaim a
> page from this address-space. Just between setting tlb_flush_batch and the
> actual clearing of the PTE, the process on CPU1 runs munmap and calls
> flush_tlb_batched_pending. This can happen if CPU1 regards a different
> page-table.
> 

If both regard the same address-space then they have the same page table so
there is a disconnect between the first and last sentence in your paragraph
above. On CPU 0, the setting of tlb_flush_batched and ptep_get_and_clear
is also reversed as the sequence is

                        pteval = ptep_get_and_clear(mm, address, pvmw.pte);
                        set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));

Additional barriers should not be needed as within the critical section
that can race, it's protected by the lock and with Andy's code, there is
a full barrier before the setting of tlb_flush_batched. With Andy's code,
there may be a need for a compiler barrier but I can rethink about that
and add it during the backport to -stable if necessary.

So the setting happens after the clear and if they share the same address
space and collide then they both share the same PTL so are protected from
each other.

If there are separate address spaces using a shared mapping then the
same race does not occur.

> > +/*
> > + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when
> > + * this returns. Using the current mm tlb_gen means the TLB will be up to date
> > + * with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a flush has
> > + * happened since then the IPIs will still be sent but the actual flush is
> > + * avoided. Unfortunately the IPIs are necessary as the per-cpu context
> > + * tlb_gens cannot be safely accessed.
> > + */
> > +void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
> > +{
> > +	int cpu;
> > +	struct flush_tlb_info info = {
> > +		.mm = mm,
> > +		.new_tlb_gen = atomic64_read(&mm->context.tlb_gen),
> > +		.start = 0,
> > +		.end = TLB_FLUSH_ALL,
> > +	};
> > +
> > +	cpu = get_cpu();
> > +
> > +	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> > +		VM_WARN_ON(irqs_disabled());
> > +		local_irq_disable();
> > +		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
> > +		local_irq_enable();
> > +	}
> > +
> > +	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> > +		flush_tlb_others(mm_cpumask(mm), &info);
> > +
> > +	put_cpu();
> > +}
> > +
> 
> It is a shame that after Andy collapsed all the different flushing flows,
> you create a new one. How about squashing this untested one to yours?
> 

The patch looks fine to be but when writing the patch, I wondered why the
original code disabled preemption before inc_mm_tlb_gen. I didn't spot
the reason for it but given the importance of properly synchronising with
switch_mm, I played it safe. However, this should be ok on top and
maintain the existing sequences

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 248063dc5be8..cbd8621a0bee 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -404,6 +404,21 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
  */
 static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 
+static void flush_tlb_mm_common(struct flush_tlb_info *info, int cpu)
+{
+	struct mm_struct *mm = info->mm;
+
+	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+		VM_WARN_ON(irqs_disabled());
+		local_irq_disable();
+		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+		local_irq_enable();
+	}
+
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+		flush_tlb_others(mm_cpumask(mm), info);
+}
+
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
@@ -429,15 +444,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		info.end = TLB_FLUSH_ALL;
 	}
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
-		VM_WARN_ON(irqs_disabled());
-		local_irq_disable();
-		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
-		local_irq_enable();
-	}
-
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), &info);
+	flush_tlb_mm_common(&info, cpu);
 
 	put_cpu();
 }
@@ -515,7 +522,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
  */
 void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
 {
-	int cpu;
 	struct flush_tlb_info info = {
 		.mm = mm,
 		.new_tlb_gen = atomic64_read(&mm->context.tlb_gen),
@@ -523,17 +529,7 @@ void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
 		.end = TLB_FLUSH_ALL,
 	};
 
-	cpu = get_cpu();
-
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
-		VM_WARN_ON(irqs_disabled());
-		local_irq_disable();
-		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
-		local_irq_enable();
-	}
-
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), &info);
+	flush_tlb_mm_common(&info, get_cpu());
 
 	put_cpu();
 }

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-15 15:55                                 ` Mel Gorman
@ 2017-07-15 16:41                                   ` Andy Lutomirski
  2017-07-17  7:49                                     ` Mel Gorman
  2017-07-18 21:28                                   ` Nadav Amit
  1 sibling, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2017-07-15 16:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Sat, Jul 15, 2017 at 8:55 AM, Mel Gorman <mgorman@suse.de> wrote:
> The patch looks fine to be but when writing the patch, I wondered why the
> original code disabled preemption before inc_mm_tlb_gen. I didn't spot
> the reason for it but given the importance of properly synchronising with
> switch_mm, I played it safe. However, this should be ok on top and
> maintain the existing sequences

LGTM.  You could also fold it into your patch or even put it before
your patch, too.

FWIW, I didn't have any real reason to inc_mm_tlb_gen() with
preemption disabled.  I think I did it because the code it replaced
was also called with preemption off.  That being said, it's
effectively a single instruction, so it barely matters latency-wise.
(Hmm.  Would there be a performance downside if a thread got preempted
between inc_mm_tlb_gen() and doing the flush?  It could arbitrarily
delay the IPIs, which would give a big window for something else to
flush and maybe make our IPIs unnecessary.  Whether that's a win or a
loss isn't so clear to me.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-15 16:41                                   ` Andy Lutomirski
@ 2017-07-17  7:49                                     ` Mel Gorman
  0 siblings, 0 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-17  7:49 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Nadav Amit, open list:MEMORY MANAGEMENT

On Sat, Jul 15, 2017 at 09:41:35AM -0700, Andrew Lutomirski wrote:
> On Sat, Jul 15, 2017 at 8:55 AM, Mel Gorman <mgorman@suse.de> wrote:
> > The patch looks fine to be but when writing the patch, I wondered why the
> > original code disabled preemption before inc_mm_tlb_gen. I didn't spot
> > the reason for it but given the importance of properly synchronising with
> > switch_mm, I played it safe. However, this should be ok on top and
> > maintain the existing sequences
> 
> LGTM.  You could also fold it into your patch or even put it before
> your patch, too.
> 

Thanks.

> FWIW, I didn't have any real reason to inc_mm_tlb_gen() with
> preemption disabled.  I think I did it because the code it replaced
> was also called with preemption off.  That being said, it's
> effectively a single instruction, so it barely matters latency-wise.
> (Hmm.  Would there be a performance downside if a thread got preempted
> between inc_mm_tlb_gen() and doing the flush? 

There isn't a preemption point until the point where irqs are
disabled/enabled for the local TLB flush so it doesn't really matter.
It can still be preempted by an interrupt but that's not surprising. I
don't think it matters that much either way so I'll leave it at it is.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-15 15:55                                 ` Mel Gorman
  2017-07-15 16:41                                   ` Andy Lutomirski
@ 2017-07-18 21:28                                   ` Nadav Amit
  2017-07-19  7:41                                     ` Mel Gorman
  1 sibling, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-18 21:28 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Fri, Jul 14, 2017 at 04:16:44PM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote:
>>>>> If reclaim is first, it'll take the PTL, set batched while a racing
>>>>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap
>>>>> immediately calls flush_tlb_batched_pending() before proceeding as normal,
>>>>> finding pte_none with the TLB flushed.
>>>> 
>>>> This is the scenario I regarded in my example. Notice that when the first
>>>> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table
>>>> locks - allowing them to run concurrently. As a result
>>>> flush_tlb_batched_pending is executed before the PTE was cleared and
>>>> mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear
>>>> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE.
>>> 
>>> If they hold different PTL locks, it means that reclaim and and the parallel
>>> munmap/mprotect/madvise/mremap operation are operating on different regions
>>> of an mm or separate mm's and the race should not apply or at the very
>>> least is equivalent to not batching the flushes. For multiple parallel
>>> operations, munmap/mprotect/mremap are serialised by mmap_sem so there
>>> is only one risky operation at a time. For multiple madvise, there is a
>>> small window when a page is accessible after madvise returns but it is an
>>> advisory call so it's primarily a data integrity concern and the TLB is
>>> flushed before the page is either freed or IO starts on the reclaim side.
>> 
>> I think there is some miscommunication. Perhaps one detail was missing:
>> 
>> CPU0				CPU1
>> ---- 				----
>> should_defer_flush
>> => mm->tlb_flush_batched=true		
>> 				flush_tlb_batched_pending (another PT)
>> 				=> flush TLB
>> 				=> mm->tlb_flush_batched=false
>> 
>> 				Access PTE (and cache in TLB)
>> ptep_get_and_clear(PTE)
>> ...
>> 
>> 				flush_tlb_batched_pending (batched PT)
>> 				[ no flush since tlb_flush_batched=false ]
>> 				use the stale PTE
>> ...
>> try_to_unmap_flush
>> 
>> There are only 2 CPUs and both regard the same address-space. CPU0 reclaim a
>> page from this address-space. Just between setting tlb_flush_batch and the
>> actual clearing of the PTE, the process on CPU1 runs munmap and calls
>> flush_tlb_batched_pending. This can happen if CPU1 regards a different
>> page-table.
> 
> If both regard the same address-space then they have the same page table so
> there is a disconnect between the first and last sentence in your paragraph
> above. On CPU 0, the setting of tlb_flush_batched and ptep_get_and_clear
> is also reversed as the sequence is
> 
>                        pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>                        set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> 
> Additional barriers should not be needed as within the critical section
> that can race, it's protected by the lock and with Andy's code, there is
> a full barrier before the setting of tlb_flush_batched. With Andy's code,
> there may be a need for a compiler barrier but I can rethink about that
> and add it during the backport to -stable if necessary.
> 
> So the setting happens after the clear and if they share the same address
> space and collide then they both share the same PTL so are protected from
> each other.
> 
> If there are separate address spaces using a shared mapping then the
> same race does not occur.

I missed the fact you reverted the two operations since the previous version
of the patch. This specific scenario should be solved with this patch.

But in general, I think there is a need for a simple locking scheme.
Otherwise, people (like me) would be afraid to make any changes to the code,
and additional missing TLB flushes would exist. For example, I suspect that
a user may trigger insert_pfn() or insert_page(), and rely on their output.
While it makes little sense, the user can try to insert the page on the same
address of another page. If the other page was already reclaimed the
operation should succeed and otherwise fail. But it may succeed while the
other page is going through reclamation, resulting in:

CPU0					CPU1
----					----				
					ptep_clear_flush_notify()
- access memory using a PTE
[ PTE cached in TLB ]
					try_to_unmap_one()
					==> ptep_get_and_clear() == false
insert_page()
==> pte_none() = true
    [retval = 0]

- access memory using a stale PTE


Additional potential situations can be caused, IIUC, by mcopy_atomic_pte(),
mfill_zeropage_pte(), shmem_mcopy_atomic_pte().

Even more importantly, I suspect there is an additional similar but
unrelated problem. clear_refs_write() can be used with CLEAR_REFS_SOFT_DIRTY
to write-protect PTEs. However, it batches TLB flushes, while only holding
mmap_sem for read, and without any indication in mm that TLB flushes are
pending.

As a result, concurrent operation such as KSM’s write_protect_page() or
page_mkclean_one() can consider the page write-protected while in fact it is
still accessible - since the TLB flush was deferred. As a result, they may
mishandle the PTE without flushing the page. In the case of
page_mkclean_one(), I suspect it may even lead to memory corruption. I admit
that in x86 there are some mitigating factors that would make such “attack”
complicated, but it still seems wrong to me, no?

Thanks,
Nadav

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-18 21:28                                   ` Nadav Amit
@ 2017-07-19  7:41                                     ` Mel Gorman
  2017-07-19 19:41                                       ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-19  7:41 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote:
> > If there are separate address spaces using a shared mapping then the
> > same race does not occur.
> 
> I missed the fact you reverted the two operations since the previous version
> of the patch. This specific scenario should be solved with this patch.
> 
> But in general, I think there is a need for a simple locking scheme.

Such as?

> Otherwise, people (like me) would be afraid to make any changes to the code,
> and additional missing TLB flushes would exist. For example, I suspect that
> a user may trigger insert_pfn() or insert_page(), and rely on their output.

That API is for device drivers to insert pages (which may not be RAM)
directly into userspace and the pages are not on the LRU so not subject
to the same races.

> While it makes little sense, the user can try to insert the page on the same
> address of another page.

Even if a drivers was dumb enough to do so, the second insert should fail
on a !pte_none() test.

> If the other page was already reclaimed the
> operation should succeed and otherwise fail. But it may succeed while the
> other page is going through reclamation, resulting in:
>  

It doesn't go through reclaim as the page isn't on the LRU until the last
mmap or the driver frees the page.

> CPU0					CPU1
> ----					----				
> 					ptep_clear_flush_notify()
> - access memory using a PTE
> [ PTE cached in TLB ]
> 					try_to_unmap_one()
> 					==> ptep_get_and_clear() == false
> insert_page()
> ==> pte_none() = true
>     [retval = 0]
> 
> - access memory using a stale PTE

That race assumes that the page was on the LRU and the VMAs in question
are VM_MIXEDMAP or VM_PFNMAP. If the region is unmapped and a new mapping
put in place, the last patch ensures the region is flushed.

> Additional potential situations can be caused, IIUC, by mcopy_atomic_pte(),
> mfill_zeropage_pte(), shmem_mcopy_atomic_pte().
> 

I didn't dig into the exact locking for userfaultfd because largely it
doesn't matter. The operations are copy operations which means that any
stale TLB is being used to read data only. If the page is reclaimed then a
fault is raised. If data is read for a short duration before the TLB flush
then it still doesn't matter because there is no data integrity issue. The
TLB will be flushed if an operation occurs that could leak the wrong data.

> Even more importantly, I suspect there is an additional similar but
> unrelated problem. clear_refs_write() can be used with CLEAR_REFS_SOFT_DIRTY
> to write-protect PTEs. However, it batches TLB flushes, while only holding
> mmap_sem for read, and without any indication in mm that TLB flushes are
> pending.
> 

Again, consider whether there is a data integrity issue. A TLB entry existing
after an unmap is not in itself dangerous. There is always some degree of
race between when a PTE is unmapped and the IPIs for the flush are delivered.

> As a result, concurrent operation such as KSM???s write_protect_page() or

write_protect_page operates under the page lock and cannot race with reclaim.

> page_mkclean_one() can consider the page write-protected while in fact it is
> still accessible - since the TLB flush was deferred.

As long as it's flushed before any IO occurs that would lose a data update,
it's not a data integrity issue.

> As a result, they may
> mishandle the PTE without flushing the page. In the case of
> page_mkclean_one(), I suspect it may even lead to memory corruption. I admit
> that in x86 there are some mitigating factors that would make such ???attack???
> complicated, but it still seems wrong to me, no?
> 

I worry that you're beginning to see races everywhere. I admit that the
rules and protections here are varied and complex but it's worth keeping
in mind that data integrity is the key concern (no false reads to wrong
data, no lost writes) and the first race you identified found some problems
here. However, with or without batching, there is always a delay between
when a PTE is cleared and when the TLB entries are removed.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-19  7:41                                     ` Mel Gorman
@ 2017-07-19 19:41                                       ` Nadav Amit
  2017-07-19 19:58                                         ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-19 19:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote:
>>> If there are separate address spaces using a shared mapping then the
>>> same race does not occur.
>> 
>> I missed the fact you reverted the two operations since the previous version
>> of the patch. This specific scenario should be solved with this patch.
>> 
>> But in general, I think there is a need for a simple locking scheme.
> 
> Such as?

Something like:

bool is_potentially_stale_pte(pte_t pte, pgprot_t prot, int lock_state);

which would get the current PTE, the protection bits that the user is
interested in, and whether mmap_sem is taken read/write/none. 

It would return whether this PTE may be potentially stale and needs to be
invalidated. Obviously, any code that removes protection or unmaps need to
be updated for this information to be correct.

[snip]

>> As a result, concurrent operation such as KSM???s write_protect_page() or
> 
> write_protect_page operates under the page lock and cannot race with reclaim.

I still do not understand this claim. IIUC, reclaim can unmap the page in
some page table, decide not to reclaim the page and release the page-lock
before flush.

>> page_mkclean_one() can consider the page write-protected while in fact it is
>> still accessible - since the TLB flush was deferred.
> 
> As long as it's flushed before any IO occurs that would lose a data update,
> it's not a data integrity issue.
> 
>> As a result, they may
>> mishandle the PTE without flushing the page. In the case of
>> page_mkclean_one(), I suspect it may even lead to memory corruption. I admit
>> that in x86 there are some mitigating factors that would make such ???attack???
>> complicated, but it still seems wrong to me, no?
> 
> I worry that you're beginning to see races everywhere. I admit that the
> rules and protections here are varied and complex but it's worth keeping
> in mind that data integrity is the key concern (no false reads to wrong
> data, no lost writes) and the first race you identified found some problems
> here. However, with or without batching, there is always a delay between
> when a PTE is cleared and when the TLB entries are removed.

Sure, but usually the delay occurs while the page-table lock is taken so
there is no race.

Now, it is not fair to call me a paranoid, considering that these races are
real - I confirmed that at least two can happen in practice. There are many
possibilities for concurrent TLB batching and you cannot expect developers
to consider all of them. I don’t think many people are capable of doing the
voodoo tricks of avoiding a TLB flush if the page-lock is taken or the VMA
is anonymous. I doubt that these tricks work and anyhow IMHO they are likely
to fail in the future since they are undocumented and complicated.

As for “data integrity is the key concern” - violating the memory management
API can cause data integrity issues for programs. It may not cause the OS to
crash, but it should not be acceptable either, and may potentially raise
security concerns. If you think that the current behavior is ok, let the
documentation and man pages clarify that mprotect may not protect, madvise
may not advise and so on.

And although you would use it against me, I would say: Nobody knew that TLB
flushing could be so complicated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-19 19:41                                       ` Nadav Amit
@ 2017-07-19 19:58                                         ` Mel Gorman
  2017-07-19 20:20                                           ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-19 19:58 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 19, 2017 at 12:41:01PM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote:
> >>> If there are separate address spaces using a shared mapping then the
> >>> same race does not occur.
> >> 
> >> I missed the fact you reverted the two operations since the previous version
> >> of the patch. This specific scenario should be solved with this patch.
> >> 
> >> But in general, I think there is a need for a simple locking scheme.
> > 
> > Such as?
> 
> Something like:
> 
> bool is_potentially_stale_pte(pte_t pte, pgprot_t prot, int lock_state);
> 
> which would get the current PTE, the protection bits that the user is
> interested in, and whether mmap_sem is taken read/write/none. 
> 

>From a PTE you cannot know the state of mmap_sem because you can rmap
back to multiple mm's for shared mappings. It's also fairly heavy handed.
Technically, you could lock on the basis of the VMA but that has other
consequences for scalability. The staleness is also a factor because
it's a case of "does the staleness matter". Sometimes it does, sometimes
it doesn't.  mmap_sem even if it could be used does not always tell us
the right information either because it can matter whether we are racing
against a userspace reference or a kernel operation.

It's possible your idea could be made work, but right now I'm not seeing a
solution that handles every corner case. I asked to hear what your ideas
were because anything I thought of that could batch TLB flushing in the
general case had flaws that did not improve over what is already there.

> [snip]
> 
> >> As a result, concurrent operation such as KSM???s write_protect_page() or
> > 
> > write_protect_page operates under the page lock and cannot race with reclaim.
> 
> I still do not understand this claim. IIUC, reclaim can unmap the page in
> some page table, decide not to reclaim the page and release the page-lock
> before flush.
> 

shrink_page_list is the caller of try_to_unmap in reclaim context. It
has this check

                if (!trylock_page(page))
                        goto keep;

For pages it cannot lock, they get put back on the LRU and recycled instead
of reclaimed. Hence, if KSM or anything else holds the page lock, reclaim
can't unmap it.

> >> page_mkclean_one() can consider the page write-protected while in fact it is
> >> still accessible - since the TLB flush was deferred.
> > 
> > As long as it's flushed before any IO occurs that would lose a data update,
> > it's not a data integrity issue.
> > 
> >> As a result, they may
> >> mishandle the PTE without flushing the page. In the case of
> >> page_mkclean_one(), I suspect it may even lead to memory corruption. I admit
> >> that in x86 there are some mitigating factors that would make such ???attack???
> >> complicated, but it still seems wrong to me, no?
> > 
> > I worry that you're beginning to see races everywhere. I admit that the
> > rules and protections here are varied and complex but it's worth keeping
> > in mind that data integrity is the key concern (no false reads to wrong
> > data, no lost writes) and the first race you identified found some problems
> > here. However, with or without batching, there is always a delay between
> > when a PTE is cleared and when the TLB entries are removed.
> 
> Sure, but usually the delay occurs while the page-table lock is taken so
> there is no race.
> 
> Now, it is not fair to call me a paranoid, considering that these races are
> real - I confirmed that at least two can happen in practice.

It's less an accusation of paranoia and more a caution that the fact that
pte_clear_flush is not atomic means that it can be difficult to find what
races matter and what ones don't.

> As for ???data integrity is the key concern??? - violating the memory management
> API can cause data integrity issues for programs.

The madvise one should be fixed too. It could also be "fixed" by
removing all batching but the performance cost will be sufficiently high
that there will be pressure to find an alternative.

> It may not cause the OS to
> crash, but it should not be acceptable either, and may potentially raise
> security concerns. If you think that the current behavior is ok, let the
> documentation and man pages clarify that mprotect may not protect, madvise
> may not advise and so on.
> 

The madvise one should be fixed, not because because it allows a case
whereby userspace thinks it has initialised a structure that is actually
in a page that is freed after a TLB is flushed resulting in a lost
write. It wouldn't cause any issues with shared or file-backed mappings
but it is a problem for anonymous.

> And although you would use it against me, I would say: Nobody knew that TLB
> flushing could be so complicated.
> 

There is no question that the area is complicated.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-19 19:58                                         ` Mel Gorman
@ 2017-07-19 20:20                                           ` Nadav Amit
  2017-07-19 21:47                                             ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-19 20:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jul 19, 2017 at 12:41:01PM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Tue, Jul 18, 2017 at 02:28:27PM -0700, Nadav Amit wrote:
>>>>> If there are separate address spaces using a shared mapping then the
>>>>> same race does not occur.
>>>> 
>>>> I missed the fact you reverted the two operations since the previous version
>>>> of the patch. This specific scenario should be solved with this patch.
>>>> 
>>>> But in general, I think there is a need for a simple locking scheme.
>>> 
>>> Such as?
>> 
>> Something like:
>> 
>> bool is_potentially_stale_pte(pte_t pte, pgprot_t prot, int lock_state);
>> 
>> which would get the current PTE, the protection bits that the user is
>> interested in, and whether mmap_sem is taken read/write/none.
> 
> From a PTE you cannot know the state of mmap_sem because you can rmap
> back to multiple mm's for shared mappings. It's also fairly heavy handed.
> Technically, you could lock on the basis of the VMA but that has other
> consequences for scalability. The staleness is also a factor because
> it's a case of "does the staleness matter". Sometimes it does, sometimes
> it doesn't.  mmap_sem even if it could be used does not always tell us
> the right information either because it can matter whether we are racing
> against a userspace reference or a kernel operation.
> 
> It's possible your idea could be made work, but right now I'm not seeing a
> solution that handles every corner case. I asked to hear what your ideas
> were because anything I thought of that could batch TLB flushing in the
> general case had flaws that did not improve over what is already there.

I don’t disagree with what you say - perhaps my scheme is too simplistic.
But the bottom line, if you cannot form simple rules for when TLB needs to
be flushed, what are the chances others would get it right?

>> [snip]
>> 
>>>> As a result, concurrent operation such as KSM???s write_protect_page() or
>>> 
>>> write_protect_page operates under the page lock and cannot race with reclaim.
>> 
>> I still do not understand this claim. IIUC, reclaim can unmap the page in
>> some page table, decide not to reclaim the page and release the page-lock
>> before flush.
> 
> shrink_page_list is the caller of try_to_unmap in reclaim context. It
> has this check
> 
>                if (!trylock_page(page))
>                        goto keep;
> 
> For pages it cannot lock, they get put back on the LRU and recycled instead
> of reclaimed. Hence, if KSM or anything else holds the page lock, reclaim
> can't unmap it.

Yes, of course, since KSM does not batch TLB flushes. I regarded the other
direction - first try_to_unmap() removes the PTE (but still does not flush),
unlocks the page, and then KSM acquires the page lock and calls
write_protect_page(). It finds out the PTE is not present and does not flush
the TLB.

>>>> page_mkclean_one() can consider the page write-protected while in fact it is
>>>> still accessible - since the TLB flush was deferred.
>>> 
>>> As long as it's flushed before any IO occurs that would lose a data update,
>>> it's not a data integrity issue.
>>> 
>>>> As a result, they may
>>>> mishandle the PTE without flushing the page. In the case of
>>>> page_mkclean_one(), I suspect it may even lead to memory corruption. I admit
>>>> that in x86 there are some mitigating factors that would make such ???attack???
>>>> complicated, but it still seems wrong to me, no?
>>> 
>>> I worry that you're beginning to see races everywhere. I admit that the
>>> rules and protections here are varied and complex but it's worth keeping
>>> in mind that data integrity is the key concern (no false reads to wrong
>>> data, no lost writes) and the first race you identified found some problems
>>> here. However, with or without batching, there is always a delay between
>>> when a PTE is cleared and when the TLB entries are removed.
>> 
>> Sure, but usually the delay occurs while the page-table lock is taken so
>> there is no race.
>> 
>> Now, it is not fair to call me a paranoid, considering that these races are
>> real - I confirmed that at least two can happen in practice.
> 
> It's less an accusation of paranoia and more a caution that the fact that
> pte_clear_flush is not atomic means that it can be difficult to find what
> races matter and what ones don't.
> 
>> As for ???data integrity is the key concern??? - violating the memory management
>> API can cause data integrity issues for programs.
> 
> The madvise one should be fixed too. It could also be "fixed" by
> removing all batching but the performance cost will be sufficiently high
> that there will be pressure to find an alternative.
> 
>> It may not cause the OS to
>> crash, but it should not be acceptable either, and may potentially raise
>> security concerns. If you think that the current behavior is ok, let the
>> documentation and man pages clarify that mprotect may not protect, madvise
>> may not advise and so on.
> 
> The madvise one should be fixed, not because because it allows a case
> whereby userspace thinks it has initialised a structure that is actually
> in a page that is freed after a TLB is flushed resulting in a lost
> write. It wouldn't cause any issues with shared or file-backed mappings
> but it is a problem for anonymous.
> 
>> And although you would use it against me, I would say: Nobody knew that TLB
>> flushing could be so complicated.
> 
> There is no question that the area is complicated.

My comment was actually an unfunny joke... Never mind.

Thanks,
Nadav

p.s.: Thanks for your patience.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-19 20:20                                           ` Nadav Amit
@ 2017-07-19 21:47                                             ` Mel Gorman
  2017-07-19 22:19                                               ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-19 21:47 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 19, 2017 at 01:20:01PM -0700, Nadav Amit wrote:
> > From a PTE you cannot know the state of mmap_sem because you can rmap
> > back to multiple mm's for shared mappings. It's also fairly heavy handed.
> > Technically, you could lock on the basis of the VMA but that has other
> > consequences for scalability. The staleness is also a factor because
> > it's a case of "does the staleness matter". Sometimes it does, sometimes
> > it doesn't.  mmap_sem even if it could be used does not always tell us
> > the right information either because it can matter whether we are racing
> > against a userspace reference or a kernel operation.
> > 
> > It's possible your idea could be made work, but right now I'm not seeing a
> > solution that handles every corner case. I asked to hear what your ideas
> > were because anything I thought of that could batch TLB flushing in the
> > general case had flaws that did not improve over what is already there.
> 
> I don???t disagree with what you say - perhaps my scheme is too simplistic.
> But the bottom line, if you cannot form simple rules for when TLB needs to
> be flushed, what are the chances others would get it right?
> 

Broad rule is "flush before the page is freed/reallocated for clean pages
or any IO is initiated for dirty pages" with a lot of details that are not
documented. Often it's the PTL and flush with it held that protects the
majority of cases but it's not universal as the page lock and mmap_sem
play important rules depending ont the context and AFAIK, that's also
not documented.

> > shrink_page_list is the caller of try_to_unmap in reclaim context. It
> > has this check
> > 
> >                if (!trylock_page(page))
> >                        goto keep;
> > 
> > For pages it cannot lock, they get put back on the LRU and recycled instead
> > of reclaimed. Hence, if KSM or anything else holds the page lock, reclaim
> > can't unmap it.
> 
> Yes, of course, since KSM does not batch TLB flushes. I regarded the other
> direction - first try_to_unmap() removes the PTE (but still does not flush),
> unlocks the page, and then KSM acquires the page lock and calls
> write_protect_page(). It finds out the PTE is not present and does not flush
> the TLB.
> 

When KSM acquires the page lock, it then acquires the PTL where the
cleared PTE is observed directly and skipped.

> > There is no question that the area is complicated.
> 
> My comment was actually an unfunny joke... Never mind.
> 
> Thanks,
> Nadav
> 
> p.s.: Thanks for your patience.
> 

No need for thanks. As you pointed out yourself, you have been
identifying races.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-19 21:47                                             ` Mel Gorman
@ 2017-07-19 22:19                                               ` Nadav Amit
  2017-07-19 22:59                                                 ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-19 22:19 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jul 19, 2017 at 01:20:01PM -0700, Nadav Amit wrote:
>>> From a PTE you cannot know the state of mmap_sem because you can rmap
>>> back to multiple mm's for shared mappings. It's also fairly heavy handed.
>>> Technically, you could lock on the basis of the VMA but that has other
>>> consequences for scalability. The staleness is also a factor because
>>> it's a case of "does the staleness matter". Sometimes it does, sometimes
>>> it doesn't.  mmap_sem even if it could be used does not always tell us
>>> the right information either because it can matter whether we are racing
>>> against a userspace reference or a kernel operation.
>>> 
>>> It's possible your idea could be made work, but right now I'm not seeing a
>>> solution that handles every corner case. I asked to hear what your ideas
>>> were because anything I thought of that could batch TLB flushing in the
>>> general case had flaws that did not improve over what is already there.
>> 
>> I don???t disagree with what you say - perhaps my scheme is too simplistic.
>> But the bottom line, if you cannot form simple rules for when TLB needs to
>> be flushed, what are the chances others would get it right?
> 
> Broad rule is "flush before the page is freed/reallocated for clean pages
> or any IO is initiated for dirty pages" with a lot of details that are not
> documented. Often it's the PTL and flush with it held that protects the
> majority of cases but it's not universal as the page lock and mmap_sem
> play important rules depending ont the context and AFAIK, that's also
> not documented.
> 
>>> shrink_page_list is the caller of try_to_unmap in reclaim context. It
>>> has this check
>>> 
>>>               if (!trylock_page(page))
>>>                       goto keep;
>>> 
>>> For pages it cannot lock, they get put back on the LRU and recycled instead
>>> of reclaimed. Hence, if KSM or anything else holds the page lock, reclaim
>>> can't unmap it.
>> 
>> Yes, of course, since KSM does not batch TLB flushes. I regarded the other
>> direction - first try_to_unmap() removes the PTE (but still does not flush),
>> unlocks the page, and then KSM acquires the page lock and calls
>> write_protect_page(). It finds out the PTE is not present and does not flush
>> the TLB.
> 
> When KSM acquires the page lock, it then acquires the PTL where the
> cleared PTE is observed directly and skipped.

I don’t see why. Let’s try again - CPU0 reclaims while CPU1 deduplicates:

CPU0				CPU1
----				----
shrink_page_list()

=> try_to_unmap()
==> try_to_unmap_one()
[ unmaps from some page-tables ]

[ try_to_unmap returns false;
  page not reclaimed ]

=> keep_locked: unlock_page()

[ TLB flush deferred ]
				try_to_merge_one_page()
				=> trylock_page()
				=> write_protect_page()
				==> acquire ptl
				  [ PTE non-present —> no PTE change
				    and no flush ]
				==> release ptl
				==> replace_page()


At this point, while replace_page() is running, CPU0 may still not have
flushed the TLBs. Another CPU (CPU2) may hold a stale PTE, which is not
write-protected. It can therefore write to that page while replace_page() is
running, resulting in memory corruption.

No?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-19 22:19                                               ` Nadav Amit
@ 2017-07-19 22:59                                                 ` Mel Gorman
  2017-07-19 23:39                                                   ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-19 22:59 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 19, 2017 at 03:19:00PM -0700, Nadav Amit wrote:
> >> Yes, of course, since KSM does not batch TLB flushes. I regarded the other
> >> direction - first try_to_unmap() removes the PTE (but still does not flush),
> >> unlocks the page, and then KSM acquires the page lock and calls
> >> write_protect_page(). It finds out the PTE is not present and does not flush
> >> the TLB.
> > 
> > When KSM acquires the page lock, it then acquires the PTL where the
> > cleared PTE is observed directly and skipped.
> 
> I don???t see why. Let???s try again - CPU0 reclaims while CPU1 deduplicates:
> 
> CPU0				CPU1
> ----				----
> shrink_page_list()
> 
> => try_to_unmap()
> ==> try_to_unmap_one()
> [ unmaps from some page-tables ]
> 
> [ try_to_unmap returns false;
>   page not reclaimed ]
> 
> => keep_locked: unlock_page()
> 
> [ TLB flush deferred ]
> 				try_to_merge_one_page()
> 				=> trylock_page()
> 				=> write_protect_page()
> 				==> acquire ptl
> 				  [ PTE non-present ???> no PTE change
> 				    and no flush ]
> 				==> release ptl
> 				==> replace_page()
> 
> 
> At this point, while replace_page() is running, CPU0 may still not have
> flushed the TLBs. Another CPU (CPU2) may hold a stale PTE, which is not
> write-protected. It can therefore write to that page while replace_page() is
> running, resulting in memory corruption.
> 
> No?
> 

KSM is not my strong point so it's reaching the point where others more
familiar with that code need to be involved.

If try_to_unmap returns false on CPU0 then at least one unmap attempt
failed and the page is not reclaimed. For those that were unmapped, they
will get flushed in the near future. When KSM operates on CPU1, it'll skip
the unmapped pages under the PTL so stale TLB entries are not relevant as
the mapped entries are still pointing to a valid page and ksm misses a merge
opportunity. If it write protects a page, ksm unconditionally flushes the PTE
on clearing the PTE so again, there is no stale entry anywhere. For CPU2,
it'll either reference a PTE that was unmapped in which case it'll fault
once CPU0 flushes the TLB and until then it's safe to read and write as
long as the TLB is flushed before the page is freed or IO is initiated which
reclaim already handles. If CPU2 references a page that was still mapped
then it'll be fine until KSM unmaps and flushes the page before going
further so any reference after KSM starts the critical operation will
trap a fault.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-19 22:59                                                 ` Mel Gorman
@ 2017-07-19 23:39                                                   ` Nadav Amit
  2017-07-20  7:43                                                     ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-19 23:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jul 19, 2017 at 03:19:00PM -0700, Nadav Amit wrote:
>>>> Yes, of course, since KSM does not batch TLB flushes. I regarded the other
>>>> direction - first try_to_unmap() removes the PTE (but still does not flush),
>>>> unlocks the page, and then KSM acquires the page lock and calls
>>>> write_protect_page(). It finds out the PTE is not present and does not flush
>>>> the TLB.
>>> 
>>> When KSM acquires the page lock, it then acquires the PTL where the
>>> cleared PTE is observed directly and skipped.
>> 
>> I don???t see why. Let???s try again - CPU0 reclaims while CPU1 deduplicates:
>> 
>> CPU0				CPU1
>> ----				----
>> shrink_page_list()
>> 
>> => try_to_unmap()
>> ==> try_to_unmap_one()
>> [ unmaps from some page-tables ]
>> 
>> [ try_to_unmap returns false;
>>  page not reclaimed ]
>> 
>> => keep_locked: unlock_page()
>> 
>> [ TLB flush deferred ]
>> 				try_to_merge_one_page()
>> 				=> trylock_page()
>> 				=> write_protect_page()
>> 				==> acquire ptl
>> 				  [ PTE non-present ???> no PTE change
>> 				    and no flush ]
>> 				==> release ptl
>> 				==> replace_page()
>> 
>> 
>> At this point, while replace_page() is running, CPU0 may still not have
>> flushed the TLBs. Another CPU (CPU2) may hold a stale PTE, which is not
>> write-protected. It can therefore write to that page while replace_page() is
>> running, resulting in memory corruption.
>> 
>> No?
> 
> KSM is not my strong point so it's reaching the point where others more
> familiar with that code need to be involved.

Do not assume for a second that I really know what is going on over there.

> If try_to_unmap returns false on CPU0 then at least one unmap attempt
> failed and the page is not reclaimed.

Actually, try_to_unmap() may even return true, and the page would still not
be reclaimed - for example if page_has_private() and freeing the buffers
fails. In this case, the page would be unlocked as well.

> For those that were unmapped, they
> will get flushed in the near future. When KSM operates on CPU1, it'll skip
> the unmapped pages under the PTL so stale TLB entries are not relevant as
> the mapped entries are still pointing to a valid page and ksm misses a merge
> opportunity.

This is the case I regarded, but I do not understand your point. The whole
problem is that CPU1 would skip the unmapped pages under the PTL. As it
skips them it does not flush them from the TLB. And as a result,
replace_page() may happen before the TLB is flushed by CPU0.

> If it write protects a page, ksm unconditionally flushes the PTE
> on clearing the PTE so again, there is no stale entry anywhere. For CPU2,
> it'll either reference a PTE that was unmapped in which case it'll fault
> once CPU0 flushes the TLB and until then it's safe to read and write as
> long as the TLB is flushed before the page is freed or IO is initiated which
> reclaim already handles.

In my scenario the page is not freed and there is no I/O in the reclaim
path. The TLB flush of CPU0 in my scenario is just deferred while the
page-table lock is not held. As I mentioned before, this time-period can be
potentially very long in a virtual machine. CPU2 referenced a PTE that
was unmapped by CPU0 (reclaim path) but not CPU1 (ksm path).

ksm, IIUC, would not expect modifications of the page during replace_page.
Eventually it would flush the TLB (after changing the PTE to point to the
deduplicated page). But in the meanwhile, another CPU may use stale PTEs for
writes, and those writes would be lost after the page is deduplicated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-19 23:39                                                   ` Nadav Amit
@ 2017-07-20  7:43                                                     ` Mel Gorman
  2017-07-22  1:19                                                       ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-20  7:43 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 19, 2017 at 04:39:07PM -0700, Nadav Amit wrote:
> > If try_to_unmap returns false on CPU0 then at least one unmap attempt
> > failed and the page is not reclaimed.
> 
> Actually, try_to_unmap() may even return true, and the page would still not
> be reclaimed - for example if page_has_private() and freeing the buffers
> fails. In this case, the page would be unlocked as well.
> 

I'm not seeing the relevance from the perspective of a stale TLB being
used to corrupt memory or access the wrong data.

> > For those that were unmapped, they
> > will get flushed in the near future. When KSM operates on CPU1, it'll skip
> > the unmapped pages under the PTL so stale TLB entries are not relevant as
> > the mapped entries are still pointing to a valid page and ksm misses a merge
> > opportunity.
> 
> This is the case I regarded, but I do not understand your point. The whole
> problem is that CPU1 would skip the unmapped pages under the PTL. As it
> skips them it does not flush them from the TLB. And as a result,
> replace_page() may happen before the TLB is flushed by CPU0.
> 

At the time of the unlock_page on the reclaim side, any unmapping that
will happen before the flush has taken place. If KSM starts between the
unlock_page and the tlb flush then it'll skip any of the PTEs that were
previously unmapped with stale entries so there is no relevant stale TLB
entry to work with.

> > If it write protects a page, ksm unconditionally flushes the PTE
> > on clearing the PTE so again, there is no stale entry anywhere. For CPU2,
> > it'll either reference a PTE that was unmapped in which case it'll fault
> > once CPU0 flushes the TLB and until then it's safe to read and write as
> > long as the TLB is flushed before the page is freed or IO is initiated which
> > reclaim already handles.
> 
> In my scenario the page is not freed and there is no I/O in the reclaim
> path. The TLB flush of CPU0 in my scenario is just deferred while the
> page-table lock is not held. As I mentioned before, this time-period can be
> potentially very long in a virtual machine. CPU2 referenced a PTE that
> was unmapped by CPU0 (reclaim path) but not CPU1 (ksm path).
> 
> ksm, IIUC, would not expect modifications of the page during replace_page.

Indeed not but it'll either find not PTE in which case it won't allow a
stale PTE entry to exist and even when it finds a PTE, it flushes the
TLB unconditionally to avoid any writes taking place. It holds the page
lock while setting up the sharing so no parallel fault can reinsert the
page and no parallel writes can take place that would result in false
sharing.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-20  7:43                                                     ` Mel Gorman
@ 2017-07-22  1:19                                                       ` Nadav Amit
  2017-07-24  9:58                                                         ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-22  1:19 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jul 19, 2017 at 04:39:07PM -0700, Nadav Amit wrote:
>>> If try_to_unmap returns false on CPU0 then at least one unmap attempt
>>> failed and the page is not reclaimed.
>> 
>> Actually, try_to_unmap() may even return true, and the page would still not
>> be reclaimed - for example if page_has_private() and freeing the buffers
>> fails. In this case, the page would be unlocked as well.
> 
> I'm not seeing the relevance from the perspective of a stale TLB being
> used to corrupt memory or access the wrong data.
> 
>>> For those that were unmapped, they
>>> will get flushed in the near future. When KSM operates on CPU1, it'll skip
>>> the unmapped pages under the PTL so stale TLB entries are not relevant as
>>> the mapped entries are still pointing to a valid page and ksm misses a merge
>>> opportunity.
>> 
>> This is the case I regarded, but I do not understand your point. The whole
>> problem is that CPU1 would skip the unmapped pages under the PTL. As it
>> skips them it does not flush them from the TLB. And as a result,
>> replace_page() may happen before the TLB is flushed by CPU0.
> 
> At the time of the unlock_page on the reclaim side, any unmapping that
> will happen before the flush has taken place. If KSM starts between the
> unlock_page and the tlb flush then it'll skip any of the PTEs that were
> previously unmapped with stale entries so there is no relevant stale TLB
> entry to work with.

I don’t see where this skipping happens, but let’s put this scenario aside
for a second. Here is a similar scenario that causes memory corruption. I
actually created and tested it (although I needed to hack the kernel to add
some artificial latency before the actual flushes and before the actual
dedupliaction of KSM).

We are going to cause KSM to deduplicate a page, and after page comparison
but before the page is actually replaced, to use a stale PTE entry to 
overwrite the page. As a result KSM will lose a write, causing memory
corruption.

For this race we need 4 CPUs:

CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
write later.

CPU1: Runs madvise_free on the range that includes the PTE. It would clear
the dirty-bit. It batches TLB flushes.

CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
care about the fact that it clears the PTE write-bit, and of course, batches
TLB flushes.

CPU3: Runs KSM. Our purpose is to pass the following test in
write_protect_page():

	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))

Since it will avoid TLB flush. And we want to do it while the PTE is stale.
Later, and before replacing the page, we would be able to change the page.

Note that all the operations the CPU1-3 perform canhappen in parallel since
they only acquire mmap_sem for read.

We start with two identical pages. Everything below regards the same
page/PTE.

CPU0		CPU1		CPU2		CPU3
----		----		----		----
Write the same
value on page

[cache PTE as
 dirty in TLB]

		MADV_FREE
		pte_mkclean()
							
				4 > clear_refs
				pte_wrprotect()

						write_protect_page()
						[ success, no flush ]

						pages_indentical()
						[ ok ]

Write to page
different value

[Ok, using stale
 PTE]

						replace_page()


Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
already wrote on the page, but KSM ignored this write, and it got lost.

Now to reiterate my point: It is really hard to get TLB batching right
without some clear policy. And it should be important, since such issues can
cause memory corruption and have security implications (if somebody manages
to get the timing right).

Regards,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-22  1:19                                                       ` Nadav Amit
@ 2017-07-24  9:58                                                         ` Mel Gorman
  2017-07-24 19:46                                                           ` Nadav Amit
  2017-07-25  7:37                                                           ` Minchan Kim
  0 siblings, 2 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-24  9:58 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andy Lutomirski, Minchan Kim, open list:MEMORY MANAGEMENT

On Fri, Jul 21, 2017 at 06:19:22PM -0700, Nadav Amit wrote:
> > At the time of the unlock_page on the reclaim side, any unmapping that
> > will happen before the flush has taken place. If KSM starts between the
> > unlock_page and the tlb flush then it'll skip any of the PTEs that were
> > previously unmapped with stale entries so there is no relevant stale TLB
> > entry to work with.
> 
> I don???t see where this skipping happens, but let???s put this scenario aside
> for a second. Here is a similar scenario that causes memory corruption. I
> actually created and tested it (although I needed to hack the kernel to add
> some artificial latency before the actual flushes and before the actual
> dedupliaction of KSM).
> 
> We are going to cause KSM to deduplicate a page, and after page comparison
> but before the page is actually replaced, to use a stale PTE entry to 
> overwrite the page. As a result KSM will lose a write, causing memory
> corruption.
> 
> For this race we need 4 CPUs:
> 
> CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
> write later.
> 
> CPU1: Runs madvise_free on the range that includes the PTE. It would clear
> the dirty-bit. It batches TLB flushes.
> 
> CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
> care about the fact that it clears the PTE write-bit, and of course, batches
> TLB flushes.
> 
> CPU3: Runs KSM. Our purpose is to pass the following test in
> write_protect_page():
> 
> 	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
> 	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
> 
> Since it will avoid TLB flush. And we want to do it while the PTE is stale.
> Later, and before replacing the page, we would be able to change the page.
> 
> Note that all the operations the CPU1-3 perform canhappen in parallel since
> they only acquire mmap_sem for read.
> 
> We start with two identical pages. Everything below regards the same
> page/PTE.
> 
> CPU0		CPU1		CPU2		CPU3
> ----		----		----		----
> Write the same
> value on page
> 
> [cache PTE as
>  dirty in TLB]
> 
> 		MADV_FREE
> 		pte_mkclean()
> 							
> 				4 > clear_refs
> 				pte_wrprotect()
> 
> 						write_protect_page()
> 						[ success, no flush ]
> 
> 						pages_indentical()
> 						[ ok ]
> 
> Write to page
> different value
> 
> [Ok, using stale
>  PTE]
> 
> 						replace_page()
> 
> 
> Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
> already wrote on the page, but KSM ignored this write, and it got lost.
> 

Ok, as you say you have reproduced this with corruption, I would suggest
one path for dealing with it although you'll need to pass it by the
original authors.

When unmapping ranges, there is a check for dirty PTEs in
zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
writable stale PTEs from CPU0 in a scenario like you laid out above.

madvise_free misses a similar class of check so I'm adding Minchan Kim
to the cc as the original author of much of that code. Minchan Kim will
need to confirm but it appears that two modifications would be required.
The first should pass in the mmu_gather structure to
madvise_free_pte_range (at minimum) and force flush the TLB under the
PTL if a dirty PTE is encountered. The second is that it should consider
flushing the full affected range as madvise_free holds mmap_sem for
read-only to avoid problems with two parallel madv_free operations. The
second is optional because there are other ways it could also be handled
that may have lower overhead.

Soft dirty page handling may need similar protections.

> Now to reiterate my point: It is really hard to get TLB batching right
> without some clear policy. And it should be important, since such issues can
> cause memory corruption and have security implications (if somebody manages
> to get the timing right).
> 

Basically it comes down to when batching TLB flushes, care must be taken
when dealing with dirty PTEs that writable TLB entries do not leak data. The
reclaim TLB batching *should* still be ok as it allows stale entries to exist
but only up until the point where IO is queued to prevent data being
lost. I'm not aware of this being formally documented in the past. It's
possible that you could extent the mmu_gather API to track that state
and handle it properly in the general case so as long as someone uses
that API properly that they'll be protected.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-24  9:58                                                         ` Mel Gorman
@ 2017-07-24 19:46                                                           ` Nadav Amit
  2017-07-25  7:37                                                           ` Minchan Kim
  1 sibling, 0 replies; 70+ messages in thread
From: Nadav Amit @ 2017-07-24 19:46 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andy Lutomirski, Minchan Kim, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Fri, Jul 21, 2017 at 06:19:22PM -0700, Nadav Amit wrote:
>>> At the time of the unlock_page on the reclaim side, any unmapping that
>>> will happen before the flush has taken place. If KSM starts between the
>>> unlock_page and the tlb flush then it'll skip any of the PTEs that were
>>> previously unmapped with stale entries so there is no relevant stale TLB
>>> entry to work with.
>> 
>> I don???t see where this skipping happens, but let???s put this scenario aside
>> for a second. Here is a similar scenario that causes memory corruption. I
>> actually created and tested it (although I needed to hack the kernel to add
>> some artificial latency before the actual flushes and before the actual
>> dedupliaction of KSM).
>> 
>> We are going to cause KSM to deduplicate a page, and after page comparison
>> but before the page is actually replaced, to use a stale PTE entry to 
>> overwrite the page. As a result KSM will lose a write, causing memory
>> corruption.
>> 
>> For this race we need 4 CPUs:
>> 
>> CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
>> write later.
>> 
>> CPU1: Runs madvise_free on the range that includes the PTE. It would clear
>> the dirty-bit. It batches TLB flushes.
>> 
>> CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
>> care about the fact that it clears the PTE write-bit, and of course, batches
>> TLB flushes.
>> 
>> CPU3: Runs KSM. Our purpose is to pass the following test in
>> write_protect_page():
>> 
>> 	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
>> 	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
>> 
>> Since it will avoid TLB flush. And we want to do it while the PTE is stale.
>> Later, and before replacing the page, we would be able to change the page.
>> 
>> Note that all the operations the CPU1-3 perform canhappen in parallel since
>> they only acquire mmap_sem for read.
>> 
>> We start with two identical pages. Everything below regards the same
>> page/PTE.
>> 
>> CPU0		CPU1		CPU2		CPU3
>> ----		----		----		----
>> Write the same
>> value on page
>> 
>> [cache PTE as
>> dirty in TLB]
>> 
>> 		MADV_FREE
>> 		pte_mkclean()
>> 							
>> 				4 > clear_refs
>> 				pte_wrprotect()
>> 
>> 						write_protect_page()
>> 						[ success, no flush ]
>> 
>> 						pages_indentical()
>> 						[ ok ]
>> 
>> Write to page
>> different value
>> 
>> [Ok, using stale
>> PTE]
>> 
>> 						replace_page()
>> 
>> 
>> Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
>> already wrote on the page, but KSM ignored this write, and it got lost.
> 
> Ok, as you say you have reproduced this with corruption, I would suggest
> one path for dealing with it although you'll need to pass it by the
> original authors.
> 
> When unmapping ranges, there is a check for dirty PTEs in
> zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
> writable stale PTEs from CPU0 in a scenario like you laid out above.
> 
> madvise_free misses a similar class of check so I'm adding Minchan Kim
> to the cc as the original author of much of that code. Minchan Kim will
> need to confirm but it appears that two modifications would be required.
> The first should pass in the mmu_gather structure to
> madvise_free_pte_range (at minimum) and force flush the TLB under the
> PTL if a dirty PTE is encountered. The second is that it should consider
> flushing the full affected range as madvise_free holds mmap_sem for
> read-only to avoid problems with two parallel madv_free operations. The
> second is optional because there are other ways it could also be handled
> that may have lower overhead.
> 
> Soft dirty page handling may need similar protections.

The problem, in my mind, is that KSM conditionally invalidates the PTEs
despite potentially pending flushes. Forcing flushes under the ptl instead
of batching may have some significant performance impact.

BTW: let me know if you need my PoC.

> 
>> Now to reiterate my point: It is really hard to get TLB batching right
>> without some clear policy. And it should be important, since such issues can
>> cause memory corruption and have security implications (if somebody manages
>> to get the timing right).
> 
> Basically it comes down to when batching TLB flushes, care must be taken
> when dealing with dirty PTEs that writable TLB entries do not leak data. The
> reclaim TLB batching *should* still be ok as it allows stale entries to exist
> but only up until the point where IO is queued to prevent data being
> lost. I'm not aware of this being formally documented in the past. It's
> possible that you could extent the mmu_gather API to track that state
> and handle it properly in the general case so as long as someone uses
> that API properly that they'll be protected.

I had a brief look on FreeBSD. Basically, AFAIU, the scheme is that if there
are any pending invalidations to the address space, they must be carried
before related operations finish. It is similar to what I proposed before:
increase a “pending flush” counter for the mm when updating the entries, and
update “done flush” counter once the invalidation is done. When the kernel
makes decisions or conditional flush based on a PTE value - it needs to
wait for the flushes to be finished. Obviously, such scheme can be further
refined. 

Thanks again,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-24  9:58                                                         ` Mel Gorman
  2017-07-24 19:46                                                           ` Nadav Amit
@ 2017-07-25  7:37                                                           ` Minchan Kim
  2017-07-25  8:51                                                             ` Mel Gorman
  1 sibling, 1 reply; 70+ messages in thread
From: Minchan Kim @ 2017-07-25  7:37 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

Hi Mel,

On Mon, Jul 24, 2017 at 10:58:32AM +0100, Mel Gorman wrote:
> On Fri, Jul 21, 2017 at 06:19:22PM -0700, Nadav Amit wrote:
> > > At the time of the unlock_page on the reclaim side, any unmapping that
> > > will happen before the flush has taken place. If KSM starts between the
> > > unlock_page and the tlb flush then it'll skip any of the PTEs that were
> > > previously unmapped with stale entries so there is no relevant stale TLB
> > > entry to work with.
> > 
> > I don???t see where this skipping happens, but let???s put this scenario aside
> > for a second. Here is a similar scenario that causes memory corruption. I
> > actually created and tested it (although I needed to hack the kernel to add
> > some artificial latency before the actual flushes and before the actual
> > dedupliaction of KSM).
> > 
> > We are going to cause KSM to deduplicate a page, and after page comparison
> > but before the page is actually replaced, to use a stale PTE entry to 
> > overwrite the page. As a result KSM will lose a write, causing memory
> > corruption.
> > 
> > For this race we need 4 CPUs:
> > 
> > CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
> > write later.
> > 
> > CPU1: Runs madvise_free on the range that includes the PTE. It would clear
> > the dirty-bit. It batches TLB flushes.
> > 
> > CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
> > care about the fact that it clears the PTE write-bit, and of course, batches
> > TLB flushes.
> > 
> > CPU3: Runs KSM. Our purpose is to pass the following test in
> > write_protect_page():
> > 
> > 	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
> > 	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
> > 
> > Since it will avoid TLB flush. And we want to do it while the PTE is stale.
> > Later, and before replacing the page, we would be able to change the page.
> > 
> > Note that all the operations the CPU1-3 perform canhappen in parallel since
> > they only acquire mmap_sem for read.
> > 
> > We start with two identical pages. Everything below regards the same
> > page/PTE.
> > 
> > CPU0		CPU1		CPU2		CPU3
> > ----		----		----		----
> > Write the same
> > value on page
> > 
> > [cache PTE as
> >  dirty in TLB]
> > 
> > 		MADV_FREE
> > 		pte_mkclean()
> > 							
> > 				4 > clear_refs
> > 				pte_wrprotect()
> > 
> > 						write_protect_page()
> > 						[ success, no flush ]
> > 
> > 						pages_indentical()
> > 						[ ok ]
> > 
> > Write to page
> > different value
> > 
> > [Ok, using stale
> >  PTE]
> > 
> > 						replace_page()
> > 
> > 
> > Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
> > already wrote on the page, but KSM ignored this write, and it got lost.
> > 
> 
> Ok, as you say you have reproduced this with corruption, I would suggest
> one path for dealing with it although you'll need to pass it by the
> original authors.
> 
> When unmapping ranges, there is a check for dirty PTEs in
> zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
> writable stale PTEs from CPU0 in a scenario like you laid out above.
> 
> madvise_free misses a similar class of check so I'm adding Minchan Kim
> to the cc as the original author of much of that code. Minchan Kim will
> need to confirm but it appears that two modifications would be required.
> The first should pass in the mmu_gather structure to
> madvise_free_pte_range (at minimum) and force flush the TLB under the
> PTL if a dirty PTE is encountered. The second is that it should consider

OTL: I couldn't read this lengthy discussion so I miss miss something.

About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE
context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread
in parallel which has stale pte does "store" to make the pte dirty,
it's okay since try_to_unmap_one in shrink_page_list catches the dirty.

In above example, I think KSM should flush the TLB, not MADV_FREE and
soft dirty page hander.

Maybe, I miss something clear, Could you explain it in detail?

> flushing the full affected range as madvise_free holds mmap_sem for
> read-only to avoid problems with two parallel madv_free operations. The
> second is optional because there are other ways it could also be handled
> that may have lower overhead.

Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-25  7:37                                                           ` Minchan Kim
@ 2017-07-25  8:51                                                             ` Mel Gorman
  2017-07-25  9:11                                                               ` Minchan Kim
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-25  8:51 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote:
> > Ok, as you say you have reproduced this with corruption, I would suggest
> > one path for dealing with it although you'll need to pass it by the
> > original authors.
> > 
> > When unmapping ranges, there is a check for dirty PTEs in
> > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
> > writable stale PTEs from CPU0 in a scenario like you laid out above.
> > 
> > madvise_free misses a similar class of check so I'm adding Minchan Kim
> > to the cc as the original author of much of that code. Minchan Kim will
> > need to confirm but it appears that two modifications would be required.
> > The first should pass in the mmu_gather structure to
> > madvise_free_pte_range (at minimum) and force flush the TLB under the
> > PTL if a dirty PTE is encountered. The second is that it should consider
> 
> OTL: I couldn't read this lengthy discussion so I miss miss something.
> 
> About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE
> context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread
> in parallel which has stale pte does "store" to make the pte dirty,
> it's okay since try_to_unmap_one in shrink_page_list catches the dirty.
> 

In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given
that the key is that data corruption is avoided, you could argue with a
comment that madv_free doesn't necesssarily have to flush it as long as
KSM does even if it's clean due to batching.

> In above example, I think KSM should flush the TLB, not MADV_FREE and
> soft dirty page hander.
> 

That would also be acceptable.

> > flushing the full affected range as madvise_free holds mmap_sem for
> > read-only to avoid problems with two parallel madv_free operations. The
> > second is optional because there are other ways it could also be handled
> > that may have lower overhead.
> 
> Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem?
> 

Like madvise(), madv_free can potentially return with a stale PTE visible
to the caller that observed a pte_none at the time of madv_free and uses
a stale PTE that potentially allows a lost write. It's debatable whether
this matters considering that madv_free to a region means that parallel
writers can lose their update anyway. It's less of a concern than the
KSM angle outlined in Nadav's example which he was able to artifically
reproduce by slowing operations to increase the race window.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-25  8:51                                                             ` Mel Gorman
@ 2017-07-25  9:11                                                               ` Minchan Kim
  2017-07-25 10:10                                                                 ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Minchan Kim @ 2017-07-25  9:11 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 25, 2017 at 09:51:32AM +0100, Mel Gorman wrote:
> On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote:
> > > Ok, as you say you have reproduced this with corruption, I would suggest
> > > one path for dealing with it although you'll need to pass it by the
> > > original authors.
> > > 
> > > When unmapping ranges, there is a check for dirty PTEs in
> > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
> > > writable stale PTEs from CPU0 in a scenario like you laid out above.
> > > 
> > > madvise_free misses a similar class of check so I'm adding Minchan Kim
> > > to the cc as the original author of much of that code. Minchan Kim will
> > > need to confirm but it appears that two modifications would be required.
> > > The first should pass in the mmu_gather structure to
> > > madvise_free_pte_range (at minimum) and force flush the TLB under the
> > > PTL if a dirty PTE is encountered. The second is that it should consider
> > 
> > OTL: I couldn't read this lengthy discussion so I miss miss something.
> > 
> > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE
> > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread
> > in parallel which has stale pte does "store" to make the pte dirty,
> > it's okay since try_to_unmap_one in shrink_page_list catches the dirty.
> > 
> 
> In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given
> that the key is that data corruption is avoided, you could argue with a
> comment that madv_free doesn't necesssarily have to flush it as long as
> KSM does even if it's clean due to batching.

Yes, I think it should be done in side where have a concern.
Maybe, mm_struct can carry a flag which indicates someone is
doing the TLB bacthing and then KSM side can flush it by the flag.
It would reduce unncessary flushing.

> 
> > In above example, I think KSM should flush the TLB, not MADV_FREE and
> > soft dirty page hander.
> > 
> 
> That would also be acceptable.
> 
> > > flushing the full affected range as madvise_free holds mmap_sem for
> > > read-only to avoid problems with two parallel madv_free operations. The
> > > second is optional because there are other ways it could also be handled
> > > that may have lower overhead.
> > 
> > Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem?
> > 
> 
> Like madvise(), madv_free can potentially return with a stale PTE visible
> to the caller that observed a pte_none at the time of madv_free and uses
> a stale PTE that potentially allows a lost write. It's debatable whether

That is the part I cannot understand.
How does it lost "the write"? MADV_FREE doesn't discard the memory so
finally, the write should be done sometime.
Could you tell me more?

Thanks.

> this matters considering that madv_free to a region means that parallel
> writers can lose their update anyway. It's less of a concern than the
> KSM angle outlined in Nadav's example which he was able to artifically
> reproduce by slowing operations to increase the race window.
> 
> -- 
> Mel Gorman
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-25  9:11                                                               ` Minchan Kim
@ 2017-07-25 10:10                                                                 ` Mel Gorman
  2017-07-26  5:43                                                                   ` Minchan Kim
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-25 10:10 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 25, 2017 at 06:11:15PM +0900, Minchan Kim wrote:
> On Tue, Jul 25, 2017 at 09:51:32AM +0100, Mel Gorman wrote:
> > On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote:
> > > > Ok, as you say you have reproduced this with corruption, I would suggest
> > > > one path for dealing with it although you'll need to pass it by the
> > > > original authors.
> > > > 
> > > > When unmapping ranges, there is a check for dirty PTEs in
> > > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
> > > > writable stale PTEs from CPU0 in a scenario like you laid out above.
> > > > 
> > > > madvise_free misses a similar class of check so I'm adding Minchan Kim
> > > > to the cc as the original author of much of that code. Minchan Kim will
> > > > need to confirm but it appears that two modifications would be required.
> > > > The first should pass in the mmu_gather structure to
> > > > madvise_free_pte_range (at minimum) and force flush the TLB under the
> > > > PTL if a dirty PTE is encountered. The second is that it should consider
> > > 
> > > OTL: I couldn't read this lengthy discussion so I miss miss something.
> > > 
> > > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE
> > > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread
> > > in parallel which has stale pte does "store" to make the pte dirty,
> > > it's okay since try_to_unmap_one in shrink_page_list catches the dirty.
> > > 
> > 
> > In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given
> > that the key is that data corruption is avoided, you could argue with a
> > comment that madv_free doesn't necesssarily have to flush it as long as
> > KSM does even if it's clean due to batching.
> 
> Yes, I think it should be done in side where have a concern.
> Maybe, mm_struct can carry a flag which indicates someone is
> doing the TLB bacthing and then KSM side can flush it by the flag.
> It would reduce unncessary flushing.
> 

If you're confident that it's only necessary on the KSM side to avoid the
problem then I'm ok with that. Update KSM in that case with a comment
explaining the madv_free race and why the flush is unconditionally
necessary. madv_free only came up because it was a critical part of having
KSM miss a TLB flush.

> > Like madvise(), madv_free can potentially return with a stale PTE visible
> > to the caller that observed a pte_none at the time of madv_free and uses
> > a stale PTE that potentially allows a lost write. It's debatable whether
> 
> That is the part I cannot understand.
> How does it lost "the write"? MADV_FREE doesn't discard the memory so
> finally, the write should be done sometime.
> Could you tell me more?
> 

I'm relying on the fact you are the madv_free author to determine if
it's really necessary. The race in question is CPU 0 running madv_free
and updating some PTEs while CPU 1 is also running madv_free and looking
at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
the pte_dirty check (because CPU 0 has updated it already) and potentially
fail to flush. Hence, when madv_free on CPU 1 returns, there are still
potentially writable TLB entries and the underlying PTE is still present
so that a subsequent write does not necessarily propagate the dirty bit
to the underlying PTE any more. Reclaim at some unknown time at the future
may then see that the PTE is still clean and discard the page even though
a write has happened in the meantime. I think this is possible but I could
have missed some protection in madv_free that prevents it happening.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-25 10:10                                                                 ` Mel Gorman
@ 2017-07-26  5:43                                                                   ` Minchan Kim
  2017-07-26  9:22                                                                     ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Minchan Kim @ 2017-07-26  5:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Tue, Jul 25, 2017 at 11:10:06AM +0100, Mel Gorman wrote:
> On Tue, Jul 25, 2017 at 06:11:15PM +0900, Minchan Kim wrote:
> > On Tue, Jul 25, 2017 at 09:51:32AM +0100, Mel Gorman wrote:
> > > On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote:
> > > > > Ok, as you say you have reproduced this with corruption, I would suggest
> > > > > one path for dealing with it although you'll need to pass it by the
> > > > > original authors.
> > > > > 
> > > > > When unmapping ranges, there is a check for dirty PTEs in
> > > > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
> > > > > writable stale PTEs from CPU0 in a scenario like you laid out above.
> > > > > 
> > > > > madvise_free misses a similar class of check so I'm adding Minchan Kim
> > > > > to the cc as the original author of much of that code. Minchan Kim will
> > > > > need to confirm but it appears that two modifications would be required.
> > > > > The first should pass in the mmu_gather structure to
> > > > > madvise_free_pte_range (at minimum) and force flush the TLB under the
> > > > > PTL if a dirty PTE is encountered. The second is that it should consider
> > > > 
> > > > OTL: I couldn't read this lengthy discussion so I miss miss something.
> > > > 
> > > > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE
> > > > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread
> > > > in parallel which has stale pte does "store" to make the pte dirty,
> > > > it's okay since try_to_unmap_one in shrink_page_list catches the dirty.
> > > > 
> > > 
> > > In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given
> > > that the key is that data corruption is avoided, you could argue with a
> > > comment that madv_free doesn't necesssarily have to flush it as long as
> > > KSM does even if it's clean due to batching.
> > 
> > Yes, I think it should be done in side where have a concern.
> > Maybe, mm_struct can carry a flag which indicates someone is
> > doing the TLB bacthing and then KSM side can flush it by the flag.
> > It would reduce unncessary flushing.
> > 
> 
> If you're confident that it's only necessary on the KSM side to avoid the
> problem then I'm ok with that. Update KSM in that case with a comment
> explaining the madv_free race and why the flush is unconditionally
> necessary. madv_free only came up because it was a critical part of having
> KSM miss a TLB flush.
> 
> > > Like madvise(), madv_free can potentially return with a stale PTE visible
> > > to the caller that observed a pte_none at the time of madv_free and uses
> > > a stale PTE that potentially allows a lost write. It's debatable whether
> > 
> > That is the part I cannot understand.
> > How does it lost "the write"? MADV_FREE doesn't discard the memory so
> > finally, the write should be done sometime.
> > Could you tell me more?
> > 
> 
> I'm relying on the fact you are the madv_free author to determine if
> it's really necessary. The race in question is CPU 0 running madv_free
> and updating some PTEs while CPU 1 is also running madv_free and looking
> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> the pte_dirty check (because CPU 0 has updated it already) and potentially
> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> potentially writable TLB entries and the underlying PTE is still present
> so that a subsequent write does not necessarily propagate the dirty bit
> to the underlying PTE any more. Reclaim at some unknown time at the future
> may then see that the PTE is still clean and discard the page even though
> a write has happened in the meantime. I think this is possible but I could
> have missed some protection in madv_free that prevents it happening.

Thanks for the detail. You didn't miss anything. It can happen and then
it's really bug. IOW, if application does write something after madv_free,
it must see the written value, not zero.

How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
With it, when tlb_finish_mmu is called, we can know we skip the flush
but there is pending flush, so flush focefully to avoid madv_dontneed
as well as madv_free scenario.

Also, KSM can know it through mm_tlb_flush_pending?
If it's acceptable, need to look into soft dirty to use [set|clear]_tlb
_flush_pending or TLB gathering API.

To show my intention:

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 8afa4335e5b2..fffd4d86d0c4 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -113,7 +113,7 @@ struct mmu_gather {
 #define HAVE_GENERIC_MMU_GATHER
 
 void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end);
-void tlb_flush_mmu(struct mmu_gather *tlb);
+bool tlb_flush_mmu(struct mmu_gather *tlb);
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start,
 							unsigned long end);
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
diff --git a/mm/ksm.c b/mm/ksm.c
index 4dc92f138786..0fbbd5d234d5 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1037,8 +1037,9 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
 		goto out_unlock;
 
-	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
-	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) {
+	if ((pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
+	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) ||
+		mm_tlb_flush_pending(mm)) {
 		pte_t entry;
 
 		swapped = PageSwapCache(page);
diff --git a/mm/memory.c b/mm/memory.c
index ea9f28e44b81..d5c5e6497c70 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -239,12 +239,13 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 	tlb->page_size = 0;
 
 	__tlb_reset_range(tlb);
+	set_tlb_flush_pending(tlb->mm);
 }
 
-static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+static bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	if (!tlb->end)
-		return;
+		return false;
 
 	tlb_flush(tlb);
 	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
@@ -252,6 +253,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 	tlb_table_flush(tlb);
 #endif
 	__tlb_reset_range(tlb);
+	return true;
 }
 
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
@@ -265,10 +267,16 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 	tlb->active = &tlb->local;
 }
 
-void tlb_flush_mmu(struct mmu_gather *tlb)
+/*
+ * returns true if tlb flush really happens
+ */
+bool tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	tlb_flush_mmu_tlbonly(tlb);
+	bool ret;
+
+	ret = tlb_flush_mmu_tlbonly(tlb);
 	tlb_flush_mmu_free(tlb);
+	return ret;
 }
 
 /* tlb_finish_mmu
@@ -278,8 +286,11 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
 	struct mmu_gather_batch *batch, *next;
+	bool flushed = tlb_flush_mmu(tlb);
 
-	tlb_flush_mmu(tlb);
+	clear_tlb_flush_pending(tlb->mm);
+	if (!flushed && mm_tlb_flush_pending(tlb->mm))
+		flush_tlb_mm_range(tlb->mm, start, end, 0UL);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();


> 
> -- 
> Mel Gorman
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-26  5:43                                                                   ` Minchan Kim
@ 2017-07-26  9:22                                                                     ` Mel Gorman
  2017-07-26 19:18                                                                       ` Nadav Amit
  2017-07-26 23:44                                                                       ` Minchan Kim
  0 siblings, 2 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-26  9:22 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
> > I'm relying on the fact you are the madv_free author to determine if
> > it's really necessary. The race in question is CPU 0 running madv_free
> > and updating some PTEs while CPU 1 is also running madv_free and looking
> > at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> > the pte_dirty check (because CPU 0 has updated it already) and potentially
> > fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> > potentially writable TLB entries and the underlying PTE is still present
> > so that a subsequent write does not necessarily propagate the dirty bit
> > to the underlying PTE any more. Reclaim at some unknown time at the future
> > may then see that the PTE is still clean and discard the page even though
> > a write has happened in the meantime. I think this is possible but I could
> > have missed some protection in madv_free that prevents it happening.
> 
> Thanks for the detail. You didn't miss anything. It can happen and then
> it's really bug. IOW, if application does write something after madv_free,
> it must see the written value, not zero.
> 
> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
> With it, when tlb_finish_mmu is called, we can know we skip the flush
> but there is pending flush, so flush focefully to avoid madv_dontneed
> as well as madv_free scenario.
> 

I *think* this is ok as it's simply more expensive on the KSM side in
the event of a race but no other harmful change is made assuming that
KSM is the only race-prone. The check for mm_tlb_flush_pending also
happens under the PTL so there should be sufficient protection from the
mm struct update being visible at teh right time.

Check using the test program from "mm: Always flush VMA ranges affected
by zap_page_range v2" if it handles the madvise case as well as that
would give some degree of safety. Make sure it's tested against 4.13-rc2
instead of mmotm which already includes the madv_dontneed fix. If yours
works for both then it supersedes the mmotm patch.

It would also be interesting if Nadav would use his slowdown hack to see
if he can still force the corruption.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-26  9:22                                                                     ` Mel Gorman
@ 2017-07-26 19:18                                                                       ` Nadav Amit
  2017-07-26 23:40                                                                         ` Minchan Kim
  2017-07-26 23:44                                                                       ` Minchan Kim
  1 sibling, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-26 19:18 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Minchan Kim, Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
>>> I'm relying on the fact you are the madv_free author to determine if
>>> it's really necessary. The race in question is CPU 0 running madv_free
>>> and updating some PTEs while CPU 1 is also running madv_free and looking
>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
>>> the pte_dirty check (because CPU 0 has updated it already) and potentially
>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
>>> potentially writable TLB entries and the underlying PTE is still present
>>> so that a subsequent write does not necessarily propagate the dirty bit
>>> to the underlying PTE any more. Reclaim at some unknown time at the future
>>> may then see that the PTE is still clean and discard the page even though
>>> a write has happened in the meantime. I think this is possible but I could
>>> have missed some protection in madv_free that prevents it happening.
>> 
>> Thanks for the detail. You didn't miss anything. It can happen and then
>> it's really bug. IOW, if application does write something after madv_free,
>> it must see the written value, not zero.
>> 
>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
>> With it, when tlb_finish_mmu is called, we can know we skip the flush
>> but there is pending flush, so flush focefully to avoid madv_dontneed
>> as well as madv_free scenario.
> 
> I *think* this is ok as it's simply more expensive on the KSM side in
> the event of a race but no other harmful change is made assuming that
> KSM is the only race-prone. The check for mm_tlb_flush_pending also
> happens under the PTL so there should be sufficient protection from the
> mm struct update being visible at teh right time.
> 
> Check using the test program from "mm: Always flush VMA ranges affected
> by zap_page_range v2" if it handles the madvise case as well as that
> would give some degree of safety. Make sure it's tested against 4.13-rc2
> instead of mmotm which already includes the madv_dontneed fix. If yours
> works for both then it supersedes the mmotm patch.
> 
> It would also be interesting if Nadav would use his slowdown hack to see
> if he can still force the corruption.

The proposed fix for the KSM side is likely to work (I will try later), but
on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
another one stale and not flush it.

Note also that the use of set/clear_tlb_flush_pending() is only applicable
following my pending fix that changes the pending indication from bool to
atomic_t.

For the record here is my test, followed by the patch to add latency. There
are some magic numbers that may not apply to your system (I got tired of
trying to time the system). If you run the test in a VM, the pause-loop
exiting can potentially prevent the issue from appearing.

--

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <string.h>
#include <assert.h>
#include <unistd.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdint.h>
#include <stdbool.h>
#include <sys/mman.h>
#include <sys/types.h>

#define PAGE_SIZE		(4096)
#define N_PAGES			(65536ull * 16)

#define CHANGED_VAL		(7)
#define BASE_VAL		(9)

#define max(a,b) \
	({ __typeof__ (a) _a = (a); \
	  __typeof__ (b) _b = (b); \
	  _a > _b ? _a : _b; })

#define STEP_HELPERS_RUN	(1)
#define STEP_DONTNEED_DONE	(2)
#define STEP_ACCESS_PAUSED	(4)

volatile int sync_step = STEP_ACCESS_PAUSED;
volatile char *p;
int dirty_fd, ksm_sharing_fd, ksm_run_fd;
uint64_t soft_dirty_time, madvise_time, soft_dirty_delta, madvise_delta;

static inline unsigned long rdtsc()
{
	unsigned long hi, lo;

	__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
	 return lo | (hi << 32);
}

static inline void wait_rdtsc(unsigned long cycles)
{
	unsigned long tsc = rdtsc();

	while (rdtsc() - tsc < cycles)
		__asm__ __volatile__ ("rep nop" ::: "memory");
}

static void break_sharing(void)
{
	char buf[20];

	pwrite(ksm_run_fd, "2", 1, 0);

	printf("waiting for page sharing to be broken\n");
	do {
		pread(ksm_sharing_fd, buf, sizeof(buf), 0);
	} while (strtoul(buf, NULL, sizeof(buf)));
}


static inline void wait_step(unsigned int step)
{
	while (!(sync_step & step))
		asm volatile ("rep nop":::"memory");
}

static void *big_madvise_thread(void *ign)
{
	while (1) {
		uint64_t tsc;

		wait_step(STEP_HELPERS_RUN);
		wait_rdtsc(madvise_delta);
		tsc = rdtsc();
		madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_FREE);
		madvise_time = rdtsc() - tsc;
		sync_step = STEP_DONTNEED_DONE;
	}
}

static void *soft_dirty_thread(void *ign)
{
	while (1) {
		int r;
		uint64_t tsc;

		wait_step(STEP_HELPERS_RUN | STEP_DONTNEED_DONE);
		wait_rdtsc(soft_dirty_delta);

		tsc = rdtsc();
		r = pwrite(dirty_fd, "4", 1, 0);
		assert(r == 1);
		soft_dirty_time = rdtsc() - tsc;
		wait_step(STEP_DONTNEED_DONE);
		sync_step = STEP_ACCESS_PAUSED;
	}
}

void main(void)
{
	pthread_t aux_thread, aux_thread2;
	char pathname[256];
	long i;
	volatile char c;

	sprintf(pathname, "/proc/%d/clear_refs", getpid());
	dirty_fd = open(pathname, O_RDWR);

	ksm_sharing_fd = open("/sys/kernel/mm/ksm/pages_sharing", O_RDONLY);
	assert(ksm_sharing_fd >= 0);

	ksm_run_fd = open("/sys/kernel/mm/ksm/run", O_RDWR);
	assert(ksm_run_fd >= 0);

	pwrite(ksm_run_fd, "0", 1, 0);

	p = mmap(0, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
		 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	assert(p != MAP_FAILED);
	madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_MERGEABLE);

	memset((void*)p, BASE_VAL, PAGE_SIZE * 2);
	for (i = 2; i < N_PAGES; i++)
		c = p[PAGE_SIZE * i];

	pthread_create(&aux_thread, NULL, big_madvise_thread, NULL);
	pthread_create(&aux_thread2, NULL, soft_dirty_thread, NULL);

	while (1) {
		break_sharing();
		*(p + 64) = BASE_VAL;		// cache in TLB and break KSM
		pwrite(ksm_run_fd, "1", 1, 0);

		wait_rdtsc(0x8000000ull);
		sync_step = STEP_HELPERS_RUN;
		wait_rdtsc(0x4000000ull);

		*(p+64) = CHANGED_VAL;

		wait_step(STEP_ACCESS_PAUSED);		// wait for TLB to be flushed
		if (*(p+64) != CHANGED_VAL ||
		    *(p + PAGE_SIZE + 64) == CHANGED_VAL) {
			printf("KSM error\n");
			exit(EXIT_FAILURE);
		}

		printf("No failure yet\n");

		soft_dirty_delta = max(0, (long)madvise_time - (long)soft_dirty_time);
		madvise_delta = max(0, (long)soft_dirty_time - (long)madvise_time);
	}
}

-- 8< --

Subject: [PATCH] TLB flush delay to trigger failure

---
 fs/proc/task_mmu.c | 2 ++
 mm/ksm.c           | 2 ++
 mm/madvise.c       | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 520802da059c..c13259251210 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -16,6 +16,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
+#include <linux/delay.h>
 
 #include <asm/elf.h>
 #include <linux/uaccess.h>
@@ -1076,6 +1077,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		walk_page_range(0, mm->highest_vm_end, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
 			mmu_notifier_invalidate_range_end(mm, 0, -1);
+		msleep(5);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 out_mm:
diff --git a/mm/ksm.c b/mm/ksm.c
index 216184af0e19..317adbb48b0f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -39,6 +39,7 @@
 #include <linux/freezer.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/delay.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -960,6 +961,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	mmun_end   = addr + PAGE_SIZE;
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 
+	msleep(5);
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
 		pte_unmap_unlock(ptep, ptl);
diff --git a/mm/madvise.c b/mm/madvise.c
index 25b78ee4fc2c..e4c852360f2c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -23,6 +23,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/delay.h>
 
 #include <asm/tlb.h>
 
@@ -472,6 +473,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(mm, start, end);
 	madvise_free_page_range(&tlb, vma, start, end);
 	mmu_notifier_invalidate_range_end(mm, start, end);
+	msleep(5);
 	tlb_finish_mmu(&tlb, start, end);
 
 	return 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-26 19:18                                                                       ` Nadav Amit
@ 2017-07-26 23:40                                                                         ` Minchan Kim
  2017-07-27  0:09                                                                           ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Minchan Kim @ 2017-07-26 23:40 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Mel Gorman, Andy Lutomirski, open list:MEMORY MANAGEMENT

Hello Nadav,

On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
> >>> I'm relying on the fact you are the madv_free author to determine if
> >>> it's really necessary. The race in question is CPU 0 running madv_free
> >>> and updating some PTEs while CPU 1 is also running madv_free and looking
> >>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> >>> the pte_dirty check (because CPU 0 has updated it already) and potentially
> >>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> >>> potentially writable TLB entries and the underlying PTE is still present
> >>> so that a subsequent write does not necessarily propagate the dirty bit
> >>> to the underlying PTE any more. Reclaim at some unknown time at the future
> >>> may then see that the PTE is still clean and discard the page even though
> >>> a write has happened in the meantime. I think this is possible but I could
> >>> have missed some protection in madv_free that prevents it happening.
> >> 
> >> Thanks for the detail. You didn't miss anything. It can happen and then
> >> it's really bug. IOW, if application does write something after madv_free,
> >> it must see the written value, not zero.
> >> 
> >> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
> >> With it, when tlb_finish_mmu is called, we can know we skip the flush
> >> but there is pending flush, so flush focefully to avoid madv_dontneed
> >> as well as madv_free scenario.
> > 
> > I *think* this is ok as it's simply more expensive on the KSM side in
> > the event of a race but no other harmful change is made assuming that
> > KSM is the only race-prone. The check for mm_tlb_flush_pending also
> > happens under the PTL so there should be sufficient protection from the
> > mm struct update being visible at teh right time.
> > 
> > Check using the test program from "mm: Always flush VMA ranges affected
> > by zap_page_range v2" if it handles the madvise case as well as that
> > would give some degree of safety. Make sure it's tested against 4.13-rc2
> > instead of mmotm which already includes the madv_dontneed fix. If yours
> > works for both then it supersedes the mmotm patch.
> > 
> > It would also be interesting if Nadav would use his slowdown hack to see
> > if he can still force the corruption.
> 
> The proposed fix for the KSM side is likely to work (I will try later), but
> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
> another one stale and not flush it.

Okay, I will change that part like this to avoid partial flush problem.

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1c42d69490e4..87d0ebac6605 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
  * The barriers below prevent the compiler from re-ordering the instructions
  * around the memory barriers that are already present in the code.
  */
-static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
+static inline int mm_tlb_flush_pending(struct mm_struct *mm)
 {
+	int nr_pending;
+
 	barrier();
-	return atomic_read(&mm->tlb_flush_pending) > 0;
+	nr_pending = atomic_read(&mm->tlb_flush_pending);
+	return nr_pending;
 }
 static inline void set_tlb_flush_pending(struct mm_struct *mm)
 {
diff --git a/mm/memory.c b/mm/memory.c
index d5c5e6497c70..b5320e96ec51 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb)
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
 	struct mmu_gather_batch *batch, *next;
-	bool flushed = tlb_flush_mmu(tlb);
 
+	if (!tlb->fullmm && !tlb->need_flush_all &&
+			mm_tlb_flush_pending(tlb->mm) > 1) {
+		tlb->start = min(start, tlb->start);
+		tlb->end = max(end, tlb->end);
+	}
+
+	tlb_flush_mmu(tlb);
 	clear_tlb_flush_pending(tlb->mm);
-	if (!flushed && mm_tlb_flush_pending(tlb->mm))
-		flush_tlb_mm_range(tlb->mm, start, end, 0UL);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
> 
> Note also that the use of set/clear_tlb_flush_pending() is only applicable
> following my pending fix that changes the pending indication from bool to
> atomic_t.

Sure, I saw it in current mmots. Without your good job, my patch never work. :)
Thanks for the head up.

> 
> For the record here is my test, followed by the patch to add latency. There
> are some magic numbers that may not apply to your system (I got tired of
> trying to time the system). If you run the test in a VM, the pause-loop
> exiting can potentially prevent the issue from appearing.

Thanks for the sharing. I will try it, too.

> 
> --
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <pthread.h>
> #include <string.h>
> #include <assert.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <pthread.h>
> #include <stdint.h>
> #include <stdbool.h>
> #include <sys/mman.h>
> #include <sys/types.h>
> 
> #define PAGE_SIZE		(4096)
> #define N_PAGES			(65536ull * 16)
> 
> #define CHANGED_VAL		(7)
> #define BASE_VAL		(9)
> 
> #define max(a,b) \
> 	({ __typeof__ (a) _a = (a); \
> 	  __typeof__ (b) _b = (b); \
> 	  _a > _b ? _a : _b; })
> 
> #define STEP_HELPERS_RUN	(1)
> #define STEP_DONTNEED_DONE	(2)
> #define STEP_ACCESS_PAUSED	(4)
> 
> volatile int sync_step = STEP_ACCESS_PAUSED;
> volatile char *p;
> int dirty_fd, ksm_sharing_fd, ksm_run_fd;
> uint64_t soft_dirty_time, madvise_time, soft_dirty_delta, madvise_delta;
> 
> static inline unsigned long rdtsc()
> {
> 	unsigned long hi, lo;
> 
> 	__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
> 	 return lo | (hi << 32);
> }
> 
> static inline void wait_rdtsc(unsigned long cycles)
> {
> 	unsigned long tsc = rdtsc();
> 
> 	while (rdtsc() - tsc < cycles)
> 		__asm__ __volatile__ ("rep nop" ::: "memory");
> }
> 
> static void break_sharing(void)
> {
> 	char buf[20];
> 
> 	pwrite(ksm_run_fd, "2", 1, 0);
> 
> 	printf("waiting for page sharing to be broken\n");
> 	do {
> 		pread(ksm_sharing_fd, buf, sizeof(buf), 0);
> 	} while (strtoul(buf, NULL, sizeof(buf)));
> }
> 
> 
> static inline void wait_step(unsigned int step)
> {
> 	while (!(sync_step & step))
> 		asm volatile ("rep nop":::"memory");
> }
> 
> static void *big_madvise_thread(void *ign)
> {
> 	while (1) {
> 		uint64_t tsc;
> 
> 		wait_step(STEP_HELPERS_RUN);
> 		wait_rdtsc(madvise_delta);
> 		tsc = rdtsc();
> 		madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_FREE);
> 		madvise_time = rdtsc() - tsc;
> 		sync_step = STEP_DONTNEED_DONE;
> 	}
> }
> 
> static void *soft_dirty_thread(void *ign)
> {
> 	while (1) {
> 		int r;
> 		uint64_t tsc;
> 
> 		wait_step(STEP_HELPERS_RUN | STEP_DONTNEED_DONE);
> 		wait_rdtsc(soft_dirty_delta);
> 
> 		tsc = rdtsc();
> 		r = pwrite(dirty_fd, "4", 1, 0);
> 		assert(r == 1);
> 		soft_dirty_time = rdtsc() - tsc;
> 		wait_step(STEP_DONTNEED_DONE);
> 		sync_step = STEP_ACCESS_PAUSED;
> 	}
> }
> 
> void main(void)
> {
> 	pthread_t aux_thread, aux_thread2;
> 	char pathname[256];
> 	long i;
> 	volatile char c;
> 
> 	sprintf(pathname, "/proc/%d/clear_refs", getpid());
> 	dirty_fd = open(pathname, O_RDWR);
> 
> 	ksm_sharing_fd = open("/sys/kernel/mm/ksm/pages_sharing", O_RDONLY);
> 	assert(ksm_sharing_fd >= 0);
> 
> 	ksm_run_fd = open("/sys/kernel/mm/ksm/run", O_RDWR);
> 	assert(ksm_run_fd >= 0);
> 
> 	pwrite(ksm_run_fd, "0", 1, 0);
> 
> 	p = mmap(0, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
> 		 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> 	assert(p != MAP_FAILED);
> 	madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_MERGEABLE);
> 
> 	memset((void*)p, BASE_VAL, PAGE_SIZE * 2);
> 	for (i = 2; i < N_PAGES; i++)
> 		c = p[PAGE_SIZE * i];
> 
> 	pthread_create(&aux_thread, NULL, big_madvise_thread, NULL);
> 	pthread_create(&aux_thread2, NULL, soft_dirty_thread, NULL);
> 
> 	while (1) {
> 		break_sharing();
> 		*(p + 64) = BASE_VAL;		// cache in TLB and break KSM
> 		pwrite(ksm_run_fd, "1", 1, 0);
> 
> 		wait_rdtsc(0x8000000ull);
> 		sync_step = STEP_HELPERS_RUN;
> 		wait_rdtsc(0x4000000ull);
> 
> 		*(p+64) = CHANGED_VAL;
> 
> 		wait_step(STEP_ACCESS_PAUSED);		// wait for TLB to be flushed
> 		if (*(p+64) != CHANGED_VAL ||
> 		    *(p + PAGE_SIZE + 64) == CHANGED_VAL) {
> 			printf("KSM error\n");
> 			exit(EXIT_FAILURE);
> 		}
> 
> 		printf("No failure yet\n");
> 
> 		soft_dirty_delta = max(0, (long)madvise_time - (long)soft_dirty_time);
> 		madvise_delta = max(0, (long)soft_dirty_time - (long)madvise_time);
> 	}
> }
> 
> -- 8< --
> 
> Subject: [PATCH] TLB flush delay to trigger failure
> 
> ---
>  fs/proc/task_mmu.c | 2 ++
>  mm/ksm.c           | 2 ++
>  mm/madvise.c       | 2 ++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 520802da059c..c13259251210 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -16,6 +16,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/page_idle.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/delay.h>
>  
>  #include <asm/elf.h>
>  #include <linux/uaccess.h>
> @@ -1076,6 +1077,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  		walk_page_range(0, mm->highest_vm_end, &clear_refs_walk);
>  		if (type == CLEAR_REFS_SOFT_DIRTY)
>  			mmu_notifier_invalidate_range_end(mm, 0, -1);
> +		msleep(5);
>  		flush_tlb_mm(mm);
>  		up_read(&mm->mmap_sem);
>  out_mm:
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 216184af0e19..317adbb48b0f 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -39,6 +39,7 @@
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
>  #include <linux/numa.h>
> +#include <linux/delay.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -960,6 +961,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	mmun_end   = addr + PAGE_SIZE;
>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>  
> +	msleep(5);
>  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte_same(*ptep, orig_pte)) {
>  		pte_unmap_unlock(ptep, ptl);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 25b78ee4fc2c..e4c852360f2c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -23,6 +23,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/delay.h>
>  
>  #include <asm/tlb.h>
>  
> @@ -472,6 +473,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  	mmu_notifier_invalidate_range_start(mm, start, end);
>  	madvise_free_page_range(&tlb, vma, start, end);
>  	mmu_notifier_invalidate_range_end(mm, start, end);
> +	msleep(5);
>  	tlb_finish_mmu(&tlb, start, end);
>  
>  	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-26  9:22                                                                     ` Mel Gorman
  2017-07-26 19:18                                                                       ` Nadav Amit
@ 2017-07-26 23:44                                                                       ` Minchan Kim
  1 sibling, 0 replies; 70+ messages in thread
From: Minchan Kim @ 2017-07-26 23:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

Hi Mel,

On Wed, Jul 26, 2017 at 10:22:28AM +0100, Mel Gorman wrote:
> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
> > > I'm relying on the fact you are the madv_free author to determine if
> > > it's really necessary. The race in question is CPU 0 running madv_free
> > > and updating some PTEs while CPU 1 is also running madv_free and looking
> > > at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> > > the pte_dirty check (because CPU 0 has updated it already) and potentially
> > > fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> > > potentially writable TLB entries and the underlying PTE is still present
> > > so that a subsequent write does not necessarily propagate the dirty bit
> > > to the underlying PTE any more. Reclaim at some unknown time at the future
> > > may then see that the PTE is still clean and discard the page even though
> > > a write has happened in the meantime. I think this is possible but I could
> > > have missed some protection in madv_free that prevents it happening.
> > 
> > Thanks for the detail. You didn't miss anything. It can happen and then
> > it's really bug. IOW, if application does write something after madv_free,
> > it must see the written value, not zero.
> > 
> > How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
> > With it, when tlb_finish_mmu is called, we can know we skip the flush
> > but there is pending flush, so flush focefully to avoid madv_dontneed
> > as well as madv_free scenario.
> > 
> 
> I *think* this is ok as it's simply more expensive on the KSM side in
> the event of a race but no other harmful change is made assuming that
> KSM is the only race-prone. The check for mm_tlb_flush_pending also
> happens under the PTL so there should be sufficient protection from the
> mm struct update being visible at teh right time.
> 
> Check using the test program from "mm: Always flush VMA ranges affected
> by zap_page_range v2" if it handles the madvise case as well as that
> would give some degree of safety. Make sure it's tested against 4.13-rc2
> instead of mmotm which already includes the madv_dontneed fix. If yours
> works for both then it supersedes the mmotm patch.

Okay, I will test it on 4.13-rc2 + Nadav's atomic tlb_flush_pending
+ my patch fixed partial flush problem pointed out by Nadav.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-26 23:40                                                                         ` Minchan Kim
@ 2017-07-27  0:09                                                                           ` Nadav Amit
  2017-07-27  0:34                                                                             ` Minchan Kim
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-27  0:09 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Mel Gorman, Andy Lutomirski, open list:MEMORY MANAGEMENT

Minchan Kim <minchan@kernel.org> wrote:

> Hello Nadav,
> 
> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
>>>>> I'm relying on the fact you are the madv_free author to determine if
>>>>> it's really necessary. The race in question is CPU 0 running madv_free
>>>>> and updating some PTEs while CPU 1 is also running madv_free and looking
>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
>>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially
>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
>>>>> potentially writable TLB entries and the underlying PTE is still present
>>>>> so that a subsequent write does not necessarily propagate the dirty bit
>>>>> to the underlying PTE any more. Reclaim at some unknown time at the future
>>>>> may then see that the PTE is still clean and discard the page even though
>>>>> a write has happened in the meantime. I think this is possible but I could
>>>>> have missed some protection in madv_free that prevents it happening.
>>>> 
>>>> Thanks for the detail. You didn't miss anything. It can happen and then
>>>> it's really bug. IOW, if application does write something after madv_free,
>>>> it must see the written value, not zero.
>>>> 
>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
>>>> With it, when tlb_finish_mmu is called, we can know we skip the flush
>>>> but there is pending flush, so flush focefully to avoid madv_dontneed
>>>> as well as madv_free scenario.
>>> 
>>> I *think* this is ok as it's simply more expensive on the KSM side in
>>> the event of a race but no other harmful change is made assuming that
>>> KSM is the only race-prone. The check for mm_tlb_flush_pending also
>>> happens under the PTL so there should be sufficient protection from the
>>> mm struct update being visible at teh right time.
>>> 
>>> Check using the test program from "mm: Always flush VMA ranges affected
>>> by zap_page_range v2" if it handles the madvise case as well as that
>>> would give some degree of safety. Make sure it's tested against 4.13-rc2
>>> instead of mmotm which already includes the madv_dontneed fix. If yours
>>> works for both then it supersedes the mmotm patch.
>>> 
>>> It would also be interesting if Nadav would use his slowdown hack to see
>>> if he can still force the corruption.
>> 
>> The proposed fix for the KSM side is likely to work (I will try later), but
>> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
>> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
>> another one stale and not flush it.
> 
> Okay, I will change that part like this to avoid partial flush problem.
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1c42d69490e4..87d0ebac6605 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
>  * The barriers below prevent the compiler from re-ordering the instructions
>  * around the memory barriers that are already present in the code.
>  */
> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> +static inline int mm_tlb_flush_pending(struct mm_struct *mm)
> {
> +	int nr_pending;
> +
> 	barrier();
> -	return atomic_read(&mm->tlb_flush_pending) > 0;
> +	nr_pending = atomic_read(&mm->tlb_flush_pending);
> +	return nr_pending;
> }
> static inline void set_tlb_flush_pending(struct mm_struct *mm)
> {
> diff --git a/mm/memory.c b/mm/memory.c
> index d5c5e6497c70..b5320e96ec51 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb)
> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> {
> 	struct mmu_gather_batch *batch, *next;
> -	bool flushed = tlb_flush_mmu(tlb);
> 
> +	if (!tlb->fullmm && !tlb->need_flush_all &&
> +			mm_tlb_flush_pending(tlb->mm) > 1) {

I saw you noticed my comment about the access of the flag without a lock. I
must say it feels strange that a memory barrier would be needed here, but
that what I understood from the documentation.

> +		tlb->start = min(start, tlb->start);
> +		tlb->end = max(end, tlb->end);

Err… You open-code mmu_gather which is arch-specific. It appears that all of
them have start and end members, but not need_flush_all. Besides, I am not
sure whether they regard start and end the same way.

> +	}
> +
> +	tlb_flush_mmu(tlb);
> 	clear_tlb_flush_pending(tlb->mm);
> -	if (!flushed && mm_tlb_flush_pending(tlb->mm))
> -		flush_tlb_mm_range(tlb->mm, start, end, 0UL);
> 
> 	/* keep the page table cache within bounds */
> 	check_pgt_cache();
>> Note also that the use of set/clear_tlb_flush_pending() is only applicable
>> following my pending fix that changes the pending indication from bool to
>> atomic_t.
> 
> Sure, I saw it in current mmots. Without your good job, my patch never work. :)
> Thanks for the head up.

Thanks, I really appreciate it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-27  0:09                                                                           ` Nadav Amit
@ 2017-07-27  0:34                                                                             ` Minchan Kim
  2017-07-27  0:48                                                                               ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Minchan Kim @ 2017-07-27  0:34 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Mel Gorman, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote:
> Minchan Kim <minchan@kernel.org> wrote:
> 
> > Hello Nadav,
> > 
> > On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote:
> >> Mel Gorman <mgorman@suse.de> wrote:
> >> 
> >>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
> >>>>> I'm relying on the fact you are the madv_free author to determine if
> >>>>> it's really necessary. The race in question is CPU 0 running madv_free
> >>>>> and updating some PTEs while CPU 1 is also running madv_free and looking
> >>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> >>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially
> >>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> >>>>> potentially writable TLB entries and the underlying PTE is still present
> >>>>> so that a subsequent write does not necessarily propagate the dirty bit
> >>>>> to the underlying PTE any more. Reclaim at some unknown time at the future
> >>>>> may then see that the PTE is still clean and discard the page even though
> >>>>> a write has happened in the meantime. I think this is possible but I could
> >>>>> have missed some protection in madv_free that prevents it happening.
> >>>> 
> >>>> Thanks for the detail. You didn't miss anything. It can happen and then
> >>>> it's really bug. IOW, if application does write something after madv_free,
> >>>> it must see the written value, not zero.
> >>>> 
> >>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
> >>>> With it, when tlb_finish_mmu is called, we can know we skip the flush
> >>>> but there is pending flush, so flush focefully to avoid madv_dontneed
> >>>> as well as madv_free scenario.
> >>> 
> >>> I *think* this is ok as it's simply more expensive on the KSM side in
> >>> the event of a race but no other harmful change is made assuming that
> >>> KSM is the only race-prone. The check for mm_tlb_flush_pending also
> >>> happens under the PTL so there should be sufficient protection from the
> >>> mm struct update being visible at teh right time.
> >>> 
> >>> Check using the test program from "mm: Always flush VMA ranges affected
> >>> by zap_page_range v2" if it handles the madvise case as well as that
> >>> would give some degree of safety. Make sure it's tested against 4.13-rc2
> >>> instead of mmotm which already includes the madv_dontneed fix. If yours
> >>> works for both then it supersedes the mmotm patch.
> >>> 
> >>> It would also be interesting if Nadav would use his slowdown hack to see
> >>> if he can still force the corruption.
> >> 
> >> The proposed fix for the KSM side is likely to work (I will try later), but
> >> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
> >> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
> >> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
> >> another one stale and not flush it.
> > 
> > Okay, I will change that part like this to avoid partial flush problem.
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 1c42d69490e4..87d0ebac6605 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
> >  * The barriers below prevent the compiler from re-ordering the instructions
> >  * around the memory barriers that are already present in the code.
> >  */
> > -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> > +static inline int mm_tlb_flush_pending(struct mm_struct *mm)
> > {
> > +	int nr_pending;
> > +
> > 	barrier();
> > -	return atomic_read(&mm->tlb_flush_pending) > 0;
> > +	nr_pending = atomic_read(&mm->tlb_flush_pending);
> > +	return nr_pending;
> > }
> > static inline void set_tlb_flush_pending(struct mm_struct *mm)
> > {
> > diff --git a/mm/memory.c b/mm/memory.c
> > index d5c5e6497c70..b5320e96ec51 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb)
> > void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> > {
> > 	struct mmu_gather_batch *batch, *next;
> > -	bool flushed = tlb_flush_mmu(tlb);
> > 
> > +	if (!tlb->fullmm && !tlb->need_flush_all &&
> > +			mm_tlb_flush_pending(tlb->mm) > 1) {
> 
> I saw you noticed my comment about the access of the flag without a lock. I
> must say it feels strange that a memory barrier would be needed here, but
> that what I understood from the documentation.

I saw your recent barriers fix patch, too.
[PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending

As I commented out in there, I hope to use below here without being
aware of complex barrier stuff. Instead, mm_tlb_flush_pending should
call the right barrier inside.

        mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1

> 
> > +		tlb->start = min(start, tlb->start);
> > +		tlb->end = max(end, tlb->end);
> 
> Erra?| You open-code mmu_gather which is arch-specific. It appears that all of
> them have start and end members, but not need_flush_all. Besides, I am not

When I see tlb_gather_mmu which is not arch-specific, it intializes
need_flush_all to zero so it would be no harmful although some of
architecture doesn't set the flag.
Please correct me if I miss something.

> sure whether they regard start and end the same way.

I understand your worry but my patch takes longer range by min/max
so I cannot imagine how it breaks. During looking the code, I found
__tlb_adjust_range so better to use it rather than open-code.


diff --git a/mm/memory.c b/mm/memory.c
index b5320e96ec51..b23188daa396 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e
 	struct mmu_gather_batch *batch, *next;
 
 	if (!tlb->fullmm && !tlb->need_flush_all &&
-			mm_tlb_flush_pending(tlb->mm) > 1) {
-		tlb->start = min(start, tlb->start);
-		tlb->end = max(end, tlb->end);
-	}
+			mm_tlb_flush_pending(tlb->mm) > 1)
+		__tlb_adjust_range(tlb->mm, start, end - start);
 
 	tlb_flush_mmu(tlb);
 	clear_tlb_flush_pending(tlb->mm);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-27  0:34                                                                             ` Minchan Kim
@ 2017-07-27  0:48                                                                               ` Nadav Amit
  2017-07-27  1:13                                                                                 ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-27  0:48 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Mel Gorman, Andy Lutomirski, open list:MEMORY MANAGEMENT

Minchan Kim <minchan@kernel.org> wrote:

> On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote:
>> Minchan Kim <minchan@kernel.org> wrote:
>> 
>>> Hello Nadav,
>>> 
>>> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote:
>>>> Mel Gorman <mgorman@suse.de> wrote:
>>>> 
>>>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
>>>>>>> I'm relying on the fact you are the madv_free author to determine if
>>>>>>> it's really necessary. The race in question is CPU 0 running madv_free
>>>>>>> and updating some PTEs while CPU 1 is also running madv_free and looking
>>>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
>>>>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially
>>>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
>>>>>>> potentially writable TLB entries and the underlying PTE is still present
>>>>>>> so that a subsequent write does not necessarily propagate the dirty bit
>>>>>>> to the underlying PTE any more. Reclaim at some unknown time at the future
>>>>>>> may then see that the PTE is still clean and discard the page even though
>>>>>>> a write has happened in the meantime. I think this is possible but I could
>>>>>>> have missed some protection in madv_free that prevents it happening.
>>>>>> 
>>>>>> Thanks for the detail. You didn't miss anything. It can happen and then
>>>>>> it's really bug. IOW, if application does write something after madv_free,
>>>>>> it must see the written value, not zero.
>>>>>> 
>>>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
>>>>>> With it, when tlb_finish_mmu is called, we can know we skip the flush
>>>>>> but there is pending flush, so flush focefully to avoid madv_dontneed
>>>>>> as well as madv_free scenario.
>>>>> 
>>>>> I *think* this is ok as it's simply more expensive on the KSM side in
>>>>> the event of a race but no other harmful change is made assuming that
>>>>> KSM is the only race-prone. The check for mm_tlb_flush_pending also
>>>>> happens under the PTL so there should be sufficient protection from the
>>>>> mm struct update being visible at teh right time.
>>>>> 
>>>>> Check using the test program from "mm: Always flush VMA ranges affected
>>>>> by zap_page_range v2" if it handles the madvise case as well as that
>>>>> would give some degree of safety. Make sure it's tested against 4.13-rc2
>>>>> instead of mmotm which already includes the madv_dontneed fix. If yours
>>>>> works for both then it supersedes the mmotm patch.
>>>>> 
>>>>> It would also be interesting if Nadav would use his slowdown hack to see
>>>>> if he can still force the corruption.
>>>> 
>>>> The proposed fix for the KSM side is likely to work (I will try later), but
>>>> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
>>>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
>>>> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
>>>> another one stale and not flush it.
>>> 
>>> Okay, I will change that part like this to avoid partial flush problem.
>>> 
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 1c42d69490e4..87d0ebac6605 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
>>> * The barriers below prevent the compiler from re-ordering the instructions
>>> * around the memory barriers that are already present in the code.
>>> */
>>> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
>>> +static inline int mm_tlb_flush_pending(struct mm_struct *mm)
>>> {
>>> +	int nr_pending;
>>> +
>>> 	barrier();
>>> -	return atomic_read(&mm->tlb_flush_pending) > 0;
>>> +	nr_pending = atomic_read(&mm->tlb_flush_pending);
>>> +	return nr_pending;
>>> }
>>> static inline void set_tlb_flush_pending(struct mm_struct *mm)
>>> {
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index d5c5e6497c70..b5320e96ec51 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb)
>>> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
>>> {
>>> 	struct mmu_gather_batch *batch, *next;
>>> -	bool flushed = tlb_flush_mmu(tlb);
>>> 
>>> +	if (!tlb->fullmm && !tlb->need_flush_all &&
>>> +			mm_tlb_flush_pending(tlb->mm) > 1) {
>> 
>> I saw you noticed my comment about the access of the flag without a lock. I
>> must say it feels strange that a memory barrier would be needed here, but
>> that what I understood from the documentation.
> 
> I saw your recent barriers fix patch, too.
> [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending
> 
> As I commented out in there, I hope to use below here without being
> aware of complex barrier stuff. Instead, mm_tlb_flush_pending should
> call the right barrier inside.
> 
>        mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1

I will address it in v3.


> 
>>> +		tlb->start = min(start, tlb->start);
>>> +		tlb->end = max(end, tlb->end);
>> 
>> Err… You open-code mmu_gather which is arch-specific. It appears that all of
>> them have start and end members, but not need_flush_all. Besides, I am not
> 
> When I see tlb_gather_mmu which is not arch-specific, it intializes
> need_flush_all to zero so it would be no harmful although some of
> architecture doesn't set the flag.
> Please correct me if I miss something.

Oh.. my bad. I missed the fact that this code is under “#ifdef
HAVE_GENERIC_MMU_GATHER”. But that means that arch-specific tlb_finish_mmu()
implementations (s390, arm) may need to be modified as well.

>> sure whether they regard start and end the same way.
> 
> I understand your worry but my patch takes longer range by min/max
> so I cannot imagine how it breaks. During looking the code, I found
> __tlb_adjust_range so better to use it rather than open-code.
> 
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b5320e96ec51..b23188daa396 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e
> 	struct mmu_gather_batch *batch, *next;
> 
> 	if (!tlb->fullmm && !tlb->need_flush_all &&
> -			mm_tlb_flush_pending(tlb->mm) > 1) {
> -		tlb->start = min(start, tlb->start);
> -		tlb->end = max(end, tlb->end);
> -	}
> +			mm_tlb_flush_pending(tlb->mm) > 1)
> +		__tlb_adjust_range(tlb->mm, start, end - start);
> 
> 	tlb_flush_mmu(tlb);
> 	clear_tlb_flush_pending(tlb->mm);

This one is better, especially as I now understand it is only for the
generic MMU gather (which I missed before).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-27  0:48                                                                               ` Nadav Amit
@ 2017-07-27  1:13                                                                                 ` Nadav Amit
  2017-07-27  7:04                                                                                   ` Minchan Kim
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-27  1:13 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Mel Gorman, Andy Lutomirski, open list:MEMORY MANAGEMENT

Nadav Amit <nadav.amit@gmail.com> wrote:

> Minchan Kim <minchan@kernel.org> wrote:
> 
>> On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote:
>>> Minchan Kim <minchan@kernel.org> wrote:
>>> 
>>>> Hello Nadav,
>>>> 
>>>> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote:
>>>>> Mel Gorman <mgorman@suse.de> wrote:
>>>>> 
>>>>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
>>>>>>>> I'm relying on the fact you are the madv_free author to determine if
>>>>>>>> it's really necessary. The race in question is CPU 0 running madv_free
>>>>>>>> and updating some PTEs while CPU 1 is also running madv_free and looking
>>>>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
>>>>>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially
>>>>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
>>>>>>>> potentially writable TLB entries and the underlying PTE is still present
>>>>>>>> so that a subsequent write does not necessarily propagate the dirty bit
>>>>>>>> to the underlying PTE any more. Reclaim at some unknown time at the future
>>>>>>>> may then see that the PTE is still clean and discard the page even though
>>>>>>>> a write has happened in the meantime. I think this is possible but I could
>>>>>>>> have missed some protection in madv_free that prevents it happening.
>>>>>>> 
>>>>>>> Thanks for the detail. You didn't miss anything. It can happen and then
>>>>>>> it's really bug. IOW, if application does write something after madv_free,
>>>>>>> it must see the written value, not zero.
>>>>>>> 
>>>>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
>>>>>>> With it, when tlb_finish_mmu is called, we can know we skip the flush
>>>>>>> but there is pending flush, so flush focefully to avoid madv_dontneed
>>>>>>> as well as madv_free scenario.
>>>>>> 
>>>>>> I *think* this is ok as it's simply more expensive on the KSM side in
>>>>>> the event of a race but no other harmful change is made assuming that
>>>>>> KSM is the only race-prone. The check for mm_tlb_flush_pending also
>>>>>> happens under the PTL so there should be sufficient protection from the
>>>>>> mm struct update being visible at teh right time.
>>>>>> 
>>>>>> Check using the test program from "mm: Always flush VMA ranges affected
>>>>>> by zap_page_range v2" if it handles the madvise case as well as that
>>>>>> would give some degree of safety. Make sure it's tested against 4.13-rc2
>>>>>> instead of mmotm which already includes the madv_dontneed fix. If yours
>>>>>> works for both then it supersedes the mmotm patch.
>>>>>> 
>>>>>> It would also be interesting if Nadav would use his slowdown hack to see
>>>>>> if he can still force the corruption.
>>>>> 
>>>>> The proposed fix for the KSM side is likely to work (I will try later), but
>>>>> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
>>>>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
>>>>> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
>>>>> another one stale and not flush it.
>>>> 
>>>> Okay, I will change that part like this to avoid partial flush problem.
>>>> 
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 1c42d69490e4..87d0ebac6605 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
>>>> * The barriers below prevent the compiler from re-ordering the instructions
>>>> * around the memory barriers that are already present in the code.
>>>> */
>>>> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
>>>> +static inline int mm_tlb_flush_pending(struct mm_struct *mm)
>>>> {
>>>> +	int nr_pending;
>>>> +
>>>> 	barrier();
>>>> -	return atomic_read(&mm->tlb_flush_pending) > 0;
>>>> +	nr_pending = atomic_read(&mm->tlb_flush_pending);
>>>> +	return nr_pending;
>>>> }
>>>> static inline void set_tlb_flush_pending(struct mm_struct *mm)
>>>> {
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index d5c5e6497c70..b5320e96ec51 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb)
>>>> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
>>>> {
>>>> 	struct mmu_gather_batch *batch, *next;
>>>> -	bool flushed = tlb_flush_mmu(tlb);
>>>> 
>>>> +	if (!tlb->fullmm && !tlb->need_flush_all &&
>>>> +			mm_tlb_flush_pending(tlb->mm) > 1) {
>>> 
>>> I saw you noticed my comment about the access of the flag without a lock. I
>>> must say it feels strange that a memory barrier would be needed here, but
>>> that what I understood from the documentation.
>> 
>> I saw your recent barriers fix patch, too.
>> [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending
>> 
>> As I commented out in there, I hope to use below here without being
>> aware of complex barrier stuff. Instead, mm_tlb_flush_pending should
>> call the right barrier inside.
>> 
>>       mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1
> 
> I will address it in v3.
> 
> 
>>>> +		tlb->start = min(start, tlb->start);
>>>> +		tlb->end = max(end, tlb->end);
>>> 
>>> Err… You open-code mmu_gather which is arch-specific. It appears that all of
>>> them have start and end members, but not need_flush_all. Besides, I am not
>> 
>> When I see tlb_gather_mmu which is not arch-specific, it intializes
>> need_flush_all to zero so it would be no harmful although some of
>> architecture doesn't set the flag.
>> Please correct me if I miss something.
> 
> Oh.. my bad. I missed the fact that this code is under “#ifdef
> HAVE_GENERIC_MMU_GATHER”. But that means that arch-specific tlb_finish_mmu()
> implementations (s390, arm) may need to be modified as well.
> 
>>> sure whether they regard start and end the same way.
>> 
>> I understand your worry but my patch takes longer range by min/max
>> so I cannot imagine how it breaks. During looking the code, I found
>> __tlb_adjust_range so better to use it rather than open-code.
>> 
>> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5320e96ec51..b23188daa396 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e
>> 	struct mmu_gather_batch *batch, *next;
>> 
>> 	if (!tlb->fullmm && !tlb->need_flush_all &&
>> -			mm_tlb_flush_pending(tlb->mm) > 1) {
>> -		tlb->start = min(start, tlb->start);
>> -		tlb->end = max(end, tlb->end);
>> -	}
>> +			mm_tlb_flush_pending(tlb->mm) > 1)
>> +		__tlb_adjust_range(tlb->mm, start, end - start);
>> 
>> 	tlb_flush_mmu(tlb);
>> 	clear_tlb_flush_pending(tlb->mm);
> 
> This one is better, especially as I now understand it is only for the
> generic MMU gather (which I missed before).

There is one issue I forgot: pte_accessible() on x86 regards
mm_tlb_flush_pending() as an indication for NUMA migration. But now the code
does not make too much sense:

        if ((pte_flags(a) & _PAGE_PROTNONE) &&
                        mm_tlb_flush_pending(mm))

Either we remove the _PAGE_PROTNONE check or we need to use the atomic field
to count separately pending flushes due to migration and due to other
reasons. The first option is safer, but Mel objected to it, because of the
performance implications. The second one requires some thought on how to
build a single counter for multiple reasons and avoid a potential overflow.

Thoughts?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-27  1:13                                                                                 ` Nadav Amit
@ 2017-07-27  7:04                                                                                   ` Minchan Kim
  2017-07-27  7:21                                                                                     ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Minchan Kim @ 2017-07-27  7:04 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Mel Gorman, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Wed, Jul 26, 2017 at 06:13:15PM -0700, Nadav Amit wrote:
> Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> > Minchan Kim <minchan@kernel.org> wrote:
> > 
> >> On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote:
> >>> Minchan Kim <minchan@kernel.org> wrote:
> >>> 
> >>>> Hello Nadav,
> >>>> 
> >>>> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote:
> >>>>> Mel Gorman <mgorman@suse.de> wrote:
> >>>>> 
> >>>>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
> >>>>>>>> I'm relying on the fact you are the madv_free author to determine if
> >>>>>>>> it's really necessary. The race in question is CPU 0 running madv_free
> >>>>>>>> and updating some PTEs while CPU 1 is also running madv_free and looking
> >>>>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> >>>>>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially
> >>>>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> >>>>>>>> potentially writable TLB entries and the underlying PTE is still present
> >>>>>>>> so that a subsequent write does not necessarily propagate the dirty bit
> >>>>>>>> to the underlying PTE any more. Reclaim at some unknown time at the future
> >>>>>>>> may then see that the PTE is still clean and discard the page even though
> >>>>>>>> a write has happened in the meantime. I think this is possible but I could
> >>>>>>>> have missed some protection in madv_free that prevents it happening.
> >>>>>>> 
> >>>>>>> Thanks for the detail. You didn't miss anything. It can happen and then
> >>>>>>> it's really bug. IOW, if application does write something after madv_free,
> >>>>>>> it must see the written value, not zero.
> >>>>>>> 
> >>>>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
> >>>>>>> With it, when tlb_finish_mmu is called, we can know we skip the flush
> >>>>>>> but there is pending flush, so flush focefully to avoid madv_dontneed
> >>>>>>> as well as madv_free scenario.
> >>>>>> 
> >>>>>> I *think* this is ok as it's simply more expensive on the KSM side in
> >>>>>> the event of a race but no other harmful change is made assuming that
> >>>>>> KSM is the only race-prone. The check for mm_tlb_flush_pending also
> >>>>>> happens under the PTL so there should be sufficient protection from the
> >>>>>> mm struct update being visible at teh right time.
> >>>>>> 
> >>>>>> Check using the test program from "mm: Always flush VMA ranges affected
> >>>>>> by zap_page_range v2" if it handles the madvise case as well as that
> >>>>>> would give some degree of safety. Make sure it's tested against 4.13-rc2
> >>>>>> instead of mmotm which already includes the madv_dontneed fix. If yours
> >>>>>> works for both then it supersedes the mmotm patch.
> >>>>>> 
> >>>>>> It would also be interesting if Nadav would use his slowdown hack to see
> >>>>>> if he can still force the corruption.
> >>>>> 
> >>>>> The proposed fix for the KSM side is likely to work (I will try later), but
> >>>>> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
> >>>>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
> >>>>> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
> >>>>> another one stale and not flush it.
> >>>> 
> >>>> Okay, I will change that part like this to avoid partial flush problem.
> >>>> 
> >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >>>> index 1c42d69490e4..87d0ebac6605 100644
> >>>> --- a/include/linux/mm_types.h
> >>>> +++ b/include/linux/mm_types.h
> >>>> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
> >>>> * The barriers below prevent the compiler from re-ordering the instructions
> >>>> * around the memory barriers that are already present in the code.
> >>>> */
> >>>> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> >>>> +static inline int mm_tlb_flush_pending(struct mm_struct *mm)
> >>>> {
> >>>> +	int nr_pending;
> >>>> +
> >>>> 	barrier();
> >>>> -	return atomic_read(&mm->tlb_flush_pending) > 0;
> >>>> +	nr_pending = atomic_read(&mm->tlb_flush_pending);
> >>>> +	return nr_pending;
> >>>> }
> >>>> static inline void set_tlb_flush_pending(struct mm_struct *mm)
> >>>> {
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index d5c5e6497c70..b5320e96ec51 100644
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb)
> >>>> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> >>>> {
> >>>> 	struct mmu_gather_batch *batch, *next;
> >>>> -	bool flushed = tlb_flush_mmu(tlb);
> >>>> 
> >>>> +	if (!tlb->fullmm && !tlb->need_flush_all &&
> >>>> +			mm_tlb_flush_pending(tlb->mm) > 1) {
> >>> 
> >>> I saw you noticed my comment about the access of the flag without a lock. I
> >>> must say it feels strange that a memory barrier would be needed here, but
> >>> that what I understood from the documentation.
> >> 
> >> I saw your recent barriers fix patch, too.
> >> [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending
> >> 
> >> As I commented out in there, I hope to use below here without being
> >> aware of complex barrier stuff. Instead, mm_tlb_flush_pending should
> >> call the right barrier inside.
> >> 
> >>       mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1
> > 
> > I will address it in v3.
> > 
> > 
> >>>> +		tlb->start = min(start, tlb->start);
> >>>> +		tlb->end = max(end, tlb->end);
> >>> 
> >>> Erra?| You open-code mmu_gather which is arch-specific. It appears that all of
> >>> them have start and end members, but not need_flush_all. Besides, I am not
> >> 
> >> When I see tlb_gather_mmu which is not arch-specific, it intializes
> >> need_flush_all to zero so it would be no harmful although some of
> >> architecture doesn't set the flag.
> >> Please correct me if I miss something.
> > 
> > Oh.. my bad. I missed the fact that this code is under a??#ifdef
> > HAVE_GENERIC_MMU_GATHERa??. But that means that arch-specific tlb_finish_mmu()
> > implementations (s390, arm) may need to be modified as well.
> > 
> >>> sure whether they regard start and end the same way.
> >> 
> >> I understand your worry but my patch takes longer range by min/max
> >> so I cannot imagine how it breaks. During looking the code, I found
> >> __tlb_adjust_range so better to use it rather than open-code.
> >> 
> >> 
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b5320e96ec51..b23188daa396 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e
> >> 	struct mmu_gather_batch *batch, *next;
> >> 
> >> 	if (!tlb->fullmm && !tlb->need_flush_all &&
> >> -			mm_tlb_flush_pending(tlb->mm) > 1) {
> >> -		tlb->start = min(start, tlb->start);
> >> -		tlb->end = max(end, tlb->end);
> >> -	}
> >> +			mm_tlb_flush_pending(tlb->mm) > 1)
> >> +		__tlb_adjust_range(tlb->mm, start, end - start);
> >> 
> >> 	tlb_flush_mmu(tlb);
> >> 	clear_tlb_flush_pending(tlb->mm);
> > 
> > This one is better, especially as I now understand it is only for the
> > generic MMU gather (which I missed before).
> 
> There is one issue I forgot: pte_accessible() on x86 regards
> mm_tlb_flush_pending() as an indication for NUMA migration. But now the code
> does not make too much sense:
> 
>         if ((pte_flags(a) & _PAGE_PROTNONE) &&
>                         mm_tlb_flush_pending(mm))
> 
> Either we remove the _PAGE_PROTNONE check or we need to use the atomic field
> to count separately pending flushes due to migration and due to other
> reasons. The first option is safer, but Mel objected to it, because of the
> performance implications. The second one requires some thought on how to
> build a single counter for multiple reasons and avoid a potential overflow.
> 
> Thoughts?
> 

I'm really new for the autoNUMA so not sure I understand your concern
If your concern is that increasing places where add up pending count,
autoNUMA performance might be hurt. Right?
If so, above _PAGE_PROTNONE check will filter out most of cases?
Maybe, Mel could answer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-27  7:04                                                                                   ` Minchan Kim
@ 2017-07-27  7:21                                                                                     ` Mel Gorman
  2017-07-27 16:04                                                                                       ` Nadav Amit
  0 siblings, 1 reply; 70+ messages in thread
From: Mel Gorman @ 2017-07-27  7:21 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Nadav Amit, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Thu, Jul 27, 2017 at 04:04:20PM +0900, Minchan Kim wrote:
> > There is one issue I forgot: pte_accessible() on x86 regards
> > mm_tlb_flush_pending() as an indication for NUMA migration. But now the code
> > does not make too much sense:
> > 
> >         if ((pte_flags(a) & _PAGE_PROTNONE) &&
> >                         mm_tlb_flush_pending(mm))
> > 
> > Either we remove the _PAGE_PROTNONE check or we need to use the atomic field
> > to count separately pending flushes due to migration and due to other
> > reasons. The first option is safer, but Mel objected to it, because of the
> > performance implications. The second one requires some thought on how to
> > build a single counter for multiple reasons and avoid a potential overflow.
> > 
> > Thoughts?
> > 
> 
> I'm really new for the autoNUMA so not sure I understand your concern
> If your concern is that increasing places where add up pending count,
> autoNUMA performance might be hurt. Right?
> If so, above _PAGE_PROTNONE check will filter out most of cases?
> Maybe, Mel could answer.

I'm not sure what I'm being asked. In the case above, the TLB flush pending
is only relevant against autonuma-related races so only those PTEs are
checked to limit overhead. It could be checked on every PTE but it's
adding more compiler barriers or more atomic reads which do not appear
necessary. If the check is removed, a comment should be added explaining
why every PTE has to be checked.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-27  7:21                                                                                     ` Mel Gorman
@ 2017-07-27 16:04                                                                                       ` Nadav Amit
  2017-07-27 17:36                                                                                         ` Mel Gorman
  0 siblings, 1 reply; 70+ messages in thread
From: Nadav Amit @ 2017-07-27 16:04 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Minchan Kim, Andy Lutomirski, open list:MEMORY MANAGEMENT

Mel Gorman <mgorman@suse.de> wrote:

> On Thu, Jul 27, 2017 at 04:04:20PM +0900, Minchan Kim wrote:
>>> There is one issue I forgot: pte_accessible() on x86 regards
>>> mm_tlb_flush_pending() as an indication for NUMA migration. But now the code
>>> does not make too much sense:
>>> 
>>>        if ((pte_flags(a) & _PAGE_PROTNONE) &&
>>>                        mm_tlb_flush_pending(mm))
>>> 
>>> Either we remove the _PAGE_PROTNONE check or we need to use the atomic field
>>> to count separately pending flushes due to migration and due to other
>>> reasons. The first option is safer, but Mel objected to it, because of the
>>> performance implications. The second one requires some thought on how to
>>> build a single counter for multiple reasons and avoid a potential overflow.
>>> 
>>> Thoughts?
>> 
>> I'm really new for the autoNUMA so not sure I understand your concern
>> If your concern is that increasing places where add up pending count,
>> autoNUMA performance might be hurt. Right?
>> If so, above _PAGE_PROTNONE check will filter out most of cases?
>> Maybe, Mel could answer.
> 
> I'm not sure what I'm being asked. In the case above, the TLB flush pending
> is only relevant against autonuma-related races so only those PTEs are
> checked to limit overhead. It could be checked on every PTE but it's
> adding more compiler barriers or more atomic reads which do not appear
> necessary. If the check is removed, a comment should be added explaining
> why every PTE has to be checked.

I considered breaking tlb_flush_pending to two: tlb_flush_pending_numa and
tlb_flush_pending_other (they can share one atomic64_t field). This way,
pte_accessible() would only consider “tlb_flush_pending_numa", and the
changes that Minchan proposed would not increase the number unnecessary TLB
flushes.

However, considering the complexity of the TLB flushes scheme, and the fact
I am not fully convinced all of these TLB flushes are indeed unnecessary, I
will put it aside.

Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Potential race in TLB flush batching?
  2017-07-27 16:04                                                                                       ` Nadav Amit
@ 2017-07-27 17:36                                                                                         ` Mel Gorman
  0 siblings, 0 replies; 70+ messages in thread
From: Mel Gorman @ 2017-07-27 17:36 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Minchan Kim, Andy Lutomirski, open list:MEMORY MANAGEMENT

On Thu, Jul 27, 2017 at 09:04:11AM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Thu, Jul 27, 2017 at 04:04:20PM +0900, Minchan Kim wrote:
> >>> There is one issue I forgot: pte_accessible() on x86 regards
> >>> mm_tlb_flush_pending() as an indication for NUMA migration. But now the code
> >>> does not make too much sense:
> >>> 
> >>>        if ((pte_flags(a) & _PAGE_PROTNONE) &&
> >>>                        mm_tlb_flush_pending(mm))
> >>> 
> >>> Either we remove the _PAGE_PROTNONE check or we need to use the atomic field
> >>> to count separately pending flushes due to migration and due to other
> >>> reasons. The first option is safer, but Mel objected to it, because of the
> >>> performance implications. The second one requires some thought on how to
> >>> build a single counter for multiple reasons and avoid a potential overflow.
> >>> 
> >>> Thoughts?
> >> 
> >> I'm really new for the autoNUMA so not sure I understand your concern
> >> If your concern is that increasing places where add up pending count,
> >> autoNUMA performance might be hurt. Right?
> >> If so, above _PAGE_PROTNONE check will filter out most of cases?
> >> Maybe, Mel could answer.
> > 
> > I'm not sure what I'm being asked. In the case above, the TLB flush pending
> > is only relevant against autonuma-related races so only those PTEs are
> > checked to limit overhead. It could be checked on every PTE but it's
> > adding more compiler barriers or more atomic reads which do not appear
> > necessary. If the check is removed, a comment should be added explaining
> > why every PTE has to be checked.
> 
> I considered breaking tlb_flush_pending to two: tlb_flush_pending_numa and
> tlb_flush_pending_other (they can share one atomic64_t field). This way,
> pte_accessible() would only consider ???tlb_flush_pending_numa", and the
> changes that Minchan proposed would not increase the number unnecessary TLB
> flushes.
> 
> However, considering the complexity of the TLB flushes scheme, and the fact
> I am not fully convinced all of these TLB flushes are indeed unnecessary, I
> will put it aside.
> 

Ok, I understand now. With a second set/clear of mm_tlb_flush_pending,
it is necessary to remove the PROT_NUMA check from pte_accessible because
it's no longer change_prot_range that is the only user of concern. At
this time, I do not see a value if adding two pending field because it's
a maintenance headache and an API that would be harder to get right. It's
also not clear it would add any performance advantage and even if it did,
it's the type of complexity that would need hard data supporting it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2017-07-27 17:36 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.