linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
@ 2018-06-05 17:13 Mel Gorman
  2018-06-05 18:18 ` Dave Hansen
  2018-06-05 19:53 ` Nadav Amit
  0 siblings, 2 replies; 11+ messages in thread
From: Mel Gorman @ 2018-06-05 17:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mhocko, vbabka, Aaron Lu, Dave Hansen, linux-kernel, linux-mm

Commit 5d1904204c99 ("mremap: fix race between mremap() and page cleanning")
fixed races between mremap and other operations for both file-backed and
anonymous mappings. The file-backed was the most critical as it allowed the
possibility that data could be changed on a physical page after page_mkclean
returned which could trigger data loss or data integrity issues. A customer
reported that the cost of the TLBs for anonymous regressions was excessive
and resulting in a 30-50% drop in performance overall since this commit
on a microbenchmark. Unfortunately I neither have access to the test-case
nor can I describe what it does other than saying that mremap operations
dominate heavily.

The anonymous page race fix is overkill for two reasons. Pages that are not
in the swap cache are not going to be issued for IO and if a stale TLB entry
is used, the write still occurs on the same physical page. Any race with
mmap replacing the address space is handled by mmap_sem. As anonymous pages
are often dirty, it can mean that mremap always has to flush even when it is
not necessary.

This patch special cases anonymous pages to only flush if the page is in
swap cache and can be potentially queued for IO. It uses the page lock to
serialise against any potential reclaim. If the page is added to the swap
cache on the reclaim side after the page lock is dropped on the mremap
side then reclaim will call try_to_unmap_flush_dirty() before issuing
any IO so there is no data integrity issue. This means that in the common
case where a workload avoids swap entirely that mremap is a much cheaper
operation due to the lack of TLB flushes.

Using another testcase that simply calls mremap heavily with varying number
of threads, it was found that very broadly speaking that TLB shootdowns
were reduced by 31% on average throughout the entire test case but your
milage will vary.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/mremap.c | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 049470aa1e3e..d26c5a00fd9d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -24,6 +24,7 @@
 #include <linux/uaccess.h>
 #include <linux/mm-arch-hooks.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
@@ -112,6 +113,41 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 	return pte;
 }
 
+/* Returns true if a TLB must be flushed before PTL is dropped */
+static bool should_force_flush(pte_t *pte)
+{
+	bool is_swapcache;
+	struct page *page;
+
+	if (!pte_present(*pte) || !pte_dirty(*pte))
+		return false;
+
+	/*
+	 * If we are remapping a dirty file PTE, make sure to flush TLB
+	 * before we drop the PTL for the old PTE or we may race with
+	 * page_mkclean().
+	 */
+	page = pte_page(*pte);
+	if (page_is_file_cache(page))
+		return true;
+
+	/*
+	 * For anonymous pages, only flush swap cache pages that could
+	 * be unmapped and queued for swap since flush_tlb_batched_pending was
+	 * last called. Reclaim itself takes care that the TLB is flushed
+	 * before IO is queued. If a page is not in swap cache and a stale TLB
+	 * is used before mremap is complete then the write hits the same
+	 * physical page and there is no lost data loss. Check under the
+	 * page lock to avoid any potential race with reclaim.
+	 */
+	if (!trylock_page(page))
+		return true;
+	is_swapcache = PageSwapCache(page);
+	unlock_page(page);
+
+	return is_swapcache;
+}
+
 static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		unsigned long old_addr, unsigned long old_end,
 		struct vm_area_struct *new_vma, pmd_t *new_pmd,
@@ -163,15 +199,11 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 
 		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		/*
-		 * If we are remapping a dirty PTE, make sure
-		 * to flush TLB before we drop the PTL for the
-		 * old PTE or we may race with page_mkclean().
-		 *
 		 * This check has to be done after we removed the
 		 * old PTE from page tables or another thread may
 		 * dirty it after the check and before the removal.
 		 */
-		if (pte_present(pte) && pte_dirty(pte))
+		if (should_force_flush(&pte))
 			force_flush = true;
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 17:13 [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache Mel Gorman
@ 2018-06-05 18:18 ` Dave Hansen
  2018-06-05 19:12   ` Mel Gorman
  2018-06-05 19:53 ` Nadav Amit
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2018-06-05 18:18 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: mhocko, vbabka, Aaron Lu, linux-kernel, linux-mm

On 06/05/2018 10:13 AM, Mel Gorman wrote:
> The anonymous page race fix is overkill for two reasons. Pages that are not
> in the swap cache are not going to be issued for IO and if a stale TLB entry
> is used, the write still occurs on the same physical page. Any race with
> mmap replacing the address space is handled by mmap_sem. As anonymous pages
> are often dirty, it can mean that mremap always has to flush even when it is
> not necessary.

This looks fine to me.  One nit on the description: I found myself
wondering if we skip the flush under the ptl where the flush is
eventually done.  That code is a bit out of the context, so we don't see
it in the patch.

We have two modes of flushing during move_ptes():
1. The flush_tlb_range() while holding the ptl in move_ptes().
2. A flush_tlb_range() at the end of move_table_tables(), driven by
  'need_flush' which will be set any time move_ptes() does *not* flush.

This patch broadens the scope where move_ptes() does not flush and
shifts the burden to the flush inside move_table_tables().

Right?

Other minor nits:

> +/* Returns true if a TLB must be flushed before PTL is dropped */
> +static bool should_force_flush(pte_t *pte)
> +{

I usually try to make the non-pte-modifying functions take a pte_t
instead of 'pte_t *' to make it obvious that there no modification going
on.  Any reason not to do that here?

> +	if (!trylock_page(page))
> +		return true;
> +	is_swapcache = PageSwapCache(page);
> +	unlock_page(page);
> +
> +	return is_swapcache;
> +}

I was hoping we didn't have to go as far as taking the page lock, but I
guess the proof is in the pudding that this tradeoff is worth it.

BTW, do you want to add a tiny comment about why we do the
trylock_page()?  I assume it's because we don't want to wait on finding
an exact answer: we just assume it is in the swap cache if the page is
locked and flush regardless.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 18:18 ` Dave Hansen
@ 2018-06-05 19:12   ` Mel Gorman
  2018-06-05 19:49     ` Dave Hansen
  0 siblings, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2018-06-05 19:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, mhocko, vbabka, Aaron Lu, linux-kernel, linux-mm

On Tue, Jun 05, 2018 at 11:18:18AM -0700, Dave Hansen wrote:
> On 06/05/2018 10:13 AM, Mel Gorman wrote:
> > The anonymous page race fix is overkill for two reasons. Pages that are not
> > in the swap cache are not going to be issued for IO and if a stale TLB entry
> > is used, the write still occurs on the same physical page. Any race with
> > mmap replacing the address space is handled by mmap_sem. As anonymous pages
> > are often dirty, it can mean that mremap always has to flush even when it is
> > not necessary.
> 
> This looks fine to me.  One nit on the description: I found myself
> wondering if we skip the flush under the ptl where the flush is
> eventually done.  That code is a bit out of the context, so we don't see
> it in the patch.
> 

That's fair enough. I updated part of the changelog to read

This patch special cases anonymous pages to only flush ranges under the
page table lock if the page is in swap cache and can be potentially queued
for IO. Note that the full flush of the range being mremapped is still
flushed so TLB flushes are not eliminated entirely.

Does that work for you?

> We have two modes of flushing during move_ptes():
> 1. The flush_tlb_range() while holding the ptl in move_ptes().
> 2. A flush_tlb_range() at the end of move_table_tables(), driven by
>   'need_flush' which will be set any time move_ptes() does *not* flush.
> 
> This patch broadens the scope where move_ptes() does not flush and
> shifts the burden to the flush inside move_table_tables().
> 
> Right?
> 

Yes. While this does not eliminate TLB flushes, it reduces the number
considerably as we potentially are replacing one-flush-per-LATENCY_LIMIT
with one flush.

> Other minor nits:
> 
> > +/* Returns true if a TLB must be flushed before PTL is dropped */
> > +static bool should_force_flush(pte_t *pte)
> > +{
> 
> I usually try to make the non-pte-modifying functions take a pte_t
> instead of 'pte_t *' to make it obvious that there no modification going
> on.  Any reason not to do that here?
> 

No, it was just a minor saving on stack usage.

> > +	if (!trylock_page(page))
> > +		return true;
> > +	is_swapcache = PageSwapCache(page);
> > +	unlock_page(page);
> > +
> > +	return is_swapcache;
> > +}
> 
> I was hoping we didn't have to go as far as taking the page lock, but I
> guess the proof is in the pudding that this tradeoff is worth it.
> 

In the interest of full disclosure, the feedback I have from the customer is
based on a patch that modifies the LATENCY_LIMIT. This helped but did not
eliminate the problem. This was potentially a better solution but I wanted
review first. I'm waiting on confirmation this definitely behaves better.

> BTW, do you want to add a tiny comment about why we do the
> trylock_page()?  I assume it's because we don't want to wait on finding
> an exact answer: we just assume it is in the swap cache if the page is
> locked and flush regardless.

It's really because calling lock_page while holding a spinlock is
eventually going to ruin your day.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 19:12   ` Mel Gorman
@ 2018-06-05 19:49     ` Dave Hansen
  2018-06-05 19:51       ` Mel Gorman
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2018-06-05 19:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, mhocko, vbabka, Aaron Lu, linux-kernel, linux-mm

On 06/05/2018 12:12 PM, Mel Gorman wrote:
> That's fair enough. I updated part of the changelog to read
> 
> This patch special cases anonymous pages to only flush ranges under the
> page table lock if the page is in swap cache and can be potentially queued
> for IO. Note that the full flush of the range being mremapped is still
> flushed so TLB flushes are not eliminated entirely.
> 
> Does that work for you?

Looks good, thanks.

>> I usually try to make the non-pte-modifying functions take a pte_t
>> instead of 'pte_t *' to make it obvious that there no modification going
>> on.  Any reason not to do that here?
> 
> No, it was just a minor saving on stack usage.

We're just splitting hairs now :) but, realistically, this little helper
will get inlined anyway, so it probably all generates the same code.

...
>> BTW, do you want to add a tiny comment about why we do the
>> trylock_page()?  I assume it's because we don't want to wait on finding
>> an exact answer: we just assume it is in the swap cache if the page is
>> locked and flush regardless.
> 
> It's really because calling lock_page while holding a spinlock is
> eventually going to ruin your day.

Hah, yeah, that'll do it.  Could you comment this, too?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 19:49     ` Dave Hansen
@ 2018-06-05 19:51       ` Mel Gorman
  2018-06-05 19:54         ` Dave Hansen
  2018-06-06  8:22         ` Michal Hocko
  0 siblings, 2 replies; 11+ messages in thread
From: Mel Gorman @ 2018-06-05 19:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, mhocko, vbabka, Aaron Lu, linux-kernel, linux-mm

On Tue, Jun 05, 2018 at 12:49:18PM -0700, Dave Hansen wrote:
> >> I usually try to make the non-pte-modifying functions take a pte_t
> >> instead of 'pte_t *' to make it obvious that there no modification going
> >> on.  Any reason not to do that here?
> > 
> > No, it was just a minor saving on stack usage.
> 
> We're just splitting hairs now :) but, realistically, this little helper
> will get inlined anyway, so it probably all generates the same code.
> 

I expect so and your point is valid that it should be obvious that no
modifications can take place.

> ...
> >> BTW, do you want to add a tiny comment about why we do the
> >> trylock_page()?  I assume it's because we don't want to wait on finding
> >> an exact answer: we just assume it is in the swap cache if the page is
> >> locked and flush regardless.
> > 
> > It's really because calling lock_page while holding a spinlock is
> > eventually going to ruin your day.
> 
> Hah, yeah, that'll do it.  Could you comment this, too?
> 

Yep, I made the change after reading your mail. This is how it currently
stands

---8<---
mremap: Avoid excessive TLB flushing for anonymous pages that are not in swap cache

Commit 5d1904204c99 ("mremap: fix race between mremap() and page cleanning")
fixed races between mremap and other operations for both file-backed and
anonymous mappings. The file-backed was the most critical as it allowed the
possibility that data could be changed on a physical page after page_mkclean
returned which could trigger data loss or data integrity issues. A customer
reported that the cost of the TLBs for anonymous regressions was excessive
and resulting in a 30-50% drop in performance overall since this commit
on a microbenchmark. Unfortunately I neither have access to the test-case
nor can I describe what it does other than saying that mremap operations
dominate heavily.

The anonymous page race fix is overkill for two reasons. Pages that are
not in the swap cache are not going to be issued for IO and if a stale TLB
entry is used, the write still occurs on the same physical page. Any race
with mmap replacing the address space is handled by mmap_sem. As anonymous
pages are often dirty, it can mean that mremap always has to flush even
when it is not necessary.

This patch special cases anonymous pages to only flush ranges under the
page table lock if the page is in swap cache and can be potentially queued
for IO. Note that the full flush of the range being mremapped is still
flushed so TLB flushes are not eliminated entirely.

It uses the page lock to serialise against any potential reclaim. If the
page is added to the swap cache on the reclaim side after the page lock is
dropped on the mremap side then reclaim will call try_to_unmap_flush_dirty()
before issuing any IO so there is no data integrity issue. This means that
in the common case where a workload avoids swap entirely that mremap is
a much cheaper operation due to the lack of TLB flushes.

Using another testcase that simply calls mremap heavily with varying number
of threads, it was found that very broadly speaking that TLB shootdowns
were reduced by 31% on average throughout the entire test case but your
milage will vary.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/mremap.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 049470aa1e3e..5b9767b0446e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -24,6 +24,7 @@
 #include <linux/uaccess.h>
 #include <linux/mm-arch-hooks.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
@@ -112,6 +113,44 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 	return pte;
 }
 
+/* Returns true if a TLB must be flushed before PTL is dropped */
+static bool should_force_flush(pte_t pte)
+{
+	bool is_swapcache;
+	struct page *page;
+
+	if (!pte_present(pte) || !pte_dirty(pte))
+		return false;
+
+	/*
+	 * If we are remapping a dirty file PTE, make sure to flush TLB
+	 * before we drop the PTL for the old PTE or we may race with
+	 * page_mkclean().
+	 */
+	page = pte_page(pte);
+	if (page_is_file_cache(page))
+		return true;
+
+	/*
+	 * For anonymous pages, only flush swap cache pages that could
+	 * be unmapped and queued for swap since flush_tlb_batched_pending was
+	 * last called. Reclaim itself takes care that the TLB is flushed
+	 * before IO is queued. If a page is not in swap cache and a stale TLB
+	 * is used before mremap is complete then the write hits the same
+	 * physical page and there is no lost data loss.
+	 *
+	 * Check under the page lock to avoid any potential race with reclaim.
+	 * trylock is necessary as spinlocks are currently held. In the unlikely
+	 * event of contention, flush the TLB to be safe.
+	 */
+	if (!trylock_page(page))
+		return true;
+	is_swapcache = PageSwapCache(page);
+	unlock_page(page);
+
+	return is_swapcache;
+}
+
 static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		unsigned long old_addr, unsigned long old_end,
 		struct vm_area_struct *new_vma, pmd_t *new_pmd,
@@ -163,15 +202,11 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 
 		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		/*
-		 * If we are remapping a dirty PTE, make sure
-		 * to flush TLB before we drop the PTL for the
-		 * old PTE or we may race with page_mkclean().
-		 *
 		 * This check has to be done after we removed the
 		 * old PTE from page tables or another thread may
 		 * dirty it after the check and before the removal.
 		 */
-		if (pte_present(pte) && pte_dirty(pte))
+		if (should_force_flush(pte))
 			force_flush = true;
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 17:13 [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache Mel Gorman
  2018-06-05 18:18 ` Dave Hansen
@ 2018-06-05 19:53 ` Nadav Amit
  2018-06-05 20:08   ` Mel Gorman
  1 sibling, 1 reply; 11+ messages in thread
From: Nadav Amit @ 2018-06-05 19:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Aaron Lu,
	Dave Hansen, Linux Kernel Mailing List, linux-mm

Mel Gorman <mgorman@techsingularity.net> wrote:

> Commit 5d1904204c99 ("mremap: fix race between mremap() and page cleanning")
> fixed races between mremap and other operations for both file-backed and
> anonymous mappings. The file-backed was the most critical as it allowed the
> possibility that data could be changed on a physical page after page_mkclean
> returned which could trigger data loss or data integrity issues. A customer
> reported that the cost of the TLBs for anonymous regressions was excessive
> and resulting in a 30-50% drop in performance overall since this commit
> on a microbenchmark. Unfortunately I neither have access to the test-case
> nor can I describe what it does other than saying that mremap operations
> dominate heavily.
> 
> The anonymous page race fix is overkill for two reasons. Pages that are not
> in the swap cache are not going to be issued for IO and if a stale TLB entry
> is used, the write still occurs on the same physical page. Any race with
> mmap replacing the address space is handled by mmap_sem. As anonymous pages
> are often dirty, it can mean that mremap always has to flush even when it is
> not necessary.
> 
> This patch special cases anonymous pages to only flush if the page is in
> swap cache and can be potentially queued for IO. It uses the page lock to
> serialise against any potential reclaim. If the page is added to the swap
> cache on the reclaim side after the page lock is dropped on the mremap
> side then reclaim will call try_to_unmap_flush_dirty() before issuing
> any IO so there is no data integrity issue. This means that in the common
> case where a workload avoids swap entirely that mremap is a much cheaper
> operation due to the lack of TLB flushes.
> 
> Using another testcase that simply calls mremap heavily with varying number
> of threads, it was found that very broadly speaking that TLB shootdowns
> were reduced by 31% on average throughout the entire test case but your
> milage will vary.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> mm/mremap.c | 42 +++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 049470aa1e3e..d26c5a00fd9d 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -24,6 +24,7 @@
> #include <linux/uaccess.h>
> #include <linux/mm-arch-hooks.h>
> #include <linux/userfaultfd_k.h>
> +#include <linux/mm_inline.h>
> 
> #include <asm/cacheflush.h>
> #include <asm/tlbflush.h>
> @@ -112,6 +113,41 @@ static pte_t move_soft_dirty_pte(pte_t pte)
> 	return pte;
> }
> 
> +/* Returns true if a TLB must be flushed before PTL is dropped */
> +static bool should_force_flush(pte_t *pte)
> +{
> +	bool is_swapcache;
> +	struct page *page;
> +
> +	if (!pte_present(*pte) || !pte_dirty(*pte))
> +		return false;
> +
> +	/*
> +	 * If we are remapping a dirty file PTE, make sure to flush TLB
> +	 * before we drop the PTL for the old PTE or we may race with
> +	 * page_mkclean().
> +	 */
> +	page = pte_page(*pte);
> +	if (page_is_file_cache(page))
> +		return true;
> +
> +	/*
> +	 * For anonymous pages, only flush swap cache pages that could
> +	 * be unmapped and queued for swap since flush_tlb_batched_pending was
> +	 * last called. Reclaim itself takes care that the TLB is flushed
> +	 * before IO is queued. If a page is not in swap cache and a stale TLB
> +	 * is used before mremap is complete then the write hits the same
> +	 * physical page and there is no lost data loss. Check under the
> +	 * page lock to avoid any potential race with reclaim.
> +	 */
> +	if (!trylock_page(page))
> +		return true;
> +	is_swapcache = PageSwapCache(page);
> +	unlock_page(page);
> +
> +	return is_swapcache;
> +}

While I do not have a specific reservation regarding the logic, I find the
current TLB invalidation scheme hard to follow and inconsistent. I guess
should_force_flush() can be extended and used more commonly to make things
clearer.

To be more specific and to give an example: Can should_force_flush() be used
in zap_pte_range() to set the force_flush instead of the current code?

  if (!PageAnon(page)) {
	if (pte_dirty(ptent)) {
		force_flush = 1;
		...
  	}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 19:51       ` Mel Gorman
@ 2018-06-05 19:54         ` Dave Hansen
  2018-06-05 20:00           ` Mel Gorman
  2018-06-06  8:22         ` Michal Hocko
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2018-06-05 19:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, mhocko, vbabka, Aaron Lu, linux-kernel, linux-mm

On 06/05/2018 12:51 PM, Mel Gorman wrote:
> Using another testcase that simply calls mremap heavily with varying number
> of threads, it was found that very broadly speaking that TLB shootdowns
> were reduced by 31% on average throughout the entire test case but your
> milage will vary.

Looks good to me.  Feel free to add my Reviewed-by.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 19:54         ` Dave Hansen
@ 2018-06-05 20:00           ` Mel Gorman
  0 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2018-06-05 20:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, mhocko, vbabka, Aaron Lu, linux-kernel, linux-mm

On Tue, Jun 05, 2018 at 12:54:27PM -0700, Dave Hansen wrote:
> On 06/05/2018 12:51 PM, Mel Gorman wrote:
> > Using another testcase that simply calls mremap heavily with varying number
> > of threads, it was found that very broadly speaking that TLB shootdowns
> > were reduced by 31% on average throughout the entire test case but your
> > milage will vary.
> 
> Looks good to me.  Feel free to add my Reviewed-by.

Thanks, I'll send a proper v2 when I hear back from the customer on what
the impact on the real workload is.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 19:53 ` Nadav Amit
@ 2018-06-05 20:08   ` Mel Gorman
  2018-06-05 22:53     ` Nadav Amit
  0 siblings, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2018-06-05 20:08 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Aaron Lu,
	Dave Hansen, Linux Kernel Mailing List, linux-mm

On Tue, Jun 05, 2018 at 12:53:57PM -0700, Nadav Amit wrote:
> While I do not have a specific reservation regarding the logic, I find the
> current TLB invalidation scheme hard to follow and inconsistent. I guess
> should_force_flush() can be extended and used more commonly to make things
> clearer.
> 
> To be more specific and to give an example: Can should_force_flush() be used
> in zap_pte_range() to set the force_flush instead of the current code?
> 
>   if (!PageAnon(page)) {
> 	if (pte_dirty(ptent)) {
> 		force_flush = 1;
> 		...
>   	}
> 

That check is against !PageAnon pages where it's potentially critical
that the dirty PTE bit be propogated to the page. You could split the
separate the TLB flush from the dirty page setting but it's not the same
class of problem and without perf data, it's not clear it's worthwhile.

Note that I also didn't handle the huge page moving because it's already
naturally batching a larger range with a lower potential factor of TLB
flushing and has different potential race conditions.

I agree that the TLB handling would benefit from being simplier but it's
not a simple search/replace job to deal with the different cases that apply.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 20:08   ` Mel Gorman
@ 2018-06-05 22:53     ` Nadav Amit
  0 siblings, 0 replies; 11+ messages in thread
From: Nadav Amit @ 2018-06-05 22:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Aaron Lu,
	Dave Hansen, Linux Kernel Mailing List, linux-mm

Mel Gorman <mgorman@techsingularity.net> wrote:

> On Tue, Jun 05, 2018 at 12:53:57PM -0700, Nadav Amit wrote:
>> While I do not have a specific reservation regarding the logic, I find the
>> current TLB invalidation scheme hard to follow and inconsistent. I guess
>> should_force_flush() can be extended and used more commonly to make things
>> clearer.
>> 
>> To be more specific and to give an example: Can should_force_flush() be used
>> in zap_pte_range() to set the force_flush instead of the current code?
>> 
>>  if (!PageAnon(page)) {
>> 	if (pte_dirty(ptent)) {
>> 		force_flush = 1;
>> 		...
>>  	}
> 
> That check is against !PageAnon pages where it's potentially critical
> that the dirty PTE bit be propogated to the page. You could split the
> separate the TLB flush from the dirty page setting but it's not the same
> class of problem and without perf data, it's not clear it's worthwhile.
> 
> Note that I also didn't handle the huge page moving because it's already
> naturally batching a larger range with a lower potential factor of TLB
> flushing and has different potential race conditions.

I noticed.

> 
> I agree that the TLB handling would benefit from being simplier but it's
> not a simple search/replace job to deal with the different cases that apply.

I understand. It’s not just a matter of performance: having a consistent
implementation can prevent bugs and allow auditing of the invalidation
scheme.

Anyhow, if I find some free time, I’ll give it a shot.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache
  2018-06-05 19:51       ` Mel Gorman
  2018-06-05 19:54         ` Dave Hansen
@ 2018-06-06  8:22         ` Michal Hocko
  1 sibling, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2018-06-06  8:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Hansen, Andrew Morton, vbabka, Aaron Lu, linux-kernel, linux-mm

On Tue 05-06-18 20:51:40, Mel Gorman wrote:
[...]
> mremap: Avoid excessive TLB flushing for anonymous pages that are not in swap cache
> 
> Commit 5d1904204c99 ("mremap: fix race between mremap() and page cleanning")
> fixed races between mremap and other operations for both file-backed and
> anonymous mappings. The file-backed was the most critical as it allowed the
> possibility that data could be changed on a physical page after page_mkclean
> returned which could trigger data loss or data integrity issues. A customer
> reported that the cost of the TLBs for anonymous regressions was excessive
> and resulting in a 30-50% drop in performance overall since this commit
> on a microbenchmark. Unfortunately I neither have access to the test-case
> nor can I describe what it does other than saying that mremap operations
> dominate heavily.
> 
> The anonymous page race fix is overkill for two reasons. Pages that are
> not in the swap cache are not going to be issued for IO and if a stale TLB
> entry is used, the write still occurs on the same physical page. Any race
> with mmap replacing the address space is handled by mmap_sem. As anonymous
> pages are often dirty, it can mean that mremap always has to flush even
> when it is not necessary.
> 
> This patch special cases anonymous pages to only flush ranges under the
> page table lock if the page is in swap cache and can be potentially queued
> for IO. Note that the full flush of the range being mremapped is still
> flushed so TLB flushes are not eliminated entirely.
> 
> It uses the page lock to serialise against any potential reclaim. If the
> page is added to the swap cache on the reclaim side after the page lock is
> dropped on the mremap side then reclaim will call try_to_unmap_flush_dirty()
> before issuing any IO so there is no data integrity issue. This means that
> in the common case where a workload avoids swap entirely that mremap is
> a much cheaper operation due to the lack of TLB flushes.
> 
> Using another testcase that simply calls mremap heavily with varying number
> of threads, it was found that very broadly speaking that TLB shootdowns
> were reduced by 31% on average throughout the entire test case but your
> milage will vary.

LGTM and it would be great to add some perf numbers for the specific
workload which triggered this (a mremap heavy workload which is real
unfortunately).

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/mremap.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 049470aa1e3e..5b9767b0446e 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -24,6 +24,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/mm-arch-hooks.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/mm_inline.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/tlbflush.h>
> @@ -112,6 +113,44 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  	return pte;
>  }
>  
> +/* Returns true if a TLB must be flushed before PTL is dropped */
> +static bool should_force_flush(pte_t pte)
> +{
> +	bool is_swapcache;
> +	struct page *page;
> +
> +	if (!pte_present(pte) || !pte_dirty(pte))
> +		return false;
> +
> +	/*
> +	 * If we are remapping a dirty file PTE, make sure to flush TLB
> +	 * before we drop the PTL for the old PTE or we may race with
> +	 * page_mkclean().
> +	 */
> +	page = pte_page(pte);
> +	if (page_is_file_cache(page))
> +		return true;
> +
> +	/*
> +	 * For anonymous pages, only flush swap cache pages that could
> +	 * be unmapped and queued for swap since flush_tlb_batched_pending was
> +	 * last called. Reclaim itself takes care that the TLB is flushed
> +	 * before IO is queued. If a page is not in swap cache and a stale TLB
> +	 * is used before mremap is complete then the write hits the same
> +	 * physical page and there is no lost data loss.
> +	 *
> +	 * Check under the page lock to avoid any potential race with reclaim.
> +	 * trylock is necessary as spinlocks are currently held. In the unlikely
> +	 * event of contention, flush the TLB to be safe.
> +	 */
> +	if (!trylock_page(page))
> +		return true;
> +	is_swapcache = PageSwapCache(page);
> +	unlock_page(page);
> +
> +	return is_swapcache;
> +}
> +
>  static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  		unsigned long old_addr, unsigned long old_end,
>  		struct vm_area_struct *new_vma, pmd_t *new_pmd,
> @@ -163,15 +202,11 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  
>  		pte = ptep_get_and_clear(mm, old_addr, old_pte);
>  		/*
> -		 * If we are remapping a dirty PTE, make sure
> -		 * to flush TLB before we drop the PTL for the
> -		 * old PTE or we may race with page_mkclean().
> -		 *
>  		 * This check has to be done after we removed the
>  		 * old PTE from page tables or another thread may
>  		 * dirty it after the check and before the removal.
>  		 */
> -		if (pte_present(pte) && pte_dirty(pte))
> +		if (should_force_flush(pte))
>  			force_flush = true;
>  		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-06-06  8:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-05 17:13 [PATCH] mremap: Avoid TLB flushing anonymous pages that are not in swap cache Mel Gorman
2018-06-05 18:18 ` Dave Hansen
2018-06-05 19:12   ` Mel Gorman
2018-06-05 19:49     ` Dave Hansen
2018-06-05 19:51       ` Mel Gorman
2018-06-05 19:54         ` Dave Hansen
2018-06-05 20:00           ` Mel Gorman
2018-06-06  8:22         ` Michal Hocko
2018-06-05 19:53 ` Nadav Amit
2018-06-05 20:08   ` Mel Gorman
2018-06-05 22:53     ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).