All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch]x86: clearing access bit don't flush tlb
@ 2014-03-26 22:30 Shaohua Li
  2014-03-26 23:55 ` Rik van Riel
  2014-04-02 13:01 ` Mel Gorman
  0 siblings, 2 replies; 21+ messages in thread
From: Shaohua Li @ 2014-03-26 22:30 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, hughd, riel, mel


I posted this patch a year ago or so, but it gets lost. Repost it here to check
if we can make progress this time.

We use access bit to age a page at page reclaim. When clearing pte access bit,
we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
access bit is unset in page table, when cpu access the page again, cpu will not
set page table pte's access bit. Next time page reclaim will think this hot
page is old and reclaim it wrongly, but this doesn't corrupt data.

And according to intel manual, tlb has less than 1k entries, which covers < 4M
memory. In today's system, several giga byte memory is normal. After page
reclaim clears pte access bit and before cpu access the page again, it's quite
unlikely this page's pte is still in TLB. And context swich will flush tlb too.
The chance skiping tlb flush to impact page reclaim should be very rare.

Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
Hugh added it to fix some ARM and sparc issues. Since I only change this for
x86, there should be no risk.

And in some workloads, TLB flush overhead is very heavy. In my simple
multithread app with a lot of swap to several pcie SSD, removing the tlb flush
gives about 20% ~ 30% swapout speedup.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 arch/x86/mm/pgtable.c |   13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pgtable.c
===================================================================
--- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
+++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
@@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
-	int young;
-
-	young = ptep_test_and_clear_young(vma, address, ptep);
-	if (young)
-		flush_tlb_page(vma, address);
-
-	return young;
+	/*
+	 * In X86, clearing access bit without TLB flush doesn't cause data
+	 * corruption. Doing this could cause wrong page aging and so hot pages
+	 * are reclaimed, but the chance should be very rare.
+	 */
+	return ptep_test_and_clear_young(vma, address, ptep);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch]x86: clearing access bit don't flush tlb
  2014-03-26 22:30 [patch]x86: clearing access bit don't flush tlb Shaohua Li
@ 2014-03-26 23:55 ` Rik van Riel
  2014-03-27 17:12   ` Shaohua Li
  2014-04-02 13:01 ` Mel Gorman
  1 sibling, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2014-03-26 23:55 UTC (permalink / raw)
  To: Shaohua Li, linux-mm; +Cc: akpm, hughd, mel

On 03/26/2014 06:30 PM, Shaohua Li wrote:
>
> I posted this patch a year ago or so, but it gets lost. Repost it here to check
> if we can make progress this time.

I believe we can make progress. However, I also
believe the code could be enhanced to address a
concern that Hugh raised last time this was
proposed...

> And according to intel manual, tlb has less than 1k entries, which covers < 4M
> memory. In today's system, several giga byte memory is normal. After page
> reclaim clears pte access bit and before cpu access the page again, it's quite
> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> The chance skiping tlb flush to impact page reclaim should be very rare.

Context switch to a kernel thread does not result in a
TLB flush, due to the lazy TLB code.

While I agree with you that clearing the TLB right at
the moment the accessed bit is cleared in a PTE is
not necessary, I believe it would be good to clear
the TLB on affected CPUs relatively soon, maybe at the
next time schedule is called?

> --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
>   int ptep_clear_flush_young(struct vm_area_struct *vma,
>   			   unsigned long address, pte_t *ptep)
>   {
> -	int young;
> -
> -	young = ptep_test_and_clear_young(vma, address, ptep);
> -	if (young)
> -		flush_tlb_page(vma, address);
> -
> -	return young;
> +	/*
> +	 * In X86, clearing access bit without TLB flush doesn't cause data
> +	 * corruption. Doing this could cause wrong page aging and so hot pages
> +	 * are reclaimed, but the chance should be very rare.
> +	 */
> +	return ptep_test_and_clear_young(vma, address, ptep);
>   }


At this point, we could use vma->vm_mm->cpu_vm_mask_var to
set (or clear) some bit in the per-cpu data of each CPU that
has active/valid tlb state for the mm in question.

I could see using cpu_tlbstate.state for this, or maybe
another variable in cpu_tlbstate, so switch_mm will load
both items with the same cache line.

At schedule time, the function switch_mm() can examine that
variable (it already touches that data, anyway), and flush
the TLB even if prev==next.

I suspect that would be both low overhead enough to get you
the performance gains you want, and address the concern that
we do want to flush the TLB at some point.

Does that sound reasonable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch]x86: clearing access bit don't flush tlb
  2014-03-26 23:55 ` Rik van Riel
@ 2014-03-27 17:12   ` Shaohua Li
  2014-03-27 18:41     ` Rik van Riel
  0 siblings, 1 reply; 21+ messages in thread
From: Shaohua Li @ 2014-03-27 17:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, akpm, hughd, mel

On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote:
> On 03/26/2014 06:30 PM, Shaohua Li wrote:
> >
> >I posted this patch a year ago or so, but it gets lost. Repost it here to check
> >if we can make progress this time.
> 
> I believe we can make progress. However, I also
> believe the code could be enhanced to address a
> concern that Hugh raised last time this was
> proposed...
> 
> >And according to intel manual, tlb has less than 1k entries, which covers < 4M
> >memory. In today's system, several giga byte memory is normal. After page
> >reclaim clears pte access bit and before cpu access the page again, it's quite
> >unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> >The chance skiping tlb flush to impact page reclaim should be very rare.
> 
> Context switch to a kernel thread does not result in a
> TLB flush, due to the lazy TLB code.
> 
> While I agree with you that clearing the TLB right at
> the moment the accessed bit is cleared in a PTE is
> not necessary, I believe it would be good to clear
> the TLB on affected CPUs relatively soon, maybe at the
> next time schedule is called?
> 
> >--- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> >+++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> >@@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> >  int ptep_clear_flush_young(struct vm_area_struct *vma,
> >  			   unsigned long address, pte_t *ptep)
> >  {
> >-	int young;
> >-
> >-	young = ptep_test_and_clear_young(vma, address, ptep);
> >-	if (young)
> >-		flush_tlb_page(vma, address);
> >-
> >-	return young;
> >+	/*
> >+	 * In X86, clearing access bit without TLB flush doesn't cause data
> >+	 * corruption. Doing this could cause wrong page aging and so hot pages
> >+	 * are reclaimed, but the chance should be very rare.
> >+	 */
> >+	return ptep_test_and_clear_young(vma, address, ptep);
> >  }
> 
> 
> At this point, we could use vma->vm_mm->cpu_vm_mask_var to
> set (or clear) some bit in the per-cpu data of each CPU that
> has active/valid tlb state for the mm in question.
> 
> I could see using cpu_tlbstate.state for this, or maybe
> another variable in cpu_tlbstate, so switch_mm will load
> both items with the same cache line.
> 
> At schedule time, the function switch_mm() can examine that
> variable (it already touches that data, anyway), and flush
> the TLB even if prev==next.
> 
> I suspect that would be both low overhead enough to get you
> the performance gains you want, and address the concern that
> we do want to flush the TLB at some point.
> 
> Does that sound reasonable?

So looks what you suggested is to force tlb flush for a mm with access bit
cleared in two corner cases:
1. lazy tlb flush
2. context switch between threads from one process

Am I missing anything? I'm wonering if we should care about these corner cases.
On the other hand, a thread might run long time without schedule. If the corner
cases are an issue, the long run thread is a severer issue. My point is context
switch does provide a safeguard, but we don't depend on it. The whole theory at
the back of this patch is page which has access bit cleared is unlikely
accessed again when its pte entry is still in tlb cache.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch]x86: clearing access bit don't flush tlb
  2014-03-27 17:12   ` Shaohua Li
@ 2014-03-27 18:41     ` Rik van Riel
  2014-03-28 19:02       ` Shaohua Li
  0 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2014-03-27 18:41 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, akpm, hughd, mel

[-- Attachment #1: Type: text/plain, Size: 3850 bytes --]

On 03/27/2014 01:12 PM, Shaohua Li wrote:
> On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote:
>> On 03/26/2014 06:30 PM, Shaohua Li wrote:
>>>
>>> I posted this patch a year ago or so, but it gets lost. Repost it here to check
>>> if we can make progress this time.
>>
>> I believe we can make progress. However, I also
>> believe the code could be enhanced to address a
>> concern that Hugh raised last time this was
>> proposed...
>>
>>> And according to intel manual, tlb has less than 1k entries, which covers < 4M
>>> memory. In today's system, several giga byte memory is normal. After page
>>> reclaim clears pte access bit and before cpu access the page again, it's quite
>>> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
>>> The chance skiping tlb flush to impact page reclaim should be very rare.
>>
>> Context switch to a kernel thread does not result in a
>> TLB flush, due to the lazy TLB code.
>>
>> While I agree with you that clearing the TLB right at
>> the moment the accessed bit is cleared in a PTE is
>> not necessary, I believe it would be good to clear
>> the TLB on affected CPUs relatively soon, maybe at the
>> next time schedule is called?
>>
>>> --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
>>> +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
>>> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
>>>   int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>   			   unsigned long address, pte_t *ptep)
>>>   {
>>> -	int young;
>>> -
>>> -	young = ptep_test_and_clear_young(vma, address, ptep);
>>> -	if (young)
>>> -		flush_tlb_page(vma, address);
>>> -
>>> -	return young;
>>> +	/*
>>> +	 * In X86, clearing access bit without TLB flush doesn't cause data
>>> +	 * corruption. Doing this could cause wrong page aging and so hot pages
>>> +	 * are reclaimed, but the chance should be very rare.
>>> +	 */
>>> +	return ptep_test_and_clear_young(vma, address, ptep);
>>>   }
>>
>>
>> At this point, we could use vma->vm_mm->cpu_vm_mask_var to
>> set (or clear) some bit in the per-cpu data of each CPU that
>> has active/valid tlb state for the mm in question.
>>
>> I could see using cpu_tlbstate.state for this, or maybe
>> another variable in cpu_tlbstate, so switch_mm will load
>> both items with the same cache line.
>>
>> At schedule time, the function switch_mm() can examine that
>> variable (it already touches that data, anyway), and flush
>> the TLB even if prev==next.
>>
>> I suspect that would be both low overhead enough to get you
>> the performance gains you want, and address the concern that
>> we do want to flush the TLB at some point.
>>
>> Does that sound reasonable?
>
> So looks what you suggested is to force tlb flush for a mm with access bit
> cleared in two corner cases:
> 1. lazy tlb flush
> 2. context switch between threads from one process
>
> Am I missing anything? I'm wonering if we should care about these corner cases.

I believe the corner case is relatively rare, but I also
suspect that your patch could fail pretty badly in some
of those cases, and the fix is easy...

> On the other hand, a thread might run long time without schedule. If the corner
> cases are an issue, the long run thread is a severer issue. My point is context
> switch does provide a safeguard, but we don't depend on it. The whole theory at
> the back of this patch is page which has access bit cleared is unlikely
> accessed again when its pte entry is still in tlb cache.

On the contrary, a TLB with a good cache policy should
retain the most actively used entries, in favor of
less actively used ones.

That means the pages we care most about keeping, are
the ones also most at danger of not having the accessed
bit flushed to memory.

Does the attached (untested) patch look reasonable?




[-- Attachment #2: flush_young_lazy_tlb.patch --]
[-- Type: text/x-patch, Size: 3356 bytes --]


Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/mmu_context.h |  5 ++++-
 arch/x86/include/asm/tlbflush.h    | 12 ++++++++++++
 arch/x86/mm/pgtable.c              |  9 ++++++---
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index be12c53..665d98b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -39,6 +39,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 #ifdef CONFIG_SMP
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		this_cpu_write(cpu_tlbstate.active_mm, next);
+		this_cpu_write(cpu_tlbstate.force_flush, false);
 #endif
 		cpumask_set_cpu(cpu, mm_cpumask(next));
 
@@ -57,7 +58,8 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (!cpumask_test_cpu(cpu, mm_cpumask(next)) ||
+				this_cpu_read(cpu_tlbstate.force_flush)) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -70,6 +72,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			 * tlb flush IPI delivery. We must reload CR3
 			 * to make sure to use no freed page tables.
 			 */
+			this_cpu_write(cpu_tlbstate.force_flush, false);
 			load_cr3(next->pgd);
 			load_LDT_nolock(&next->context);
 		}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 04905bf..f2cda2c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -151,6 +151,10 @@ static inline void reset_lazy_tlbstate(void)
 {
 }
 
+static inline void tlb_set_force_flush(int cpu)
+{
+}
+
 static inline void flush_tlb_kernel_range(unsigned long start,
 					  unsigned long end)
 {
@@ -187,6 +191,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 struct tlb_state {
 	struct mm_struct *active_mm;
 	int state;
+	bool force_flush;
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
@@ -196,6 +201,13 @@ static inline void reset_lazy_tlbstate(void)
 	this_cpu_write(cpu_tlbstate.active_mm, &init_mm);
 }
 
+static inline void tlb_set_force_flush(int cpu)
+{
+	struct tlb_state *percputlb= &per_cpu(cpu_tlbstate, cpu);
+	if (percputlb->force_flush == false)
+		percputlb->force_flush = true;
+}
+
 #endif	/* SMP */
 
 #ifndef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index c96314a..dcd26e9 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -4,6 +4,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlb.h>
 #include <asm/fixmap.h>
+#include <asm/tlbflush.h>
 
 #define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
 
@@ -399,11 +400,13 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
-	int young;
+	int young, cpu;
 
 	young = ptep_test_and_clear_young(vma, address, ptep);
-	if (young)
-		flush_tlb_page(vma, address);
+	if (young) {
+		for_each_cpu(cpu, vma->vm_mm->cpu_vm_mask_var)
+			tlb_set_force_flush(cpu);
+	}
 
 	return young;
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [patch]x86: clearing access bit don't flush tlb
  2014-03-27 18:41     ` Rik van Riel
@ 2014-03-28 19:02       ` Shaohua Li
  2014-03-30 12:58         ` Rik van Riel
  0 siblings, 1 reply; 21+ messages in thread
From: Shaohua Li @ 2014-03-28 19:02 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, akpm, hughd, mel

On Thu, Mar 27, 2014 at 02:41:59PM -0400, Rik van Riel wrote:
> On 03/27/2014 01:12 PM, Shaohua Li wrote:
> >On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote:
> >>On 03/26/2014 06:30 PM, Shaohua Li wrote:
> >>>
> >>>I posted this patch a year ago or so, but it gets lost. Repost it here to check
> >>>if we can make progress this time.
> >>
> >>I believe we can make progress. However, I also
> >>believe the code could be enhanced to address a
> >>concern that Hugh raised last time this was
> >>proposed...
> >>
> >>>And according to intel manual, tlb has less than 1k entries, which covers < 4M
> >>>memory. In today's system, several giga byte memory is normal. After page
> >>>reclaim clears pte access bit and before cpu access the page again, it's quite
> >>>unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> >>>The chance skiping tlb flush to impact page reclaim should be very rare.
> >>
> >>Context switch to a kernel thread does not result in a
> >>TLB flush, due to the lazy TLB code.
> >>
> >>While I agree with you that clearing the TLB right at
> >>the moment the accessed bit is cleared in a PTE is
> >>not necessary, I believe it would be good to clear
> >>the TLB on affected CPUs relatively soon, maybe at the
> >>next time schedule is called?
> >>
> >>>--- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> >>>+++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> >>>@@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> >>>  int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>>  			   unsigned long address, pte_t *ptep)
> >>>  {
> >>>-	int young;
> >>>-
> >>>-	young = ptep_test_and_clear_young(vma, address, ptep);
> >>>-	if (young)
> >>>-		flush_tlb_page(vma, address);
> >>>-
> >>>-	return young;
> >>>+	/*
> >>>+	 * In X86, clearing access bit without TLB flush doesn't cause data
> >>>+	 * corruption. Doing this could cause wrong page aging and so hot pages
> >>>+	 * are reclaimed, but the chance should be very rare.
> >>>+	 */
> >>>+	return ptep_test_and_clear_young(vma, address, ptep);
> >>>  }
> >>
> >>
> >>At this point, we could use vma->vm_mm->cpu_vm_mask_var to
> >>set (or clear) some bit in the per-cpu data of each CPU that
> >>has active/valid tlb state for the mm in question.
> >>
> >>I could see using cpu_tlbstate.state for this, or maybe
> >>another variable in cpu_tlbstate, so switch_mm will load
> >>both items with the same cache line.
> >>
> >>At schedule time, the function switch_mm() can examine that
> >>variable (it already touches that data, anyway), and flush
> >>the TLB even if prev==next.
> >>
> >>I suspect that would be both low overhead enough to get you
> >>the performance gains you want, and address the concern that
> >>we do want to flush the TLB at some point.
> >>
> >>Does that sound reasonable?
> >
> >So looks what you suggested is to force tlb flush for a mm with access bit
> >cleared in two corner cases:
> >1. lazy tlb flush
> >2. context switch between threads from one process
> >
> >Am I missing anything? I'm wonering if we should care about these corner cases.
> 
> I believe the corner case is relatively rare, but I also
> suspect that your patch could fail pretty badly in some
> of those cases, and the fix is easy...
> 
> >On the other hand, a thread might run long time without schedule. If the corner
> >cases are an issue, the long run thread is a severer issue. My point is context
> >switch does provide a safeguard, but we don't depend on it. The whole theory at
> >the back of this patch is page which has access bit cleared is unlikely
> >accessed again when its pte entry is still in tlb cache.
> 
> On the contrary, a TLB with a good cache policy should
> retain the most actively used entries, in favor of
> less actively used ones.
> 
> That means the pages we care most about keeping, are
> the ones also most at danger of not having the accessed
> bit flushed to memory.
> 
> Does the attached (untested) patch look reasonable?

It works obviously. Test shows tehre is no extra tradeoff too compared to just
skip tlb flush. So I have no objection to this if you insist a safeguard like
this. Should we force no entering lazy tlb too (in context_switch) if
force_flush is set, because you are talking about it but I didn't see it in the
patch? Should I push this or will you do it?

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch]x86: clearing access bit don't flush tlb
  2014-03-28 19:02       ` Shaohua Li
@ 2014-03-30 12:58         ` Rik van Riel
  2014-03-31  2:16           ` Shaohua Li
  0 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2014-03-30 12:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, akpm, hughd, mel

On 03/28/2014 03:02 PM, Shaohua Li wrote:
> On Thu, Mar 27, 2014 at 02:41:59PM -0400, Rik van Riel wrote:
>> On 03/27/2014 01:12 PM, Shaohua Li wrote:
>>> On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote:
>>>> On 03/26/2014 06:30 PM, Shaohua Li wrote:
>>>>>
>>>>> I posted this patch a year ago or so, but it gets lost. Repost it here to check
>>>>> if we can make progress this time.
>>>>
>>>> I believe we can make progress. However, I also
>>>> believe the code could be enhanced to address a
>>>> concern that Hugh raised last time this was
>>>> proposed...
>>>>
>>>>> And according to intel manual, tlb has less than 1k entries, which covers < 4M
>>>>> memory. In today's system, several giga byte memory is normal. After page
>>>>> reclaim clears pte access bit and before cpu access the page again, it's quite
>>>>> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
>>>>> The chance skiping tlb flush to impact page reclaim should be very rare.
>>>>
>>>> Context switch to a kernel thread does not result in a
>>>> TLB flush, due to the lazy TLB code.
>>>>
>>>> While I agree with you that clearing the TLB right at
>>>> the moment the accessed bit is cleared in a PTE is
>>>> not necessary, I believe it would be good to clear
>>>> the TLB on affected CPUs relatively soon, maybe at the
>>>> next time schedule is called?
>>>>
>>>>> --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
>>>>> +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
>>>>> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
>>>>>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>  			   unsigned long address, pte_t *ptep)
>>>>>  {
>>>>> -	int young;
>>>>> -
>>>>> -	young = ptep_test_and_clear_young(vma, address, ptep);
>>>>> -	if (young)
>>>>> -		flush_tlb_page(vma, address);
>>>>> -
>>>>> -	return young;
>>>>> +	/*
>>>>> +	 * In X86, clearing access bit without TLB flush doesn't cause data
>>>>> +	 * corruption. Doing this could cause wrong page aging and so hot pages
>>>>> +	 * are reclaimed, but the chance should be very rare.
>>>>> +	 */
>>>>> +	return ptep_test_and_clear_young(vma, address, ptep);
>>>>>  }
>>>>
>>>>
>>>> At this point, we could use vma->vm_mm->cpu_vm_mask_var to
>>>> set (or clear) some bit in the per-cpu data of each CPU that
>>>> has active/valid tlb state for the mm in question.
>>>>
>>>> I could see using cpu_tlbstate.state for this, or maybe
>>>> another variable in cpu_tlbstate, so switch_mm will load
>>>> both items with the same cache line.
>>>>
>>>> At schedule time, the function switch_mm() can examine that
>>>> variable (it already touches that data, anyway), and flush
>>>> the TLB even if prev==next.
>>>>
>>>> I suspect that would be both low overhead enough to get you
>>>> the performance gains you want, and address the concern that
>>>> we do want to flush the TLB at some point.
>>>>
>>>> Does that sound reasonable?
>>>
>>> So looks what you suggested is to force tlb flush for a mm with access bit
>>> cleared in two corner cases:
>>> 1. lazy tlb flush
>>> 2. context switch between threads from one process
>>>
>>> Am I missing anything? I'm wonering if we should care about these corner cases.
>>
>> I believe the corner case is relatively rare, but I also
>> suspect that your patch could fail pretty badly in some
>> of those cases, and the fix is easy...
>>
>>> On the other hand, a thread might run long time without schedule. If the corner
>>> cases are an issue, the long run thread is a severer issue. My point is context
>>> switch does provide a safeguard, but we don't depend on it. The whole theory at
>>> the back of this patch is page which has access bit cleared is unlikely
>>> accessed again when its pte entry is still in tlb cache.
>>
>> On the contrary, a TLB with a good cache policy should
>> retain the most actively used entries, in favor of
>> less actively used ones.
>>
>> That means the pages we care most about keeping, are
>> the ones also most at danger of not having the accessed
>> bit flushed to memory.
>>
>> Does the attached (untested) patch look reasonable?
> 
> It works obviously. Test shows tehre is no extra tradeoff too compared to just
> skip tlb flush. So I have no objection to this if you insist a safeguard like
> this. Should we force no entering lazy tlb too (in context_switch) if
> force_flush is set, because you are talking about it but I didn't see it in the
> patch? Should I push this or will you do it?

Thank you for testing the patch.

I think we should be fine not adding any code to the lazy tlb
entering code, because a kernel thread in lazy tlb mode should
not be accessing user space memory, except for the vhost-net
thread, which will then wake up userspace, and cause the flush.

Since performance numbers were identical, can I use your
performance numbers when submitting my patch? :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch]x86: clearing access bit don't flush tlb
  2014-03-30 12:58         ` Rik van Riel
@ 2014-03-31  2:16           ` Shaohua Li
  0 siblings, 0 replies; 21+ messages in thread
From: Shaohua Li @ 2014-03-31  2:16 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, akpm, hughd, mel

On Sun, Mar 30, 2014 at 08:58:46AM -0400, Rik van Riel wrote:
> On 03/28/2014 03:02 PM, Shaohua Li wrote:
> > On Thu, Mar 27, 2014 at 02:41:59PM -0400, Rik van Riel wrote:
> >> On 03/27/2014 01:12 PM, Shaohua Li wrote:
> >>> On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote:
> >>>> On 03/26/2014 06:30 PM, Shaohua Li wrote:
> >>>>>
> >>>>> I posted this patch a year ago or so, but it gets lost. Repost it here to check
> >>>>> if we can make progress this time.
> >>>>
> >>>> I believe we can make progress. However, I also
> >>>> believe the code could be enhanced to address a
> >>>> concern that Hugh raised last time this was
> >>>> proposed...
> >>>>
> >>>>> And according to intel manual, tlb has less than 1k entries, which covers < 4M
> >>>>> memory. In today's system, several giga byte memory is normal. After page
> >>>>> reclaim clears pte access bit and before cpu access the page again, it's quite
> >>>>> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> >>>>> The chance skiping tlb flush to impact page reclaim should be very rare.
> >>>>
> >>>> Context switch to a kernel thread does not result in a
> >>>> TLB flush, due to the lazy TLB code.
> >>>>
> >>>> While I agree with you that clearing the TLB right at
> >>>> the moment the accessed bit is cleared in a PTE is
> >>>> not necessary, I believe it would be good to clear
> >>>> the TLB on affected CPUs relatively soon, maybe at the
> >>>> next time schedule is called?
> >>>>
> >>>>> --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> >>>>> +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> >>>>> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> >>>>>  int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>>>>  			   unsigned long address, pte_t *ptep)
> >>>>>  {
> >>>>> -	int young;
> >>>>> -
> >>>>> -	young = ptep_test_and_clear_young(vma, address, ptep);
> >>>>> -	if (young)
> >>>>> -		flush_tlb_page(vma, address);
> >>>>> -
> >>>>> -	return young;
> >>>>> +	/*
> >>>>> +	 * In X86, clearing access bit without TLB flush doesn't cause data
> >>>>> +	 * corruption. Doing this could cause wrong page aging and so hot pages
> >>>>> +	 * are reclaimed, but the chance should be very rare.
> >>>>> +	 */
> >>>>> +	return ptep_test_and_clear_young(vma, address, ptep);
> >>>>>  }
> >>>>
> >>>>
> >>>> At this point, we could use vma->vm_mm->cpu_vm_mask_var to
> >>>> set (or clear) some bit in the per-cpu data of each CPU that
> >>>> has active/valid tlb state for the mm in question.
> >>>>
> >>>> I could see using cpu_tlbstate.state for this, or maybe
> >>>> another variable in cpu_tlbstate, so switch_mm will load
> >>>> both items with the same cache line.
> >>>>
> >>>> At schedule time, the function switch_mm() can examine that
> >>>> variable (it already touches that data, anyway), and flush
> >>>> the TLB even if prev==next.
> >>>>
> >>>> I suspect that would be both low overhead enough to get you
> >>>> the performance gains you want, and address the concern that
> >>>> we do want to flush the TLB at some point.
> >>>>
> >>>> Does that sound reasonable?
> >>>
> >>> So looks what you suggested is to force tlb flush for a mm with access bit
> >>> cleared in two corner cases:
> >>> 1. lazy tlb flush
> >>> 2. context switch between threads from one process
> >>>
> >>> Am I missing anything? I'm wonering if we should care about these corner cases.
> >>
> >> I believe the corner case is relatively rare, but I also
> >> suspect that your patch could fail pretty badly in some
> >> of those cases, and the fix is easy...
> >>
> >>> On the other hand, a thread might run long time without schedule. If the corner
> >>> cases are an issue, the long run thread is a severer issue. My point is context
> >>> switch does provide a safeguard, but we don't depend on it. The whole theory at
> >>> the back of this patch is page which has access bit cleared is unlikely
> >>> accessed again when its pte entry is still in tlb cache.
> >>
> >> On the contrary, a TLB with a good cache policy should
> >> retain the most actively used entries, in favor of
> >> less actively used ones.
> >>
> >> That means the pages we care most about keeping, are
> >> the ones also most at danger of not having the accessed
> >> bit flushed to memory.
> >>
> >> Does the attached (untested) patch look reasonable?
> > 
> > It works obviously. Test shows tehre is no extra tradeoff too compared to just
> > skip tlb flush. So I have no objection to this if you insist a safeguard like
> > this. Should we force no entering lazy tlb too (in context_switch) if
> > force_flush is set, because you are talking about it but I didn't see it in the
> > patch? Should I push this or will you do it?
> 
> Thank you for testing the patch.
> 
> I think we should be fine not adding any code to the lazy tlb
> entering code, because a kernel thread in lazy tlb mode should
> not be accessing user space memory, except for the vhost-net
> thread, which will then wake up userspace, and cause the flush.
> 
> Since performance numbers were identical, can I use your
> performance numbers when submitting my patch? :)

Sure, I tested 10 threads sequential access memory to 2 pcie ssd. without the
patch, the swapout speed is around 1.5G/s; with the patch, the speed is around
1.85G/s. As I said before, the patch can provide about 20% ~ 30% swapout
speedup for ssd swap.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch]x86: clearing access bit don't flush tlb
  2014-03-26 22:30 [patch]x86: clearing access bit don't flush tlb Shaohua Li
  2014-03-26 23:55 ` Rik van Riel
@ 2014-04-02 13:01 ` Mel Gorman
  2014-04-02 15:42   ` Hugh Dickins
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2014-04-02 13:01 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, akpm, hughd, riel, mel

On Thu, Mar 27, 2014 at 06:30:34AM +0800, Shaohua Li wrote:
> 
> I posted this patch a year ago or so, but it gets lost. Repost it here to check
> if we can make progress this time.
> 
> We use access bit to age a page at page reclaim. When clearing pte access bit,
> we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> access bit is unset in page table, when cpu access the page again, cpu will not
> set page table pte's access bit. Next time page reclaim will think this hot
> page is old and reclaim it wrongly, but this doesn't corrupt data.
> 
> And according to intel manual, tlb has less than 1k entries, which covers < 4M
> memory. In today's system, several giga byte memory is normal. After page
> reclaim clears pte access bit and before cpu access the page again, it's quite
> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> The chance skiping tlb flush to impact page reclaim should be very rare.
> 
> Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> Hugh added it to fix some ARM and sparc issues. Since I only change this for
> x86, there should be no risk.
> 
> And in some workloads, TLB flush overhead is very heavy. In my simple
> multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> gives about 20% ~ 30% swapout speedup.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>

I'm aware of the discussion on the more complex version and the outcome
of that. While I think the corner case is real, I think it's also very
unlikely and as this is an x86-only thing which will be safe from
corruption at least;

Acked-by: Mel Gorman <mgorman@suse.de>

Shaohua, you almost certainly should resend this to Andrew with the
ack's you collected so that he does not have to dig into the history
trying to figure out what the exact story is.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch]x86: clearing access bit don't flush tlb
  2014-04-02 13:01 ` Mel Gorman
@ 2014-04-02 15:42   ` Hugh Dickins
  0 siblings, 0 replies; 21+ messages in thread
From: Hugh Dickins @ 2014-04-02 15:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Mel Gorman, linux-mm, akpm, hughd, riel, mel

On Wed, 2 Apr 2014, Mel Gorman wrote:
> On Thu, Mar 27, 2014 at 06:30:34AM +0800, Shaohua Li wrote:
> > 
> > I posted this patch a year ago or so, but it gets lost. Repost it here to check
> > if we can make progress this time.
> > 
> > We use access bit to age a page at page reclaim. When clearing pte access bit,
> > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> > access bit is unset in page table, when cpu access the page again, cpu will not
> > set page table pte's access bit. Next time page reclaim will think this hot
> > page is old and reclaim it wrongly, but this doesn't corrupt data.
> > 
> > And according to intel manual, tlb has less than 1k entries, which covers < 4M
> > memory. In today's system, several giga byte memory is normal. After page
> > reclaim clears pte access bit and before cpu access the page again, it's quite
> > unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> > The chance skiping tlb flush to impact page reclaim should be very rare.
> > 
> > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> > Hugh added it to fix some ARM and sparc issues. Since I only change this for
> > x86, there should be no risk.
> > 
> > And in some workloads, TLB flush overhead is very heavy. In my simple
> > multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> > gives about 20% ~ 30% swapout speedup.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> 
> I'm aware of the discussion on the more complex version and the outcome
> of that. While I think the corner case is real, I think it's also very
> unlikely and as this is an x86-only thing which will be safe from
> corruption at least;
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> 
> Shaohua, you almost certainly should resend this to Andrew with the
> ack's you collected so that he does not have to dig into the history
> trying to figure out what the exact story is.

And you can add my

Acked-by: Hugh Dickins <hughd@google.com>

to your collection too: you and I discussed this at LSF/MM, and nowadays
I agree that the corner case that originally worried me (highly-accessed
page not getting its accessed bit updated and then temporarily unmapped)
is too unlikely a case to refuse the optimization: it might happen
occasionally, but I doubt anybody will notice.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
  2014-04-08  7:58     ` Shaohua Li
@ 2014-04-14 11:36       ` Ingo Molnar
  -1 siblings, 0 replies; 21+ messages in thread
From: Ingo Molnar @ 2014-04-14 11:36 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds,
	Peter Zijlstra, Thomas Gleixner


* Shaohua Li <shli@kernel.org> wrote:

> On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote:
> > 
> > * Shaohua Li <shli@kernel.org> wrote:
> > 
> > > Add a few acks and resend this patch.
> > > 
> > > We use access bit to age a page at page reclaim. When clearing pte access bit,
> > > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> > > access bit is unset in page table, when cpu access the page again, cpu will not
> > > set page table pte's access bit. Next time page reclaim will think this hot
> > > page is yong and reclaim it wrongly, but this doesn't corrupt data.
> > > 
> > > And according to intel manual, tlb has less than 1k entries, which covers < 4M
> > > memory. In today's system, several giga byte memory is normal. After page
> > > reclaim clears pte access bit and before cpu access the page again, it's quite
> > > unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> > > The chance skiping tlb flush to impact page reclaim should be very rare.
> > > 
> > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> > > Hugh added it to fix some ARM and sparc issues. Since I only change this for
> > > x86, there should be no risk.
> > > 
> > > And in some workloads, TLB flush overhead is very heavy. In my simple
> > > multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> > > gives about 20% ~ 30% swapout speedup.
> > > 
> > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Acked-by: Mel Gorman <mgorman@suse.de>
> > > Acked-by: Hugh Dickins <hughd@google.com>
> > > ---
> > >  arch/x86/mm/pgtable.c |   13 ++++++-------
> > >  1 file changed, 6 insertions(+), 7 deletions(-)
> > > 
> > > Index: linux/arch/x86/mm/pgtable.c
> > > ===================================================================
> > > --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> > > +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> > > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> > >  int ptep_clear_flush_young(struct vm_area_struct *vma,
> > >  			   unsigned long address, pte_t *ptep)
> > >  {
> > > -	int young;
> > > -
> > > -	young = ptep_test_and_clear_young(vma, address, ptep);
> > > -	if (young)
> > > -		flush_tlb_page(vma, address);
> > > -
> > > -	return young;
> > > +	/*
> > > +	 * In X86, clearing access bit without TLB flush doesn't cause data
> > > +	 * corruption. Doing this could cause wrong page aging and so hot pages
> > > +	 * are reclaimed, but the chance should be very rare.
> > 
> > So, beyond the spelling mistakes, I guess this explanation should also 
> > be a bit more explanatory - how about something like:
> > 
> > 	/*
> > 	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
> > 	 * doesn't cause data corruption. [ It could cause incorrect
> > 	 * page aging and the (mistaken) reclaim of hot pages, but the
> > 	 * chance of that should be relatively low. ]
> > 	 *
> > 	 * So as a performance optimization don't flush the TLB when 
> > 	 * clearing the accessed bit, it will eventually be flushed by 
> > 	 * a context switch or a VM operation anyway. [ In the rare 
> > 	 * event of it not getting flushed for a long time the delay 
> > 	 * shouldn't really matter because there's no real memory 
> > 	 * pressure for swapout to react to. ]
> > 	 */
> > 
> > Agreed?
> 
> Changed the comments and added ACK of Johannes, so you can pick up directly.
>  
> Subject: x86: clearing access bit don't flush tlb
> 
> We use access bit to age a page at page reclaim. When clearing pte access bit,
> we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> access bit is unset in page table, when cpu access the page again, cpu will not
> set page table pte's access bit. Next time page reclaim will think this hot
> page is yong and reclaim it wrongly, but this doesn't corrupt data.
> 
> And according to intel manual, tlb has less than 1k entries, which covers < 4M
> memory. In today's system, several giga byte memory is normal. After page
> reclaim clears pte access bit and before cpu access the page again, it's quite
> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> The chance skiping tlb flush to impact page reclaim should be very rare.
> 
> Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> Hugh added it to fix some ARM and sparc issues. Since I only change this for
> x86, there should be no risk.
> 
> And in some workloads, TLB flush overhead is very heavy. In my simple
> multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> gives about 20% ~ 30% swapout speedup.
> 
> Update comments by Ingo.

I fixed this changelog as well.

> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  arch/x86/mm/pgtable.c |   21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> Index: linux/arch/x86/mm/pgtable.c
> ===================================================================
> --- linux.orig/arch/x86/mm/pgtable.c	2014-04-07 08:36:02.843221074 +0800
> +++ linux/arch/x86/mm/pgtable.c	2014-04-07 08:37:26.438170140 +0800
> @@ -399,13 +399,20 @@ int pmdp_test_and_clear_young(struct vm_
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>  			   unsigned long address, pte_t *ptep)
>  {
> -	int young;
> -
> -	young = ptep_test_and_clear_young(vma, address, ptep);
> -	if (young)
> -		flush_tlb_page(vma, address);
> -
> -	return young;
> +	/*
> +         * On x86 CPUs, clearing the accessed bit without a TLB flush
> +         * doesn't cause data corruption. [ It could cause incorrect
> +         * page aging and the (mistaken) reclaim of hot pages, but the
> +         * chance of that should be relatively low. ]
> +         *
> +         * So as a performance optimization don't flush the TLB when
> +         * clearing the accessed bit, it will eventually be flushed by
> +         * a context switch or a VM operation anyway. [ In the rare
> +         * event of it not getting flushed for a long time the delay
> +         * shouldn't really matter because there's no real memory
> +         * pressure for swapout to react to. ]
> +         */

There's whitespace damage here - I fixed that up as well.

Please use scripts/checkpatch.pl before submitting patches, to make 
sure there are no fixable problems in it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
@ 2014-04-14 11:36       ` Ingo Molnar
  0 siblings, 0 replies; 21+ messages in thread
From: Ingo Molnar @ 2014-04-14 11:36 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds,
	Peter Zijlstra, Thomas Gleixner


* Shaohua Li <shli@kernel.org> wrote:

> On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote:
> > 
> > * Shaohua Li <shli@kernel.org> wrote:
> > 
> > > Add a few acks and resend this patch.
> > > 
> > > We use access bit to age a page at page reclaim. When clearing pte access bit,
> > > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> > > access bit is unset in page table, when cpu access the page again, cpu will not
> > > set page table pte's access bit. Next time page reclaim will think this hot
> > > page is yong and reclaim it wrongly, but this doesn't corrupt data.
> > > 
> > > And according to intel manual, tlb has less than 1k entries, which covers < 4M
> > > memory. In today's system, several giga byte memory is normal. After page
> > > reclaim clears pte access bit and before cpu access the page again, it's quite
> > > unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> > > The chance skiping tlb flush to impact page reclaim should be very rare.
> > > 
> > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> > > Hugh added it to fix some ARM and sparc issues. Since I only change this for
> > > x86, there should be no risk.
> > > 
> > > And in some workloads, TLB flush overhead is very heavy. In my simple
> > > multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> > > gives about 20% ~ 30% swapout speedup.
> > > 
> > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Acked-by: Mel Gorman <mgorman@suse.de>
> > > Acked-by: Hugh Dickins <hughd@google.com>
> > > ---
> > >  arch/x86/mm/pgtable.c |   13 ++++++-------
> > >  1 file changed, 6 insertions(+), 7 deletions(-)
> > > 
> > > Index: linux/arch/x86/mm/pgtable.c
> > > ===================================================================
> > > --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> > > +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> > > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> > >  int ptep_clear_flush_young(struct vm_area_struct *vma,
> > >  			   unsigned long address, pte_t *ptep)
> > >  {
> > > -	int young;
> > > -
> > > -	young = ptep_test_and_clear_young(vma, address, ptep);
> > > -	if (young)
> > > -		flush_tlb_page(vma, address);
> > > -
> > > -	return young;
> > > +	/*
> > > +	 * In X86, clearing access bit without TLB flush doesn't cause data
> > > +	 * corruption. Doing this could cause wrong page aging and so hot pages
> > > +	 * are reclaimed, but the chance should be very rare.
> > 
> > So, beyond the spelling mistakes, I guess this explanation should also 
> > be a bit more explanatory - how about something like:
> > 
> > 	/*
> > 	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
> > 	 * doesn't cause data corruption. [ It could cause incorrect
> > 	 * page aging and the (mistaken) reclaim of hot pages, but the
> > 	 * chance of that should be relatively low. ]
> > 	 *
> > 	 * So as a performance optimization don't flush the TLB when 
> > 	 * clearing the accessed bit, it will eventually be flushed by 
> > 	 * a context switch or a VM operation anyway. [ In the rare 
> > 	 * event of it not getting flushed for a long time the delay 
> > 	 * shouldn't really matter because there's no real memory 
> > 	 * pressure for swapout to react to. ]
> > 	 */
> > 
> > Agreed?
> 
> Changed the comments and added ACK of Johannes, so you can pick up directly.
>  
> Subject: x86: clearing access bit don't flush tlb
> 
> We use access bit to age a page at page reclaim. When clearing pte access bit,
> we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> access bit is unset in page table, when cpu access the page again, cpu will not
> set page table pte's access bit. Next time page reclaim will think this hot
> page is yong and reclaim it wrongly, but this doesn't corrupt data.
> 
> And according to intel manual, tlb has less than 1k entries, which covers < 4M
> memory. In today's system, several giga byte memory is normal. After page
> reclaim clears pte access bit and before cpu access the page again, it's quite
> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> The chance skiping tlb flush to impact page reclaim should be very rare.
> 
> Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> Hugh added it to fix some ARM and sparc issues. Since I only change this for
> x86, there should be no risk.
> 
> And in some workloads, TLB flush overhead is very heavy. In my simple
> multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> gives about 20% ~ 30% swapout speedup.
> 
> Update comments by Ingo.

I fixed this changelog as well.

> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  arch/x86/mm/pgtable.c |   21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> Index: linux/arch/x86/mm/pgtable.c
> ===================================================================
> --- linux.orig/arch/x86/mm/pgtable.c	2014-04-07 08:36:02.843221074 +0800
> +++ linux/arch/x86/mm/pgtable.c	2014-04-07 08:37:26.438170140 +0800
> @@ -399,13 +399,20 @@ int pmdp_test_and_clear_young(struct vm_
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>  			   unsigned long address, pte_t *ptep)
>  {
> -	int young;
> -
> -	young = ptep_test_and_clear_young(vma, address, ptep);
> -	if (young)
> -		flush_tlb_page(vma, address);
> -
> -	return young;
> +	/*
> +         * On x86 CPUs, clearing the accessed bit without a TLB flush
> +         * doesn't cause data corruption. [ It could cause incorrect
> +         * page aging and the (mistaken) reclaim of hot pages, but the
> +         * chance of that should be relatively low. ]
> +         *
> +         * So as a performance optimization don't flush the TLB when
> +         * clearing the accessed bit, it will eventually be flushed by
> +         * a context switch or a VM operation anyway. [ In the rare
> +         * event of it not getting flushed for a long time the delay
> +         * shouldn't really matter because there's no real memory
> +         * pressure for swapout to react to. ]
> +         */

There's whitespace damage here - I fixed that up as well.

Please use scripts/checkpatch.pl before submitting patches, to make 
sure there are no fixable problems in it.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
  2014-04-03 11:35   ` Ingo Molnar
@ 2014-04-08  7:58     ` Shaohua Li
  -1 siblings, 0 replies; 21+ messages in thread
From: Shaohua Li @ 2014-04-08  7:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds,
	Peter Zijlstra, Thomas Gleixner

On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote:
> 
> * Shaohua Li <shli@kernel.org> wrote:
> 
> > Add a few acks and resend this patch.
> > 
> > We use access bit to age a page at page reclaim. When clearing pte access bit,
> > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> > access bit is unset in page table, when cpu access the page again, cpu will not
> > set page table pte's access bit. Next time page reclaim will think this hot
> > page is yong and reclaim it wrongly, but this doesn't corrupt data.
> > 
> > And according to intel manual, tlb has less than 1k entries, which covers < 4M
> > memory. In today's system, several giga byte memory is normal. After page
> > reclaim clears pte access bit and before cpu access the page again, it's quite
> > unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> > The chance skiping tlb flush to impact page reclaim should be very rare.
> > 
> > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> > Hugh added it to fix some ARM and sparc issues. Since I only change this for
> > x86, there should be no risk.
> > 
> > And in some workloads, TLB flush overhead is very heavy. In my simple
> > multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> > gives about 20% ~ 30% swapout speedup.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Acked-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Hugh Dickins <hughd@google.com>
> > ---
> >  arch/x86/mm/pgtable.c |   13 ++++++-------
> >  1 file changed, 6 insertions(+), 7 deletions(-)
> > 
> > Index: linux/arch/x86/mm/pgtable.c
> > ===================================================================
> > --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> > +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> >  int ptep_clear_flush_young(struct vm_area_struct *vma,
> >  			   unsigned long address, pte_t *ptep)
> >  {
> > -	int young;
> > -
> > -	young = ptep_test_and_clear_young(vma, address, ptep);
> > -	if (young)
> > -		flush_tlb_page(vma, address);
> > -
> > -	return young;
> > +	/*
> > +	 * In X86, clearing access bit without TLB flush doesn't cause data
> > +	 * corruption. Doing this could cause wrong page aging and so hot pages
> > +	 * are reclaimed, but the chance should be very rare.
> 
> So, beyond the spelling mistakes, I guess this explanation should also 
> be a bit more explanatory - how about something like:
> 
> 	/*
> 	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
> 	 * doesn't cause data corruption. [ It could cause incorrect
> 	 * page aging and the (mistaken) reclaim of hot pages, but the
> 	 * chance of that should be relatively low. ]
> 	 *
> 	 * So as a performance optimization don't flush the TLB when 
> 	 * clearing the accessed bit, it will eventually be flushed by 
> 	 * a context switch or a VM operation anyway. [ In the rare 
> 	 * event of it not getting flushed for a long time the delay 
> 	 * shouldn't really matter because there's no real memory 
> 	 * pressure for swapout to react to. ]
> 	 */
> 
> Agreed?

Changed the comments and added ACK of Johannes, so you can pick up directly.
 
Subject: x86: clearing access bit don't flush tlb

We use access bit to age a page at page reclaim. When clearing pte access bit,
we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
access bit is unset in page table, when cpu access the page again, cpu will not
set page table pte's access bit. Next time page reclaim will think this hot
page is yong and reclaim it wrongly, but this doesn't corrupt data.

And according to intel manual, tlb has less than 1k entries, which covers < 4M
memory. In today's system, several giga byte memory is normal. After page
reclaim clears pte access bit and before cpu access the page again, it's quite
unlikely this page's pte is still in TLB. And context swich will flush tlb too.
The chance skiping tlb flush to impact page reclaim should be very rare.

Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
Hugh added it to fix some ARM and sparc issues. Since I only change this for
x86, there should be no risk.

And in some workloads, TLB flush overhead is very heavy. In my simple
multithread app with a lot of swap to several pcie SSD, removing the tlb flush
gives about 20% ~ 30% swapout speedup.

Update comments by Ingo.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/x86/mm/pgtable.c |   21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pgtable.c
===================================================================
--- linux.orig/arch/x86/mm/pgtable.c	2014-04-07 08:36:02.843221074 +0800
+++ linux/arch/x86/mm/pgtable.c	2014-04-07 08:37:26.438170140 +0800
@@ -399,13 +399,20 @@ int pmdp_test_and_clear_young(struct vm_
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
-	int young;
-
-	young = ptep_test_and_clear_young(vma, address, ptep);
-	if (young)
-		flush_tlb_page(vma, address);
-
-	return young;
+	/*
+         * On x86 CPUs, clearing the accessed bit without a TLB flush
+         * doesn't cause data corruption. [ It could cause incorrect
+         * page aging and the (mistaken) reclaim of hot pages, but the
+         * chance of that should be relatively low. ]
+         *
+         * So as a performance optimization don't flush the TLB when
+         * clearing the accessed bit, it will eventually be flushed by
+         * a context switch or a VM operation anyway. [ In the rare
+         * event of it not getting flushed for a long time the delay
+         * shouldn't really matter because there's no real memory
+         * pressure for swapout to react to. ]
+         */
+	return ptep_test_and_clear_young(vma, address, ptep);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
@ 2014-04-08  7:58     ` Shaohua Li
  0 siblings, 0 replies; 21+ messages in thread
From: Shaohua Li @ 2014-04-08  7:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds,
	Peter Zijlstra, Thomas Gleixner

On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote:
> 
> * Shaohua Li <shli@kernel.org> wrote:
> 
> > Add a few acks and resend this patch.
> > 
> > We use access bit to age a page at page reclaim. When clearing pte access bit,
> > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> > access bit is unset in page table, when cpu access the page again, cpu will not
> > set page table pte's access bit. Next time page reclaim will think this hot
> > page is yong and reclaim it wrongly, but this doesn't corrupt data.
> > 
> > And according to intel manual, tlb has less than 1k entries, which covers < 4M
> > memory. In today's system, several giga byte memory is normal. After page
> > reclaim clears pte access bit and before cpu access the page again, it's quite
> > unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> > The chance skiping tlb flush to impact page reclaim should be very rare.
> > 
> > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> > Hugh added it to fix some ARM and sparc issues. Since I only change this for
> > x86, there should be no risk.
> > 
> > And in some workloads, TLB flush overhead is very heavy. In my simple
> > multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> > gives about 20% ~ 30% swapout speedup.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Acked-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Hugh Dickins <hughd@google.com>
> > ---
> >  arch/x86/mm/pgtable.c |   13 ++++++-------
> >  1 file changed, 6 insertions(+), 7 deletions(-)
> > 
> > Index: linux/arch/x86/mm/pgtable.c
> > ===================================================================
> > --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> > +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> >  int ptep_clear_flush_young(struct vm_area_struct *vma,
> >  			   unsigned long address, pte_t *ptep)
> >  {
> > -	int young;
> > -
> > -	young = ptep_test_and_clear_young(vma, address, ptep);
> > -	if (young)
> > -		flush_tlb_page(vma, address);
> > -
> > -	return young;
> > +	/*
> > +	 * In X86, clearing access bit without TLB flush doesn't cause data
> > +	 * corruption. Doing this could cause wrong page aging and so hot pages
> > +	 * are reclaimed, but the chance should be very rare.
> 
> So, beyond the spelling mistakes, I guess this explanation should also 
> be a bit more explanatory - how about something like:
> 
> 	/*
> 	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
> 	 * doesn't cause data corruption. [ It could cause incorrect
> 	 * page aging and the (mistaken) reclaim of hot pages, but the
> 	 * chance of that should be relatively low. ]
> 	 *
> 	 * So as a performance optimization don't flush the TLB when 
> 	 * clearing the accessed bit, it will eventually be flushed by 
> 	 * a context switch or a VM operation anyway. [ In the rare 
> 	 * event of it not getting flushed for a long time the delay 
> 	 * shouldn't really matter because there's no real memory 
> 	 * pressure for swapout to react to. ]
> 	 */
> 
> Agreed?

Changed the comments and added ACK of Johannes, so you can pick up directly.
 
Subject: x86: clearing access bit don't flush tlb

We use access bit to age a page at page reclaim. When clearing pte access bit,
we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
access bit is unset in page table, when cpu access the page again, cpu will not
set page table pte's access bit. Next time page reclaim will think this hot
page is yong and reclaim it wrongly, but this doesn't corrupt data.

And according to intel manual, tlb has less than 1k entries, which covers < 4M
memory. In today's system, several giga byte memory is normal. After page
reclaim clears pte access bit and before cpu access the page again, it's quite
unlikely this page's pte is still in TLB. And context swich will flush tlb too.
The chance skiping tlb flush to impact page reclaim should be very rare.

Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
Hugh added it to fix some ARM and sparc issues. Since I only change this for
x86, there should be no risk.

And in some workloads, TLB flush overhead is very heavy. In my simple
multithread app with a lot of swap to several pcie SSD, removing the tlb flush
gives about 20% ~ 30% swapout speedup.

Update comments by Ingo.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/x86/mm/pgtable.c |   21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pgtable.c
===================================================================
--- linux.orig/arch/x86/mm/pgtable.c	2014-04-07 08:36:02.843221074 +0800
+++ linux/arch/x86/mm/pgtable.c	2014-04-07 08:37:26.438170140 +0800
@@ -399,13 +399,20 @@ int pmdp_test_and_clear_young(struct vm_
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
-	int young;
-
-	young = ptep_test_and_clear_young(vma, address, ptep);
-	if (young)
-		flush_tlb_page(vma, address);
-
-	return young;
+	/*
+         * On x86 CPUs, clearing the accessed bit without a TLB flush
+         * doesn't cause data corruption. [ It could cause incorrect
+         * page aging and the (mistaken) reclaim of hot pages, but the
+         * chance of that should be relatively low. ]
+         *
+         * So as a performance optimization don't flush the TLB when
+         * clearing the accessed bit, it will eventually be flushed by
+         * a context switch or a VM operation anyway. [ In the rare
+         * event of it not getting flushed for a long time the delay
+         * shouldn't really matter because there's no real memory
+         * pressure for swapout to react to. ]
+         */
+	return ptep_test_and_clear_young(vma, address, ptep);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
  2014-04-03 13:45     ` Shaohua Li
@ 2014-04-04 15:01       ` Johannes Weiner
  -1 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2014-04-04 15:01 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Ingo Molnar, linux-kernel, linux-mm, akpm, riel, hughd, mgorman,
	torvalds, Peter Zijlstra, Thomas Gleixner

On Thu, Apr 03, 2014 at 09:45:42PM +0800, Shaohua Li wrote:
> On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote:
> > 
> > * Shaohua Li <shli@kernel.org> wrote:
> > 
> > > Add a few acks and resend this patch.
> > > 
> > > We use access bit to age a page at page reclaim. When clearing pte access bit,
> > > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> > > access bit is unset in page table, when cpu access the page again, cpu will not
> > > set page table pte's access bit. Next time page reclaim will think this hot
> > > page is yong and reclaim it wrongly, but this doesn't corrupt data.
> > > 
> > > And according to intel manual, tlb has less than 1k entries, which covers < 4M
> > > memory. In today's system, several giga byte memory is normal. After page
> > > reclaim clears pte access bit and before cpu access the page again, it's quite
> > > unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> > > The chance skiping tlb flush to impact page reclaim should be very rare.
> > > 
> > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> > > Hugh added it to fix some ARM and sparc issues. Since I only change this for
> > > x86, there should be no risk.
> > > 
> > > And in some workloads, TLB flush overhead is very heavy. In my simple
> > > multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> > > gives about 20% ~ 30% swapout speedup.
> > > 
> > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Acked-by: Mel Gorman <mgorman@suse.de>
> > > Acked-by: Hugh Dickins <hughd@google.com>
> > > ---
> > >  arch/x86/mm/pgtable.c |   13 ++++++-------
> > >  1 file changed, 6 insertions(+), 7 deletions(-)
> > > 
> > > Index: linux/arch/x86/mm/pgtable.c
> > > ===================================================================
> > > --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> > > +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> > > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> > >  int ptep_clear_flush_young(struct vm_area_struct *vma,
> > >  			   unsigned long address, pte_t *ptep)
> > >  {
> > > -	int young;
> > > -
> > > -	young = ptep_test_and_clear_young(vma, address, ptep);
> > > -	if (young)
> > > -		flush_tlb_page(vma, address);
> > > -
> > > -	return young;
> > > +	/*
> > > +	 * In X86, clearing access bit without TLB flush doesn't cause data
> > > +	 * corruption. Doing this could cause wrong page aging and so hot pages
> > > +	 * are reclaimed, but the chance should be very rare.
> > 
> > So, beyond the spelling mistakes, I guess this explanation should also 
> > be a bit more explanatory - how about something like:
> > 
> > 	/*
> > 	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
> > 	 * doesn't cause data corruption. [ It could cause incorrect
> > 	 * page aging and the (mistaken) reclaim of hot pages, but the
> > 	 * chance of that should be relatively low. ]
> > 	 *
> > 	 * So as a performance optimization don't flush the TLB when 
> > 	 * clearing the accessed bit, it will eventually be flushed by 
> > 	 * a context switch or a VM operation anyway. [ In the rare 
> > 	 * event of it not getting flushed for a long time the delay 
> > 	 * shouldn't really matter because there's no real memory 
> > 	 * pressure for swapout to react to. ]
> > 	 */
> > 
> > Agreed?
> 
> Sure, that's better, thanks!

With Ingo's updated comment:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
@ 2014-04-04 15:01       ` Johannes Weiner
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2014-04-04 15:01 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Ingo Molnar, linux-kernel, linux-mm, akpm, riel, hughd, mgorman,
	torvalds, Peter Zijlstra, Thomas Gleixner

On Thu, Apr 03, 2014 at 09:45:42PM +0800, Shaohua Li wrote:
> On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote:
> > 
> > * Shaohua Li <shli@kernel.org> wrote:
> > 
> > > Add a few acks and resend this patch.
> > > 
> > > We use access bit to age a page at page reclaim. When clearing pte access bit,
> > > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> > > access bit is unset in page table, when cpu access the page again, cpu will not
> > > set page table pte's access bit. Next time page reclaim will think this hot
> > > page is yong and reclaim it wrongly, but this doesn't corrupt data.
> > > 
> > > And according to intel manual, tlb has less than 1k entries, which covers < 4M
> > > memory. In today's system, several giga byte memory is normal. After page
> > > reclaim clears pte access bit and before cpu access the page again, it's quite
> > > unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> > > The chance skiping tlb flush to impact page reclaim should be very rare.
> > > 
> > > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> > > Hugh added it to fix some ARM and sparc issues. Since I only change this for
> > > x86, there should be no risk.
> > > 
> > > And in some workloads, TLB flush overhead is very heavy. In my simple
> > > multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> > > gives about 20% ~ 30% swapout speedup.
> > > 
> > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Acked-by: Mel Gorman <mgorman@suse.de>
> > > Acked-by: Hugh Dickins <hughd@google.com>
> > > ---
> > >  arch/x86/mm/pgtable.c |   13 ++++++-------
> > >  1 file changed, 6 insertions(+), 7 deletions(-)
> > > 
> > > Index: linux/arch/x86/mm/pgtable.c
> > > ===================================================================
> > > --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> > > +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> > > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> > >  int ptep_clear_flush_young(struct vm_area_struct *vma,
> > >  			   unsigned long address, pte_t *ptep)
> > >  {
> > > -	int young;
> > > -
> > > -	young = ptep_test_and_clear_young(vma, address, ptep);
> > > -	if (young)
> > > -		flush_tlb_page(vma, address);
> > > -
> > > -	return young;
> > > +	/*
> > > +	 * In X86, clearing access bit without TLB flush doesn't cause data
> > > +	 * corruption. Doing this could cause wrong page aging and so hot pages
> > > +	 * are reclaimed, but the chance should be very rare.
> > 
> > So, beyond the spelling mistakes, I guess this explanation should also 
> > be a bit more explanatory - how about something like:
> > 
> > 	/*
> > 	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
> > 	 * doesn't cause data corruption. [ It could cause incorrect
> > 	 * page aging and the (mistaken) reclaim of hot pages, but the
> > 	 * chance of that should be relatively low. ]
> > 	 *
> > 	 * So as a performance optimization don't flush the TLB when 
> > 	 * clearing the accessed bit, it will eventually be flushed by 
> > 	 * a context switch or a VM operation anyway. [ In the rare 
> > 	 * event of it not getting flushed for a long time the delay 
> > 	 * shouldn't really matter because there's no real memory 
> > 	 * pressure for swapout to react to. ]
> > 	 */
> > 
> > Agreed?
> 
> Sure, that's better, thanks!

With Ingo's updated comment:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
  2014-04-03 11:35   ` Ingo Molnar
@ 2014-04-03 13:45     ` Shaohua Li
  -1 siblings, 0 replies; 21+ messages in thread
From: Shaohua Li @ 2014-04-03 13:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds,
	Peter Zijlstra, Thomas Gleixner

On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote:
> 
> * Shaohua Li <shli@kernel.org> wrote:
> 
> > Add a few acks and resend this patch.
> > 
> > We use access bit to age a page at page reclaim. When clearing pte access bit,
> > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> > access bit is unset in page table, when cpu access the page again, cpu will not
> > set page table pte's access bit. Next time page reclaim will think this hot
> > page is yong and reclaim it wrongly, but this doesn't corrupt data.
> > 
> > And according to intel manual, tlb has less than 1k entries, which covers < 4M
> > memory. In today's system, several giga byte memory is normal. After page
> > reclaim clears pte access bit and before cpu access the page again, it's quite
> > unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> > The chance skiping tlb flush to impact page reclaim should be very rare.
> > 
> > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> > Hugh added it to fix some ARM and sparc issues. Since I only change this for
> > x86, there should be no risk.
> > 
> > And in some workloads, TLB flush overhead is very heavy. In my simple
> > multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> > gives about 20% ~ 30% swapout speedup.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Acked-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Hugh Dickins <hughd@google.com>
> > ---
> >  arch/x86/mm/pgtable.c |   13 ++++++-------
> >  1 file changed, 6 insertions(+), 7 deletions(-)
> > 
> > Index: linux/arch/x86/mm/pgtable.c
> > ===================================================================
> > --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> > +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> >  int ptep_clear_flush_young(struct vm_area_struct *vma,
> >  			   unsigned long address, pte_t *ptep)
> >  {
> > -	int young;
> > -
> > -	young = ptep_test_and_clear_young(vma, address, ptep);
> > -	if (young)
> > -		flush_tlb_page(vma, address);
> > -
> > -	return young;
> > +	/*
> > +	 * In X86, clearing access bit without TLB flush doesn't cause data
> > +	 * corruption. Doing this could cause wrong page aging and so hot pages
> > +	 * are reclaimed, but the chance should be very rare.
> 
> So, beyond the spelling mistakes, I guess this explanation should also 
> be a bit more explanatory - how about something like:
> 
> 	/*
> 	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
> 	 * doesn't cause data corruption. [ It could cause incorrect
> 	 * page aging and the (mistaken) reclaim of hot pages, but the
> 	 * chance of that should be relatively low. ]
> 	 *
> 	 * So as a performance optimization don't flush the TLB when 
> 	 * clearing the accessed bit, it will eventually be flushed by 
> 	 * a context switch or a VM operation anyway. [ In the rare 
> 	 * event of it not getting flushed for a long time the delay 
> 	 * shouldn't really matter because there's no real memory 
> 	 * pressure for swapout to react to. ]
> 	 */
> 
> Agreed?

Sure, that's better, thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
@ 2014-04-03 13:45     ` Shaohua Li
  0 siblings, 0 replies; 21+ messages in thread
From: Shaohua Li @ 2014-04-03 13:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds,
	Peter Zijlstra, Thomas Gleixner

On Thu, Apr 03, 2014 at 01:35:37PM +0200, Ingo Molnar wrote:
> 
> * Shaohua Li <shli@kernel.org> wrote:
> 
> > Add a few acks and resend this patch.
> > 
> > We use access bit to age a page at page reclaim. When clearing pte access bit,
> > we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> > access bit is unset in page table, when cpu access the page again, cpu will not
> > set page table pte's access bit. Next time page reclaim will think this hot
> > page is yong and reclaim it wrongly, but this doesn't corrupt data.
> > 
> > And according to intel manual, tlb has less than 1k entries, which covers < 4M
> > memory. In today's system, several giga byte memory is normal. After page
> > reclaim clears pte access bit and before cpu access the page again, it's quite
> > unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> > The chance skiping tlb flush to impact page reclaim should be very rare.
> > 
> > Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> > Hugh added it to fix some ARM and sparc issues. Since I only change this for
> > x86, there should be no risk.
> > 
> > And in some workloads, TLB flush overhead is very heavy. In my simple
> > multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> > gives about 20% ~ 30% swapout speedup.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Acked-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Hugh Dickins <hughd@google.com>
> > ---
> >  arch/x86/mm/pgtable.c |   13 ++++++-------
> >  1 file changed, 6 insertions(+), 7 deletions(-)
> > 
> > Index: linux/arch/x86/mm/pgtable.c
> > ===================================================================
> > --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> > +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> > @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> >  int ptep_clear_flush_young(struct vm_area_struct *vma,
> >  			   unsigned long address, pte_t *ptep)
> >  {
> > -	int young;
> > -
> > -	young = ptep_test_and_clear_young(vma, address, ptep);
> > -	if (young)
> > -		flush_tlb_page(vma, address);
> > -
> > -	return young;
> > +	/*
> > +	 * In X86, clearing access bit without TLB flush doesn't cause data
> > +	 * corruption. Doing this could cause wrong page aging and so hot pages
> > +	 * are reclaimed, but the chance should be very rare.
> 
> So, beyond the spelling mistakes, I guess this explanation should also 
> be a bit more explanatory - how about something like:
> 
> 	/*
> 	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
> 	 * doesn't cause data corruption. [ It could cause incorrect
> 	 * page aging and the (mistaken) reclaim of hot pages, but the
> 	 * chance of that should be relatively low. ]
> 	 *
> 	 * So as a performance optimization don't flush the TLB when 
> 	 * clearing the accessed bit, it will eventually be flushed by 
> 	 * a context switch or a VM operation anyway. [ In the rare 
> 	 * event of it not getting flushed for a long time the delay 
> 	 * shouldn't really matter because there's no real memory 
> 	 * pressure for swapout to react to. ]
> 	 */
> 
> Agreed?

Sure, that's better, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
  2014-04-03  0:42 ` Shaohua Li
@ 2014-04-03 11:35   ` Ingo Molnar
  -1 siblings, 0 replies; 21+ messages in thread
From: Ingo Molnar @ 2014-04-03 11:35 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds,
	Peter Zijlstra, Thomas Gleixner


* Shaohua Li <shli@kernel.org> wrote:

> Add a few acks and resend this patch.
> 
> We use access bit to age a page at page reclaim. When clearing pte access bit,
> we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> access bit is unset in page table, when cpu access the page again, cpu will not
> set page table pte's access bit. Next time page reclaim will think this hot
> page is yong and reclaim it wrongly, but this doesn't corrupt data.
> 
> And according to intel manual, tlb has less than 1k entries, which covers < 4M
> memory. In today's system, several giga byte memory is normal. After page
> reclaim clears pte access bit and before cpu access the page again, it's quite
> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> The chance skiping tlb flush to impact page reclaim should be very rare.
> 
> Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> Hugh added it to fix some ARM and sparc issues. Since I only change this for
> x86, there should be no risk.
> 
> And in some workloads, TLB flush overhead is very heavy. In my simple
> multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> gives about 20% ~ 30% swapout speedup.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Hugh Dickins <hughd@google.com>
> ---
>  arch/x86/mm/pgtable.c |   13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> Index: linux/arch/x86/mm/pgtable.c
> ===================================================================
> --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>  			   unsigned long address, pte_t *ptep)
>  {
> -	int young;
> -
> -	young = ptep_test_and_clear_young(vma, address, ptep);
> -	if (young)
> -		flush_tlb_page(vma, address);
> -
> -	return young;
> +	/*
> +	 * In X86, clearing access bit without TLB flush doesn't cause data
> +	 * corruption. Doing this could cause wrong page aging and so hot pages
> +	 * are reclaimed, but the chance should be very rare.

So, beyond the spelling mistakes, I guess this explanation should also 
be a bit more explanatory - how about something like:

	/*
	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
	 * doesn't cause data corruption. [ It could cause incorrect
	 * page aging and the (mistaken) reclaim of hot pages, but the
	 * chance of that should be relatively low. ]
	 *
	 * So as a performance optimization don't flush the TLB when 
	 * clearing the accessed bit, it will eventually be flushed by 
	 * a context switch or a VM operation anyway. [ In the rare 
	 * event of it not getting flushed for a long time the delay 
	 * shouldn't really matter because there's no real memory 
	 * pressure for swapout to react to. ]
	 */

Agreed?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] x86: clearing access bit don't flush tlb
@ 2014-04-03 11:35   ` Ingo Molnar
  0 siblings, 0 replies; 21+ messages in thread
From: Ingo Molnar @ 2014-04-03 11:35 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, akpm, riel, hughd, mgorman, torvalds,
	Peter Zijlstra, Thomas Gleixner


* Shaohua Li <shli@kernel.org> wrote:

> Add a few acks and resend this patch.
> 
> We use access bit to age a page at page reclaim. When clearing pte access bit,
> we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
> access bit is unset in page table, when cpu access the page again, cpu will not
> set page table pte's access bit. Next time page reclaim will think this hot
> page is yong and reclaim it wrongly, but this doesn't corrupt data.
> 
> And according to intel manual, tlb has less than 1k entries, which covers < 4M
> memory. In today's system, several giga byte memory is normal. After page
> reclaim clears pte access bit and before cpu access the page again, it's quite
> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> The chance skiping tlb flush to impact page reclaim should be very rare.
> 
> Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
> Hugh added it to fix some ARM and sparc issues. Since I only change this for
> x86, there should be no risk.
> 
> And in some workloads, TLB flush overhead is very heavy. In my simple
> multithread app with a lot of swap to several pcie SSD, removing the tlb flush
> gives about 20% ~ 30% swapout speedup.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Hugh Dickins <hughd@google.com>
> ---
>  arch/x86/mm/pgtable.c |   13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> Index: linux/arch/x86/mm/pgtable.c
> ===================================================================
> --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>  			   unsigned long address, pte_t *ptep)
>  {
> -	int young;
> -
> -	young = ptep_test_and_clear_young(vma, address, ptep);
> -	if (young)
> -		flush_tlb_page(vma, address);
> -
> -	return young;
> +	/*
> +	 * In X86, clearing access bit without TLB flush doesn't cause data
> +	 * corruption. Doing this could cause wrong page aging and so hot pages
> +	 * are reclaimed, but the chance should be very rare.

So, beyond the spelling mistakes, I guess this explanation should also 
be a bit more explanatory - how about something like:

	/*
	 * On x86 CPUs, clearing the accessed bit without a TLB flush 
	 * doesn't cause data corruption. [ It could cause incorrect
	 * page aging and the (mistaken) reclaim of hot pages, but the
	 * chance of that should be relatively low. ]
	 *
	 * So as a performance optimization don't flush the TLB when 
	 * clearing the accessed bit, it will eventually be flushed by 
	 * a context switch or a VM operation anyway. [ In the rare 
	 * event of it not getting flushed for a long time the delay 
	 * shouldn't really matter because there's no real memory 
	 * pressure for swapout to react to. ]
	 */

Agreed?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch]x86: clearing access bit don't flush tlb
@ 2014-04-03  0:42 ` Shaohua Li
  0 siblings, 0 replies; 21+ messages in thread
From: Shaohua Li @ 2014-04-03  0:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, akpm, mingo, riel, hughd, mgorman, torvalds

Add a few acks and resend this patch.

We use access bit to age a page at page reclaim. When clearing pte access bit,
we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
access bit is unset in page table, when cpu access the page again, cpu will not
set page table pte's access bit. Next time page reclaim will think this hot
page is yong and reclaim it wrongly, but this doesn't corrupt data.

And according to intel manual, tlb has less than 1k entries, which covers < 4M
memory. In today's system, several giga byte memory is normal. After page
reclaim clears pte access bit and before cpu access the page again, it's quite
unlikely this page's pte is still in TLB. And context swich will flush tlb too.
The chance skiping tlb flush to impact page reclaim should be very rare.

Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
Hugh added it to fix some ARM and sparc issues. Since I only change this for
x86, there should be no risk.

And in some workloads, TLB flush overhead is very heavy. In my simple
multithread app with a lot of swap to several pcie SSD, removing the tlb flush
gives about 20% ~ 30% swapout speedup.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
---
 arch/x86/mm/pgtable.c |   13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pgtable.c
===================================================================
--- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
+++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
@@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
-	int young;
-
-	young = ptep_test_and_clear_young(vma, address, ptep);
-	if (young)
-		flush_tlb_page(vma, address);
-
-	return young;
+	/*
+	 * In X86, clearing access bit without TLB flush doesn't cause data
+	 * corruption. Doing this could cause wrong page aging and so hot pages
+	 * are reclaimed, but the chance should be very rare.
+	 */
+	return ptep_test_and_clear_young(vma, address, ptep);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch]x86: clearing access bit don't flush tlb
@ 2014-04-03  0:42 ` Shaohua Li
  0 siblings, 0 replies; 21+ messages in thread
From: Shaohua Li @ 2014-04-03  0:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, akpm, mingo, riel, hughd, mgorman, torvalds

Add a few acks and resend this patch.

We use access bit to age a page at page reclaim. When clearing pte access bit,
we could skip tlb flush in X86. The side effect is if the pte is in tlb and pte
access bit is unset in page table, when cpu access the page again, cpu will not
set page table pte's access bit. Next time page reclaim will think this hot
page is yong and reclaim it wrongly, but this doesn't corrupt data.

And according to intel manual, tlb has less than 1k entries, which covers < 4M
memory. In today's system, several giga byte memory is normal. After page
reclaim clears pte access bit and before cpu access the page again, it's quite
unlikely this page's pte is still in TLB. And context swich will flush tlb too.
The chance skiping tlb flush to impact page reclaim should be very rare.

Originally (in 2.5 kernel maybe), we didn't do tlb flush after clear access bit.
Hugh added it to fix some ARM and sparc issues. Since I only change this for
x86, there should be no risk.

And in some workloads, TLB flush overhead is very heavy. In my simple
multithread app with a lot of swap to several pcie SSD, removing the tlb flush
gives about 20% ~ 30% swapout speedup.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
---
 arch/x86/mm/pgtable.c |   13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pgtable.c
===================================================================
--- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
+++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
@@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
-	int young;
-
-	young = ptep_test_and_clear_young(vma, address, ptep);
-	if (young)
-		flush_tlb_page(vma, address);
-
-	return young;
+	/*
+	 * In X86, clearing access bit without TLB flush doesn't cause data
+	 * corruption. Doing this could cause wrong page aging and so hot pages
+	 * are reclaimed, but the chance should be very rare.
+	 */
+	return ptep_test_and_clear_young(vma, address, ptep);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-04-14 11:36 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-26 22:30 [patch]x86: clearing access bit don't flush tlb Shaohua Li
2014-03-26 23:55 ` Rik van Riel
2014-03-27 17:12   ` Shaohua Li
2014-03-27 18:41     ` Rik van Riel
2014-03-28 19:02       ` Shaohua Li
2014-03-30 12:58         ` Rik van Riel
2014-03-31  2:16           ` Shaohua Li
2014-04-02 13:01 ` Mel Gorman
2014-04-02 15:42   ` Hugh Dickins
2014-04-03  0:42 Shaohua Li
2014-04-03  0:42 ` Shaohua Li
2014-04-03 11:35 ` [patch] x86: " Ingo Molnar
2014-04-03 11:35   ` Ingo Molnar
2014-04-03 13:45   ` Shaohua Li
2014-04-03 13:45     ` Shaohua Li
2014-04-04 15:01     ` Johannes Weiner
2014-04-04 15:01       ` Johannes Weiner
2014-04-08  7:58   ` Shaohua Li
2014-04-08  7:58     ` Shaohua Li
2014-04-14 11:36     ` Ingo Molnar
2014-04-14 11:36       ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.