linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm: make faultaround produce old ptes
@ 2017-11-28  5:07 Vinayak Menon
  2017-11-28  5:07 ` [PATCH 2/2] arm64: add faultaround mm hook Vinayak Menon
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Vinayak Menon @ 2017-11-28  5:07 UTC (permalink / raw)
  To: linux-arm-kernel, linux-mm
  Cc: riel, jack, catalin.marinas, dave.hansen, will.deacon, minchan,
	Vinayak Menon, mgorman, ying.huang, akpm, torvalds,
	kirill.shutemov

Based on Kirill's patch [1].

Currently, faultaround code produces young pte.  This can screw up
vmscan behaviour[2], as it makes vmscan think that these pages are hot
and not push them out on first round.

During sparse file access faultaround gets more pages mapped and all of
them are young.  Under memory pressure, this makes vmscan swap out anon
pages instead, or to drop other page cache pages which otherwise stay
resident.

Modify faultaround to produce old ptes, so they can easily be reclaimed
under memory pressure.

This can to some extend defeat the purpose of faultaround on machines
without hardware accessed bit as it will not help us with reducing the
number of minor page faults.

Making the faultaround ptes old results in a unixbench regression for some
architectures [3][4]. But on some architectures it is not found to cause
any regression. So by default produce young ptes and provide an option for
architectures to make the ptes old.

[1] http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com
[2] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
[3] https://marc.info/?l=linux-kernel&m=146582237922378&w=2
[4] https://marc.info/?l=linux-mm&m=146589376909424&w=2

Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
 include/linux/mm-arch-hooks.h | 7 +++++++
 include/linux/mm.h            | 2 ++
 mm/filemap.c                  | 4 ++++
 mm/memory.c                   | 5 +++++
 4 files changed, 18 insertions(+)

diff --git a/include/linux/mm-arch-hooks.h b/include/linux/mm-arch-hooks.h
index 4efc3f56..0322b98 100644
--- a/include/linux/mm-arch-hooks.h
+++ b/include/linux/mm-arch-hooks.h
@@ -22,4 +22,11 @@ static inline void arch_remap(struct mm_struct *mm,
 #define arch_remap arch_remap
 #endif
 
+#ifndef arch_faultaround_pte_mkold
+static inline void arch_faultaround_pte_mkold(struct vm_fault *vmf)
+{
+}
+#define arch_faultaround_pte_mkold arch_faultaround_pte_mkold
+#endif
+
 #endif /* _LINUX_MM_ARCH_HOOKS_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7661156..be689a0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -302,6 +302,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
+#define FAULT_FLAG_MKOLD	0x200	/* Make faultaround ptes old */
 
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
@@ -330,6 +331,7 @@ struct vm_fault {
 	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	unsigned long address;		/* Faulting virtual address */
+	unsigned long fault_address;    /* Saved faulting virtual address */
 	pmd_t *pmd;			/* Pointer to pmd entry matching
 					 * the 'address' */
 	pud_t *pud;			/* Pointer to pud entry matching
diff --git a/mm/filemap.c b/mm/filemap.c
index 693f622..63c7bf4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,7 @@
 #include <linux/cleancache.h>
 #include <linux/shmem_fs.h>
 #include <linux/rmap.h>
+#include <linux/mm-arch-hooks.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -2677,6 +2678,9 @@ void filemap_map_pages(struct vm_fault *vmf,
 		if (vmf->pte)
 			vmf->pte += iter.index - last_pgoff;
 		last_pgoff = iter.index;
+
+		arch_faultaround_pte_mkold(vmf);
+
 		if (alloc_set_pte(vmf, NULL, page))
 			goto unlock;
 		unlock_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 24e9e1d..210dea3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3398,6 +3398,10 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 	entry = mk_pte(page, vma->vm_page_prot);
 	if (write)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+
+	if (vmf->flags & FAULT_FLAG_MKOLD)
+		entry = pte_mkold(entry);
+
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -3527,6 +3531,7 @@ static int do_fault_around(struct vm_fault *vmf)
 	pgoff_t end_pgoff;
 	int off, ret = 0;
 
+	vmf->fault_address = address;
 	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
 	mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] arm64: add faultaround mm hook
  2017-11-28  5:07 [PATCH 1/2] mm: make faultaround produce old ptes Vinayak Menon
@ 2017-11-28  5:07 ` Vinayak Menon
  2017-11-28  9:12 ` [PATCH 1/2] mm: make faultaround produce old ptes Jan Kara
  2017-11-28 19:45 ` Linus Torvalds
  2 siblings, 0 replies; 12+ messages in thread
From: Vinayak Menon @ 2017-11-28  5:07 UTC (permalink / raw)
  To: linux-arm-kernel, linux-mm
  Cc: riel, jack, catalin.marinas, dave.hansen, will.deacon, minchan,
	Vinayak Menon, mgorman, ying.huang, akpm, torvalds,
	kirill.shutemov

The ptes produced by faultaround feature are by default young
and that is found to cause page reclaim issues [1]. But making
the ptes old results in a unixbench regression for some
architectures [2]. But arm64 doesn't show the regression.

unixbench shell8 scores (5 runs min, max, avg):
Base: (741,748,744)
With this patch: (739,748,743)

Add a faultaround mm hook to make the faultaround ptes old only for arm64.

[1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
[2] https://marc.info/?l=linux-kernel&m=146582237922378&w=2

Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
 arch/arm64/include/asm/Kbuild          |  1 -
 arch/arm64/include/asm/mm-arch-hooks.h | 20 ++++++++++++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/mm-arch-hooks.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index e63d0a8..0043f7c 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -13,7 +13,6 @@ generic-y += kmap_types.h
 generic-y += local.h
 generic-y += local64.h
 generic-y += mcs_spinlock.h
-generic-y += mm-arch-hooks.h
 generic-y += msi.h
 generic-y += preempt.h
 generic-y += qrwlock.h
diff --git a/arch/arm64/include/asm/mm-arch-hooks.h b/arch/arm64/include/asm/mm-arch-hooks.h
new file mode 100644
index 0000000..b34d730
--- /dev/null
+++ b/arch/arm64/include/asm/mm-arch-hooks.h
@@ -0,0 +1,20 @@
+#ifndef _ASM_MM_ARCH_HOOKS_H
+#define _ASM_MM_ARCH_HOOKS_H
+
+#ifdef CONFIG_ARM64_HW_AFDBM
+static inline void arch_faultaround_pte_mkold(struct vm_fault *vmf)
+{
+	if (vmf->address != vmf->fault_address)
+		vmf->flags |= FAULT_FLAG_MKOLD;
+	else
+		vmf->flags &= ~FAULT_FLAG_MKOLD;
+}
+#else
+static inline void arch_faultaround_pte_mkold(struct vm_fault *vmf)
+{
+}
+#endif
+
+#define arch_faultaround_pte_mkold arch_faultaround_pte_mkold
+
+#endif /* _ASM_MM_ARCH_HOOKS_H */
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-11-28  5:07 [PATCH 1/2] mm: make faultaround produce old ptes Vinayak Menon
  2017-11-28  5:07 ` [PATCH 2/2] arm64: add faultaround mm hook Vinayak Menon
@ 2017-11-28  9:12 ` Jan Kara
  2017-11-29  5:03   ` Vinayak Menon
  2017-11-28 19:45 ` Linus Torvalds
  2 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2017-11-28  9:12 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: linux-arm-kernel, linux-mm, kirill.shutemov, akpm, jack, minchan,
	catalin.marinas, will.deacon, ying.huang, riel, dave.hansen,
	mgorman, torvalds

On Tue 28-11-17 10:37:49, Vinayak Menon wrote:
> Based on Kirill's patch [1].
> 
> Currently, faultaround code produces young pte.  This can screw up
> vmscan behaviour[2], as it makes vmscan think that these pages are hot
> and not push them out on first round.
> 
> During sparse file access faultaround gets more pages mapped and all of
> them are young.  Under memory pressure, this makes vmscan swap out anon
> pages instead, or to drop other page cache pages which otherwise stay
> resident.
> 
> Modify faultaround to produce old ptes, so they can easily be reclaimed
> under memory pressure.
> 
> This can to some extend defeat the purpose of faultaround on machines
> without hardware accessed bit as it will not help us with reducing the
> number of minor page faults.
> 
> Making the faultaround ptes old results in a unixbench regression for some
> architectures [3][4]. But on some architectures it is not found to cause
> any regression. So by default produce young ptes and provide an option for
> architectures to make the ptes old.
> 
> [1] http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com
> [2] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
> [3] https://marc.info/?l=linux-kernel&m=146582237922378&w=2
> [4] https://marc.info/?l=linux-mm&m=146589376909424&w=2
> 
> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
> ---
>  include/linux/mm-arch-hooks.h | 7 +++++++
>  include/linux/mm.h            | 2 ++
>  mm/filemap.c                  | 4 ++++
>  mm/memory.c                   | 5 +++++
>  4 files changed, 18 insertions(+)
> 
> diff --git a/include/linux/mm-arch-hooks.h b/include/linux/mm-arch-hooks.h
> index 4efc3f56..0322b98 100644
> --- a/include/linux/mm-arch-hooks.h
> +++ b/include/linux/mm-arch-hooks.h
> @@ -22,4 +22,11 @@ static inline void arch_remap(struct mm_struct *mm,
>  #define arch_remap arch_remap
>  #endif
>  
> +#ifndef arch_faultaround_pte_mkold
> +static inline void arch_faultaround_pte_mkold(struct vm_fault *vmf)
> +{
> +}
> +#define arch_faultaround_pte_mkold arch_faultaround_pte_mkold
> +#endif
> +
>  #endif /* _LINUX_MM_ARCH_HOOKS_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7661156..be689a0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -302,6 +302,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
> +#define FAULT_FLAG_MKOLD	0x200	/* Make faultaround ptes old */

Nit: Can we make this FAULT_FLAG_PREFAULT_OLD or something like that so
that it is clear from the flag name that this is about prefaulting of
pages?

>  #define FAULT_FLAG_TRACE \
>  	{ FAULT_FLAG_WRITE,		"WRITE" }, \
> @@ -330,6 +331,7 @@ struct vm_fault {
>  	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
>  	pgoff_t pgoff;			/* Logical page offset based on vma */
>  	unsigned long address;		/* Faulting virtual address */
> +	unsigned long fault_address;    /* Saved faulting virtual address */

Ugh, so I dislike how you hide the decision about whether the *particular*
PTE should be old or young in the arch code. Sure the arch wants to decide
whether the prefaulted PTEs should be old or young and that it has to tell
us but the arch code has no business in checking whether this is prefault
or a normal fault - that decision belongs to filemap_map_pages(). So I'd do
in filemap_map_pages() something like:

	if (iter.index > start_pgoff && arch_wants_old_faultaround_pte())
		vmf->flags |= FAULT_FLAG_PREFAULT_OLD;

And then there's no need for new fault_address in vm_fault or messing with
addresses and fault flags in arch specific code.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-11-28  5:07 [PATCH 1/2] mm: make faultaround produce old ptes Vinayak Menon
  2017-11-28  5:07 ` [PATCH 2/2] arm64: add faultaround mm hook Vinayak Menon
  2017-11-28  9:12 ` [PATCH 1/2] mm: make faultaround produce old ptes Jan Kara
@ 2017-11-28 19:45 ` Linus Torvalds
  2017-11-29  6:05   ` Vinayak Menon
  2017-12-05 12:16   ` Catalin Marinas
  2 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2017-11-28 19:45 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: linux-arm-kernel, linux-mm, kirill.shutemov, Andrew Morton, jack,
	minchan, catalin.marinas, Will Deacon, ying.huang, riel,
	dave.hansen, mgorman

On Mon, Nov 27, 2017 at 9:07 PM, Vinayak Menon <vinmenon@codeaurora.org> wrote:
>
> Making the faultaround ptes old results in a unixbench regression for some
> architectures [3][4]. But on some architectures it is not found to cause
> any regression. So by default produce young ptes and provide an option for
> architectures to make the ptes old.

Ugh. This hidden random behavior difference annoys me.

It should also be better documented in the code if we end up doing it.

The reason x86 seems to prefer young pte's is simply that a TLB lookup
of an old entry basically causes a micro-fault that then sets the
accessed bit (using a locked cycle) and then a restart.

Those microfaults are not visible to software, but they are pretty
expensive in hardware, probably because they basically serialize
execution as if a real page fault had happened.

HOWEVER - and this is the part that annoys me most about the hidden
behavior - I suspect it ends up being very dependent on
microarchitectural details in addition to the actual load. So it might
be more true on some cores than others, and it might be very
load-dependent. So hiding it as some architectural helper function
really feels wrong to me. It would likely be better off as a real
flag, and then maybe we could make the default behavior be set by
architecture (or even dynamically by the architecture bootup code if
it turns out to be enough of an issue).

And I'm actually somewhat suspicious of your claim that it's not
noticeable on arm64. It's entirely possible that the serialization
cost of the hardware access flag is much lower, but I thought that in
virtualization you actually end up taking a SW fault, which in turn
would be much more expensive. In fact, I don't even find that
"Hardware Accessed" bit in my armv8 docs at all, so I'm guessing it's
new to 8.1? So this is very much not about architectures at all, but
about small details in microarchitectural behavior.

Maybe I'm wrong. Will/Catalin?

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-11-28  9:12 ` [PATCH 1/2] mm: make faultaround produce old ptes Jan Kara
@ 2017-11-29  5:03   ` Vinayak Menon
  0 siblings, 0 replies; 12+ messages in thread
From: Vinayak Menon @ 2017-11-29  5:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: riel, minchan, catalin.marinas, dave.hansen, will.deacon,
	linux-mm, linux-arm-kernel, ying.huang, akpm, torvalds,
	kirill.shutemov, mgorman

On 11/28/2017 2:42 PM, Jan Kara wrote:
> On Tue 28-11-17 10:37:49, Vinayak Menon wrote:
>> Based on Kirill's patch [1].
>>
>> Currently, faultaround code produces young pte.  This can screw up
>> vmscan behaviour[2], as it makes vmscan think that these pages are hot
>> and not push them out on first round.
>>
>> During sparse file access faultaround gets more pages mapped and all of
>> them are young.  Under memory pressure, this makes vmscan swap out anon
>> pages instead, or to drop other page cache pages which otherwise stay
>> resident.
>>
>> Modify faultaround to produce old ptes, so they can easily be reclaimed
>> under memory pressure.
>>
>> This can to some extend defeat the purpose of faultaround on machines
>> without hardware accessed bit as it will not help us with reducing the
>> number of minor page faults.
>>
>> Making the faultaround ptes old results in a unixbench regression for some
>> architectures [3][4]. But on some architectures it is not found to cause
>> any regression. So by default produce young ptes and provide an option for
>> architectures to make the ptes old.
>>
>> [1] http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com
>> [2] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
>> [3] https://marc.info/?l=linux-kernel&m=146582237922378&w=2
>> [4] https://marc.info/?l=linux-mm&m=146589376909424&w=2
>>
>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>> ---
>>  include/linux/mm-arch-hooks.h | 7 +++++++
>>  include/linux/mm.h            | 2 ++
>>  mm/filemap.c                  | 4 ++++
>>  mm/memory.c                   | 5 +++++
>>  4 files changed, 18 insertions(+)
>>
>> diff --git a/include/linux/mm-arch-hooks.h b/include/linux/mm-arch-hooks.h
>> index 4efc3f56..0322b98 100644
>> --- a/include/linux/mm-arch-hooks.h
>> +++ b/include/linux/mm-arch-hooks.h
>> @@ -22,4 +22,11 @@ static inline void arch_remap(struct mm_struct *mm,
>>  #define arch_remap arch_remap
>>  #endif
>>  
>> +#ifndef arch_faultaround_pte_mkold
>> +static inline void arch_faultaround_pte_mkold(struct vm_fault *vmf)
>> +{
>> +}
>> +#define arch_faultaround_pte_mkold arch_faultaround_pte_mkold
>> +#endif
>> +
>>  #endif /* _LINUX_MM_ARCH_HOOKS_H */
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 7661156..be689a0 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -302,6 +302,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
>>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
>> +#define FAULT_FLAG_MKOLD	0x200	/* Make faultaround ptes old */
> Nit: Can we make this FAULT_FLAG_PREFAULT_OLD or something like that so
> that it is clear from the flag name that this is about prefaulting of
> pages?
Okay, will change the name.

>>  #define FAULT_FLAG_TRACE \
>>  	{ FAULT_FLAG_WRITE,		"WRITE" }, \
>> @@ -330,6 +331,7 @@ struct vm_fault {
>>  	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
>>  	pgoff_t pgoff;			/* Logical page offset based on vma */
>>  	unsigned long address;		/* Faulting virtual address */
>> +	unsigned long fault_address;    /* Saved faulting virtual address */
> Ugh, so I dislike how you hide the decision about whether the *particular*
> PTE should be old or young in the arch code. Sure the arch wants to decide
> whether the prefaulted PTEs should be old or young and that it has to tell
> us but the arch code has no business in checking whether this is prefault
> or a normal fault - that decision belongs to filemap_map_pages(). So I'd do
> in filemap_map_pages() something like:
>
> 	if (iter.index > start_pgoff && arch_wants_old_faultaround_pte())
> 		vmf->flags |= FAULT_FLAG_PREFAULT_OLD;
Okay, I will fix it.

Thanks,
Vinayak

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-11-28 19:45 ` Linus Torvalds
@ 2017-11-29  6:05   ` Vinayak Menon
  2017-11-29 10:51     ` Will Deacon
  2017-12-05 12:16   ` Catalin Marinas
  1 sibling, 1 reply; 12+ messages in thread
From: Vinayak Menon @ 2017-11-29  6:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: riel, jack, minchan, catalin.marinas, dave.hansen, Will Deacon,
	linux-mm, linux-arm-kernel, ying.huang, Andrew Morton,
	kirill.shutemov, mgorman

On 11/29/2017 1:15 AM, Linus Torvalds wrote:
> On Mon, Nov 27, 2017 at 9:07 PM, Vinayak Menon <vinmenon@codeaurora.org> wrote:
>> Making the faultaround ptes old results in a unixbench regression for some
>> architectures [3][4]. But on some architectures it is not found to cause
>> any regression. So by default produce young ptes and provide an option for
>> architectures to make the ptes old.
> Ugh. This hidden random behavior difference annoys me.
>
> It should also be better documented in the code if we end up doing it.
Okay.
> The reason x86 seems to prefer young pte's is simply that a TLB lookup
> of an old entry basically causes a micro-fault that then sets the
> accessed bit (using a locked cycle) and then a restart.
>
> Those microfaults are not visible to software, but they are pretty
> expensive in hardware, probably because they basically serialize
> execution as if a real page fault had happened.
>
> HOWEVER - and this is the part that annoys me most about the hidden
> behavior - I suspect it ends up being very dependent on
> microarchitectural details in addition to the actual load. So it might
> be more true on some cores than others, and it might be very
> load-dependent. So hiding it as some architectural helper function
> really feels wrong to me. It would likely be better off as a real
> flag, and then maybe we could make the default behavior be set by
> architecture (or even dynamically by the architecture bootup code if
> it turns out to be enough of an issue).
>
> And I'm actually somewhat suspicious of your claim that it's not
> noticeable on arm64. It's entirely possible that the serialization
> cost of the hardware access flag is much lower, but I thought that in
> virtualization you actually end up taking a SW fault, which in turn
> would be much more expensive. In fact, I don't even find that
> "Hardware Accessed" bit in my armv8 docs at all, so I'm guessing it's
> new to 8.1? So this is very much not about architectures at all, but
> about small details in microarchitectural behavior.
The experiments were done on v8.2 hardware with CONFIG_ARM64_HW_AFDBM enabled.
I have tried with CONFIG_ARM64_HW_AFDBM "disabled", and the unixbench score drops down,
probably due to the SW faults.

Thanks,
Vinayak

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-11-29  6:05   ` Vinayak Menon
@ 2017-11-29 10:51     ` Will Deacon
  2017-11-29 11:24       ` Vinayak Menon
  2017-11-29 17:37       ` Linus Torvalds
  0 siblings, 2 replies; 12+ messages in thread
From: Will Deacon @ 2017-11-29 10:51 UTC (permalink / raw)
  To: Vinayak Menon
  Cc: Linus Torvalds, linux-arm-kernel, linux-mm, kirill.shutemov,
	Andrew Morton, jack, minchan, catalin.marinas, ying.huang, riel,
	dave.hansen, mgorman

On Wed, Nov 29, 2017 at 11:35:28AM +0530, Vinayak Menon wrote:
> On 11/29/2017 1:15 AM, Linus Torvalds wrote:
> > On Mon, Nov 27, 2017 at 9:07 PM, Vinayak Menon <vinmenon@codeaurora.org> wrote:
> >> Making the faultaround ptes old results in a unixbench regression for some
> >> architectures [3][4]. But on some architectures it is not found to cause
> >> any regression. So by default produce young ptes and provide an option for
> >> architectures to make the ptes old.
> > Ugh. This hidden random behavior difference annoys me.
> >
> > It should also be better documented in the code if we end up doing it.
> Okay.
> > The reason x86 seems to prefer young pte's is simply that a TLB lookup
> > of an old entry basically causes a micro-fault that then sets the
> > accessed bit (using a locked cycle) and then a restart.
> >
> > Those microfaults are not visible to software, but they are pretty
> > expensive in hardware, probably because they basically serialize
> > execution as if a real page fault had happened.
> >
> > HOWEVER - and this is the part that annoys me most about the hidden
> > behavior - I suspect it ends up being very dependent on
> > microarchitectural details in addition to the actual load. So it might
> > be more true on some cores than others, and it might be very
> > load-dependent. So hiding it as some architectural helper function
> > really feels wrong to me. It would likely be better off as a real
> > flag, and then maybe we could make the default behavior be set by
> > architecture (or even dynamically by the architecture bootup code if
> > it turns out to be enough of an issue).
> >
> > And I'm actually somewhat suspicious of your claim that it's not
> > noticeable on arm64. It's entirely possible that the serialization
> > cost of the hardware access flag is much lower, but I thought that in
> > virtualization you actually end up taking a SW fault, which in turn
> > would be much more expensive. In fact, I don't even find that
> > "Hardware Accessed" bit in my armv8 docs at all, so I'm guessing it's
> > new to 8.1? So this is very much not about architectures at all, but
> > about small details in microarchitectural behavior.
> The experiments were done on v8.2 hardware with CONFIG_ARM64_HW_AFDBM enabled.
> I have tried with CONFIG_ARM64_HW_AFDBM "disabled", and the unixbench score drops down,
> probably due to the SW faults.

Sure, but I think the point is that just because a CPU implements hardware
access/dirty management (DBM -- added in 8.1), it doesn't mean it's going
to be efficient on all implementations, and so having this keyed off the
architecture isn't the right thing to do.

If we had a flag, as suggested, then we could set that by default on CPUs
that implement hardware DBM and clear it on a case-by-case basis if
implementations pop up where it's a performance issue, although I think
it's more likely that setting the dirty bit is the expensive one since
it's not allowed to be performed speculatively.

Linus -- if you want the latest architecture document, it's now available
here without a click-through:

https://developer.arm.com/products/architecture/a-profile/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile

p2109 has stuff about DBM. It is also available at Stage-2, but nobody's
done the KVM work yet by the looks of it.

Cheers,

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-11-29 10:51     ` Will Deacon
@ 2017-11-29 11:24       ` Vinayak Menon
  2017-11-29 17:37       ` Linus Torvalds
  1 sibling, 0 replies; 12+ messages in thread
From: Vinayak Menon @ 2017-11-29 11:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: riel, jack, minchan, catalin.marinas, dave.hansen, linux-mm,
	linux-arm-kernel, ying.huang, Andrew Morton, Linus Torvalds,
	kirill.shutemov, mgorman



On 11/29/2017 4:21 PM, Will Deacon wrote:
> On Wed, Nov 29, 2017 at 11:35:28AM +0530, Vinayak Menon wrote:
>> On 11/29/2017 1:15 AM, Linus Torvalds wrote:
>>> On Mon, Nov 27, 2017 at 9:07 PM, Vinayak Menon <vinmenon@codeaurora.org> wrote:
>>>> Making the faultaround ptes old results in a unixbench regression for some
>>>> architectures [3][4]. But on some architectures it is not found to cause
>>>> any regression. So by default produce young ptes and provide an option for
>>>> architectures to make the ptes old.
>>> Ugh. This hidden random behavior difference annoys me.
>>>
>>> It should also be better documented in the code if we end up doing it.
>> Okay.
>>> The reason x86 seems to prefer young pte's is simply that a TLB lookup
>>> of an old entry basically causes a micro-fault that then sets the
>>> accessed bit (using a locked cycle) and then a restart.
>>>
>>> Those microfaults are not visible to software, but they are pretty
>>> expensive in hardware, probably because they basically serialize
>>> execution as if a real page fault had happened.
>>>
>>> HOWEVER - and this is the part that annoys me most about the hidden
>>> behavior - I suspect it ends up being very dependent on
>>> microarchitectural details in addition to the actual load. So it might
>>> be more true on some cores than others, and it might be very
>>> load-dependent. So hiding it as some architectural helper function
>>> really feels wrong to me. It would likely be better off as a real
>>> flag, and then maybe we could make the default behavior be set by
>>> architecture (or even dynamically by the architecture bootup code if
>>> it turns out to be enough of an issue).
>>>
>>> And I'm actually somewhat suspicious of your claim that it's not
>>> noticeable on arm64. It's entirely possible that the serialization
>>> cost of the hardware access flag is much lower, but I thought that in
>>> virtualization you actually end up taking a SW fault, which in turn
>>> would be much more expensive. In fact, I don't even find that
>>> "Hardware Accessed" bit in my armv8 docs at all, so I'm guessing it's
>>> new to 8.1? So this is very much not about architectures at all, but
>>> about small details in microarchitectural behavior.
>> The experiments were done on v8.2 hardware with CONFIG_ARM64_HW_AFDBM enabled.
>> I have tried with CONFIG_ARM64_HW_AFDBM "disabled", and the unixbench score drops down,
>> probably due to the SW faults.
> Sure, but I think the point is that just because a CPU implements hardware
> access/dirty management (DBM -- added in 8.1), it doesn't mean it's going
> to be efficient on all implementations, and so having this keyed off the
> architecture isn't the right thing to do.
>
> If we had a flag, as suggested, then we could set that by default on CPUs
> that implement hardware DBM and clear it on a case-by-case basis if
> implementations pop up where it's a performance issue, although I think
> it's more likely that setting the dirty bit is the expensive one since
> it's not allowed to be performed speculatively.

Yes, I agree that a flag will be better. I will send a v2 with these changes.

Thanks,
Vinayak

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-11-29 10:51     ` Will Deacon
  2017-11-29 11:24       ` Vinayak Menon
@ 2017-11-29 17:37       ` Linus Torvalds
  1 sibling, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2017-11-29 17:37 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vinayak Menon, linux-arm-kernel, linux-mm, Kirill A. Shutemov,
	Andrew Morton, Jan Kara, Minchan Kim, Catalin Marinas,
	Huang Ying, Rik van Riel, Dave Hansen, Mel Gorman

On Wed, Nov 29, 2017 at 2:51 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> Linus -- if you want the latest architecture document, it's now available
> here without a click-through:

Thanks. I was sure there was something newer available than the ARMv8
pdf I had, but my google-fu failed miserably.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-11-28 19:45 ` Linus Torvalds
  2017-11-29  6:05   ` Vinayak Menon
@ 2017-12-05 12:16   ` Catalin Marinas
  2017-12-05 16:39     ` Linus Torvalds
  2017-12-11  7:01     ` Vinayak Menon
  1 sibling, 2 replies; 12+ messages in thread
From: Catalin Marinas @ 2017-12-05 12:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vinayak Menon, riel, jack, minchan, dave.hansen, Will Deacon,
	linux-mm, linux-arm-kernel, ying.huang, Andrew Morton,
	kirill.shutemov, mgorman

On Tue, Nov 28, 2017 at 11:45:27AM -0800, Linus Torvalds wrote:
> On Mon, Nov 27, 2017 at 9:07 PM, Vinayak Menon <vinmenon@codeaurora.org> wrote:
> > Making the faultaround ptes old results in a unixbench regression for some
> > architectures [3][4]. But on some architectures it is not found to cause
> > any regression. So by default produce young ptes and provide an option for
> > architectures to make the ptes old.
> 
> Ugh. This hidden random behavior difference annoys me.
> 
> It should also be better documented in the code if we end up doing it.
> 
> The reason x86 seems to prefer young pte's is simply that a TLB lookup
> of an old entry basically causes a micro-fault that then sets the
> accessed bit (using a locked cycle) and then a restart.
> 
> Those microfaults are not visible to software, but they are pretty
> expensive in hardware, probably because they basically serialize
> execution as if a real page fault had happened.

In principle it's not that different for ARMv8.1+ but it highly depends
on the microarchitecture details (and we have a lot of variation on
ARM). From a programmer's perspective, old ptes (access flag cleared)
are not allowed to be cached in the TLB, otherwise ptep_clear_flush()
would break. Marking fault-around ptes as young allows the hardware to
speculatively populate the TLB but, again, it's highly microarchitecture
specific and I'm not sure we have a general answer covering the ARM
architecture. Of course, faulting on old ptes is much slower without
hardware AF.

> HOWEVER - and this is the part that annoys me most about the hidden
> behavior - I suspect it ends up being very dependent on
> microarchitectural details in addition to the actual load. So it might
> be more true on some cores than others, and it might be very
> load-dependent. So hiding it as some architectural helper function
> really feels wrong to me. It would likely be better off as a real
> flag, and then maybe we could make the default behavior be set by
> architecture (or even dynamically by the architecture bootup code if
> it turns out to be enough of an issue).

It looks to me like we are trying to work around a vmscan behaviour
visible under memory pressure [1]. The original report doesn't state
whether hardware AF is available (it seems to be tested on a 3.18
Android kernel; hardware AF on arm64 went in 4.6).

In this case there is a trade-off between swapping out potentially hot
pages vs page table walk (either in hardware or via software fault) for
fault-around ptes. This trade-off further depends on whether the
architecture can do hardware access flag or not.

I would be more in favour of some heuristics to dynamically reduce the
fault-around bytes based on the memory pressure rather than choosing
between young or old ptes. Or, if we are to go with old vs young ptes,
make this choice dependent on the memory pressure regardless of whether
the CPU supports hardware accessed bit.

[1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org

-- 
Catalin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-12-05 12:16   ` Catalin Marinas
@ 2017-12-05 16:39     ` Linus Torvalds
  2017-12-11  7:01     ` Vinayak Menon
  1 sibling, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2017-12-05 16:39 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vinayak Menon, Rik van Riel, Jan Kara, Minchan Kim, Dave Hansen,
	Will Deacon, linux-mm, linux-arm-kernel, Huang Ying,
	Andrew Morton, Kirill A. Shutemov, Mel Gorman

On Tue, Dec 5, 2017 at 4:16 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> I would be more in favour of some heuristics to dynamically reduce the
> fault-around bytes based on the memory pressure rather than choosing
> between young or old ptes. Or, if we are to go with old vs young ptes,
> make this choice dependent on the memory pressure regardless of whether
> the CPU supports hardware accessed bit.

That sounds like a good idea, but possibly a bit _too_ smart for
something that likely isn't a big deal.

The current behavior definitely is based on a "swapping is not a big
deal" mindset, and that getting the best LRU isn't worth it. That's
probably true in most circumstances, but if you really do have low
memory, and you really do have fairly random access behavior that
where the actual working set size is close to the actual memory size,
then a "get rid of faultaround pages earlier" mode would be a good
thing.

So I'm not at all against your idea - it sounds like the
RightThing(tm) to do - I just wonder how painful it is to generate a
sane heuristic that actually works in practice..

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: make faultaround produce old ptes
  2017-12-05 12:16   ` Catalin Marinas
  2017-12-05 16:39     ` Linus Torvalds
@ 2017-12-11  7:01     ` Vinayak Menon
  1 sibling, 0 replies; 12+ messages in thread
From: Vinayak Menon @ 2017-12-11  7:01 UTC (permalink / raw)
  To: Catalin Marinas, Linus Torvalds
  Cc: riel, jack, minchan, dave.hansen, Will Deacon, linux-mm,
	linux-arm-kernel, ying.huang, Andrew Morton, kirill.shutemov,
	mgorman


On 12/5/2017 5:46 PM, Catalin Marinas wrote:
> On Tue, Nov 28, 2017 at 11:45:27AM -0800, Linus Torvalds wrote:
>> On Mon, Nov 27, 2017 at 9:07 PM, Vinayak Menon <vinmenon@codeaurora.org> wrote:
>>> Making the faultaround ptes old results in a unixbench regression for some
>>> architectures [3][4]. But on some architectures it is not found to cause
>>> any regression. So by default produce young ptes and provide an option for
>>> architectures to make the ptes old.
>> Ugh. This hidden random behavior difference annoys me.
>>
>> It should also be better documented in the code if we end up doing it.
>>
>> The reason x86 seems to prefer young pte's is simply that a TLB lookup
>> of an old entry basically causes a micro-fault that then sets the
>> accessed bit (using a locked cycle) and then a restart.
>>
>> Those microfaults are not visible to software, but they are pretty
>> expensive in hardware, probably because they basically serialize
>> execution as if a real page fault had happened.
> In principle it's not that different for ARMv8.1+ but it highly depends
> on the microarchitecture details (and we have a lot of variation on
> ARM). From a programmer's perspective, old ptes (access flag cleared)
> are not allowed to be cached in the TLB, otherwise ptep_clear_flush()
> would break. Marking fault-around ptes as young allows the hardware to
> speculatively populate the TLB but, again, it's highly microarchitecture
> specific and I'm not sure we have a general answer covering the ARM
> architecture. Of course, faulting on old ptes is much slower without
> hardware AF.
>
>> HOWEVER - and this is the part that annoys me most about the hidden
>> behavior - I suspect it ends up being very dependent on
>> microarchitectural details in addition to the actual load. So it might
>> be more true on some cores than others, and it might be very
>> load-dependent. So hiding it as some architectural helper function
>> really feels wrong to me. It would likely be better off as a real
>> flag, and then maybe we could make the default behavior be set by
>> architecture (or even dynamically by the architecture bootup code if
>> it turns out to be enough of an issue).
> It looks to me like we are trying to work around a vmscan behaviour
> visible under memory pressure [1]. The original report doesn't state
> whether hardware AF is available (it seems to be tested on a 3.18
> Android kernel; hardware AF on arm64 went in 4.6).
Sorry for the delayed response.
The original report was based on a target without HW AF support.
The issue can be seen even on a 8GB machine with multibuild test. Yes, I agree that the problem
is triggered with reclaim, but looks like it doesn't need a very low memory condition for the issue
to be visible.
IIUC, vmscan is doing the right thing in trying to keep the hot pages. But by making the faultaround
pages young, we are saying that they are referenced which may not be true. When the speculation
that faultaround pages are hot goes wrong, vmscan ends up evicting more hot anon and file pages.
Multibuild and apps launch test on android showing a regression I think means that this speculation
is going wrong often.
Even on non-HW-AF targets, where making ptes old can result in more minor faults, performance is
better with faultaorund disabled (or making ptes old), with the above mentioned tests.
I think making faultaround pages young was not intentionally done by the original faultaround code [1],
and the original idea was to gain benefit of reduced minor faults.

> In this case there is a trade-off between swapping out potentially hot
> pages vs page table walk (either in hardware or via software fault) for
> fault-around ptes. This trade-off further depends on whether the
> architecture can do hardware access flag or not.
>
> I would be more in favour of some heuristics to dynamically reduce the
> fault-around bytes based on the memory pressure rather than choosing
> between young or old ptes. 

Even with minimal or moderateA  memory pressure, if the speculation that faultaround pages are hot is
wrong, it can result in wrong evictions. I am not sure, but it could be similar reason why we skip VM_SEQ_READ
pages in page_referenced_one.
Since the arm64 tests doesn't show any visible impact of page table walk, there doesn't seem to be a case of
trade-off. I agree that this may not be a case for all arm64 cores, but it would be advantageous to make
faultaround ptes old on targets which doesn't show page table access issues. And shouldn't faultaround ptes
logically be old until they are really accessed ?
Also, even during memory pressure, it can be beneficial to have faultaround pages which reduces the page
faults (even if a quarter of the pages are actually accessed), but the issue is only that they are made young
which confuses vmscan. So I think faultaround bytes may not be related to memory pressure.

> Or, if we are to go with old vs young ptes,
> make this choice dependent on the memory pressure regardless of whether
> the CPU supports hardware accessed bit.

Will it be okay then to make this flag part of /proc/sys/vm and let the vmpressure clients take care of
setting it based on the vmpressure values ? Those who always want faultaround ptes to be old can do
that too with this interface.

[1] https://lkml.org/lkml/2016/4/22/568

Thanks,
Vinayak

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-12-11  7:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-28  5:07 [PATCH 1/2] mm: make faultaround produce old ptes Vinayak Menon
2017-11-28  5:07 ` [PATCH 2/2] arm64: add faultaround mm hook Vinayak Menon
2017-11-28  9:12 ` [PATCH 1/2] mm: make faultaround produce old ptes Jan Kara
2017-11-29  5:03   ` Vinayak Menon
2017-11-28 19:45 ` Linus Torvalds
2017-11-29  6:05   ` Vinayak Menon
2017-11-29 10:51     ` Will Deacon
2017-11-29 11:24       ` Vinayak Menon
2017-11-29 17:37       ` Linus Torvalds
2017-12-05 12:16   ` Catalin Marinas
2017-12-05 16:39     ` Linus Torvalds
2017-12-11  7:01     ` Vinayak Menon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).