[RFC 0/4] mm: Introduce lazy exec permission setting on a page

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/4] mm: Introduce lazy exec permission setting on a page
@ 2019-02-13  8:06 Anshuman Khandual
  2019-02-13  8:06 ` [RFC 1/4] " Anshuman Khandual
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-13  8:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon,
	catalin.marinas, dave.hansen

Setting an exec permission on a page normally triggers I-cache invalidation
which might be expensive. I-cache invalidation is not mandatory on a given
page if there is no immediate exec access on it. Non-fault modification of
user page table from generic memory paths like migration can be improved if
setting of the exec permission on the page can be deferred till actual use.
There was a performance report [1] which highlighted the problem.

This introduces [pte|pmd]_mklazyexec() which clears the exec permission on
a page during migration. This exec permission deferral must be enabled back
with maybe_[pmd]_mkexec() during exec page fault (FAULT_FLAG_INSTRUCTION)
if the corresponding VMA contains exec flag (VM_EXEC).

This framework is encapsulated under CONFIG_ARCH_SUPPORTS_LAZY_EXEC so that
non-subscribing architectures don't take any performance hit. For now only
generic memory migration path will be using this framework but later it can
be extended to other generic memory paths as well.

This enables CONFIG_ARCH_SUPPORTS_LAZY_EXEC on arm64 and defines required
helper functions in this regard while changing ptep_set_access_flags() to
allow non-exec to exec transition.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html

Anshuman Khandual (4):
  mm: Introduce lazy exec permission setting on a page
  arm64/mm: Identify user level instruction faults
  arm64/mm: Allow non-exec to exec transition in ptep_set_access_flags()
  arm64/mm: Enable ARCH_SUPPORTS_LAZY_EXEC

 arch/arm64/Kconfig               |  1 +
 arch/arm64/include/asm/pgtable.h | 17 +++++++++++++++++
 arch/arm64/mm/fault.c            | 22 ++++++++++++++--------
 include/asm-generic/pgtable.h    | 12 ++++++++++++
 include/linux/mm.h               | 26 ++++++++++++++++++++++++++
 mm/Kconfig                       |  9 +++++++++
 mm/huge_memory.c                 |  5 +++++
 mm/hugetlb.c                     |  2 ++
 mm/memory.c                      |  4 ++++
 mm/migrate.c                     |  2 ++
 10 files changed, 92 insertions(+), 8 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC 1/4] mm: Introduce lazy exec permission setting on a page
  2019-02-13  8:06 [RFC 0/4] mm: Introduce lazy exec permission setting on a page Anshuman Khandual
@ 2019-02-13  8:06 ` Anshuman Khandual
  2019-02-13 13:17   ` Matthew Wilcox
  2019-02-13  8:06 ` [RFC 2/4] arm64/mm: Identify user level instruction faults Anshuman Khandual
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-13  8:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon,
	catalin.marinas, dave.hansen

Setting an exec permission on a page normally triggers I-cache invalidation
which might be expensive. I-cache invalidation is not mandatory on a given
page if there is no immediate exec access on it. Non-fault modification of
user page table from generic memory paths like migration can be improved if
setting of the exec permission on the page can be deferred till actual use.

This introduces [pte|pmd]_mklazyexec() which clears the exec permission on
a page during migration. This exec permission deferral must be enabled back
with maybe_[pmd]_mkexec() during exec page fault (FAULT_FLAG_INSTRUCTION)
if the corresponding VMA contains exec flag (VM_EXEC).

This framework is encapsulated under CONFIG_ARCH_SUPPORTS_LAZY_EXEC so that
non-subscribing architectures don't take any performance hit. For now only
generic memory migration path will be using this framework but later it can
be extended to other generic memory paths as well.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 include/asm-generic/pgtable.h | 12 ++++++++++++
 include/linux/mm.h            | 26 ++++++++++++++++++++++++++
 mm/Kconfig                    |  9 +++++++++
 mm/huge_memory.c              |  5 +++++
 mm/hugetlb.c                  |  2 ++
 mm/memory.c                   |  4 ++++
 mm/migrate.c                  |  2 ++
 7 files changed, 60 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 05e61e6..d35d129 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -26,6 +26,18 @@
 #define USER_PGTABLES_CEILING	0UL
 #endif
 
+#ifndef CONFIG_ARCH_SUPPORTS_LAZY_EXEC
+static inline pte_t pte_mklazyexec(pte_t entry)
+{
+	return entry;
+}
+
+static inline pmd_t pmd_mklazyexec(pmd_t entry)
+{
+	return entry;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 extern int ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb640..04d7a0a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -755,6 +755,32 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 	return pte;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LAZY_EXEC
+static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
+{
+	if (unlikely(vma->vm_flags & VM_EXEC))
+		return pte_mkexec(entry);
+	return entry;
+}
+
+static inline pmd_t maybe_pmd_mkexec(pmd_t entry, struct vm_area_struct *vma)
+{
+	if (unlikely(vma->vm_flags & VM_EXEC))
+		return pmd_mkexec(entry);
+	return entry;
+}
+#else
+static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
+{
+	return entry;
+}
+
+static inline pmd_t maybe_pmd_mkexec(pmd_t entry, struct vm_area_struct *vma)
+{
+	return entry;
+}
+#endif
+
 vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 		struct page *page);
 vm_fault_t finish_fault(struct vm_fault *vmf);
diff --git a/mm/Kconfig b/mm/Kconfig
index 25c71eb..5c046cb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -322,6 +322,15 @@ config DEFAULT_MMAP_MIN_ADDR
 	  This value can be changed after boot using the
 	  /proc/sys/vm/mmap_min_addr tunable.
 
+config ARCH_SUPPORTS_LAZY_EXEC
+	bool "Architecture supports deferred exec permission setting"
+	help
+	  Some architectures can improve performance during non-fault page
+	  table modifications paths with deferred exec permission setting
+	  which helps in avoiding expensive I-cache invalidations. This
+	  requires arch implementation of ptep_set_access_flags() to allow
+	  non-exec to exec transition.
+
 config ARCH_SUPPORTS_MEMORY_FAILURE
 	bool
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faf357e..9ef7662 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1126,6 +1126,8 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
 	if (write)
 		entry = pmd_mkdirty(entry);
 	haddr = vmf->address & HPAGE_PMD_MASK;
+	if (vmf->flags & FAULT_FLAG_INSTRUCTION)
+		entry = maybe_pmd_mkexec(entry, vmf->vma);
 	if (pmdp_set_access_flags(vmf->vma, haddr, vmf->pmd, entry, write))
 		update_mmu_cache_pmd(vmf->vma, vmf->address, vmf->pmd);
 
@@ -1290,6 +1292,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (vmf->flags & FAULT_FLAG_INSTRUCTION)
+			entry = maybe_pmd_mkexec(entry, vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry,  1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		ret |= VM_FAULT_WRITE;
@@ -2944,6 +2948,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_write_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
+	pmde = pmd_mklazyexec(pmde);
 
 	flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
 	if (PageAnon(new))
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index afef616..ea41832 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4018,6 +4018,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		entry = huge_pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
+	if (flags & FAULT_FLAG_INSTRUCTION)
+		entry = maybe_mkexec(entry, vma);
 	if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
 						flags & FAULT_FLAG_WRITE))
 		update_mmu_cache(vma, haddr, ptep);
diff --git a/mm/memory.c b/mm/memory.c
index e11ca9d..74c406b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2218,6 +2218,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 	entry = pte_mkyoung(vmf->orig_pte);
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	if (vmf->flags & FAULT_FLAG_INSTRUCTION)
+		entry = maybe_mkexec(entry, vma);
 	if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
 		update_mmu_cache(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3804,6 +3806,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
+	if (vmf->flags & FAULT_FLAG_INSTRUCTION)
+		entry = maybe_mkexec(entry, vmf->vma);
 	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
 				vmf->flags & FAULT_FLAG_WRITE)) {
 		update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
diff --git a/mm/migrate.c b/mm/migrate.c
index d4fd680..7587717 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -257,6 +257,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (PageHuge(new)) {
 			pte = pte_mkhuge(pte);
 			pte = arch_make_huge_pte(pte, vma, new, 0);
+			pte = pte_mklazyexec(pte);
 			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 			if (PageAnon(new))
 				hugepage_add_anon_rmap(new, vma, pvmw.address);
@@ -265,6 +266,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		} else
 #endif
 		{
+			pte = pte_mklazyexec(pte);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
 			if (PageAnon(new))
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 2/4] arm64/mm: Identify user level instruction faults
  2019-02-13  8:06 [RFC 0/4] mm: Introduce lazy exec permission setting on a page Anshuman Khandual
  2019-02-13  8:06 ` [RFC 1/4] " Anshuman Khandual
@ 2019-02-13  8:06 ` Anshuman Khandual
  2019-02-13  8:06 ` [RFC 3/4] arm64/mm: Allow non-exec to exec transition in ptep_set_access_flags() Anshuman Khandual
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-13  8:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon,
	catalin.marinas, dave.hansen

Page fault flags (FAULT_FLAG_XXX) need to be passed down fault handling
path for appropriate action and reporting. Identify user instruction
fetch faults and mark them with FAULT_FLAG_INSTRUCTION.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/fault.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index efb7b2c..591670d 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -468,6 +468,9 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 		mm_flags |= FAULT_FLAG_WRITE;
 	}
 
+	if (is_el0_instruction_abort(esr))
+		mm_flags |= FAULT_FLAG_INSTRUCTION;
+
 	if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, esr, regs)) {
 		/* regs->orig_addr_limit may be 0 if we entered from EL0 */
 		if (regs->orig_addr_limit == KERNEL_DS)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 3/4] arm64/mm: Allow non-exec to exec transition in ptep_set_access_flags()
  2019-02-13  8:06 [RFC 0/4] mm: Introduce lazy exec permission setting on a page Anshuman Khandual
  2019-02-13  8:06 ` [RFC 1/4] " Anshuman Khandual
  2019-02-13  8:06 ` [RFC 2/4] arm64/mm: Identify user level instruction faults Anshuman Khandual
@ 2019-02-13  8:06 ` Anshuman Khandual
  2019-02-13  8:06 ` [RFC 4/4] arm64/mm: Enable ARCH_SUPPORTS_LAZY_EXEC Anshuman Khandual
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-13  8:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon,
	catalin.marinas, dave.hansen

ptep_set_access_flags() updates page table for a mapped page entry which
still got a fault probably because of a different permission than what
it is mapped with. Previously an exec enabled page always gets required
permission in the page table entry. Hence ptep_set_access_flags() never
had to move an entry from non-exec to exec. This is going to change with
deferred exec permission setting with later patches. Hence allow non-exec
to exec transition here and do the required I-cache invalidation.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/fault.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 591670d..1540fc1 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -227,22 +227,25 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (pte_same(pte, entry))
 		return 0;
 
-	/* only preserve the access flags and write permission */
-	pte_val(entry) &= PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
+	/* only preserve the access flags, write and exec permission */
+	pte_val(entry) &= PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY | PTE_UXN;
+
+	if (pte_user_exec(entry))
+		__sync_icache_dcache(pte);
 
 	/*
 	 * Setting the flags must be done atomically to avoid racing with the
-	 * hardware update of the access/dirty state. The PTE_RDONLY bit must
-	 * be set to the most permissive (lowest value) of *ptep and entry
-	 * (calculated as: a & b == ~(~a | ~b)).
+	 * hardware update of the access/dirty state. The PTE_RDONLY bit and
+	 * PTE_UXN must be set to the most permissive (lowest value) of *ptep
+	 * and entry (calculated as: a & b == ~(~a | ~b)).
 	 */
-	pte_val(entry) ^= PTE_RDONLY;
+	pte_val(entry) ^= PTE_RDONLY | PTE_UXN;
 	pteval = pte_val(pte);
 	do {
 		old_pteval = pteval;
-		pteval ^= PTE_RDONLY;
+		pteval ^= PTE_RDONLY | PTE_UXN;
 		pteval |= pte_val(entry);
-		pteval ^= PTE_RDONLY;
+		pteval ^= PTE_RDONLY | PTE_UXN;
 		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
 	} while (pteval != old_pteval);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 4/4] arm64/mm: Enable ARCH_SUPPORTS_LAZY_EXEC
  2019-02-13  8:06 [RFC 0/4] mm: Introduce lazy exec permission setting on a page Anshuman Khandual
                   ` (2 preceding siblings ...)
  2019-02-13  8:06 ` [RFC 3/4] arm64/mm: Allow non-exec to exec transition in ptep_set_access_flags() Anshuman Khandual
@ 2019-02-13  8:06 ` Anshuman Khandual
  2019-02-13 11:21 ` [RFC 0/4] mm: Introduce lazy exec permission setting on a page Catalin Marinas
  2019-02-13 15:44 ` Dave Hansen
  5 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-13  8:06 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon,
	catalin.marinas, dave.hansen

Make arm64 subscribe to ARCH_SUPPORTS_LAZY_EXEC framework and provided all
required helpers for this purpose. This moves away execution cost from the
migration path to exec fault path as expected.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig               |  1 +
 arch/arm64/include/asm/pgtable.h | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d3..3cdb3e4 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -59,6 +59,7 @@ config ARM64
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
+	select ARCH_SUPPORTS_LAZY_EXEC
 	select ARCH_SUPPORTS_MEMORY_FAILURE
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_SUPPORTS_INT128 if GCC_VERSION >= 50000 || CC_IS_CLANG
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index de70c1e..f2a5716 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -217,6 +217,18 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
 	return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LAZY_EXEC
+static inline pte_t pte_mkexec(pte_t pte)
+{
+	return clear_pte_bit(pte, __pgprot(PTE_UXN));
+}
+
+static inline pte_t pte_mklazyexec(pte_t pte)
+{
+	return set_pte_bit(pte, __pgprot(PTE_UXN));
+}
+#endif
+
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
@@ -355,6 +367,11 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_LAZY_EXEC
+#define pmd_mkexec(pmd)		pte_pmd(pte_mkexec(pmd_pte(pmd)))
+#define pmd_mklazyexec(pmd)	pte_pmd(pte_mklazyexec(pmd_pte(pmd)))
+#endif
+
 /*
  * THP definitions.
  */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-13  8:06 [RFC 0/4] mm: Introduce lazy exec permission setting on a page Anshuman Khandual
                   ` (3 preceding siblings ...)
  2019-02-13  8:06 ` [RFC 4/4] arm64/mm: Enable ARCH_SUPPORTS_LAZY_EXEC Anshuman Khandual
@ 2019-02-13 11:21 ` Catalin Marinas
  2019-02-13 15:38   ` Michal Hocko
  2019-02-13 15:44 ` Dave Hansen
  5 siblings, 1 reply; 28+ messages in thread
From: Catalin Marinas @ 2019-02-13 11:21 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, akpm, mhocko, kirill, kirill.shutemov, vbabka,
	will.deacon, dave.hansen

On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote:
> Setting an exec permission on a page normally triggers I-cache invalidation
> which might be expensive. I-cache invalidation is not mandatory on a given
> page if there is no immediate exec access on it. Non-fault modification of
> user page table from generic memory paths like migration can be improved if
> setting of the exec permission on the page can be deferred till actual use.
> There was a performance report [1] which highlighted the problem.
[...]
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html

FTR, this performance regression has been addressed by commit
132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That
said, I still think this patch series is valuable for further optimising
the page migration path on arm64 (and can be extended to other
architectures that currently require I/D cache maintenance for
executable pages).

BTW, if you are going to post new versions of this series, please
include linux-arch and linux-arm-kernel.

-- 
Catalin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 1/4] mm: Introduce lazy exec permission setting on a page
  2019-02-13  8:06 ` [RFC 1/4] " Anshuman Khandual
@ 2019-02-13 13:17   ` Matthew Wilcox
  2019-02-13 13:53     ` Anshuman Khandual
  0 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2019-02-13 13:17 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, akpm, mhocko, kirill, kirill.shutemov, vbabka,
	will.deacon, catalin.marinas, dave.hansen

On Wed, Feb 13, 2019 at 01:36:28PM +0530, Anshuman Khandual wrote:
> +#ifdef CONFIG_ARCH_SUPPORTS_LAZY_EXEC
> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
> +{
> +	if (unlikely(vma->vm_flags & VM_EXEC))
> +		return pte_mkexec(entry);
> +	return entry;
> +}
> +#else
> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
> +{
> +	return entry;
> +}
> +#endif

> +++ b/mm/memory.c
> @@ -2218,6 +2218,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>  	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  	entry = pte_mkyoung(vmf->orig_pte);
>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +	if (vmf->flags & FAULT_FLAG_INSTRUCTION)
> +		entry = maybe_mkexec(entry, vma);

I don't understand this bit.  We have a fault based on an instruction
fetch.  But we're only going to _maybe_ set the exec bit?  Why not call
pte_mkexec() unconditionally?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 1/4] mm: Introduce lazy exec permission setting on a page
  2019-02-13 13:17   ` Matthew Wilcox
@ 2019-02-13 13:53     ` Anshuman Khandual
  2019-02-14  9:06       ` Mike Rapoport
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-13 13:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, akpm, mhocko, kirill, kirill.shutemov, vbabka,
	will.deacon, catalin.marinas, dave.hansen



On 02/13/2019 06:47 PM, Matthew Wilcox wrote:
> On Wed, Feb 13, 2019 at 01:36:28PM +0530, Anshuman Khandual wrote:
>> +#ifdef CONFIG_ARCH_SUPPORTS_LAZY_EXEC
>> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
>> +{
>> +	if (unlikely(vma->vm_flags & VM_EXEC))
>> +		return pte_mkexec(entry);
>> +	return entry;
>> +}
>> +#else
>> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
>> +{
>> +	return entry;
>> +}
>> +#endif
> 
>> +++ b/mm/memory.c
>> @@ -2218,6 +2218,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>>  	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>>  	entry = pte_mkyoung(vmf->orig_pte);
>>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>> +	if (vmf->flags & FAULT_FLAG_INSTRUCTION)
>> +		entry = maybe_mkexec(entry, vma);
> 
> I don't understand this bit.  We have a fault based on an instruction
> fetch.  But we're only going to _maybe_ set the exec bit?  Why not call
> pte_mkexec() unconditionally?

Because the arch might not have subscribed to this in which case the fall
back function does nothing and return the same entry. But in case this is
enabled it also checks for VMA exec flag (VM_EXEC) before calling into
pte_mkexec() something similar to existing maybe_mkwrite().


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-13 11:21 ` [RFC 0/4] mm: Introduce lazy exec permission setting on a page Catalin Marinas
@ 2019-02-13 15:38   ` Michal Hocko
  2019-02-14  6:04     ` Anshuman Khandual
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-02-13 15:38 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Anshuman Khandual, linux-mm, akpm, kirill, kirill.shutemov,
	vbabka, will.deacon, dave.hansen

On Wed 13-02-19 11:21:36, Catalin Marinas wrote:
> On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote:
> > Setting an exec permission on a page normally triggers I-cache invalidation
> > which might be expensive. I-cache invalidation is not mandatory on a given
> > page if there is no immediate exec access on it. Non-fault modification of
> > user page table from generic memory paths like migration can be improved if
> > setting of the exec permission on the page can be deferred till actual use.
> > There was a performance report [1] which highlighted the problem.
> [...]
> > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html
> 
> FTR, this performance regression has been addressed by commit
> 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That
> said, I still think this patch series is valuable for further optimising
> the page migration path on arm64 (and can be extended to other
> architectures that currently require I/D cache maintenance for
> executable pages).

Are there any numbers to show the optimization impact?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-13  8:06 [RFC 0/4] mm: Introduce lazy exec permission setting on a page Anshuman Khandual
                   ` (4 preceding siblings ...)
  2019-02-13 11:21 ` [RFC 0/4] mm: Introduce lazy exec permission setting on a page Catalin Marinas
@ 2019-02-13 15:44 ` Dave Hansen
  2019-02-14  4:12   ` Anshuman Khandual
  5 siblings, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2019-02-13 15:44 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon, catalin.marinas

On 2/13/19 12:06 AM, Anshuman Khandual wrote:
> Setting an exec permission on a page normally triggers I-cache invalidation
> which might be expensive. I-cache invalidation is not mandatory on a given
> page if there is no immediate exec access on it. Non-fault modification of
> user page table from generic memory paths like migration can be improved if
> setting of the exec permission on the page can be deferred till actual use.
> There was a performance report [1] which highlighted the problem.

How does this happen?  If the page was not executed, then it'll
(presumably) be non-present which won't require icache invalidation.
So, this would only be for pages that have been executed (and won't
again before the next migration), *or* for pages that were mapped
executable but never executed.

Any idea which one it is?

If it's pages that got mapped in but were never executed, how did that
happen?  Was it fault-around?  If so, maybe it would just be simpler to
not do fault-around for executable pages on these platforms.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-13 15:44 ` Dave Hansen
@ 2019-02-14  4:12   ` Anshuman Khandual
  2019-02-14 16:55     ` Dave Hansen
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-14  4:12 UTC (permalink / raw)
  To: Dave Hansen, linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon, catalin.marinas



On 02/13/2019 09:14 PM, Dave Hansen wrote:
> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
>> Setting an exec permission on a page normally triggers I-cache invalidation
>> which might be expensive. I-cache invalidation is not mandatory on a given
>> page if there is no immediate exec access on it. Non-fault modification of
>> user page table from generic memory paths like migration can be improved if
>> setting of the exec permission on the page can be deferred till actual use.
>> There was a performance report [1] which highlighted the problem.
> 
> How does this happen?  If the page was not executed, then it'll
> (presumably) be non-present which won't require icache invalidation.
> So, this would only be for pages that have been executed (and won't
> again before the next migration), *or* for pages that were mapped
> executable but never executed.
I-cache invalidation happens while migrating a 'mapped and executable' page
irrespective whether that page was really executed for being mapped there
in the first place.

> 
> Any idea which one it is?
> 

I am not sure about this particular reported case. But was able to reproduce
the problem through a test case where a buffer was mapped with R|W|X, get it
faulted/mapped through write, migrate and then execute from it.

> If it's pages that got mapped in but were never executed, how did that
> happen?  Was it fault-around?  If so, maybe it would just be simpler to
> not do fault-around for executable pages on these platforms.
Page can get mapped through a different access (write) without being executed.
Even if it got mapped through execution and subsequent invalidation, the
invalidation does not have to be repeated again after migration without first
getting an exec access subsequently. This series just tries to hold off the
invalidation after migration till subsequent exec access.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-13 15:38   ` Michal Hocko
@ 2019-02-14  6:04     ` Anshuman Khandual
  2019-02-14  8:38       ` Michal Hocko
  2019-02-14 15:38       ` Dave Hansen
  0 siblings, 2 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-14  6:04 UTC (permalink / raw)
  To: Michal Hocko, Catalin Marinas
  Cc: linux-mm, akpm, kirill, kirill.shutemov, vbabka, will.deacon,
	dave.hansen



On 02/13/2019 09:08 PM, Michal Hocko wrote:
> On Wed 13-02-19 11:21:36, Catalin Marinas wrote:
>> On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote:
>>> Setting an exec permission on a page normally triggers I-cache invalidation
>>> which might be expensive. I-cache invalidation is not mandatory on a given
>>> page if there is no immediate exec access on it. Non-fault modification of
>>> user page table from generic memory paths like migration can be improved if
>>> setting of the exec permission on the page can be deferred till actual use.
>>> There was a performance report [1] which highlighted the problem.
>> [...]
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html
>>
>> FTR, this performance regression has been addressed by commit
>> 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That
>> said, I still think this patch series is valuable for further optimising
>> the page migration path on arm64 (and can be extended to other
>> architectures that currently require I/D cache maintenance for
>> executable pages).
> 
> Are there any numbers to show the optimization impact?

This series transfers execution cost linearly with nr_pages from migration path
to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
HugeTLB and THP migration enablement on arm64 platform.

A. [Normal Pages]

nr_pages	migration1 	migration2	execfault1	execfault2	

1000 		7.000000	3.000000	24.000000	31.000000
5000 		38.000000 	18.000000	127.000000	153.000000
10000 		80.000000 	40.000000	289.000000	343.000000
15000		120.000000	60.000000	435.000000	514.000000
19900 		159.000000	79.000000	576.000000	681.000000

B. [THP Pages]

nr_pages	migration1 	migration2	execfault1	execfault2

10 		22.000000	3.000000	131.000000	146.000000
30 		72.000000	15.000000	443.000000	503.000000
50 		121.000000	24.000000	739.000000	837.000000
100 		242.000000	49.000000	1485.000000	1673.000000
199 		473.000000 	98.000000	2685.000000	3327.000000

C. [HugeTLB Pages]

nr_pages	migration1 	migration2	execfault1	execfault2

10		97.000000 	79.000000	125.000000	144.000000
30 		292.000000 	235.000000	408.000000	463.000000
50 		487.000000 	392.000000	674.000000	777.000000
100 		995.000000 	802.000000	1480.000000	1671.000000
130 		1300.000000 	1048.000000	1925.000000	2172.000000

NOTE:

migration1: Execution time (ms) for migrating nr_pages without patches
migration2: Execution time (ms) for migrating nr_pages with patches
execfault1: Execution time (ms) for executing nr_pages without patches
execfault2: Execution time (ms) for executing nr_pages with patches


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-14  6:04     ` Anshuman Khandual
@ 2019-02-14  8:38       ` Michal Hocko
  2019-02-14 10:19         ` Catalin Marinas
  2019-02-14 15:38       ` Dave Hansen
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-02-14  8:38 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Catalin Marinas, linux-mm, akpm, kirill, kirill.shutemov, vbabka,
	will.deacon, dave.hansen

On Thu 14-02-19 11:34:09, Anshuman Khandual wrote:
> 
> 
> On 02/13/2019 09:08 PM, Michal Hocko wrote:
> > On Wed 13-02-19 11:21:36, Catalin Marinas wrote:
> >> On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote:
> >>> Setting an exec permission on a page normally triggers I-cache invalidation
> >>> which might be expensive. I-cache invalidation is not mandatory on a given
> >>> page if there is no immediate exec access on it. Non-fault modification of
> >>> user page table from generic memory paths like migration can be improved if
> >>> setting of the exec permission on the page can be deferred till actual use.
> >>> There was a performance report [1] which highlighted the problem.
> >> [...]
> >>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html
> >>
> >> FTR, this performance regression has been addressed by commit
> >> 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That
> >> said, I still think this patch series is valuable for further optimising
> >> the page migration path on arm64 (and can be extended to other
> >> architectures that currently require I/D cache maintenance for
> >> executable pages).
> > 
> > Are there any numbers to show the optimization impact?
> 
> This series transfers execution cost linearly with nr_pages from migration path
> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
> HugeTLB and THP migration enablement on arm64 platform.

Please make sure that these numbers are in the changelog. I am also
missing an explanation why this is an overal win. Why should we pay
on the later access rather than the migration which is arguably a slower
path. What is the usecase that benefits from the cost shift?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 1/4] mm: Introduce lazy exec permission setting on a page
  2019-02-13 13:53     ` Anshuman Khandual
@ 2019-02-14  9:06       ` Mike Rapoport
  2019-02-15  8:11         ` Anshuman Khandual
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Rapoport @ 2019-02-14  9:06 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Matthew Wilcox, linux-mm, akpm, mhocko, kirill, kirill.shutemov,
	vbabka, will.deacon, catalin.marinas, dave.hansen

On Wed, Feb 13, 2019 at 07:23:18PM +0530, Anshuman Khandual wrote:
> 
> 
> On 02/13/2019 06:47 PM, Matthew Wilcox wrote:
> > On Wed, Feb 13, 2019 at 01:36:28PM +0530, Anshuman Khandual wrote:
> >> +#ifdef CONFIG_ARCH_SUPPORTS_LAZY_EXEC
> >> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
> >> +{
> >> +	if (unlikely(vma->vm_flags & VM_EXEC))
> >> +		return pte_mkexec(entry);
> >> +	return entry;
> >> +}
> >> +#else
> >> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
> >> +{
> >> +	return entry;
> >> +}
> >> +#endif
> > 
> >> +++ b/mm/memory.c
> >> @@ -2218,6 +2218,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
> >>  	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> >>  	entry = pte_mkyoung(vmf->orig_pte);
> >>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> >> +	if (vmf->flags & FAULT_FLAG_INSTRUCTION)
> >> +		entry = maybe_mkexec(entry, vma);
> > 
> > I don't understand this bit.  We have a fault based on an instruction
> > fetch.  But we're only going to _maybe_ set the exec bit?  Why not call
> > pte_mkexec() unconditionally?
> 
> Because the arch might not have subscribed to this in which case the fall
> back function does nothing and return the same entry. But in case this is
> enabled it also checks for VMA exec flag (VM_EXEC) before calling into
> pte_mkexec() something similar to existing maybe_mkwrite().

Than why not pass vmf->flags to maybe_mkexec() so that only arches
subscribed to this will have the check for 'flags & FAULT_FLAG_INSTRUCTION' ?

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-14  8:38       ` Michal Hocko
@ 2019-02-14 10:19         ` Catalin Marinas
  2019-02-14 12:28           ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Catalin Marinas @ 2019-02-14 10:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anshuman Khandual, linux-mm, akpm, kirill, kirill.shutemov,
	vbabka, will.deacon, dave.hansen

On Thu, Feb 14, 2019 at 09:38:44AM +0100, Michal Hocko wrote:
> On Thu 14-02-19 11:34:09, Anshuman Khandual wrote:
> > On 02/13/2019 09:08 PM, Michal Hocko wrote:
> > > Are there any numbers to show the optimization impact?
> > 
> > This series transfers execution cost linearly with nr_pages from migration path
> > to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
> > is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
> > HugeTLB and THP migration enablement on arm64 platform.
> 
> Please make sure that these numbers are in the changelog. I am also
> missing an explanation why this is an overal win. Why should we pay
> on the later access rather than the migration which is arguably a slower
> path. What is the usecase that benefits from the cost shift?

Originally the investigation started because of a regression we had
sending IPIs on each set_pte_at(PROT_EXEC). This has been fixed
separately, so the original value of this patchset has been diminished.

Trying to frame the problem, let's analyse the overall cost of migration
+ execute. Removing other invariants like cost of the initial mapping of
the pages or the mapping of new pages after migration, we have:

M - number of mapped executable pages just before migration
N - number of previously mapped pages that will be executed after
    migration (N <= M)
D - cost of migrating page data
I - cost of I-cache maintenance for a page
F - cost of an instruction fault (handle_mm_fault() + set_pte_at()
    without the actual I-cache maintenance)

Tc - total migration cost current kernel (including executing)
Tp - total migration cost patched kernel (including executing)

  Tc = M * (D + I)
  Tp = M * D + N * (F + I)

To be useful, we want this patchset to lead to:

  Tp < Tc

Simplifying:

  M * D + N * (F + I) < M * (D + I)
  ...
  F < I * (M - N) / N

So the question is, in a *real-world* scenario, what proportion of the
mapped executable pages would still be executed from after migration.
I'd leave this as a task for Anshuman to investigate and come up with
some numbers (and it's fine if it's just in the noise, we won't need
this patchset).

Also note that there are ARM CPU implementations that don't need I-cache
maintenance (the I side can snoop the D side), so for those this
patchset introducing an additional cost. But we can make the decision in
the arch code via pte_mklazyexec().

We implemented something similar in arm64 KVM (d0e22b4ac3ba "KVM:
arm/arm64: Limit icache invalidation to prefetch aborts") but the
use-case was different: previously KVM considered all pages executable
though the vast majority were only data pages in guests.

-- 
Catalin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-14 10:19         ` Catalin Marinas
@ 2019-02-14 12:28           ` Michal Hocko
  2019-02-15  8:45             ` Anshuman Khandual
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-02-14 12:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Anshuman Khandual, linux-mm, akpm, kirill, kirill.shutemov,
	vbabka, will.deacon, dave.hansen

On Thu 14-02-19 10:19:37, Catalin Marinas wrote:
> On Thu, Feb 14, 2019 at 09:38:44AM +0100, Michal Hocko wrote:
> > On Thu 14-02-19 11:34:09, Anshuman Khandual wrote:
> > > On 02/13/2019 09:08 PM, Michal Hocko wrote:
> > > > Are there any numbers to show the optimization impact?
> > > 
> > > This series transfers execution cost linearly with nr_pages from migration path
> > > to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
> > > is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
> > > HugeTLB and THP migration enablement on arm64 platform.
> > 
> > Please make sure that these numbers are in the changelog. I am also
> > missing an explanation why this is an overal win. Why should we pay
> > on the later access rather than the migration which is arguably a slower
> > path. What is the usecase that benefits from the cost shift?
> 
> Originally the investigation started because of a regression we had
> sending IPIs on each set_pte_at(PROT_EXEC). This has been fixed
> separately, so the original value of this patchset has been diminished.
> 
> Trying to frame the problem, let's analyse the overall cost of migration
> + execute. Removing other invariants like cost of the initial mapping of
> the pages or the mapping of new pages after migration, we have:
> 
> M - number of mapped executable pages just before migration
> N - number of previously mapped pages that will be executed after
>     migration (N <= M)
> D - cost of migrating page data
> I - cost of I-cache maintenance for a page
> F - cost of an instruction fault (handle_mm_fault() + set_pte_at()
>     without the actual I-cache maintenance)
> 
> Tc - total migration cost current kernel (including executing)
> Tp - total migration cost patched kernel (including executing)
> 
>   Tc = M * (D + I)
>   Tp = M * D + N * (F + I)
> 
> To be useful, we want this patchset to lead to:
> 
>   Tp < Tc
> 
> Simplifying:
> 
>   M * D + N * (F + I) < M * (D + I)
>   ...
>   F < I * (M - N) / N
> 
> So the question is, in a *real-world* scenario, what proportion of the
> mapped executable pages would still be executed from after migration.
> I'd leave this as a task for Anshuman to investigate and come up with
> some numbers (and it's fine if it's just in the noise, we won't need
> this patchset).

Yeah, betting on accessing only a smaller subset of the migrated memory
is something I figured out. But I am really missing a usecase or a
larger set of them to actually benefit from it. We have different
triggers for a migration. E.g. numa balancing. I would expect that
migrated pages are likely to be accessed after migration because
the primary reason to migrate them is that they are accessed from a
remote node. Then we a compaction which is a completely different story.
It is hard to assume any further access for migrated pages here. Then we
have an explicit move_pages syscall and I would expect this to be
somewhere in the middle. One would expect that the caller knows why the
memory is migrated and it will be used but again, we cannot really
assume anything.

This would suggest that this depends on the migration reason quite a
lot. So I would really like to see a more comprehensive analysis of
different workloads to see whether this is really worth it.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-14  6:04     ` Anshuman Khandual
  2019-02-14  8:38       ` Michal Hocko
@ 2019-02-14 15:38       ` Dave Hansen
  2019-02-18  3:19         ` Anshuman Khandual
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2019-02-14 15:38 UTC (permalink / raw)
  To: Anshuman Khandual, Michal Hocko, Catalin Marinas
  Cc: linux-mm, akpm, kirill, kirill.shutemov, vbabka, will.deacon

On 2/13/19 10:04 PM, Anshuman Khandual wrote:
>> Are there any numbers to show the optimization impact?
> This series transfers execution cost linearly with nr_pages from migration path
> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
> HugeTLB and THP migration enablement on arm64 platform.
> 
> A. [Normal Pages]
> 
> nr_pages	migration1 	migration2	execfault1	execfault2	
> 
> 1000 		7.000000	3.000000	24.000000	31.000000
> 5000 		38.000000 	18.000000	127.000000	153.000000
> 10000 		80.000000 	40.000000	289.000000	343.000000
> 15000		120.000000	60.000000	435.000000	514.000000
> 19900 		159.000000	79.000000	576.000000	681.000000

Do these numbers comprehend the increased fault costs or just the
decreased migration costs?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-14  4:12   ` Anshuman Khandual
@ 2019-02-14 16:55     ` Dave Hansen
  2019-02-18  8:31       ` Anshuman Khandual
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2019-02-14 16:55 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon, catalin.marinas

On 2/13/19 8:12 PM, Anshuman Khandual wrote:
> On 02/13/2019 09:14 PM, Dave Hansen wrote:
>> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
>>> Setting an exec permission on a page normally triggers I-cache invalidation
>>> which might be expensive. I-cache invalidation is not mandatory on a given
>>> page if there is no immediate exec access on it. Non-fault modification of
>>> user page table from generic memory paths like migration can be improved if
>>> setting of the exec permission on the page can be deferred till actual use.
>>> There was a performance report [1] which highlighted the problem.
>>
>> How does this happen?  If the page was not executed, then it'll
>> (presumably) be non-present which won't require icache invalidation.
>> So, this would only be for pages that have been executed (and won't
>> again before the next migration), *or* for pages that were mapped
>> executable but never executed.
> I-cache invalidation happens while migrating a 'mapped and executable' page
> irrespective whether that page was really executed for being mapped there
> in the first place.

Ahh, got it.  I also assume that the Accessed bit on these platforms is
also managed similar to how we do it on x86 such that it can't be used
to drive invalidation decisions?

>> Any idea which one it is?
> 
> I am not sure about this particular reported case. But was able to reproduce
> the problem through a test case where a buffer was mapped with R|W|X, get it
> faulted/mapped through write, migrate and then execute from it.

Could you make sure, please?

Write and Execute at the same time are generally a "bad idea".  Given
the hardware, I'm not surprised that this problem pops up, but it would
be great to find out if this is a real application, or a "doctor it
hurts when I do this."

>> If it's pages that got mapped in but were never executed, how did that
>> happen?  Was it fault-around?  If so, maybe it would just be simpler to
>> not do fault-around for executable pages on these platforms.
> Page can get mapped through a different access (write) without being executed.
> Even if it got mapped through execution and subsequent invalidation, the
> invalidation does not have to be repeated again after migration without first
> getting an exec access subsequently. This series just tries to hold off the
> invalidation after migration till subsequent exec access.

This set generally seems to be assuming an environment with "lots of
migration, and not much execution".  That seems like a kinda odd
situation to me.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 1/4] mm: Introduce lazy exec permission setting on a page
  2019-02-14  9:06       ` Mike Rapoport
@ 2019-02-15  8:11         ` Anshuman Khandual
  2019-02-15  9:49           ` Catalin Marinas
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-15  8:11 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Matthew Wilcox, linux-mm, akpm, mhocko, kirill, kirill.shutemov,
	vbabka, will.deacon, catalin.marinas, dave.hansen



On 02/14/2019 02:36 PM, Mike Rapoport wrote:
> On Wed, Feb 13, 2019 at 07:23:18PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 02/13/2019 06:47 PM, Matthew Wilcox wrote:
>>> On Wed, Feb 13, 2019 at 01:36:28PM +0530, Anshuman Khandual wrote:
>>>> +#ifdef CONFIG_ARCH_SUPPORTS_LAZY_EXEC
>>>> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
>>>> +{
>>>> +	if (unlikely(vma->vm_flags & VM_EXEC))
>>>> +		return pte_mkexec(entry);
>>>> +	return entry;
>>>> +}
>>>> +#else
>>>> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
>>>> +{
>>>> +	return entry;
>>>> +}
>>>> +#endif
>>>
>>>> +++ b/mm/memory.c
>>>> @@ -2218,6 +2218,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>>>>  	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>>>>  	entry = pte_mkyoung(vmf->orig_pte);
>>>>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>>>> +	if (vmf->flags & FAULT_FLAG_INSTRUCTION)
>>>> +		entry = maybe_mkexec(entry, vma);
>>>
>>> I don't understand this bit.  We have a fault based on an instruction
>>> fetch.  But we're only going to _maybe_ set the exec bit?  Why not call
>>> pte_mkexec() unconditionally?
>>
>> Because the arch might not have subscribed to this in which case the fall
>> back function does nothing and return the same entry. But in case this is
>> enabled it also checks for VMA exec flag (VM_EXEC) before calling into
>> pte_mkexec() something similar to existing maybe_mkwrite().
> 
> Than why not pass vmf->flags to maybe_mkexec() so that only arches
> subscribed to this will have the check for 'flags & FAULT_FLAG_INSTRUCTION' ?

Right it can help remove couple of instructions from un-subscribing archs. 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-14 12:28           ` Michal Hocko
@ 2019-02-15  8:45             ` Anshuman Khandual
  2019-02-15  9:27               ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-15  8:45 UTC (permalink / raw)
  To: Michal Hocko, Catalin Marinas
  Cc: linux-mm, akpm, kirill, kirill.shutemov, vbabka, will.deacon,
	dave.hansen

a

On 02/14/2019 05:58 PM, Michal Hocko wrote:
> On Thu 14-02-19 10:19:37, Catalin Marinas wrote:
>> On Thu, Feb 14, 2019 at 09:38:44AM +0100, Michal Hocko wrote:
>>> On Thu 14-02-19 11:34:09, Anshuman Khandual wrote:
>>>> On 02/13/2019 09:08 PM, Michal Hocko wrote:
>>>>> Are there any numbers to show the optimization impact?
>>>>
>>>> This series transfers execution cost linearly with nr_pages from migration path
>>>> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
>>>> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
>>>> HugeTLB and THP migration enablement on arm64 platform.
>>>
>>> Please make sure that these numbers are in the changelog. I am also
>>> missing an explanation why this is an overal win. Why should we pay
>>> on the later access rather than the migration which is arguably a slower
>>> path. What is the usecase that benefits from the cost shift?
>>
>> Originally the investigation started because of a regression we had
>> sending IPIs on each set_pte_at(PROT_EXEC). This has been fixed
>> separately, so the original value of this patchset has been diminished.
>>
>> Trying to frame the problem, let's analyse the overall cost of migration
>> + execute. Removing other invariants like cost of the initial mapping of
>> the pages or the mapping of new pages after migration, we have:
>>
>> M - number of mapped executable pages just before migration
>> N - number of previously mapped pages that will be executed after
>>     migration (N <= M)
>> D - cost of migrating page data
>> I - cost of I-cache maintenance for a page
>> F - cost of an instruction fault (handle_mm_fault() + set_pte_at()
>>     without the actual I-cache maintenance)
>>
>> Tc - total migration cost current kernel (including executing)
>> Tp - total migration cost patched kernel (including executing)
>>
>>   Tc = M * (D + I)
>>   Tp = M * D + N * (F + I)
>>
>> To be useful, we want this patchset to lead to:
>>
>>   Tp < Tc
>>
>> Simplifying:
>>
>>   M * D + N * (F + I) < M * (D + I)
>>   ...
>>   F < I * (M - N) / N
>>
>> So the question is, in a *real-world* scenario, what proportion of the
>> mapped executable pages would still be executed from after migration.
>> I'd leave this as a task for Anshuman to investigate and come up with
>> some numbers (and it's fine if it's just in the noise, we won't need
>> this patchset).
> 
> Yeah, betting on accessing only a smaller subset of the migrated memory
> is something I figured out. But I am really missing a usecase or a
> larger set of them to actually benefit from it. We have different
> triggers for a migration. E.g. numa balancing. I would expect that
> migrated pages are likely to be accessed after migration because
> the primary reason to migrate them is that they are accessed from a
> remote node. Then we a compaction which is a completely different story.

That access might not have been an exec fault it could have been bunch of
write faults which triggered NUMA migration. So NUMA triggered migration
does not necessarily mean continuing exec faults before and after migration.

Compaction might move around mapped pages with exec permission which might
not have any recent history of exec accesses before compaction or might not
even see any future exec access as well.

> It is hard to assume any further access for migrated pages here. Then we
> have an explicit move_pages syscall and I would expect this to be
> somewhere in the middle. One would expect that the caller knows why the
> memory is migrated and it will be used but again, we cannot really
> assume anything.

What if the caller knows that it wont be used ever again or in near future
and hence trying to migrate to a different node which has less expensive and
slower memory. Kernel should not assume either way on it but can decide to
be conservative in spending time in preparing for future exec faults.

But being conservative during migration risks additional exec faults which
would have been avoided if exec permission should have stayed on followed
by an I-cache invalidation. Deferral of the I-cache invalidation requires
removing the exec permission completely (unless there is some magic which
I am not aware about) i.e unmapping page for exec permission and risking
an exec fault next time around.

This problem gets particularly amplified for mixed permission (WRITE | EXEC)
user space mappings where things like NUMA migration, compaction etc probably
gets triggered by write faults and additional exec permission there never
really gets used.

> 
> This would suggest that this depends on the migration reason quite a
> lot. So I would really like to see a more comprehensive analysis of
> different workloads to see whether this is really worth it.

Sure. Could you please give some more details on how to go about this and
what specifically you are looking for ? User initiated migration through
systems calls seems bit tricky as an application can be written primarily
to benefit from this series. If real world applications can help give
some better insights then which ones I wonder. Or do we need to understand
more about compaction and NUMA triggered migration which are kernel
driven. Statistics from compaction/NUMA migration can reveal what ratio
of the exec enabled mapping gets exec faulted again later on after kernel
driven migrations (compaction/NUMA) which are more or less random without
depending too much on application behavior.

- Anshuman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-15  8:45             ` Anshuman Khandual
@ 2019-02-15  9:27               ` Michal Hocko
  2019-02-18  3:07                 ` Anshuman Khandual
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-02-15  9:27 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Catalin Marinas, linux-mm, akpm, kirill, kirill.shutemov, vbabka,
	will.deacon, dave.hansen

On Fri 15-02-19 14:15:58, Anshuman Khandual wrote:
> On 02/14/2019 05:58 PM, Michal Hocko wrote:
> > It is hard to assume any further access for migrated pages here. Then we
> > have an explicit move_pages syscall and I would expect this to be
> > somewhere in the middle. One would expect that the caller knows why the
> > memory is migrated and it will be used but again, we cannot really
> > assume anything.
> 
> What if the caller knows that it wont be used ever again or in near future
> and hence trying to migrate to a different node which has less expensive and
> slower memory. Kernel should not assume either way on it but can decide to
> be conservative in spending time in preparing for future exec faults.
> 
> But being conservative during migration risks additional exec faults which
> would have been avoided if exec permission should have stayed on followed
> by an I-cache invalidation. Deferral of the I-cache invalidation requires
> removing the exec permission completely (unless there is some magic which
> I am not aware about) i.e unmapping page for exec permission and risking
> an exec fault next time around.
> 
> This problem gets particularly amplified for mixed permission (WRITE | EXEC)
> user space mappings where things like NUMA migration, compaction etc probably
> gets triggered by write faults and additional exec permission there never
> really gets used.

Please quantify that and provide us with some _data_

> > This would suggest that this depends on the migration reason quite a
> > lot. So I would really like to see a more comprehensive analysis of
> > different workloads to see whether this is really worth it.
> 
> Sure. Could you please give some more details on how to go about this and
> what specifically you are looking for ?

You are proposing an optimization without actually providing any
justification. The overhead is not removed it is just shifted from one
path to another. So you should have some pretty convincing arguments
to back that shift as a general win. You can go an test on wider range
of workloads and isolate the worst/best case behavior. I fully realize
that this is tedious. Another option would be to define conditions when
the optimization is going to be a huge win and have some convincing
arguments that many/most workloads are falling into that category while
pathological ones are not suffering much.

This is no different from any other optimizations/heuristics we have.

Btw. have you considered to have this optimization conditional based on
the migration reason or vma flags?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 1/4] mm: Introduce lazy exec permission setting on a page
  2019-02-15  8:11         ` Anshuman Khandual
@ 2019-02-15  9:49           ` Catalin Marinas
  0 siblings, 0 replies; 28+ messages in thread
From: Catalin Marinas @ 2019-02-15  9:49 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mike Rapoport, Matthew Wilcox, linux-mm, akpm, mhocko, kirill,
	kirill.shutemov, vbabka, will.deacon, dave.hansen

On Fri, Feb 15, 2019 at 01:41:16PM +0530, Anshuman Khandual wrote:
> On 02/14/2019 02:36 PM, Mike Rapoport wrote:
> > On Wed, Feb 13, 2019 at 07:23:18PM +0530, Anshuman Khandual wrote:
> >> On 02/13/2019 06:47 PM, Matthew Wilcox wrote:
> >>> On Wed, Feb 13, 2019 at 01:36:28PM +0530, Anshuman Khandual wrote:
> >>>> +#ifdef CONFIG_ARCH_SUPPORTS_LAZY_EXEC
> >>>> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
> >>>> +{
> >>>> +	if (unlikely(vma->vm_flags & VM_EXEC))
> >>>> +		return pte_mkexec(entry);
> >>>> +	return entry;
> >>>> +}
> >>>> +#else
> >>>> +static inline pte_t maybe_mkexec(pte_t entry, struct vm_area_struct *vma)
> >>>> +{
> >>>> +	return entry;
> >>>> +}
> >>>> +#endif
> >>>
> >>>> +++ b/mm/memory.c
> >>>> @@ -2218,6 +2218,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
> >>>>  	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> >>>>  	entry = pte_mkyoung(vmf->orig_pte);
> >>>>  	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> >>>> +	if (vmf->flags & FAULT_FLAG_INSTRUCTION)
> >>>> +		entry = maybe_mkexec(entry, vma);
> >>>
> >>> I don't understand this bit.  We have a fault based on an instruction
> >>> fetch.  But we're only going to _maybe_ set the exec bit?  Why not call
> >>> pte_mkexec() unconditionally?
> >>
> >> Because the arch might not have subscribed to this in which case the fall
> >> back function does nothing and return the same entry. But in case this is
> >> enabled it also checks for VMA exec flag (VM_EXEC) before calling into
> >> pte_mkexec() something similar to existing maybe_mkwrite().
> > 
> > Than why not pass vmf->flags to maybe_mkexec() so that only arches
> > subscribed to this will have the check for 'flags & FAULT_FLAG_INSTRUCTION' ?
> 
> Right it can help remove couple of instructions from un-subscribing archs. 

If the arch does not enable CONFIG_ARCH_SUPPORTS_LAZY_EXEC, wouldn't the
compiler eliminate the FAULT_FLAG_INSTRUCTION check anyway? The current
maybe_mkexec() proposal here looks slightly nicer as it matches the
maybe_mkwrite() prototype.

-- 
Catalin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-15  9:27               ` Michal Hocko
@ 2019-02-18  3:07                 ` Anshuman Khandual
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-18  3:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Catalin Marinas, linux-mm, akpm, kirill, kirill.shutemov, vbabka,
	will.deacon, dave.hansen

On 02/15/2019 02:57 PM, Michal Hocko wrote:
> On Fri 15-02-19 14:15:58, Anshuman Khandual wrote:
>> On 02/14/2019 05:58 PM, Michal Hocko wrote:
>>> It is hard to assume any further access for migrated pages here. Then we
>>> have an explicit move_pages syscall and I would expect this to be
>>> somewhere in the middle. One would expect that the caller knows why the
>>> memory is migrated and it will be used but again, we cannot really
>>> assume anything.
>>
>> What if the caller knows that it wont be used ever again or in near future
>> and hence trying to migrate to a different node which has less expensive and
>> slower memory. Kernel should not assume either way on it but can decide to
>> be conservative in spending time in preparing for future exec faults.
>>
>> But being conservative during migration risks additional exec faults which
>> would have been avoided if exec permission should have stayed on followed
>> by an I-cache invalidation. Deferral of the I-cache invalidation requires
>> removing the exec permission completely (unless there is some magic which
>> I am not aware about) i.e unmapping page for exec permission and risking
>> an exec fault next time around.
>>
>> This problem gets particularly amplified for mixed permission (WRITE | EXEC)
>> user space mappings where things like NUMA migration, compaction etc probably
>> gets triggered by write faults and additional exec permission there never
>> really gets used.
> 
> Please quantify that and provide us with some _data_> 
>>> This would suggest that this depends on the migration reason quite a
>>> lot. So I would really like to see a more comprehensive analysis of
>>> different workloads to see whether this is really worth it.
>>
>> Sure. Could you please give some more details on how to go about this and
>> what specifically you are looking for ?
> 
> You are proposing an optimization without actually providing any
> justification. The overhead is not removed it is just shifted from one
> path to another. So you should have some pretty convincing arguments
> to back that shift as a general win. You can go an test on wider range
> of workloads and isolate the worst/best case behavior. I fully realize
> that this is tedious. Another option would be to define conditions when
> the optimization is going to be a huge win and have some convincing

Yeah conditional approach might narrow down the field and provide better
probability for a general win. The system call (move_pages/mbind) based
migrations from the user space are better placed for an win because the
user might just want to put those pages aside for rare exec accesses in
the future and the worst case cost for those deferral is not too high as
well. A hint regarding probable rare exec access in the future for the
kernel would have been better but I am afraid it would then require a new
user interface. But I think lazy exec decision can be taken right away
for MR_SYSCALL triggered migrations for VMAs with mixed permission
([VM_READ]|VM_WRITE|VM_EXEC) knowing the fact that in worst case the
cost is just getting migrated.

MR_NUMA_MISPLACED triggered migrations requires explicit tracking of fault
type (exec/write/[read]) per VMA along with it's applicable permission to
determine if exec permission deferral would be helpful or not. These stats
can also be used for all other kernel or user initiated migrations like
MR_COMPACTION, MR_MEMORY_FAILURE, MR_MEMORY_HOTPLUG and MR_CONTIG_RANGE.

Would it be worth adding explicit fault type tracking per VMA ? Can it be
used for some other purpose as well.

> arguments that many/most workloads are falling into that category while
> pathological ones are not suffering much.
> 
> This is no different from any other optimizations/heuristics we have.

Sure. Will think about this further.

> 
> Btw. have you considered to have this optimization conditional based on
> the migration reason or vma flags?

Started considering it after our discussions here. It makes sense to look
into the migration reason and the VMA flags right away but as I mentioned
earlier VMA fault type stats can really help on this as well.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-14 15:38       ` Dave Hansen
@ 2019-02-18  3:19         ` Anshuman Khandual
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-18  3:19 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko, Catalin Marinas
  Cc: linux-mm, akpm, kirill, kirill.shutemov, vbabka, will.deacon



On 02/14/2019 09:08 PM, Dave Hansen wrote:
> On 2/13/19 10:04 PM, Anshuman Khandual wrote:
>>> Are there any numbers to show the optimization impact?
>> This series transfers execution cost linearly with nr_pages from migration path
>> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment
>> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for
>> HugeTLB and THP migration enablement on arm64 platform.
>>
>> A. [Normal Pages]
>>
>> nr_pages	migration1 	migration2	execfault1	execfault2	
>>
>> 1000 		7.000000	3.000000	24.000000	31.000000
>> 5000 		38.000000 	18.000000	127.000000	153.000000
>> 10000 		80.000000 	40.000000	289.000000	343.000000
>> 15000		120.000000	60.000000	435.000000	514.000000
>> 19900 		159.000000	79.000000	576.000000	681.000000
> 
> Do these numbers comprehend the increased fault costs or just the
> decreased migration costs?

Both. It transfers cost from migration path to exec fault path.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-14 16:55     ` Dave Hansen
@ 2019-02-18  8:31       ` Anshuman Khandual
  2019-02-18  9:04         ` Catalin Marinas
  2019-02-18 18:20         ` Dave Hansen
  0 siblings, 2 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-18  8:31 UTC (permalink / raw)
  To: Dave Hansen, linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon, catalin.marinas

On 02/14/2019 10:25 PM, Dave Hansen wrote:
> On 2/13/19 8:12 PM, Anshuman Khandual wrote:
>> On 02/13/2019 09:14 PM, Dave Hansen wrote:
>>> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
>>>> Setting an exec permission on a page normally triggers I-cache invalidation
>>>> which might be expensive. I-cache invalidation is not mandatory on a given
>>>> page if there is no immediate exec access on it. Non-fault modification of
>>>> user page table from generic memory paths like migration can be improved if
>>>> setting of the exec permission on the page can be deferred till actual use.
>>>> There was a performance report [1] which highlighted the problem.
>>>
>>> How does this happen?  If the page was not executed, then it'll
>>> (presumably) be non-present which won't require icache invalidation.
>>> So, this would only be for pages that have been executed (and won't
>>> again before the next migration), *or* for pages that were mapped
>>> executable but never executed.
>> I-cache invalidation happens while migrating a 'mapped and executable' page
>> irrespective whether that page was really executed for being mapped there
>> in the first place.
> 
> Ahh, got it.  I also assume that the Accessed bit on these platforms is
> also managed similar to how we do it on x86 such that it can't be used
> to drive invalidation decisions?

Drive I-cache invalidation ? Could you please elaborate on this. Is not that
the access bit mechanism is to identify dirty pages after write faults when
it is SW updated or write accesses when HW updated. In SW updated method, given
PTE goes through pte_young() during page fault. Then how to differentiate exec
fault/access from an write fault/access and decide to invalidate the I-cache.
Just being curious.

> 
>>> Any idea which one it is?
>>
>> I am not sure about this particular reported case. But was able to reproduce
>> the problem through a test case where a buffer was mapped with R|W|X, get it
>> faulted/mapped through write, migrate and then execute from it.
> 
> Could you make sure, please?

The test in the report [1] does not create any explicit PROT_EXEC maps and just
attempts to migrate all pages of the process (which has 10 child processes)
including the exec pages. So the only exec mappings would be the primary text
segment and the mapped shared glibc segment. Looks like the shared libraries
have some mapped pages.

$cat /proc/[PID]/numa_maps  | grep libc

ffffaa4c9000 default file=/lib/aarch64-linux-gnu/libc-2.28.so mapped=150 mapmax=57 N0=150 kernelpagesize_kB=4
ffffaa621000 default file=/lib/aarch64-linux-gnu/libc-2.28.so
ffffaa630000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=4 dirty=4 mapmax=11 N0=4 kernelpagesize_kB=4
ffffaa634000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=2 dirty=2 mapmax=11 N0=2 kernelpagesize_kB=4

Will keep looking into this.

> 
> Write and Execute at the same time are generally a "bad idea".  Given

But wont this be the case for all run-time generate code which gets written to a
buffer before being executed from there.

> the hardware, I'm not surprised that this problem pops up, but it would
> be great to find out if this is a real application, or a "doctor it
> hurts when I do this."

Is not that a problem though :)

> 
>>> If it's pages that got mapped in but were never executed, how did that
>>> happen?  Was it fault-around?  If so, maybe it would just be simpler to
>>> not do fault-around for executable pages on these platforms.
>> Page can get mapped through a different access (write) without being executed.
>> Even if it got mapped through execution and subsequent invalidation, the
>> invalidation does not have to be repeated again after migration without first
>> getting an exec access subsequently. This series just tries to hold off the
>> invalidation after migration till subsequent exec access.
> 
> This set generally seems to be assuming an environment with "lots of
> migration, and not much execution".  That seems like a kinda odd
> situation to me.

Irrespective of the reported problem which is user driven, there are many kernel
triggered migrations which can accumulate I-cache invalidation cost over time on
a memory heavy system with high number of exec enabled user pages. Will that be
such a rare situation !

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-18  8:31       ` Anshuman Khandual
@ 2019-02-18  9:04         ` Catalin Marinas
  2019-02-18  9:16           ` Anshuman Khandual
  2019-02-18 18:20         ` Dave Hansen
  1 sibling, 1 reply; 28+ messages in thread
From: Catalin Marinas @ 2019-02-18  9:04 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Dave Hansen, linux-mm, akpm, mhocko, kirill, kirill.shutemov,
	vbabka, will.deacon

On Mon, Feb 18, 2019 at 02:01:55PM +0530, Anshuman Khandual wrote:
> On 02/14/2019 10:25 PM, Dave Hansen wrote:
> > On 2/13/19 8:12 PM, Anshuman Khandual wrote:
> >> On 02/13/2019 09:14 PM, Dave Hansen wrote:
> >>> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
> >>>> Setting an exec permission on a page normally triggers I-cache invalidation
> >>>> which might be expensive. I-cache invalidation is not mandatory on a given
> >>>> page if there is no immediate exec access on it. Non-fault modification of
> >>>> user page table from generic memory paths like migration can be improved if
> >>>> setting of the exec permission on the page can be deferred till actual use.
> >>>> There was a performance report [1] which highlighted the problem.
> >>>
> >>> How does this happen?  If the page was not executed, then it'll
> >>> (presumably) be non-present which won't require icache invalidation.
> >>> So, this would only be for pages that have been executed (and won't
> >>> again before the next migration), *or* for pages that were mapped
> >>> executable but never executed.
> >> I-cache invalidation happens while migrating a 'mapped and executable' page
> >> irrespective whether that page was really executed for being mapped there
> >> in the first place.
> > 
> > Ahh, got it.  I also assume that the Accessed bit on these platforms is
> > also managed similar to how we do it on x86 such that it can't be used
> > to drive invalidation decisions?
> 
> Drive I-cache invalidation ? Could you please elaborate on this. Is not that
> the access bit mechanism is to identify dirty pages after write faults when
> it is SW updated or write accesses when HW updated. In SW updated method, given
> PTE goes through pte_young() during page fault. Then how to differentiate exec
> fault/access from an write fault/access and decide to invalidate the I-cache.
> Just being curious.

The access flag is used to identify young/old pages only (the dirty bit
is used to track writes to a page). Depending on the Arm implementation,
the access bit/flag could be managed by hardware transparently, so no
fault taken to the kernel on accessing through an 'old' pte.

-- 
Catalin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-18  9:04         ` Catalin Marinas
@ 2019-02-18  9:16           ` Anshuman Khandual
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-02-18  9:16 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Dave Hansen, linux-mm, akpm, mhocko, kirill, kirill.shutemov,
	vbabka, will.deacon



On 02/18/2019 02:34 PM, Catalin Marinas wrote:
> On Mon, Feb 18, 2019 at 02:01:55PM +0530, Anshuman Khandual wrote:
>> On 02/14/2019 10:25 PM, Dave Hansen wrote:
>>> On 2/13/19 8:12 PM, Anshuman Khandual wrote:
>>>> On 02/13/2019 09:14 PM, Dave Hansen wrote:
>>>>> On 2/13/19 12:06 AM, Anshuman Khandual wrote:
>>>>>> Setting an exec permission on a page normally triggers I-cache invalidation
>>>>>> which might be expensive. I-cache invalidation is not mandatory on a given
>>>>>> page if there is no immediate exec access on it. Non-fault modification of
>>>>>> user page table from generic memory paths like migration can be improved if
>>>>>> setting of the exec permission on the page can be deferred till actual use.
>>>>>> There was a performance report [1] which highlighted the problem.
>>>>>
>>>>> How does this happen?  If the page was not executed, then it'll
>>>>> (presumably) be non-present which won't require icache invalidation.
>>>>> So, this would only be for pages that have been executed (and won't
>>>>> again before the next migration), *or* for pages that were mapped
>>>>> executable but never executed.
>>>> I-cache invalidation happens while migrating a 'mapped and executable' page
>>>> irrespective whether that page was really executed for being mapped there
>>>> in the first place.
>>>
>>> Ahh, got it.  I also assume that the Accessed bit on these platforms is
>>> also managed similar to how we do it on x86 such that it can't be used
>>> to drive invalidation decisions?
>>
>> Drive I-cache invalidation ? Could you please elaborate on this. Is not that
>> the access bit mechanism is to identify dirty pages after write faults when
>> it is SW updated or write accesses when HW updated. In SW updated method, given
>> PTE goes through pte_young() during page fault. Then how to differentiate exec
>> fault/access from an write fault/access and decide to invalidate the I-cache.
>> Just being curious.
> 
> The access flag is used to identify young/old pages only (the dirty bit
> is used to track writes to a page). Depending on the Arm implementation,
> the access bit/flag could be managed by hardware transparently, so no
> fault taken to the kernel on accessing through an 'old' pte.

Then there is no way to identify an exec fault with either of the facilities of
access/reference bit or dirty bit whether managed by SW or HW. Still wondering about
previous comment where Dave mentioned how it can be used for I-cache invalidation.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/4] mm: Introduce lazy exec permission setting on a page
  2019-02-18  8:31       ` Anshuman Khandual
  2019-02-18  9:04         ` Catalin Marinas
@ 2019-02-18 18:20         ` Dave Hansen
  1 sibling, 0 replies; 28+ messages in thread
From: Dave Hansen @ 2019-02-18 18:20 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, akpm
  Cc: mhocko, kirill, kirill.shutemov, vbabka, will.deacon, catalin.marinas

On 2/18/19 12:31 AM, Anshuman Khandual wrote:
>> Ahh, got it.  I also assume that the Accessed bit on these platforms is
>> also managed similar to how we do it on x86 such that it can't be used
>> to drive invalidation decisions?
> 
> Drive I-cache invalidation ? Could you please elaborate on this. Is not that
> the access bit mechanism is to identify dirty pages after write faults when
> it is SW updated or write accesses when HW updated. In SW updated method, given
> PTE goes through pte_young() during page fault. Then how to differentiate exec
> fault/access from an write fault/access and decide to invalidate the I-cache.
> Just being curious.

Let's say this was on x86 where the Accessed bit is set by the hardware
on any access.  Let's also say that Linux invalidated the TLB any time
that bit was cleared in software (it doesn't, but let's pretend it did).

In that case, if we needed to do icache invalidation, we could optimize
it by only invalidating the icache when we see the Accessed bit set.
That's because any execution would first set the Accessed bit before the
icache was populated.

So, my question

>>>> Any idea which one it is?
>>>
>>> I am not sure about this particular reported case. But was able to reproduce
>>> the problem through a test case where a buffer was mapped with R|W|X, get it
>>> faulted/mapped through write, migrate and then execute from it.
>>
>> Could you make sure, please?
> 
> The test in the report [1] does not create any explicit PROT_EXEC maps and just
> attempts to migrate all pages of the process (which has 10 child processes)
> including the exec pages. So the only exec mappings would be the primary text
> segment and the mapped shared glibc segment. Looks like the shared libraries
> have some mapped pages.

Yeah, but the executable ones are also read-only in your example:

> $cat /proc/[PID]/numa_maps  | grep libc
> 
> ffffaa4c9000 default file=/lib/aarch64-linux-gnu/libc-2.28.so mapped=150 mapmax=57 N0=150 kernelpagesize_kB=4

^ These are all page-cache, executable and read-only.

> ffffaa621000 default file=/lib/aarch64-linux-gnu/libc-2.28.so
> ffffaa630000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=4 dirty=4 mapmax=11 N0=4 kernelpagesize_kB=4
> ffffaa634000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=2 dirty=2 mapmax=11 N0=2 kernelpagesize_kB=4

This last one is the only read-write one and it's not executable.

>> Write and Execute at the same time are generally a "bad idea".  Given
> 
> But wont this be the case for all run-time generate code which gets written to a
> buffer before being executed from there.

No.  They usually are r=1,w=1,x=0, then transition to r=1,w=0,x=1.  It's
never simultaneously executable and writable.

>> the hardware, I'm not surprised that this problem pops up, but it would
>> be great to find out if this is a real application, or a "doctor it
>> hurts when I do this."
> 
> Is not that a problem though :)

The point is that it's not a real-world problem.  You can certainly
expose this, but do *real* apps do this rather than something entirely
synthetic?

>> This set generally seems to be assuming an environment with "lots of
>> migration, and not much execution".  That seems like a kinda odd
>> situation to me.
> 
> Irrespective of the reported problem which is user driven, there are many kernel
> triggered migrations which can accumulate I-cache invalidation cost over time on
> a memory heavy system with high number of exec enabled user pages. Will that be
> such a rare situation !
> 
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html

I translate "trivial C application" to "highly synthetic microbenchmark".

I suspect what's happening here is that somebody wrote a micro that
worked well on x86, although it was being rather stupid.  Somebody got
an arm system, and voila: it's slower.  Someone says "Oh, this arm
system is slower than x86!"

Again, the big questions you have real-world applications with writable,
executable pages?  The kernel essentially has *zero* of these because
they're such a massive security risk.  Adding this feature will
encourage folks to replicate this massive security risk in userspace.

Seems like a bad idea.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-02-18 18:20 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-13  8:06 [RFC 0/4] mm: Introduce lazy exec permission setting on a page Anshuman Khandual
2019-02-13  8:06 ` [RFC 1/4] " Anshuman Khandual
2019-02-13 13:17   ` Matthew Wilcox
2019-02-13 13:53     ` Anshuman Khandual
2019-02-14  9:06       ` Mike Rapoport
2019-02-15  8:11         ` Anshuman Khandual
2019-02-15  9:49           ` Catalin Marinas
2019-02-13  8:06 ` [RFC 2/4] arm64/mm: Identify user level instruction faults Anshuman Khandual
2019-02-13  8:06 ` [RFC 3/4] arm64/mm: Allow non-exec to exec transition in ptep_set_access_flags() Anshuman Khandual
2019-02-13  8:06 ` [RFC 4/4] arm64/mm: Enable ARCH_SUPPORTS_LAZY_EXEC Anshuman Khandual
2019-02-13 11:21 ` [RFC 0/4] mm: Introduce lazy exec permission setting on a page Catalin Marinas
2019-02-13 15:38   ` Michal Hocko
2019-02-14  6:04     ` Anshuman Khandual
2019-02-14  8:38       ` Michal Hocko
2019-02-14 10:19         ` Catalin Marinas
2019-02-14 12:28           ` Michal Hocko
2019-02-15  8:45             ` Anshuman Khandual
2019-02-15  9:27               ` Michal Hocko
2019-02-18  3:07                 ` Anshuman Khandual
2019-02-14 15:38       ` Dave Hansen
2019-02-18  3:19         ` Anshuman Khandual
2019-02-13 15:44 ` Dave Hansen
2019-02-14  4:12   ` Anshuman Khandual
2019-02-14 16:55     ` Dave Hansen
2019-02-18  8:31       ` Anshuman Khandual
2019-02-18  9:04         ` Catalin Marinas
2019-02-18  9:16           ` Anshuman Khandual
2019-02-18 18:20         ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.