All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] Slight improvements for OOM/Futex
@ 2022-04-21 19:05 Nico Pache
  2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Nico Pache @ 2022-04-21 19:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Alexander Viro, Andrew Morton, Nicholas Piggin, Logan Gunthorpe,
	Hari Bathini, Aneesh Kumar K.V, Matthew Wilcox (Oracle),
	Yang Shi, Miaohe Lin, William Kucharski, Hugh Dickins,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, André Almeida, Arjan van de Ven,
	Ulrich Drepper, Michal Hocko

The following 3 patches were developed while working on the OOM/Futex
fix. 

Patch 1: This is a nonfunctional change. The vma_is_anonymous function
originally led to some confusion about what the oom reaper is checking
for in __oom_reap_task_mm. This patch makes that function name more 
verbose.

Patch 2: Futex cleanup was designed with silent failures. Printing this
failure would have led to more quickly finding the issue. This
introduces a pr_info if the exit path has any issues walking the
userspace list.

Patch 3: During the debug process I noticed that the oom_reap_task_mm
function was always running twice for the same mm_struct; Once in the
exit path and once in the oom_reaper. By checking the MMF_OOM_SKIP we
can prevent these unnecissary double calls. I'd like some input from
David Rientjes here with regards to CVE-2018-1000200, I want to make
sure we are not reintroducing that CVE.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "André Almeida" <andrealmeid@collabora.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Nico Pache <npache@redhat.com>

Nico Pache (3):
  mm: change vma_is_anonymous to vma_is_private_anon
  futex: exit: Print a warning when futex_cleanup fails
  exit: Check for MMF_OOM_SKIP in exit_mmap

 arch/powerpc/mm/book3s64/pgtable.c |  2 +-
 fs/userfaultfd.c                   |  2 +-
 include/linux/huge_mm.h            |  2 +-
 include/linux/mm.h                 |  2 +-
 kernel/futex/core.c                | 44 ++++++++++++++++++------------
 mm/gup.c                           |  4 +--
 mm/huge_memory.c                   | 10 +++----
 mm/madvise.c                       |  4 +--
 mm/memory.c                        | 10 +++----
 mm/migrate_device.c                |  6 ++--
 mm/mincore.c                       |  2 +-
 mm/mmap.c                          |  7 +++--
 mm/oom_kill.c                      |  2 +-
 mm/userfaultfd.c                   |  8 +++---
 14 files changed, 57 insertions(+), 48 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon
  2022-04-21 19:05 [RFC 0/3] Slight improvements for OOM/Futex Nico Pache
@ 2022-04-21 19:05 ` Nico Pache
  2022-04-21 19:28   ` Matthew Wilcox
  2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
  2022-04-21 19:05 ` [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap Nico Pache
  2 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-21 19:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Alexander Viro, Andrew Morton, Nicholas Piggin, Logan Gunthorpe,
	Hari Bathini, Aneesh Kumar K.V, Matthew Wilcox (Oracle),
	Yang Shi, Miaohe Lin, William Kucharski, Hugh Dickins

The vma_is_anonymous function isn't fully indicative of what it checks.

Without having full knowledge of the mmap process, one may incorrectly
assume this covers all types of anonymous memory; which is not the case.

No functional changes.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 arch/powerpc/mm/book3s64/pgtable.c |  2 +-
 fs/userfaultfd.c                   |  2 +-
 include/linux/huge_mm.h            |  2 +-
 include/linux/mm.h                 |  2 +-
 mm/gup.c                           |  4 ++--
 mm/huge_memory.c                   | 10 +++++-----
 mm/madvise.c                       |  4 ++--
 mm/memory.c                        | 10 +++++-----
 mm/migrate_device.c                |  6 +++---
 mm/mincore.c                       |  2 +-
 mm/mmap.c                          |  4 ++--
 mm/oom_kill.c                      |  2 +-
 mm/userfaultfd.c                   |  8 ++++----
 13 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 052e6590f84f..c8fd68a7965d 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -479,7 +479,7 @@ int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
 			   struct vm_area_struct *vma)
 {
 	if (radix_enabled())
-		return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+		return (new_pmd_ptl != old_pmd_ptl) && vma_is_private_anon(vma);
 
 	return true;
 }
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aa0c47cb0d16..d3fe2c26874a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1269,7 +1269,7 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
 			return false;
 	}
 
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+	return vma_is_private_anon(vma) || is_vm_hugetlb_page(vma) ||
 	       vma_is_shmem(vma);
 }
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2999190adc22..40493f53562c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,7 +119,7 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 		unsigned long haddr)
 {
 	/* Don't have to check pgoff for anonymous vma */
-	if (!vma_is_anonymous(vma)) {
+	if (!vma_is_private_anon(vma)) {
 		if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
 				HPAGE_PMD_NR))
 			return false;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..9a01d053853b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -623,7 +623,7 @@ static inline void vma_set_anonymous(struct vm_area_struct *vma)
 	vma->vm_ops = NULL;
 }
 
-static inline bool vma_is_anonymous(struct vm_area_struct *vma)
+static inline bool vma_is_private_anon(struct vm_area_struct *vma)
 {
 	return !vma->vm_ops;
 }
diff --git a/mm/gup.c b/mm/gup.c
index f598a037eb04..7cd3ef329ea2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -398,7 +398,7 @@ static struct page *no_page_table(struct vm_area_struct *vma,
 	 * be zero-filled if handle_mm_fault() actually did handle it.
 	 */
 	if ((flags & FOLL_DUMP) &&
-			(vma_is_anonymous(vma) || !vma->vm_ops->fault))
+			(vma_is_private_anon(vma) || !vma->vm_ops->fault))
 		return ERR_PTR(-EFAULT);
 	return NULL;
 }
@@ -913,7 +913,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if (vm_flags & (VM_IO | VM_PFNMAP))
 		return -EFAULT;
 
-	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+	if (gup_flags & FOLL_ANON && !vma_is_private_anon(vma))
 		return -EFAULT;
 
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c468fee595ff..81c2123aeffe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -82,7 +82,7 @@ bool transparent_hugepage_active(struct vm_area_struct *vma)
 
 	if (!transhuge_vma_suitable(vma, addr))
 		return false;
-	if (vma_is_anonymous(vma))
+	if (vma_is_private_anon(vma))
 		return __transparent_hugepage_enabled(vma);
 	if (vma_is_shmem(vma))
 		return shmem_huge_enabled(vma);
@@ -1035,7 +1035,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	int ret = -ENOMEM;
 
 	/* Skip if can be re-fill on fault */
-	if (!vma_is_anonymous(dst_vma))
+	if (!vma_is_private_anon(dst_vma))
 		return 0;
 
 	pgtable = pte_alloc_one(dst_mm);
@@ -1622,7 +1622,7 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
 	 *
 	 * We also don't deposit and withdraw tables for file pages.
 	 */
-	return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+	return (new_pmd_ptl != old_pmd_ptl) && vma_is_private_anon(vma);
 }
 #endif
 
@@ -1799,7 +1799,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	ret = HPAGE_PMD_NR;
 	set_pmd_at(mm, addr, pmd, entry);
-	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
+	BUG_ON(vma_is_private_anon(vma) && !preserve_write && pmd_write(entry));
 unlock:
 	spin_unlock(ptl);
 	return ret;
@@ -1957,7 +1957,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	count_vm_event(THP_SPLIT_PMD);
 
-	if (!vma_is_anonymous(vma)) {
+	if (!vma_is_private_anon(vma)) {
 		old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
 		/*
 		 * We are going to unmap this huge page. So
diff --git a/mm/madvise.c b/mm/madvise.c
index 1873616a37d2..23771875ae9d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -543,7 +543,7 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
 
 static inline bool can_do_pageout(struct vm_area_struct *vma)
 {
-	if (vma_is_anonymous(vma))
+	if (vma_is_private_anon(vma))
 		return true;
 	if (!vma->vm_file)
 		return false;
@@ -725,7 +725,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	struct mmu_gather tlb;
 
 	/* MADV_FREE works for only anon vma at the moment */
-	if (!vma_is_anonymous(vma))
+	if (!vma_is_private_anon(vma))
 		return -EINVAL;
 
 	range.start = max(vma->vm_start, start_addr);
diff --git a/mm/memory.c b/mm/memory.c
index 76e3af9639d9..fcb599bc5c33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4494,7 +4494,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 
 static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 {
-	if (vma_is_anonymous(vmf->vma))
+	if (vma_is_private_anon(vmf->vma))
 		return do_huge_pmd_anonymous_page(vmf);
 	if (vmf->vma->vm_ops->huge_fault)
 		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
@@ -4504,7 +4504,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 /* `inline' is required to avoid gcc 4.1.2 build error */
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
-	if (vma_is_anonymous(vmf->vma)) {
+	if (vma_is_private_anon(vmf->vma)) {
 		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf);
@@ -4527,7 +4527,7 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&			\
 	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 	/* No support for anonymous transparent PUD pages yet */
-	if (vma_is_anonymous(vmf->vma))
+	if (vma_is_private_anon(vmf->vma))
 		goto split;
 	if (vmf->vma->vm_ops->huge_fault) {
 		vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
@@ -4546,7 +4546,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* No support for anonymous transparent PUD pages yet */
-	if (vma_is_anonymous(vmf->vma))
+	if (vma_is_private_anon(vmf->vma))
 		return VM_FAULT_FALLBACK;
 	if (vmf->vma->vm_ops->huge_fault)
 		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
@@ -4621,7 +4621,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	}
 
 	if (!vmf->pte) {
-		if (vma_is_anonymous(vmf->vma))
+		if (vma_is_private_anon(vmf->vma))
 			return do_anonymous_page(vmf);
 		else
 			return do_fault(vmf);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 70c7dc05bbfc..4a4068c410f7 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -40,7 +40,7 @@ static int migrate_vma_collect_hole(unsigned long start,
 	unsigned long addr;
 
 	/* Only allow populating anonymous memory. */
-	if (!vma_is_anonymous(walk->vma))
+	if (!vma_is_private_anon(walk->vma))
 		return migrate_vma_collect_skip(start, end, walk);
 
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
@@ -120,7 +120,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pte = *ptep;
 
 		if (pte_none(pte)) {
-			if (vma_is_anonymous(vma)) {
+			if (vma_is_private_anon(vma)) {
 				mpfn = MIGRATE_PFN_MIGRATE;
 				migrate->cpages++;
 			}
@@ -518,7 +518,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pte_t *ptep;
 
 	/* Only allow populating anonymous memory */
-	if (!vma_is_anonymous(vma))
+	if (!vma_is_private_anon(vma))
 		goto abort;
 
 	pgdp = pgd_offset(mm, addr);
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..5aa4a1358ac1 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -156,7 +156,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 static inline bool can_do_mincore(struct vm_area_struct *vma)
 {
-	if (vma_is_anonymous(vma))
+	if (vma_is_private_anon(vma))
 		return true;
 	if (!vma->vm_file)
 		return false;
diff --git a/mm/mmap.c b/mm/mmap.c
index 3aa839f81e63..a2968669fd4e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3189,7 +3189,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 	 * using the existing file pgoff checks and manipulations.
 	 * Similarly in do_mmap and in do_brk_flags.
 	 */
-	if (vma_is_anonymous(vma)) {
+	if (vma_is_private_anon(vma)) {
 		BUG_ON(vma->anon_vma);
 		vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
 	}
@@ -3217,7 +3217,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	 * If anonymous vma has not yet been faulted, update new pgoff
 	 * to match new location, to increase its chance of merging.
 	 */
-	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
+	if (unlikely(vma_is_private_anon(vma) && !vma->anon_vma)) {
 		pgoff = addr >> PAGE_SHIFT;
 		faulted_in_anon_vma = false;
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ec38194f8e1..aa3e68934fcb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -536,7 +536,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 		 * we do not want to block exit_mmap by keeping mm ref
 		 * count elevated without a good reason.
 		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+		if (vma_is_private_anon(vma) || !(vma->vm_flags & VM_SHARED)) {
 			struct mmu_notifier_range range;
 			struct mmu_gather tlb;
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0cb8e5ef1713..c94f81bd109b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -540,9 +540,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	err = -EINVAL;
 	/*
 	 * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
-	 * it will overwrite vm_ops, so vma_is_anonymous must return false.
+	 * it will overwrite vm_ops, so vma_is_private_anon must return false.
 	 */
-	if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
+	if (WARN_ON_ONCE(vma_is_private_anon(dst_vma) &&
 	    dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;
 
@@ -561,7 +561,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
 						src_start, len, mcopy_mode);
 
-	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
+	if (!vma_is_private_anon(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
 	if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE)
 		goto out_unlock;
@@ -717,7 +717,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 		goto out_unlock;
 	if (!userfaultfd_wp(dst_vma))
 		goto out_unlock;
-	if (!vma_is_anonymous(dst_vma))
+	if (!vma_is_private_anon(dst_vma))
 		goto out_unlock;
 
 	if (enable_wp)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
  2022-04-21 19:05 [RFC 0/3] Slight improvements for OOM/Futex Nico Pache
  2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
@ 2022-04-21 19:05 ` Nico Pache
  2022-04-21 19:30   ` Matthew Wilcox
                     ` (2 more replies)
  2022-04-21 19:05 ` [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap Nico Pache
  2 siblings, 3 replies; 16+ messages in thread
From: Nico Pache @ 2022-04-21 19:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Darren Hart, Davidlohr Bueso, André Almeida,
	Arjan van de Ven, Ulrich Drepper

The futex_cleanup routines currently fails silently.

Allow the futex_cleanup routines to fail more verbosely so we can
leave a message behind for easier debugging/error detecting.

Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "André Almeida" <andrealmeid@collabora.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 kernel/futex/core.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b22ef1efe751..8a62eb1b5d77 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -760,9 +760,9 @@ static inline int fetch_robust_entry(struct robust_list __user **entry,
  * Walk curr->robust_list (very carefully, it's a userspace list!)
  * and mark any locks found there dead, and notify any waiters.
  *
- * We silently return on any sign of list-walking problem.
+ * We return false on any sign of list-walking problem.
  */
-static void exit_robust_list(struct task_struct *curr)
+static bool exit_robust_list(struct task_struct *curr)
 {
 	struct robust_list_head __user *head = curr->robust_list;
 	struct robust_list __user *entry, *next_entry, *pending;
@@ -771,23 +771,25 @@ static void exit_robust_list(struct task_struct *curr)
 	unsigned long futex_offset;
 	int rc;
 
+	if (!futex_cmpxchg_enabled)
+		return false;
 	/*
 	 * Fetch the list head (which was registered earlier, via
 	 * sys_set_robust_list()):
 	 */
 	if (fetch_robust_entry(&entry, &head->list.next, &pi))
-		return;
+		return false;
 	/*
 	 * Fetch the relative futex offset:
 	 */
 	if (get_user(futex_offset, &head->futex_offset))
-		return;
+		return false;
 	/*
 	 * Fetch any possibly pending lock-add first, and handle it
 	 * if it exists:
 	 */
 	if (fetch_robust_entry(&pending, &head->list_op_pending, &pip))
-		return;
+		return false;
 
 	next_entry = NULL;	/* avoid warning with gcc */
 	while (entry != &head->list) {
@@ -803,10 +805,10 @@ static void exit_robust_list(struct task_struct *curr)
 		if (entry != pending) {
 			if (handle_futex_death((void __user *)entry + futex_offset,
 						curr, pi, HANDLE_DEATH_LIST))
-				return;
+				return false;
 		}
 		if (rc)
-			return;
+			return false;
 		entry = next_entry;
 		pi = next_pi;
 		/*
@@ -819,9 +821,10 @@ static void exit_robust_list(struct task_struct *curr)
 	}
 
 	if (pending) {
-		handle_futex_death((void __user *)pending + futex_offset,
+		return handle_futex_death((void __user *)pending + futex_offset,
 				   curr, pip, HANDLE_DEATH_PENDING);
 	}
+	return true;
 }
 
 #ifdef CONFIG_COMPAT
@@ -854,9 +857,9 @@ compat_fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user **ent
  * Walk curr->robust_list (very carefully, it's a userspace list!)
  * and mark any locks found there dead, and notify any waiters.
  *
- * We silently return on any sign of list-walking problem.
+ * We return false on any sign of list-walking problem.
  */
-static void compat_exit_robust_list(struct task_struct *curr)
+static bool compat_exit_robust_list(struct task_struct *curr)
 {
 	struct compat_robust_list_head __user *head = curr->compat_robust_list;
 	struct robust_list __user *entry, *next_entry, *pending;
@@ -866,24 +869,26 @@ static void compat_exit_robust_list(struct task_struct *curr)
 	compat_long_t futex_offset;
 	int rc;
 
+	if (!futex_cmpxchg_enabled)
+		return false;
 	/*
 	 * Fetch the list head (which was registered earlier, via
 	 * sys_set_robust_list()):
 	 */
 	if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &pi))
-		return;
+		return false;
 	/*
 	 * Fetch the relative futex offset:
 	 */
 	if (get_user(futex_offset, &head->futex_offset))
-		return;
+		return false;
 	/*
 	 * Fetch any possibly pending lock-add first, and handle it
 	 * if it exists:
 	 */
 	if (compat_fetch_robust_entry(&upending, &pending,
 			       &head->list_op_pending, &pip))
-		return;
+		return false;
 
 	next_entry = NULL;	/* avoid warning with gcc */
 	while (entry != (struct robust_list __user *) &head->list) {
@@ -902,10 +907,10 @@ static void compat_exit_robust_list(struct task_struct *curr)
 
 			if (handle_futex_death(uaddr, curr, pi,
 					       HANDLE_DEATH_LIST))
-				return;
+				return false;
 		}
 		if (rc)
-			return;
+			return false;
 		uentry = next_uentry;
 		entry = next_entry;
 		pi = next_pi;
@@ -920,8 +925,9 @@ static void compat_exit_robust_list(struct task_struct *curr)
 	if (pending) {
 		void __user *uaddr = futex_uaddr(pending, futex_offset);
 
-		handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING);
+		return handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING);
 	}
+	return true;
 }
 #endif
 
@@ -1007,13 +1013,15 @@ static inline void exit_pi_state_list(struct task_struct *curr) { }
 static void futex_cleanup(struct task_struct *tsk)
 {
 	if (unlikely(tsk->robust_list)) {
-		exit_robust_list(tsk);
+		if (!exit_robust_list(tsk))
+			pr_info("futex: exit_robust_list failed");
 		tsk->robust_list = NULL;
 	}
 
 #ifdef CONFIG_COMPAT
 	if (unlikely(tsk->compat_robust_list)) {
-		compat_exit_robust_list(tsk);
+		if (!compat_exit_robust_list(tsk))
+			pr_info("futex: compat_exit_robust_list failed");
 		tsk->compat_robust_list = NULL;
 	}
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap
  2022-04-21 19:05 [RFC 0/3] Slight improvements for OOM/Futex Nico Pache
  2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
  2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
@ 2022-04-21 19:05 ` Nico Pache
  2022-04-22 15:38   ` Michal Hocko
  2 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-21 19:05 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: David Rientjes, Michal Hocko, Andrea Arcangeli

The MMF_OOM_SKIP bit is used to indicate weather a mm_struct can not be
invalided or has already been invalided. exit_mmap currently calls
__oom_reap_task_mm unconditionally despite the fact that the oom reaper
may have already called this.

Add a check for the MMF_OOM_SKIP bit being set in exit_mmap to avoid
unnessary calls to the invalidate code.

A slight race can occur on the MMF_OOM_SKIP bit that will still allow
this to run twice. My testing has shown an ~66% decrease in double calls
to _oom_reap_task_mm.

Fixes: 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/mmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index a2968669fd4e..b867f408dacd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3113,7 +3113,8 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
+	if (unlikely(mm_is_oom_victim(mm)) &&
+			!test_bit(MMF_OOM_SKIP, &mm->flags)) {
 		/*
 		 * Manually reap the mm to free as much memory as possible.
 		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon
  2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
@ 2022-04-21 19:28   ` Matthew Wilcox
  2022-04-22 14:00     ` Nico Pache
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2022-04-21 19:28 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Alexander Viro, Andrew Morton, Nicholas Piggin,
	Logan Gunthorpe, Hari Bathini, Aneesh Kumar K.V, Yang Shi,
	Miaohe Lin, William Kucharski, Hugh Dickins

On Thu, Apr 21, 2022 at 03:05:31PM -0400, Nico Pache wrote:
> The vma_is_anonymous function isn't fully indicative of what it checks.
> 
> Without having full knowledge of the mmap process, one may incorrectly
> assume this covers all types of anonymous memory; which is not the case.

Is your complaint that anonymous memory can also be found in file VMAs
that were mapped with MAP_PRIVATE?  ie COWed pages?

I don't think renaming this function is appropriate.  It's whether
the VMA is anonymous, not whether the VMA can contain anonymous
pages.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
  2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
@ 2022-04-21 19:30   ` Matthew Wilcox
  2022-04-22 14:12     ` Nico Pache
  2022-04-21 20:53   ` Thomas Gleixner
  2022-04-22  0:18   ` kernel test robot
  2 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2022-04-21 19:30 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Darren Hart, Davidlohr Bueso,
	André Almeida, Arjan van de Ven, Ulrich Drepper

On Thu, Apr 21, 2022 at 03:05:32PM -0400, Nico Pache wrote:
> @@ -1007,13 +1013,15 @@ static inline void exit_pi_state_list(struct task_struct *curr) { }
>  static void futex_cleanup(struct task_struct *tsk)
>  {
>  	if (unlikely(tsk->robust_list)) {
> -		exit_robust_list(tsk);
> +		if (!exit_robust_list(tsk))
> +			pr_info("futex: exit_robust_list failed");

Doesn't this allow a malicious user process to spam the kernel logs
with messages?  There needs to be a ratelimit on this, at least.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
  2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
  2022-04-21 19:30   ` Matthew Wilcox
@ 2022-04-21 20:53   ` Thomas Gleixner
  2022-04-22 14:23     ` Nico Pache
  2022-04-22  0:18   ` kernel test robot
  2 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-21 20:53 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, André Almeida, Arjan van de Ven

On Thu, Apr 21 2022 at 15:05, Nico Pache wrote:

> The futex_cleanup routines currently fails silently.
>
> Allow the futex_cleanup routines to fail more verbosely so we can
> leave a message behind for easier debugging/error detecting.

This is a free ticket to spam dmesg for any unpriviledged user.

> Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")

There is nothing to fix, really.

Robust futexes are best effort as I explained you before and spamming
dmesg won't help anything.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
  2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
  2022-04-21 19:30   ` Matthew Wilcox
  2022-04-21 20:53   ` Thomas Gleixner
@ 2022-04-22  0:18   ` kernel test robot
  2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2022-04-22  0:18 UTC (permalink / raw)
  To: Nico Pache; +Cc: llvm, kbuild-all

Hi Nico,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on powerpc/next]
[also build test ERROR on tip/locking/core linus/master v5.18-rc3]
[cannot apply to hnaz-mm/master linux/master next-20220421]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Nico-Pache/Slight-improvements-for-OOM-Futex/20220422-030740
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: hexagon-randconfig-r025-20220421 (https://download.01.org/0day-ci/archive/20220422/202204220841.5b2XtG9h-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5bd87350a5ae429baf8f373cb226a57b62f87280)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/cd9fef152153fe60c67bb633118dd8adef51cadc
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Nico-Pache/Slight-improvements-for-OOM-Futex/20220422-030740
        git checkout cd9fef152153fe60c67bb633118dd8adef51cadc
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> kernel/futex/core.c:774:7: error: use of undeclared identifier 'futex_cmpxchg_enabled'
           if (!futex_cmpxchg_enabled)
                ^
   1 error generated.


vim +/futex_cmpxchg_enabled +774 kernel/futex/core.c

   758	
   759	/*
   760	 * Walk curr->robust_list (very carefully, it's a userspace list!)
   761	 * and mark any locks found there dead, and notify any waiters.
   762	 *
   763	 * We return false on any sign of list-walking problem.
   764	 */
   765	static bool exit_robust_list(struct task_struct *curr)
   766	{
   767		struct robust_list_head __user *head = curr->robust_list;
   768		struct robust_list __user *entry, *next_entry, *pending;
   769		unsigned int limit = ROBUST_LIST_LIMIT, pi, pip;
   770		unsigned int next_pi;
   771		unsigned long futex_offset;
   772		int rc;
   773	
 > 774		if (!futex_cmpxchg_enabled)
   775			return false;
   776		/*
   777		 * Fetch the list head (which was registered earlier, via
   778		 * sys_set_robust_list()):
   779		 */
   780		if (fetch_robust_entry(&entry, &head->list.next, &pi))
   781			return false;
   782		/*
   783		 * Fetch the relative futex offset:
   784		 */
   785		if (get_user(futex_offset, &head->futex_offset))
   786			return false;
   787		/*
   788		 * Fetch any possibly pending lock-add first, and handle it
   789		 * if it exists:
   790		 */
   791		if (fetch_robust_entry(&pending, &head->list_op_pending, &pip))
   792			return false;
   793	
   794		next_entry = NULL;	/* avoid warning with gcc */
   795		while (entry != &head->list) {
   796			/*
   797			 * Fetch the next entry in the list before calling
   798			 * handle_futex_death:
   799			 */
   800			rc = fetch_robust_entry(&next_entry, &entry->next, &next_pi);
   801			/*
   802			 * A pending lock might already be on the list, so
   803			 * don't process it twice:
   804			 */
   805			if (entry != pending) {
   806				if (handle_futex_death((void __user *)entry + futex_offset,
   807							curr, pi, HANDLE_DEATH_LIST))
   808					return false;
   809			}
   810			if (rc)
   811				return false;
   812			entry = next_entry;
   813			pi = next_pi;
   814			/*
   815			 * Avoid excessively long or circular lists:
   816			 */
   817			if (!--limit)
   818				break;
   819	
   820			cond_resched();
   821		}
   822	
   823		if (pending) {
   824			return handle_futex_death((void __user *)pending + futex_offset,
   825					   curr, pip, HANDLE_DEATH_PENDING);
   826		}
   827		return true;
   828	}
   829	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon
  2022-04-21 19:28   ` Matthew Wilcox
@ 2022-04-22 14:00     ` Nico Pache
  2022-04-28 16:14       ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-22 14:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Alexander Viro, Andrew Morton, Nicholas Piggin,
	Logan Gunthorpe, Hari Bathini, Aneesh Kumar K.V, Yang Shi,
	Miaohe Lin, William Kucharski, Hugh Dickins



On 4/21/22 15:28, Matthew Wilcox wrote:
> On Thu, Apr 21, 2022 at 03:05:31PM -0400, Nico Pache wrote:
>> The vma_is_anonymous function isn't fully indicative of what it checks.
>>
>> Without having full knowledge of the mmap process, one may incorrectly
>> assume this covers all types of anonymous memory; which is not the case.
> 
> Is your complaint that anonymous memory can also be found in file VMAs
> that were mapped with MAP_PRIVATE?  ie COWed pages?
I should have been more descriptive in my commit msg about how I came to this
conclusion.

From my understanding of the mmap process, a vma->vm_ops field is only NULL when
mmapped as !file and !shared:

	if (file){
		...
	} else if (vm_flags & VM_SHARED) { 	//ANON SHARED
		error = shmem_zero_setup(vma);
	        if (error)
        		goto free_vma;
	} else { 				//ANON PRIVATE
		vma_set_anonymous(vma);		//set vma->vm_ops= NULL
	}

To me this means that the VMA is PRIVATE ANON memory. The vma_is_anonymous
function returns true when vm_ops == NULL. So my intentions were to more
accurately describe what we are checking for. I could be wrong though thats why
I started with an RFC :)

There could be some aspect of COW that I dont fully understand. It is not
something I've looked into much.

Cheers,
-- Nico


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
  2022-04-21 19:30   ` Matthew Wilcox
@ 2022-04-22 14:12     ` Nico Pache
  0 siblings, 0 replies; 16+ messages in thread
From: Nico Pache @ 2022-04-22 14:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Darren Hart, Davidlohr Bueso,
	André Almeida, Arjan van de Ven, Ulrich Drepper



On 4/21/22 15:30, Matthew Wilcox wrote:
> On Thu, Apr 21, 2022 at 03:05:32PM -0400, Nico Pache wrote:
>> @@ -1007,13 +1013,15 @@ static inline void exit_pi_state_list(struct task_struct *curr) { }
>>  static void futex_cleanup(struct task_struct *tsk)
>>  {
>>  	if (unlikely(tsk->robust_list)) {
>> -		exit_robust_list(tsk);
>> +		if (!exit_robust_list(tsk))
>> +			pr_info("futex: exit_robust_list failed");
> 
> Doesn't this allow a malicious user process to spam the kernel logs
> with messages?  There needs to be a ratelimit on this, at least.
Fair point, we'd need a ratelimited print if we want to continue forward with
this. Additionally we may want to limit this print to debug kernels, but thats
just a thought.

Thanks for the review :)
-- Nico


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
  2022-04-21 20:53   ` Thomas Gleixner
@ 2022-04-22 14:23     ` Nico Pache
  2022-04-22 14:42       ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-22 14:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, André Almeida, Arjan van de Ven,
	linux-kernel, linux-mm, Matthew Wilcox



On 4/21/22 16:53, Thomas Gleixner wrote:
> On Thu, Apr 21 2022 at 15:05, Nico Pache wrote:
> 
>> The futex_cleanup routines currently fails silently.
>>
>> Allow the futex_cleanup routines to fail more verbosely so we can
>> leave a message behind for easier debugging/error detecting.
> 
> This is a free ticket to spam dmesg for any unpriviledged user.
As Matthew pointed out, we'd want to rate limit this. I was also thinking we can
limit this to debug kernels so unpriviledged users can not cause dmesg spamming
in production configs.
> 
>> Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
> 
> There is nothing to fix, really.
Fair enough, its been part of the design since inceptions.
> 
> Robust futexes are best effort as I explained you before and spamming
> dmesg won't help anything.
It would have helped find the OOM/Robust Futex issue more quickly. It may also
help detect breakages in the pthread code (or any other users of robust futex)
if someone is doing something incorrectly.

Just a thought though. If you think its pointless to proceed down this path I
wont push forward.

Cheers,
-- Nico


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
  2022-04-22 14:23     ` Nico Pache
@ 2022-04-22 14:42       ` Thomas Gleixner
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-22 14:42 UTC (permalink / raw)
  To: Nico Pache
  Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, André Almeida, Arjan van de Ven,
	linux-kernel, linux-mm, Matthew Wilcox

On Fri, Apr 22 2022 at 10:23, Nico Pache wrote:
> On 4/21/22 16:53, Thomas Gleixner wrote:
>> On Thu, Apr 21 2022 at 15:05, Nico Pache wrote:
>> Robust futexes are best effort as I explained you before and spamming
>> dmesg won't help anything.
> It would have helped find the OOM/Robust Futex issue more quickly. It may also
> help detect breakages in the pthread code (or any other users of robust futex)
> if someone is doing something incorrectly.

If we follow that line of argumentation then we have to sprinkle printks
en masse all over the place as there are a gazillion ways to create hard
to debug issues.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap
  2022-04-21 19:05 ` [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap Nico Pache
@ 2022-04-22 15:38   ` Michal Hocko
  2022-04-25 19:00     ` Nico Pache
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2022-04-22 15:38 UTC (permalink / raw)
  To: Nico Pache; +Cc: linux-kernel, linux-mm, David Rientjes, Andrea Arcangeli

On Thu 21-04-22 15:05:33, Nico Pache wrote:
> The MMF_OOM_SKIP bit is used to indicate weather a mm_struct can not be
> invalided or has already been invalided. exit_mmap currently calls
> __oom_reap_task_mm unconditionally despite the fact that the oom reaper
> may have already called this.
> 
> Add a check for the MMF_OOM_SKIP bit being set in exit_mmap to avoid
> unnessary calls to the invalidate code.

Why do we care about this?
 
> A slight race can occur on the MMF_OOM_SKIP bit that will still allow
> this to run twice. My testing has shown an ~66% decrease in double calls
> to _oom_reap_task_mm.
> 
> Fixes: 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")

I do not see this would be fixing anything.

> Cc: David Rientjes <rientjes@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/mmap.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a2968669fd4e..b867f408dacd 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3113,7 +3113,8 @@ void exit_mmap(struct mm_struct *mm)
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> -	if (unlikely(mm_is_oom_victim(mm))) {
> +	if (unlikely(mm_is_oom_victim(mm)) &&
> +			!test_bit(MMF_OOM_SKIP, &mm->flags)) {
>  		/*
>  		 * Manually reap the mm to free as much memory as possible.
>  		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -- 
> 2.35.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap
  2022-04-22 15:38   ` Michal Hocko
@ 2022-04-25 19:00     ` Nico Pache
  2022-04-26  6:59       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-25 19:00 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, David Rientjes, Andrea Arcangeli



On 4/22/22 11:38, Michal Hocko wrote:
> On Thu 21-04-22 15:05:33, Nico Pache wrote:
>> The MMF_OOM_SKIP bit is used to indicate weather a mm_struct can not be
>> invalided or has already been invalided. exit_mmap currently calls
>> __oom_reap_task_mm unconditionally despite the fact that the oom reaper
>> may have already called this.
>>
>> Add a check for the MMF_OOM_SKIP bit being set in exit_mmap to avoid
>> unnessary calls to the invalidate code.
> 
> Why do we care about this?
Is there no cost to the MMU/TLB invalidation? The MMU notifier contains a lock
too so perhaps we can also avoids some unnecessary MMU notifier lock contention.
>  
>> A slight race can occur on the MMF_OOM_SKIP bit that will still allow
>> this to run twice. My testing has shown an ~66% decrease in double calls
>> to _oom_reap_task_mm.
>>
>> Fixes: 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> 
> I do not see this would be fixing anything.
Ok im just trying to make sure we are keeping an eye on what introduced this
double call. Davids commit above is what introduced the oom_reap_task_mm in the
exit_mmap code. It goes along with some other changes that I dont fully
understand without more studying, so that's why I was hoping he could provide
some input around that CVE (the main thing im concerned about re-introducing).
> 
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>  mm/mmap.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index a2968669fd4e..b867f408dacd 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3113,7 +3113,8 @@ void exit_mmap(struct mm_struct *mm)
>>  	/* mm's last user has gone, and its about to be pulled down */
>>  	mmu_notifier_release(mm);
>>  
>> -	if (unlikely(mm_is_oom_victim(mm))) {
>> +	if (unlikely(mm_is_oom_victim(mm)) &&
>> +			!test_bit(MMF_OOM_SKIP, &mm->flags)) {
>>  		/*
>>  		 * Manually reap the mm to free as much memory as possible.
>>  		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
>> -- 
>> 2.35.1
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap
  2022-04-25 19:00     ` Nico Pache
@ 2022-04-26  6:59       ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2022-04-26  6:59 UTC (permalink / raw)
  To: Nico Pache; +Cc: linux-kernel, linux-mm, David Rientjes, Andrea Arcangeli

On Mon 25-04-22 15:00:24, Nico Pache wrote:
> 
> 
> On 4/22/22 11:38, Michal Hocko wrote:
> > On Thu 21-04-22 15:05:33, Nico Pache wrote:
> >> The MMF_OOM_SKIP bit is used to indicate weather a mm_struct can not be
> >> invalided or has already been invalided. exit_mmap currently calls
> >> __oom_reap_task_mm unconditionally despite the fact that the oom reaper
> >> may have already called this.
> >>
> >> Add a check for the MMF_OOM_SKIP bit being set in exit_mmap to avoid
> >> unnessary calls to the invalidate code.
> > 
> > Why do we care about this?
> Is there no cost to the MMU/TLB invalidation? The MMU notifier contains a lock
> too so perhaps we can also avoids some unnecessary MMU notifier lock contention.

I am pretty sure that this area can be micro optimized but I do not
really see a strong reason for that. OOM victims are/should be really
rare so I do not think that any performance optimization would be really
visible.

If you want to improve the code then I think a much better plan would be
to get rid of the whole oom special case altogether. This might be much
closer than ever after Hugh's recent m{un}lock changes. I didn't have
time to think that through though. I believe Suren Baghdasaryan has been
looking into that as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon
  2022-04-22 14:00     ` Nico Pache
@ 2022-04-28 16:14       ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2022-04-28 16:14 UTC (permalink / raw)
  To: Nico Pache, Matthew Wilcox
  Cc: linux-kernel, linux-mm, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Alexander Viro, Andrew Morton, Nicholas Piggin,
	Logan Gunthorpe, Hari Bathini, Aneesh Kumar K.V, Yang Shi,
	Miaohe Lin, William Kucharski, Hugh Dickins

On 22.04.22 16:00, Nico Pache wrote:
> 
> 
> On 4/21/22 15:28, Matthew Wilcox wrote:
>> On Thu, Apr 21, 2022 at 03:05:31PM -0400, Nico Pache wrote:
>>> The vma_is_anonymous function isn't fully indicative of what it checks.
>>>
>>> Without having full knowledge of the mmap process, one may incorrectly
>>> assume this covers all types of anonymous memory; which is not the case.
>>
>> Is your complaint that anonymous memory can also be found in file VMAs
>> that were mapped with MAP_PRIVATE?  ie COWed pages?
> I should have been more descriptive in my commit msg about how I came to this
> conclusion.
> 
> From my understanding of the mmap process, a vma->vm_ops field is only NULL when
> mmapped as !file and !shared:
> 
> 	if (file){
> 		...
> 	} else if (vm_flags & VM_SHARED) { 	//ANON SHARED
> 		error = shmem_zero_setup(vma);
> 	        if (error)
>         		goto free_vma;
> 	} else { 				//ANON PRIVATE
> 		vma_set_anonymous(vma);		//set vma->vm_ops= NULL
> 	}
> 
> To me this means that the VMA is PRIVATE ANON memory. The vma_is_anonymous
> function returns true when vm_ops == NULL. So my intentions were to more
> accurately describe what we are checking for. I could be wrong though thats why
> I started with an RFC :)

Shared anon in the kernel is really just shmem.  The user space notion
is MAP_ANON|MAP_SHARED, but that really just maps to shmem and the
kernel doesn't really call that thing anonymous memory.

So I agree, renaming this is not appropriate.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-04-28 16:14 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-21 19:05 [RFC 0/3] Slight improvements for OOM/Futex Nico Pache
2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
2022-04-21 19:28   ` Matthew Wilcox
2022-04-22 14:00     ` Nico Pache
2022-04-28 16:14       ` David Hildenbrand
2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
2022-04-21 19:30   ` Matthew Wilcox
2022-04-22 14:12     ` Nico Pache
2022-04-21 20:53   ` Thomas Gleixner
2022-04-22 14:23     ` Nico Pache
2022-04-22 14:42       ` Thomas Gleixner
2022-04-22  0:18   ` kernel test robot
2022-04-21 19:05 ` [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap Nico Pache
2022-04-22 15:38   ` Michal Hocko
2022-04-25 19:00     ` Nico Pache
2022-04-26  6:59       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.