* [RFC 0/3] Slight improvements for OOM/Futex
@ 2022-04-21 19:05 Nico Pache
2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Nico Pache @ 2022-04-21 19:05 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Alexander Viro, Andrew Morton, Nicholas Piggin, Logan Gunthorpe,
Hari Bathini, Aneesh Kumar K.V, Matthew Wilcox (Oracle),
Yang Shi, Miaohe Lin, William Kucharski, Hugh Dickins,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, André Almeida, Arjan van de Ven,
Ulrich Drepper, Michal Hocko
The following 3 patches were developed while working on the OOM/Futex
fix.
Patch 1: This is a nonfunctional change. The vma_is_anonymous function
originally led to some confusion about what the oom reaper is checking
for in __oom_reap_task_mm. This patch makes that function name more
verbose.
Patch 2: Futex cleanup was designed with silent failures. Printing this
failure would have led to more quickly finding the issue. This
introduces a pr_info if the exit path has any issues walking the
userspace list.
Patch 3: During the debug process I noticed that the oom_reap_task_mm
function was always running twice for the same mm_struct; Once in the
exit path and once in the oom_reaper. By checking the MMF_OOM_SKIP we
can prevent these unnecissary double calls. I'd like some input from
David Rientjes here with regards to CVE-2018-1000200, I want to make
sure we are not reintroducing that CVE.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "André Almeida" <andrealmeid@collabora.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Nico Pache (3):
mm: change vma_is_anonymous to vma_is_private_anon
futex: exit: Print a warning when futex_cleanup fails
exit: Check for MMF_OOM_SKIP in exit_mmap
arch/powerpc/mm/book3s64/pgtable.c | 2 +-
fs/userfaultfd.c | 2 +-
include/linux/huge_mm.h | 2 +-
include/linux/mm.h | 2 +-
kernel/futex/core.c | 44 ++++++++++++++++++------------
mm/gup.c | 4 +--
mm/huge_memory.c | 10 +++----
mm/madvise.c | 4 +--
mm/memory.c | 10 +++----
mm/migrate_device.c | 6 ++--
mm/mincore.c | 2 +-
mm/mmap.c | 7 +++--
mm/oom_kill.c | 2 +-
mm/userfaultfd.c | 8 +++---
14 files changed, 57 insertions(+), 48 deletions(-)
--
2.35.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon
2022-04-21 19:05 [RFC 0/3] Slight improvements for OOM/Futex Nico Pache
@ 2022-04-21 19:05 ` Nico Pache
2022-04-21 19:28 ` Matthew Wilcox
2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
2022-04-21 19:05 ` [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap Nico Pache
2 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-21 19:05 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Alexander Viro, Andrew Morton, Nicholas Piggin, Logan Gunthorpe,
Hari Bathini, Aneesh Kumar K.V, Matthew Wilcox (Oracle),
Yang Shi, Miaohe Lin, William Kucharski, Hugh Dickins
The vma_is_anonymous function isn't fully indicative of what it checks.
Without having full knowledge of the mmap process, one may incorrectly
assume this covers all types of anonymous memory; which is not the case.
No functional changes.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
arch/powerpc/mm/book3s64/pgtable.c | 2 +-
fs/userfaultfd.c | 2 +-
include/linux/huge_mm.h | 2 +-
include/linux/mm.h | 2 +-
mm/gup.c | 4 ++--
mm/huge_memory.c | 10 +++++-----
mm/madvise.c | 4 ++--
mm/memory.c | 10 +++++-----
mm/migrate_device.c | 6 +++---
mm/mincore.c | 2 +-
mm/mmap.c | 4 ++--
mm/oom_kill.c | 2 +-
mm/userfaultfd.c | 8 ++++----
13 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 052e6590f84f..c8fd68a7965d 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -479,7 +479,7 @@ int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
struct vm_area_struct *vma)
{
if (radix_enabled())
- return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+ return (new_pmd_ptl != old_pmd_ptl) && vma_is_private_anon(vma);
return true;
}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aa0c47cb0d16..d3fe2c26874a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1269,7 +1269,7 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
return false;
}
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ return vma_is_private_anon(vma) || is_vm_hugetlb_page(vma) ||
vma_is_shmem(vma);
}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2999190adc22..40493f53562c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,7 +119,7 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
unsigned long haddr)
{
/* Don't have to check pgoff for anonymous vma */
- if (!vma_is_anonymous(vma)) {
+ if (!vma_is_private_anon(vma)) {
if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
HPAGE_PMD_NR))
return false;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..9a01d053853b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -623,7 +623,7 @@ static inline void vma_set_anonymous(struct vm_area_struct *vma)
vma->vm_ops = NULL;
}
-static inline bool vma_is_anonymous(struct vm_area_struct *vma)
+static inline bool vma_is_private_anon(struct vm_area_struct *vma)
{
return !vma->vm_ops;
}
diff --git a/mm/gup.c b/mm/gup.c
index f598a037eb04..7cd3ef329ea2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -398,7 +398,7 @@ static struct page *no_page_table(struct vm_area_struct *vma,
* be zero-filled if handle_mm_fault() actually did handle it.
*/
if ((flags & FOLL_DUMP) &&
- (vma_is_anonymous(vma) || !vma->vm_ops->fault))
+ (vma_is_private_anon(vma) || !vma->vm_ops->fault))
return ERR_PTR(-EFAULT);
return NULL;
}
@@ -913,7 +913,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if (vm_flags & (VM_IO | VM_PFNMAP))
return -EFAULT;
- if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+ if (gup_flags & FOLL_ANON && !vma_is_private_anon(vma))
return -EFAULT;
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c468fee595ff..81c2123aeffe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -82,7 +82,7 @@ bool transparent_hugepage_active(struct vm_area_struct *vma)
if (!transhuge_vma_suitable(vma, addr))
return false;
- if (vma_is_anonymous(vma))
+ if (vma_is_private_anon(vma))
return __transparent_hugepage_enabled(vma);
if (vma_is_shmem(vma))
return shmem_huge_enabled(vma);
@@ -1035,7 +1035,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
int ret = -ENOMEM;
/* Skip if can be re-fill on fault */
- if (!vma_is_anonymous(dst_vma))
+ if (!vma_is_private_anon(dst_vma))
return 0;
pgtable = pte_alloc_one(dst_mm);
@@ -1622,7 +1622,7 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
*
* We also don't deposit and withdraw tables for file pages.
*/
- return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+ return (new_pmd_ptl != old_pmd_ptl) && vma_is_private_anon(vma);
}
#endif
@@ -1799,7 +1799,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
}
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
+ BUG_ON(vma_is_private_anon(vma) && !preserve_write && pmd_write(entry));
unlock:
spin_unlock(ptl);
return ret;
@@ -1957,7 +1957,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
count_vm_event(THP_SPLIT_PMD);
- if (!vma_is_anonymous(vma)) {
+ if (!vma_is_private_anon(vma)) {
old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
/*
* We are going to unmap this huge page. So
diff --git a/mm/madvise.c b/mm/madvise.c
index 1873616a37d2..23771875ae9d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -543,7 +543,7 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
static inline bool can_do_pageout(struct vm_area_struct *vma)
{
- if (vma_is_anonymous(vma))
+ if (vma_is_private_anon(vma))
return true;
if (!vma->vm_file)
return false;
@@ -725,7 +725,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
struct mmu_gather tlb;
/* MADV_FREE works for only anon vma at the moment */
- if (!vma_is_anonymous(vma))
+ if (!vma_is_private_anon(vma))
return -EINVAL;
range.start = max(vma->vm_start, start_addr);
diff --git a/mm/memory.c b/mm/memory.c
index 76e3af9639d9..fcb599bc5c33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4494,7 +4494,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
{
- if (vma_is_anonymous(vmf->vma))
+ if (vma_is_private_anon(vmf->vma))
return do_huge_pmd_anonymous_page(vmf);
if (vmf->vma->vm_ops->huge_fault)
return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
@@ -4504,7 +4504,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
/* `inline' is required to avoid gcc 4.1.2 build error */
static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
{
- if (vma_is_anonymous(vmf->vma)) {
+ if (vma_is_private_anon(vmf->vma)) {
if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
return handle_userfault(vmf, VM_UFFD_WP);
return do_huge_pmd_wp_page(vmf);
@@ -4527,7 +4527,7 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
/* No support for anonymous transparent PUD pages yet */
- if (vma_is_anonymous(vmf->vma))
+ if (vma_is_private_anon(vmf->vma))
goto split;
if (vmf->vma->vm_ops->huge_fault) {
vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
@@ -4546,7 +4546,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* No support for anonymous transparent PUD pages yet */
- if (vma_is_anonymous(vmf->vma))
+ if (vma_is_private_anon(vmf->vma))
return VM_FAULT_FALLBACK;
if (vmf->vma->vm_ops->huge_fault)
return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
@@ -4621,7 +4621,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
}
if (!vmf->pte) {
- if (vma_is_anonymous(vmf->vma))
+ if (vma_is_private_anon(vmf->vma))
return do_anonymous_page(vmf);
else
return do_fault(vmf);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 70c7dc05bbfc..4a4068c410f7 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -40,7 +40,7 @@ static int migrate_vma_collect_hole(unsigned long start,
unsigned long addr;
/* Only allow populating anonymous memory. */
- if (!vma_is_anonymous(walk->vma))
+ if (!vma_is_private_anon(walk->vma))
return migrate_vma_collect_skip(start, end, walk);
for (addr = start; addr < end; addr += PAGE_SIZE) {
@@ -120,7 +120,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
pte = *ptep;
if (pte_none(pte)) {
- if (vma_is_anonymous(vma)) {
+ if (vma_is_private_anon(vma)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
}
@@ -518,7 +518,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
pte_t *ptep;
/* Only allow populating anonymous memory */
- if (!vma_is_anonymous(vma))
+ if (!vma_is_private_anon(vma))
goto abort;
pgdp = pgd_offset(mm, addr);
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..5aa4a1358ac1 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -156,7 +156,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
static inline bool can_do_mincore(struct vm_area_struct *vma)
{
- if (vma_is_anonymous(vma))
+ if (vma_is_private_anon(vma))
return true;
if (!vma->vm_file)
return false;
diff --git a/mm/mmap.c b/mm/mmap.c
index 3aa839f81e63..a2968669fd4e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3189,7 +3189,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
* using the existing file pgoff checks and manipulations.
* Similarly in do_mmap and in do_brk_flags.
*/
- if (vma_is_anonymous(vma)) {
+ if (vma_is_private_anon(vma)) {
BUG_ON(vma->anon_vma);
vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
}
@@ -3217,7 +3217,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
* If anonymous vma has not yet been faulted, update new pgoff
* to match new location, to increase its chance of merging.
*/
- if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
+ if (unlikely(vma_is_private_anon(vma) && !vma->anon_vma)) {
pgoff = addr >> PAGE_SHIFT;
faulted_in_anon_vma = false;
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ec38194f8e1..aa3e68934fcb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -536,7 +536,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
* we do not want to block exit_mmap by keeping mm ref
* count elevated without a good reason.
*/
- if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+ if (vma_is_private_anon(vma) || !(vma->vm_flags & VM_SHARED)) {
struct mmu_notifier_range range;
struct mmu_gather tlb;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0cb8e5ef1713..c94f81bd109b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -540,9 +540,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
err = -EINVAL;
/*
* shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
- * it will overwrite vm_ops, so vma_is_anonymous must return false.
+ * it will overwrite vm_ops, so vma_is_private_anon must return false.
*/
- if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
+ if (WARN_ON_ONCE(vma_is_private_anon(dst_vma) &&
dst_vma->vm_flags & VM_SHARED))
goto out_unlock;
@@ -561,7 +561,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
src_start, len, mcopy_mode);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
+ if (!vma_is_private_anon(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE)
goto out_unlock;
@@ -717,7 +717,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
goto out_unlock;
if (!userfaultfd_wp(dst_vma))
goto out_unlock;
- if (!vma_is_anonymous(dst_vma))
+ if (!vma_is_private_anon(dst_vma))
goto out_unlock;
if (enable_wp)
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
2022-04-21 19:05 [RFC 0/3] Slight improvements for OOM/Futex Nico Pache
2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
@ 2022-04-21 19:05 ` Nico Pache
2022-04-21 19:30 ` Matthew Wilcox
` (2 more replies)
2022-04-21 19:05 ` [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap Nico Pache
2 siblings, 3 replies; 16+ messages in thread
From: Nico Pache @ 2022-04-21 19:05 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Andrew Morton, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
Darren Hart, Davidlohr Bueso, André Almeida,
Arjan van de Ven, Ulrich Drepper
The futex_cleanup routines currently fails silently.
Allow the futex_cleanup routines to fail more verbosely so we can
leave a message behind for easier debugging/error detecting.
Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "André Almeida" <andrealmeid@collabora.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
kernel/futex/core.c | 44 ++++++++++++++++++++++++++------------------
1 file changed, 26 insertions(+), 18 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b22ef1efe751..8a62eb1b5d77 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -760,9 +760,9 @@ static inline int fetch_robust_entry(struct robust_list __user **entry,
* Walk curr->robust_list (very carefully, it's a userspace list!)
* and mark any locks found there dead, and notify any waiters.
*
- * We silently return on any sign of list-walking problem.
+ * We return false on any sign of list-walking problem.
*/
-static void exit_robust_list(struct task_struct *curr)
+static bool exit_robust_list(struct task_struct *curr)
{
struct robust_list_head __user *head = curr->robust_list;
struct robust_list __user *entry, *next_entry, *pending;
@@ -771,23 +771,25 @@ static void exit_robust_list(struct task_struct *curr)
unsigned long futex_offset;
int rc;
+ if (!futex_cmpxchg_enabled)
+ return false;
/*
* Fetch the list head (which was registered earlier, via
* sys_set_robust_list()):
*/
if (fetch_robust_entry(&entry, &head->list.next, &pi))
- return;
+ return false;
/*
* Fetch the relative futex offset:
*/
if (get_user(futex_offset, &head->futex_offset))
- return;
+ return false;
/*
* Fetch any possibly pending lock-add first, and handle it
* if it exists:
*/
if (fetch_robust_entry(&pending, &head->list_op_pending, &pip))
- return;
+ return false;
next_entry = NULL; /* avoid warning with gcc */
while (entry != &head->list) {
@@ -803,10 +805,10 @@ static void exit_robust_list(struct task_struct *curr)
if (entry != pending) {
if (handle_futex_death((void __user *)entry + futex_offset,
curr, pi, HANDLE_DEATH_LIST))
- return;
+ return false;
}
if (rc)
- return;
+ return false;
entry = next_entry;
pi = next_pi;
/*
@@ -819,9 +821,10 @@ static void exit_robust_list(struct task_struct *curr)
}
if (pending) {
- handle_futex_death((void __user *)pending + futex_offset,
+ return handle_futex_death((void __user *)pending + futex_offset,
curr, pip, HANDLE_DEATH_PENDING);
}
+ return true;
}
#ifdef CONFIG_COMPAT
@@ -854,9 +857,9 @@ compat_fetch_robust_entry(compat_uptr_t *uentry, struct robust_list __user **ent
* Walk curr->robust_list (very carefully, it's a userspace list!)
* and mark any locks found there dead, and notify any waiters.
*
- * We silently return on any sign of list-walking problem.
+ * We return false on any sign of list-walking problem.
*/
-static void compat_exit_robust_list(struct task_struct *curr)
+static bool compat_exit_robust_list(struct task_struct *curr)
{
struct compat_robust_list_head __user *head = curr->compat_robust_list;
struct robust_list __user *entry, *next_entry, *pending;
@@ -866,24 +869,26 @@ static void compat_exit_robust_list(struct task_struct *curr)
compat_long_t futex_offset;
int rc;
+ if (!futex_cmpxchg_enabled)
+ return false;
/*
* Fetch the list head (which was registered earlier, via
* sys_set_robust_list()):
*/
if (compat_fetch_robust_entry(&uentry, &entry, &head->list.next, &pi))
- return;
+ return false;
/*
* Fetch the relative futex offset:
*/
if (get_user(futex_offset, &head->futex_offset))
- return;
+ return false;
/*
* Fetch any possibly pending lock-add first, and handle it
* if it exists:
*/
if (compat_fetch_robust_entry(&upending, &pending,
&head->list_op_pending, &pip))
- return;
+ return false;
next_entry = NULL; /* avoid warning with gcc */
while (entry != (struct robust_list __user *) &head->list) {
@@ -902,10 +907,10 @@ static void compat_exit_robust_list(struct task_struct *curr)
if (handle_futex_death(uaddr, curr, pi,
HANDLE_DEATH_LIST))
- return;
+ return false;
}
if (rc)
- return;
+ return false;
uentry = next_uentry;
entry = next_entry;
pi = next_pi;
@@ -920,8 +925,9 @@ static void compat_exit_robust_list(struct task_struct *curr)
if (pending) {
void __user *uaddr = futex_uaddr(pending, futex_offset);
- handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING);
+ return handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING);
}
+ return true;
}
#endif
@@ -1007,13 +1013,15 @@ static inline void exit_pi_state_list(struct task_struct *curr) { }
static void futex_cleanup(struct task_struct *tsk)
{
if (unlikely(tsk->robust_list)) {
- exit_robust_list(tsk);
+ if (!exit_robust_list(tsk))
+ pr_info("futex: exit_robust_list failed");
tsk->robust_list = NULL;
}
#ifdef CONFIG_COMPAT
if (unlikely(tsk->compat_robust_list)) {
- compat_exit_robust_list(tsk);
+ if (!compat_exit_robust_list(tsk))
+ pr_info("futex: compat_exit_robust_list failed");
tsk->compat_robust_list = NULL;
}
#endif
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap
2022-04-21 19:05 [RFC 0/3] Slight improvements for OOM/Futex Nico Pache
2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
@ 2022-04-21 19:05 ` Nico Pache
2022-04-22 15:38 ` Michal Hocko
2 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-21 19:05 UTC (permalink / raw)
To: linux-kernel, linux-mm; +Cc: David Rientjes, Michal Hocko, Andrea Arcangeli
The MMF_OOM_SKIP bit is used to indicate weather a mm_struct can not be
invalided or has already been invalided. exit_mmap currently calls
__oom_reap_task_mm unconditionally despite the fact that the oom reaper
may have already called this.
Add a check for the MMF_OOM_SKIP bit being set in exit_mmap to avoid
unnessary calls to the invalidate code.
A slight race can occur on the MMF_OOM_SKIP bit that will still allow
this to run twice. My testing has shown an ~66% decrease in double calls
to _oom_reap_task_mm.
Fixes: 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
mm/mmap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index a2968669fd4e..b867f408dacd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3113,7 +3113,8 @@ void exit_mmap(struct mm_struct *mm)
/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);
- if (unlikely(mm_is_oom_victim(mm))) {
+ if (unlikely(mm_is_oom_victim(mm)) &&
+ !test_bit(MMF_OOM_SKIP, &mm->flags)) {
/*
* Manually reap the mm to free as much memory as possible.
* Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
--
2.35.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon
2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
@ 2022-04-21 19:28 ` Matthew Wilcox
2022-04-22 14:00 ` Nico Pache
0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2022-04-21 19:28 UTC (permalink / raw)
To: Nico Pache
Cc: linux-kernel, linux-mm, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Alexander Viro, Andrew Morton, Nicholas Piggin,
Logan Gunthorpe, Hari Bathini, Aneesh Kumar K.V, Yang Shi,
Miaohe Lin, William Kucharski, Hugh Dickins
On Thu, Apr 21, 2022 at 03:05:31PM -0400, Nico Pache wrote:
> The vma_is_anonymous function isn't fully indicative of what it checks.
>
> Without having full knowledge of the mmap process, one may incorrectly
> assume this covers all types of anonymous memory; which is not the case.
Is your complaint that anonymous memory can also be found in file VMAs
that were mapped with MAP_PRIVATE? ie COWed pages?
I don't think renaming this function is appropriate. It's whether
the VMA is anonymous, not whether the VMA can contain anonymous
pages.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
@ 2022-04-21 19:30 ` Matthew Wilcox
2022-04-22 14:12 ` Nico Pache
2022-04-21 20:53 ` Thomas Gleixner
2022-04-22 0:18 ` kernel test robot
2 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2022-04-21 19:30 UTC (permalink / raw)
To: Nico Pache
Cc: linux-kernel, linux-mm, Andrew Morton, Thomas Gleixner,
Ingo Molnar, Peter Zijlstra, Darren Hart, Davidlohr Bueso,
André Almeida, Arjan van de Ven, Ulrich Drepper
On Thu, Apr 21, 2022 at 03:05:32PM -0400, Nico Pache wrote:
> @@ -1007,13 +1013,15 @@ static inline void exit_pi_state_list(struct task_struct *curr) { }
> static void futex_cleanup(struct task_struct *tsk)
> {
> if (unlikely(tsk->robust_list)) {
> - exit_robust_list(tsk);
> + if (!exit_robust_list(tsk))
> + pr_info("futex: exit_robust_list failed");
Doesn't this allow a malicious user process to spam the kernel logs
with messages? There needs to be a ratelimit on this, at least.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
2022-04-21 19:30 ` Matthew Wilcox
@ 2022-04-21 20:53 ` Thomas Gleixner
2022-04-22 14:23 ` Nico Pache
2022-04-22 0:18 ` kernel test robot
2 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-21 20:53 UTC (permalink / raw)
To: Nico Pache, linux-kernel, linux-mm
Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, André Almeida, Arjan van de Ven
On Thu, Apr 21 2022 at 15:05, Nico Pache wrote:
> The futex_cleanup routines currently fails silently.
>
> Allow the futex_cleanup routines to fail more verbosely so we can
> leave a message behind for easier debugging/error detecting.
This is a free ticket to spam dmesg for any unpriviledged user.
> Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
There is nothing to fix, really.
Robust futexes are best effort as I explained you before and spamming
dmesg won't help anything.
Thanks,
tglx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
2022-04-21 19:30 ` Matthew Wilcox
2022-04-21 20:53 ` Thomas Gleixner
@ 2022-04-22 0:18 ` kernel test robot
2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2022-04-22 0:18 UTC (permalink / raw)
To: Nico Pache; +Cc: llvm, kbuild-all
Hi Nico,
[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on powerpc/next]
[also build test ERROR on tip/locking/core linus/master v5.18-rc3]
[cannot apply to hnaz-mm/master linux/master next-20220421]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/intel-lab-lkp/linux/commits/Nico-Pache/Slight-improvements-for-OOM-Futex/20220422-030740
base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: hexagon-randconfig-r025-20220421 (https://download.01.org/0day-ci/archive/20220422/202204220841.5b2XtG9h-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5bd87350a5ae429baf8f373cb226a57b62f87280)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/cd9fef152153fe60c67bb633118dd8adef51cadc
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Nico-Pache/Slight-improvements-for-OOM-Futex/20220422-030740
git checkout cd9fef152153fe60c67bb633118dd8adef51cadc
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
>> kernel/futex/core.c:774:7: error: use of undeclared identifier 'futex_cmpxchg_enabled'
if (!futex_cmpxchg_enabled)
^
1 error generated.
vim +/futex_cmpxchg_enabled +774 kernel/futex/core.c
758
759 /*
760 * Walk curr->robust_list (very carefully, it's a userspace list!)
761 * and mark any locks found there dead, and notify any waiters.
762 *
763 * We return false on any sign of list-walking problem.
764 */
765 static bool exit_robust_list(struct task_struct *curr)
766 {
767 struct robust_list_head __user *head = curr->robust_list;
768 struct robust_list __user *entry, *next_entry, *pending;
769 unsigned int limit = ROBUST_LIST_LIMIT, pi, pip;
770 unsigned int next_pi;
771 unsigned long futex_offset;
772 int rc;
773
> 774 if (!futex_cmpxchg_enabled)
775 return false;
776 /*
777 * Fetch the list head (which was registered earlier, via
778 * sys_set_robust_list()):
779 */
780 if (fetch_robust_entry(&entry, &head->list.next, &pi))
781 return false;
782 /*
783 * Fetch the relative futex offset:
784 */
785 if (get_user(futex_offset, &head->futex_offset))
786 return false;
787 /*
788 * Fetch any possibly pending lock-add first, and handle it
789 * if it exists:
790 */
791 if (fetch_robust_entry(&pending, &head->list_op_pending, &pip))
792 return false;
793
794 next_entry = NULL; /* avoid warning with gcc */
795 while (entry != &head->list) {
796 /*
797 * Fetch the next entry in the list before calling
798 * handle_futex_death:
799 */
800 rc = fetch_robust_entry(&next_entry, &entry->next, &next_pi);
801 /*
802 * A pending lock might already be on the list, so
803 * don't process it twice:
804 */
805 if (entry != pending) {
806 if (handle_futex_death((void __user *)entry + futex_offset,
807 curr, pi, HANDLE_DEATH_LIST))
808 return false;
809 }
810 if (rc)
811 return false;
812 entry = next_entry;
813 pi = next_pi;
814 /*
815 * Avoid excessively long or circular lists:
816 */
817 if (!--limit)
818 break;
819
820 cond_resched();
821 }
822
823 if (pending) {
824 return handle_futex_death((void __user *)pending + futex_offset,
825 curr, pip, HANDLE_DEATH_PENDING);
826 }
827 return true;
828 }
829
--
0-DAY CI Kernel Test Service
https://01.org/lkp
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon
2022-04-21 19:28 ` Matthew Wilcox
@ 2022-04-22 14:00 ` Nico Pache
2022-04-28 16:14 ` David Hildenbrand
0 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-22 14:00 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-kernel, linux-mm, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Alexander Viro, Andrew Morton, Nicholas Piggin,
Logan Gunthorpe, Hari Bathini, Aneesh Kumar K.V, Yang Shi,
Miaohe Lin, William Kucharski, Hugh Dickins
On 4/21/22 15:28, Matthew Wilcox wrote:
> On Thu, Apr 21, 2022 at 03:05:31PM -0400, Nico Pache wrote:
>> The vma_is_anonymous function isn't fully indicative of what it checks.
>>
>> Without having full knowledge of the mmap process, one may incorrectly
>> assume this covers all types of anonymous memory; which is not the case.
>
> Is your complaint that anonymous memory can also be found in file VMAs
> that were mapped with MAP_PRIVATE? ie COWed pages?
I should have been more descriptive in my commit msg about how I came to this
conclusion.
From my understanding of the mmap process, a vma->vm_ops field is only NULL when
mmapped as !file and !shared:
if (file){
...
} else if (vm_flags & VM_SHARED) { //ANON SHARED
error = shmem_zero_setup(vma);
if (error)
goto free_vma;
} else { //ANON PRIVATE
vma_set_anonymous(vma); //set vma->vm_ops= NULL
}
To me this means that the VMA is PRIVATE ANON memory. The vma_is_anonymous
function returns true when vm_ops == NULL. So my intentions were to more
accurately describe what we are checking for. I could be wrong though thats why
I started with an RFC :)
There could be some aspect of COW that I dont fully understand. It is not
something I've looked into much.
Cheers,
-- Nico
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
2022-04-21 19:30 ` Matthew Wilcox
@ 2022-04-22 14:12 ` Nico Pache
0 siblings, 0 replies; 16+ messages in thread
From: Nico Pache @ 2022-04-22 14:12 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-kernel, linux-mm, Andrew Morton, Thomas Gleixner,
Ingo Molnar, Peter Zijlstra, Darren Hart, Davidlohr Bueso,
André Almeida, Arjan van de Ven, Ulrich Drepper
On 4/21/22 15:30, Matthew Wilcox wrote:
> On Thu, Apr 21, 2022 at 03:05:32PM -0400, Nico Pache wrote:
>> @@ -1007,13 +1013,15 @@ static inline void exit_pi_state_list(struct task_struct *curr) { }
>> static void futex_cleanup(struct task_struct *tsk)
>> {
>> if (unlikely(tsk->robust_list)) {
>> - exit_robust_list(tsk);
>> + if (!exit_robust_list(tsk))
>> + pr_info("futex: exit_robust_list failed");
>
> Doesn't this allow a malicious user process to spam the kernel logs
> with messages? There needs to be a ratelimit on this, at least.
Fair point, we'd need a ratelimited print if we want to continue forward with
this. Additionally we may want to limit this print to debug kernels, but thats
just a thought.
Thanks for the review :)
-- Nico
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
2022-04-21 20:53 ` Thomas Gleixner
@ 2022-04-22 14:23 ` Nico Pache
2022-04-22 14:42 ` Thomas Gleixner
0 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-22 14:23 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, André Almeida, Arjan van de Ven,
linux-kernel, linux-mm, Matthew Wilcox
On 4/21/22 16:53, Thomas Gleixner wrote:
> On Thu, Apr 21 2022 at 15:05, Nico Pache wrote:
>
>> The futex_cleanup routines currently fails silently.
>>
>> Allow the futex_cleanup routines to fail more verbosely so we can
>> leave a message behind for easier debugging/error detecting.
>
> This is a free ticket to spam dmesg for any unpriviledged user.
As Matthew pointed out, we'd want to rate limit this. I was also thinking we can
limit this to debug kernels so unpriviledged users can not cause dmesg spamming
in production configs.
>
>> Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
>
> There is nothing to fix, really.
Fair enough, its been part of the design since inceptions.
>
> Robust futexes are best effort as I explained you before and spamming
> dmesg won't help anything.
It would have helped find the OOM/Robust Futex issue more quickly. It may also
help detect breakages in the pthread code (or any other users of robust futex)
if someone is doing something incorrectly.
Just a thought though. If you think its pointless to proceed down this path I
wont push forward.
Cheers,
-- Nico
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails
2022-04-22 14:23 ` Nico Pache
@ 2022-04-22 14:42 ` Thomas Gleixner
0 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-22 14:42 UTC (permalink / raw)
To: Nico Pache
Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, André Almeida, Arjan van de Ven,
linux-kernel, linux-mm, Matthew Wilcox
On Fri, Apr 22 2022 at 10:23, Nico Pache wrote:
> On 4/21/22 16:53, Thomas Gleixner wrote:
>> On Thu, Apr 21 2022 at 15:05, Nico Pache wrote:
>> Robust futexes are best effort as I explained you before and spamming
>> dmesg won't help anything.
> It would have helped find the OOM/Robust Futex issue more quickly. It may also
> help detect breakages in the pthread code (or any other users of robust futex)
> if someone is doing something incorrectly.
If we follow that line of argumentation then we have to sprinkle printks
en masse all over the place as there are a gazillion ways to create hard
to debug issues.
Thanks,
tglx
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap
2022-04-21 19:05 ` [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap Nico Pache
@ 2022-04-22 15:38 ` Michal Hocko
2022-04-25 19:00 ` Nico Pache
0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2022-04-22 15:38 UTC (permalink / raw)
To: Nico Pache; +Cc: linux-kernel, linux-mm, David Rientjes, Andrea Arcangeli
On Thu 21-04-22 15:05:33, Nico Pache wrote:
> The MMF_OOM_SKIP bit is used to indicate weather a mm_struct can not be
> invalided or has already been invalided. exit_mmap currently calls
> __oom_reap_task_mm unconditionally despite the fact that the oom reaper
> may have already called this.
>
> Add a check for the MMF_OOM_SKIP bit being set in exit_mmap to avoid
> unnessary calls to the invalidate code.
Why do we care about this?
> A slight race can occur on the MMF_OOM_SKIP bit that will still allow
> this to run twice. My testing has shown an ~66% decrease in double calls
> to _oom_reap_task_mm.
>
> Fixes: 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
I do not see this would be fixing anything.
> Cc: David Rientjes <rientjes@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/mmap.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a2968669fd4e..b867f408dacd 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3113,7 +3113,8 @@ void exit_mmap(struct mm_struct *mm)
> /* mm's last user has gone, and its about to be pulled down */
> mmu_notifier_release(mm);
>
> - if (unlikely(mm_is_oom_victim(mm))) {
> + if (unlikely(mm_is_oom_victim(mm)) &&
> + !test_bit(MMF_OOM_SKIP, &mm->flags)) {
> /*
> * Manually reap the mm to free as much memory as possible.
> * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> --
> 2.35.1
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap
2022-04-22 15:38 ` Michal Hocko
@ 2022-04-25 19:00 ` Nico Pache
2022-04-26 6:59 ` Michal Hocko
0 siblings, 1 reply; 16+ messages in thread
From: Nico Pache @ 2022-04-25 19:00 UTC (permalink / raw)
To: Michal Hocko; +Cc: linux-kernel, linux-mm, David Rientjes, Andrea Arcangeli
On 4/22/22 11:38, Michal Hocko wrote:
> On Thu 21-04-22 15:05:33, Nico Pache wrote:
>> The MMF_OOM_SKIP bit is used to indicate weather a mm_struct can not be
>> invalided or has already been invalided. exit_mmap currently calls
>> __oom_reap_task_mm unconditionally despite the fact that the oom reaper
>> may have already called this.
>>
>> Add a check for the MMF_OOM_SKIP bit being set in exit_mmap to avoid
>> unnessary calls to the invalidate code.
>
> Why do we care about this?
Is there no cost to the MMU/TLB invalidation? The MMU notifier contains a lock
too so perhaps we can also avoids some unnecessary MMU notifier lock contention.
>
>> A slight race can occur on the MMF_OOM_SKIP bit that will still allow
>> this to run twice. My testing has shown an ~66% decrease in double calls
>> to _oom_reap_task_mm.
>>
>> Fixes: 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
>
> I do not see this would be fixing anything.
Ok im just trying to make sure we are keeping an eye on what introduced this
double call. Davids commit above is what introduced the oom_reap_task_mm in the
exit_mmap code. It goes along with some other changes that I dont fully
understand without more studying, so that's why I was hoping he could provide
some input around that CVE (the main thing im concerned about re-introducing).
>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>> mm/mmap.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index a2968669fd4e..b867f408dacd 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3113,7 +3113,8 @@ void exit_mmap(struct mm_struct *mm)
>> /* mm's last user has gone, and its about to be pulled down */
>> mmu_notifier_release(mm);
>>
>> - if (unlikely(mm_is_oom_victim(mm))) {
>> + if (unlikely(mm_is_oom_victim(mm)) &&
>> + !test_bit(MMF_OOM_SKIP, &mm->flags)) {
>> /*
>> * Manually reap the mm to free as much memory as possible.
>> * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
>> --
>> 2.35.1
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap
2022-04-25 19:00 ` Nico Pache
@ 2022-04-26 6:59 ` Michal Hocko
0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2022-04-26 6:59 UTC (permalink / raw)
To: Nico Pache; +Cc: linux-kernel, linux-mm, David Rientjes, Andrea Arcangeli
On Mon 25-04-22 15:00:24, Nico Pache wrote:
>
>
> On 4/22/22 11:38, Michal Hocko wrote:
> > On Thu 21-04-22 15:05:33, Nico Pache wrote:
> >> The MMF_OOM_SKIP bit is used to indicate weather a mm_struct can not be
> >> invalided or has already been invalided. exit_mmap currently calls
> >> __oom_reap_task_mm unconditionally despite the fact that the oom reaper
> >> may have already called this.
> >>
> >> Add a check for the MMF_OOM_SKIP bit being set in exit_mmap to avoid
> >> unnessary calls to the invalidate code.
> >
> > Why do we care about this?
> Is there no cost to the MMU/TLB invalidation? The MMU notifier contains a lock
> too so perhaps we can also avoids some unnecessary MMU notifier lock contention.
I am pretty sure that this area can be micro optimized but I do not
really see a strong reason for that. OOM victims are/should be really
rare so I do not think that any performance optimization would be really
visible.
If you want to improve the code then I think a much better plan would be
to get rid of the whole oom special case altogether. This might be much
closer than ever after Hugh's recent m{un}lock changes. I didn't have
time to think that through though. I believe Suren Baghdasaryan has been
looking into that as well.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon
2022-04-22 14:00 ` Nico Pache
@ 2022-04-28 16:14 ` David Hildenbrand
0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2022-04-28 16:14 UTC (permalink / raw)
To: Nico Pache, Matthew Wilcox
Cc: linux-kernel, linux-mm, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Alexander Viro, Andrew Morton, Nicholas Piggin,
Logan Gunthorpe, Hari Bathini, Aneesh Kumar K.V, Yang Shi,
Miaohe Lin, William Kucharski, Hugh Dickins
On 22.04.22 16:00, Nico Pache wrote:
>
>
> On 4/21/22 15:28, Matthew Wilcox wrote:
>> On Thu, Apr 21, 2022 at 03:05:31PM -0400, Nico Pache wrote:
>>> The vma_is_anonymous function isn't fully indicative of what it checks.
>>>
>>> Without having full knowledge of the mmap process, one may incorrectly
>>> assume this covers all types of anonymous memory; which is not the case.
>>
>> Is your complaint that anonymous memory can also be found in file VMAs
>> that were mapped with MAP_PRIVATE? ie COWed pages?
> I should have been more descriptive in my commit msg about how I came to this
> conclusion.
>
> From my understanding of the mmap process, a vma->vm_ops field is only NULL when
> mmapped as !file and !shared:
>
> if (file){
> ...
> } else if (vm_flags & VM_SHARED) { //ANON SHARED
> error = shmem_zero_setup(vma);
> if (error)
> goto free_vma;
> } else { //ANON PRIVATE
> vma_set_anonymous(vma); //set vma->vm_ops= NULL
> }
>
> To me this means that the VMA is PRIVATE ANON memory. The vma_is_anonymous
> function returns true when vm_ops == NULL. So my intentions were to more
> accurately describe what we are checking for. I could be wrong though thats why
> I started with an RFC :)
Shared anon in the kernel is really just shmem. The user space notion
is MAP_ANON|MAP_SHARED, but that really just maps to shmem and the
kernel doesn't really call that thing anonymous memory.
So I agree, renaming this is not appropriate.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-04-28 16:14 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-21 19:05 [RFC 0/3] Slight improvements for OOM/Futex Nico Pache
2022-04-21 19:05 ` [RFC 1/3] mm: change vma_is_anonymous to vma_is_private_anon Nico Pache
2022-04-21 19:28 ` Matthew Wilcox
2022-04-22 14:00 ` Nico Pache
2022-04-28 16:14 ` David Hildenbrand
2022-04-21 19:05 ` [RFC 2/3] futex: exit: Print a warning when futex_cleanup fails Nico Pache
2022-04-21 19:30 ` Matthew Wilcox
2022-04-22 14:12 ` Nico Pache
2022-04-21 20:53 ` Thomas Gleixner
2022-04-22 14:23 ` Nico Pache
2022-04-22 14:42 ` Thomas Gleixner
2022-04-22 0:18 ` kernel test robot
2022-04-21 19:05 ` [RFC 3/3] exit: Check for MMF_OOM_SKIP in exit_mmap Nico Pache
2022-04-22 15:38 ` Michal Hocko
2022-04-25 19:00 ` Nico Pache
2022-04-26 6:59 ` Michal Hocko
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.