linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v1 0/4] Two alternatives for mm async teardown
@ 2021-11-11  9:50 Claudio Imbrenda
  2021-11-11  9:50 ` [RFC v1 1/4] add arch mmput hook in exit.c Claudio Imbrenda
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Claudio Imbrenda @ 2021-11-11  9:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: thuth, frankja, borntraeger, Ulrich.Weigand, heiko.carstens,
	david, ultrachin, akpm, vbabka, brookxu.cn, xiaoggchen,
	linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe, legion,
	peterz, aarcange, christian, ebiederm, tglx

This RFC series proposes two possible ways for enabling asynchronous mm
teardown.

The first approach, in patch 1, is simply to provide an arch hook in
exit_mm. This has no functional change for archs that don't explicitly
use the hook, and leaves the hard part to arch code (including
accounting, if any).

The second approach, in patches 2 to 4, adds a new syscall to allow an
mm to be asynchronously torn down in the context of another process
(similarly to how process_mrelease works). It also adds an OOM notifier
to prevent the OOM killer from killing processes while the teardown is
in progress.


Claudio Imbrenda (4):
  exit: add arch mmput hook in exit_mm
  kernel/fork.c: implement new process_mmput_async syscall
  mm: wire up the process_mmput_async syscall
  kernel/fork.c: process_mmput_async: stop OOM while freeing memory

 arch/alpha/kernel/syscalls/syscall.tbl      |   2 +
 arch/arm/tools/syscall.tbl                  |   1 +
 arch/arm64/include/asm/unistd.h             |   2 +-
 arch/arm64/include/asm/unistd32.h           |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |   2 +
 arch/m68k/kernel/syscalls/syscall.tbl       |   2 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   2 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   2 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   2 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   2 +
 arch/parisc/kernel/syscalls/syscall.tbl     |   2 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |   2 +
 arch/s390/kernel/syscalls/syscall.tbl       |   2 +
 arch/sh/kernel/syscalls/syscall.tbl         |   2 +
 arch/sparc/kernel/syscalls/syscall.tbl      |   2 +
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |   2 +
 include/asm-generic/mmu_context.h           |   4 +
 include/linux/mm_types.h                    |   1 +
 include/linux/syscalls.h                    |   1 +
 include/uapi/asm-generic/unistd.h           |   5 +-
 kernel/exit.c                               |   2 +-
 kernel/fork.c                               | 131 +++++++++++++++++++-
 kernel/sys_ni.c                             |   1 +
 25 files changed, 173 insertions(+), 5 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC v1 1/4] add arch mmput hook in exit.c
  2021-11-11  9:50 [RFC v1 0/4] Two alternatives for mm async teardown Claudio Imbrenda
@ 2021-11-11  9:50 ` Claudio Imbrenda
  2021-11-11  9:50 ` [RFC v1 1/4] exit: add arch mmput hook in exit_mm Claudio Imbrenda
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Claudio Imbrenda @ 2021-11-11  9:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: thuth, frankja, borntraeger, Ulrich.Weigand, heiko.carstens,
	david, ultrachin, akpm, vbabka, brookxu.cn, xiaoggchen,
	linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe, legion,
	peterz, aarcange, christian, ebiederm, tglx

This simple patch adds a hook for the mmput in exit_mm. This allows
archs to perform the mmput in custom ways if desired (e.g.
asynchronously)

No functional change intended.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 include/asm-generic/mmu_context.h | 4 ++++
 kernel/exit.c                     | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
index 91727065bacb..900931a6a105 100644
--- a/include/asm-generic/mmu_context.h
+++ b/include/asm-generic/mmu_context.h
@@ -73,4 +73,8 @@ static inline void deactivate_mm(struct task_struct *tsk,
 }
 #endif
 
+#ifndef arch_exit_mm_mmput
+#define arch_exit_mm_mmput mmput
+#endif
+
 #endif /* __ASM_GENERIC_MMU_CONTEXT_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..6eb1fdcc434e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -504,7 +504,7 @@ static void exit_mm(void)
 	task_unlock(current);
 	mmap_read_unlock(mm);
 	mm_update_next_owner(mm);
-	mmput(mm);
+	arch_exit_mm_mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
 		exit_oom_victim();
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v1 1/4] exit: add arch mmput hook in exit_mm
  2021-11-11  9:50 [RFC v1 0/4] Two alternatives for mm async teardown Claudio Imbrenda
  2021-11-11  9:50 ` [RFC v1 1/4] add arch mmput hook in exit.c Claudio Imbrenda
@ 2021-11-11  9:50 ` Claudio Imbrenda
  2021-11-11 18:43   ` Eric W. Biederman
  2021-11-11  9:50 ` [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall Claudio Imbrenda
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Claudio Imbrenda @ 2021-11-11  9:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: thuth, frankja, borntraeger, Ulrich.Weigand, heiko.carstens,
	david, ultrachin, akpm, vbabka, brookxu.cn, xiaoggchen,
	linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe, legion,
	peterz, aarcange, christian, ebiederm, tglx

This simple patch adds a hook for the mmput in exit_mm. This allows
archs to perform the mmput in custom ways if desired (e.g.
asynchronously)

No functional change intended.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 include/asm-generic/mmu_context.h | 4 ++++
 kernel/exit.c                     | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
index 91727065bacb..900931a6a105 100644
--- a/include/asm-generic/mmu_context.h
+++ b/include/asm-generic/mmu_context.h
@@ -73,4 +73,8 @@ static inline void deactivate_mm(struct task_struct *tsk,
 }
 #endif
 
+#ifndef arch_exit_mm_mmput
+#define arch_exit_mm_mmput mmput
+#endif
+
 #endif /* __ASM_GENERIC_MMU_CONTEXT_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..6eb1fdcc434e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -504,7 +504,7 @@ static void exit_mm(void)
 	task_unlock(current);
 	mmap_read_unlock(mm);
 	mm_update_next_owner(mm);
-	mmput(mm);
+	arch_exit_mm_mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
 		exit_oom_victim();
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall
  2021-11-11  9:50 [RFC v1 0/4] Two alternatives for mm async teardown Claudio Imbrenda
  2021-11-11  9:50 ` [RFC v1 1/4] add arch mmput hook in exit.c Claudio Imbrenda
  2021-11-11  9:50 ` [RFC v1 1/4] exit: add arch mmput hook in exit_mm Claudio Imbrenda
@ 2021-11-11  9:50 ` Claudio Imbrenda
  2021-11-11 19:20   ` Eric W. Biederman
  2021-11-11  9:50 ` [RFC v1 3/4] mm: wire up the " Claudio Imbrenda
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Claudio Imbrenda @ 2021-11-11  9:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: thuth, frankja, borntraeger, Ulrich.Weigand, heiko.carstens,
	david, ultrachin, akpm, vbabka, brookxu.cn, xiaoggchen,
	linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe, legion,
	peterz, aarcange, christian, ebiederm, tglx

The goal of this new syscall is to be able to asynchronously free the
mm of a dying process. This is especially useful for processes that use
huge amounts of memory (e.g. databases or KVM guests). The process is
allowed to terminate immediately, while its mm is cleaned/reclaimed
asynchronously.

A separate process needs use the process_mmput_async syscall to attach
itself to the mm of a running target process. The process will then
sleep until the last user of the target mm has gone.

When the last user of the mm has gone, instead of synchronously free
the mm, the attached process is awoken. The syscall will then continue
and clean up the target mm.

This solution has the advantage that the cleanup of the target mm can
happen both be asynchronous and properly accounted for (e.g. cgroups).

Tested on s390x.

A separate patch will actually wire up the syscall.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 include/linux/mm_types.h |   1 +
 kernel/fork.c            | 103 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb8c6f5f19bc..adc62cba3e91 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -631,6 +631,7 @@ struct mm_struct {
 		atomic_long_t hugetlb_usage;
 #endif
 		struct work_struct async_put_work;
+		struct task_struct *mmput_async_task;
 
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
diff --git a/kernel/fork.c b/kernel/fork.c
index 5de23f3e08bf..0da39b76005c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1062,6 +1062,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 #endif
 	mm_init_uprobes_state(mm);
 	hugetlb_count_init(mm);
+	mm->mmput_async_task = NULL;
 
 	if (current->mm) {
 		mm->flags = current->mm->flags & MMF_INIT_MASK;
@@ -1130,7 +1131,12 @@ void mmput(struct mm_struct *mm)
 {
 	might_sleep();
 
-	if (atomic_dec_and_test(&mm->mm_users))
+	if (!atomic_dec_and_test(&mm->mm_users))
+		return;
+
+	if (READ_ONCE(mm->mmput_async_task))
+		wake_up_process(mm->mmput_async_task);
+	else
 		__mmput(mm);
 }
 EXPORT_SYMBOL_GPL(mmput);
@@ -1146,7 +1152,12 @@ static void mmput_async_fn(struct work_struct *work)
 
 void mmput_async(struct mm_struct *mm)
 {
-	if (atomic_dec_and_test(&mm->mm_users)) {
+	if (!atomic_dec_and_test(&mm->mm_users))
+		return;
+
+	if (READ_ONCE(mm->mmput_async_task)) {
+		wake_up_process(mm->mmput_async_task);
+	} else {
 		INIT_WORK(&mm->async_put_work, mmput_async_fn);
 		schedule_work(&mm->async_put_work);
 	}
@@ -3191,3 +3202,91 @@ int sysctl_max_threads(struct ctl_table *table, int write,
 
 	return 0;
 }
+
+SYSCALL_DEFINE2(process_mmput_async, int, pidfd, unsigned int, flags)
+{
+#ifdef CONFIG_MMU
+	struct mm_struct *mm = NULL;
+	struct task_struct *task;
+	unsigned int tmp;
+	struct pid *pid;
+
+	if (flags)
+		return -EINVAL;
+
+	pid = pidfd_get_pid(pidfd, &tmp);
+	if (IS_ERR(pid))
+		return PTR_ERR(pid);
+
+	task = get_pid_task(pid, PIDTYPE_TGID);
+	/* The PID is not needed once we have the tast_struct */
+	put_pid(pid);
+	if (!task)
+		return -ESRCH;
+
+	/*
+	 * The struct_mm is guaranteed to be there as long as we are holding
+	 * a reference to the task_struct. This also guarantees that we
+	 * will not race with mmput, since we are now holding one additional
+	 * reference.
+	 */
+	if (mmget_not_zero(task->mm))
+		mm = task->mm;
+	/* The task_struct is not needed once we have the mm_struct */
+	put_task_struct(task);
+	/* If the target process no longer has an mm, there is nothing we can do. */
+	if (!mm)
+		return -ENXIO;
+
+	/* Use TASK_IDLE instead of TASK_UNINTERRUPTIBLE to avoid stall notifications. */
+	set_current_state(TASK_IDLE);
+	/*
+	 * Return an error if another task had already set itself as async
+	 * cleanup for the target mm.
+	 */
+	if (cmpxchg(&mm->mmput_async_task, NULL, current) != NULL) {
+		set_current_state(TASK_RUNNING);
+		return -EEXIST;
+	}
+
+	/*
+	 * The target mm now has an extra reference to current.
+	 * Is this useless?
+	 */
+	get_task_struct(current);
+	/*
+	 * We will now do mmput, and we no longer have a reference to the
+	 * task; we need mmgrab to be able to reference the mm_struct even
+	 * after its last user is gone.
+	 */
+	mmgrab(mm);
+	/*
+	 * If the target mm has been discarded after we got its reference,
+	 * then we are holding the last reference to it, and doing mmput
+	 * here will cause us to be schedulable again.
+	 */
+	mmput(mm);
+
+	/*
+	 * Go to sleep; we are set to TASK_IDLE, so nothing will wake us up
+	 * except an explicit wake_up_process from mmput.
+	 */
+	schedule();
+
+	/* Put the extra reference taken above */
+	put_task_struct(current);
+
+	/* Should this be a warning instead? */
+	if (atomic_read(&mm->mm_users))
+		panic("mm_users not 0 but trying to __mmput anyway!");
+
+	/* Do the actual work */
+	__mmput(mm);
+	/* And put the extra reference taken above */
+	mmdrop(mm);
+
+	return 0;
+#else
+	return -ENOSYS;
+#endif /* CONFIG_MMU */
+}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v1 3/4] mm: wire up the process_mmput_async syscall
  2021-11-11  9:50 [RFC v1 0/4] Two alternatives for mm async teardown Claudio Imbrenda
                   ` (2 preceding siblings ...)
  2021-11-11  9:50 ` [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall Claudio Imbrenda
@ 2021-11-11  9:50 ` Claudio Imbrenda
  2021-11-11  9:50 ` [RFC v1 4/4] kernel/fork.c: process_mmput_async: stop OOM while freeing memory Claudio Imbrenda
  2021-11-12 10:15 ` [RFC v1 0/4] Two alternatives for mm async teardown Michal Hocko
  5 siblings, 0 replies; 14+ messages in thread
From: Claudio Imbrenda @ 2021-11-11  9:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: thuth, frankja, borntraeger, Ulrich.Weigand, heiko.carstens,
	david, ultrachin, akpm, vbabka, brookxu.cn, xiaoggchen,
	linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe, legion,
	peterz, aarcange, christian, ebiederm, tglx

Wire up the process_mmput_async syscall.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      | 2 ++
 arch/arm/tools/syscall.tbl                  | 1 +
 arch/arm64/include/asm/unistd.h             | 2 +-
 arch/arm64/include/asm/unistd32.h           | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl       | 2 ++
 arch/m68k/kernel/syscalls/syscall.tbl       | 2 ++
 arch/microblaze/kernel/syscalls/syscall.tbl | 2 ++
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 2 ++
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 2 ++
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 2 ++
 arch/parisc/kernel/syscalls/syscall.tbl     | 2 ++
 arch/powerpc/kernel/syscalls/syscall.tbl    | 2 ++
 arch/s390/kernel/syscalls/syscall.tbl       | 2 ++
 arch/sh/kernel/syscalls/syscall.tbl         | 2 ++
 arch/sparc/kernel/syscalls/syscall.tbl      | 2 ++
 arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     | 2 ++
 include/linux/syscalls.h                    | 1 +
 include/uapi/asm-generic/unistd.h           | 5 ++++-
 kernel/sys_ni.c                             | 1 +
 21 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index e4a041cd5715..2ec84b67534e 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -488,3 +488,5 @@
 556	common	landlock_restrict_self		sys_landlock_restrict_self
 # 557 reserved for memfd_secret
 558	common	process_mrelease		sys_process_mrelease
+# 559 reserved for futex_waitv
+560	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 543100151f2b..8aee1a20f4d5 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -463,3 +463,4 @@
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
 449	common	futex_waitv			sys_futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 6bdb5f5db438..4e65da3445c7 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		450
+#define __NR_compat_syscalls		451
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 41ea1195e44b..e3b75a2896d2 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -905,6 +905,8 @@ __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
 __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
+#define __NR_process_mmput_async 450
+__SYSCALL(__NR_process_mmput_async, sys_process_mmput_async)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 6fea1844fb95..c1e454342e94 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -369,3 +369,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7976dff8f879..b89bcb3b50d0 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -448,3 +448,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 6b0e11362bd2..cf4df3a9f95d 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -454,3 +454,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 70e32de2bcaa..6374c2acc370 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -387,3 +387,5 @@
 446	n32	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	n32	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 1ca7bc337932..f7d336bcdcad 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -363,3 +363,5 @@
 446	n64	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	n64	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index a61c35edaa74..1b4572b4c0f4 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -436,3 +436,5 @@
 446	o32	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	o32	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index bf751e0732b7..58fca00dd2c4 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -446,3 +446,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 7bef917cc84e..9c62f9ca307f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -528,3 +528,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index df5261e5cfe1..aa6b50367a8b 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -451,3 +451,5 @@
 446  common	landlock_restrict_self	sys_landlock_restrict_self	sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448  common	process_mrelease	sys_process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450  common	process_mmput_async	sys_process_mmput_async		sys_process_mmput_async
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 208f131659c5..372122390fa2 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -451,3 +451,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index c37764dc764d..5c1448ead433 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -494,3 +494,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 7e25543693de..ad034bc349ec 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -454,3 +454,4 @@
 447	i386	memfd_secret		sys_memfd_secret
 448	i386	process_mrelease	sys_process_mrelease
 449	i386	futex_waitv		sys_futex_waitv
+450	i386	process_mmput_async	sys_process_mmput_async
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index fe8f8dd157b4..df659c967932 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -371,6 +371,7 @@
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
+450	common	process_mmput_async	sys_process_mmput_async
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 104b327f8ac9..7534f2969af6 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -419,3 +419,5 @@
 446	common	landlock_restrict_self		sys_landlock_restrict_self
 # 447 reserved for memfd_secret
 448	common	process_mrelease		sys_process_mrelease
+# 449 reserved for futex_waitv
+450	common	process_mmput_async		sys_process_mmput_async
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 528a478dbda8..e416f6f8a4b8 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -920,6 +920,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len,
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
 asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
 			size_t vlen, int behavior, unsigned int flags);
+asmlinkage long sys_process_mmput_async(int pidfd, unsigned int flags);
 asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 4557a8b6086f..bf6b9c3e4957 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -883,8 +883,11 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 
+#define __NR_process_mmput_async 450
+__SYSCALL(__NR_process_mmput_async, sys_process_mmput_async)
+
 #undef __NR_syscalls
-#define __NR_syscalls 450
+#define __NR_syscalls 451
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d1944258cfc0..0b69802a3d62 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -290,6 +290,7 @@ COND_SYSCALL(munlockall);
 COND_SYSCALL(mincore);
 COND_SYSCALL(madvise);
 COND_SYSCALL(process_madvise);
+COND_SYSCALL(process_mmput_async);
 COND_SYSCALL(process_mrelease);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC v1 4/4] kernel/fork.c: process_mmput_async: stop OOM while freeing memory
  2021-11-11  9:50 [RFC v1 0/4] Two alternatives for mm async teardown Claudio Imbrenda
                   ` (3 preceding siblings ...)
  2021-11-11  9:50 ` [RFC v1 3/4] mm: wire up the " Claudio Imbrenda
@ 2021-11-11  9:50 ` Claudio Imbrenda
  2021-11-12 10:15 ` [RFC v1 0/4] Two alternatives for mm async teardown Michal Hocko
  5 siblings, 0 replies; 14+ messages in thread
From: Claudio Imbrenda @ 2021-11-11  9:50 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: thuth, frankja, borntraeger, Ulrich.Weigand, heiko.carstens,
	david, ultrachin, akpm, vbabka, brookxu.cn, xiaoggchen,
	linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe, legion,
	peterz, aarcange, christian, ebiederm, tglx

This patch implements a simple OOM notifier to stop the OOM killer
while a mm is being reclaimed asynchronously using the
process_mmput_async syscall.

Tested on s390x.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 kernel/fork.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 0da39b76005c..7279209eb69c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -118,6 +118,11 @@
  */
 #define MAX_THREADS FUTEX_TID_MASK
 
+/*
+ * Priority for the OOM notifier used in process_mmput_async
+ */
+#define PROCESS_MMPUT_ASYNC_OOM_NOTIFY_PRIORITY 70
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -3203,13 +3208,27 @@ int sysctl_max_threads(struct ctl_table *table, int write,
 	return 0;
 }
 
+/* Prevent the OOM from being triggered while we are cleaning up asynchronously */
+static int mmput_async_oom_notifier(struct notifier_block *nb, unsigned long dummy, void *parm)
+{
+	/*
+	 * We cannot know the speed at which pages are being freed, so we
+	 * fake it and say it's at least one. This is already enough to
+	 * stop the OOM killer.
+	 */
+	*(unsigned long *)parm += PAGE_SIZE;
+	return NOTIFY_OK;
+}
+
 SYSCALL_DEFINE2(process_mmput_async, int, pidfd, unsigned int, flags)
 {
 #ifdef CONFIG_MMU
+	struct notifier_block oom_nb;
 	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	unsigned int tmp;
 	struct pid *pid;
+	int r;
 
 	if (flags)
 		return -EINVAL;
@@ -3280,8 +3299,17 @@ SYSCALL_DEFINE2(process_mmput_async, int, pidfd, unsigned int, flags)
 	if (atomic_read(&mm->mm_users))
 		panic("mm_users not 0 but trying to __mmput anyway!");
 
+	/*
+	 * Register an OOM notifier, to stop the OOM while we are
+	 * asynchronously freeing the mm.
+	 */
+	oom_nb.priority = PROCESS_MMPUT_ASYNC_OOM_NOTIFY_PRIORITY;
+	oom_nb.notifier_call = mmput_async_oom_notifier;
+	r = register_oom_notifier(&oom_nb);
 	/* Do the actual work */
 	__mmput(mm);
+	if (!r)
+		unregister_oom_notifier(&oom_nb);
 	/* And put the extra reference taken above */
 	mmdrop(mm);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC v1 1/4] exit: add arch mmput hook in exit_mm
  2021-11-11  9:50 ` [RFC v1 1/4] exit: add arch mmput hook in exit_mm Claudio Imbrenda
@ 2021-11-11 18:43   ` Eric W. Biederman
  0 siblings, 0 replies; 14+ messages in thread
From: Eric W. Biederman @ 2021-11-11 18:43 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-kernel, linux-mm, thuth, frankja, borntraeger,
	Ulrich.Weigand, heiko.carstens, david, ultrachin, akpm, vbabka,
	brookxu.cn, xiaoggchen, linuszeng, yihuilu, mhocko,
	daniel.m.jordan, axboe, legion, peterz, aarcange, christian,
	tglx

Claudio Imbrenda <imbrenda@linux.ibm.com> writes:

> This simple patch adds a hook for the mmput in exit_mm. This allows
> archs to perform the mmput in custom ways if desired (e.g.
> asynchronously)
>
> No functional change intended.

Ouch! No.

You changes don't include an architecture taking advantage of this
so there is not way to see how this is used in practice and maintain
the code.

Further you are making the generic code much harder to read and
follow replacing generic code with something random that some buggy
architecture implements that no one understands.

Saying "some buggy architecture implements and no one understands"
is a bit hyperbole but that is how these hooks feel when digging in
to changing the code.  It takes weeks to months to read through
ask questions etc to clean hooks like this up and change the
code in an appropriate way.

As things are ill specified like this really do need change eventually.

So please use much more targeted routines for architecture code to call.
Especially when dealing with something as fundamental as the lifetime of
a core kernel object.

If you want an example of how silly some of these kinds of things
can get just look at arch_ptrace_stop and sigkill_pending.  Linus has
just merged my fixes for these things.  There are worse examples, I just
remember them off the top of my head.

If this is merged as is, this feels like code that will be subtly wrong
for a decade before someone figures it out and fixes it.

Eric


> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---
>  include/asm-generic/mmu_context.h | 4 ++++
>  kernel/exit.c                     | 2 +-
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
> index 91727065bacb..900931a6a105 100644
> --- a/include/asm-generic/mmu_context.h
> +++ b/include/asm-generic/mmu_context.h
> @@ -73,4 +73,8 @@ static inline void deactivate_mm(struct task_struct *tsk,
>  }
>  #endif
>  
> +#ifndef arch_exit_mm_mmput
> +#define arch_exit_mm_mmput mmput
> +#endif
> +
>  #endif /* __ASM_GENERIC_MMU_CONTEXT_H */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f702a6a63686..6eb1fdcc434e 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -504,7 +504,7 @@ static void exit_mm(void)
>  	task_unlock(current);
>  	mmap_read_unlock(mm);
>  	mm_update_next_owner(mm);
> -	mmput(mm);
> +	arch_exit_mm_mmput(mm);
>  	if (test_thread_flag(TIF_MEMDIE))
>  		exit_oom_victim();
>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall
  2021-11-11  9:50 ` [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall Claudio Imbrenda
@ 2021-11-11 19:20   ` Eric W. Biederman
  2021-11-12  9:27     ` Claudio Imbrenda
  2021-11-12  9:34     ` Claudio Imbrenda
  0 siblings, 2 replies; 14+ messages in thread
From: Eric W. Biederman @ 2021-11-11 19:20 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-kernel, linux-mm, thuth, frankja, borntraeger,
	Ulrich.Weigand, heiko.carstens, david, ultrachin, akpm, vbabka,
	brookxu.cn, xiaoggchen, linuszeng, yihuilu, mhocko,
	daniel.m.jordan, axboe, legion, peterz, aarcange, christian,
	tglx

Claudio Imbrenda <imbrenda@linux.ibm.com> writes:

> The goal of this new syscall is to be able to asynchronously free the
> mm of a dying process. This is especially useful for processes that use
> huge amounts of memory (e.g. databases or KVM guests). The process is
> allowed to terminate immediately, while its mm is cleaned/reclaimed
> asynchronously.
>
> A separate process needs use the process_mmput_async syscall to attach
> itself to the mm of a running target process. The process will then
> sleep until the last user of the target mm has gone.
>
> When the last user of the mm has gone, instead of synchronously free
> the mm, the attached process is awoken. The syscall will then continue
> and clean up the target mm.
>
> This solution has the advantage that the cleanup of the target mm can
> happen both be asynchronous and properly accounted for (e.g. cgroups).
>
> Tested on s390x.
>
> A separate patch will actually wire up the syscall.

I am a bit confused.

You want the process report that it has finished immediately,
and you want the cleanup work to continue on in the background.

Why do you need a separate process?

Why not just modify the process cleanup code to keep the task_struct
running while allowing waitpid to reap the process (aka allowing
release_task to run)?  All tasks can be already be reaped after
exit_notify in do_exit.

I can see some reasons for wanting an opt-in.  It is nice to know all of
a processes resources have been freed when waitpid succeeds.

Still I don't see why this whole thing isn't exit_mm returning
the mm_sturct when a flag is set, and then having an exit_mm_late
being called and passed the returned mm after exit_notify.

Or maybe something with schedule_work or task_work, instead of an
exit_mm_late.  I don't see any practical difference.

I really don't see why this needs a whole other process to connect to
the process you care about asynchronously.

This whole thing seems an exercise in spending lots of resources to free
resources much later.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall
  2021-11-11 19:20   ` Eric W. Biederman
@ 2021-11-12  9:27     ` Claudio Imbrenda
  2021-11-12  9:34     ` Claudio Imbrenda
  1 sibling, 0 replies; 14+ messages in thread
From: Claudio Imbrenda @ 2021-11-12  9:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-mm, thuth, frankja, borntraeger,
	Ulrich.Weigand, david, ultrachin, akpm, vbabka, brookxu.cn,
	xiaoggchen, linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe,
	legion, peterz, aarcange, christian, tglx

On Thu, 11 Nov 2021 13:20:11 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
> 
> > The goal of this new syscall is to be able to asynchronously free the
> > mm of a dying process. This is especially useful for processes that use
> > huge amounts of memory (e.g. databases or KVM guests). The process is
> > allowed to terminate immediately, while its mm is cleaned/reclaimed
> > asynchronously.
> >
> > A separate process needs use the process_mmput_async syscall to attach
> > itself to the mm of a running target process. The process will then
> > sleep until the last user of the target mm has gone.
> >
> > When the last user of the mm has gone, instead of synchronously free
> > the mm, the attached process is awoken. The syscall will then continue
> > and clean up the target mm.
> >
> > This solution has the advantage that the cleanup of the target mm can
> > happen both be asynchronous and properly accounted for (e.g. cgroups).
> >
> > Tested on s390x.
> >
> > A separate patch will actually wire up the syscall.  
> 
> I am a bit confused.
> 
> You want the process report that it has finished immediately,
> and you want the cleanup work to continue on in the background.

yes

> Why do you need a separate process?
> 
> Why not just modify the process cleanup code to keep the task_struct
> running while allowing waitpid to reap the process (aka allowing
> release_task to run)?  All tasks can be already be reaped after
> exit_notify in do_exit.
> 
> I can see some reasons for wanting an opt-in.  It is nice to know all of
> a processes resources have been freed when waitpid succeeds.
> 
> Still I don't see why this whole thing isn't exit_mm returning
> the mm_sturct when a flag is set, and then having an exit_mm_late
> being called and passed the returned mm after exit_notify.

so if I understand correctly you are saying exit_mm would skip the
mmput, set a flag, then I should introduce a new function
"exit_mm_late" after exit_notify, to check the flag and do the mmput if
needed

and that would mean that the cleanup would still be done in the context
of the exiting process, but without holding back anyone waiting for the
process to terminate (so the process appears to exit immediately)

sounds clean, I will do it

> Or maybe something with schedule_work or task_work, instead of an
> exit_mm_late.  I don't see any practical difference.
> 
> I really don't see why this needs a whole other process to connect to
> the process you care about asynchronously.

accounting. workqueues or kernel threads are not properly accounted to
the right cgroups; by using a userspace process, things get accounted
properly.

this was a major point that was made last month when a similar
discussion came up

> This whole thing seems an exercise in spending lots of resources to free
> resources much later.

there are some usecases for this (huge processes like databases, or huge
secure VMs where the teardown is significantly slower than normal
processes)

> 
> Eric


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall
  2021-11-11 19:20   ` Eric W. Biederman
  2021-11-12  9:27     ` Claudio Imbrenda
@ 2021-11-12  9:34     ` Claudio Imbrenda
  2021-11-12 14:57       ` Eric W. Biederman
  1 sibling, 1 reply; 14+ messages in thread
From: Claudio Imbrenda @ 2021-11-12  9:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-mm, thuth, frankja, borntraeger,
	Ulrich.Weigand, david, ultrachin, akpm, vbabka, brookxu.cn,
	xiaoggchen, linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe,
	legion, peterz, aarcange, christian, tglx

On Thu, 11 Nov 2021 13:20:11 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
> 
> > The goal of this new syscall is to be able to asynchronously free the
> > mm of a dying process. This is especially useful for processes that use
> > huge amounts of memory (e.g. databases or KVM guests). The process is
> > allowed to terminate immediately, while its mm is cleaned/reclaimed
> > asynchronously.
> >
> > A separate process needs use the process_mmput_async syscall to attach
> > itself to the mm of a running target process. The process will then
> > sleep until the last user of the target mm has gone.
> >
> > When the last user of the mm has gone, instead of synchronously free
> > the mm, the attached process is awoken. The syscall will then continue
> > and clean up the target mm.
> >
> > This solution has the advantage that the cleanup of the target mm can
> > happen both be asynchronous and properly accounted for (e.g. cgroups).
> >
> > Tested on s390x.
> >
> > A separate patch will actually wire up the syscall.  
> 
> I am a bit confused.
> 
> You want the process report that it has finished immediately,
> and you want the cleanup work to continue on in the background.
> 
> Why do you need a separate process?
> 
> Why not just modify the process cleanup code to keep the task_struct
> running while allowing waitpid to reap the process (aka allowing
> release_task to run)?  All tasks can be already be reaped after
> exit_notify in do_exit.
> 
> I can see some reasons for wanting an opt-in.  It is nice to know all of
> a processes resources have been freed when waitpid succeeds.
> 
> Still I don't see why this whole thing isn't exit_mm returning
> the mm_sturct when a flag is set, and then having an exit_mm_late
> being called and passed the returned mm after exit_notify.

nevermind, exit_notify is done after cgroup_exit, the teardown would
then not be accounted properly

> 
> Or maybe something with schedule_work or task_work, instead of an
> exit_mm_late.  I don't see any practical difference.
> 
> I really don't see why this needs a whole other process to connect to
> the process you care about asynchronously.
> 
> This whole thing seems an exercise in spending lots of resources to free
> resources much later.
> 
> Eric


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 0/4] Two alternatives for mm async teardown
  2021-11-11  9:50 [RFC v1 0/4] Two alternatives for mm async teardown Claudio Imbrenda
                   ` (4 preceding siblings ...)
  2021-11-11  9:50 ` [RFC v1 4/4] kernel/fork.c: process_mmput_async: stop OOM while freeing memory Claudio Imbrenda
@ 2021-11-12 10:15 ` Michal Hocko
  5 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2021-11-12 10:15 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-kernel, linux-mm, thuth, frankja, borntraeger,
	Ulrich.Weigand, heiko.carstens, david, ultrachin, akpm, vbabka,
	brookxu.cn, xiaoggchen, linuszeng, yihuilu, daniel.m.jordan,
	axboe, legion, peterz, aarcange, christian, ebiederm, tglx

On Thu 11-11-21 10:50:03, Claudio Imbrenda wrote:
> This RFC series proposes two possible ways for enabling asynchronous mm
> teardown.

It would be great to describe an intended usecase here and also explain
why the existing features do not allow the required functionality.

Please also make sure to cc linux-api when adding a new user visible
interface or changing a visible behavior of existing one.

E.g. why cannot you simply create a process outside of the thread group
yet share the mm with your task. Once the other process exits which you
can detect then you just exit that process and do the finall clean up
from that context?

> The first approach, in patch 1, is simply to provide an arch hook in
> exit_mm. This has no functional change for archs that don't explicitly
> use the hook, and leaves the hard part to arch code (including
> accounting, if any).

This is just too vague but I have to say I am not really a fan of hooks
that considerably change the existing behavior.

> The second approach, in patches 2 to 4, adds a new syscall to allow an
> mm to be asynchronously torn down in the context of another process
> (similarly to how process_mrelease works). It also adds an OOM notifier
> to prevent the OOM killer from killing processes while the teardown is
> in progress.

I have to say I do not like oom notifier part at all. You can have
different sources of the OOM (memcg, cpusets or global oom). It is
impossible to tell those appart in the notifier. Not to mention that
memcg oom is explicitly avoiding notifiers altogether.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall
  2021-11-12  9:34     ` Claudio Imbrenda
@ 2021-11-12 14:57       ` Eric W. Biederman
  2021-11-12 16:53         ` Claudio Imbrenda
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2021-11-12 14:57 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-kernel, linux-mm, thuth, frankja, borntraeger,
	Ulrich.Weigand, david, ultrachin, akpm, vbabka, brookxu.cn,
	xiaoggchen, linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe,
	legion, peterz, aarcange, christian, tglx

Claudio Imbrenda <imbrenda@linux.ibm.com> writes:

> On Thu, 11 Nov 2021 13:20:11 -0600
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
>> 
>> > The goal of this new syscall is to be able to asynchronously free the
>> > mm of a dying process. This is especially useful for processes that use
>> > huge amounts of memory (e.g. databases or KVM guests). The process is
>> > allowed to terminate immediately, while its mm is cleaned/reclaimed
>> > asynchronously.
>> >
>> > A separate process needs use the process_mmput_async syscall to attach
>> > itself to the mm of a running target process. The process will then
>> > sleep until the last user of the target mm has gone.
>> >
>> > When the last user of the mm has gone, instead of synchronously free
>> > the mm, the attached process is awoken. The syscall will then continue
>> > and clean up the target mm.
>> >
>> > This solution has the advantage that the cleanup of the target mm can
>> > happen both be asynchronous and properly accounted for (e.g. cgroups).
>> >
>> > Tested on s390x.
>> >
>> > A separate patch will actually wire up the syscall.  
>> 
>> I am a bit confused.
>> 
>> You want the process report that it has finished immediately,
>> and you want the cleanup work to continue on in the background.
>> 
>> Why do you need a separate process?
>> 
>> Why not just modify the process cleanup code to keep the task_struct
>> running while allowing waitpid to reap the process (aka allowing
>> release_task to run)?  All tasks can be already be reaped after
>> exit_notify in do_exit.
>> 
>> I can see some reasons for wanting an opt-in.  It is nice to know all of
>> a processes resources have been freed when waitpid succeeds.
>> 
>> Still I don't see why this whole thing isn't exit_mm returning
>> the mm_sturct when a flag is set, and then having an exit_mm_late
>> being called and passed the returned mm after exit_notify.
>
> nevermind, exit_notify is done after cgroup_exit, the teardown would
> then not be accounted properly

So you want this new mechanism so you can separate the cleanup from
the exit notification, and so that things are accounted properly.

It would have helped if you had included a link to the previous
conversation.

I think Michal Hoko has a point.  This looks like a job for
"clone(CLONE_VM)" and "prctl(PR_SET_PDEATH_SIG)".  Maybe using a pidfd
instead of the prctl.

AKA just create a child that shares the parents memory, and waits for
the parent to exit and then cleans things up.

That should not need any new kernel mechanisms.



There is the other question: why this is disastrously slow on s390?
Is this a s390 specific issue?  Can the issue be fixed by optimizing
what is happening on s390?

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall
  2021-11-12 14:57       ` Eric W. Biederman
@ 2021-11-12 16:53         ` Claudio Imbrenda
  2021-11-15 10:43           ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Claudio Imbrenda @ 2021-11-12 16:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-mm, thuth, frankja, borntraeger,
	Ulrich.Weigand, david, ultrachin, akpm, vbabka, brookxu.cn,
	xiaoggchen, linuszeng, yihuilu, mhocko, daniel.m.jordan, axboe,
	legion, peterz, aarcange, christian, tglx

On Fri, 12 Nov 2021 08:57:13 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
> 
> > On Thu, 11 Nov 2021 13:20:11 -0600
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> >  
> >> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
> >>   
> >> > The goal of this new syscall is to be able to asynchronously free the
> >> > mm of a dying process. This is especially useful for processes that use
> >> > huge amounts of memory (e.g. databases or KVM guests). The process is
> >> > allowed to terminate immediately, while its mm is cleaned/reclaimed
> >> > asynchronously.
> >> >
> >> > A separate process needs use the process_mmput_async syscall to attach
> >> > itself to the mm of a running target process. The process will then
> >> > sleep until the last user of the target mm has gone.
> >> >
> >> > When the last user of the mm has gone, instead of synchronously free
> >> > the mm, the attached process is awoken. The syscall will then continue
> >> > and clean up the target mm.
> >> >
> >> > This solution has the advantage that the cleanup of the target mm can
> >> > happen both be asynchronous and properly accounted for (e.g. cgroups).
> >> >
> >> > Tested on s390x.
> >> >
> >> > A separate patch will actually wire up the syscall.    
> >> 
> >> I am a bit confused.
> >> 
> >> You want the process report that it has finished immediately,
> >> and you want the cleanup work to continue on in the background.
> >> 
> >> Why do you need a separate process?
> >> 
> >> Why not just modify the process cleanup code to keep the task_struct
> >> running while allowing waitpid to reap the process (aka allowing
> >> release_task to run)?  All tasks can be already be reaped after
> >> exit_notify in do_exit.
> >> 
> >> I can see some reasons for wanting an opt-in.  It is nice to know all of
> >> a processes resources have been freed when waitpid succeeds.
> >> 
> >> Still I don't see why this whole thing isn't exit_mm returning
> >> the mm_sturct when a flag is set, and then having an exit_mm_late
> >> being called and passed the returned mm after exit_notify.  
> >
> > nevermind, exit_notify is done after cgroup_exit, the teardown would
> > then not be accounted properly  
> 
> So you want this new mechanism so you can separate the cleanup from
> the exit notification, and so that things are accounted properly.
> 
> It would have helped if you had included a link to the previous
> conversation.
> 
> I think Michal Hoko has a point.  This looks like a job for
> "clone(CLONE_VM)" and "prctl(PR_SET_PDEATH_SIG)".  Maybe using a pidfd
> instead of the prctl.
> 
> AKA just create a child that shares the parents memory, and waits for
> the parent to exit and then cleans things up.
> 
> That should not need any new kernel mechanisms.

Of course, but this also means that it's not possible to stop the OOM
killer while the teardown is in progress, and will require userspace
changes on a case-by-case basis.

Anyway, I will try to kludge something together with clone

> 
> 
> 
> There is the other question: why this is disastrously slow on s390?
> Is this a s390 specific issue?  Can the issue be fixed by optimizing

It's a hardware issue with protected VMs, which are achieved on s390x
without memory encryption. When a protected VM terminates, the
secure/trusted firmware needs to clear all protected memory, and change
the security properties to make it accessible. This last step in
particular takes more time than just clearing memory.

> what is happening on s390?
> 
> Eric


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall
  2021-11-12 16:53         ` Claudio Imbrenda
@ 2021-11-15 10:43           ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2021-11-15 10:43 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Eric W. Biederman, linux-kernel, linux-mm, thuth, frankja,
	borntraeger, Ulrich.Weigand, david, ultrachin, akpm, vbabka,
	brookxu.cn, xiaoggchen, linuszeng, yihuilu, daniel.m.jordan,
	axboe, legion, peterz, aarcange, christian, tglx

On Fri 12-11-21 17:53:09, Claudio Imbrenda wrote:
[...]
> Of course, but this also means that it's not possible to stop the OOM
> killer while the teardown is in progress,

Blocking the OOM killer and depend on the userspace to make a forward
progress is not acceptable solution.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-11-15 10:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-11  9:50 [RFC v1 0/4] Two alternatives for mm async teardown Claudio Imbrenda
2021-11-11  9:50 ` [RFC v1 1/4] add arch mmput hook in exit.c Claudio Imbrenda
2021-11-11  9:50 ` [RFC v1 1/4] exit: add arch mmput hook in exit_mm Claudio Imbrenda
2021-11-11 18:43   ` Eric W. Biederman
2021-11-11  9:50 ` [RFC v1 2/4] kernel/fork.c: implement new process_mmput_async syscall Claudio Imbrenda
2021-11-11 19:20   ` Eric W. Biederman
2021-11-12  9:27     ` Claudio Imbrenda
2021-11-12  9:34     ` Claudio Imbrenda
2021-11-12 14:57       ` Eric W. Biederman
2021-11-12 16:53         ` Claudio Imbrenda
2021-11-15 10:43           ` Michal Hocko
2021-11-11  9:50 ` [RFC v1 3/4] mm: wire up the " Claudio Imbrenda
2021-11-11  9:50 ` [RFC v1 4/4] kernel/fork.c: process_mmput_async: stop OOM while freeing memory Claudio Imbrenda
2021-11-12 10:15 ` [RFC v1 0/4] Two alternatives for mm async teardown Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).