linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] userspace memory reaping using process_madvise
@ 2020-11-24  5:39 Suren Baghdasaryan
  2020-11-24  5:39 ` [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range Suren Baghdasaryan
  2020-11-24  5:39 ` [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support Suren Baghdasaryan
  0 siblings, 2 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-11-24  5:39 UTC (permalink / raw)
  To: surenb
  Cc: akpm, mhocko, mhocko, rientjes, willy, hannes, guro, riel,
	minchan, christian, oleg, timmurray, linux-api, linux-mm,
	linux-kernel, kernel-team

When a process is being killed it might be in an uninterruptible sleep
which leads to an unpredictable delay in its memory reclaim. In low memory
situations, when it's important to free up memory quickly, such delay is
problematic. Kernel solves this problem with oom-reaper thread which
performs memory reclaim even when the victim process is not runnable.
Userspace currently lacks such mechanisms and the need and potential
solutions were discussed before (see links below).
This patchset provides a mechanism to perform memory reclaim of an
external process using process_madvise(MADV_DONTNEED). The chosen
mechanism is the result of the latest discussion at [4].
The first patch adds PMADV_FLAG_RANGE flag for process_madvise to operate
on large address ranges spanning multiple VMAs. Currently it supports only
the entire memory of a process. This is done to keep things simple and
since it's the only real usecase we currently know of. In the future this
can be developed further to support other large ranges. One way to do that
is suggested in [5].
The second patch enables MADV_DONTNEED behavior for process_madvise to
perform memory reclaim of an external process.

1. https://patchwork.kernel.org/cover/10894999
2. https://lwn.net/Articles/787217
3. https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com
4. https://lkml.org/lkml/2020/11/13/849
5. https://lkml.org/lkml/2020/11/18/1076

Suren Baghdasaryan (2):
  mm/madvise: allow process_madvise operations on entire memory range
  mm/madvise: add process_madvise MADV_DONTNEER support

 arch/alpha/include/uapi/asm/mman.h           |  4 +
 arch/mips/include/uapi/asm/mman.h            |  4 +
 arch/parisc/include/uapi/asm/mman.h          |  4 +
 arch/xtensa/include/uapi/asm/mman.h          |  4 +
 fs/io_uring.c                                |  2 +-
 include/linux/mm.h                           |  3 +-
 include/uapi/asm-generic/mman-common.h       |  4 +
 mm/madvise.c                                 | 81 ++++++++++++++++++--
 tools/include/uapi/asm-generic/mman-common.h |  4 +
 9 files changed, 101 insertions(+), 9 deletions(-)

-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-11-24  5:39 [PATCH 0/2] userspace memory reaping using process_madvise Suren Baghdasaryan
@ 2020-11-24  5:39 ` Suren Baghdasaryan
  2020-11-25 23:13   ` Minchan Kim
  2020-11-24  5:39 ` [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support Suren Baghdasaryan
  1 sibling, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-11-24  5:39 UTC (permalink / raw)
  To: surenb
  Cc: akpm, mhocko, mhocko, rientjes, willy, hannes, guro, riel,
	minchan, christian, oleg, timmurray, linux-api, linux-mm,
	linux-kernel, kernel-team

process_madvise requires a vector of address ranges to be provided for
its operations. When an advice should be applied to the entire process,
the caller process has to obtain the list of VMAs of the target process
by reading the /proc/pid/maps or some other way. The cost of this
operation grows linearly with increasing number of VMAs in the target
process. Even constructing the input vector can be non-trivial when
target process has several thousands of VMAs and the syscall is being
issued during high memory pressure period when new allocations for such
a vector would only worsen the situation.
In the case when advice is being applied to the entire memory space of
the target process, this creates an extra overhead.
Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
advise a memory range of the target process. For now, to keep it simple,
only the entire process memory range is supported, vec and vlen inputs
in this mode are ignored and can be NULL and 0.
Instead of returning the number of bytes that advice was successfully
applied to, the syscall in this mode returns 0 on success. This is due
to the fact that the number of bytes would not be useful for the caller
that does not know the amount of memory the call is supposed to affect.
Besides, the ssize_t return type can be too small to hold the number of
bytes affected when the operation is applied to a large memory range.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/alpha/include/uapi/asm/mman.h           |  4 ++
 arch/mips/include/uapi/asm/mman.h            |  4 ++
 arch/parisc/include/uapi/asm/mman.h          |  4 ++
 arch/xtensa/include/uapi/asm/mman.h          |  4 ++
 fs/io_uring.c                                |  2 +-
 include/linux/mm.h                           |  3 +-
 include/uapi/asm-generic/mman-common.h       |  4 ++
 mm/madvise.c                                 | 47 +++++++++++++++++---
 tools/include/uapi/asm-generic/mman-common.h |  4 ++
 9 files changed, 67 insertions(+), 9 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index a18ec7f63888..54588d2f5406 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -79,4 +79,8 @@
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+/* process_madvise flags */
+#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
+#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
+
 #endif /* __ALPHA_MMAN_H__ */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 57dc2ac4f8bd..af94f38a3a9d 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -106,4 +106,8 @@
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+/* process_madvise flags */
+#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
+#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
+
 #endif /* _ASM_MMAN_H */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index ab78cba446ed..ae644c493991 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -77,4 +77,8 @@
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+/* process_madvise flags */
+#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
+#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
+
 #endif /* __PARISC_MMAN_H__ */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index e5e643752947..934cdd11abff 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -114,4 +114,8 @@
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+/* process_madvise flags */
+#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
+#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
+
 #endif /* _XTENSA_MMAN_H */
diff --git a/fs/io_uring.c b/fs/io_uring.c
index a8c136a1cf4e..508c48b998ee 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4118,7 +4118,7 @@ static int io_madvise(struct io_kiocb *req, bool force_nonblock)
 	if (force_nonblock)
 		return -EAGAIN;
 
-	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
+	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, 0);
 	if (ret < 0)
 		req_set_fail_links(req);
 	io_req_complete(req, ret);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..414c7639e394 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2579,7 +2579,8 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
 		       struct list_head *uf, bool downgrade);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
-extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
+extern int do_madvise(struct mm_struct *mm, unsigned long start, unsigned long len_in,
+		      int behavior, unsigned int flags);
 
 #ifdef CONFIG_MMU
 extern int __mm_populate(unsigned long addr, unsigned long len,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..4898034593ec 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -80,4 +80,8 @@
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+/* process_madvise flags */
+#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
+#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index a8d8d48a57fe..1aa074a46524 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1001,6 +1001,14 @@ process_madvise_behavior_valid(int behavior)
 	}
 }
 
+static bool can_range_madv_lru_vma(struct vm_area_struct *vma, int behavior)
+{
+	if (!can_madv_lru_vma(vma))
+		return false;
+
+	return true;
+}
+
 /*
  * The madvise(2) system call.
  *
@@ -1067,15 +1075,21 @@ process_madvise_behavior_valid(int behavior)
  *  -EBADF  - map exists, but area maps something that isn't a file.
  *  -EAGAIN - a kernel resource was temporarily unavailable.
  */
-int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
+int do_madvise(struct mm_struct *mm, unsigned long start, unsigned long len_in,
+	       int behavior, unsigned int flags)
 {
 	unsigned long end, tmp;
 	struct vm_area_struct *vma, *prev;
 	int unmapped_error = 0;
 	int error = -EINVAL;
+	int error_on_gap;
 	int write;
-	size_t len;
+	unsigned long len;
 	struct blk_plug plug;
+	bool range_madvise = !!(flags & PMADV_FLAG_RANGE);
+
+	/* For range operations gap between VMAs is normal */
+	error_on_gap = range_madvise ? 0 : -ENOMEM;
 
 	start = untagged_addr(start);
 
@@ -1123,13 +1137,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	blk_start_plug(&plug);
 	for (;;) {
 		/* Still start < end. */
-		error = -ENOMEM;
+		error = error_on_gap;
+
 		if (!vma)
 			goto out;
 
 		/* Here start < (end|vma->vm_end). */
 		if (start < vma->vm_start) {
-			unmapped_error = -ENOMEM;
+			unmapped_error = error_on_gap;
 			start = vma->vm_start;
 			if (start >= end)
 				goto out;
@@ -1140,10 +1155,18 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 		if (end < tmp)
 			tmp = end;
 
+		/* For range operations skip VMAs ineligible for the behavior */
+		if (range_madvise && !can_range_madv_lru_vma(vma, behavior)) {
+			prev = vma;
+			goto skip_vma;
+		}
+
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
 		error = madvise_vma(vma, &prev, start, tmp, behavior);
 		if (error)
 			goto out;
+
+skip_vma:
 		start = tmp;
 		if (prev && start < prev->vm_end)
 			start = prev->vm_end;
@@ -1167,7 +1190,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
-	return do_madvise(current->mm, start, len_in, behavior);
+	return do_madvise(current->mm, start, len_in, behavior, 0);
 }
 
 SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
@@ -1183,7 +1206,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 	size_t total_len;
 	unsigned int f_flags;
 
-	if (flags != 0) {
+	if (flags & ~PMADV_FLAG_MASK) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -1216,12 +1239,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto release_task;
 	}
 
+	/*
+	 * For range madvise only the entire address space is supported for now
+	 * and input iovec is ignored.
+	 */
+	if (flags & PMADV_FLAG_RANGE) {
+		ret = do_madvise(mm, 0, ULONG_MAX & PAGE_MASK, behavior, flags);
+		goto release_mm;
+	}
+
 	total_len = iov_iter_count(&iter);
 
 	while (iov_iter_count(&iter)) {
 		iovec = iov_iter_iovec(&iter);
 		ret = do_madvise(mm, (unsigned long)iovec.iov_base,
-					iovec.iov_len, behavior);
+					iovec.iov_len, behavior, flags);
 		if (ret < 0)
 			break;
 		iov_iter_advance(&iter, iovec.iov_len);
@@ -1230,6 +1262,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 	if (ret == 0)
 		ret = total_len - iov_iter_count(&iter);
 
+release_mm:
 	mmput(mm);
 release_task:
 	put_task_struct(task);
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index f94f65d429be..4898034593ec 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -80,4 +80,8 @@
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+/* process_madvise flags */
+#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
+#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support
  2020-11-24  5:39 [PATCH 0/2] userspace memory reaping using process_madvise Suren Baghdasaryan
  2020-11-24  5:39 ` [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range Suren Baghdasaryan
@ 2020-11-24  5:39 ` Suren Baghdasaryan
  2020-11-24 13:42   ` Oleg Nesterov
  2020-12-08 23:40   ` Jann Horn
  1 sibling, 2 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-11-24  5:39 UTC (permalink / raw)
  To: surenb
  Cc: akpm, mhocko, mhocko, rientjes, willy, hannes, guro, riel,
	minchan, christian, oleg, timmurray, linux-api, linux-mm,
	linux-kernel, kernel-team

In modern systems it's not unusual to have a system component monitoring
memory conditions of the system and tasked with keeping system memory
pressure under control. One way to accomplish that is to kill
non-essential processes to free up memory for more important ones.
Examples of this are Facebook's OOM killer daemon called oomd and
Android's low memory killer daemon called lmkd.
For such system component it's important to be able to free memory
quickly and efficiently. Unfortunately the time process takes to free
up its memory after receiving a SIGKILL might vary based on the state
of the process (uninterruptible sleep), size and OPP level of the core
the process is running.
In such situation it is desirable to be able to free up the memory of the
process being killed in a more controlled way.
Enable MADV_DONTNEED to be used with process_madvise when applied to a
dying process to reclaim its memory. This would allow userspace system
components like oomd and lmkd to free memory of the target process in
a more predictable way.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 1aa074a46524..11306534369e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -29,6 +29,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/oom.h>
 
 #include <asm/tlb.h>
 
@@ -995,6 +996,18 @@ process_madvise_behavior_valid(int behavior)
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
+	case MADV_DONTNEED:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static bool madvise_destructive(int behavior)
+{
+	switch (behavior) {
+	case MADV_DONTNEED:
+	case MADV_FREE:
 		return true;
 	default:
 		return false;
@@ -1006,6 +1019,10 @@ static bool can_range_madv_lru_vma(struct vm_area_struct *vma, int behavior)
 	if (!can_madv_lru_vma(vma))
 		return false;
 
+	/* For destructive madvise skip shared file-backed VMAs */
+	if (madvise_destructive(behavior))
+		return vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED);
+
 	return true;
 }
 
@@ -1239,6 +1256,23 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto release_task;
 	}
 
+	if (madvise_destructive(behavior)) {
+		/* Allow destructive madvise only on a dying processes */
+		if (!signal_group_exit(task->signal)) {
+			ret = -EINVAL;
+			goto release_mm;
+		}
+		/* Ensure no competition with OOM-killer to avoid contention */
+		if (unlikely(mm_is_oom_victim(mm)) ||
+		    unlikely(test_bit(MMF_OOM_SKIP, &mm->flags))) {
+			/* Already being reclaimed */
+			ret = 0;
+			goto release_mm;
+		}
+		/* Mark mm as unstable */
+		set_bit(MMF_UNSTABLE, &mm->flags);
+	}
+
 	/*
 	 * For range madvise only the entire address space is supported for now
 	 * and input iovec is ignored.
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support
  2020-11-24  5:39 ` [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support Suren Baghdasaryan
@ 2020-11-24 13:42   ` Oleg Nesterov
  2020-11-24 16:42     ` Suren Baghdasaryan
  2020-12-08 23:40   ` Jann Horn
  1 sibling, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2020-11-24 13:42 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, mhocko, rientjes, willy, hannes, guro, riel,
	minchan, christian, timmurray, linux-api, linux-mm, linux-kernel,
	kernel-team

On 11/23, Suren Baghdasaryan wrote:
>
> +	if (madvise_destructive(behavior)) {
> +		/* Allow destructive madvise only on a dying processes */
> +		if (!signal_group_exit(task->signal)) {

signal_group_exit(task) is true if this task execs and kills other threads,
see the comment above this helper.

I think you need !(task->signal->flags & SIGNAL_GROUP_EXIT).

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support
  2020-11-24 13:42   ` Oleg Nesterov
@ 2020-11-24 16:42     ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-11-24 16:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Tim Murray, linux-api, linux-mm,
	LKML, kernel-team

On Tue, Nov 24, 2020 at 5:42 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 11/23, Suren Baghdasaryan wrote:
> >
> > +     if (madvise_destructive(behavior)) {
> > +             /* Allow destructive madvise only on a dying processes */
> > +             if (!signal_group_exit(task->signal)) {
>
> signal_group_exit(task) is true if this task execs and kills other threads,
> see the comment above this helper.
>
> I think you need !(task->signal->flags & SIGNAL_GROUP_EXIT).

I see. Thanks for the feedback, Oleg. I'll test and fix it in the next version.

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-11-24  5:39 ` [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range Suren Baghdasaryan
@ 2020-11-25 23:13   ` Minchan Kim
  2020-11-25 23:23     ` Suren Baghdasaryan
  2020-12-11 20:27     ` Jann Horn
  0 siblings, 2 replies; 20+ messages in thread
From: Minchan Kim @ 2020-11-25 23:13 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, mhocko, rientjes, willy, hannes, guro, riel,
	christian, oleg, timmurray, linux-api, linux-mm, linux-kernel,
	kernel-team

On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> process_madvise requires a vector of address ranges to be provided for
> its operations. When an advice should be applied to the entire process,
> the caller process has to obtain the list of VMAs of the target process
> by reading the /proc/pid/maps or some other way. The cost of this
> operation grows linearly with increasing number of VMAs in the target
> process. Even constructing the input vector can be non-trivial when
> target process has several thousands of VMAs and the syscall is being
> issued during high memory pressure period when new allocations for such
> a vector would only worsen the situation.
> In the case when advice is being applied to the entire memory space of
> the target process, this creates an extra overhead.
> Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> advise a memory range of the target process. For now, to keep it simple,
> only the entire process memory range is supported, vec and vlen inputs
> in this mode are ignored and can be NULL and 0.
> Instead of returning the number of bytes that advice was successfully
> applied to, the syscall in this mode returns 0 on success. This is due
> to the fact that the number of bytes would not be useful for the caller
> that does not know the amount of memory the call is supposed to affect.
> Besides, the ssize_t return type can be too small to hold the number of
> bytes affected when the operation is applied to a large memory range.

Can we just use one element in iovec to indicate entire address rather
than using up the reserved flags?

	struct iovec {
		.iov_base = NULL,
		.iov_len = (~(size_t)0),
	};

Furthermore, it would be applied for other syscalls where have support
iovec if we agree on it.

> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  arch/alpha/include/uapi/asm/mman.h           |  4 ++
>  arch/mips/include/uapi/asm/mman.h            |  4 ++
>  arch/parisc/include/uapi/asm/mman.h          |  4 ++
>  arch/xtensa/include/uapi/asm/mman.h          |  4 ++
>  fs/io_uring.c                                |  2 +-
>  include/linux/mm.h                           |  3 +-
>  include/uapi/asm-generic/mman-common.h       |  4 ++
>  mm/madvise.c                                 | 47 +++++++++++++++++---
>  tools/include/uapi/asm-generic/mman-common.h |  4 ++
>  9 files changed, 67 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index a18ec7f63888..54588d2f5406 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -79,4 +79,8 @@
>  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
>  				 PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
> +
>  #endif /* __ALPHA_MMAN_H__ */
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 57dc2ac4f8bd..af94f38a3a9d 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -106,4 +106,8 @@
>  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
>  				 PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
> +
>  #endif /* _ASM_MMAN_H */
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index ab78cba446ed..ae644c493991 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -77,4 +77,8 @@
>  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
>  				 PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
> +
>  #endif /* __PARISC_MMAN_H__ */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index e5e643752947..934cdd11abff 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -114,4 +114,8 @@
>  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
>  				 PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
> +
>  #endif /* _XTENSA_MMAN_H */
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index a8c136a1cf4e..508c48b998ee 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -4118,7 +4118,7 @@ static int io_madvise(struct io_kiocb *req, bool force_nonblock)
>  	if (force_nonblock)
>  		return -EAGAIN;
>  
> -	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
> +	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, 0);
>  	if (ret < 0)
>  		req_set_fail_links(req);
>  	io_req_complete(req, ret);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index db6ae4d3fb4e..414c7639e394 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2579,7 +2579,8 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
>  		       struct list_head *uf, bool downgrade);
>  extern int do_munmap(struct mm_struct *, unsigned long, size_t,
>  		     struct list_head *uf);
> -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> +extern int do_madvise(struct mm_struct *mm, unsigned long start, unsigned long len_in,
> +		      int behavior, unsigned int flags);
>  
>  #ifdef CONFIG_MMU
>  extern int __mm_populate(unsigned long addr, unsigned long len,
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f94f65d429be..4898034593ec 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -80,4 +80,8 @@
>  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
>  				 PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
> +
>  #endif /* __ASM_GENERIC_MMAN_COMMON_H */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index a8d8d48a57fe..1aa074a46524 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1001,6 +1001,14 @@ process_madvise_behavior_valid(int behavior)
>  	}
>  }
>  
> +static bool can_range_madv_lru_vma(struct vm_area_struct *vma, int behavior)
> +{
> +	if (!can_madv_lru_vma(vma))
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * The madvise(2) system call.
>   *
> @@ -1067,15 +1075,21 @@ process_madvise_behavior_valid(int behavior)
>   *  -EBADF  - map exists, but area maps something that isn't a file.
>   *  -EAGAIN - a kernel resource was temporarily unavailable.
>   */
> -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +int do_madvise(struct mm_struct *mm, unsigned long start, unsigned long len_in,
> +	       int behavior, unsigned int flags)
>  {
>  	unsigned long end, tmp;
>  	struct vm_area_struct *vma, *prev;
>  	int unmapped_error = 0;
>  	int error = -EINVAL;
> +	int error_on_gap;
>  	int write;
> -	size_t len;
> +	unsigned long len;
>  	struct blk_plug plug;
> +	bool range_madvise = !!(flags & PMADV_FLAG_RANGE);
> +
> +	/* For range operations gap between VMAs is normal */
> +	error_on_gap = range_madvise ? 0 : -ENOMEM;
>  
>  	start = untagged_addr(start);
>  
> @@ -1123,13 +1137,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  	blk_start_plug(&plug);
>  	for (;;) {
>  		/* Still start < end. */
> -		error = -ENOMEM;
> +		error = error_on_gap;
> +
>  		if (!vma)
>  			goto out;
>  
>  		/* Here start < (end|vma->vm_end). */
>  		if (start < vma->vm_start) {
> -			unmapped_error = -ENOMEM;
> +			unmapped_error = error_on_gap;
>  			start = vma->vm_start;
>  			if (start >= end)
>  				goto out;
> @@ -1140,10 +1155,18 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  		if (end < tmp)
>  			tmp = end;
>  
> +		/* For range operations skip VMAs ineligible for the behavior */
> +		if (range_madvise && !can_range_madv_lru_vma(vma, behavior)) {
> +			prev = vma;
> +			goto skip_vma;
> +		}
> +
>  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
>  		error = madvise_vma(vma, &prev, start, tmp, behavior);
>  		if (error)
>  			goto out;
> +
> +skip_vma:
>  		start = tmp;
>  		if (prev && start < prev->vm_end)
>  			start = prev->vm_end;
> @@ -1167,7 +1190,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  
>  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
> -	return do_madvise(current->mm, start, len_in, behavior);
> +	return do_madvise(current->mm, start, len_in, behavior, 0);
>  }
>  
>  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> @@ -1183,7 +1206,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>  	size_t total_len;
>  	unsigned int f_flags;
>  
> -	if (flags != 0) {
> +	if (flags & ~PMADV_FLAG_MASK) {
>  		ret = -EINVAL;
>  		goto out;
>  	}
> @@ -1216,12 +1239,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>  		goto release_task;
>  	}
>  
> +	/*
> +	 * For range madvise only the entire address space is supported for now
> +	 * and input iovec is ignored.
> +	 */
> +	if (flags & PMADV_FLAG_RANGE) {
> +		ret = do_madvise(mm, 0, ULONG_MAX & PAGE_MASK, behavior, flags);
> +		goto release_mm;
> +	}
> +
>  	total_len = iov_iter_count(&iter);
>  
>  	while (iov_iter_count(&iter)) {
>  		iovec = iov_iter_iovec(&iter);
>  		ret = do_madvise(mm, (unsigned long)iovec.iov_base,
> -					iovec.iov_len, behavior);
> +					iovec.iov_len, behavior, flags);
>  		if (ret < 0)
>  			break;
>  		iov_iter_advance(&iter, iovec.iov_len);
> @@ -1230,6 +1262,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>  	if (ret == 0)
>  		ret = total_len - iov_iter_count(&iter);
>  
> +release_mm:
>  	mmput(mm);
>  release_task:
>  	put_task_struct(task);
> diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> index f94f65d429be..4898034593ec 100644
> --- a/tools/include/uapi/asm-generic/mman-common.h
> +++ b/tools/include/uapi/asm-generic/mman-common.h
> @@ -80,4 +80,8 @@
>  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
>  				 PKEY_DISABLE_WRITE)
>  
> +/* process_madvise flags */
> +#define PMADV_FLAG_RANGE	0x1	/* advice for all VMAs in the range */
> +#define PMADV_FLAG_MASK	(PMADV_FLAG_RANGE)
> +
>  #endif /* __ASM_GENERIC_MMAN_COMMON_H */
> -- 
> 2.29.2.454.gaff20da3a2-goog
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-11-25 23:13   ` Minchan Kim
@ 2020-11-25 23:23     ` Suren Baghdasaryan
  2020-11-25 23:43       ` Minchan Kim
  2020-12-11 20:27     ` Jann Horn
  1 sibling, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-11-25 23:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, linux-api,
	linux-mm, LKML, kernel-team

On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > process_madvise requires a vector of address ranges to be provided for
> > its operations. When an advice should be applied to the entire process,
> > the caller process has to obtain the list of VMAs of the target process
> > by reading the /proc/pid/maps or some other way. The cost of this
> > operation grows linearly with increasing number of VMAs in the target
> > process. Even constructing the input vector can be non-trivial when
> > target process has several thousands of VMAs and the syscall is being
> > issued during high memory pressure period when new allocations for such
> > a vector would only worsen the situation.
> > In the case when advice is being applied to the entire memory space of
> > the target process, this creates an extra overhead.
> > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > advise a memory range of the target process. For now, to keep it simple,
> > only the entire process memory range is supported, vec and vlen inputs
> > in this mode are ignored and can be NULL and 0.
> > Instead of returning the number of bytes that advice was successfully
> > applied to, the syscall in this mode returns 0 on success. This is due
> > to the fact that the number of bytes would not be useful for the caller
> > that does not know the amount of memory the call is supposed to affect.
> > Besides, the ssize_t return type can be too small to hold the number of
> > bytes affected when the operation is applied to a large memory range.
>
> Can we just use one element in iovec to indicate entire address rather
> than using up the reserved flags?
>
>         struct iovec {
>                 .iov_base = NULL,
>                 .iov_len = (~(size_t)0),
>         };
>
> Furthermore, it would be applied for other syscalls where have support
> iovec if we agree on it.
>

The flag also changes the return value semantics. If we follow your
suggestion we should also agree that in this mode the return value
will be 0 on success and negative otherwise instead of the number of
bytes madvise was applied to.

> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  arch/alpha/include/uapi/asm/mman.h           |  4 ++
> >  arch/mips/include/uapi/asm/mman.h            |  4 ++
> >  arch/parisc/include/uapi/asm/mman.h          |  4 ++
> >  arch/xtensa/include/uapi/asm/mman.h          |  4 ++
> >  fs/io_uring.c                                |  2 +-
> >  include/linux/mm.h                           |  3 +-
> >  include/uapi/asm-generic/mman-common.h       |  4 ++
> >  mm/madvise.c                                 | 47 +++++++++++++++++---
> >  tools/include/uapi/asm-generic/mman-common.h |  4 ++
> >  9 files changed, 67 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index a18ec7f63888..54588d2f5406 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -79,4 +79,8 @@
> >  #define PKEY_ACCESS_MASK     (PKEY_DISABLE_ACCESS |\
> >                                PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE     0x1     /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK      (PMADV_FLAG_RANGE)
> > +
> >  #endif /* __ALPHA_MMAN_H__ */
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index 57dc2ac4f8bd..af94f38a3a9d 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -106,4 +106,8 @@
> >  #define PKEY_ACCESS_MASK     (PKEY_DISABLE_ACCESS |\
> >                                PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE     0x1     /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK      (PMADV_FLAG_RANGE)
> > +
> >  #endif /* _ASM_MMAN_H */
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index ab78cba446ed..ae644c493991 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -77,4 +77,8 @@
> >  #define PKEY_ACCESS_MASK     (PKEY_DISABLE_ACCESS |\
> >                                PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE     0x1     /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK      (PMADV_FLAG_RANGE)
> > +
> >  #endif /* __PARISC_MMAN_H__ */
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index e5e643752947..934cdd11abff 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -114,4 +114,8 @@
> >  #define PKEY_ACCESS_MASK     (PKEY_DISABLE_ACCESS |\
> >                                PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE     0x1     /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK      (PMADV_FLAG_RANGE)
> > +
> >  #endif /* _XTENSA_MMAN_H */
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index a8c136a1cf4e..508c48b998ee 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -4118,7 +4118,7 @@ static int io_madvise(struct io_kiocb *req, bool force_nonblock)
> >       if (force_nonblock)
> >               return -EAGAIN;
> >
> > -     ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
> > +     ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, 0);
> >       if (ret < 0)
> >               req_set_fail_links(req);
> >       io_req_complete(req, ret);
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index db6ae4d3fb4e..414c7639e394 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2579,7 +2579,8 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
> >                      struct list_head *uf, bool downgrade);
> >  extern int do_munmap(struct mm_struct *, unsigned long, size_t,
> >                    struct list_head *uf);
> > -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> > +extern int do_madvise(struct mm_struct *mm, unsigned long start, unsigned long len_in,
> > +                   int behavior, unsigned int flags);
> >
> >  #ifdef CONFIG_MMU
> >  extern int __mm_populate(unsigned long addr, unsigned long len,
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index f94f65d429be..4898034593ec 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -80,4 +80,8 @@
> >  #define PKEY_ACCESS_MASK     (PKEY_DISABLE_ACCESS |\
> >                                PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE     0x1     /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK      (PMADV_FLAG_RANGE)
> > +
> >  #endif /* __ASM_GENERIC_MMAN_COMMON_H */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index a8d8d48a57fe..1aa074a46524 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1001,6 +1001,14 @@ process_madvise_behavior_valid(int behavior)
> >       }
> >  }
> >
> > +static bool can_range_madv_lru_vma(struct vm_area_struct *vma, int behavior)
> > +{
> > +     if (!can_madv_lru_vma(vma))
> > +             return false;
> > +
> > +     return true;
> > +}
> > +
> >  /*
> >   * The madvise(2) system call.
> >   *
> > @@ -1067,15 +1075,21 @@ process_madvise_behavior_valid(int behavior)
> >   *  -EBADF  - map exists, but area maps something that isn't a file.
> >   *  -EAGAIN - a kernel resource was temporarily unavailable.
> >   */
> > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > +int do_madvise(struct mm_struct *mm, unsigned long start, unsigned long len_in,
> > +            int behavior, unsigned int flags)
> >  {
> >       unsigned long end, tmp;
> >       struct vm_area_struct *vma, *prev;
> >       int unmapped_error = 0;
> >       int error = -EINVAL;
> > +     int error_on_gap;
> >       int write;
> > -     size_t len;
> > +     unsigned long len;
> >       struct blk_plug plug;
> > +     bool range_madvise = !!(flags & PMADV_FLAG_RANGE);
> > +
> > +     /* For range operations gap between VMAs is normal */
> > +     error_on_gap = range_madvise ? 0 : -ENOMEM;
> >
> >       start = untagged_addr(start);
> >
> > @@ -1123,13 +1137,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >       blk_start_plug(&plug);
> >       for (;;) {
> >               /* Still start < end. */
> > -             error = -ENOMEM;
> > +             error = error_on_gap;
> > +
> >               if (!vma)
> >                       goto out;
> >
> >               /* Here start < (end|vma->vm_end). */
> >               if (start < vma->vm_start) {
> > -                     unmapped_error = -ENOMEM;
> > +                     unmapped_error = error_on_gap;
> >                       start = vma->vm_start;
> >                       if (start >= end)
> >                               goto out;
> > @@ -1140,10 +1155,18 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >               if (end < tmp)
> >                       tmp = end;
> >
> > +             /* For range operations skip VMAs ineligible for the behavior */
> > +             if (range_madvise && !can_range_madv_lru_vma(vma, behavior)) {
> > +                     prev = vma;
> > +                     goto skip_vma;
> > +             }
> > +
> >               /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> >               error = madvise_vma(vma, &prev, start, tmp, behavior);
> >               if (error)
> >                       goto out;
> > +
> > +skip_vma:
> >               start = tmp;
> >               if (prev && start < prev->vm_end)
> >                       start = prev->vm_end;
> > @@ -1167,7 +1190,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >
> >  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> > -     return do_madvise(current->mm, start, len_in, behavior);
> > +     return do_madvise(current->mm, start, len_in, behavior, 0);
> >  }
> >
> >  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > @@ -1183,7 +1206,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> >       size_t total_len;
> >       unsigned int f_flags;
> >
> > -     if (flags != 0) {
> > +     if (flags & ~PMADV_FLAG_MASK) {
> >               ret = -EINVAL;
> >               goto out;
> >       }
> > @@ -1216,12 +1239,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> >               goto release_task;
> >       }
> >
> > +     /*
> > +      * For range madvise only the entire address space is supported for now
> > +      * and input iovec is ignored.
> > +      */
> > +     if (flags & PMADV_FLAG_RANGE) {
> > +             ret = do_madvise(mm, 0, ULONG_MAX & PAGE_MASK, behavior, flags);
> > +             goto release_mm;
> > +     }
> > +
> >       total_len = iov_iter_count(&iter);
> >
> >       while (iov_iter_count(&iter)) {
> >               iovec = iov_iter_iovec(&iter);
> >               ret = do_madvise(mm, (unsigned long)iovec.iov_base,
> > -                                     iovec.iov_len, behavior);
> > +                                     iovec.iov_len, behavior, flags);
> >               if (ret < 0)
> >                       break;
> >               iov_iter_advance(&iter, iovec.iov_len);
> > @@ -1230,6 +1262,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> >       if (ret == 0)
> >               ret = total_len - iov_iter_count(&iter);
> >
> > +release_mm:
> >       mmput(mm);
> >  release_task:
> >       put_task_struct(task);
> > diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> > index f94f65d429be..4898034593ec 100644
> > --- a/tools/include/uapi/asm-generic/mman-common.h
> > +++ b/tools/include/uapi/asm-generic/mman-common.h
> > @@ -80,4 +80,8 @@
> >  #define PKEY_ACCESS_MASK     (PKEY_DISABLE_ACCESS |\
> >                                PKEY_DISABLE_WRITE)
> >
> > +/* process_madvise flags */
> > +#define PMADV_FLAG_RANGE     0x1     /* advice for all VMAs in the range */
> > +#define PMADV_FLAG_MASK      (PMADV_FLAG_RANGE)
> > +
> >  #endif /* __ASM_GENERIC_MMAN_COMMON_H */
> > --
> > 2.29.2.454.gaff20da3a2-goog
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-11-25 23:23     ` Suren Baghdasaryan
@ 2020-11-25 23:43       ` Minchan Kim
  2020-11-30 19:01         ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2020-11-25 23:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, linux-api,
	linux-mm, LKML, kernel-team

On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > process_madvise requires a vector of address ranges to be provided for
> > > its operations. When an advice should be applied to the entire process,
> > > the caller process has to obtain the list of VMAs of the target process
> > > by reading the /proc/pid/maps or some other way. The cost of this
> > > operation grows linearly with increasing number of VMAs in the target
> > > process. Even constructing the input vector can be non-trivial when
> > > target process has several thousands of VMAs and the syscall is being
> > > issued during high memory pressure period when new allocations for such
> > > a vector would only worsen the situation.
> > > In the case when advice is being applied to the entire memory space of
> > > the target process, this creates an extra overhead.
> > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > advise a memory range of the target process. For now, to keep it simple,
> > > only the entire process memory range is supported, vec and vlen inputs
> > > in this mode are ignored and can be NULL and 0.
> > > Instead of returning the number of bytes that advice was successfully
> > > applied to, the syscall in this mode returns 0 on success. This is due
> > > to the fact that the number of bytes would not be useful for the caller
> > > that does not know the amount of memory the call is supposed to affect.
> > > Besides, the ssize_t return type can be too small to hold the number of
> > > bytes affected when the operation is applied to a large memory range.
> >
> > Can we just use one element in iovec to indicate entire address rather
> > than using up the reserved flags?
> >
> >         struct iovec {
> >                 .iov_base = NULL,
> >                 .iov_len = (~(size_t)0),
> >         };
> >
> > Furthermore, it would be applied for other syscalls where have support
> > iovec if we agree on it.
> >
> 
> The flag also changes the return value semantics. If we follow your
> suggestion we should also agree that in this mode the return value
> will be 0 on success and negative otherwise instead of the number of
> bytes madvise was applied to.

Well, return value will depends on the each API. If the operation is
desruptive, it should return the right size affected by the API but
would be okay with 0 or error, otherwise.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-11-25 23:43       ` Minchan Kim
@ 2020-11-30 19:01         ` Suren Baghdasaryan
  2020-12-08  7:23           ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-11-30 19:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, linux-api,
	linux-mm, LKML, kernel-team

On Wed, Nov 25, 2020 at 3:43 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> > On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > process_madvise requires a vector of address ranges to be provided for
> > > > its operations. When an advice should be applied to the entire process,
> > > > the caller process has to obtain the list of VMAs of the target process
> > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > operation grows linearly with increasing number of VMAs in the target
> > > > process. Even constructing the input vector can be non-trivial when
> > > > target process has several thousands of VMAs and the syscall is being
> > > > issued during high memory pressure period when new allocations for such
> > > > a vector would only worsen the situation.
> > > > In the case when advice is being applied to the entire memory space of
> > > > the target process, this creates an extra overhead.
> > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > advise a memory range of the target process. For now, to keep it simple,
> > > > only the entire process memory range is supported, vec and vlen inputs
> > > > in this mode are ignored and can be NULL and 0.
> > > > Instead of returning the number of bytes that advice was successfully
> > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > to the fact that the number of bytes would not be useful for the caller
> > > > that does not know the amount of memory the call is supposed to affect.
> > > > Besides, the ssize_t return type can be too small to hold the number of
> > > > bytes affected when the operation is applied to a large memory range.
> > >
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > >         struct iovec {
> > >                 .iov_base = NULL,
> > >                 .iov_len = (~(size_t)0),
> > >         };
> > >
> > > Furthermore, it would be applied for other syscalls where have support
> > > iovec if we agree on it.
> > >
> >
> > The flag also changes the return value semantics. If we follow your
> > suggestion we should also agree that in this mode the return value
> > will be 0 on success and negative otherwise instead of the number of
> > bytes madvise was applied to.
>
> Well, return value will depends on the each API. If the operation is
> desruptive, it should return the right size affected by the API but
> would be okay with 0 or error, otherwise.

I'm fine with dropping the flag, I just thought with the flag it would
be more explicit that this is a special mode operating on ranges. This
way the patch also becomes simpler.
Andrew, Michal, Christian, what do you think about such API? Should I
change the API this way / keep the flag / change it in some other way?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-11-30 19:01         ` Suren Baghdasaryan
@ 2020-12-08  7:23           ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-12-08  7:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, linux-api,
	linux-mm, LKML, kernel-team

On Mon, Nov 30, 2020 at 11:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Nov 25, 2020 at 3:43 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Wed, Nov 25, 2020 at 03:23:40PM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Nov 25, 2020 at 3:13 PM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > > process_madvise requires a vector of address ranges to be provided for
> > > > > its operations. When an advice should be applied to the entire process,
> > > > > the caller process has to obtain the list of VMAs of the target process
> > > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > > operation grows linearly with increasing number of VMAs in the target
> > > > > process. Even constructing the input vector can be non-trivial when
> > > > > target process has several thousands of VMAs and the syscall is being
> > > > > issued during high memory pressure period when new allocations for such
> > > > > a vector would only worsen the situation.
> > > > > In the case when advice is being applied to the entire memory space of
> > > > > the target process, this creates an extra overhead.
> > > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > > advise a memory range of the target process. For now, to keep it simple,
> > > > > only the entire process memory range is supported, vec and vlen inputs
> > > > > in this mode are ignored and can be NULL and 0.
> > > > > Instead of returning the number of bytes that advice was successfully
> > > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > > to the fact that the number of bytes would not be useful for the caller
> > > > > that does not know the amount of memory the call is supposed to affect.
> > > > > Besides, the ssize_t return type can be too small to hold the number of
> > > > > bytes affected when the operation is applied to a large memory range.
> > > >
> > > > Can we just use one element in iovec to indicate entire address rather
> > > > than using up the reserved flags?
> > > >
> > > >         struct iovec {
> > > >                 .iov_base = NULL,
> > > >                 .iov_len = (~(size_t)0),
> > > >         };
> > > >
> > > > Furthermore, it would be applied for other syscalls where have support
> > > > iovec if we agree on it.
> > > >
> > >
> > > The flag also changes the return value semantics. If we follow your
> > > suggestion we should also agree that in this mode the return value
> > > will be 0 on success and negative otherwise instead of the number of
> > > bytes madvise was applied to.
> >
> > Well, return value will depends on the each API. If the operation is
> > desruptive, it should return the right size affected by the API but
> > would be okay with 0 or error, otherwise.
>
> I'm fine with dropping the flag, I just thought with the flag it would
> be more explicit that this is a special mode operating on ranges. This
> way the patch also becomes simpler.
> Andrew, Michal, Christian, what do you think about such API? Should I
> change the API this way / keep the flag / change it in some other way?


Friendly ping to get some feedback on the proposed API please.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support
  2020-11-24  5:39 ` [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support Suren Baghdasaryan
  2020-11-24 13:42   ` Oleg Nesterov
@ 2020-12-08 23:40   ` Jann Horn
  2020-12-08 23:59     ` Suren Baghdasaryan
  1 sibling, 1 reply; 20+ messages in thread
From: Jann Horn @ 2020-12-08 23:40 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	Linux API, Linux-MM, kernel list, kernel-team

On Tue, Nov 24, 2020 at 6:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running.
> In such situation it is desirable to be able to free up the memory of the
> process being killed in a more controlled way.
> Enable MADV_DONTNEED to be used with process_madvise when applied to a
> dying process to reclaim its memory. This would allow userspace system
> components like oomd and lmkd to free memory of the target process in
> a more predictable way.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
[...]
> @@ -1239,6 +1256,23 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>                 goto release_task;
>         }
>
> +       if (madvise_destructive(behavior)) {
> +               /* Allow destructive madvise only on a dying processes */
> +               if (!signal_group_exit(task->signal)) {
> +                       ret = -EINVAL;
> +                       goto release_mm;
> +               }

Technically Linux allows processes to share mm_struct without being in
the same thread group, so I'm not sure whether this check is good
enough? AFAICS the normal OOM killer deals with this case by letting
__oom_kill_process() always kill all tasks that share the mm_struct.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support
  2020-12-08 23:40   ` Jann Horn
@ 2020-12-08 23:59     ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-12-08 23:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Minchan Kim, Christian Brauner, Oleg Nesterov, Tim Murray,
	Linux API, Linux-MM, kernel list, kernel-team

On Tue, Dec 8, 2020 at 3:40 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Nov 24, 2020 at 6:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running.
> > In such situation it is desirable to be able to free up the memory of the
> > process being killed in a more controlled way.
> > Enable MADV_DONTNEED to be used with process_madvise when applied to a
> > dying process to reclaim its memory. This would allow userspace system
> > components like oomd and lmkd to free memory of the target process in
> > a more predictable way.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> [...]
> > @@ -1239,6 +1256,23 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> >                 goto release_task;
> >         }
> >
> > +       if (madvise_destructive(behavior)) {
> > +               /* Allow destructive madvise only on a dying processes */
> > +               if (!signal_group_exit(task->signal)) {
> > +                       ret = -EINVAL;
> > +                       goto release_mm;
> > +               }
>
> Technically Linux allows processes to share mm_struct without being in
> the same thread group, so I'm not sure whether this check is good
> enough? AFAICS the normal OOM killer deals with this case by letting
> __oom_kill_process() always kill all tasks that share the mm_struct.

Thanks for the comment Jann.
You are right. I think replacing !signal_group_exit(task->signal) with
task_will_free_mem(task) would address both your and Oleg's comments.
IIUC, task_will_free_mem() calls __task_will_free_mem() on the task
itself and on all processes sharing the mm_struct ensuring that they
are all dying.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-11-25 23:13   ` Minchan Kim
  2020-11-25 23:23     ` Suren Baghdasaryan
@ 2020-12-11 20:27     ` Jann Horn
  2020-12-11 23:01       ` Minchan Kim
  2020-12-22 13:44       ` Christoph Hellwig
  1 sibling, 2 replies; 20+ messages in thread
From: Jann Horn @ 2020-12-11 20:27 UTC (permalink / raw)
  To: Minchan Kim, Christoph Hellwig
  Cc: Suren Baghdasaryan, Andrew Morton, Michal Hocko, Michal Hocko,
	David Rientjes, Matthew Wilcox, Johannes Weiner, Roman Gushchin,
	Rik van Riel, Christian Brauner, Oleg Nesterov, Tim Murray,
	Linux API, Linux-MM, kernel list, kernel-team

+CC Christoph Hellwig for opinions on compat

On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim <minchan@kernel.org> wrote:
> On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > process_madvise requires a vector of address ranges to be provided for
> > its operations. When an advice should be applied to the entire process,
> > the caller process has to obtain the list of VMAs of the target process
> > by reading the /proc/pid/maps or some other way. The cost of this
> > operation grows linearly with increasing number of VMAs in the target
> > process. Even constructing the input vector can be non-trivial when
> > target process has several thousands of VMAs and the syscall is being
> > issued during high memory pressure period when new allocations for such
> > a vector would only worsen the situation.
> > In the case when advice is being applied to the entire memory space of
> > the target process, this creates an extra overhead.
> > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > advise a memory range of the target process. For now, to keep it simple,
> > only the entire process memory range is supported, vec and vlen inputs
> > in this mode are ignored and can be NULL and 0.
> > Instead of returning the number of bytes that advice was successfully
> > applied to, the syscall in this mode returns 0 on success. This is due
> > to the fact that the number of bytes would not be useful for the caller
> > that does not know the amount of memory the call is supposed to affect.
> > Besides, the ssize_t return type can be too small to hold the number of
> > bytes affected when the operation is applied to a large memory range.
>
> Can we just use one element in iovec to indicate entire address rather
> than using up the reserved flags?
>
>         struct iovec {
>                 .iov_base = NULL,
>                 .iov_len = (~(size_t)0),
>         };

In addition to Suren's objections, I think it's also worth considering
how this looks in terms of compat API. If a compat process does
process_madvise() on another compat process, it would be specifying
the maximum 32-bit number, rather than the maximum 64-bit number, so
you'd need special code to catch that case, which would be ugly.

And when a compat process uses this API on a non-compat process, it
semantically gets really weird: The actual address range covered would
be larger than the address range specified.

And if we want different access checks for the two flavors in the
future, gating that different behavior on special values in the iovec
would feel too magical to me.

And the length value SIZE_MAX doesn't really make sense anyway because
the length of the whole address space would be SIZE_MAX+1, which you
can't express.

So I'm in favor of a new flag, and strongly against using SIZE_MAX as
a magic number here.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-12-11 20:27     ` Jann Horn
@ 2020-12-11 23:01       ` Minchan Kim
  2020-12-12  0:16         ` Jann Horn
  2020-12-22 13:44       ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Minchan Kim @ 2020-12-11 23:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christoph Hellwig, Suren Baghdasaryan, Andrew Morton,
	Michal Hocko, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Rik van Riel, Christian Brauner,
	Oleg Nesterov, Tim Murray, Linux API, Linux-MM, kernel list,
	kernel-team

On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> +CC Christoph Hellwig for opinions on compat
> 
> On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim <minchan@kernel.org> wrote:
> > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > process_madvise requires a vector of address ranges to be provided for
> > > its operations. When an advice should be applied to the entire process,
> > > the caller process has to obtain the list of VMAs of the target process
> > > by reading the /proc/pid/maps or some other way. The cost of this
> > > operation grows linearly with increasing number of VMAs in the target
> > > process. Even constructing the input vector can be non-trivial when
> > > target process has several thousands of VMAs and the syscall is being
> > > issued during high memory pressure period when new allocations for such
> > > a vector would only worsen the situation.
> > > In the case when advice is being applied to the entire memory space of
> > > the target process, this creates an extra overhead.
> > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > advise a memory range of the target process. For now, to keep it simple,
> > > only the entire process memory range is supported, vec and vlen inputs
> > > in this mode are ignored and can be NULL and 0.
> > > Instead of returning the number of bytes that advice was successfully
> > > applied to, the syscall in this mode returns 0 on success. This is due
> > > to the fact that the number of bytes would not be useful for the caller
> > > that does not know the amount of memory the call is supposed to affect.
> > > Besides, the ssize_t return type can be too small to hold the number of
> > > bytes affected when the operation is applied to a large memory range.
> >
> > Can we just use one element in iovec to indicate entire address rather
> > than using up the reserved flags?
> >
> >         struct iovec {
> >                 .iov_base = NULL,
> >                 .iov_len = (~(size_t)0),
> >         };
> 
> In addition to Suren's objections, I think it's also worth considering
> how this looks in terms of compat API. If a compat process does
> process_madvise() on another compat process, it would be specifying
> the maximum 32-bit number, rather than the maximum 64-bit number, so
> you'd need special code to catch that case, which would be ugly.
> 
> And when a compat process uses this API on a non-compat process, it
> semantically gets really weird: The actual address range covered would
> be larger than the address range specified.
> 
> And if we want different access checks for the two flavors in the
> future, gating that different behavior on special values in the iovec
> would feel too magical to me.
> 
> And the length value SIZE_MAX doesn't really make sense anyway because
> the length of the whole address space would be SIZE_MAX+1, which you
> can't express.
> 
> So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> a magic number here.

Can't we simply pass NULL as iovec as special id, then?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-12-11 23:01       ` Minchan Kim
@ 2020-12-12  0:16         ` Jann Horn
  0 siblings, 0 replies; 20+ messages in thread
From: Jann Horn @ 2020-12-12  0:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Christoph Hellwig, Suren Baghdasaryan, Andrew Morton,
	Michal Hocko, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Rik van Riel, Christian Brauner,
	Oleg Nesterov, Tim Murray, Linux API, Linux-MM, kernel list,
	kernel-team

On Sat, Dec 12, 2020 at 12:01 AM Minchan Kim <minchan@kernel.org> wrote:
> On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > +CC Christoph Hellwig for opinions on compat
> >
> > On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim <minchan@kernel.org> wrote:
> > > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > > process_madvise requires a vector of address ranges to be provided for
> > > > its operations. When an advice should be applied to the entire process,
> > > > the caller process has to obtain the list of VMAs of the target process
> > > > by reading the /proc/pid/maps or some other way. The cost of this
> > > > operation grows linearly with increasing number of VMAs in the target
> > > > process. Even constructing the input vector can be non-trivial when
> > > > target process has several thousands of VMAs and the syscall is being
> > > > issued during high memory pressure period when new allocations for such
> > > > a vector would only worsen the situation.
> > > > In the case when advice is being applied to the entire memory space of
> > > > the target process, this creates an extra overhead.
> > > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > > advise a memory range of the target process. For now, to keep it simple,
> > > > only the entire process memory range is supported, vec and vlen inputs
> > > > in this mode are ignored and can be NULL and 0.
> > > > Instead of returning the number of bytes that advice was successfully
> > > > applied to, the syscall in this mode returns 0 on success. This is due
> > > > to the fact that the number of bytes would not be useful for the caller
> > > > that does not know the amount of memory the call is supposed to affect.
> > > > Besides, the ssize_t return type can be too small to hold the number of
> > > > bytes affected when the operation is applied to a large memory range.
> > >
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > >         struct iovec {
> > >                 .iov_base = NULL,
> > >                 .iov_len = (~(size_t)0),
> > >         };
> >
> > In addition to Suren's objections, I think it's also worth considering
> > how this looks in terms of compat API. If a compat process does
> > process_madvise() on another compat process, it would be specifying
> > the maximum 32-bit number, rather than the maximum 64-bit number, so
> > you'd need special code to catch that case, which would be ugly.
> >
> > And when a compat process uses this API on a non-compat process, it
> > semantically gets really weird: The actual address range covered would
> > be larger than the address range specified.
> >
> > And if we want different access checks for the two flavors in the
> > future, gating that different behavior on special values in the iovec
> > would feel too magical to me.
> >
> > And the length value SIZE_MAX doesn't really make sense anyway because
> > the length of the whole address space would be SIZE_MAX+1, which you
> > can't express.
> >
> > So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> > a magic number here.
>
> Can't we simply pass NULL as iovec as special id, then?

AFAIK in theory NULL can be a valid userspace pointer (although that
basically never happens and, on MMU systems, requires root to
explicitly do it). On the other hand, there are some parts of the UAPI
that already special-case NULL, so maybe that's considered acceptable?

Also, special-casing NULL slightly increases the chance that userspace
messes up and accidentally triggers completely different behavior
because an allocation failed or something like that.

So while I'm not strongly against using NULL here, I don't exactly
like the idea.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-12-11 20:27     ` Jann Horn
  2020-12-11 23:01       ` Minchan Kim
@ 2020-12-22 13:44       ` Christoph Hellwig
  2020-12-22 17:48         ` Suren Baghdasaryan
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2020-12-22 13:44 UTC (permalink / raw)
  To: Jann Horn
  Cc: Minchan Kim, Christoph Hellwig, Suren Baghdasaryan,
	Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, Linux API,
	Linux-MM, kernel list, kernel-team

On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > Can we just use one element in iovec to indicate entire address rather
> > than using up the reserved flags?
> >
> >         struct iovec {
> >                 .iov_base = NULL,
> >                 .iov_len = (~(size_t)0),
> >         };
> 
> In addition to Suren's objections, I think it's also worth considering
> how this looks in terms of compat API. If a compat process does
> process_madvise() on another compat process, it would be specifying
> the maximum 32-bit number, rather than the maximum 64-bit number, so
> you'd need special code to catch that case, which would be ugly.
> 
> And when a compat process uses this API on a non-compat process, it
> semantically gets really weird: The actual address range covered would
> be larger than the address range specified.
> 
> And if we want different access checks for the two flavors in the
> future, gating that different behavior on special values in the iovec
> would feel too magical to me.
> 
> And the length value SIZE_MAX doesn't really make sense anyway because
> the length of the whole address space would be SIZE_MAX+1, which you
> can't express.
> 
> So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> a magic number here.

Yes, using SIZE_MAX is a horrible interface in this case.  I'm not
a huge fan of a flag either.  What is the use case for the madvise
to all of a processes address space anyway?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-12-22 13:44       ` Christoph Hellwig
@ 2020-12-22 17:48         ` Suren Baghdasaryan
  2020-12-23  4:09           ` Suren Baghdasaryan
  2020-12-23  7:57           ` Christoph Hellwig
  0 siblings, 2 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-12-22 17:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jann Horn, Minchan Kim, Christoph Hellwig, Andrew Morton,
	Michal Hocko, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Rik van Riel, Christian Brauner,
	Oleg Nesterov, Tim Murray, Linux API, Linux-MM, kernel list,
	kernel-team

On Tue, Dec 22, 2020 at 5:44 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > > Can we just use one element in iovec to indicate entire address rather
> > > than using up the reserved flags?
> > >
> > >         struct iovec {
> > >                 .iov_base = NULL,
> > >                 .iov_len = (~(size_t)0),
> > >         };
> >
> > In addition to Suren's objections, I think it's also worth considering
> > how this looks in terms of compat API. If a compat process does
> > process_madvise() on another compat process, it would be specifying
> > the maximum 32-bit number, rather than the maximum 64-bit number, so
> > you'd need special code to catch that case, which would be ugly.
> >
> > And when a compat process uses this API on a non-compat process, it
> > semantically gets really weird: The actual address range covered would
> > be larger than the address range specified.
> >
> > And if we want different access checks for the two flavors in the
> > future, gating that different behavior on special values in the iovec
> > would feel too magical to me.
> >
> > And the length value SIZE_MAX doesn't really make sense anyway because
> > the length of the whole address space would be SIZE_MAX+1, which you
> > can't express.
> >
> > So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> > a magic number here.
>
> Yes, using SIZE_MAX is a horrible interface in this case.  I'm not
> a huge fan of a flag either.  What is the use case for the madvise
> to all of a processes address space anyway?

Thanks for the feedback! The use case is userspace memory reaping
similar to oom-reaper. Detailed justification is here:
https://lore.kernel.org/linux-mm/20201124053943.1684874-1-surenb@google.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-12-22 17:48         ` Suren Baghdasaryan
@ 2020-12-23  4:09           ` Suren Baghdasaryan
  2020-12-23  7:57           ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-12-23  4:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jann Horn, Minchan Kim, Christoph Hellwig, Andrew Morton,
	Michal Hocko, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Rik van Riel, Christian Brauner,
	Oleg Nesterov, Tim Murray, Linux API, Linux-MM, kernel list,
	kernel-team

On Tue, Dec 22, 2020 at 9:48 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Dec 22, 2020 at 5:44 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> > > > Can we just use one element in iovec to indicate entire address rather
> > > > than using up the reserved flags?
> > > >
> > > >         struct iovec {
> > > >                 .iov_base = NULL,
> > > >                 .iov_len = (~(size_t)0),
> > > >         };
> > >
> > > In addition to Suren's objections, I think it's also worth considering
> > > how this looks in terms of compat API. If a compat process does
> > > process_madvise() on another compat process, it would be specifying
> > > the maximum 32-bit number, rather than the maximum 64-bit number, so
> > > you'd need special code to catch that case, which would be ugly.
> > >
> > > And when a compat process uses this API on a non-compat process, it
> > > semantically gets really weird: The actual address range covered would
> > > be larger than the address range specified.
> > >
> > > And if we want different access checks for the two flavors in the
> > > future, gating that different behavior on special values in the iovec
> > > would feel too magical to me.
> > >
> > > And the length value SIZE_MAX doesn't really make sense anyway because
> > > the length of the whole address space would be SIZE_MAX+1, which you
> > > can't express.
> > >
> > > So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> > > a magic number here.
> >
> > Yes, using SIZE_MAX is a horrible interface in this case.  I'm not
> > a huge fan of a flag either.  What is the use case for the madvise
> > to all of a processes address space anyway?
>
> Thanks for the feedback! The use case is userspace memory reaping
> similar to oom-reaper. Detailed justification is here:
> https://lore.kernel.org/linux-mm/20201124053943.1684874-1-surenb@google.com

Actually this post in the most informative and includes test results:
https://lore.kernel.org/linux-api/CAJuCfpGz1kPM3G1gZH+09Z7aoWKg05QSAMMisJ7H5MdmRrRhNQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-12-22 17:48         ` Suren Baghdasaryan
  2020-12-23  4:09           ` Suren Baghdasaryan
@ 2020-12-23  7:57           ` Christoph Hellwig
  2020-12-23 17:32             ` Suren Baghdasaryan
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2020-12-23  7:57 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Christoph Hellwig, Jann Horn, Minchan Kim, Christoph Hellwig,
	Andrew Morton, Michal Hocko, Michal Hocko, David Rientjes,
	Matthew Wilcox, Johannes Weiner, Roman Gushchin, Rik van Riel,
	Christian Brauner, Oleg Nesterov, Tim Murray, Linux API,
	Linux-MM, kernel list, kernel-team

On Tue, Dec 22, 2020 at 09:48:43AM -0800, Suren Baghdasaryan wrote:
> Thanks for the feedback! The use case is userspace memory reaping
> similar to oom-reaper. Detailed justification is here:
> https://lore.kernel.org/linux-mm/20201124053943.1684874-1-surenb@google.com

Given that this new variant of process_madvise

  a) does not work on an address range
  b) is destructive
  c) doesn't share much code at all with the rest of process_madvise

Why not add a proper separate syscall?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range
  2020-12-23  7:57           ` Christoph Hellwig
@ 2020-12-23 17:32             ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2020-12-23 17:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, Jann Horn, Minchan Kim, Andrew Morton,
	Michal Hocko, Michal Hocko, David Rientjes, Matthew Wilcox,
	Johannes Weiner, Roman Gushchin, Rik van Riel, Christian Brauner,
	Oleg Nesterov, Tim Murray, Linux API, Linux-MM, kernel list,
	kernel-team

On Tue, Dec 22, 2020 at 11:57 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Dec 22, 2020 at 09:48:43AM -0800, Suren Baghdasaryan wrote:
> > Thanks for the feedback! The use case is userspace memory reaping
> > similar to oom-reaper. Detailed justification is here:
> > https://lore.kernel.org/linux-mm/20201124053943.1684874-1-surenb@google.com
>
> Given that this new variant of process_madvise
>
>   a) does not work on an address range

True, however I can see other madvise flavors that could be used on
the entire process. For example process_madvise(MADV_PAGEOUT) could be
used to "shrink" an entire inactive background process.

>   b) is destructive

I agree that memory reaping might be the only case when a destructive
process_madvise() makes sense. Unless the target process is dying, a
destructive process_madvise() would need coordination with the target
process, and if it's coordinated then the target might as well call
normal madvise() itself.

>   c) doesn't share much code at all with the rest of process_madvise

It actually does reuse a considerable part of the code, but the same
code can be refactored and reused either way.

>
> Why not add a proper separate syscall?

I think my answer to (a) is one justification for allowing
process_madvise() to operate on the entire process. Also MADV_DONTNEED
seems quite suitable for this operation.
Considering the above answers, are you still leaning towards a separate syscall?

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-12-23 17:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-24  5:39 [PATCH 0/2] userspace memory reaping using process_madvise Suren Baghdasaryan
2020-11-24  5:39 ` [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range Suren Baghdasaryan
2020-11-25 23:13   ` Minchan Kim
2020-11-25 23:23     ` Suren Baghdasaryan
2020-11-25 23:43       ` Minchan Kim
2020-11-30 19:01         ` Suren Baghdasaryan
2020-12-08  7:23           ` Suren Baghdasaryan
2020-12-11 20:27     ` Jann Horn
2020-12-11 23:01       ` Minchan Kim
2020-12-12  0:16         ` Jann Horn
2020-12-22 13:44       ` Christoph Hellwig
2020-12-22 17:48         ` Suren Baghdasaryan
2020-12-23  4:09           ` Suren Baghdasaryan
2020-12-23  7:57           ` Christoph Hellwig
2020-12-23 17:32             ` Suren Baghdasaryan
2020-11-24  5:39 ` [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support Suren Baghdasaryan
2020-11-24 13:42   ` Oleg Nesterov
2020-11-24 16:42     ` Suren Baghdasaryan
2020-12-08 23:40   ` Jann Horn
2020-12-08 23:59     ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).