All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise
@ 2019-05-16  9:42 Oleksandr Natalenko
  2019-05-16  9:42 ` [PATCH RFC 1/5] proc: introduce madvise placeholder Oleksandr Natalenko
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

It all began with the fact that KSM works only on memory that is marked
by madvise(). And the only way to get around that is to either:

  * use LD_PRELOAD; or
  * patch the kernel with something like UKSM or PKSM.

(i skip ptrace can of worms here intentionally)

To overcome this restriction, lets implement a per-process /proc knob,
which allows calling madvise remotely. This can be used manually on a
task in question or by some small userspace helper daemon that will do
auto-KSM job for us.

Also, following the discussions from the previous submissions [2] and
[3], make the interface more generic, so that it can be used for other
madvise hints in the future. At this point, I'd like Android people to
speak up, for instance, and clarify in which form they need page
granularity or other things I've missed or have never heard about.

So, I think of three major consumers of this interface:

  * hosts, that run containers, especially similar ones and especially in
    a trusted environment, sharing the same runtime like Node.js;

  * heavy applications, that can be run in multiple instances, not
    limited to opensource ones like Firefox, but also those that cannot be
	modified since they are binary-only and, maybe, statically linked;

  * Android environment that wants to do tricks with
    MADV_WILLNEED/DONTNEED or something similar.

On to the actual implementation. The per-process knob is named "madvise",
and it is write-only. It accepts a madvise hint name to be executed.
Currently, only KSM hints are implemented:

* to mark all the eligible VMAs as mergeable, use:

   # echo merge > /proc/<pid>/madvise

* to unmerge all the VMAs, use:

   # echo unmerge > /proc/<pid>/madvise

I've implemented address space level granularity instead of VMA/page
granularity intentionally for simplicity. If the discussion goes in
other directions, this can be re-implemented to act on a specific VMA
(via map_files?) or page-wise.

Speaking of statistics, more numbers can be found in the very first
submission, that is related to this one [1]. For my current setup with
two Firefox instances I get 100 to 200 MiB saved for the second instance
depending on the amount of tabs.

1 FF instance with 15 tabs:

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   410

2 FF instances, second one has 12 tabs (all the tabs are different):

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   592

At the very moment I do not have specific numbers for containerised
workload, but those should be comparable in case the containers share
similar/same runtime.

The history of this patchset:

  * [2] was based on Timofey's submission [1], but it didn't use a
    dedicated kthread to walk through the list of tasks/VMAs. Instead,
	do_anonymous_page() was amended to implement fully automatic mode,
	but this approach was incorrect due to improper locking and not
	desired due to excessive complexity and being KSM-specific;
  * [3] implemented KSM-specific madvise hints via sysfs, leaving
    traversing /proc to userspace if needed. The approach was not
	desired due to the fact that sysfs shouldn't implement any
	per-process API. Also, the interface was not generic enough to
	extend it for other users.

I drop all the "Reviewed-by" tags from previous submissions because of
code changes and because the objective of this series is now somewhat
different.

Please comment!

Thanks.

[1] https://lore.kernel.org/patchwork/patch/1012142/
[2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/05076.html

Oleksandr Natalenko (5):
  proc: introduce madvise placeholder
  mm/ksm: introduce ksm_madvise_merge() helper
  mm/ksm: introduce ksm_madvise_unmerge() helper
  mm/ksm, proc: introduce remote merge
  mm/ksm, proc: add remote madvise documentation

 Documentation/filesystems/proc.txt | 13 +++++
 fs/proc/base.c                     | 70 +++++++++++++++++++++++
 include/linux/ksm.h                |  4 ++
 mm/ksm.c                           | 92 +++++++++++++++++++-----------
 4 files changed, 145 insertions(+), 34 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/5] proc: introduce madvise placeholder
  2019-05-16  9:42 [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Oleksandr Natalenko
@ 2019-05-16  9:42 ` Oleksandr Natalenko
  2019-05-16  9:42 ` [PATCH RFC 2/5] mm/ksm: introduce ksm_madvise_merge() helper Oleksandr Natalenko
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

Add a write-only /proc/<pid>/madvise file to handle remote madvise
operations.

For now, this is just a soother that does nothing.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 fs/proc/base.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6a803a0b75df..f69532d6b74f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2957,6 +2957,25 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
 }
 #endif /* CONFIG_STACKLEAK_METRICS */
 
+static ssize_t madvise_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+
+	task = get_proc_task(file_inode(file));
+	if (!task)
+		return -ESRCH;
+
+	put_task_struct(task);
+
+	return count;
+}
+
+static const struct file_operations proc_madvise_operations = {
+	.write		= madvise_write,
+	.llseek		= noop_llseek,
+};
+
 /*
  * Thread groups
  */
@@ -3061,6 +3080,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_STACKLEAK_METRICS
 	ONE("stack_depth", S_IRUGO, proc_stack_depth),
 #endif
+	REG("madvise", S_IRUGO|S_IWUSR, proc_madvise_operations),
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/5] mm/ksm: introduce ksm_madvise_merge() helper
  2019-05-16  9:42 [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Oleksandr Natalenko
  2019-05-16  9:42 ` [PATCH RFC 1/5] proc: introduce madvise placeholder Oleksandr Natalenko
@ 2019-05-16  9:42 ` Oleksandr Natalenko
  2019-05-16  9:42 ` [PATCH RFC 3/5] mm/ksm: introduce ksm_madvise_unmerge() helper Oleksandr Natalenko
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

Move MADV_MERGEABLE part of ksm_madvise() into a dedicated helper since
it will be further used for marking VMAs to be merged forcibly.

This does not bring any functional changes.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 include/linux/ksm.h |  2 ++
 mm/ksm.c            | 60 +++++++++++++++++++++++++++------------------
 2 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index e48b1e453ff5..e824b3141677 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -19,6 +19,8 @@ struct stable_node;
 struct mem_cgroup;
 
 #ifdef CONFIG_KSM
+int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long *vm_flags);
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
diff --git a/mm/ksm.c b/mm/ksm.c
index fc64874dc6f4..1fdcf2fbd58d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2442,41 +2442,53 @@ static int ksm_scan_thread(void *nothing)
 	return 0;
 }
 
-int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long *vm_flags)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
-	switch (advice) {
-	case MADV_MERGEABLE:
-		/*
-		 * Be somewhat over-protective for now!
-		 */
-		if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
-				 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
-				 VM_HUGETLB | VM_MIXEDMAP))
-			return 0;		/* just ignore the advice */
+	/*
+	 * Be somewhat over-protective for now!
+	 */
+	if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
+			 VM_HUGETLB | VM_MIXEDMAP))
+		return 0;		/* just ignore the advice */
 
-		if (vma_is_dax(vma))
-			return 0;
+	if (vma_is_dax(vma))
+		return 0;
 
 #ifdef VM_SAO
-		if (*vm_flags & VM_SAO)
-			return 0;
+	if (*vm_flags & VM_SAO)
+		return 0;
 #endif
 #ifdef VM_SPARC_ADI
-		if (*vm_flags & VM_SPARC_ADI)
-			return 0;
+	if (*vm_flags & VM_SPARC_ADI)
+		return 0;
 #endif
 
-		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
-			err = __ksm_enter(mm);
-			if (err)
-				return err;
-		}
+	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+		err = __ksm_enter(mm);
+		if (err)
+			return err;
+	}
+
+	*vm_flags |= VM_MERGEABLE;
+
+	return 0;
+}
 
-		*vm_flags |= VM_MERGEABLE;
+int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, int advice, unsigned long *vm_flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int err;
+
+	switch (advice) {
+	case MADV_MERGEABLE:
+		err = ksm_madvise_merge(mm, vma, vm_flags);
+		if (err)
+			return err;
 		break;
 
 	case MADV_UNMERGEABLE:
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 3/5] mm/ksm: introduce ksm_madvise_unmerge() helper
  2019-05-16  9:42 [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Oleksandr Natalenko
  2019-05-16  9:42 ` [PATCH RFC 1/5] proc: introduce madvise placeholder Oleksandr Natalenko
  2019-05-16  9:42 ` [PATCH RFC 2/5] mm/ksm: introduce ksm_madvise_merge() helper Oleksandr Natalenko
@ 2019-05-16  9:42 ` Oleksandr Natalenko
  2019-05-16  9:42 ` [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge Oleksandr Natalenko
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

Move MADV_UNMERGEABLE part of ksm_madvise() into a dedicated helper
since it will be further used for unmerging VMAs forcibly.

This does not bring any functional changes.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 include/linux/ksm.h |  2 ++
 mm/ksm.c            | 32 ++++++++++++++++++++++----------
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index e824b3141677..a91a7cfc87a1 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -21,6 +21,8 @@ struct mem_cgroup;
 #ifdef CONFIG_KSM
 int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long *vm_flags);
+int ksm_madvise_unmerge(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, unsigned long *vm_flags);
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
diff --git a/mm/ksm.c b/mm/ksm.c
index 1fdcf2fbd58d..e0357e25e09f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2478,6 +2478,25 @@ int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int ksm_madvise_unmerge(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, unsigned long *vm_flags)
+{
+	int err;
+
+	if (!(*vm_flags & VM_MERGEABLE))
+		return 0;		/* just ignore the advice */
+
+	if (vma->anon_vma) {
+		err = unmerge_ksm_pages(vma, start, end);
+		if (err)
+			return err;
+	}
+
+	*vm_flags &= ~VM_MERGEABLE;
+
+	return 0;
+}
+
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags)
 {
@@ -2492,16 +2511,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		break;
 
 	case MADV_UNMERGEABLE:
-		if (!(*vm_flags & VM_MERGEABLE))
-			return 0;		/* just ignore the advice */
-
-		if (vma->anon_vma) {
-			err = unmerge_ksm_pages(vma, start, end);
-			if (err)
-				return err;
-		}
-
-		*vm_flags &= ~VM_MERGEABLE;
+		err = ksm_madvise_unmerge(vma, start, end, vm_flags);
+		if (err)
+			return err;
 		break;
 	}
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
  2019-05-16  9:42 [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Oleksandr Natalenko
                   ` (2 preceding siblings ...)
  2019-05-16  9:42 ` [PATCH RFC 3/5] mm/ksm: introduce ksm_madvise_unmerge() helper Oleksandr Natalenko
@ 2019-05-16  9:42 ` Oleksandr Natalenko
  2019-05-16 10:00     ` Jann Horn
  2019-05-16  9:42 ` [PATCH RFC 5/5] mm/ksm, proc: add remote madvise documentation Oleksandr Natalenko
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

Use previously introduced remote madvise knob to mark task's
anonymous memory as mergeable.

To force merging task's VMAs, "merge" hint is used:

   # echo merge > /proc/<pid>/madvise

Force unmerging is done similarly:

   # echo unmerge > /proc/<pid>/madvise

To achieve this, previously introduced ksm_madvise_*() helpers
are used.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 fs/proc/base.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f69532d6b74f..6677580080ed 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -94,6 +94,8 @@
 #include <linux/sched/debug.h>
 #include <linux/sched/stat.h>
 #include <linux/posix-timers.h>
+#include <linux/mman.h>
+#include <linux/ksm.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
 static ssize_t madvise_write(struct file *file, const char __user *buf,
 		size_t count, loff_t *ppos)
 {
+	/* For now, only KSM hints are implemented */
+#ifdef CONFIG_KSM
+	char buffer[PROC_NUMBUF];
+	int behaviour;
 	struct task_struct *task;
+	struct mm_struct *mm;
+	int err = 0;
+	struct vm_area_struct *vma;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))
+		behaviour = MADV_MERGEABLE;
+	else if (!memcmp("unmerge", buffer, min(sizeof("unmerge")-1, count)))
+		behaviour = MADV_UNMERGEABLE;
+	else
+		return -EINVAL;
 
 	task = get_proc_task(file_inode(file));
 	if (!task)
 		return -ESRCH;
 
+	mm = get_task_mm(task);
+	if (!mm) {
+		err = -EINVAL;
+		goto out_put_task_struct;
+	}
+
+	down_write(&mm->mmap_sem);
+	switch (behaviour) {
+	case MADV_MERGEABLE:
+	case MADV_UNMERGEABLE:
+		vma = mm->mmap;
+		while (vma) {
+			if (behaviour == MADV_MERGEABLE)
+				ksm_madvise_merge(vma->vm_mm, vma, &vma->vm_flags);
+			else
+				ksm_madvise_unmerge(vma, vma->vm_start, vma->vm_end, &vma->vm_flags);
+			vma = vma->vm_next;
+		}
+		break;
+	}
+	up_write(&mm->mmap_sem);
+
+	mmput(mm);
+
+out_put_task_struct:
 	put_task_struct(task);
 
-	return count;
+	return err ? err : count;
+#else
+	return -EINVAL;
+#endif /* CONFIG_KSM */
 }
 
 static const struct file_operations proc_madvise_operations = {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 5/5] mm/ksm, proc: add remote madvise documentation
  2019-05-16  9:42 [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Oleksandr Natalenko
                   ` (3 preceding siblings ...)
  2019-05-16  9:42 ` [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge Oleksandr Natalenko
@ 2019-05-16  9:42 ` Oleksandr Natalenko
  2019-05-16 10:44 ` [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Michal Hocko
  2019-05-16 17:24 ` Alexey Dobriyan
  6 siblings, 0 replies; 18+ messages in thread
From: Oleksandr Natalenko @ 2019-05-16  9:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill Tkhai, Hugh Dickins, Alexey Dobriyan, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

Document respective /proc/<pid>/madvise knob.

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
---
 Documentation/filesystems/proc.txt | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66cad5c86171..17106e435bba 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc/<pid>/map_files - Information about memory mapped files
   3.10  /proc/<pid>/timerslack_ns - Task timerslack value
   3.11	/proc/<pid>/patch_state - Livepatch patch operation state
+  3.12  /proc/<pid>/madvise - Remote madvise
 
   4	Configuring procfs
   4.1	Mount options
@@ -1948,6 +1949,18 @@ patched.  If the patch is being enabled, then the task has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12    /proc/<pid>/madvise - Remote madvise
+--------------------------------------------
+This write-only file allows executing madvise operation for another task.
+
+If CONFIG_KSM is enabled, the following actions are available:
+
+  * marking task's memory as mergeable:
+    # echo merge > /proc/<pid>/madvise
+
+  * unmerging all the task's memory:
+    # echo unmerge > /proc/<pid>/madvise
+
 
 ------------------------------------------------------------------------------
 Configuring procfs
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
  2019-05-16  9:42 ` [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge Oleksandr Natalenko
@ 2019-05-16 10:00     ` Jann Horn
  0 siblings, 0 replies; 18+ messages in thread
From: Jann Horn @ 2019-05-16 10:00 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API

On Thu, May 16, 2019 at 11:43 AM Oleksandr Natalenko
<oleksandr@redhat.com> wrote:
> Use previously introduced remote madvise knob to mark task's
> anonymous memory as mergeable.
>
> To force merging task's VMAs, "merge" hint is used:
>
>    # echo merge > /proc/<pid>/madvise
>
> Force unmerging is done similarly:
>
>    # echo unmerge > /proc/<pid>/madvise
>
> To achieve this, previously introduced ksm_madvise_*() helpers
> are used.

Why does this not require PTRACE_MODE_ATTACH_FSCREDS to the target
process? Enabling KSM on another process is hazardous because it
significantly increases the attack surface for side channels.

(Note that if you change this to require PTRACE_MODE_ATTACH_FSCREDS,
you'll want to use mm_access() in the ->open handler and drop the mm
in ->release. mm_access() from a ->write handler is not permitted.)

[...]
> @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
>  static ssize_t madvise_write(struct file *file, const char __user *buf,
>                 size_t count, loff_t *ppos)
>  {
> +       /* For now, only KSM hints are implemented */
> +#ifdef CONFIG_KSM
> +       char buffer[PROC_NUMBUF];
> +       int behaviour;
>         struct task_struct *task;
> +       struct mm_struct *mm;
> +       int err = 0;
> +       struct vm_area_struct *vma;
> +
> +       memset(buffer, 0, sizeof(buffer));
> +       if (count > sizeof(buffer) - 1)
> +               count = sizeof(buffer) - 1;
> +       if (copy_from_user(buffer, buf, count))
> +               return -EFAULT;
> +
> +       if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))

This means that you also match on something like "mergeblah". Just use strcmp().

> +               behaviour = MADV_MERGEABLE;
> +       else if (!memcmp("unmerge", buffer, min(sizeof("unmerge")-1, count)))
> +               behaviour = MADV_UNMERGEABLE;
> +       else
> +               return -EINVAL;
>
>         task = get_proc_task(file_inode(file));
>         if (!task)
>                 return -ESRCH;
>
> +       mm = get_task_mm(task);
> +       if (!mm) {
> +               err = -EINVAL;
> +               goto out_put_task_struct;
> +       }
> +
> +       down_write(&mm->mmap_sem);

Should a check for mmget_still_valid(mm) be inserted here? See commit
04f5866e41fb70690e28397487d8bd8eea7d712a.

> +       switch (behaviour) {
> +       case MADV_MERGEABLE:
> +       case MADV_UNMERGEABLE:

This switch isn't actually necessary at this point, right?

> +               vma = mm->mmap;
> +               while (vma) {
> +                       if (behaviour == MADV_MERGEABLE)
> +                               ksm_madvise_merge(vma->vm_mm, vma, &vma->vm_flags);
> +                       else
> +                               ksm_madvise_unmerge(vma, vma->vm_start, vma->vm_end, &vma->vm_flags);
> +                       vma = vma->vm_next;
> +               }
> +               break;
> +       }
> +       up_write(&mm->mmap_sem);
> +
> +       mmput(mm);
> +
> +out_put_task_struct:
>         put_task_struct(task);
>
> -       return count;
> +       return err ? err : count;
> +#else
> +       return -EINVAL;
> +#endif /* CONFIG_KSM */
>  }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
@ 2019-05-16 10:00     ` Jann Horn
  0 siblings, 0 replies; 18+ messages in thread
From: Jann Horn @ 2019-05-16 10:00 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API

On Thu, May 16, 2019 at 11:43 AM Oleksandr Natalenko
<oleksandr@redhat.com> wrote:
> Use previously introduced remote madvise knob to mark task's
> anonymous memory as mergeable.
>
> To force merging task's VMAs, "merge" hint is used:
>
>    # echo merge > /proc/<pid>/madvise
>
> Force unmerging is done similarly:
>
>    # echo unmerge > /proc/<pid>/madvise
>
> To achieve this, previously introduced ksm_madvise_*() helpers
> are used.

Why does this not require PTRACE_MODE_ATTACH_FSCREDS to the target
process? Enabling KSM on another process is hazardous because it
significantly increases the attack surface for side channels.

(Note that if you change this to require PTRACE_MODE_ATTACH_FSCREDS,
you'll want to use mm_access() in the ->open handler and drop the mm
in ->release. mm_access() from a ->write handler is not permitted.)

[...]
> @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
>  static ssize_t madvise_write(struct file *file, const char __user *buf,
>                 size_t count, loff_t *ppos)
>  {
> +       /* For now, only KSM hints are implemented */
> +#ifdef CONFIG_KSM
> +       char buffer[PROC_NUMBUF];
> +       int behaviour;
>         struct task_struct *task;
> +       struct mm_struct *mm;
> +       int err = 0;
> +       struct vm_area_struct *vma;
> +
> +       memset(buffer, 0, sizeof(buffer));
> +       if (count > sizeof(buffer) - 1)
> +               count = sizeof(buffer) - 1;
> +       if (copy_from_user(buffer, buf, count))
> +               return -EFAULT;
> +
> +       if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))

This means that you also match on something like "mergeblah". Just use strcmp().

> +               behaviour = MADV_MERGEABLE;
> +       else if (!memcmp("unmerge", buffer, min(sizeof("unmerge")-1, count)))
> +               behaviour = MADV_UNMERGEABLE;
> +       else
> +               return -EINVAL;
>
>         task = get_proc_task(file_inode(file));
>         if (!task)
>                 return -ESRCH;
>
> +       mm = get_task_mm(task);
> +       if (!mm) {
> +               err = -EINVAL;
> +               goto out_put_task_struct;
> +       }
> +
> +       down_write(&mm->mmap_sem);

Should a check for mmget_still_valid(mm) be inserted here? See commit
04f5866e41fb70690e28397487d8bd8eea7d712a.

> +       switch (behaviour) {
> +       case MADV_MERGEABLE:
> +       case MADV_UNMERGEABLE:

This switch isn't actually necessary at this point, right?

> +               vma = mm->mmap;
> +               while (vma) {
> +                       if (behaviour == MADV_MERGEABLE)
> +                               ksm_madvise_merge(vma->vm_mm, vma, &vma->vm_flags);
> +                       else
> +                               ksm_madvise_unmerge(vma, vma->vm_start, vma->vm_end, &vma->vm_flags);
> +                       vma = vma->vm_next;
> +               }
> +               break;
> +       }
> +       up_write(&mm->mmap_sem);
> +
> +       mmput(mm);
> +
> +out_put_task_struct:
>         put_task_struct(task);
>
> -       return count;
> +       return err ? err : count;
> +#else
> +       return -EINVAL;
> +#endif /* CONFIG_KSM */
>  }


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise
  2019-05-16  9:42 [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Oleksandr Natalenko
                   ` (4 preceding siblings ...)
  2019-05-16  9:42 ` [PATCH RFC 5/5] mm/ksm, proc: add remote madvise documentation Oleksandr Natalenko
@ 2019-05-16 10:44 ` Michal Hocko
  2019-05-16 14:21   ` Oleksandr Natalenko
  2019-05-16 17:24 ` Alexey Dobriyan
  6 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-05-16 10:44 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: linux-kernel, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

On Thu 16-05-19 11:42:29, Oleksandr Natalenko wrote:
[...]
> * to mark all the eligible VMAs as mergeable, use:
> 
>    # echo merge > /proc/<pid>/madvise
> 
> * to unmerge all the VMAs, use:
> 
>    # echo unmerge > /proc/<pid>/madvise

Please do not open a new thread until a previous one reaches some
conclusion. I have outlined some ways to go forward in
http://lkml.kernel.org/r/20190515145151.GG16651@dhcp22.suse.cz.
I haven't heard any feedback on that, yet you open a 3rd way in a
different thread. This will not help to move on with the discussion.

Please follow up on that thread.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
  2019-05-16 10:00     ` Jann Horn
  (?)
@ 2019-05-16 14:20     ` Oleksandr Natalenko
  2019-05-16 14:43       ` Oleksandr Natalenko
  2019-05-16 16:06         ` Jann Horn
  -1 siblings, 2 replies; 18+ messages in thread
From: Oleksandr Natalenko @ 2019-05-16 14:20 UTC (permalink / raw)
  To: Jann Horn
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API

Hi.

On Thu, May 16, 2019 at 12:00:24PM +0200, Jann Horn wrote:
> On Thu, May 16, 2019 at 11:43 AM Oleksandr Natalenko
> <oleksandr@redhat.com> wrote:
> > Use previously introduced remote madvise knob to mark task's
> > anonymous memory as mergeable.
> >
> > To force merging task's VMAs, "merge" hint is used:
> >
> >    # echo merge > /proc/<pid>/madvise
> >
> > Force unmerging is done similarly:
> >
> >    # echo unmerge > /proc/<pid>/madvise
> >
> > To achieve this, previously introduced ksm_madvise_*() helpers
> > are used.
> 
> Why does this not require PTRACE_MODE_ATTACH_FSCREDS to the target
> process? Enabling KSM on another process is hazardous because it
> significantly increases the attack surface for side channels.
> 
> (Note that if you change this to require PTRACE_MODE_ATTACH_FSCREDS,
> you'll want to use mm_access() in the ->open handler and drop the mm
> in ->release. mm_access() from a ->write handler is not permitted.)

Sounds reasonable. So, something similar to what mem_open() & friends do
now:

static int madvise_open(...)
...
	struct task_struct *task = get_proc_task(inode);
...
	if (task) {
		mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
		put_task_struct(task);
		if (!IS_ERR_OR_NULL(mm)) {
			mmgrab(mm);
			mmput(mm);
...

Then:

static ssize_t madvise_write(...)
...
	if (!mmget_not_zero(mm))
		goto out;

	down_write(&mm->mmap_sem);
	if (!mmget_still_valid(mm))
		goto skip_mm;
...
skip_mm:
	up_write(&mm->mmap_sem);

	mmput(mm);
out:
	return ...;

And, finally:

static int madvise_release(...)
...
		mmdrop(mm);
...

Right?

> [...]
> > @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> >  static ssize_t madvise_write(struct file *file, const char __user *buf,
> >                 size_t count, loff_t *ppos)
> >  {
> > +       /* For now, only KSM hints are implemented */
> > +#ifdef CONFIG_KSM
> > +       char buffer[PROC_NUMBUF];
> > +       int behaviour;
> >         struct task_struct *task;
> > +       struct mm_struct *mm;
> > +       int err = 0;
> > +       struct vm_area_struct *vma;
> > +
> > +       memset(buffer, 0, sizeof(buffer));
> > +       if (count > sizeof(buffer) - 1)
> > +               count = sizeof(buffer) - 1;
> > +       if (copy_from_user(buffer, buf, count))
> > +               return -EFAULT;
> > +
> > +       if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))
> 
> This means that you also match on something like "mergeblah". Just use strcmp().

I agree. Just to make it more interesting I must say that

   /sys/kernel/mm/transparent_hugepage/enabled

uses memcmp in the very same way, and thus echoing "alwaysssss" or
"madviseeee" works perfectly there, and it was like that from the very
beginning, it seems. Should we fix it, or it became (zomg) a public API?

> > +               behaviour = MADV_MERGEABLE;
> > +       else if (!memcmp("unmerge", buffer, min(sizeof("unmerge")-1, count)))
> > +               behaviour = MADV_UNMERGEABLE;
> > +       else
> > +               return -EINVAL;
> >
> >         task = get_proc_task(file_inode(file));
> >         if (!task)
> >                 return -ESRCH;
> >
> > +       mm = get_task_mm(task);
> > +       if (!mm) {
> > +               err = -EINVAL;
> > +               goto out_put_task_struct;
> > +       }
> > +
> > +       down_write(&mm->mmap_sem);
> 
> Should a check for mmget_still_valid(mm) be inserted here? See commit
> 04f5866e41fb70690e28397487d8bd8eea7d712a.

Yeah, it seems so :/. Thanks for the pointer. I've put it into the
madvise_write snippet above.

> > +       switch (behaviour) {
> > +       case MADV_MERGEABLE:
> > +       case MADV_UNMERGEABLE:
> 
> This switch isn't actually necessary at this point, right?

Yup, but it is there to highlight a possibility of adding other, non-KSM
options. So, let it be, and I'll just re-arrange CONFIG_KSM ifdef
instead.

Thank you.

> > +               vma = mm->mmap;
> > +               while (vma) {
> > +                       if (behaviour == MADV_MERGEABLE)
> > +                               ksm_madvise_merge(vma->vm_mm, vma, &vma->vm_flags);
> > +                       else
> > +                               ksm_madvise_unmerge(vma, vma->vm_start, vma->vm_end, &vma->vm_flags);
> > +                       vma = vma->vm_next;
> > +               }
> > +               break;
> > +       }
> > +       up_write(&mm->mmap_sem);
> > +
> > +       mmput(mm);
> > +
> > +out_put_task_struct:
> >         put_task_struct(task);
> >
> > -       return count;
> > +       return err ? err : count;
> > +#else
> > +       return -EINVAL;
> > +#endif /* CONFIG_KSM */
> >  }

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise
  2019-05-16 10:44 ` [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Michal Hocko
@ 2019-05-16 14:21   ` Oleksandr Natalenko
  0 siblings, 0 replies; 18+ messages in thread
From: Oleksandr Natalenko @ 2019-05-16 14:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

Hi.

On Thu, May 16, 2019 at 12:44:12PM +0200, Michal Hocko wrote:
> On Thu 16-05-19 11:42:29, Oleksandr Natalenko wrote:
> [...]
> > * to mark all the eligible VMAs as mergeable, use:
> > 
> >    # echo merge > /proc/<pid>/madvise
> > 
> > * to unmerge all the VMAs, use:
> > 
> >    # echo unmerge > /proc/<pid>/madvise
> 
> Please do not open a new thread until a previous one reaches some
> conclusion. I have outlined some ways to go forward in
> http://lkml.kernel.org/r/20190515145151.GG16651@dhcp22.suse.cz.
> I haven't heard any feedback on that, yet you open a 3rd way in a
> different thread. This will not help to move on with the discussion.
> 
> Please follow up on that thread.

Sure, I will follow the thread once and if there are responses. Consider
this one to be an intermediate summary of current suggestions and also
an indication that it is better to have the code early for public eyes.

Thank you.

> -- 
> Michal Hocko
> SUSE Labs

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
  2019-05-16 14:20     ` Oleksandr Natalenko
@ 2019-05-16 14:43       ` Oleksandr Natalenko
  2019-05-16 16:09           ` Jann Horn
  2019-05-16 16:06         ` Jann Horn
  1 sibling, 1 reply; 18+ messages in thread
From: Oleksandr Natalenko @ 2019-05-16 14:43 UTC (permalink / raw)
  To: Jann Horn
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API

On Thu, May 16, 2019 at 04:20:13PM +0200, Oleksandr Natalenko wrote:
> > [...]
> > > @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> > >  static ssize_t madvise_write(struct file *file, const char __user *buf,
> > >                 size_t count, loff_t *ppos)
> > >  {
> > > +       /* For now, only KSM hints are implemented */
> > > +#ifdef CONFIG_KSM
> > > +       char buffer[PROC_NUMBUF];
> > > +       int behaviour;
> > >         struct task_struct *task;
> > > +       struct mm_struct *mm;
> > > +       int err = 0;
> > > +       struct vm_area_struct *vma;
> > > +
> > > +       memset(buffer, 0, sizeof(buffer));
> > > +       if (count > sizeof(buffer) - 1)
> > > +               count = sizeof(buffer) - 1;
> > > +       if (copy_from_user(buffer, buf, count))
> > > +               return -EFAULT;
> > > +
> > > +       if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))
> > 
> > This means that you also match on something like "mergeblah". Just use strcmp().
> 
> I agree. Just to make it more interesting I must say that
> 
>    /sys/kernel/mm/transparent_hugepage/enabled
> 
> uses memcmp in the very same way, and thus echoing "alwaysssss" or
> "madviseeee" works perfectly there, and it was like that from the very
> beginning, it seems. Should we fix it, or it became (zomg) a public API?

Actually, maybe, the reason for using memcmp is to handle "echo"
properly: by default it puts a newline character at the end, so if we use
just strcmp, echo should be called with -n, otherwise strcmp won't match
the string.

Huh?

> [...]

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
  2019-05-16 14:20     ` Oleksandr Natalenko
@ 2019-05-16 16:06         ` Jann Horn
  2019-05-16 16:06         ` Jann Horn
  1 sibling, 0 replies; 18+ messages in thread
From: Jann Horn @ 2019-05-16 16:06 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API

On Thu, May 16, 2019 at 4:20 PM Oleksandr Natalenko
<oleksandr@redhat.com> wrote:
> On Thu, May 16, 2019 at 12:00:24PM +0200, Jann Horn wrote:
> > On Thu, May 16, 2019 at 11:43 AM Oleksandr Natalenko
> > <oleksandr@redhat.com> wrote:
> > > Use previously introduced remote madvise knob to mark task's
> > > anonymous memory as mergeable.
> > >
> > > To force merging task's VMAs, "merge" hint is used:
> > >
> > >    # echo merge > /proc/<pid>/madvise
> > >
> > > Force unmerging is done similarly:
> > >
> > >    # echo unmerge > /proc/<pid>/madvise
> > >
> > > To achieve this, previously introduced ksm_madvise_*() helpers
> > > are used.
> >
> > Why does this not require PTRACE_MODE_ATTACH_FSCREDS to the target
> > process? Enabling KSM on another process is hazardous because it
> > significantly increases the attack surface for side channels.
> >
> > (Note that if you change this to require PTRACE_MODE_ATTACH_FSCREDS,
> > you'll want to use mm_access() in the ->open handler and drop the mm
> > in ->release. mm_access() from a ->write handler is not permitted.)
>
> Sounds reasonable. So, something similar to what mem_open() & friends do
> now:
>
> static int madvise_open(...)
> ...
>         struct task_struct *task = get_proc_task(inode);
> ...
>         if (task) {
>                 mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
>                 put_task_struct(task);
>                 if (!IS_ERR_OR_NULL(mm)) {
>                         mmgrab(mm);
>                         mmput(mm);
> ...
>
> Then:
>
> static ssize_t madvise_write(...)
> ...
>         if (!mmget_not_zero(mm))
>                 goto out;
>
>         down_write(&mm->mmap_sem);
>         if (!mmget_still_valid(mm))
>                 goto skip_mm;
> ...
> skip_mm:
>         up_write(&mm->mmap_sem);
>
>         mmput(mm);
> out:
>         return ...;
>
> And, finally:
>
> static int madvise_release(...)
> ...
>                 mmdrop(mm);
> ...
>
> Right?

Yeah, that looks reasonable.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
@ 2019-05-16 16:06         ` Jann Horn
  0 siblings, 0 replies; 18+ messages in thread
From: Jann Horn @ 2019-05-16 16:06 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API

On Thu, May 16, 2019 at 4:20 PM Oleksandr Natalenko
<oleksandr@redhat.com> wrote:
> On Thu, May 16, 2019 at 12:00:24PM +0200, Jann Horn wrote:
> > On Thu, May 16, 2019 at 11:43 AM Oleksandr Natalenko
> > <oleksandr@redhat.com> wrote:
> > > Use previously introduced remote madvise knob to mark task's
> > > anonymous memory as mergeable.
> > >
> > > To force merging task's VMAs, "merge" hint is used:
> > >
> > >    # echo merge > /proc/<pid>/madvise
> > >
> > > Force unmerging is done similarly:
> > >
> > >    # echo unmerge > /proc/<pid>/madvise
> > >
> > > To achieve this, previously introduced ksm_madvise_*() helpers
> > > are used.
> >
> > Why does this not require PTRACE_MODE_ATTACH_FSCREDS to the target
> > process? Enabling KSM on another process is hazardous because it
> > significantly increases the attack surface for side channels.
> >
> > (Note that if you change this to require PTRACE_MODE_ATTACH_FSCREDS,
> > you'll want to use mm_access() in the ->open handler and drop the mm
> > in ->release. mm_access() from a ->write handler is not permitted.)
>
> Sounds reasonable. So, something similar to what mem_open() & friends do
> now:
>
> static int madvise_open(...)
> ...
>         struct task_struct *task = get_proc_task(inode);
> ...
>         if (task) {
>                 mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
>                 put_task_struct(task);
>                 if (!IS_ERR_OR_NULL(mm)) {
>                         mmgrab(mm);
>                         mmput(mm);
> ...
>
> Then:
>
> static ssize_t madvise_write(...)
> ...
>         if (!mmget_not_zero(mm))
>                 goto out;
>
>         down_write(&mm->mmap_sem);
>         if (!mmget_still_valid(mm))
>                 goto skip_mm;
> ...
> skip_mm:
>         up_write(&mm->mmap_sem);
>
>         mmput(mm);
> out:
>         return ...;
>
> And, finally:
>
> static int madvise_release(...)
> ...
>                 mmdrop(mm);
> ...
>
> Right?

Yeah, that looks reasonable.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
  2019-05-16 14:43       ` Oleksandr Natalenko
@ 2019-05-16 16:09           ` Jann Horn
  0 siblings, 0 replies; 18+ messages in thread
From: Jann Horn @ 2019-05-16 16:09 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API

On Thu, May 16, 2019 at 4:43 PM Oleksandr Natalenko
<oleksandr@redhat.com> wrote:
> On Thu, May 16, 2019 at 04:20:13PM +0200, Oleksandr Natalenko wrote:
> > > [...]
> > > > @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> > > >  static ssize_t madvise_write(struct file *file, const char __user *buf,
> > > >                 size_t count, loff_t *ppos)
> > > >  {
> > > > +       /* For now, only KSM hints are implemented */
> > > > +#ifdef CONFIG_KSM
> > > > +       char buffer[PROC_NUMBUF];
> > > > +       int behaviour;
> > > >         struct task_struct *task;
> > > > +       struct mm_struct *mm;
> > > > +       int err = 0;
> > > > +       struct vm_area_struct *vma;
> > > > +
> > > > +       memset(buffer, 0, sizeof(buffer));
> > > > +       if (count > sizeof(buffer) - 1)
> > > > +               count = sizeof(buffer) - 1;
> > > > +       if (copy_from_user(buffer, buf, count))
> > > > +               return -EFAULT;
> > > > +
> > > > +       if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))
> > >
> > > This means that you also match on something like "mergeblah". Just use strcmp().
> >
> > I agree. Just to make it more interesting I must say that
> >
> >    /sys/kernel/mm/transparent_hugepage/enabled
> >
> > uses memcmp in the very same way, and thus echoing "alwaysssss" or
> > "madviseeee" works perfectly there, and it was like that from the very
> > beginning, it seems. Should we fix it, or it became (zomg) a public API?
>
> Actually, maybe, the reason for using memcmp is to handle "echo"
> properly: by default it puts a newline character at the end, so if we use
> just strcmp, echo should be called with -n, otherwise strcmp won't match
> the string.
>
> Huh?

Ah, yes, other code like e.g. proc_setgroups_write() uses strncmp()
and then has an extra check to make sure everything trailing is
whitespace.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
@ 2019-05-16 16:09           ` Jann Horn
  0 siblings, 0 replies; 18+ messages in thread
From: Jann Horn @ 2019-05-16 16:09 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: kernel list, Kirill Tkhai, Hugh Dickins, Alexey Dobriyan,
	Vlastimil Babka, Michal Hocko, Matthew Wilcox, Pavel Tatashin,
	Greg KH, Suren Baghdasaryan, Minchan Kim, Timofey Titovets,
	Aaron Tomlin, Grzegorz Halat, Linux-MM, Linux API

On Thu, May 16, 2019 at 4:43 PM Oleksandr Natalenko
<oleksandr@redhat.com> wrote:
> On Thu, May 16, 2019 at 04:20:13PM +0200, Oleksandr Natalenko wrote:
> > > [...]
> > > > @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
> > > >  static ssize_t madvise_write(struct file *file, const char __user *buf,
> > > >                 size_t count, loff_t *ppos)
> > > >  {
> > > > +       /* For now, only KSM hints are implemented */
> > > > +#ifdef CONFIG_KSM
> > > > +       char buffer[PROC_NUMBUF];
> > > > +       int behaviour;
> > > >         struct task_struct *task;
> > > > +       struct mm_struct *mm;
> > > > +       int err = 0;
> > > > +       struct vm_area_struct *vma;
> > > > +
> > > > +       memset(buffer, 0, sizeof(buffer));
> > > > +       if (count > sizeof(buffer) - 1)
> > > > +               count = sizeof(buffer) - 1;
> > > > +       if (copy_from_user(buffer, buf, count))
> > > > +               return -EFAULT;
> > > > +
> > > > +       if (!memcmp("merge", buffer, min(sizeof("merge")-1, count)))
> > >
> > > This means that you also match on something like "mergeblah". Just use strcmp().
> >
> > I agree. Just to make it more interesting I must say that
> >
> >    /sys/kernel/mm/transparent_hugepage/enabled
> >
> > uses memcmp in the very same way, and thus echoing "alwaysssss" or
> > "madviseeee" works perfectly there, and it was like that from the very
> > beginning, it seems. Should we fix it, or it became (zomg) a public API?
>
> Actually, maybe, the reason for using memcmp is to handle "echo"
> properly: by default it puts a newline character at the end, so if we use
> just strcmp, echo should be called with -n, otherwise strcmp won't match
> the string.
>
> Huh?

Ah, yes, other code like e.g. proc_setgroups_write() uses strncmp()
and then has an extra check to make sure everything trailing is
whitespace.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
  2019-05-16 10:00     ` Jann Horn
  (?)
  (?)
@ 2019-05-16 16:29     ` Aaron Tomlin
  -1 siblings, 0 replies; 18+ messages in thread
From: Aaron Tomlin @ 2019-05-16 16:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: Oleksandr Natalenko, kernel list, Kirill Tkhai, Hugh Dickins,
	Alexey Dobriyan, Vlastimil Babka, Michal Hocko, Matthew Wilcox,
	Pavel Tatashin, Greg KH, Suren Baghdasaryan, Minchan Kim,
	Timofey Titovets, Grzegorz Halat, Linux-MM, Linux API

On Thu 2019-05-16 12:00 +0200, Jann Horn wrote:
> On Thu, May 16, 2019 at 11:43 AM Oleksandr Natalenko
> <oleksandr@redhat.com> wrote:
[ ... ]
> > +       }
> > +
> > +       down_write(&mm->mmap_sem);
> 
> Should a check for mmget_still_valid(mm) be inserted here? See commit
> 04f5866e41fb70690e28397487d8bd8eea7d712a.

Yes - I'd say this is required here.

Thanks,

-- 
Aaron Tomlin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise
  2019-05-16  9:42 [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Oleksandr Natalenko
                   ` (5 preceding siblings ...)
  2019-05-16 10:44 ` [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Michal Hocko
@ 2019-05-16 17:24 ` Alexey Dobriyan
  6 siblings, 0 replies; 18+ messages in thread
From: Alexey Dobriyan @ 2019-05-16 17:24 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: linux-kernel, Kirill Tkhai, Hugh Dickins, Vlastimil Babka,
	Michal Hocko, Matthew Wilcox, Pavel Tatashin, Greg KH,
	Suren Baghdasaryan, Minchan Kim, Timofey Titovets, Aaron Tomlin,
	Grzegorz Halat, linux-mm, linux-api

On Thu, May 16, 2019 at 11:42:29AM +0200, Oleksandr Natalenko wrote:

> * to mark all the eligible VMAs as mergeable, use:
> 
>    # echo merge > /proc/<pid>/madvise
> 
> * to unmerge all the VMAs, use:
> 
>    # echo unmerge > /proc/<pid>/madvise

Please make a real system call (or abuse prctl(2) passing target's pid).

Your example automerge daemon could just call it and not bother with /proc.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-05-16 17:24 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-16  9:42 [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Oleksandr Natalenko
2019-05-16  9:42 ` [PATCH RFC 1/5] proc: introduce madvise placeholder Oleksandr Natalenko
2019-05-16  9:42 ` [PATCH RFC 2/5] mm/ksm: introduce ksm_madvise_merge() helper Oleksandr Natalenko
2019-05-16  9:42 ` [PATCH RFC 3/5] mm/ksm: introduce ksm_madvise_unmerge() helper Oleksandr Natalenko
2019-05-16  9:42 ` [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge Oleksandr Natalenko
2019-05-16 10:00   ` Jann Horn
2019-05-16 10:00     ` Jann Horn
2019-05-16 14:20     ` Oleksandr Natalenko
2019-05-16 14:43       ` Oleksandr Natalenko
2019-05-16 16:09         ` Jann Horn
2019-05-16 16:09           ` Jann Horn
2019-05-16 16:06       ` Jann Horn
2019-05-16 16:06         ` Jann Horn
2019-05-16 16:29     ` Aaron Tomlin
2019-05-16  9:42 ` [PATCH RFC 5/5] mm/ksm, proc: add remote madvise documentation Oleksandr Natalenko
2019-05-16 10:44 ` [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise Michal Hocko
2019-05-16 14:21   ` Oleksandr Natalenko
2019-05-16 17:24 ` Alexey Dobriyan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.