linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3] KSM: allow dedup all tasks memory
@ 2018-11-12 23:13 Timofey Titovets
  2018-11-13  1:49 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Timofey Titovets @ 2018-11-12 23:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: Timofey Titovets, Matthew Wilcox, linux-mm, linux-doc

From: Timofey Titovets <nefelim4ag@gmail.com>

ksm by default working only on memory that added by
madvise().

And only way get that work on other applications:
  * Use LD_PRELOAD and libraries
  * Patch kernel

Lets use kernel task list and add logic to import VMAs from tasks.

That behaviour controlled by new attributes:
  * mode:
    I try mimic hugepages attribute, so mode have two states:
      * madvise      - old default behaviour
      * always [new] - allow ksm to get tasks vma and
                       try working on that.
  * seeker_sleep_millisecs:
    Add pauses between imports tasks VMA

For rate limiting proporses and tasklist locking time,
ksm seeker thread only import VMAs from one task per loop.

Some numbers from different not madvised workloads.
Formulas:
  Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
  Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
  Memory used = free -h

  * Name: My working laptop
    Description: Many different chrome/electron apps + KDE
    Ratio: 5%
    Saved: ~100  MiB
    Used:  ~2000 MiB

  * Name: K8s test VM
    Description: Some small random running docker images
    Ratio: 40%
    Saved: ~160 MiB
    Used:  ~920 MiB

  * Name: Ceph test VM
    Description: Ceph Mon/OSD, some containers
    Ratio: 20%
    Saved: ~60 MiB
    Used:  ~600 MiB

  * Name: BareMetal K8s backend server
    Description: Different server apps in containers C, Java, GO & etc
    Ratio: 72%
    Saved: ~5800 MiB
    Used:  ~35.7 GiB

  * Name: BareMetal K8s processing server
    Description: Many instance of one CPU intensive application
    Ratio: 55%
    Saved: ~2600 MiB
    Used:  ~28.0 GiB

  * Name: BareMetal Ceph node
    Description: Only OSD storage daemons running
    Raio: 2%
    Saved: ~190 MiB
    Used:  ~11.7 GiB

Changes:
  v1 -> v2:
    * Rebase on v4.19.1 (must also apply on 4.20-rc2+)
  v2 -> v3:
    * Reformat patch description
    * Rename mode normal to madvise
    * Add some memory numbers
    * Fix checkpatch.pl warnings
    * Separate ksm vma seeker to another kthread
    * Fix: "BUG: scheduling while atomic: ksmd"
      by move seeker to another thread

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: linux-mm@kvack.org
CC: linux-doc@vger.kernel.org
---
 Documentation/admin-guide/mm/ksm.rst |  15 ++
 mm/ksm.c                             | 215 +++++++++++++++++++++++----
 2 files changed, 198 insertions(+), 32 deletions(-)

diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index 9303786632d1..7cffd47f9b38 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -116,6 +116,21 @@ run
         Default: 0 (must be changed to 1 to activate KSM, except if
         CONFIG_SYSFS is disabled)
 
+mode
+        * set always to allow ksm deduplicate memory of every process
+        * set madvise to use only madvised memory
+
+        Default: madvise (dedupulicate only madvised memory as in
+        earlier releases)
+
+seeker_sleep_millisecs
+        how many milliseconds ksmd task seeker should sleep try another
+        task.
+        e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
+
+        Default: 1000 (chosen for rate limit purposes)
+
+
 use_zero_pages
         specifies whether empty pages (i.e. allocated pages that only
         contain zeroes) should be treated specially.  When set to 1,
diff --git a/mm/ksm.c b/mm/ksm.c
index 5b0894b45ee5..1a03b28b6288 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Milliseconds ksmd seeker should sleep between runs */
+static unsigned int ksm_thread_seeker_sleep_millisecs = 1000;
+
 /* Checksum of an empty (zeroed) page */
 static unsigned int zero_checksum __read_mostly;
 
@@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1;
 static unsigned long ksm_run = KSM_RUN_STOP;
 static void wait_while_offlining(void);
 
+#define KSM_MODE_MADVISE 0
+#define KSM_MODE_ALWAYS	1
+static unsigned long ksm_mode = KSM_MODE_MADVISE;
+
 static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
+static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait);
 static DEFINE_MUTEX(ksm_thread_mutex);
 static DEFINE_SPINLOCK(ksm_mmlist_lock);
 
@@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
 		sizeof(struct __struct), __alignof__(struct __struct),\
 		(__flags), NULL)
 
+static inline int ksm_mode_always(void)
+{
+	return (ksm_mode == KSM_MODE_ALWAYS);
+}
+
 static int __init ksm_slab_init(void)
 {
 	rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
@@ -2389,6 +2402,106 @@ static int ksmd_should_run(void)
 	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
 }
 
+
+static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags)
+{
+	int err;
+
+	if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
+			 VM_HUGETLB | VM_MIXEDMAP))
+		return 0;
+
+#ifdef VM_SAO
+	if (*vm_flags & VM_SAO)
+		return 0;
+#endif
+#ifdef VM_SPARC_ADI
+	if (*vm_flags & VM_SPARC_ADI)
+		return 0;
+#endif
+	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+		err = __ksm_enter(mm);
+		if (err)
+			return err;
+	}
+
+	*vm_flags |= VM_MERGEABLE;
+
+	return 0;
+}
+
+/*
+ * Register all vmas for all processes in the system with KSM.
+ * Note that every call to ksm_, for a given vma, after the first
+ * does nothing but set flags.
+ */
+void ksm_import_task_vma(struct task_struct *task)
+{
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int error;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		return;
+	down_write(&mm->mmap_sem);
+	vma = mm->mmap;
+	while (vma) {
+		error = ksm_enter(vma->vm_mm, &vma->vm_flags);
+		vma = vma->vm_next;
+	}
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+}
+
+static int ksm_seeker_thread(void *nothing)
+{
+	pid_t last_pid = 1;
+	pid_t curr_pid;
+	struct task_struct *task;
+
+	set_freezable();
+	set_user_nice(current, 5);
+
+	while (!kthread_should_stop()) {
+		wait_while_offlining();
+
+		try_to_freeze();
+
+		if (!ksm_mode_always()) {
+			wait_event_freezable(ksm_seeker_thread_wait,
+				ksm_mode_always() || kthread_should_stop());
+			continue;
+		}
+
+		/*
+		 * import one task's vma per run
+		 */
+		read_lock(&tasklist_lock);
+
+		/* Try always get next task */
+		for_each_process(task) {
+			curr_pid = task_pid_nr(task);
+			if (curr_pid == last_pid) {
+				task = next_task(task);
+				break;
+			}
+
+			if (curr_pid > last_pid)
+				break;
+		}
+
+		last_pid = task_pid_nr(task);
+		ksm_import_task_vma(task);
+		read_unlock(&tasklist_lock);
+
+		schedule_timeout_interruptible(
+			msecs_to_jiffies(ksm_thread_seeker_sleep_millisecs));
+	}
+	return 0;
+}
+
 static int ksm_scan_thread(void *nothing)
 {
 	set_freezable();
@@ -2422,33 +2535,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 
 	switch (advice) {
 	case MADV_MERGEABLE:
-		/*
-		 * Be somewhat over-protective for now!
-		 */
-		if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
-				 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
-				 VM_HUGETLB | VM_MIXEDMAP))
-			return 0;		/* just ignore the advice */
-
-		if (vma_is_dax(vma))
-			return 0;
-
-#ifdef VM_SAO
-		if (*vm_flags & VM_SAO)
-			return 0;
-#endif
-#ifdef VM_SPARC_ADI
-		if (*vm_flags & VM_SPARC_ADI)
-			return 0;
-#endif
-
-		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
-			err = __ksm_enter(mm);
-			if (err)
-				return err;
-		}
-
-		*vm_flags |= VM_MERGEABLE;
+		err = ksm_enter(mm, vm_flags);
+		if (err)
+			return err;
 		break;
 
 	case MADV_UNMERGEABLE:
@@ -2829,6 +2918,29 @@ static ssize_t sleep_millisecs_store(struct kobject *kobj,
 }
 KSM_ATTR(sleep_millisecs);
 
+static ssize_t seeker_sleep_millisecs_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", ksm_thread_seeker_sleep_millisecs);
+}
+
+static ssize_t seeker_sleep_millisecs_store(struct kobject *kobj,
+				     struct kobj_attribute *attr,
+				     const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = kstrtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	ksm_thread_seeker_sleep_millisecs = msecs;
+
+	return count;
+}
+KSM_ATTR(seeker_sleep_millisecs);
+
 static ssize_t pages_to_scan_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
@@ -2852,6 +2964,34 @@ static ssize_t pages_to_scan_store(struct kobject *kobj,
 }
 KSM_ATTR(pages_to_scan);
 
+static ssize_t mode_show(struct kobject *kobj, struct kobj_attribute *attr,
+			char *buf)
+{
+	switch (ksm_mode) {
+	case KSM_MODE_ALWAYS:
+		return sprintf(buf, "[always] madvise\n");
+	case KSM_MODE_MADVISE:
+		return sprintf(buf, "always [madvise]\n");
+	}
+
+	return sprintf(buf, "always [madvise]\n");
+}
+
+static ssize_t mode_store(struct kobject *kobj, struct kobj_attribute *attr,
+			 const char *buf, size_t count)
+{
+	if (!memcmp("always", buf, min(sizeof("always")-1, count))) {
+		ksm_mode = KSM_MODE_ALWAYS;
+		wake_up_interruptible(&ksm_seeker_thread_wait);
+	} else if (!memcmp("madvise", buf, min(sizeof("madvise")-1, count))) {
+		ksm_mode = KSM_MODE_MADVISE;
+	} else
+		return -EINVAL;
+
+	return count;
+}
+KSM_ATTR(mode);
+
 static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr,
 			char *buf)
 {
@@ -3108,7 +3248,9 @@ KSM_ATTR_RO(full_scans);
 
 static struct attribute *ksm_attrs[] = {
 	&sleep_millisecs_attr.attr,
+	&seeker_sleep_millisecs_attr.attr,
 	&pages_to_scan_attr.attr,
+	&mode_attr.attr,
 	&run_attr.attr,
 	&pages_shared_attr.attr,
 	&pages_sharing_attr.attr,
@@ -3134,7 +3276,7 @@ static const struct attribute_group ksm_attr_group = {
 
 static int __init ksm_init(void)
 {
-	struct task_struct *ksm_thread;
+	struct task_struct *ksm_thread[2];
 	int err;
 
 	/* The correct value depends on page size and endianness */
@@ -3146,10 +3288,18 @@ static int __init ksm_init(void)
 	if (err)
 		goto out;
 
-	ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
-	if (IS_ERR(ksm_thread)) {
+	ksm_thread[0] = kthread_run(ksm_scan_thread, NULL, "ksmd");
+	if (IS_ERR(ksm_thread[0])) {
 		pr_err("ksm: creating kthread failed\n");
-		err = PTR_ERR(ksm_thread);
+		err = PTR_ERR(ksm_thread[0]);
+		goto out_free;
+	}
+
+	ksm_thread[1] = kthread_run(ksm_seeker_thread, NULL, "ksmd_seeker");
+	if (IS_ERR(ksm_thread[1])) {
+		pr_err("ksm: creating seeker kthread failed\n");
+		err = PTR_ERR(ksm_thread[1]);
+		kthread_stop(ksm_thread[0]);
 		goto out_free;
 	}
 
@@ -3157,7 +3307,8 @@ static int __init ksm_init(void)
 	err = sysfs_create_group(mm_kobj, &ksm_attr_group);
 	if (err) {
 		pr_err("ksm: register sysfs failed\n");
-		kthread_stop(ksm_thread);
+		kthread_stop(ksm_thread[0]);
+		kthread_stop(ksm_thread[1]);
 		goto out_free;
 	}
 #else
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-12 23:13 [PATCH V3] KSM: allow dedup all tasks memory Timofey Titovets
@ 2018-11-13  1:49 ` Matthew Wilcox
  2018-11-13 11:25   ` Timofey Titovets
  2018-11-13  2:25 ` Pavel Tatashin
  2018-11-13 11:57 ` Jann Horn
  2 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2018-11-13  1:49 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: linux-kernel, Timofey Titovets, linux-mm, linux-doc

On Tue, Nov 13, 2018 at 02:13:44AM +0300, Timofey Titovets wrote:
> Some numbers from different not madvised workloads.
> Formulas:
>   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
>   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
>   Memory used = free -h
> 
>   * Name: My working laptop
>     Description: Many different chrome/electron apps + KDE
>     Ratio: 5%
>     Saved: ~100  MiB
>     Used:  ~2000 MiB

Your _laptop_ saves 100MB of RAM?  That's extraordinary.  Essentially
that's like getting an extra 100MB of page cache for free.  Is there
any observable slowdown?  I could even see there being a speedup (due
to your working set being allowed to be 5% larger)

I am now a big fan of this patch and shall try to give it the review
that it deserves.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-12 23:13 [PATCH V3] KSM: allow dedup all tasks memory Timofey Titovets
  2018-11-13  1:49 ` Matthew Wilcox
@ 2018-11-13  2:25 ` Pavel Tatashin
  2018-11-13 11:40   ` Timofey Titovets
  2018-11-13 11:57 ` Jann Horn
  2 siblings, 1 reply; 28+ messages in thread
From: Pavel Tatashin @ 2018-11-13  2:25 UTC (permalink / raw)
  To: Timofey Titovets
  Cc: linux-kernel, Timofey Titovets, Matthew Wilcox, linux-mm, linux-doc

On 18-11-13 02:13:44, Timofey Titovets wrote:
> From: Timofey Titovets <nefelim4ag@gmail.com>
> 
> ksm by default working only on memory that added by
> madvise().
> 
> And only way get that work on other applications:
>   * Use LD_PRELOAD and libraries
>   * Patch kernel
> 
> Lets use kernel task list and add logic to import VMAs from tasks.
> 
> That behaviour controlled by new attributes:
>   * mode:
>     I try mimic hugepages attribute, so mode have two states:
>       * madvise      - old default behaviour
>       * always [new] - allow ksm to get tasks vma and
>                        try working on that.
>   * seeker_sleep_millisecs:
>     Add pauses between imports tasks VMA
> 
> For rate limiting proporses and tasklist locking time,
> ksm seeker thread only import VMAs from one task per loop.
> 
> Some numbers from different not madvised workloads.
> Formulas:
>   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
>   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
>   Memory used = free -h
> 
>   * Name: My working laptop
>     Description: Many different chrome/electron apps + KDE
>     Ratio: 5%
>     Saved: ~100  MiB
>     Used:  ~2000 MiB
> 
>   * Name: K8s test VM
>     Description: Some small random running docker images
>     Ratio: 40%
>     Saved: ~160 MiB
>     Used:  ~920 MiB
> 
>   * Name: Ceph test VM
>     Description: Ceph Mon/OSD, some containers
>     Ratio: 20%
>     Saved: ~60 MiB
>     Used:  ~600 MiB
> 
>   * Name: BareMetal K8s backend server
>     Description: Different server apps in containers C, Java, GO & etc
>     Ratio: 72%
>     Saved: ~5800 MiB
>     Used:  ~35.7 GiB
> 
>   * Name: BareMetal K8s processing server
>     Description: Many instance of one CPU intensive application
>     Ratio: 55%
>     Saved: ~2600 MiB
>     Used:  ~28.0 GiB
> 
>   * Name: BareMetal Ceph node
>     Description: Only OSD storage daemons running
>     Raio: 2%
>     Saved: ~190 MiB
>     Used:  ~11.7 GiB
> 
> Changes:
>   v1 -> v2:
>     * Rebase on v4.19.1 (must also apply on 4.20-rc2+)
>   v2 -> v3:
>     * Reformat patch description
>     * Rename mode normal to madvise
>     * Add some memory numbers
>     * Fix checkpatch.pl warnings
>     * Separate ksm vma seeker to another kthread
>     * Fix: "BUG: scheduling while atomic: ksmd"
>       by move seeker to another thread
> 
> Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> CC: Matthew Wilcox <willy@infradead.org>
> CC: linux-mm@kvack.org
> CC: linux-doc@vger.kernel.org
> ---
>  Documentation/admin-guide/mm/ksm.rst |  15 ++
>  mm/ksm.c                             | 215 +++++++++++++++++++++++----
>  2 files changed, 198 insertions(+), 32 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
> index 9303786632d1..7cffd47f9b38 100644
> --- a/Documentation/admin-guide/mm/ksm.rst
> +++ b/Documentation/admin-guide/mm/ksm.rst
> @@ -116,6 +116,21 @@ run
>          Default: 0 (must be changed to 1 to activate KSM, except if
>          CONFIG_SYSFS is disabled)
>  
> +mode
> +        * set always to allow ksm deduplicate memory of every process
> +        * set madvise to use only madvised memory
> +
> +        Default: madvise (dedupulicate only madvised memory as in
> +        earlier releases)
> +
> +seeker_sleep_millisecs
> +        how many milliseconds ksmd task seeker should sleep try another
> +        task.
> +        e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
> +
> +        Default: 1000 (chosen for rate limit purposes)
> +
> +
>  use_zero_pages
>          specifies whether empty pages (i.e. allocated pages that only
>          contain zeroes) should be treated specially.  When set to 1,
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 5b0894b45ee5..1a03b28b6288 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
>  /* Milliseconds ksmd should sleep between batches */
>  static unsigned int ksm_thread_sleep_millisecs = 20;
>  
> +/* Milliseconds ksmd seeker should sleep between runs */
> +static unsigned int ksm_thread_seeker_sleep_millisecs = 1000;
> +
>  /* Checksum of an empty (zeroed) page */
>  static unsigned int zero_checksum __read_mostly;
>  
> @@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1;
>  static unsigned long ksm_run = KSM_RUN_STOP;
>  static void wait_while_offlining(void);
>  
> +#define KSM_MODE_MADVISE 0
> +#define KSM_MODE_ALWAYS	1
> +static unsigned long ksm_mode = KSM_MODE_MADVISE;
> +
>  static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
> +static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait);
>  static DEFINE_MUTEX(ksm_thread_mutex);
>  static DEFINE_SPINLOCK(ksm_mmlist_lock);
>  
> @@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
>  		sizeof(struct __struct), __alignof__(struct __struct),\
>  		(__flags), NULL)
>  
> +static inline int ksm_mode_always(void)
> +{
> +	return (ksm_mode == KSM_MODE_ALWAYS);
> +}
> +
>  static int __init ksm_slab_init(void)
>  {
>  	rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
> @@ -2389,6 +2402,106 @@ static int ksmd_should_run(void)
>  	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
>  }
>  
> +
> +static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags)
> +{
> +	int err;
> +
> +	if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
> +			 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
> +			 VM_HUGETLB | VM_MIXEDMAP))
> +		return 0;
> +
> +#ifdef VM_SAO
> +	if (*vm_flags & VM_SAO)
> +		return 0;
> +#endif
> +#ifdef VM_SPARC_ADI
> +	if (*vm_flags & VM_SPARC_ADI)
> +		return 0;
> +#endif
> +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
> +		err = __ksm_enter(mm);
> +		if (err)
> +			return err;
> +	}
> +
> +	*vm_flags |= VM_MERGEABLE;
> +
> +	return 0;
> +}
> +
> +/*
> + * Register all vmas for all processes in the system with KSM.
> + * Note that every call to ksm_, for a given vma, after the first
> + * does nothing but set flags.
> + */
> +void ksm_import_task_vma(struct task_struct *task)
> +{
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm;
> +	int error;
> +
> +	mm = get_task_mm(task);
> +	if (!mm)
> +		return;
> +	down_write(&mm->mmap_sem);
> +	vma = mm->mmap;
> +	while (vma) {
> +		error = ksm_enter(vma->vm_mm, &vma->vm_flags);
> +		vma = vma->vm_next;
> +	}
> +	up_write(&mm->mmap_sem);
> +	mmput(mm);
> +}
> +
> +static int ksm_seeker_thread(void *nothing)

Is it really necessary to have an extra thread in ksm just to add vma's
for scanning? Can we do it right from the scanner thread? Also, may be
it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
enabled?

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13  1:49 ` Matthew Wilcox
@ 2018-11-13 11:25   ` Timofey Titovets
  0 siblings, 0 replies; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 11:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Linux Kernel, linux-mm, linux-doc

вт, 13 нояб. 2018 г. в 04:49, Matthew Wilcox <willy@infradead.org>:
>
> On Tue, Nov 13, 2018 at 02:13:44AM +0300, Timofey Titovets wrote:
> > Some numbers from different not madvised workloads.
> > Formulas:
> >   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
> >   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
> >   Memory used = free -h
> >
> >   * Name: My working laptop
> >     Description: Many different chrome/electron apps + KDE
> >     Ratio: 5%
> >     Saved: ~100  MiB
> >     Used:  ~2000 MiB
>
> Your _laptop_ saves 100MB of RAM?  That's extraordinary.  Essentially
> that's like getting an extra 100MB of page cache for free.  Is there
> any observable slowdown?  I could even see there being a speedup (due
> to your working set being allowed to be 5% larger)
>
> I am now a big fan of this patch and shall try to give it the review
> that it deserves.

I'm not sure if this is sarcasm,
anyway i try do my best to get that working.

On any x86 desktop with mixed load (browser, docs, games & etc)
You will always see something like 40-200 MiB of deduped pages,
based on type of load of course.

I'm just don't try use that numbers as reason to get general KSM
deduplication in kernel.
Because in current generation with several gigabytes of memory,
several saved MiB not looks serious for most of people.

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13  2:25 ` Pavel Tatashin
@ 2018-11-13 11:40   ` Timofey Titovets
  2018-11-13 18:42     ` Pavel Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 11:40 UTC (permalink / raw)
  To: pasha.tatashin; +Cc: Linux Kernel, Matthew Wilcox, linux-mm, linux-doc

вт, 13 нояб. 2018 г. в 05:25, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> On 18-11-13 02:13:44, Timofey Titovets wrote:
> > From: Timofey Titovets <nefelim4ag@gmail.com>
> >
> > ksm by default working only on memory that added by
> > madvise().
> >
> > And only way get that work on other applications:
> >   * Use LD_PRELOAD and libraries
> >   * Patch kernel
> >
> > Lets use kernel task list and add logic to import VMAs from tasks.
> >
> > That behaviour controlled by new attributes:
> >   * mode:
> >     I try mimic hugepages attribute, so mode have two states:
> >       * madvise      - old default behaviour
> >       * always [new] - allow ksm to get tasks vma and
> >                        try working on that.
> >   * seeker_sleep_millisecs:
> >     Add pauses between imports tasks VMA
> >
> > For rate limiting proporses and tasklist locking time,
> > ksm seeker thread only import VMAs from one task per loop.
> >
> > Some numbers from different not madvised workloads.
> > Formulas:
> >   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
> >   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
> >   Memory used = free -h
> >
> >   * Name: My working laptop
> >     Description: Many different chrome/electron apps + KDE
> >     Ratio: 5%
> >     Saved: ~100  MiB
> >     Used:  ~2000 MiB
> >
> >   * Name: K8s test VM
> >     Description: Some small random running docker images
> >     Ratio: 40%
> >     Saved: ~160 MiB
> >     Used:  ~920 MiB
> >
> >   * Name: Ceph test VM
> >     Description: Ceph Mon/OSD, some containers
> >     Ratio: 20%
> >     Saved: ~60 MiB
> >     Used:  ~600 MiB
> >
> >   * Name: BareMetal K8s backend server
> >     Description: Different server apps in containers C, Java, GO & etc
> >     Ratio: 72%
> >     Saved: ~5800 MiB
> >     Used:  ~35.7 GiB
> >
> >   * Name: BareMetal K8s processing server
> >     Description: Many instance of one CPU intensive application
> >     Ratio: 55%
> >     Saved: ~2600 MiB
> >     Used:  ~28.0 GiB
> >
> >   * Name: BareMetal Ceph node
> >     Description: Only OSD storage daemons running
> >     Raio: 2%
> >     Saved: ~190 MiB
> >     Used:  ~11.7 GiB
> >
> > Changes:
> >   v1 -> v2:
> >     * Rebase on v4.19.1 (must also apply on 4.20-rc2+)
> >   v2 -> v3:
> >     * Reformat patch description
> >     * Rename mode normal to madvise
> >     * Add some memory numbers
> >     * Fix checkpatch.pl warnings
> >     * Separate ksm vma seeker to another kthread
> >     * Fix: "BUG: scheduling while atomic: ksmd"
> >       by move seeker to another thread
> >
> > Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> > CC: Matthew Wilcox <willy@infradead.org>
> > CC: linux-mm@kvack.org
> > CC: linux-doc@vger.kernel.org
> > ---
> >  Documentation/admin-guide/mm/ksm.rst |  15 ++
> >  mm/ksm.c                             | 215 +++++++++++++++++++++++----
> >  2 files changed, 198 insertions(+), 32 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
> > index 9303786632d1..7cffd47f9b38 100644
> > --- a/Documentation/admin-guide/mm/ksm.rst
> > +++ b/Documentation/admin-guide/mm/ksm.rst
> > @@ -116,6 +116,21 @@ run
> >          Default: 0 (must be changed to 1 to activate KSM, except if
> >          CONFIG_SYSFS is disabled)
> >
> > +mode
> > +        * set always to allow ksm deduplicate memory of every process
> > +        * set madvise to use only madvised memory
> > +
> > +        Default: madvise (dedupulicate only madvised memory as in
> > +        earlier releases)
> > +
> > +seeker_sleep_millisecs
> > +        how many milliseconds ksmd task seeker should sleep try another
> > +        task.
> > +        e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
> > +
> > +        Default: 1000 (chosen for rate limit purposes)
> > +
> > +
> >  use_zero_pages
> >          specifies whether empty pages (i.e. allocated pages that only
> >          contain zeroes) should be treated specially.  When set to 1,
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 5b0894b45ee5..1a03b28b6288 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
> >  /* Milliseconds ksmd should sleep between batches */
> >  static unsigned int ksm_thread_sleep_millisecs = 20;
> >
> > +/* Milliseconds ksmd seeker should sleep between runs */
> > +static unsigned int ksm_thread_seeker_sleep_millisecs = 1000;
> > +
> >  /* Checksum of an empty (zeroed) page */
> >  static unsigned int zero_checksum __read_mostly;
> >
> > @@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1;
> >  static unsigned long ksm_run = KSM_RUN_STOP;
> >  static void wait_while_offlining(void);
> >
> > +#define KSM_MODE_MADVISE 0
> > +#define KSM_MODE_ALWAYS      1
> > +static unsigned long ksm_mode = KSM_MODE_MADVISE;
> > +
> >  static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
> > +static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait);
> >  static DEFINE_MUTEX(ksm_thread_mutex);
> >  static DEFINE_SPINLOCK(ksm_mmlist_lock);
> >
> > @@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
> >               sizeof(struct __struct), __alignof__(struct __struct),\
> >               (__flags), NULL)
> >
> > +static inline int ksm_mode_always(void)
> > +{
> > +     return (ksm_mode == KSM_MODE_ALWAYS);
> > +}
> > +
> >  static int __init ksm_slab_init(void)
> >  {
> >       rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
> > @@ -2389,6 +2402,106 @@ static int ksmd_should_run(void)
> >       return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
> >  }
> >
> > +
> > +static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags)
> > +{
> > +     int err;
> > +
> > +     if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
> > +                      VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
> > +                      VM_HUGETLB | VM_MIXEDMAP))
> > +             return 0;
> > +
> > +#ifdef VM_SAO
> > +     if (*vm_flags & VM_SAO)
> > +             return 0;
> > +#endif
> > +#ifdef VM_SPARC_ADI
> > +     if (*vm_flags & VM_SPARC_ADI)
> > +             return 0;
> > +#endif
> > +     if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
> > +             err = __ksm_enter(mm);
> > +             if (err)
> > +                     return err;
> > +     }
> > +
> > +     *vm_flags |= VM_MERGEABLE;
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Register all vmas for all processes in the system with KSM.
> > + * Note that every call to ksm_, for a given vma, after the first
> > + * does nothing but set flags.
> > + */
> > +void ksm_import_task_vma(struct task_struct *task)
> > +{
> > +     struct vm_area_struct *vma;
> > +     struct mm_struct *mm;
> > +     int error;
> > +
> > +     mm = get_task_mm(task);
> > +     if (!mm)
> > +             return;
> > +     down_write(&mm->mmap_sem);
> > +     vma = mm->mmap;
> > +     while (vma) {
> > +             error = ksm_enter(vma->vm_mm, &vma->vm_flags);
> > +             vma = vma->vm_next;
> > +     }
> > +     up_write(&mm->mmap_sem);
> > +     mmput(mm);
> > +}
> > +
> > +static int ksm_seeker_thread(void *nothing)
>
> Is it really necessary to have an extra thread in ksm just to add vma's
> for scanning? Can we do it right from the scanner thread? Also, may be
> it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
> enabled?
>
> Thank you,
> Pasha

Oh, thats a long story, and my english to bad for describe all things,
even that hard to find linux-mm conversation several years ago about that.

Anyway, so:
In V2 - i use scanner thread to add VMA, but i think scanner do that
with too high rate.
i.e. walk on task list, and get new task every 20ms, to wait write semaphore,
to get VMA...
To high rate for task list scanner, i think it's overkill.

About add VMA from creation time,
UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM.
But some mm people say what they not like add KSM hooks to other subsystems.
And want ksm do that internally by some way.

Frankly speaking i didn't have enough knowledge and skills to do that
another way in past time.
They also suggest me look to THP for that logic, but i can't find how
THP do that without hooks, and
where THP truly scan memory.

So, after all of that i implemented this in that way.
In first iteration as part of ksm scan thread, and in second, by
separate thread.
Because that allow to add VMA in fully independent way.

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-12 23:13 [PATCH V3] KSM: allow dedup all tasks memory Timofey Titovets
  2018-11-13  1:49 ` Matthew Wilcox
  2018-11-13  2:25 ` Pavel Tatashin
@ 2018-11-13 11:57 ` Jann Horn
  2018-11-13 12:58   ` Timofey Titovets
  2 siblings, 1 reply; 28+ messages in thread
From: Jann Horn @ 2018-11-13 11:57 UTC (permalink / raw)
  To: timofey.titovets
  Cc: kernel list, nefelim4ag, Matthew Wilcox, Linux-MM, linux-doc

On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets
<timofey.titovets@synesis.ru> wrote:
> ksm by default working only on memory that added by
> madvise().
>
> And only way get that work on other applications:
>   * Use LD_PRELOAD and libraries
>   * Patch kernel
>
> Lets use kernel task list and add logic to import VMAs from tasks.
>
> That behaviour controlled by new attributes:
>   * mode:
>     I try mimic hugepages attribute, so mode have two states:
>       * madvise      - old default behaviour
>       * always [new] - allow ksm to get tasks vma and
>                        try working on that.

Please don't. And if you really have to for some reason, put some big
warnings on this, advising people that it's a security risk.

KSM is one of the favorite punching bags of side-channel and hardware
security researchers:

As a gigantic, problematic side channel:
http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf
https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf
https://access.redhat.com/blogs/766093/posts/1976303
https://gruss.cc/files/dedup.pdf

In particular https://gruss.cc/files/dedup.pdf ("Practical Memory
Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes
it possible to use malicious JavaScript to determine whether a given
page of memory exists elsewhere on your system.

And also as a way to target rowhammer-based faults:
https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 11:57 ` Jann Horn
@ 2018-11-13 12:58   ` Timofey Titovets
  2018-11-13 13:25     ` Jann Horn
  0 siblings, 1 reply; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 12:58 UTC (permalink / raw)
  To: jannh; +Cc: Linux Kernel, Matthew Wilcox, linux-mm, linux-doc

вт, 13 нояб. 2018 г. в 14:57, Jann Horn <jannh@google.com>:
>
> On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets
> <timofey.titovets@synesis.ru> wrote:
> > ksm by default working only on memory that added by
> > madvise().
> >
> > And only way get that work on other applications:
> >   * Use LD_PRELOAD and libraries
> >   * Patch kernel
> >
> > Lets use kernel task list and add logic to import VMAs from tasks.
> >
> > That behaviour controlled by new attributes:
> >   * mode:
> >     I try mimic hugepages attribute, so mode have two states:
> >       * madvise      - old default behaviour
> >       * always [new] - allow ksm to get tasks vma and
> >                        try working on that.
>
> Please don't. And if you really have to for some reason, put some big
> warnings on this, advising people that it's a security risk.
>
> KSM is one of the favorite punching bags of side-channel and hardware
> security researchers:
>
> As a gigantic, problematic side channel:
> http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf
> https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf
> https://access.redhat.com/blogs/766093/posts/1976303
> https://gruss.cc/files/dedup.pdf
>
> In particular https://gruss.cc/files/dedup.pdf ("Practical Memory
> Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes
> it possible to use malicious JavaScript to determine whether a given
> page of memory exists elsewhere on your system.
>
> And also as a way to target rowhammer-based faults:
> https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
> https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/

I'm very sorry, i'm not a security specialist.
But if i understood correctly, ksm have that security issues _without_
my patch set.
Even more, not only KSM have that type of issue, any memory
deduplication have that problems.
Any guy who care about security must decide on it self. Which things
him use and how he will
defend from others.
Even more on it self he must learn tools, what he use and make some
decision right?

So, if you really care about that problem in general, or only on KSM side,
that your initiative and your duty to warn people about that.

KSM already exists for 10+ years. You know about security implication
of use memory deduplication.
That your duty to send a patches to documentation, and add appropriate warnings.

Sorry for my passive aggressive,
i don't try hurt someone, or humiliate.

That's just my IMHO and i'm just to restricted in my english knowledge,
to write that more gentle.

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 12:58   ` Timofey Titovets
@ 2018-11-13 13:25     ` Jann Horn
  0 siblings, 0 replies; 28+ messages in thread
From: Jann Horn @ 2018-11-13 13:25 UTC (permalink / raw)
  To: timofey.titovets; +Cc: kernel list, Matthew Wilcox, Linux-MM, linux-doc

On Tue, Nov 13, 2018 at 1:59 PM Timofey Titovets
<timofey.titovets@synesis.ru> wrote:
>
> вт, 13 нояб. 2018 г. в 14:57, Jann Horn <jannh@google.com>:
> >
> > On Tue, Nov 13, 2018 at 12:40 PM Timofey Titovets
> > <timofey.titovets@synesis.ru> wrote:
> > > ksm by default working only on memory that added by
> > > madvise().
> > >
> > > And only way get that work on other applications:
> > >   * Use LD_PRELOAD and libraries
> > >   * Patch kernel
> > >
> > > Lets use kernel task list and add logic to import VMAs from tasks.
> > >
> > > That behaviour controlled by new attributes:
> > >   * mode:
> > >     I try mimic hugepages attribute, so mode have two states:
> > >       * madvise      - old default behaviour
> > >       * always [new] - allow ksm to get tasks vma and
> > >                        try working on that.
> >
> > Please don't. And if you really have to for some reason, put some big
> > warnings on this, advising people that it's a security risk.
> >
> > KSM is one of the favorite punching bags of side-channel and hardware
> > security researchers:
> >
> > As a gigantic, problematic side channel:
> > http://staff.aist.go.jp/k.suzaki/EuroSec2011-suzaki.pdf
> > https://www.usenix.org/system/files/conference/woot15/woot15-paper-barresi.pdf
> > https://access.redhat.com/blogs/766093/posts/1976303
> > https://gruss.cc/files/dedup.pdf
> >
> > In particular https://gruss.cc/files/dedup.pdf ("Practical Memory
> > Deduplication Attacks in Sandboxed JavaScript") shows that KSM makes
> > it possible to use malicious JavaScript to determine whether a given
> > page of memory exists elsewhere on your system.
> >
> > And also as a way to target rowhammer-based faults:
> > https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_razavi.pdf
> > https://thisissecurity.stormshield.com/2017/10/19/attacking-co-hosted-vm-hacker-hammer-two-memory-modules/
>
> I'm very sorry, i'm not a security specialist.
> But if i understood correctly, ksm have that security issues _without_
> my patch set.

Yep. However, so far, it requires an application to explicitly opt in
to this behavior, so it's not all that bad. Your patch would remove
the requirement for application opt-in, which, in my opinion, makes
this way worse and reduces the number of applications for which this
is acceptable.

> Even more, not only KSM have that type of issue, any memory
> deduplication have that problems.

Yup.

> Any guy who care about security must decide on it self. Which things
> him use and how he will
> defend from others.

> Even more on it self he must learn tools, what he use and make some
> decision right?
>
> So, if you really care about that problem in general, or only on KSM side,
> that your initiative and your duty to warn people about that.
>
> KSM already exists for 10+ years. You know about security implication
> of use memory deduplication.
> That your duty to send a patches to documentation, and add appropriate warnings.

As far as I know, basically nobody is using KSM at this point. There
are blog posts from several cloud providers about these security risks
that explicitly state that they're not using memory deduplication.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
       [not found] <<CAG48ez0ZprqUYGZFxcrY6U3Dnwt77q1NJXzzpsn1XNkRuXVppw@mail.gmail.com>
@ 2018-11-13 14:23 ` Oleksandr Natalenko
  2018-11-13 17:59   ` Pavel Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Oleksandr Natalenko @ 2018-11-13 14:23 UTC (permalink / raw)
  To: jannh; +Cc: linux-doc, linux-kernel, linux-mm, timofey.titovets, willy

Hi.

> Yep. However, so far, it requires an application to explicitly opt in
> to this behavior, so it's not all that bad. Your patch would remove
> the requirement for application opt-in, which, in my opinion, makes
> this way worse and reduces the number of applications for which this
> is acceptable.

The default is to maintain the old behaviour, so unless the explicit 
decision is made by the administrator, no extra risk is imposed.

> As far as I know, basically nobody is using KSM at this point. There
> are blog posts from several cloud providers about these security risks
> that explicitly state that they're not using memory deduplication.

I tend to disagree here. Based on both what my company does and what 
UKSM users do, memory dedup is a desired option (note "option" word 
here, not the default choice).

Thanks.

-- 
   Oleksandr Natalenko (post-factum)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 14:23 ` Oleksandr Natalenko
@ 2018-11-13 17:59   ` Pavel Tatashin
  2018-11-13 18:17     ` Timofey Titovets
  2018-11-13 20:26     ` Jann Horn
  0 siblings, 2 replies; 28+ messages in thread
From: Pavel Tatashin @ 2018-11-13 17:59 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: jannh, linux-doc, linux-kernel, linux-mm, timofey.titovets, willy

On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> Hi.
> 
> > Yep. However, so far, it requires an application to explicitly opt in
> > to this behavior, so it's not all that bad. Your patch would remove
> > the requirement for application opt-in, which, in my opinion, makes
> > this way worse and reduces the number of applications for which this
> > is acceptable.
> 
> The default is to maintain the old behaviour, so unless the explicit
> decision is made by the administrator, no extra risk is imposed.

The new interface would be more tolerable if it honored MADV_UNMERGEABLE:

KSM default on: merge everything except when MADV_UNMERGEABLE is
excplicitly set.

KSM default off: merge only when MADV_MERGEABLE is set.

The proposed change won't honor MADV_UNMERGEABLE, meaning that
application programmers won't have a way to prevent sensitive data to be
every merged. So, I think, we should keep allow an explicit opt-out
option for applications.

> 
> > As far as I know, basically nobody is using KSM at this point. There
> > are blog posts from several cloud providers about these security risks
> > that explicitly state that they're not using memory deduplication.
> 
> I tend to disagree here. Based on both what my company does and what UKSM
> users do, memory dedup is a desired option (note "option" word here, not the
> default choice).

Lightweight containers is a use case for KSM: when many VMs share the
same small kernel. KSM is used in production by large cloud vendors.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 17:59   ` Pavel Tatashin
@ 2018-11-13 18:17     ` Timofey Titovets
  2018-11-13 18:35       ` Pavel Tatashin
  2018-11-13 20:26     ` Jann Horn
  1 sibling, 1 reply; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 18:17 UTC (permalink / raw)
  To: pasha.tatashin
  Cc: oleksandr, Jann Horn, linux-doc, Linux Kernel, linux-mm, Matthew Wilcox

вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > Hi.
> >
> > > Yep. However, so far, it requires an application to explicitly opt in
> > > to this behavior, so it's not all that bad. Your patch would remove
> > > the requirement for application opt-in, which, in my opinion, makes
> > > this way worse and reduces the number of applications for which this
> > > is acceptable.
> >
> > The default is to maintain the old behaviour, so unless the explicit
> > decision is made by the administrator, no extra risk is imposed.
>
> The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
>
> KSM default on: merge everything except when MADV_UNMERGEABLE is
> excplicitly set.
>
> KSM default off: merge only when MADV_MERGEABLE is set.
>
> The proposed change won't honor MADV_UNMERGEABLE, meaning that
> application programmers won't have a way to prevent sensitive data to be
> every merged. So, I think, we should keep allow an explicit opt-out
> option for applications.
>

We just did not have VM/Madvise flag for that currently.
Same as THP.
Because all logic written with assumption, what we have exactly 2 states.
Allow/Disallow (More like not allow).

And if we try to add, that must be something like:
MADV_FORBID_* to disallow something completely.

And same for THP
(because currently apps just refuse to start if THP enabled, because of no way
to forbid thp).

Thanks.

> >
> > > As far as I know, basically nobody is using KSM at this point. There
> > > are blog posts from several cloud providers about these security risks
> > > that explicitly state that they're not using memory deduplication.
> >
> > I tend to disagree here. Based on both what my company does and what UKSM
> > users do, memory dedup is a desired option (note "option" word here, not the
> > default choice).
>
> Lightweight containers is a use case for KSM: when many VMs share the
> same small kernel. KSM is used in production by large cloud vendors.
>
> Thank you,
> Pasha
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 18:17     ` Timofey Titovets
@ 2018-11-13 18:35       ` Pavel Tatashin
  2018-11-13 18:54         ` Timofey Titovets
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Tatashin @ 2018-11-13 18:35 UTC (permalink / raw)
  To: Timofey Titovets
  Cc: oleksandr, Jann Horn, linux-doc, Linux Kernel, linux-mm, Matthew Wilcox

On 18-11-13 21:17:42, Timofey Titovets wrote:
> вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
> >
> > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > Hi.
> > >
> > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > the requirement for application opt-in, which, in my opinion, makes
> > > > this way worse and reduces the number of applications for which this
> > > > is acceptable.
> > >
> > > The default is to maintain the old behaviour, so unless the explicit
> > > decision is made by the administrator, no extra risk is imposed.
> >
> > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> >
> > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > excplicitly set.
> >
> > KSM default off: merge only when MADV_MERGEABLE is set.
> >
> > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > application programmers won't have a way to prevent sensitive data to be
> > every merged. So, I think, we should keep allow an explicit opt-out
> > option for applications.
> >
> 
> We just did not have VM/Madvise flag for that currently.
> Same as THP.
> Because all logic written with assumption, what we have exactly 2 states.
> Allow/Disallow (More like not allow).
> 
> And if we try to add, that must be something like:
> MADV_FORBID_* to disallow something completely.

No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
set, memory is indeed becomes always unmergeable regardless of ksm mode
of operation.

To do the above in ksm_madvise(), a new state should be added, for example
instead of: 

case MADV_UNMERGEABLE:
	*vm_flags &= ~VM_MERGEABLE;

A new flag should be used:
	*vm_flags |= VM_UNMERGEABLE;

I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
be accepted.

Pasha

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 11:40   ` Timofey Titovets
@ 2018-11-13 18:42     ` Pavel Tatashin
  2018-11-13 22:55       ` Timofey Titovets
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Tatashin @ 2018-11-13 18:42 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: LKML, Matthew Wilcox, linux-mm, linux-doc

> > Is it really necessary to have an extra thread in ksm just to add vma's
> > for scanning? Can we do it right from the scanner thread? Also, may be
> > it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
> > enabled?
> >
> > Thank you,
> > Pasha
>
> Oh, thats a long story, and my english to bad for describe all things,
> even that hard to find linux-mm conversation several years ago about that.
>
> Anyway, so:
> In V2 - i use scanner thread to add VMA, but i think scanner do that
> with too high rate.
> i.e. walk on task list, and get new task every 20ms, to wait write semaphore,
> to get VMA...
> To high rate for task list scanner, i think it's overkill.
>
> About add VMA from creation time,
> UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM.
> But some mm people say what they not like add KSM hooks to other subsystems.
> And want ksm do that internally by some way.
>
> Frankly speaking i didn't have enough knowledge and skills to do that
> another way in past time.
> They also suggest me look to THP for that logic, but i can't find how
> THP do that without hooks, and
> where THP truly scan memory.
>
> So, after all of that i implemented this in that way.
> In first iteration as part of ksm scan thread, and in second, by
> separate thread.
> Because that allow to add VMA in fully independent way.

It still feels as a wrong direction. A new thread that adds random
VMA's to scan, and no way to optimize the queue fairness for example.
It should really be done at creation time, when VMA is created it
should be added to KSM scanning queue, or KSM main scanner thread
should go through VMA list in a coherent order.

The design of having a separate thread is bad. I plan in the future to
add thread per node support to KSM, and this one odd thread won't
break things, to which queue should this thread add VMA if there are
multiple queues?

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 18:35       ` Pavel Tatashin
@ 2018-11-13 18:54         ` Timofey Titovets
  2018-11-13 19:16           ` Pavel Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 18:54 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Oleksandr Natalenko, Jann Horn, linux-doc, Linux Kernel,
	linux-mm, Matthew Wilcox

вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> On 18-11-13 21:17:42, Timofey Titovets wrote:
> > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
> > >
> > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > > Hi.
> > > >
> > > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > > the requirement for application opt-in, which, in my opinion, makes
> > > > > this way worse and reduces the number of applications for which this
> > > > > is acceptable.
> > > >
> > > > The default is to maintain the old behaviour, so unless the explicit
> > > > decision is made by the administrator, no extra risk is imposed.
> > >
> > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> > >
> > > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > > excplicitly set.
> > >
> > > KSM default off: merge only when MADV_MERGEABLE is set.
> > >
> > > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > > application programmers won't have a way to prevent sensitive data to be
> > > every merged. So, I think, we should keep allow an explicit opt-out
> > > option for applications.
> > >
> >
> > We just did not have VM/Madvise flag for that currently.
> > Same as THP.
> > Because all logic written with assumption, what we have exactly 2 states.
> > Allow/Disallow (More like not allow).
> >
> > And if we try to add, that must be something like:
> > MADV_FORBID_* to disallow something completely.
>
> No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
> and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
> set, memory is indeed becomes always unmergeable regardless of ksm mode
> of operation.
>
> To do the above in ksm_madvise(), a new state should be added, for example
> instead of:
>
> case MADV_UNMERGEABLE:
>         *vm_flags &= ~VM_MERGEABLE;
>
> A new flag should be used:
>         *vm_flags |= VM_UNMERGEABLE;
>
> I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
> be accepted.
>
> Pasha
>

That must work, but we out of bit space in vm_flags [1].
i.e. first 32 bits already defined, and other only accessible only on
64-bit machines.

1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 18:54         ` Timofey Titovets
@ 2018-11-13 19:16           ` Pavel Tatashin
  2018-11-13 22:40             ` Timofey Titovets
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Tatashin @ 2018-11-13 19:16 UTC (permalink / raw)
  To: Timofey Titovets
  Cc: Oleksandr Natalenko, Jann Horn, linux-doc, Linux Kernel,
	linux-mm, Matthew Wilcox

On 18-11-13 21:54:13, Timofey Titovets wrote:
> вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin <pasha.tatashin@soleen.com>:
> >
> > On 18-11-13 21:17:42, Timofey Titovets wrote:
> > > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
> > > >
> > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > > > Hi.
> > > > >
> > > > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > > > the requirement for application opt-in, which, in my opinion, makes
> > > > > > this way worse and reduces the number of applications for which this
> > > > > > is acceptable.
> > > > >
> > > > > The default is to maintain the old behaviour, so unless the explicit
> > > > > decision is made by the administrator, no extra risk is imposed.
> > > >
> > > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> > > >
> > > > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > > > excplicitly set.
> > > >
> > > > KSM default off: merge only when MADV_MERGEABLE is set.
> > > >
> > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > > > application programmers won't have a way to prevent sensitive data to be
> > > > every merged. So, I think, we should keep allow an explicit opt-out
> > > > option for applications.
> > > >
> > >
> > > We just did not have VM/Madvise flag for that currently.
> > > Same as THP.
> > > Because all logic written with assumption, what we have exactly 2 states.
> > > Allow/Disallow (More like not allow).
> > >
> > > And if we try to add, that must be something like:
> > > MADV_FORBID_* to disallow something completely.
> >
> > No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
> > and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
> > set, memory is indeed becomes always unmergeable regardless of ksm mode
> > of operation.
> >
> > To do the above in ksm_madvise(), a new state should be added, for example
> > instead of:
> >
> > case MADV_UNMERGEABLE:
> >         *vm_flags &= ~VM_MERGEABLE;
> >
> > A new flag should be used:
> >         *vm_flags |= VM_UNMERGEABLE;
> >
> > I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
> > be accepted.
> >
> > Pasha
> >
> 
> That must work, but we out of bit space in vm_flags [1].
> i.e. first 32 bits already defined, and other only accessible only on
> 64-bit machines.

So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.

> 
> 1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 17:59   ` Pavel Tatashin
  2018-11-13 18:17     ` Timofey Titovets
@ 2018-11-13 20:26     ` Jann Horn
  2018-11-13 22:35       ` Pavel Tatashin
  1 sibling, 1 reply; 28+ messages in thread
From: Jann Horn @ 2018-11-13 20:26 UTC (permalink / raw)
  To: pasha.tatashin
  Cc: oleksandr, linux-doc, kernel list, Linux-MM, timofey.titovets,
	Matthew Wilcox, Daniel Gruss

+cc Daniel Gruss

On Tue, Nov 13, 2018 at 6:59 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
> On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > Hi.
> >
> > > Yep. However, so far, it requires an application to explicitly opt in
> > > to this behavior, so it's not all that bad. Your patch would remove
> > > the requirement for application opt-in, which, in my opinion, makes
> > > this way worse and reduces the number of applications for which this
> > > is acceptable.
> >
> > The default is to maintain the old behaviour, so unless the explicit
> > decision is made by the administrator, no extra risk is imposed.
>
> The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
>
> KSM default on: merge everything except when MADV_UNMERGEABLE is
> excplicitly set.
>
> KSM default off: merge only when MADV_MERGEABLE is set.
>
> The proposed change won't honor MADV_UNMERGEABLE, meaning that
> application programmers won't have a way to prevent sensitive data to be
> every merged. So, I think, we should keep allow an explicit opt-out
> option for applications.
>
> >
> > > As far as I know, basically nobody is using KSM at this point. There
> > > are blog posts from several cloud providers about these security risks
> > > that explicitly state that they're not using memory deduplication.
> >
> > I tend to disagree here. Based on both what my company does and what UKSM
> > users do, memory dedup is a desired option (note "option" word here, not the
> > default choice).
>
> Lightweight containers is a use case for KSM: when many VMs share the
> same small kernel. KSM is used in production by large cloud vendors.

Wait, what? Can you name specific ones? Nowadays, enabling KSM for
untrusted VMs seems like a terrible idea to me, security-wise.

Google says at <https://cloud.google.com/blog/products/gcp/7-ways-we-harden-our-kvm-hypervisor-at-google-cloud-security-in-plaintext>:
"Compute Engine and Container Engine are not vulnerable to this kind
of attack, since they do not use KSM."

An AWS employee says at
<https://forums.aws.amazon.com/thread.jspa?threadID=238519&tstart=0&messageID=739485#739485>:
"memory de-duplication is not enabled by Amazon EC2's hypervisor"

In my opinion, KSM is fundamentally insecure for systems hosting
multiple VMs that don't trust each other. I don't think anyone writes
cryptographic software under the assumption that an attacker will be
given the ability to query whether a given page of data exists
anywhere else on the system.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 20:26     ` Jann Horn
@ 2018-11-13 22:35       ` Pavel Tatashin
  0 siblings, 0 replies; 28+ messages in thread
From: Pavel Tatashin @ 2018-11-13 22:35 UTC (permalink / raw)
  To: Jann Horn
  Cc: Oleksandr Natalenko, linux-doc, LKML, linux-mm, Timofey Titovets,
	Matthew Wilcox, daniel

> Wait, what? Can you name specific ones? Nowadays, enabling KSM for
> untrusted VMs seems like a terrible idea to me, security-wise.

Of course it is not used to share data among different
customers/tenants, as far as I know it is used by Oracle Cloud to
merge the same pages in clear containers.

https://medium.com/cri-o/intel-clear-containers-and-cri-o-70824fb51811
One performance enhancing feature is the use of KSM, a recent KVM
optimized for memory sharing and boot speed. Another is the use of an
optimized Clear Containers mini-OS.

Pasha

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 19:16           ` Pavel Tatashin
@ 2018-11-13 22:40             ` Timofey Titovets
  2018-11-13 22:53               ` Pavel Tatashin
  0 siblings, 1 reply; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 22:40 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Oleksandr Natalenko, Jann Horn, linux-doc, Linux Kernel,
	linux-mm, Matthew Wilcox

вт, 13 нояб. 2018 г. в 22:17, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> On 18-11-13 21:54:13, Timofey Titovets wrote:
> > вт, 13 нояб. 2018 г. в 21:35, Pavel Tatashin <pasha.tatashin@soleen.com>:
> > >
> > > On 18-11-13 21:17:42, Timofey Titovets wrote:
> > > > вт, 13 нояб. 2018 г. в 20:59, Pavel Tatashin <pasha.tatashin@soleen.com>:
> > > > >
> > > > > On 18-11-13 15:23:50, Oleksandr Natalenko wrote:
> > > > > > Hi.
> > > > > >
> > > > > > > Yep. However, so far, it requires an application to explicitly opt in
> > > > > > > to this behavior, so it's not all that bad. Your patch would remove
> > > > > > > the requirement for application opt-in, which, in my opinion, makes
> > > > > > > this way worse and reduces the number of applications for which this
> > > > > > > is acceptable.
> > > > > >
> > > > > > The default is to maintain the old behaviour, so unless the explicit
> > > > > > decision is made by the administrator, no extra risk is imposed.
> > > > >
> > > > > The new interface would be more tolerable if it honored MADV_UNMERGEABLE:
> > > > >
> > > > > KSM default on: merge everything except when MADV_UNMERGEABLE is
> > > > > excplicitly set.
> > > > >
> > > > > KSM default off: merge only when MADV_MERGEABLE is set.
> > > > >
> > > > > The proposed change won't honor MADV_UNMERGEABLE, meaning that
> > > > > application programmers won't have a way to prevent sensitive data to be
> > > > > every merged. So, I think, we should keep allow an explicit opt-out
> > > > > option for applications.
> > > > >
> > > >
> > > > We just did not have VM/Madvise flag for that currently.
> > > > Same as THP.
> > > > Because all logic written with assumption, what we have exactly 2 states.
> > > > Allow/Disallow (More like not allow).
> > > >
> > > > And if we try to add, that must be something like:
> > > > MADV_FORBID_* to disallow something completely.
> > >
> > > No need to add new user flag MADV_FORBID, we should keep MADV_MERGEABLE
> > > and MADV_UNMERGEABLE, but make them work so when MADV_UNMERGEABLE is
> > > set, memory is indeed becomes always unmergeable regardless of ksm mode
> > > of operation.
> > >
> > > To do the above in ksm_madvise(), a new state should be added, for example
> > > instead of:
> > >
> > > case MADV_UNMERGEABLE:
> > >         *vm_flags &= ~VM_MERGEABLE;
> > >
> > > A new flag should be used:
> > >         *vm_flags |= VM_UNMERGEABLE;
> > >
> > > I think, without honoring MADV_UNMERGEABLE correctly, this patch won't
> > > be accepted.
> > >
> > > Pasha
> > >
> >
> > That must work, but we out of bit space in vm_flags [1].
> > i.e. first 32 bits already defined, and other only accessible only on
> > 64-bit machines.
>
> So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.

With all due respect to you, for that type of things we need
mm maintainer opinion.

I just don't want get situation, where after touch of other subsystems,
maintainer will just refuse that work by some reason.

i.e. writing patches for upstream (from my point of view),
is more art of communication and making resulte code acceptable by community.
Because any code which written correctly from engineering point of view,
can be easy refused, just because someone not found it useful.

Thanks.

> >
> > 1. https://elixir.bootlin.com/linux/latest/source/include/linux/mm.h#L219

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 22:40             ` Timofey Titovets
@ 2018-11-13 22:53               ` Pavel Tatashin
  2018-11-13 23:07                 ` Timofey Titovets
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Tatashin @ 2018-11-13 22:53 UTC (permalink / raw)
  To: Timofey Titovets
  Cc: Oleksandr Natalenko, Jann Horn, linux-doc, Linux Kernel,
	linux-mm, Matthew Wilcox

> > > That must work, but we out of bit space in vm_flags [1].
> > > i.e. first 32 bits already defined, and other only accessible only on
> > > 64-bit machines.
> >
> > So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.
> 
> With all due respect to you, for that type of things we need
> mm maintainer opinion.

As far as I understood, you already got directions from the maintainers
to do similar to the way THP is implemented, and THP uses two flags:

VM_HUGEPAGE VM_NOHUGEPAGE, the same as I am thinking ksm should do if we
honor MADV_UNMERGEABLE.

When VM_NOHUGEPAGE is set khugepaged ignores those VMAs.

There may be a way to add VM_UNMERGEABLE without extending the size of
vm_flags, but that would be a good start point in looking how to add a
new flag.

Again, you could simply enable this feature on 64-bit only.

Pasha

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 18:42     ` Pavel Tatashin
@ 2018-11-13 22:55       ` Timofey Titovets
  0 siblings, 0 replies; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 22:55 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: Linux Kernel, Matthew Wilcox, linux-mm, linux-doc

вт, 13 нояб. 2018 г. в 21:43, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> > > Is it really necessary to have an extra thread in ksm just to add vma's
> > > for scanning? Can we do it right from the scanner thread? Also, may be
> > > it is better to add vma's at their creation time when KSM_MODE_ALWAYS is
> > > enabled?
> > >
> > > Thank you,
> > > Pasha
> >
> > Oh, thats a long story, and my english to bad for describe all things,
> > even that hard to find linux-mm conversation several years ago about that.
> >
> > Anyway, so:
> > In V2 - i use scanner thread to add VMA, but i think scanner do that
> > with too high rate.
> > i.e. walk on task list, and get new task every 20ms, to wait write semaphore,
> > to get VMA...
> > To high rate for task list scanner, i think it's overkill.
> >
> > About add VMA from creation time,
> > UKSM add ksm_enter() hooks to mm subsystem, i port that to KSM.
> > But some mm people say what they not like add KSM hooks to other subsystems.
> > And want ksm do that internally by some way.
> >
> > Frankly speaking i didn't have enough knowledge and skills to do that
> > another way in past time.
> > They also suggest me look to THP for that logic, but i can't find how
> > THP do that without hooks, and
> > where THP truly scan memory.
> >
> > So, after all of that i implemented this in that way.
> > In first iteration as part of ksm scan thread, and in second, by
> > separate thread.
> > Because that allow to add VMA in fully independent way.
>
> It still feels as a wrong direction. A new thread that adds random
> VMA's to scan, and no way to optimize the queue fairness for example.
> It should really be done at creation time, when VMA is created it
> should be added to KSM scanning queue, or KSM main scanner thread
> should go through VMA list in a coherent order.

How you see queue fairness in that case?
i.e. if you talk about moving from old VMA to new VMA,
IIRC i can't find any whole kernel list of VMAs.

i.e. i really understood what you don't like exactly,
but for that we need add hooks as i already mentioned above.
(And i already try get that to kernel [1]).

So, as i wrote you below, i need some maintainer opinion
in which way that responsible person of mm see 'right' implementation.

> The design of having a separate thread is bad. I plan in the future to
> add thread per node support to KSM, and this one odd thread won't
> break things, to which queue should this thread add VMA if there are
> multiple queues?

That will be interesting to look :)
But IMHO:
I think you will need to add some code to ksm_enter().
Because madvise() internally call ksm_enter().

So ksm_enter() will decide which tread must process that.
That not depends on caller.

Thanks.

> Thank you,
> Pasha
>
- - -
1. https://lkml.org/lkml/2014/11/8/206

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 22:53               ` Pavel Tatashin
@ 2018-11-13 23:07                 ` Timofey Titovets
  0 siblings, 0 replies; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 23:07 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Oleksandr Natalenko, Jann Horn, linux-doc, Linux Kernel,
	linux-mm, Matthew Wilcox

ср, 14 нояб. 2018 г. в 01:53, Pavel Tatashin <pasha.tatashin@soleen.com>:
>
> > > > That must work, but we out of bit space in vm_flags [1].
> > > > i.e. first 32 bits already defined, and other only accessible only on
> > > > 64-bit machines.
> > >
> > > So, grow vm_flags_t to 64-bit, or enable this feature on 64-bit only.
> >
> > With all due respect to you, for that type of things we need
> > mm maintainer opinion.
>
> As far as I understood, you already got directions from the maintainers
> to do similar to the way THP is implemented, and THP uses two flags:
>
> VM_HUGEPAGE VM_NOHUGEPAGE, the same as I am thinking ksm should do if we
> honor MADV_UNMERGEABLE.
>
> When VM_NOHUGEPAGE is set khugepaged ignores those VMAs.
>
> There may be a way to add VM_UNMERGEABLE without extending the size of
> vm_flags, but that would be a good start point in looking how to add a
> new flag.
>
> Again, you could simply enable this feature on 64-bit only.
>
> Pasha
>

Deal!
I will try with only on 64bit machines.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V3] KSM: allow dedup all tasks memory
@ 2018-11-13 18:20 Timofey Titovets
  0 siblings, 0 replies; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 18:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: Timofey Titovets, Matthew Wilcox, linux-mm, linux-doc

ksm by default working only on memory that added by
madvise().

And only way get that work on other applications:
  * Use LD_PRELOAD and libraries
  * Patch kernel

Lets use kernel task list and add logic to import VMAs from tasks.

That behaviour controlled by new attributes:
  * mode:
    I try mimic hugepages attribute, so mode have two states:
      * madvise      - old default behaviour
      * always [new] - allow ksm to get tasks vma and
                       try working on that.
  * seeker_sleep_millisecs:
    Add pauses between imports tasks VMA

For rate limiting proporses and tasklist locking time,
ksm seeker thread only import VMAs from one task per loop.

Some numbers from different not madvised workloads.
Formulas:
  Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
  Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
  Memory used = free -h

  * Name: My working laptop
    Description: Many different chrome/electron apps + KDE
    Ratio: 5%
    Saved: ~100  MiB
    Used:  ~2000 MiB

  * Name: K8s test VM
    Description: Some small random running docker images
    Ratio: 40%
    Saved: ~160 MiB
    Used:  ~920 MiB

  * Name: Ceph test VM
    Description: Ceph Mon/OSD, some containers
    Ratio: 20%
    Saved: ~60 MiB
    Used:  ~600 MiB

  * Name: BareMetal K8s backend server
    Description: Different server apps in containers C, Java, GO & etc
    Ratio: 72%
    Saved: ~5800 MiB
    Used:  ~35.7 GiB

  * Name: BareMetal K8s processing server
    Description: Many instance of one CPU intensive application
    Ratio: 55%
    Saved: ~2600 MiB
    Used:  ~28.0 GiB

  * Name: BareMetal Ceph node
    Description: Only OSD storage daemons running
    Raio: 2%
    Saved: ~190 MiB
    Used:  ~11.7 GiB

Changes:
  v1 -> v2:
    * Rebase on v4.19.1 (must also apply on 4.20-rc2+)
  v2 -> v3:
    * Reformat patch description
    * Rename mode normal to madvise
    * Add some memory numbers
    * Fix checkpatch.pl warnings
    * Separate ksm vma seeker to another kthread
    * Fix: "BUG: scheduling while atomic: ksmd"
      by move seeker to another thread

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: linux-mm@kvack.org
CC: linux-doc@vger.kernel.org
---
 Documentation/admin-guide/mm/ksm.rst |  15 ++
 mm/ksm.c                             | 215 +++++++++++++++++++++++----
 2 files changed, 198 insertions(+), 32 deletions(-)

diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index 9303786632d1..7cffd47f9b38 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -116,6 +116,21 @@ run
         Default: 0 (must be changed to 1 to activate KSM, except if
         CONFIG_SYSFS is disabled)
 
+mode
+        * set always to allow ksm deduplicate memory of every process
+        * set madvise to use only madvised memory
+
+        Default: madvise (dedupulicate only madvised memory as in
+        earlier releases)
+
+seeker_sleep_millisecs
+        how many milliseconds ksmd task seeker should sleep try another
+        task.
+        e.g. ``echo 1000 > /sys/kernel/mm/ksm/seeker_sleep_millisecs``
+
+        Default: 1000 (chosen for rate limit purposes)
+
+
 use_zero_pages
         specifies whether empty pages (i.e. allocated pages that only
         contain zeroes) should be treated specially.  When set to 1,
diff --git a/mm/ksm.c b/mm/ksm.c
index 5b0894b45ee5..1a03b28b6288 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -273,6 +273,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Milliseconds ksmd seeker should sleep between runs */
+static unsigned int ksm_thread_seeker_sleep_millisecs = 1000;
+
 /* Checksum of an empty (zeroed) page */
 static unsigned int zero_checksum __read_mostly;
 
@@ -295,7 +298,12 @@ static int ksm_nr_node_ids = 1;
 static unsigned long ksm_run = KSM_RUN_STOP;
 static void wait_while_offlining(void);
 
+#define KSM_MODE_MADVISE 0
+#define KSM_MODE_ALWAYS	1
+static unsigned long ksm_mode = KSM_MODE_MADVISE;
+
 static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
+static DECLARE_WAIT_QUEUE_HEAD(ksm_seeker_thread_wait);
 static DEFINE_MUTEX(ksm_thread_mutex);
 static DEFINE_SPINLOCK(ksm_mmlist_lock);
 
@@ -303,6 +311,11 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
 		sizeof(struct __struct), __alignof__(struct __struct),\
 		(__flags), NULL)
 
+static inline int ksm_mode_always(void)
+{
+	return (ksm_mode == KSM_MODE_ALWAYS);
+}
+
 static int __init ksm_slab_init(void)
 {
 	rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
@@ -2389,6 +2402,106 @@ static int ksmd_should_run(void)
 	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
 }
 
+
+static int ksm_enter(struct mm_struct *mm, unsigned long *vm_flags)
+{
+	int err;
+
+	if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
+			 VM_HUGETLB | VM_MIXEDMAP))
+		return 0;
+
+#ifdef VM_SAO
+	if (*vm_flags & VM_SAO)
+		return 0;
+#endif
+#ifdef VM_SPARC_ADI
+	if (*vm_flags & VM_SPARC_ADI)
+		return 0;
+#endif
+	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
+		err = __ksm_enter(mm);
+		if (err)
+			return err;
+	}
+
+	*vm_flags |= VM_MERGEABLE;
+
+	return 0;
+}
+
+/*
+ * Register all vmas for all processes in the system with KSM.
+ * Note that every call to ksm_, for a given vma, after the first
+ * does nothing but set flags.
+ */
+void ksm_import_task_vma(struct task_struct *task)
+{
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int error;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		return;
+	down_write(&mm->mmap_sem);
+	vma = mm->mmap;
+	while (vma) {
+		error = ksm_enter(vma->vm_mm, &vma->vm_flags);
+		vma = vma->vm_next;
+	}
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+}
+
+static int ksm_seeker_thread(void *nothing)
+{
+	pid_t last_pid = 1;
+	pid_t curr_pid;
+	struct task_struct *task;
+
+	set_freezable();
+	set_user_nice(current, 5);
+
+	while (!kthread_should_stop()) {
+		wait_while_offlining();
+
+		try_to_freeze();
+
+		if (!ksm_mode_always()) {
+			wait_event_freezable(ksm_seeker_thread_wait,
+				ksm_mode_always() || kthread_should_stop());
+			continue;
+		}
+
+		/*
+		 * import one task's vma per run
+		 */
+		read_lock(&tasklist_lock);
+
+		/* Try always get next task */
+		for_each_process(task) {
+			curr_pid = task_pid_nr(task);
+			if (curr_pid == last_pid) {
+				task = next_task(task);
+				break;
+			}
+
+			if (curr_pid > last_pid)
+				break;
+		}
+
+		last_pid = task_pid_nr(task);
+		ksm_import_task_vma(task);
+		read_unlock(&tasklist_lock);
+
+		schedule_timeout_interruptible(
+			msecs_to_jiffies(ksm_thread_seeker_sleep_millisecs));
+	}
+	return 0;
+}
+
 static int ksm_scan_thread(void *nothing)
 {
 	set_freezable();
@@ -2422,33 +2535,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 
 	switch (advice) {
 	case MADV_MERGEABLE:
-		/*
-		 * Be somewhat over-protective for now!
-		 */
-		if (*vm_flags & (VM_MERGEABLE | VM_SHARED  | VM_MAYSHARE   |
-				 VM_PFNMAP    | VM_IO      | VM_DONTEXPAND |
-				 VM_HUGETLB | VM_MIXEDMAP))
-			return 0;		/* just ignore the advice */
-
-		if (vma_is_dax(vma))
-			return 0;
-
-#ifdef VM_SAO
-		if (*vm_flags & VM_SAO)
-			return 0;
-#endif
-#ifdef VM_SPARC_ADI
-		if (*vm_flags & VM_SPARC_ADI)
-			return 0;
-#endif
-
-		if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
-			err = __ksm_enter(mm);
-			if (err)
-				return err;
-		}
-
-		*vm_flags |= VM_MERGEABLE;
+		err = ksm_enter(mm, vm_flags);
+		if (err)
+			return err;
 		break;
 
 	case MADV_UNMERGEABLE:
@@ -2829,6 +2918,29 @@ static ssize_t sleep_millisecs_store(struct kobject *kobj,
 }
 KSM_ATTR(sleep_millisecs);
 
+static ssize_t seeker_sleep_millisecs_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", ksm_thread_seeker_sleep_millisecs);
+}
+
+static ssize_t seeker_sleep_millisecs_store(struct kobject *kobj,
+				     struct kobj_attribute *attr,
+				     const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = kstrtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	ksm_thread_seeker_sleep_millisecs = msecs;
+
+	return count;
+}
+KSM_ATTR(seeker_sleep_millisecs);
+
 static ssize_t pages_to_scan_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
@@ -2852,6 +2964,34 @@ static ssize_t pages_to_scan_store(struct kobject *kobj,
 }
 KSM_ATTR(pages_to_scan);
 
+static ssize_t mode_show(struct kobject *kobj, struct kobj_attribute *attr,
+			char *buf)
+{
+	switch (ksm_mode) {
+	case KSM_MODE_ALWAYS:
+		return sprintf(buf, "[always] madvise\n");
+	case KSM_MODE_MADVISE:
+		return sprintf(buf, "always [madvise]\n");
+	}
+
+	return sprintf(buf, "always [madvise]\n");
+}
+
+static ssize_t mode_store(struct kobject *kobj, struct kobj_attribute *attr,
+			 const char *buf, size_t count)
+{
+	if (!memcmp("always", buf, min(sizeof("always")-1, count))) {
+		ksm_mode = KSM_MODE_ALWAYS;
+		wake_up_interruptible(&ksm_seeker_thread_wait);
+	} else if (!memcmp("madvise", buf, min(sizeof("madvise")-1, count))) {
+		ksm_mode = KSM_MODE_MADVISE;
+	} else
+		return -EINVAL;
+
+	return count;
+}
+KSM_ATTR(mode);
+
 static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr,
 			char *buf)
 {
@@ -3108,7 +3248,9 @@ KSM_ATTR_RO(full_scans);
 
 static struct attribute *ksm_attrs[] = {
 	&sleep_millisecs_attr.attr,
+	&seeker_sleep_millisecs_attr.attr,
 	&pages_to_scan_attr.attr,
+	&mode_attr.attr,
 	&run_attr.attr,
 	&pages_shared_attr.attr,
 	&pages_sharing_attr.attr,
@@ -3134,7 +3276,7 @@ static const struct attribute_group ksm_attr_group = {
 
 static int __init ksm_init(void)
 {
-	struct task_struct *ksm_thread;
+	struct task_struct *ksm_thread[2];
 	int err;
 
 	/* The correct value depends on page size and endianness */
@@ -3146,10 +3288,18 @@ static int __init ksm_init(void)
 	if (err)
 		goto out;
 
-	ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
-	if (IS_ERR(ksm_thread)) {
+	ksm_thread[0] = kthread_run(ksm_scan_thread, NULL, "ksmd");
+	if (IS_ERR(ksm_thread[0])) {
 		pr_err("ksm: creating kthread failed\n");
-		err = PTR_ERR(ksm_thread);
+		err = PTR_ERR(ksm_thread[0]);
+		goto out_free;
+	}
+
+	ksm_thread[1] = kthread_run(ksm_seeker_thread, NULL, "ksmd_seeker");
+	if (IS_ERR(ksm_thread[1])) {
+		pr_err("ksm: creating seeker kthread failed\n");
+		err = PTR_ERR(ksm_thread[1]);
+		kthread_stop(ksm_thread[0]);
 		goto out_free;
 	}
 
@@ -3157,7 +3307,8 @@ static int __init ksm_init(void)
 	err = sysfs_create_group(mm_kobj, &ksm_attr_group);
 	if (err) {
 		pr_err("ksm: register sysfs failed\n");
-		kthread_stop(ksm_thread);
+		kthread_stop(ksm_thread[0]);
+		kthread_stop(ksm_thread[1]);
 		goto out_free;
 	}
 #else
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 17:27     ` Oleksandr Natalenko
@ 2018-11-13 17:44       ` Timofey Titovets
  0 siblings, 0 replies; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 17:44 UTC (permalink / raw)
  To: oleksandr; +Cc: linux-doc, Linux Kernel, linux-mm, Matthew Wilcox

вт, 13 нояб. 2018 г. в 20:27, Oleksandr Natalenko <oleksandr@natalenko.name>:
>
> On 13.11.2018 18:10, Timofey Titovets wrote:
> > You mean try do something, like that right?
> >
> > read_lock(&tasklist_lock);
> >   <get reference to task>
> >   task_lock(task);
> > read_unlock(&tasklist_lock);
> >     last_pid = task_pid_nr(task);
> >     ksm_import_task_vma(task);
> >   task_unlock(task);
>
> No, task_lock() uses spin_lock() under the bonnet, so this will be the
> same.
>
> Since the sole reason you have to lock/acquire/get a reference to
> task_struct here is to prevent it from disappearing, I was thinking
> about using get_task_struct(), which just increases atomic
> task_struct.usage value (IOW, takes a reference). I *hope* this will be
> enough to prevent task_struct from disappearing in the meantime.
>
> Someone, correct me if I'm wrong.

That brilliant, i just missing that api.
That must do exactly what i want.

Thanks!

> --
>    Oleksandr Natalenko (post-factum)
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 17:10   ` Timofey Titovets
@ 2018-11-13 17:27     ` Oleksandr Natalenko
  2018-11-13 17:44       ` Timofey Titovets
  0 siblings, 1 reply; 28+ messages in thread
From: Oleksandr Natalenko @ 2018-11-13 17:27 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: linux-doc, Linux Kernel, linux-mm, Matthew Wilcox

On 13.11.2018 18:10, Timofey Titovets wrote:
> You mean try do something, like that right?
> 
> read_lock(&tasklist_lock);
>   <get reference to task>
>   task_lock(task);
> read_unlock(&tasklist_lock);
>     last_pid = task_pid_nr(task);
>     ksm_import_task_vma(task);
>   task_unlock(task);

No, task_lock() uses spin_lock() under the bonnet, so this will be the 
same.

Since the sole reason you have to lock/acquire/get a reference to 
task_struct here is to prevent it from disappearing, I was thinking 
about using get_task_struct(), which just increases atomic 
task_struct.usage value (IOW, takes a reference). I *hope* this will be 
enough to prevent task_struct from disappearing in the meantime.

Someone, correct me if I'm wrong.

-- 
   Oleksandr Natalenko (post-factum)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 16:33 ` Oleksandr Natalenko
@ 2018-11-13 17:10   ` Timofey Titovets
  2018-11-13 17:27     ` Oleksandr Natalenko
  0 siblings, 1 reply; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 17:10 UTC (permalink / raw)
  To: oleksandr; +Cc: linux-doc, Linux Kernel, linux-mm, Matthew Wilcox

вт, 13 нояб. 2018 г. в 19:33, Oleksandr Natalenko <oleksandr@natalenko.name>:
>
> So,
>
> > …snip…
> > +static int ksm_seeker_thread(void *nothing)
> > +{
> > +     pid_t last_pid = 1;
> > +     pid_t curr_pid;
> > +     struct task_struct *task;
> > +
> > +     set_freezable();
> > +     set_user_nice(current, 5);
> > +
> > +     while (!kthread_should_stop()) {
> > +             wait_while_offlining();
> > +
> > +             try_to_freeze();
> > +
> > +             if (!ksm_mode_always()) {
> > +                     wait_event_freezable(ksm_seeker_thread_wait,
> > +                             ksm_mode_always() || kthread_should_stop());
> > +                     continue;
> > +             }
> > +
> > +             /*
> > +              * import one task's vma per run
> > +              */
> > +             read_lock(&tasklist_lock);
> > +
> > +             /* Try always get next task */
> > +             for_each_process(task) {
> > +                     curr_pid = task_pid_nr(task);
> > +                     if (curr_pid == last_pid) {
> > +                             task = next_task(task);
> > +                             break;
> > +                     }
> > +
> > +                     if (curr_pid > last_pid)
> > +                             break;
> > +             }
> > +
> > +             last_pid = task_pid_nr(task);
> > +             ksm_import_task_vma(task);
>
> This seems to be a bad idea. ksm_import_task_vma() may sleep with
> tasklist_lock being held. Thus, IIUC, you'll get this:

Yep, that one of the reason why i move code from ksmd thread, i'm not
fully understood how to properly fix that.
But i misunderstood problem symptoms.

> [ 1754.410322] BUG: scheduling while atomic: ksmd_seeker/50/0x00000002
> …
> [ 1754.410444] Call Trace:
> [ 1754.410455]  dump_stack+0x5c/0x80
> [ 1754.410460]  __schedule_bug.cold.19+0x38/0x51
> [ 1754.410464]  __schedule+0x11dc/0x2080
> [ 1754.410483]  schedule+0x32/0xb0
> [ 1754.410487]  rwsem_down_write_failed+0x15d/0x240
> [ 1754.410496]  call_rwsem_down_write_failed+0x13/0x20
> [ 1754.410499]  down_write+0x20/0x30
> [ 1754.410502]  ksm_import_task_vma+0x22/0x70
> [ 1754.410505]  ksm_seeker_thread+0x134/0x1c0
> [ 1754.410512]  kthread+0x113/0x130
> [ 1754.410518]  ret_from_fork+0x35/0x40
>
> I think you may want to get a reference to task_struct before releasing
> tasklist_lock, and then put it after ksm_import_task_vma() does its job.

Maybe i misunderstood something, but currently i do exactly that.

> > +             read_unlock(&tasklist_lock);
> > +
> > +             schedule_timeout_interruptible(
> > +                     msecs_to_jiffies(ksm_thread_seeker_sleep_millisecs));
> > +     }
> > +     return 0;
> > +}
> > …snip…
>
> --
>    Oleksandr Natalenko (post-factum)


That's good that you got that in any way (because i can't reproduce currently).

You mean try do something, like that right?

read_lock(&tasklist_lock);
  <get reference to task>
  task_lock(task);
read_unlock(&tasklist_lock);
    last_pid = task_pid_nr(task);
    ksm_import_task_vma(task);
  task_unlock(task);

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
       [not found] <<20181112231344.7161-1-timofey.titovets@synesis.ru>
  2018-11-13 11:06 ` Oleksandr Natalenko
@ 2018-11-13 16:33 ` Oleksandr Natalenko
  2018-11-13 17:10   ` Timofey Titovets
  1 sibling, 1 reply; 28+ messages in thread
From: Oleksandr Natalenko @ 2018-11-13 16:33 UTC (permalink / raw)
  To: timofey.titovets; +Cc: linux-doc, linux-kernel, linux-mm, nefelim4ag, willy

So,

> …snip…
> +static int ksm_seeker_thread(void *nothing)
> +{
> +	pid_t last_pid = 1;
> +	pid_t curr_pid;
> +	struct task_struct *task;
> +
> +	set_freezable();
> +	set_user_nice(current, 5);
> +
> +	while (!kthread_should_stop()) {
> +		wait_while_offlining();
> +
> +		try_to_freeze();
> +
> +		if (!ksm_mode_always()) {
> +			wait_event_freezable(ksm_seeker_thread_wait,
> +				ksm_mode_always() || kthread_should_stop());
> +			continue;
> +		}
> +
> +		/*
> +		 * import one task's vma per run
> +		 */
> +		read_lock(&tasklist_lock);
> +
> +		/* Try always get next task */
> +		for_each_process(task) {
> +			curr_pid = task_pid_nr(task);
> +			if (curr_pid == last_pid) {
> +				task = next_task(task);
> +				break;
> +			}
> +
> +			if (curr_pid > last_pid)
> +				break;
> +		}
> +
> +		last_pid = task_pid_nr(task);
> +		ksm_import_task_vma(task);

This seems to be a bad idea. ksm_import_task_vma() may sleep with 
tasklist_lock being held. Thus, IIUC, you'll get this:

[ 1754.410322] BUG: scheduling while atomic: ksmd_seeker/50/0x00000002
…
[ 1754.410444] Call Trace:
[ 1754.410455]  dump_stack+0x5c/0x80
[ 1754.410460]  __schedule_bug.cold.19+0x38/0x51
[ 1754.410464]  __schedule+0x11dc/0x2080
[ 1754.410483]  schedule+0x32/0xb0
[ 1754.410487]  rwsem_down_write_failed+0x15d/0x240
[ 1754.410496]  call_rwsem_down_write_failed+0x13/0x20
[ 1754.410499]  down_write+0x20/0x30
[ 1754.410502]  ksm_import_task_vma+0x22/0x70
[ 1754.410505]  ksm_seeker_thread+0x134/0x1c0
[ 1754.410512]  kthread+0x113/0x130
[ 1754.410518]  ret_from_fork+0x35/0x40

I think you may want to get a reference to task_struct before releasing 
tasklist_lock, and then put it after ksm_import_task_vma() does its job.

> +		read_unlock(&tasklist_lock);
> +
> +		schedule_timeout_interruptible(
> +			msecs_to_jiffies(ksm_thread_seeker_sleep_millisecs));
> +	}
> +	return 0;
> +}
> …snip…

-- 
   Oleksandr Natalenko (post-factum)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
  2018-11-13 11:06 ` Oleksandr Natalenko
@ 2018-11-13 11:56   ` Timofey Titovets
  0 siblings, 0 replies; 28+ messages in thread
From: Timofey Titovets @ 2018-11-13 11:56 UTC (permalink / raw)
  To: oleksandr; +Cc: linux-doc, Linux Kernel, linux-mm, Matthew Wilcox

вт, 13 нояб. 2018 г. в 14:06, Oleksandr Natalenko <oleksandr@natalenko.name>:
>
> Hi.
>
> > ksm by default working only on memory that added by
> > madvise().
> >
> > And only way get that work on other applications:
> >   * Use LD_PRELOAD and libraries
> >   * Patch kernel
> >
> > Lets use kernel task list and add logic to import VMAs from tasks.
> >
> > That behaviour controlled by new attributes:
> >   * mode:
> >     I try mimic hugepages attribute, so mode have two states:
> >       * madvise      - old default behaviour
> >       * always [new] - allow ksm to get tasks vma and
> >                        try working on that.
> >   * seeker_sleep_millisecs:
> >     Add pauses between imports tasks VMA
> >
> > For rate limiting proporses and tasklist locking time,
> > ksm seeker thread only import VMAs from one task per loop.
> >
> > Some numbers from different not madvised workloads.
> > Formulas:
> >   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
> >   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
> >   Memory used = free -h
> >
> >   * Name: My working laptop
> >     Description: Many different chrome/electron apps + KDE
> >     Ratio: 5%
> >     Saved: ~100  MiB
> >     Used:  ~2000 MiB
> >
> >   * Name: K8s test VM
> >     Description: Some small random running docker images
> >     Ratio: 40%
> >     Saved: ~160 MiB
> >     Used:  ~920 MiB
> >
> >   * Name: Ceph test VM
> >     Description: Ceph Mon/OSD, some containers
> >     Ratio: 20%
> >     Saved: ~60 MiB
> >     Used:  ~600 MiB
> >
> >   * Name: BareMetal K8s backend server
> >     Description: Different server apps in containers C, Java, GO & etc
> >     Ratio: 72%
> >     Saved: ~5800 MiB
> >     Used:  ~35.7 GiB
> >
> >   * Name: BareMetal K8s processing server
> >     Description: Many instance of one CPU intensive application
> >     Ratio: 55%
> >     Saved: ~2600 MiB
> >     Used:  ~28.0 GiB
> >
> >   * Name: BareMetal Ceph node
> >     Description: Only OSD storage daemons running
> >     Raio: 2%
> >     Saved: ~190 MiB
> >     Used:  ~11.7 GiB
>
> Out of curiosity, have you compared these results with UKSM [1]?
>
> Thanks.
>
> --
>    Oleksandr Natalenko (post-factum)
>
> [1] https://github.com/dolohow/uksm

Long story short:
I try get UKSM code in kernel, and i mostly know how UKSM works.
Yep, UKSM implement logic in different way, but UKSM _always_ will have
same or worse numbers (saved pages) in compare to KSM.

Why?

Both scan VMA pages and filter VMA by some flags (same set of flags).
So they both will see same subset of pages, which can be deduped.
But UKSM also try not dedup VMA pages, if some of VMA pages changes
more frequently - trashing.
Because of that UKSM will skip some pages, but KSM will try dedup all
pages, not changed between scans.

I skip UKSM internal logic, which can work better of course in
resource usage terms
(different hash implementation, different page tree),
but that doesn't matter in that case (if we talk about saved memory).

Only thing what UKSM have, which KSM can't do:
UKSM can add VMA memory it self, by hooks in mm.

KSM currently need help by madvise() for that.
That the reason, why i write that patchset for KSM.
(I also wrote some info to Pavel Tatashin in above mail)

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3] KSM: allow dedup all tasks memory
       [not found] <<20181112231344.7161-1-timofey.titovets@synesis.ru>
@ 2018-11-13 11:06 ` Oleksandr Natalenko
  2018-11-13 11:56   ` Timofey Titovets
  2018-11-13 16:33 ` Oleksandr Natalenko
  1 sibling, 1 reply; 28+ messages in thread
From: Oleksandr Natalenko @ 2018-11-13 11:06 UTC (permalink / raw)
  To: timofey.titovets; +Cc: linux-doc, linux-kernel, linux-mm, nefelim4ag, willy

Hi.

> ksm by default working only on memory that added by
> madvise().
> 
> And only way get that work on other applications:
>   * Use LD_PRELOAD and libraries
>   * Patch kernel
> 
> Lets use kernel task list and add logic to import VMAs from tasks.
> 
> That behaviour controlled by new attributes:
>   * mode:
>     I try mimic hugepages attribute, so mode have two states:
>       * madvise      - old default behaviour
>       * always [new] - allow ksm to get tasks vma and
>                        try working on that.
>   * seeker_sleep_millisecs:
>     Add pauses between imports tasks VMA
> 
> For rate limiting proporses and tasklist locking time,
> ksm seeker thread only import VMAs from one task per loop.
> 
> Some numbers from different not madvised workloads.
> Formulas:
>   Percentage ratio = (pages_sharing - pages_shared)/pages_unshared
>   Memory saved = (pages_sharing - pages_shared)*4/1024 MiB
>   Memory used = free -h
> 
>   * Name: My working laptop
>     Description: Many different chrome/electron apps + KDE
>     Ratio: 5%
>     Saved: ~100  MiB
>     Used:  ~2000 MiB
> 
>   * Name: K8s test VM
>     Description: Some small random running docker images
>     Ratio: 40%
>     Saved: ~160 MiB
>     Used:  ~920 MiB
> 
>   * Name: Ceph test VM
>     Description: Ceph Mon/OSD, some containers
>     Ratio: 20%
>     Saved: ~60 MiB
>     Used:  ~600 MiB
> 
>   * Name: BareMetal K8s backend server
>     Description: Different server apps in containers C, Java, GO & etc
>     Ratio: 72%
>     Saved: ~5800 MiB
>     Used:  ~35.7 GiB
> 
>   * Name: BareMetal K8s processing server
>     Description: Many instance of one CPU intensive application
>     Ratio: 55%
>     Saved: ~2600 MiB
>     Used:  ~28.0 GiB
> 
>   * Name: BareMetal Ceph node
>     Description: Only OSD storage daemons running
>     Raio: 2%
>     Saved: ~190 MiB
>     Used:  ~11.7 GiB

Out of curiosity, have you compared these results with UKSM [1]?

Thanks.

-- 
   Oleksandr Natalenko (post-factum)

[1] https://github.com/dolohow/uksm

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-11-13 23:07 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-12 23:13 [PATCH V3] KSM: allow dedup all tasks memory Timofey Titovets
2018-11-13  1:49 ` Matthew Wilcox
2018-11-13 11:25   ` Timofey Titovets
2018-11-13  2:25 ` Pavel Tatashin
2018-11-13 11:40   ` Timofey Titovets
2018-11-13 18:42     ` Pavel Tatashin
2018-11-13 22:55       ` Timofey Titovets
2018-11-13 11:57 ` Jann Horn
2018-11-13 12:58   ` Timofey Titovets
2018-11-13 13:25     ` Jann Horn
     [not found] <<CAG48ez0ZprqUYGZFxcrY6U3Dnwt77q1NJXzzpsn1XNkRuXVppw@mail.gmail.com>
2018-11-13 14:23 ` Oleksandr Natalenko
2018-11-13 17:59   ` Pavel Tatashin
2018-11-13 18:17     ` Timofey Titovets
2018-11-13 18:35       ` Pavel Tatashin
2018-11-13 18:54         ` Timofey Titovets
2018-11-13 19:16           ` Pavel Tatashin
2018-11-13 22:40             ` Timofey Titovets
2018-11-13 22:53               ` Pavel Tatashin
2018-11-13 23:07                 ` Timofey Titovets
2018-11-13 20:26     ` Jann Horn
2018-11-13 22:35       ` Pavel Tatashin
     [not found] <<20181112231344.7161-1-timofey.titovets@synesis.ru>
2018-11-13 11:06 ` Oleksandr Natalenko
2018-11-13 11:56   ` Timofey Titovets
2018-11-13 16:33 ` Oleksandr Natalenko
2018-11-13 17:10   ` Timofey Titovets
2018-11-13 17:27     ` Oleksandr Natalenko
2018-11-13 17:44       ` Timofey Titovets
2018-11-13 18:20 Timofey Titovets

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).