All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 1/2] timer: provide an api for deferrable timeout
@ 2014-08-20 12:10 ` Chintan Pandya
  0 siblings, 0 replies; 29+ messages in thread
From: Chintan Pandya @ 2014-08-20 12:10 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-arm-msm, linux-kernel, Chintan Pandya,
	Thomas Gleixner, John Stultz, Peter Zijlstra, Ingo Molnar,
	Hugh Dickins

schedule_timeout wakes up the CPU from IDLE state. For some use cases it
is not desirable, hence introduce a convenient API
(schedule_timeout_deferrable_interruptible) on similar pattern which uses
a deferrable timer.

Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
---
Changes:

V3-->V4:
	- No change

V2-->V3:
	- Big comment moved from static function to exported function
	- Using __setup_timer_on_stack for better readability

V2:
	- this patch has been newly introduced in patch v2


 include/linux/sched.h |  2 ++
 kernel/time/timer.c   | 73 +++++++++++++++++++++++++++++++--------------------
 2 files changed, 47 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c2c885..13fe361 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -376,6 +376,8 @@ extern int in_sched_functions(unsigned long addr);
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long schedule_timeout(signed long timeout);
 extern signed long schedule_timeout_interruptible(signed long timeout);
+extern signed long
+schedule_timeout_deferrable_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index aca5dfe..f4c4082 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1431,33 +1431,8 @@ static void process_timeout(unsigned long __data)
 	wake_up_process((struct task_struct *)__data);
 }
 
-/**
- * schedule_timeout - sleep until timeout
- * @timeout: timeout value in jiffies
- *
- * Make the current task sleep until @timeout jiffies have
- * elapsed. The routine will return immediately unless
- * the current task state has been set (see set_current_state()).
- *
- * You can set the task state as follows -
- *
- * %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
- * pass before the routine returns. The routine will return 0
- *
- * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
- * delivered to the current task. In this case the remaining time
- * in jiffies will be returned, or 0 if the timer expired in time
- *
- * The current task state is guaranteed to be TASK_RUNNING when this
- * routine returns.
- *
- * Specifying a @timeout value of %MAX_SCHEDULE_TIMEOUT will schedule
- * the CPU away without a bound on the timeout. In this case the return
- * value will be %MAX_SCHEDULE_TIMEOUT.
- *
- * In all cases the return value is guaranteed to be non-negative.
- */
-signed long __sched schedule_timeout(signed long timeout)
+static signed long
+__sched __schedule_timeout(signed long timeout, unsigned long flag)
 {
 	struct timer_list timer;
 	unsigned long expire;
@@ -1493,7 +1468,9 @@ signed long __sched schedule_timeout(signed long timeout)
 
 	expire = timeout + jiffies;
 
-	setup_timer_on_stack(&timer, process_timeout, (unsigned long)current);
+	__setup_timer_on_stack(&timer, process_timeout, (unsigned long)current,
+				flag);
+
 	__mod_timer(&timer, expire, false, TIMER_NOT_PINNED);
 	schedule();
 	del_singleshot_timer_sync(&timer);
@@ -1506,12 +1483,52 @@ signed long __sched schedule_timeout(signed long timeout)
  out:
 	return timeout < 0 ? 0 : timeout;
 }
+
+/**
+ * schedule_timeout - sleep until timeout
+ * @timeout: timeout value in jiffies
+ *
+ * Make the current task sleep until @timeout jiffies have
+ * elapsed. The routine will return immediately unless
+ * the current task state has been set (see set_current_state()).
+ *
+ * You can set the task state as follows -
+ *
+ * %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
+ * pass before the routine returns. The routine will return 0
+ *
+ * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
+ * delivered to the current task. In this case the remaining time
+ * in jiffies will be returned, or 0 if the timer expired in time
+ *
+ * The current task state is guaranteed to be TASK_RUNNING when this
+ * routine returns.
+ *
+ * Specifying a @timeout value of %MAX_SCHEDULE_TIMEOUT will schedule
+ * the CPU away without a bound on the timeout. In this case the return
+ * value will be %MAX_SCHEDULE_TIMEOUT.
+ *
+ * In all cases the return value is guaranteed to be non-negative.
+ */
+signed long __sched schedule_timeout(signed long timeout)
+{
+	return __schedule_timeout(timeout, 0);
+}
 EXPORT_SYMBOL(schedule_timeout);
 
 /*
  * We can use __set_current_state() here because schedule_timeout() calls
  * schedule() unconditionally.
  */
+
+signed long
+__sched schedule_timeout_deferrable_interruptible(signed long timeout)
+{
+	__set_current_state(TASK_INTERRUPTIBLE);
+	return __schedule_timeout(timeout, TIMER_DEFERRABLE);
+}
+EXPORT_SYMBOL(schedule_timeout_deferrable_interruptible);
+
 signed long __sched schedule_timeout_interruptible(signed long timeout)
 {
 	__set_current_state(TASK_INTERRUPTIBLE);
-- 
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 1/2] timer: provide an api for deferrable timeout
@ 2014-08-20 12:10 ` Chintan Pandya
  0 siblings, 0 replies; 29+ messages in thread
From: Chintan Pandya @ 2014-08-20 12:10 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-arm-msm, linux-kernel, Chintan Pandya,
	Thomas Gleixner, John Stultz, Peter Zijlstra, Ingo Molnar,
	Hugh Dickins

schedule_timeout wakes up the CPU from IDLE state. For some use cases it
is not desirable, hence introduce a convenient API
(schedule_timeout_deferrable_interruptible) on similar pattern which uses
a deferrable timer.

Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
---
Changes:

V3-->V4:
	- No change

V2-->V3:
	- Big comment moved from static function to exported function
	- Using __setup_timer_on_stack for better readability

V2:
	- this patch has been newly introduced in patch v2


 include/linux/sched.h |  2 ++
 kernel/time/timer.c   | 73 +++++++++++++++++++++++++++++++--------------------
 2 files changed, 47 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c2c885..13fe361 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -376,6 +376,8 @@ extern int in_sched_functions(unsigned long addr);
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long schedule_timeout(signed long timeout);
 extern signed long schedule_timeout_interruptible(signed long timeout);
+extern signed long
+schedule_timeout_deferrable_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index aca5dfe..f4c4082 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1431,33 +1431,8 @@ static void process_timeout(unsigned long __data)
 	wake_up_process((struct task_struct *)__data);
 }
 
-/**
- * schedule_timeout - sleep until timeout
- * @timeout: timeout value in jiffies
- *
- * Make the current task sleep until @timeout jiffies have
- * elapsed. The routine will return immediately unless
- * the current task state has been set (see set_current_state()).
- *
- * You can set the task state as follows -
- *
- * %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
- * pass before the routine returns. The routine will return 0
- *
- * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
- * delivered to the current task. In this case the remaining time
- * in jiffies will be returned, or 0 if the timer expired in time
- *
- * The current task state is guaranteed to be TASK_RUNNING when this
- * routine returns.
- *
- * Specifying a @timeout value of %MAX_SCHEDULE_TIMEOUT will schedule
- * the CPU away without a bound on the timeout. In this case the return
- * value will be %MAX_SCHEDULE_TIMEOUT.
- *
- * In all cases the return value is guaranteed to be non-negative.
- */
-signed long __sched schedule_timeout(signed long timeout)
+static signed long
+__sched __schedule_timeout(signed long timeout, unsigned long flag)
 {
 	struct timer_list timer;
 	unsigned long expire;
@@ -1493,7 +1468,9 @@ signed long __sched schedule_timeout(signed long timeout)
 
 	expire = timeout + jiffies;
 
-	setup_timer_on_stack(&timer, process_timeout, (unsigned long)current);
+	__setup_timer_on_stack(&timer, process_timeout, (unsigned long)current,
+				flag);
+
 	__mod_timer(&timer, expire, false, TIMER_NOT_PINNED);
 	schedule();
 	del_singleshot_timer_sync(&timer);
@@ -1506,12 +1483,52 @@ signed long __sched schedule_timeout(signed long timeout)
  out:
 	return timeout < 0 ? 0 : timeout;
 }
+
+/**
+ * schedule_timeout - sleep until timeout
+ * @timeout: timeout value in jiffies
+ *
+ * Make the current task sleep until @timeout jiffies have
+ * elapsed. The routine will return immediately unless
+ * the current task state has been set (see set_current_state()).
+ *
+ * You can set the task state as follows -
+ *
+ * %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
+ * pass before the routine returns. The routine will return 0
+ *
+ * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
+ * delivered to the current task. In this case the remaining time
+ * in jiffies will be returned, or 0 if the timer expired in time
+ *
+ * The current task state is guaranteed to be TASK_RUNNING when this
+ * routine returns.
+ *
+ * Specifying a @timeout value of %MAX_SCHEDULE_TIMEOUT will schedule
+ * the CPU away without a bound on the timeout. In this case the return
+ * value will be %MAX_SCHEDULE_TIMEOUT.
+ *
+ * In all cases the return value is guaranteed to be non-negative.
+ */
+signed long __sched schedule_timeout(signed long timeout)
+{
+	return __schedule_timeout(timeout, 0);
+}
 EXPORT_SYMBOL(schedule_timeout);
 
 /*
  * We can use __set_current_state() here because schedule_timeout() calls
  * schedule() unconditionally.
  */
+
+signed long
+__sched schedule_timeout_deferrable_interruptible(signed long timeout)
+{
+	__set_current_state(TASK_INTERRUPTIBLE);
+	return __schedule_timeout(timeout, TIMER_DEFERRABLE);
+}
+EXPORT_SYMBOL(schedule_timeout_deferrable_interruptible);
+
 signed long __sched schedule_timeout_interruptible(signed long timeout)
 {
 	__set_current_state(TASK_INTERRUPTIBLE);
-- 
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-08-20 12:10 ` Chintan Pandya
@ 2014-08-20 12:10   ` Chintan Pandya
  -1 siblings, 0 replies; 29+ messages in thread
From: Chintan Pandya @ 2014-08-20 12:10 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-arm-msm, linux-kernel, Chintan Pandya,
	Thomas Gleixner, John Stultz, Peter Zijlstra, Ingo Molnar,
	Hugh Dickins

KSM thread to scan pages is scheduled on definite timeout. That wakes up
CPU from idle state and hence may affect the power consumption. Provide
an optional support to use deferrable timer which suites low-power
use-cases.

Typically, on our setup we observed, 10% less power consumption with some
use-cases in which CPU goes to power collapse frequently. For example,
playing audio on Soc which has HW based Audio encoder/decoder, CPU
remains idle for longer duration of time. This idle state will save
significant CPU power consumption if KSM don't wakes them up
periodically.

Note that, deferrable timers won't be deferred if any CPU is active and
not in IDLE state.

By default, deferrable timers is enabled. To disable deferrable timers,
$ echo 0 > /sys/kernel/mm/ksm/deferrable_timer

Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
---
Changes:

V3-->V4:
	- Use deferrable timers by default

V2-->V3:
	- Handled error case properly
	- Corrected indentation in Documentation
	- Fixed build failure
	- Removed left over process_timeout()
V1-->V2:
	- allowing only valid values to be updated as use_deferrable_timer
	- using only 'deferrable' and not 'deferred'
	- moved out schedule_timeout code for deferrable timer into timer.c


 Documentation/vm/ksm.txt |  6 ++++++
 mm/ksm.c                 | 36 ++++++++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index f34a8ee..23e26c3 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -87,6 +87,12 @@ pages_sharing    - how many more sites are sharing them i.e. how much saved
 pages_unshared   - how many pages unique but repeatedly checked for merging
 pages_volatile   - how many pages changing too fast to be placed in a tree
 full_scans       - how many times all mergeable areas have been scanned
+deferrable_timer - whether to use deferrable timers or not
+                   e.g. "echo 1 > /sys/kernel/mm/ksm/deferrable_timer"
+                   Default: 1 (means, we are using deferrable timers. Users
+		   might want to clear deferrable_timer option if they want
+		   ksm thread to wakeup CPU to carryout ksm activities thus
+		   loosing on battery while gaining on memory savings.)
 
 A high ratio of pages_sharing to pages_shared indicates good sharing, but
 a high ratio of pages_unshared to pages_sharing indicates wasted effort.
diff --git a/mm/ksm.c b/mm/ksm.c
index fb75902..af90e30 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -223,6 +223,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Boolean to indicate whether to use deferrable timer or not */
+static bool use_deferrable_timer = 1;
+
 #ifdef CONFIG_NUMA
 /* Zeroed when merging across nodes is not allowed */
 static unsigned int ksm_merge_across_nodes = 1;
@@ -1725,8 +1728,13 @@ static int ksm_scan_thread(void *nothing)
 		try_to_freeze();
 
 		if (ksmd_should_run()) {
-			schedule_timeout_interruptible(
-				msecs_to_jiffies(ksm_thread_sleep_millisecs));
+			signed long to;
+
+			to = msecs_to_jiffies(ksm_thread_sleep_millisecs);
+			if (use_deferrable_timer)
+				schedule_timeout_deferrable_interruptible(to);
+			else
+				schedule_timeout_interruptible(to);
 		} else {
 			wait_event_freezable(ksm_thread_wait,
 				ksmd_should_run() || kthread_should_stop());
@@ -2175,6 +2183,29 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
 }
 KSM_ATTR(run);
 
+static ssize_t deferrable_timer_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *buf)
+{
+	return snprintf(buf, 8, "%d\n", use_deferrable_timer);
+}
+
+static ssize_t deferrable_timer_store(struct kobject *kobj,
+				     struct kobj_attribute *attr,
+				     const char *buf, size_t count)
+{
+	unsigned long enable;
+	int err;
+
+	err = kstrtoul(buf, 10, &enable);
+	if (err < 0)
+		return err;
+	if (enable >= 1)
+		return -EINVAL;
+	use_deferrable_timer = enable;
+	return count;
+}
+KSM_ATTR(deferrable_timer);
+
 #ifdef CONFIG_NUMA
 static ssize_t merge_across_nodes_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf)
@@ -2287,6 +2318,7 @@ static struct attribute *ksm_attrs[] = {
 	&pages_unshared_attr.attr,
 	&pages_volatile_attr.attr,
 	&full_scans_attr.attr,
+	&deferrable_timer_attr.attr,
 #ifdef CONFIG_NUMA
 	&merge_across_nodes_attr.attr,
 #endif
-- 
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-08-20 12:10   ` Chintan Pandya
  0 siblings, 0 replies; 29+ messages in thread
From: Chintan Pandya @ 2014-08-20 12:10 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-arm-msm, linux-kernel, Chintan Pandya,
	Thomas Gleixner, John Stultz, Peter Zijlstra, Ingo Molnar,
	Hugh Dickins

KSM thread to scan pages is scheduled on definite timeout. That wakes up
CPU from idle state and hence may affect the power consumption. Provide
an optional support to use deferrable timer which suites low-power
use-cases.

Typically, on our setup we observed, 10% less power consumption with some
use-cases in which CPU goes to power collapse frequently. For example,
playing audio on Soc which has HW based Audio encoder/decoder, CPU
remains idle for longer duration of time. This idle state will save
significant CPU power consumption if KSM don't wakes them up
periodically.

Note that, deferrable timers won't be deferred if any CPU is active and
not in IDLE state.

By default, deferrable timers is enabled. To disable deferrable timers,
$ echo 0 > /sys/kernel/mm/ksm/deferrable_timer

Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
---
Changes:

V3-->V4:
	- Use deferrable timers by default

V2-->V3:
	- Handled error case properly
	- Corrected indentation in Documentation
	- Fixed build failure
	- Removed left over process_timeout()
V1-->V2:
	- allowing only valid values to be updated as use_deferrable_timer
	- using only 'deferrable' and not 'deferred'
	- moved out schedule_timeout code for deferrable timer into timer.c


 Documentation/vm/ksm.txt |  6 ++++++
 mm/ksm.c                 | 36 ++++++++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index f34a8ee..23e26c3 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -87,6 +87,12 @@ pages_sharing    - how many more sites are sharing them i.e. how much saved
 pages_unshared   - how many pages unique but repeatedly checked for merging
 pages_volatile   - how many pages changing too fast to be placed in a tree
 full_scans       - how many times all mergeable areas have been scanned
+deferrable_timer - whether to use deferrable timers or not
+                   e.g. "echo 1 > /sys/kernel/mm/ksm/deferrable_timer"
+                   Default: 1 (means, we are using deferrable timers. Users
+		   might want to clear deferrable_timer option if they want
+		   ksm thread to wakeup CPU to carryout ksm activities thus
+		   loosing on battery while gaining on memory savings.)
 
 A high ratio of pages_sharing to pages_shared indicates good sharing, but
 a high ratio of pages_unshared to pages_sharing indicates wasted effort.
diff --git a/mm/ksm.c b/mm/ksm.c
index fb75902..af90e30 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -223,6 +223,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
 /* Milliseconds ksmd should sleep between batches */
 static unsigned int ksm_thread_sleep_millisecs = 20;
 
+/* Boolean to indicate whether to use deferrable timer or not */
+static bool use_deferrable_timer = 1;
+
 #ifdef CONFIG_NUMA
 /* Zeroed when merging across nodes is not allowed */
 static unsigned int ksm_merge_across_nodes = 1;
@@ -1725,8 +1728,13 @@ static int ksm_scan_thread(void *nothing)
 		try_to_freeze();
 
 		if (ksmd_should_run()) {
-			schedule_timeout_interruptible(
-				msecs_to_jiffies(ksm_thread_sleep_millisecs));
+			signed long to;
+
+			to = msecs_to_jiffies(ksm_thread_sleep_millisecs);
+			if (use_deferrable_timer)
+				schedule_timeout_deferrable_interruptible(to);
+			else
+				schedule_timeout_interruptible(to);
 		} else {
 			wait_event_freezable(ksm_thread_wait,
 				ksmd_should_run() || kthread_should_stop());
@@ -2175,6 +2183,29 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
 }
 KSM_ATTR(run);
 
+static ssize_t deferrable_timer_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *buf)
+{
+	return snprintf(buf, 8, "%d\n", use_deferrable_timer);
+}
+
+static ssize_t deferrable_timer_store(struct kobject *kobj,
+				     struct kobj_attribute *attr,
+				     const char *buf, size_t count)
+{
+	unsigned long enable;
+	int err;
+
+	err = kstrtoul(buf, 10, &enable);
+	if (err < 0)
+		return err;
+	if (enable >= 1)
+		return -EINVAL;
+	use_deferrable_timer = enable;
+	return count;
+}
+KSM_ATTR(deferrable_timer);
+
 #ifdef CONFIG_NUMA
 static ssize_t merge_across_nodes_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf)
@@ -2287,6 +2318,7 @@ static struct attribute *ksm_attrs[] = {
 	&pages_unshared_attr.attr,
 	&pages_volatile_attr.attr,
 	&full_scans_attr.attr,
+	&deferrable_timer_attr.attr,
 #ifdef CONFIG_NUMA
 	&merge_across_nodes_attr.attr,
 #endif
-- 
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-08-20 12:10   ` Chintan Pandya
@ 2014-08-28  6:02     ` Hugh Dickins
  -1 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-08-28  6:02 UTC (permalink / raw)
  To: Chintan Pandya
  Cc: akpm, linux-mm, linux-arm-msm, linux-kernel, Thomas Gleixner,
	John Stultz, Peter Zijlstra, Ingo Molnar, Hugh Dickins

Sorry for holding you up, I'm slow. and needed to think about this more,

On Wed, 20 Aug 2014, Chintan Pandya wrote:

> KSM thread to scan pages is scheduled on definite timeout. That wakes up
> CPU from idle state and hence may affect the power consumption. Provide
> an optional support to use deferrable timer which suites low-power
> use-cases.
> 
> Typically, on our setup we observed, 10% less power consumption with some
> use-cases in which CPU goes to power collapse frequently. For example,
> playing audio on Soc which has HW based Audio encoder/decoder, CPU
> remains idle for longer duration of time. This idle state will save
> significant CPU power consumption if KSM don't wakes them up
> periodically.
> 
> Note that, deferrable timers won't be deferred if any CPU is active and
> not in IDLE state.
> 
> By default, deferrable timers is enabled. To disable deferrable timers,
> $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer

I have now experimented.  And, much as I wanted to eliminate the
tunable, and just have deferrable timers on, I have come right back
to your original position.

I was impressed by how quiet ksmd goes when there's nothing much
happening on the machine; but equally, disappointed in how slow
it then is to fulfil the outstanding merge work.  I agree with your
original assessment, that not everybody will want deferrable timer,
the way it is working at present.

I expect that can be fixed, partly by doing more work on wakeup from
a deferred timer, according to how long it has been deferred; and
partly by not deferring on idle until two passes of the list have been
completed.  But that's easier said than done, and might turn out to
defeat deferring the timer in too many cases: a balance to be found.

I hope that you or I or another will find time to do that work soon,
maybe before 3.18 but likely not; but I think the advantage of your
option is too important to delay it further.  Once we are satisfied
with later improvements, then I would like to remove the tunable, or
at least default it to on.  But for now, the default should be off.

It's unclear whether I should still worry about ksmd's gross activity,
when the system is not actually idle, but all the activity is in non-
mergeable processes.  I'm ashamed of that hyper-activity, and still
think that fixing it would be better than using a deferrable timer -
deferring work (particularly intensive work of the kind which ksmd
does) until the system is otherwise busy, is not necessarily a good
strategy (too much work to do all at once).

But fixing that might require ksm hooks in hot locations where nobody
else would want them: I'm rather hoping we can strike a good enough
balance with your deferrable timer, that nobody will need any better.

So, with a few changes here and below, please add my
Acked-by: Hugh Dickins <hughd@google.com>
to patches 1 and 2, and resend to akpm - thank you!

Here (above), it's restore the text to V3's
To enable deferrable timer,
$ echo 1 > /sys/kernel/mm/ksm/deferrable_timer

> 
> Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> ---
> Changes:

Sorry, I'm asking for a V4-->V5 reversing V3-->V4!

> 
> V3-->V4:
> 	- Use deferrable timers by default
> 
> V2-->V3:
> 	- Handled error case properly
> 	- Corrected indentation in Documentation
> 	- Fixed build failure
> 	- Removed left over process_timeout()
> V1-->V2:
> 	- allowing only valid values to be updated as use_deferrable_timer
> 	- using only 'deferrable' and not 'deferred'
> 	- moved out schedule_timeout code for deferrable timer into timer.c
> 
> 
>  Documentation/vm/ksm.txt |  6 ++++++
>  mm/ksm.c                 | 36 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index f34a8ee..23e26c3 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -87,6 +87,12 @@ pages_sharing    - how many more sites are sharing them i.e. how much saved
>  pages_unshared   - how many pages unique but repeatedly checked for merging
>  pages_volatile   - how many pages changing too fast to be placed in a tree
>  full_scans       - how many times all mergeable areas have been scanned
> +deferrable_timer - whether to use deferrable timers or not
> +                   e.g. "echo 1 > /sys/kernel/mm/ksm/deferrable_timer"
> +                   Default: 1 (means, we are using deferrable timers. Users
> +		   might want to clear deferrable_timer option if they want
> +		   ksm thread to wakeup CPU to carryout ksm activities thus
> +		   loosing on battery while gaining on memory savings.)

Please move this section to between the "sleep_millisecs" and
"merge_across_nodes" descriptions, separated from each by a blank line:

deferrable_timer - whether to save power by using a deferrable timer
                   e.g. "echo 1 > /sys/kernel/mm/ksm/deferrable_timer"
                   If set to 1, saves power by letting ksmd sleep for
                   longer than sleep_millisecs whenever the system is idle,
                   extending battery life but sometimes saving less memory.
                   Default: 0 (strict sleep timer as in earlier releases)
                   Warning: this default is likely to be changed to 1, and
                   deferrable_timer file then removed, in future releases.

>  
>  A high ratio of pages_sharing to pages_shared indicates good sharing, but
>  a high ratio of pages_unshared to pages_sharing indicates wasted effort.
> diff --git a/mm/ksm.c b/mm/ksm.c
> index fb75902..af90e30 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -223,6 +223,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
>  /* Milliseconds ksmd should sleep between batches */
>  static unsigned int ksm_thread_sleep_millisecs = 20;
>  
> +/* Boolean to indicate whether to use deferrable timer or not */
> +static bool use_deferrable_timer = 1;

s/ = 1//

> +
>  #ifdef CONFIG_NUMA
>  /* Zeroed when merging across nodes is not allowed */
>  static unsigned int ksm_merge_across_nodes = 1;
> @@ -1725,8 +1728,13 @@ static int ksm_scan_thread(void *nothing)
>  		try_to_freeze();
>  
>  		if (ksmd_should_run()) {
> -			schedule_timeout_interruptible(
> -				msecs_to_jiffies(ksm_thread_sleep_millisecs));
> +			signed long to;
> +
> +			to = msecs_to_jiffies(ksm_thread_sleep_millisecs);
> +			if (use_deferrable_timer)
> +				schedule_timeout_deferrable_interruptible(to);
> +			else
> +				schedule_timeout_interruptible(to);
>  		} else {
>  			wait_event_freezable(ksm_thread_wait,
>  				ksmd_should_run() || kthread_should_stop());
> @@ -2175,6 +2183,29 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
>  }
>  KSM_ATTR(run);
>  
> +static ssize_t deferrable_timer_show(struct kobject *kobj,
> +				    struct kobj_attribute *attr, char *buf)
> +{
> +	return snprintf(buf, 8, "%d\n", use_deferrable_timer);

I wonder what that "8" was for:
	return sprintf(buf, "%u\n", use_deferrable_timer);

> +}
> +
> +static ssize_t deferrable_timer_store(struct kobject *kobj,
> +				     struct kobj_attribute *attr,
> +				     const char *buf, size_t count)
> +{
> +	unsigned long enable;
> +	int err;
> +
> +	err = kstrtoul(buf, 10, &enable);
> +	if (err < 0)
> +		return err;
> +	if (enable >= 1)

It seems you misunderstood me, and never tested
$ echo 1 >/sys/kernel/mm/ksm/deferrable_timer
bash: echo: write error: Invalid argument

s/>=/>/

Or better, follow the rest of ksm.c's parsing, just
	if (err || enable > 1)
		return -EINVAL;

Thanks,
Hugh


> +		return -EINVAL;
> +	use_deferrable_timer = enable;
> +	return count;
> +}
> +KSM_ATTR(deferrable_timer);
> +
>  #ifdef CONFIG_NUMA
>  static ssize_t merge_across_nodes_show(struct kobject *kobj,
>  				struct kobj_attribute *attr, char *buf)
> @@ -2287,6 +2318,7 @@ static struct attribute *ksm_attrs[] = {
>  	&pages_unshared_attr.attr,
>  	&pages_volatile_attr.attr,
>  	&full_scans_attr.attr,
> +	&deferrable_timer_attr.attr,
>  #ifdef CONFIG_NUMA
>  	&merge_across_nodes_attr.attr,
>  #endif
> -- 
> Chintan Pandya
> 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-08-28  6:02     ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-08-28  6:02 UTC (permalink / raw)
  To: Chintan Pandya
  Cc: akpm, linux-mm, linux-arm-msm, linux-kernel, Thomas Gleixner,
	John Stultz, Peter Zijlstra, Ingo Molnar, Hugh Dickins

Sorry for holding you up, I'm slow. and needed to think about this more,

On Wed, 20 Aug 2014, Chintan Pandya wrote:

> KSM thread to scan pages is scheduled on definite timeout. That wakes up
> CPU from idle state and hence may affect the power consumption. Provide
> an optional support to use deferrable timer which suites low-power
> use-cases.
> 
> Typically, on our setup we observed, 10% less power consumption with some
> use-cases in which CPU goes to power collapse frequently. For example,
> playing audio on Soc which has HW based Audio encoder/decoder, CPU
> remains idle for longer duration of time. This idle state will save
> significant CPU power consumption if KSM don't wakes them up
> periodically.
> 
> Note that, deferrable timers won't be deferred if any CPU is active and
> not in IDLE state.
> 
> By default, deferrable timers is enabled. To disable deferrable timers,
> $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer

I have now experimented.  And, much as I wanted to eliminate the
tunable, and just have deferrable timers on, I have come right back
to your original position.

I was impressed by how quiet ksmd goes when there's nothing much
happening on the machine; but equally, disappointed in how slow
it then is to fulfil the outstanding merge work.  I agree with your
original assessment, that not everybody will want deferrable timer,
the way it is working at present.

I expect that can be fixed, partly by doing more work on wakeup from
a deferred timer, according to how long it has been deferred; and
partly by not deferring on idle until two passes of the list have been
completed.  But that's easier said than done, and might turn out to
defeat deferring the timer in too many cases: a balance to be found.

I hope that you or I or another will find time to do that work soon,
maybe before 3.18 but likely not; but I think the advantage of your
option is too important to delay it further.  Once we are satisfied
with later improvements, then I would like to remove the tunable, or
at least default it to on.  But for now, the default should be off.

It's unclear whether I should still worry about ksmd's gross activity,
when the system is not actually idle, but all the activity is in non-
mergeable processes.  I'm ashamed of that hyper-activity, and still
think that fixing it would be better than using a deferrable timer -
deferring work (particularly intensive work of the kind which ksmd
does) until the system is otherwise busy, is not necessarily a good
strategy (too much work to do all at once).

But fixing that might require ksm hooks in hot locations where nobody
else would want them: I'm rather hoping we can strike a good enough
balance with your deferrable timer, that nobody will need any better.

So, with a few changes here and below, please add my
Acked-by: Hugh Dickins <hughd@google.com>
to patches 1 and 2, and resend to akpm - thank you!

Here (above), it's restore the text to V3's
To enable deferrable timer,
$ echo 1 > /sys/kernel/mm/ksm/deferrable_timer

> 
> Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> ---
> Changes:

Sorry, I'm asking for a V4-->V5 reversing V3-->V4!

> 
> V3-->V4:
> 	- Use deferrable timers by default
> 
> V2-->V3:
> 	- Handled error case properly
> 	- Corrected indentation in Documentation
> 	- Fixed build failure
> 	- Removed left over process_timeout()
> V1-->V2:
> 	- allowing only valid values to be updated as use_deferrable_timer
> 	- using only 'deferrable' and not 'deferred'
> 	- moved out schedule_timeout code for deferrable timer into timer.c
> 
> 
>  Documentation/vm/ksm.txt |  6 ++++++
>  mm/ksm.c                 | 36 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index f34a8ee..23e26c3 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -87,6 +87,12 @@ pages_sharing    - how many more sites are sharing them i.e. how much saved
>  pages_unshared   - how many pages unique but repeatedly checked for merging
>  pages_volatile   - how many pages changing too fast to be placed in a tree
>  full_scans       - how many times all mergeable areas have been scanned
> +deferrable_timer - whether to use deferrable timers or not
> +                   e.g. "echo 1 > /sys/kernel/mm/ksm/deferrable_timer"
> +                   Default: 1 (means, we are using deferrable timers. Users
> +		   might want to clear deferrable_timer option if they want
> +		   ksm thread to wakeup CPU to carryout ksm activities thus
> +		   loosing on battery while gaining on memory savings.)

Please move this section to between the "sleep_millisecs" and
"merge_across_nodes" descriptions, separated from each by a blank line:

deferrable_timer - whether to save power by using a deferrable timer
                   e.g. "echo 1 > /sys/kernel/mm/ksm/deferrable_timer"
                   If set to 1, saves power by letting ksmd sleep for
                   longer than sleep_millisecs whenever the system is idle,
                   extending battery life but sometimes saving less memory.
                   Default: 0 (strict sleep timer as in earlier releases)
                   Warning: this default is likely to be changed to 1, and
                   deferrable_timer file then removed, in future releases.

>  
>  A high ratio of pages_sharing to pages_shared indicates good sharing, but
>  a high ratio of pages_unshared to pages_sharing indicates wasted effort.
> diff --git a/mm/ksm.c b/mm/ksm.c
> index fb75902..af90e30 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -223,6 +223,9 @@ static unsigned int ksm_thread_pages_to_scan = 100;
>  /* Milliseconds ksmd should sleep between batches */
>  static unsigned int ksm_thread_sleep_millisecs = 20;
>  
> +/* Boolean to indicate whether to use deferrable timer or not */
> +static bool use_deferrable_timer = 1;

s/ = 1//

> +
>  #ifdef CONFIG_NUMA
>  /* Zeroed when merging across nodes is not allowed */
>  static unsigned int ksm_merge_across_nodes = 1;
> @@ -1725,8 +1728,13 @@ static int ksm_scan_thread(void *nothing)
>  		try_to_freeze();
>  
>  		if (ksmd_should_run()) {
> -			schedule_timeout_interruptible(
> -				msecs_to_jiffies(ksm_thread_sleep_millisecs));
> +			signed long to;
> +
> +			to = msecs_to_jiffies(ksm_thread_sleep_millisecs);
> +			if (use_deferrable_timer)
> +				schedule_timeout_deferrable_interruptible(to);
> +			else
> +				schedule_timeout_interruptible(to);
>  		} else {
>  			wait_event_freezable(ksm_thread_wait,
>  				ksmd_should_run() || kthread_should_stop());
> @@ -2175,6 +2183,29 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
>  }
>  KSM_ATTR(run);
>  
> +static ssize_t deferrable_timer_show(struct kobject *kobj,
> +				    struct kobj_attribute *attr, char *buf)
> +{
> +	return snprintf(buf, 8, "%d\n", use_deferrable_timer);

I wonder what that "8" was for:
	return sprintf(buf, "%u\n", use_deferrable_timer);

> +}
> +
> +static ssize_t deferrable_timer_store(struct kobject *kobj,
> +				     struct kobj_attribute *attr,
> +				     const char *buf, size_t count)
> +{
> +	unsigned long enable;
> +	int err;
> +
> +	err = kstrtoul(buf, 10, &enable);
> +	if (err < 0)
> +		return err;
> +	if (enable >= 1)

It seems you misunderstood me, and never tested
$ echo 1 >/sys/kernel/mm/ksm/deferrable_timer
bash: echo: write error: Invalid argument

s/>=/>/

Or better, follow the rest of ksm.c's parsing, just
	if (err || enable > 1)
		return -EINVAL;

Thanks,
Hugh


> +		return -EINVAL;
> +	use_deferrable_timer = enable;
> +	return count;
> +}
> +KSM_ATTR(deferrable_timer);
> +
>  #ifdef CONFIG_NUMA
>  static ssize_t merge_across_nodes_show(struct kobject *kobj,
>  				struct kobj_attribute *attr, char *buf)
> @@ -2287,6 +2318,7 @@ static struct attribute *ksm_attrs[] = {
>  	&pages_unshared_attr.attr,
>  	&pages_volatile_attr.attr,
>  	&full_scans_attr.attr,
> +	&deferrable_timer_attr.attr,
>  #ifdef CONFIG_NUMA
>  	&merge_across_nodes_attr.attr,
>  #endif
> -- 
> Chintan Pandya
> 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-08-28  6:02     ` Hugh Dickins
@ 2014-09-03  9:58       ` Peter Zijlstra
  -1 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2014-09-03  9:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chintan Pandya, akpm, linux-mm, linux-arm-msm, linux-kernel,
	Thomas Gleixner, John Stultz, Ingo Molnar

On Wed, Aug 27, 2014 at 11:02:20PM -0700, Hugh Dickins wrote:
> Sorry for holding you up, I'm slow. and needed to think about this more,
> 
> On Wed, 20 Aug 2014, Chintan Pandya wrote:
> 
> > KSM thread to scan pages is scheduled on definite timeout. That wakes up
> > CPU from idle state and hence may affect the power consumption. Provide
> > an optional support to use deferrable timer which suites low-power
> > use-cases.
> > 
> > Typically, on our setup we observed, 10% less power consumption with some
> > use-cases in which CPU goes to power collapse frequently. For example,
> > playing audio on Soc which has HW based Audio encoder/decoder, CPU
> > remains idle for longer duration of time. This idle state will save
> > significant CPU power consumption if KSM don't wakes them up
> > periodically.
> > 
> > Note that, deferrable timers won't be deferred if any CPU is active and
> > not in IDLE state.
> > 
> > By default, deferrable timers is enabled. To disable deferrable timers,
> > $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer
> 
> I have now experimented.  And, much as I wanted to eliminate the
> tunable, and just have deferrable timers on, I have come right back
> to your original position.
> 
> I was impressed by how quiet ksmd goes when there's nothing much
> happening on the machine; but equally, disappointed in how slow
> it then is to fulfil the outstanding merge work.  I agree with your
> original assessment, that not everybody will want deferrable timer,
> the way it is working at present.
> 
> I expect that can be fixed, partly by doing more work on wakeup from
> a deferred timer, according to how long it has been deferred; and
> partly by not deferring on idle until two passes of the list have been
> completed.  But that's easier said than done, and might turn out to

So why not have the timer cancel itself when there is no more work to do
and start itself up again when there's work added?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-03  9:58       ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2014-09-03  9:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chintan Pandya, akpm, linux-mm, linux-arm-msm, linux-kernel,
	Thomas Gleixner, John Stultz, Ingo Molnar

On Wed, Aug 27, 2014 at 11:02:20PM -0700, Hugh Dickins wrote:
> Sorry for holding you up, I'm slow. and needed to think about this more,
> 
> On Wed, 20 Aug 2014, Chintan Pandya wrote:
> 
> > KSM thread to scan pages is scheduled on definite timeout. That wakes up
> > CPU from idle state and hence may affect the power consumption. Provide
> > an optional support to use deferrable timer which suites low-power
> > use-cases.
> > 
> > Typically, on our setup we observed, 10% less power consumption with some
> > use-cases in which CPU goes to power collapse frequently. For example,
> > playing audio on Soc which has HW based Audio encoder/decoder, CPU
> > remains idle for longer duration of time. This idle state will save
> > significant CPU power consumption if KSM don't wakes them up
> > periodically.
> > 
> > Note that, deferrable timers won't be deferred if any CPU is active and
> > not in IDLE state.
> > 
> > By default, deferrable timers is enabled. To disable deferrable timers,
> > $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer
> 
> I have now experimented.  And, much as I wanted to eliminate the
> tunable, and just have deferrable timers on, I have come right back
> to your original position.
> 
> I was impressed by how quiet ksmd goes when there's nothing much
> happening on the machine; but equally, disappointed in how slow
> it then is to fulfil the outstanding merge work.  I agree with your
> original assessment, that not everybody will want deferrable timer,
> the way it is working at present.
> 
> I expect that can be fixed, partly by doing more work on wakeup from
> a deferred timer, according to how long it has been deferred; and
> partly by not deferring on idle until two passes of the list have been
> completed.  But that's easier said than done, and might turn out to

So why not have the timer cancel itself when there is no more work to do
and start itself up again when there's work added?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-03  9:58       ` Peter Zijlstra
@ 2014-09-03 10:32         ` Thomas Gleixner
  -1 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2014-09-03 10:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, John Stultz, Ingo Molnar

On Wed, 3 Sep 2014, Peter Zijlstra wrote:
> On Wed, Aug 27, 2014 at 11:02:20PM -0700, Hugh Dickins wrote:
> > Sorry for holding you up, I'm slow. and needed to think about this more,
> > 
> > On Wed, 20 Aug 2014, Chintan Pandya wrote:
> > 
> > > KSM thread to scan pages is scheduled on definite timeout. That wakes up
> > > CPU from idle state and hence may affect the power consumption. Provide
> > > an optional support to use deferrable timer which suites low-power
> > > use-cases.
> > > 
> > > Typically, on our setup we observed, 10% less power consumption with some
> > > use-cases in which CPU goes to power collapse frequently. For example,
> > > playing audio on Soc which has HW based Audio encoder/decoder, CPU
> > > remains idle for longer duration of time. This idle state will save
> > > significant CPU power consumption if KSM don't wakes them up
> > > periodically.
> > > 
> > > Note that, deferrable timers won't be deferred if any CPU is active and
> > > not in IDLE state.

This is completely wrong. A deferrable timer enqueued on a given CPU
is deferred if that very CPU goes idle. The timer subsystem does not
care at all about the other CPUs.

And that very much explains Hughs observations. If the ksm thread
sleeps deferrable on a CPU which is idle for a very long time, it will
be deferred despite work accumulating on other CPUs.

> > > By default, deferrable timers is enabled. To disable deferrable timers,
> > > $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer
> > 
> > I have now experimented.  And, much as I wanted to eliminate the
> > tunable, and just have deferrable timers on, I have come right back
> > to your original position.
> > 
> > I was impressed by how quiet ksmd goes when there's nothing much
> > happening on the machine; but equally, disappointed in how slow
> > it then is to fulfil the outstanding merge work.  I agree with your
> > original assessment, that not everybody will want deferrable timer,
> > the way it is working at present.
> > 
> > I expect that can be fixed, partly by doing more work on wakeup from
> > a deferred timer, according to how long it has been deferred; and
> > partly by not deferring on idle until two passes of the list have been
> > completed.  But that's easier said than done, and might turn out to
> 
> So why not have the timer cancel itself when there is no more work to do
> and start itself up again when there's work added?

Because that requires more work and thoughts than simply slapping a
deferrable timer at the problem and creating a sysfs variable to turn
it on/off.

So looking at Hughs test results I'm quite sure that the deferrable
timer is just another tunable bandaid with dubious value and the
potential of predictable bug/regresssion reports.

So no, I wont merge the schedule_timeout_deferrable() hackery unless
the whole mechanism is usable w/o tunables and regressions.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-03 10:32         ` Thomas Gleixner
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2014-09-03 10:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, John Stultz, Ingo Molnar

On Wed, 3 Sep 2014, Peter Zijlstra wrote:
> On Wed, Aug 27, 2014 at 11:02:20PM -0700, Hugh Dickins wrote:
> > Sorry for holding you up, I'm slow. and needed to think about this more,
> > 
> > On Wed, 20 Aug 2014, Chintan Pandya wrote:
> > 
> > > KSM thread to scan pages is scheduled on definite timeout. That wakes up
> > > CPU from idle state and hence may affect the power consumption. Provide
> > > an optional support to use deferrable timer which suites low-power
> > > use-cases.
> > > 
> > > Typically, on our setup we observed, 10% less power consumption with some
> > > use-cases in which CPU goes to power collapse frequently. For example,
> > > playing audio on Soc which has HW based Audio encoder/decoder, CPU
> > > remains idle for longer duration of time. This idle state will save
> > > significant CPU power consumption if KSM don't wakes them up
> > > periodically.
> > > 
> > > Note that, deferrable timers won't be deferred if any CPU is active and
> > > not in IDLE state.

This is completely wrong. A deferrable timer enqueued on a given CPU
is deferred if that very CPU goes idle. The timer subsystem does not
care at all about the other CPUs.

And that very much explains Hughs observations. If the ksm thread
sleeps deferrable on a CPU which is idle for a very long time, it will
be deferred despite work accumulating on other CPUs.

> > > By default, deferrable timers is enabled. To disable deferrable timers,
> > > $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer
> > 
> > I have now experimented.  And, much as I wanted to eliminate the
> > tunable, and just have deferrable timers on, I have come right back
> > to your original position.
> > 
> > I was impressed by how quiet ksmd goes when there's nothing much
> > happening on the machine; but equally, disappointed in how slow
> > it then is to fulfil the outstanding merge work.  I agree with your
> > original assessment, that not everybody will want deferrable timer,
> > the way it is working at present.
> > 
> > I expect that can be fixed, partly by doing more work on wakeup from
> > a deferred timer, according to how long it has been deferred; and
> > partly by not deferring on idle until two passes of the list have been
> > completed.  But that's easier said than done, and might turn out to
> 
> So why not have the timer cancel itself when there is no more work to do
> and start itself up again when there's work added?

Because that requires more work and thoughts than simply slapping a
deferrable timer at the problem and creating a sysfs variable to turn
it on/off.

So looking at Hughs test results I'm quite sure that the deferrable
timer is just another tunable bandaid with dubious value and the
potential of predictable bug/regresssion reports.

So no, I wont merge the schedule_timeout_deferrable() hackery unless
the whole mechanism is usable w/o tunables and regressions.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-03  9:58       ` Peter Zijlstra
@ 2014-09-08  8:25         ` Hugh Dickins
  -1 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-08  8:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar

On Wed, 3 Sep 2014, Peter Zijlstra wrote:
> On Wed, Aug 27, 2014 at 11:02:20PM -0700, Hugh Dickins wrote:
> > On Wed, 20 Aug 2014, Chintan Pandya wrote:
> > 
> > > KSM thread to scan pages is scheduled on definite timeout. That wakes up
> > > CPU from idle state and hence may affect the power consumption. Provide
> > > an optional support to use deferrable timer which suites low-power
> > > use-cases.
> > > 
> > > Typically, on our setup we observed, 10% less power consumption with some
> > > use-cases in which CPU goes to power collapse frequently. For example,
> > > playing audio on Soc which has HW based Audio encoder/decoder, CPU
> > > remains idle for longer duration of time. This idle state will save
> > > significant CPU power consumption if KSM don't wakes them up
> > > periodically.
> > > 
> > > Note that, deferrable timers won't be deferred if any CPU is active and
> > > not in IDLE state.
> > > 
> > > By default, deferrable timers is enabled. To disable deferrable timers,
> > > $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer
> > 
> > I have now experimented.  And, much as I wanted to eliminate the
> > tunable, and just have deferrable timers on, I have come right back
> > to your original position.
> > 
> > I was impressed by how quiet ksmd goes when there's nothing much
> > happening on the machine; but equally, disappointed in how slow
> > it then is to fulfil the outstanding merge work.  I agree with your
> > original assessment, that not everybody will want deferrable timer,
> > the way it is working at present.
> > 
> > I expect that can be fixed, partly by doing more work on wakeup from
> > a deferred timer, according to how long it has been deferred; and
> > partly by not deferring on idle until two passes of the list have been
> > completed.  But that's easier said than done, and might turn out to
> 
> So why not have the timer cancel itself when there is no more work to do
> and start itself up again when there's work added?

Well, yes, but... how do we know when there is no more work to do?
Further down I said:

> > But fixing that might require ksm hooks in hot locations where nobody
> > else would want them: I'm rather hoping we can strike a good enough
> > balance with your deferrable timer, that nobody will need any better.

Thomas has given reason why KSM might simply fail to do its job if we
rely on the deferrable timer.  So I've tried another approach, patch
below; but I do not expect you to jump for joy at the sight of it!

I've tried to minimize the offensive KSM hook in context_switch().
Why place it there, rather than do something near profile_tick() or
account_process_tick()?  Because KSM is aware of mms not tasks, and
context_switch() should have the next mm cachelines hot (if not, a
slight regrouping in mm_struct should do it); whereas I can find
no reference whatever to mm_struct in kernel/time, so hooking to
KSM from there would drag in another few cachelines every tick.

(Another approach would be to set up KSM hint faulting, along the
lines of NUMA hint faulting.  Not a path I'm keen to go down.)

I'm not thrilled with this patch, I think it's somewhat defective
in several ways.  But maybe in practice it will prove good enough,
and if so then I'd rather not waste effort on complicating it.

My own testing is not realistic, nor representative of real KSM users;
and I have no idea what values of pages_to_scan and sleep_millisecs
people really use (and those may make quite a difference to how
well it works).

Chintan, even if the scheduler guys turn out to hate it, please would
you give the patch below a try, to see how well it works in your
environment, whether it seems to go better or worse than your own patch.

If it works well enough for you, maybe we can come up with ideas to
make it more palatable.  I do think your issue is an important one
to fix, one way or another.

Thanks,
Hugh

[PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet

Description yet to be written!

Reported-by: Chintan Pandya <cpandya@codeaurora.org>
Not-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 include/linux/ksm.h   |   14 +++++++++++
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    9 ++++++-
 mm/ksm.c              |   50 ++++++++++++++++++++++++++++------------
 4 files changed, 58 insertions(+), 16 deletions(-)

--- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
+++ linux/include/linux/ksm.h	2014-09-07 11:54:41.528003316 -0700
@@ -12,6 +12,7 @@
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 
 struct stable_node;
 struct mem_cgroup;
@@ -21,6 +22,7 @@ int ksm_madvise(struct vm_area_struct *v
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
+wait_queue_head_t *__ksm_switch(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -35,6 +37,13 @@ static inline void ksm_exit(struct mm_st
 		__ksm_exit(mm);
 }
 
+static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_SWITCH_TO_KSM, &mm->flags)))
+		return __ksm_switch(mm);
+	return NULL;
+}
+
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
  * which KSM maps into multiple mms, wherever identical anonymous page content
@@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st
 {
 }
 
+static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
+{
+	return NULL;
+}
+
 static inline int PageKsm(struct page *page)
 {
 	return 0;
--- 3.17-rc4/include/linux/sched.h	2014-08-16 16:00:53.909189060 -0700
+++ linux/include/linux/sched.h	2014-09-07 11:54:41.528003316 -0700
@@ -453,6 +453,7 @@ static inline int get_dumpable(struct mm
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
+#define MMF_SWITCH_TO_KSM	21	/* notify KSM of switch to this mm */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
--- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
+++ linux/kernel/sched/core.c	2014-09-07 11:54:41.528003316 -0700
@@ -61,6 +61,7 @@
 #include <linux/times.h>
 #include <linux/tsacct_kern.h>
 #include <linux/kprobes.h>
+#include <linux/ksm.h>
 #include <linux/delayacct.h>
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
@@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas
 	       struct task_struct *next)
 {
 	struct mm_struct *mm, *oldmm;
+	wait_queue_head_t *wake_ksm = NULL;
 
 	prepare_task_switch(rq, prev, next);
 
@@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas
 		next->active_mm = oldmm;
 		atomic_inc(&oldmm->mm_count);
 		enter_lazy_tlb(oldmm, next);
-	} else
+	} else {
 		switch_mm(oldmm, mm, next);
+		wake_ksm = ksm_switch(mm);
+	}
 
 	if (!prev->mm) {
 		prev->active_mm = NULL;
@@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas
 	 * frame will be invalid.
 	 */
 	finish_task_switch(this_rq(), prev);
+
+	if (wake_ksm)
+		wake_up_interruptible(wake_ksm);
 }
 
 /*
--- 3.17-rc4/mm/ksm.c	2014-08-16 16:00:54.132189065 -0700
+++ linux/mm/ksm.c	2014-09-07 11:54:41.528003316 -0700
@@ -205,6 +205,9 @@ static struct kmem_cache *rmap_item_cach
 static struct kmem_cache *stable_node_cache;
 static struct kmem_cache *mm_slot_cache;
 
+/* The number of mergeable mms which have recently run */
+static atomic_t active_mergeable_mms = ATOMIC_INIT(0);
+
 /* The number of nodes in the stable tree */
 static unsigned long ksm_pages_shared;
 
@@ -313,9 +316,13 @@ static inline struct mm_slot *alloc_mm_s
 	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
 }
 
-static inline void free_mm_slot(struct mm_slot *mm_slot)
+static void free_mm_slot(struct mm_struct *mm, struct mm_slot *mm_slot)
 {
 	kmem_cache_free(mm_slot_cache, mm_slot);
+
+	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (!test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+		atomic_dec(&active_mergeable_mms);
 }
 
 static struct mm_slot *get_mm_slot(struct mm_struct *mm)
@@ -801,8 +808,7 @@ static int unmerge_and_remove_all_rmap_i
 			list_del(&mm_slot->mm_list);
 			spin_unlock(&ksm_mmlist_lock);
 
-			free_mm_slot(mm_slot);
-			clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+			free_mm_slot(mm, mm_slot);
 			up_read(&mm->mmap_sem);
 			mmdrop(mm);
 		} else {
@@ -1668,12 +1674,20 @@ next_mm:
 		list_del(&slot->mm_list);
 		spin_unlock(&ksm_mmlist_lock);
 
-		free_mm_slot(slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, slot);
 		up_read(&mm->mmap_sem);
 		mmdrop(mm);
 	} else {
 		spin_unlock(&ksm_mmlist_lock);
+		/*
+		 * After completing its scan, assume this mm to be inactive,
+		 * but set a flag for context_switch() to notify us as soon
+		 * as it is used again: see ksm_switch().  If the number of
+		 * active_mergeable_mms goes down to zero, ksmd will sleep
+		 * to save power, until awoken by mergeable context_switch().
+		 */
+		if (!test_and_set_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+			atomic_dec(&active_mergeable_mms);
 		up_read(&mm->mmap_sem);
 	}
 
@@ -1707,7 +1721,7 @@ static void ksm_do_scan(unsigned int sca
 
 static int ksmd_should_run(void)
 {
-	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
+	return (ksm_run & KSM_RUN_MERGE) && atomic_read(&active_mergeable_mms);
 }
 
 static int ksm_scan_thread(void *nothing)
@@ -1785,15 +1799,11 @@ int ksm_madvise(struct vm_area_struct *v
 int __ksm_enter(struct mm_struct *mm)
 {
 	struct mm_slot *mm_slot;
-	int needs_wakeup;
 
 	mm_slot = alloc_mm_slot();
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* Check ksm_run too?  Would need tighter locking */
-	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
-
 	spin_lock(&ksm_mmlist_lock);
 	insert_to_mm_slots_hash(mm, mm_slot);
 	/*
@@ -1812,10 +1822,9 @@ int __ksm_enter(struct mm_struct *mm)
 		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	set_bit(MMF_VM_MERGEABLE, &mm->flags);
 	atomic_inc(&mm->mm_count);
-
-	if (needs_wakeup)
+	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (atomic_inc_return(&active_mergeable_mms) == 1)
 		wake_up_interruptible(&ksm_thread_wait);
 
 	return 0;
@@ -1850,8 +1859,7 @@ void __ksm_exit(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	if (easy_to_free) {
-		free_mm_slot(mm_slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, mm_slot);
 		mmdrop(mm);
 	} else if (mm_slot) {
 		down_write(&mm->mmap_sem);
@@ -1859,6 +1867,18 @@ void __ksm_exit(struct mm_struct *mm)
 	}
 }
 
+wait_queue_head_t *__ksm_switch(struct mm_struct *mm)
+{
+	/*
+	 * Called by context_switch() to a hitherto inactive mergeable mm:
+	 * scheduler locks forbid immediate wakeup so leave that to caller.
+	 */
+	if (test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags) &&
+	    atomic_inc_return(&active_mergeable_mms) == 1)
+		return &ksm_thread_wait;
+	return NULL;
+}
+
 struct page *ksm_might_need_to_copy(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-08  8:25         ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-08  8:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar

On Wed, 3 Sep 2014, Peter Zijlstra wrote:
> On Wed, Aug 27, 2014 at 11:02:20PM -0700, Hugh Dickins wrote:
> > On Wed, 20 Aug 2014, Chintan Pandya wrote:
> > 
> > > KSM thread to scan pages is scheduled on definite timeout. That wakes up
> > > CPU from idle state and hence may affect the power consumption. Provide
> > > an optional support to use deferrable timer which suites low-power
> > > use-cases.
> > > 
> > > Typically, on our setup we observed, 10% less power consumption with some
> > > use-cases in which CPU goes to power collapse frequently. For example,
> > > playing audio on Soc which has HW based Audio encoder/decoder, CPU
> > > remains idle for longer duration of time. This idle state will save
> > > significant CPU power consumption if KSM don't wakes them up
> > > periodically.
> > > 
> > > Note that, deferrable timers won't be deferred if any CPU is active and
> > > not in IDLE state.
> > > 
> > > By default, deferrable timers is enabled. To disable deferrable timers,
> > > $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer
> > 
> > I have now experimented.  And, much as I wanted to eliminate the
> > tunable, and just have deferrable timers on, I have come right back
> > to your original position.
> > 
> > I was impressed by how quiet ksmd goes when there's nothing much
> > happening on the machine; but equally, disappointed in how slow
> > it then is to fulfil the outstanding merge work.  I agree with your
> > original assessment, that not everybody will want deferrable timer,
> > the way it is working at present.
> > 
> > I expect that can be fixed, partly by doing more work on wakeup from
> > a deferred timer, according to how long it has been deferred; and
> > partly by not deferring on idle until two passes of the list have been
> > completed.  But that's easier said than done, and might turn out to
> 
> So why not have the timer cancel itself when there is no more work to do
> and start itself up again when there's work added?

Well, yes, but... how do we know when there is no more work to do?
Further down I said:

> > But fixing that might require ksm hooks in hot locations where nobody
> > else would want them: I'm rather hoping we can strike a good enough
> > balance with your deferrable timer, that nobody will need any better.

Thomas has given reason why KSM might simply fail to do its job if we
rely on the deferrable timer.  So I've tried another approach, patch
below; but I do not expect you to jump for joy at the sight of it!

I've tried to minimize the offensive KSM hook in context_switch().
Why place it there, rather than do something near profile_tick() or
account_process_tick()?  Because KSM is aware of mms not tasks, and
context_switch() should have the next mm cachelines hot (if not, a
slight regrouping in mm_struct should do it); whereas I can find
no reference whatever to mm_struct in kernel/time, so hooking to
KSM from there would drag in another few cachelines every tick.

(Another approach would be to set up KSM hint faulting, along the
lines of NUMA hint faulting.  Not a path I'm keen to go down.)

I'm not thrilled with this patch, I think it's somewhat defective
in several ways.  But maybe in practice it will prove good enough,
and if so then I'd rather not waste effort on complicating it.

My own testing is not realistic, nor representative of real KSM users;
and I have no idea what values of pages_to_scan and sleep_millisecs
people really use (and those may make quite a difference to how
well it works).

Chintan, even if the scheduler guys turn out to hate it, please would
you give the patch below a try, to see how well it works in your
environment, whether it seems to go better or worse than your own patch.

If it works well enough for you, maybe we can come up with ideas to
make it more palatable.  I do think your issue is an important one
to fix, one way or another.

Thanks,
Hugh

[PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet

Description yet to be written!

Reported-by: Chintan Pandya <cpandya@codeaurora.org>
Not-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 include/linux/ksm.h   |   14 +++++++++++
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    9 ++++++-
 mm/ksm.c              |   50 ++++++++++++++++++++++++++++------------
 4 files changed, 58 insertions(+), 16 deletions(-)

--- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
+++ linux/include/linux/ksm.h	2014-09-07 11:54:41.528003316 -0700
@@ -12,6 +12,7 @@
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 
 struct stable_node;
 struct mem_cgroup;
@@ -21,6 +22,7 @@ int ksm_madvise(struct vm_area_struct *v
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
+wait_queue_head_t *__ksm_switch(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -35,6 +37,13 @@ static inline void ksm_exit(struct mm_st
 		__ksm_exit(mm);
 }
 
+static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_SWITCH_TO_KSM, &mm->flags)))
+		return __ksm_switch(mm);
+	return NULL;
+}
+
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
  * which KSM maps into multiple mms, wherever identical anonymous page content
@@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st
 {
 }
 
+static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
+{
+	return NULL;
+}
+
 static inline int PageKsm(struct page *page)
 {
 	return 0;
--- 3.17-rc4/include/linux/sched.h	2014-08-16 16:00:53.909189060 -0700
+++ linux/include/linux/sched.h	2014-09-07 11:54:41.528003316 -0700
@@ -453,6 +453,7 @@ static inline int get_dumpable(struct mm
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
+#define MMF_SWITCH_TO_KSM	21	/* notify KSM of switch to this mm */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
--- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
+++ linux/kernel/sched/core.c	2014-09-07 11:54:41.528003316 -0700
@@ -61,6 +61,7 @@
 #include <linux/times.h>
 #include <linux/tsacct_kern.h>
 #include <linux/kprobes.h>
+#include <linux/ksm.h>
 #include <linux/delayacct.h>
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
@@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas
 	       struct task_struct *next)
 {
 	struct mm_struct *mm, *oldmm;
+	wait_queue_head_t *wake_ksm = NULL;
 
 	prepare_task_switch(rq, prev, next);
 
@@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas
 		next->active_mm = oldmm;
 		atomic_inc(&oldmm->mm_count);
 		enter_lazy_tlb(oldmm, next);
-	} else
+	} else {
 		switch_mm(oldmm, mm, next);
+		wake_ksm = ksm_switch(mm);
+	}
 
 	if (!prev->mm) {
 		prev->active_mm = NULL;
@@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas
 	 * frame will be invalid.
 	 */
 	finish_task_switch(this_rq(), prev);
+
+	if (wake_ksm)
+		wake_up_interruptible(wake_ksm);
 }
 
 /*
--- 3.17-rc4/mm/ksm.c	2014-08-16 16:00:54.132189065 -0700
+++ linux/mm/ksm.c	2014-09-07 11:54:41.528003316 -0700
@@ -205,6 +205,9 @@ static struct kmem_cache *rmap_item_cach
 static struct kmem_cache *stable_node_cache;
 static struct kmem_cache *mm_slot_cache;
 
+/* The number of mergeable mms which have recently run */
+static atomic_t active_mergeable_mms = ATOMIC_INIT(0);
+
 /* The number of nodes in the stable tree */
 static unsigned long ksm_pages_shared;
 
@@ -313,9 +316,13 @@ static inline struct mm_slot *alloc_mm_s
 	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
 }
 
-static inline void free_mm_slot(struct mm_slot *mm_slot)
+static void free_mm_slot(struct mm_struct *mm, struct mm_slot *mm_slot)
 {
 	kmem_cache_free(mm_slot_cache, mm_slot);
+
+	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (!test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+		atomic_dec(&active_mergeable_mms);
 }
 
 static struct mm_slot *get_mm_slot(struct mm_struct *mm)
@@ -801,8 +808,7 @@ static int unmerge_and_remove_all_rmap_i
 			list_del(&mm_slot->mm_list);
 			spin_unlock(&ksm_mmlist_lock);
 
-			free_mm_slot(mm_slot);
-			clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+			free_mm_slot(mm, mm_slot);
 			up_read(&mm->mmap_sem);
 			mmdrop(mm);
 		} else {
@@ -1668,12 +1674,20 @@ next_mm:
 		list_del(&slot->mm_list);
 		spin_unlock(&ksm_mmlist_lock);
 
-		free_mm_slot(slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, slot);
 		up_read(&mm->mmap_sem);
 		mmdrop(mm);
 	} else {
 		spin_unlock(&ksm_mmlist_lock);
+		/*
+		 * After completing its scan, assume this mm to be inactive,
+		 * but set a flag for context_switch() to notify us as soon
+		 * as it is used again: see ksm_switch().  If the number of
+		 * active_mergeable_mms goes down to zero, ksmd will sleep
+		 * to save power, until awoken by mergeable context_switch().
+		 */
+		if (!test_and_set_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+			atomic_dec(&active_mergeable_mms);
 		up_read(&mm->mmap_sem);
 	}
 
@@ -1707,7 +1721,7 @@ static void ksm_do_scan(unsigned int sca
 
 static int ksmd_should_run(void)
 {
-	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
+	return (ksm_run & KSM_RUN_MERGE) && atomic_read(&active_mergeable_mms);
 }
 
 static int ksm_scan_thread(void *nothing)
@@ -1785,15 +1799,11 @@ int ksm_madvise(struct vm_area_struct *v
 int __ksm_enter(struct mm_struct *mm)
 {
 	struct mm_slot *mm_slot;
-	int needs_wakeup;
 
 	mm_slot = alloc_mm_slot();
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* Check ksm_run too?  Would need tighter locking */
-	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
-
 	spin_lock(&ksm_mmlist_lock);
 	insert_to_mm_slots_hash(mm, mm_slot);
 	/*
@@ -1812,10 +1822,9 @@ int __ksm_enter(struct mm_struct *mm)
 		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	set_bit(MMF_VM_MERGEABLE, &mm->flags);
 	atomic_inc(&mm->mm_count);
-
-	if (needs_wakeup)
+	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (atomic_inc_return(&active_mergeable_mms) == 1)
 		wake_up_interruptible(&ksm_thread_wait);
 
 	return 0;
@@ -1850,8 +1859,7 @@ void __ksm_exit(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	if (easy_to_free) {
-		free_mm_slot(mm_slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, mm_slot);
 		mmdrop(mm);
 	} else if (mm_slot) {
 		down_write(&mm->mmap_sem);
@@ -1859,6 +1867,18 @@ void __ksm_exit(struct mm_struct *mm)
 	}
 }
 
+wait_queue_head_t *__ksm_switch(struct mm_struct *mm)
+{
+	/*
+	 * Called by context_switch() to a hitherto inactive mergeable mm:
+	 * scheduler locks forbid immediate wakeup so leave that to caller.
+	 */
+	if (test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags) &&
+	    atomic_inc_return(&active_mergeable_mms) == 1)
+		return &ksm_thread_wait;
+	return NULL;
+}
+
 struct page *ksm_might_need_to_copy(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-08  8:25         ` Hugh Dickins
  (?)
@ 2014-09-08  9:39         ` Peter Zijlstra
  2014-09-09 14:52             ` Chintan Pandya
  2014-09-09 20:14             ` Hugh Dickins
  -1 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2014-09-08  9:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chintan Pandya, akpm, linux-mm, linux-arm-msm, linux-kernel,
	Thomas Gleixner, John Stultz, Ingo Molnar, Frederic Weisbecker,
	Paul McKenney

[-- Attachment #1: Type: text/plain, Size: 4221 bytes --]

On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
> Well, yes, but... how do we know when there is no more work to do?

Yeah, I figured that out _after_ I send that email..

> Thomas has given reason why KSM might simply fail to do its job if we
> rely on the deferrable timer.  So I've tried another approach, patch
> below; but I do not expect you to jump for joy at the sight of it!

Indeed :/

> I've tried to minimize the offensive KSM hook in context_switch().
> Why place it there, rather than do something near profile_tick() or
> account_process_tick()?  Because KSM is aware of mms not tasks, and
> context_switch() should have the next mm cachelines hot (if not, a
> slight regrouping in mm_struct should do it); whereas I can find
> no reference whatever to mm_struct in kernel/time, so hooking to
> KSM from there would drag in another few cachelines every tick.
> 
> (Another approach would be to set up KSM hint faulting, along the
> lines of NUMA hint faulting.  Not a path I'm keen to go down.)
> 
> I'm not thrilled with this patch, I think it's somewhat defective
> in several ways.  But maybe in practice it will prove good enough,
> and if so then I'd rather not waste effort on complicating it.
> 
> My own testing is not realistic, nor representative of real KSM users;
> and I have no idea what values of pages_to_scan and sleep_millisecs
> people really use (and those may make quite a difference to how
> well it works).
> 
> Chintan, even if the scheduler guys turn out to hate it, please would
> you give the patch below a try, to see how well it works in your
> environment, whether it seems to go better or worse than your own patch.
> 
> If it works well enough for you, maybe we can come up with ideas to
> make it more palatable.  I do think your issue is an important one
> to fix, one way or another.
> 
> Thanks,
> Hugh
> 
> [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet
> 
> Description yet to be written!
> 
> Reported-by: Chintan Pandya <cpandya@codeaurora.org>
> Not-Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> 
>  include/linux/ksm.h   |   14 +++++++++++
>  include/linux/sched.h |    1 
>  kernel/sched/core.c   |    9 ++++++-
>  mm/ksm.c              |   50 ++++++++++++++++++++++++++++------------
>  4 files changed, 58 insertions(+), 16 deletions(-)
> 
> --- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
> +++ linux/include/linux/ksm.h	2014-09-07 11:54:41.528003316 -0700

> @@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st
>  {
>  }
>  
> +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)

s/ksm_switch/__&/

> +{
> +	return NULL;
> +}
> +
>  static inline int PageKsm(struct page *page)
>  {
>  	return 0;

> --- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
> +++ linux/kernel/sched/core.c	2014-09-07 11:54:41.528003316 -0700

> @@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas
>  	       struct task_struct *next)
>  {
>  	struct mm_struct *mm, *oldmm;
> +	wait_queue_head_t *wake_ksm = NULL;
>  
>  	prepare_task_switch(rq, prev, next);
>  
> @@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas
>  		next->active_mm = oldmm;
>  		atomic_inc(&oldmm->mm_count);
>  		enter_lazy_tlb(oldmm, next);
> -	} else
> +	} else {
>  		switch_mm(oldmm, mm, next);
> +		wake_ksm = ksm_switch(mm);

Is this the right mm? We've just switched the stack, so we're looking at
next->mm when we switched away from current. That might not exist
anymore.

> +	}
>  
>  	if (!prev->mm) {
>  		prev->active_mm = NULL;
> @@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas
>  	 * frame will be invalid.
>  	 */
>  	finish_task_switch(this_rq(), prev);
> +
> +	if (wake_ksm)
> +		wake_up_interruptible(wake_ksm);
>  }

Quite horrible for sure. I really hate seeing KSM cruft all the way down
here. Can't we create a new (timer) infrastructure that does the right
thing? Surely this isn't the only such case.

I know both RCU and some NOHZ_FULL muck already track when the system is
completely idle. This is yet another case of that.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-08  9:39         ` Peter Zijlstra
@ 2014-09-09 14:52             ` Chintan Pandya
  2014-09-09 20:14             ` Hugh Dickins
  1 sibling, 0 replies; 29+ messages in thread
From: Chintan Pandya @ 2014-09-09 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, akpm, linux-mm, linux-arm-msm, linux-kernel,
	Thomas Gleixner, John Stultz, Ingo Molnar, Frederic Weisbecker,
	Paul McKenney

Hello All,

Before its too late to discuss this basic question, allow me to share my 
view on the deferrable timer approach.

I believe KSM at this point is tunable with predictable outcomes. When 
it will get triggered, how many pages it will scan etc. This aggression 
control is with user who can implement any complex logic based on its 
own flashy parameters. Along the same line, I was seeing this deferrable 
timer knob.

Here I was hoping the same level of predictability with this knob. 
Kernel still won't do smart work and user is free to play smart/complex 
with the knob. I believe that there are many use-cases where a single 
strategy of "to make KSM perform better while saving power at the same 
time" may not work. So,


> On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
>> Well, yes, but... how do we know when there is no more work to do?
>
> Yeah, I figured that out _after_ I send that email..
>
>> Thomas has given reason why KSM might simply fail to do its job if we
>> rely on the deferrable timer.

With deferrable timer, KSM thread will be scheduled on the 'active' CPU 
at that very same time. Yes, I understood from Thomas's clarification 
that if that very CPU goes IDLE, KSM task will get deferred even if at 
the timeout, we have some CPUs running. I think, this situation can be 
avoided potentially very small timeout value (?). But in totality, this 
is where KSM will be idle for sure, may be that is unwanted.

>>
>> Chintan, even if the scheduler guys turn out to hate it, please would
>> you give the patch below a try, to see how well it works in your
>> environment, whether it seems to go better or worse than your own patch.
>>
>> If it works well enough for you, maybe we can come up with ideas to
>> make it more palatable.  I do think your issue is an important one
>> to fix, one way or another.

It is taking a little more time for me to grasp your change :) So, after 
Peter's comment, do you want me to try this out or you are looking 
forward for even better idea ? BTW, if deferrable timer patch gets any 
green signal, I will publish new patch with your comments on v4.

>>
>> Thanks,
>> Hugh
>>
>> [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet
>>
>> Description yet to be written!
>>
>> Reported-by: Chintan Pandya<cpandya@codeaurora.org>
>> Not-Signed-off-by: Hugh Dickins<hughd@google.com>


 >>> So looking at Hughs test results I'm quite sure that the deferrable
 >>> timer is just another tunable bandaid with dubious value and the
 >>> potential of predictable bug/regresssion reports.

Here I am naive in understanding the obvious disadvantages of 'one more 
knob'. And hence was inclined towards deferrable timer knob. Thomas, 
could you explain what kind of bug/regression you foresee with such 
approach ?

-- 
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-09 14:52             ` Chintan Pandya
  0 siblings, 0 replies; 29+ messages in thread
From: Chintan Pandya @ 2014-09-09 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, akpm, linux-mm, linux-arm-msm, linux-kernel,
	Thomas Gleixner, John Stultz, Ingo Molnar, Frederic Weisbecker,
	Paul McKenney

Hello All,

Before its too late to discuss this basic question, allow me to share my 
view on the deferrable timer approach.

I believe KSM at this point is tunable with predictable outcomes. When 
it will get triggered, how many pages it will scan etc. This aggression 
control is with user who can implement any complex logic based on its 
own flashy parameters. Along the same line, I was seeing this deferrable 
timer knob.

Here I was hoping the same level of predictability with this knob. 
Kernel still won't do smart work and user is free to play smart/complex 
with the knob. I believe that there are many use-cases where a single 
strategy of "to make KSM perform better while saving power at the same 
time" may not work. So,


> On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
>> Well, yes, but... how do we know when there is no more work to do?
>
> Yeah, I figured that out _after_ I send that email..
>
>> Thomas has given reason why KSM might simply fail to do its job if we
>> rely on the deferrable timer.

With deferrable timer, KSM thread will be scheduled on the 'active' CPU 
at that very same time. Yes, I understood from Thomas's clarification 
that if that very CPU goes IDLE, KSM task will get deferred even if at 
the timeout, we have some CPUs running. I think, this situation can be 
avoided potentially very small timeout value (?). But in totality, this 
is where KSM will be idle for sure, may be that is unwanted.

>>
>> Chintan, even if the scheduler guys turn out to hate it, please would
>> you give the patch below a try, to see how well it works in your
>> environment, whether it seems to go better or worse than your own patch.
>>
>> If it works well enough for you, maybe we can come up with ideas to
>> make it more palatable.  I do think your issue is an important one
>> to fix, one way or another.

It is taking a little more time for me to grasp your change :) So, after 
Peter's comment, do you want me to try this out or you are looking 
forward for even better idea ? BTW, if deferrable timer patch gets any 
green signal, I will publish new patch with your comments on v4.

>>
>> Thanks,
>> Hugh
>>
>> [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet
>>
>> Description yet to be written!
>>
>> Reported-by: Chintan Pandya<cpandya@codeaurora.org>
>> Not-Signed-off-by: Hugh Dickins<hughd@google.com>


 >>> So looking at Hughs test results I'm quite sure that the deferrable
 >>> timer is just another tunable bandaid with dubious value and the
 >>> potential of predictable bug/regresssion reports.

Here I am naive in understanding the obvious disadvantages of 'one more 
knob'. And hence was inclined towards deferrable timer knob. Thomas, 
could you explain what kind of bug/regression you foresee with such 
approach ?

-- 
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-08  9:39         ` Peter Zijlstra
@ 2014-09-09 20:14             ` Hugh Dickins
  2014-09-09 20:14             ` Hugh Dickins
  1 sibling, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-09 20:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Mon, 8 Sep 2014, Peter Zijlstra wrote:
> On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
> > 
> > --- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
> > +++ linux/include/linux/ksm.h	2014-09-07 11:54:41.528003316 -0700
> 
> > @@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st
> >  {
> >  }
> >  
> > +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
> 
> s/ksm_switch/__&/

I don't think so (and I did try building with CONFIG_KSM off too).

> 
> > +{
> > +	return NULL;
> > +}
> > +
> >  static inline int PageKsm(struct page *page)
> >  {
> >  	return 0;
> 
> > --- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
> > +++ linux/kernel/sched/core.c	2014-09-07 11:54:41.528003316 -0700
> 
> > @@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas
> >  	       struct task_struct *next)
> >  {
> >  	struct mm_struct *mm, *oldmm;
> > +	wait_queue_head_t *wake_ksm = NULL;
> >  
> >  	prepare_task_switch(rq, prev, next);
> >  
> > @@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas
> >  		next->active_mm = oldmm;
> >  		atomic_inc(&oldmm->mm_count);
> >  		enter_lazy_tlb(oldmm, next);
> > -	} else
> > +	} else {
> >  		switch_mm(oldmm, mm, next);
> > +		wake_ksm = ksm_switch(mm);
> 
> Is this the right mm?

It's next->mm, that's the one I intended (though the patch might
be equally workable using prev->mm instead: given free rein, I'd
have opted for hooking into both prev and next, but free rein is
definitely not what should be granted around here!).

> We've just switched the stack,

I thought that came in switch_to() a few lines further down,
but don't think it matters for this.

> so we're looing at next->mm when we switched away from current.
> That might not exist anymore.

I fail to see how that can be.  Looking at the x86 switch_mm(),
I can see it referencing (unsurprisingly!) both old and new mms
at this point, and no reference to an mm is dropped before the
ksm_switch().  oldmm (there called mm) is mmdropped later in
finish_task_switch().

> 
> > +	}
> >  
> >  	if (!prev->mm) {
> >  		prev->active_mm = NULL;
> > @@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas
> >  	 * frame will be invalid.
> >  	 */
> >  	finish_task_switch(this_rq(), prev);
> > +
> > +	if (wake_ksm)
> > +		wake_up_interruptible(wake_ksm);
> >  }
> 
> Quite horrible for sure. I really hate seeing KSM cruft all the way down

Yes, I expected that, and I would certainly feel the same way.

And even worse, imagine if this were successful, we might come along
and ask to do something similar for khugepaged.  Though if it comes to
that, I'm sure we would generalize into one hook which does not say
"ksm" or "khugepaged" on it, but would still a present a single unlikely
flag to be tested at this level.  Maybe you would even prefer the
generalized version, but I don't want to complicate the prototype yet.

If it weren't for the "we already have the mm cachelines here" argument,
I by now feel fairly sure that I would be going for hooking into timer
tick instead (where Thomas could then express his horror!).

Do you think I should just forget about cacheline micro-optimizations
and go in that direction instead?

> here. Can't we create a new (timer) infrastructure that does the right
> thing? Surely this isn't the only such case.

A sleep-walking timer, that goes to sleep in one bed, but may wake in
another; and defers while beds are empty?  I'd be happy to try using
that for KSM if it already existed, and no doubt Chintan would too

But I don't think KSM presents a very good case for developing it.
I think KSM's use of a sleep_millisecs timer is really just an apology
for the amount of often wasted work that it does, and dates from before
we niced it down 5.  I prefer the idea of a KSM which waits on activity
amongst the restricted set of tasks it is tracking: as this patch tries.

But my preference may be naive: doing lots of unnecessary work doesn't
matter as much as waking cpus from deep sleep.

> 
> I know both RCU and some NOHZ_FULL muck already track when the system is
> completely idle. This is yet another case of that.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-09 20:14             ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-09 20:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Mon, 8 Sep 2014, Peter Zijlstra wrote:
> On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
> > 
> > --- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
> > +++ linux/include/linux/ksm.h	2014-09-07 11:54:41.528003316 -0700
> 
> > @@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st
> >  {
> >  }
> >  
> > +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
> 
> s/ksm_switch/__&/

I don't think so (and I did try building with CONFIG_KSM off too).

> 
> > +{
> > +	return NULL;
> > +}
> > +
> >  static inline int PageKsm(struct page *page)
> >  {
> >  	return 0;
> 
> > --- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
> > +++ linux/kernel/sched/core.c	2014-09-07 11:54:41.528003316 -0700
> 
> > @@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas
> >  	       struct task_struct *next)
> >  {
> >  	struct mm_struct *mm, *oldmm;
> > +	wait_queue_head_t *wake_ksm = NULL;
> >  
> >  	prepare_task_switch(rq, prev, next);
> >  
> > @@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas
> >  		next->active_mm = oldmm;
> >  		atomic_inc(&oldmm->mm_count);
> >  		enter_lazy_tlb(oldmm, next);
> > -	} else
> > +	} else {
> >  		switch_mm(oldmm, mm, next);
> > +		wake_ksm = ksm_switch(mm);
> 
> Is this the right mm?

It's next->mm, that's the one I intended (though the patch might
be equally workable using prev->mm instead: given free rein, I'd
have opted for hooking into both prev and next, but free rein is
definitely not what should be granted around here!).

> We've just switched the stack,

I thought that came in switch_to() a few lines further down,
but don't think it matters for this.

> so we're looing at next->mm when we switched away from current.
> That might not exist anymore.

I fail to see how that can be.  Looking at the x86 switch_mm(),
I can see it referencing (unsurprisingly!) both old and new mms
at this point, and no reference to an mm is dropped before the
ksm_switch().  oldmm (there called mm) is mmdropped later in
finish_task_switch().

> 
> > +	}
> >  
> >  	if (!prev->mm) {
> >  		prev->active_mm = NULL;
> > @@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas
> >  	 * frame will be invalid.
> >  	 */
> >  	finish_task_switch(this_rq(), prev);
> > +
> > +	if (wake_ksm)
> > +		wake_up_interruptible(wake_ksm);
> >  }
> 
> Quite horrible for sure. I really hate seeing KSM cruft all the way down

Yes, I expected that, and I would certainly feel the same way.

And even worse, imagine if this were successful, we might come along
and ask to do something similar for khugepaged.  Though if it comes to
that, I'm sure we would generalize into one hook which does not say
"ksm" or "khugepaged" on it, but would still a present a single unlikely
flag to be tested at this level.  Maybe you would even prefer the
generalized version, but I don't want to complicate the prototype yet.

If it weren't for the "we already have the mm cachelines here" argument,
I by now feel fairly sure that I would be going for hooking into timer
tick instead (where Thomas could then express his horror!).

Do you think I should just forget about cacheline micro-optimizations
and go in that direction instead?

> here. Can't we create a new (timer) infrastructure that does the right
> thing? Surely this isn't the only such case.

A sleep-walking timer, that goes to sleep in one bed, but may wake in
another; and defers while beds are empty?  I'd be happy to try using
that for KSM if it already existed, and no doubt Chintan would too

But I don't think KSM presents a very good case for developing it.
I think KSM's use of a sleep_millisecs timer is really just an apology
for the amount of often wasted work that it does, and dates from before
we niced it down 5.  I prefer the idea of a KSM which waits on activity
amongst the restricted set of tasks it is tracking: as this patch tries.

But my preference may be naive: doing lots of unnecessary work doesn't
matter as much as waking cpus from deep sleep.

> 
> I know both RCU and some NOHZ_FULL muck already track when the system is
> completely idle. This is yet another case of that.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-09 14:52             ` Chintan Pandya
@ 2014-09-09 20:37               ` Hugh Dickins
  -1 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-09 20:37 UTC (permalink / raw)
  To: Chintan Pandya
  Cc: Peter Zijlstra, Hugh Dickins, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Tue, 9 Sep 2014, Chintan Pandya wrote:
> Hello All,
> 
> Before its too late to discuss this basic question, allow me to share my view
> on the deferrable timer approach.
> 
> I believe KSM at this point is tunable with predictable outcomes. When it
> will get triggered, how many pages it will scan etc. This aggression control
> is with user who can implement any complex logic based on its own flashy
> parameters. Along the same line, I was seeing this deferrable timer knob.
> 
> Here I was hoping the same level of predictability with this knob. Kernel
> still won't do smart work and user is free to play smart/complex with the
> knob. I believe that there are many use-cases where a single strategy of "to
> make KSM perform better while saving power at the same time" may not work.
> So,

Understood.  Whereas seasoned developers grow increasingly sick and
sceptical of such knobs, preferring to make an effort to get things
working well without them.  Both attitudes are valid.

> 
> > On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
> > > Well, yes, but... how do we know when there is no more work to do?
> > 
> > Yeah, I figured that out _after_ I send that email..
> > 
> > > Thomas has given reason why KSM might simply fail to do its job if we
> > > rely on the deferrable timer.
> 
> With deferrable timer, KSM thread will be scheduled on the 'active' CPU at
> that very same time. Yes, I understood from Thomas's clarification that if
> that very CPU goes IDLE, KSM task will get deferred even if at the timeout,
> we have some CPUs running. I think, this situation can be avoided potentially
> very small timeout value (?). But in totality, this is where KSM will be idle
> for sure, may be that is unwanted.

I don't get how a very small timeout value would solve that.  But I am
all for a KSM which works well even with timeout value 0: responsive,
but not power hungry.

> 
> > > 
> > > Chintan, even if the scheduler guys turn out to hate it, please would
> > > you give the patch below a try, to see how well it works in your
> > > environment, whether it seems to go better or worse than your own patch.
> > > 
> > > If it works well enough for you, maybe we can come up with ideas to
> > > make it more palatable.  I do think your issue is an important one
> > > to fix, one way or another.
> 
> It is taking a little more time for me to grasp your change :)

Grasping the intent should be easy: I thought that was in the title,
"ksm: avoid periodic wakeup while mergeable mms are quiet": just go
to sleep until at least one of the MMF_VM_MERGEABLE tasks gets run.

Checking the details, whether I'm accomplishing that intent, yes,
that needs more understanding of ksm.c, which your deferrable timer
approach did not need to get involved with at all (to its credit).

> So, after Peter's comment, do you want me to try this out

Certainly yes, please do.  I'm not saying that's a patch which will
ever go into the kernel itself, but we do want to know whether it does
the job for you or not, whether it's a worthwhile attempt in the right
direction.  Does it save as much as your patch?  Does it save more?

> or you are looking forward for even better idea ?

I'd love a better idea: assuming mine works, is there some other way
of accomplishing it, which does not pollute sched/core.c, but does
not drag in mm_struct cachelines for a frequent flags check?

> BTW, if deferrable timer patch gets any green signal,
> I will publish new patch with your comments on v4.

It seems to be a red light at present: but lights do change ;)

> 
> > > 
> > > Thanks,
> > > Hugh
> > > 
> > > [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet
> > > 
> > > Description yet to be written!
> > > 
> > > Reported-by: Chintan Pandya<cpandya@codeaurora.org>
> > > Not-Signed-off-by: Hugh Dickins<hughd@google.com>
> 
> 
> >>> So looking at Hughs test results I'm quite sure that the deferrable
> >>> timer is just another tunable bandaid with dubious value and the
> >>> potential of predictable bug/regresssion reports.
> 
> Here I am naive in understanding the obvious disadvantages of 'one more
> knob'. And hence was inclined towards deferrable timer knob. Thomas, could
> you explain what kind of bug/regression you foresee with such approach ?
> 
> -- 
> Chintan Pandya
> 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-09 20:37               ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-09 20:37 UTC (permalink / raw)
  To: Chintan Pandya
  Cc: Peter Zijlstra, Hugh Dickins, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Tue, 9 Sep 2014, Chintan Pandya wrote:
> Hello All,
> 
> Before its too late to discuss this basic question, allow me to share my view
> on the deferrable timer approach.
> 
> I believe KSM at this point is tunable with predictable outcomes. When it
> will get triggered, how many pages it will scan etc. This aggression control
> is with user who can implement any complex logic based on its own flashy
> parameters. Along the same line, I was seeing this deferrable timer knob.
> 
> Here I was hoping the same level of predictability with this knob. Kernel
> still won't do smart work and user is free to play smart/complex with the
> knob. I believe that there are many use-cases where a single strategy of "to
> make KSM perform better while saving power at the same time" may not work.
> So,

Understood.  Whereas seasoned developers grow increasingly sick and
sceptical of such knobs, preferring to make an effort to get things
working well without them.  Both attitudes are valid.

> 
> > On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
> > > Well, yes, but... how do we know when there is no more work to do?
> > 
> > Yeah, I figured that out _after_ I send that email..
> > 
> > > Thomas has given reason why KSM might simply fail to do its job if we
> > > rely on the deferrable timer.
> 
> With deferrable timer, KSM thread will be scheduled on the 'active' CPU at
> that very same time. Yes, I understood from Thomas's clarification that if
> that very CPU goes IDLE, KSM task will get deferred even if at the timeout,
> we have some CPUs running. I think, this situation can be avoided potentially
> very small timeout value (?). But in totality, this is where KSM will be idle
> for sure, may be that is unwanted.

I don't get how a very small timeout value would solve that.  But I am
all for a KSM which works well even with timeout value 0: responsive,
but not power hungry.

> 
> > > 
> > > Chintan, even if the scheduler guys turn out to hate it, please would
> > > you give the patch below a try, to see how well it works in your
> > > environment, whether it seems to go better or worse than your own patch.
> > > 
> > > If it works well enough for you, maybe we can come up with ideas to
> > > make it more palatable.  I do think your issue is an important one
> > > to fix, one way or another.
> 
> It is taking a little more time for me to grasp your change :)

Grasping the intent should be easy: I thought that was in the title,
"ksm: avoid periodic wakeup while mergeable mms are quiet": just go
to sleep until at least one of the MMF_VM_MERGEABLE tasks gets run.

Checking the details, whether I'm accomplishing that intent, yes,
that needs more understanding of ksm.c, which your deferrable timer
approach did not need to get involved with at all (to its credit).

> So, after Peter's comment, do you want me to try this out

Certainly yes, please do.  I'm not saying that's a patch which will
ever go into the kernel itself, but we do want to know whether it does
the job for you or not, whether it's a worthwhile attempt in the right
direction.  Does it save as much as your patch?  Does it save more?

> or you are looking forward for even better idea ?

I'd love a better idea: assuming mine works, is there some other way
of accomplishing it, which does not pollute sched/core.c, but does
not drag in mm_struct cachelines for a frequent flags check?

> BTW, if deferrable timer patch gets any green signal,
> I will publish new patch with your comments on v4.

It seems to be a red light at present: but lights do change ;)

> 
> > > 
> > > Thanks,
> > > Hugh
> > > 
> > > [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet
> > > 
> > > Description yet to be written!
> > > 
> > > Reported-by: Chintan Pandya<cpandya@codeaurora.org>
> > > Not-Signed-off-by: Hugh Dickins<hughd@google.com>
> 
> 
> >>> So looking at Hughs test results I'm quite sure that the deferrable
> >>> timer is just another tunable bandaid with dubious value and the
> >>> potential of predictable bug/regresssion reports.
> 
> Here I am naive in understanding the obvious disadvantages of 'one more
> knob'. And hence was inclined towards deferrable timer knob. Thomas, could
> you explain what kind of bug/regression you foresee with such approach ?
> 
> -- 
> Chintan Pandya
> 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-09 20:14             ` Hugh Dickins
  (?)
@ 2014-09-10  8:10             ` Peter Zijlstra
  2014-09-11 12:27                 ` Hugh Dickins
  -1 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2014-09-10  8:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chintan Pandya, akpm, linux-mm, linux-arm-msm, linux-kernel,
	Thomas Gleixner, John Stultz, Ingo Molnar, Frederic Weisbecker,
	Paul McKenney

[-- Attachment #1: Type: text/plain, Size: 1345 bytes --]

On Tue, Sep 09, 2014 at 01:14:50PM -0700, Hugh Dickins wrote:
> On Mon, 8 Sep 2014, Peter Zijlstra wrote:
> > >  		switch_mm(oldmm, mm, next);
> > > +		wake_ksm = ksm_switch(mm);
> > 
> > Is this the right mm?
> 
> It's next->mm, that's the one I intended (though the patch might
> be equally workable using prev->mm instead: given free rein, I'd
> have opted for hooking into both prev and next, but free rein is
> definitely not what should be granted around here!).
> 
> > We've just switched the stack,
> 
> I thought that came in switch_to() a few lines further down,
> but don't think it matters for this.

Ah, yes. Got my task and mm separation messed up.

> > so we're looing at next->mm when we switched away from current.
> > That might not exist anymore.
> 
> I fail to see how that can be.  Looking at the x86 switch_mm(),
> I can see it referencing (unsurprisingly!) both old and new mms
> at this point, and no reference to an mm is dropped before the
> ksm_switch().  oldmm (there called mm) is mmdropped later in
> finish_task_switch().

Well, see the above confusion about switch_mm vs switch_to :-/

So if this were switch_to(), we'd see next->mm as before the last
context switch. And since that switch fully happened, it would also
already have done the finish_task_switch() -> mmdrop().



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-09 20:14             ` Hugh Dickins
  (?)
  (?)
@ 2014-09-10  8:27             ` Peter Zijlstra
  2014-09-11 12:59                 ` Hugh Dickins
  -1 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2014-09-10  8:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chintan Pandya, akpm, linux-mm, linux-arm-msm, linux-kernel,
	Thomas Gleixner, John Stultz, Ingo Molnar, Frederic Weisbecker,
	Paul McKenney

[-- Attachment #1: Type: text/plain, Size: 2485 bytes --]

On Tue, Sep 09, 2014 at 01:14:50PM -0700, Hugh Dickins wrote:
> > Quite horrible for sure. I really hate seeing KSM cruft all the way down
> 
> Yes, I expected that, and I would certainly feel the same way.
> 
> And even worse, imagine if this were successful, we might come along
> and ask to do something similar for khugepaged.  Though if it comes to
> that, I'm sure we would generalize into one hook which does not say
> "ksm" or "khugepaged" on it, but would still a present a single unlikely
> flag to be tested at this level.  Maybe you would even prefer the
> generalized version, but I don't want to complicate the prototype yet.
> 
> If it weren't for the "we already have the mm cachelines here" argument,
> I by now feel fairly sure that I would be going for hooking into timer
> tick instead (where Thomas could then express his horror!).
> 
> Do you think I should just forget about cacheline micro-optimizations
> and go in that direction instead?

Not really either I'm afraid. Slimming down the tick has been on my todo
list like forever. There's a lot of low hanging fruit there.

> > here. Can't we create a new (timer) infrastructure that does the right
> > thing? Surely this isn't the only such case.
> 
> A sleep-walking timer, that goes to sleep in one bed, but may wake in
> another; and defers while beds are empty?  I'd be happy to try using
> that for KSM if it already existed, and no doubt Chintan would too
> 
> But I don't think KSM presents a very good case for developing it.
> I think KSM's use of a sleep_millisecs timer is really just an apology
> for the amount of often wasted work that it does, and dates from before
> we niced it down 5.  I prefer the idea of a KSM which waits on activity
> amongst the restricted set of tasks it is tracking: as this patch tries.
> 
> But my preference may be naive: doing lots of unnecessary work doesn't
> matter as much as waking cpus from deep sleep.

Ah yes, I see your point. So far the mentioned goal has been to not
unduly wake CPUs and waste energy, but your goal is better still. And
would indeed not be met by a globally deferred timer.

Does it make sense to drive both KSM and khugepage the same way we drive
the numa scanning? It has the benefit of getting rid of these threads,
which pushes the work into the right accountable context (the task its
doing the scanning for) and makes the scanning frequency depend on the
actual task activity.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-09 20:14             ` Hugh Dickins
@ 2014-09-11  8:01               ` Chintan Pandya
  -1 siblings, 0 replies; 29+ messages in thread
From: Chintan Pandya @ 2014-09-11  8:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, akpm, linux-mm, linux-arm-msm, linux-kernel,
	Thomas Gleixner, John Stultz, Ingo Molnar, Frederic Weisbecker,
	Paul McKenney

I don't mean to divert the thread too much. But just one suggestion 
offered by Harshad.

Why can't we stop invoking more of a KSM scanner thread when we are 
saturating from savings ? But again, to check whether savings are 
saturated or not, we may still want to rely upon timers and we have to 
wake the CPUs up from IDLE state.

>> here. Can't we create a new (timer) infrastructure that does the right
>> thing? Surely this isn't the only such case.
>
> A sleep-walking timer, that goes to sleep in one bed, but may wake in
> another; and defers while beds are empty?  I'd be happy to try using
> that for KSM if it already existed, and no doubt Chintan would too

This is interesting for sure :)

>
> But I don't think KSM presents a very good case for developing it.
> I think KSM's use of a sleep_millisecs timer is really just an apology
> for the amount of often wasted work that it does, and dates from before
> we niced it down 5.  I prefer the idea of a KSM which waits on activity
> amongst the restricted set of tasks it is tracking: as this patch tries.
>
> But my preference may be naive: doing lots of unnecessary work doesn't
> matter as much as waking cpus from deep sleep.

This is exactly the preference we are looking for. But yes, cannot be 
generalized for all.

>
>>
>> I know both RCU and some NOHZ_FULL muck already track when the system is
>> completely idle. This is yet another case of that.
>
> Hugh


-- 
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-11  8:01               ` Chintan Pandya
  0 siblings, 0 replies; 29+ messages in thread
From: Chintan Pandya @ 2014-09-11  8:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, akpm, linux-mm, linux-arm-msm, linux-kernel,
	Thomas Gleixner, John Stultz, Ingo Molnar, Frederic Weisbecker,
	Paul McKenney

I don't mean to divert the thread too much. But just one suggestion 
offered by Harshad.

Why can't we stop invoking more of a KSM scanner thread when we are 
saturating from savings ? But again, to check whether savings are 
saturated or not, we may still want to rely upon timers and we have to 
wake the CPUs up from IDLE state.

>> here. Can't we create a new (timer) infrastructure that does the right
>> thing? Surely this isn't the only such case.
>
> A sleep-walking timer, that goes to sleep in one bed, but may wake in
> another; and defers while beds are empty?  I'd be happy to try using
> that for KSM if it already existed, and no doubt Chintan would too

This is interesting for sure :)

>
> But I don't think KSM presents a very good case for developing it.
> I think KSM's use of a sleep_millisecs timer is really just an apology
> for the amount of often wasted work that it does, and dates from before
> we niced it down 5.  I prefer the idea of a KSM which waits on activity
> amongst the restricted set of tasks it is tracking: as this patch tries.
>
> But my preference may be naive: doing lots of unnecessary work doesn't
> matter as much as waking cpus from deep sleep.

This is exactly the preference we are looking for. But yes, cannot be 
generalized for all.

>
>>
>> I know both RCU and some NOHZ_FULL muck already track when the system is
>> completely idle. This is yet another case of that.
>
> Hugh


-- 
Chintan Pandya

QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-10  8:10             ` Peter Zijlstra
@ 2014-09-11 12:27                 ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-11 12:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Wed, 10 Sep 2014, Peter Zijlstra wrote:
> 
> So if this were switch_to(), we'd see next->mm as before the last
> context switch. And since that switch fully happened, it would also
> already have done the finish_task_switch() -> mmdrop().

Useful words of warning, thank you, which jolted me into realizing
an idiot defect in the patch I posted - and more reason why you're
right to keep me away from context_switch() - though it is still a
very handy place for a quick patch to see if going in this direction
(quiescent KSM rather than idle CPU) will be useful to Chintan.

The problem was my wake_ksm variable on the stack of context_switch():
when we emerge from switch_to() on a different stack, what's in wake_ksm
is whatever was placed there, perhaps years ago, by some earlier task.
It may work well enough most of the time (just as the deferrable timer
did despite its defect), but could sometimes delay KSM for years.

Here's a replacement patch, looks like a replacement will be more
useful to Chintan than an incremental.  No need to worry overmuch
about cachelines and placing mm references close together, if we
won't be going forward with a solution in context_switch() anyway.

[PATCH v2] ksm: avoid periodic wakeup while mergeable mms are quiet

Description yet to be written!

Reported-by: Chintan Pandya <cpandya@codeaurora.org>
Not-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 include/linux/ksm.h   |   11 +++++++++
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    5 ++++
 mm/ksm.c              |   48 +++++++++++++++++++++++++++-------------
 4 files changed, 50 insertions(+), 15 deletions(-)

--- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
+++ linux/include/linux/ksm.h	2014-09-11 04:44:57.884104998 -0700
@@ -21,6 +21,7 @@ int ksm_madvise(struct vm_area_struct *v
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
+void __ksm_switch(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -35,6 +36,12 @@ static inline void ksm_exit(struct mm_st
 		__ksm_exit(mm);
 }
 
+static inline void ksm_switch(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_SWITCH_TO_KSM, &mm->flags)))
+		__ksm_switch(mm);
+}
+
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
  * which KSM maps into multiple mms, wherever identical anonymous page content
@@ -87,6 +94,10 @@ static inline void ksm_exit(struct mm_st
 {
 }
 
+static inline void ksm_switch(struct mm_struct *mm)
+{
+}
+
 static inline int PageKsm(struct page *page)
 {
 	return 0;
--- 3.17-rc4/include/linux/sched.h	2014-08-16 16:00:53.909189060 -0700
+++ linux/include/linux/sched.h	2014-09-11 04:44:57.888104998 -0700
@@ -453,6 +453,7 @@ static inline int get_dumpable(struct mm
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
+#define MMF_SWITCH_TO_KSM	21	/* notify KSM of switch to this mm */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
--- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
+++ linux/kernel/sched/core.c	2014-09-11 04:44:57.896104998 -0700
@@ -61,6 +61,7 @@
 #include <linux/times.h>
 #include <linux/tsacct_kern.h>
 #include <linux/kprobes.h>
+#include <linux/ksm.h>
 #include <linux/delayacct.h>
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
@@ -2348,6 +2349,10 @@ context_switch(struct rq *rq, struct tas
 	 * frame will be invalid.
 	 */
 	finish_task_switch(this_rq(), prev);
+
+	mm = current->mm;
+	if (mm)
+		ksm_switch(mm);
 }
 
 /*
--- 3.17-rc4/mm/ksm.c	2014-08-16 16:00:54.132189065 -0700
+++ linux/mm/ksm.c	2014-09-11 04:44:57.900104998 -0700
@@ -205,6 +205,9 @@ static struct kmem_cache *rmap_item_cach
 static struct kmem_cache *stable_node_cache;
 static struct kmem_cache *mm_slot_cache;
 
+/* The number of mergeable mms which have recently run */
+static atomic_t active_mergeable_mms = ATOMIC_INIT(0);
+
 /* The number of nodes in the stable tree */
 static unsigned long ksm_pages_shared;
 
@@ -313,9 +316,13 @@ static inline struct mm_slot *alloc_mm_s
 	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
 }
 
-static inline void free_mm_slot(struct mm_slot *mm_slot)
+static void free_mm_slot(struct mm_struct *mm, struct mm_slot *mm_slot)
 {
 	kmem_cache_free(mm_slot_cache, mm_slot);
+
+	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (!test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+		atomic_dec(&active_mergeable_mms);
 }
 
 static struct mm_slot *get_mm_slot(struct mm_struct *mm)
@@ -801,8 +808,7 @@ static int unmerge_and_remove_all_rmap_i
 			list_del(&mm_slot->mm_list);
 			spin_unlock(&ksm_mmlist_lock);
 
-			free_mm_slot(mm_slot);
-			clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+			free_mm_slot(mm, mm_slot);
 			up_read(&mm->mmap_sem);
 			mmdrop(mm);
 		} else {
@@ -1668,12 +1674,20 @@ next_mm:
 		list_del(&slot->mm_list);
 		spin_unlock(&ksm_mmlist_lock);
 
-		free_mm_slot(slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, slot);
 		up_read(&mm->mmap_sem);
 		mmdrop(mm);
 	} else {
 		spin_unlock(&ksm_mmlist_lock);
+		/*
+		 * After completing its scan, assume this mm to be inactive,
+		 * but set a flag for context_switch() to notify us as soon
+		 * as it is used again: see ksm_switch().  If the number of
+		 * active_mergeable_mms goes down to zero, ksmd will sleep
+		 * to save power, until awoken by mergeable context_switch().
+		 */
+		if (!test_and_set_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+			atomic_dec(&active_mergeable_mms);
 		up_read(&mm->mmap_sem);
 	}
 
@@ -1707,7 +1721,7 @@ static void ksm_do_scan(unsigned int sca
 
 static int ksmd_should_run(void)
 {
-	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
+	return (ksm_run & KSM_RUN_MERGE) && atomic_read(&active_mergeable_mms);
 }
 
 static int ksm_scan_thread(void *nothing)
@@ -1785,15 +1799,11 @@ int ksm_madvise(struct vm_area_struct *v
 int __ksm_enter(struct mm_struct *mm)
 {
 	struct mm_slot *mm_slot;
-	int needs_wakeup;
 
 	mm_slot = alloc_mm_slot();
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* Check ksm_run too?  Would need tighter locking */
-	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
-
 	spin_lock(&ksm_mmlist_lock);
 	insert_to_mm_slots_hash(mm, mm_slot);
 	/*
@@ -1812,10 +1822,9 @@ int __ksm_enter(struct mm_struct *mm)
 		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	set_bit(MMF_VM_MERGEABLE, &mm->flags);
 	atomic_inc(&mm->mm_count);
-
-	if (needs_wakeup)
+	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (atomic_inc_return(&active_mergeable_mms) == 1)
 		wake_up_interruptible(&ksm_thread_wait);
 
 	return 0;
@@ -1850,8 +1859,7 @@ void __ksm_exit(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	if (easy_to_free) {
-		free_mm_slot(mm_slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, mm_slot);
 		mmdrop(mm);
 	} else if (mm_slot) {
 		down_write(&mm->mmap_sem);
@@ -1859,6 +1867,16 @@ void __ksm_exit(struct mm_struct *mm)
 	}
 }
 
+void __ksm_switch(struct mm_struct *mm)
+{
+	/*
+	 * Called by context_switch() to a hitherto inactive mergeable mm.
+	 */
+	if (test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags) &&
+	    atomic_inc_return(&active_mergeable_mms) == 1)
+		wake_up_interruptible(&ksm_thread_wait);
+}
+
 struct page *ksm_might_need_to_copy(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-11 12:27                 ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-11 12:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Wed, 10 Sep 2014, Peter Zijlstra wrote:
> 
> So if this were switch_to(), we'd see next->mm as before the last
> context switch. And since that switch fully happened, it would also
> already have done the finish_task_switch() -> mmdrop().

Useful words of warning, thank you, which jolted me into realizing
an idiot defect in the patch I posted - and more reason why you're
right to keep me away from context_switch() - though it is still a
very handy place for a quick patch to see if going in this direction
(quiescent KSM rather than idle CPU) will be useful to Chintan.

The problem was my wake_ksm variable on the stack of context_switch():
when we emerge from switch_to() on a different stack, what's in wake_ksm
is whatever was placed there, perhaps years ago, by some earlier task.
It may work well enough most of the time (just as the deferrable timer
did despite its defect), but could sometimes delay KSM for years.

Here's a replacement patch, looks like a replacement will be more
useful to Chintan than an incremental.  No need to worry overmuch
about cachelines and placing mm references close together, if we
won't be going forward with a solution in context_switch() anyway.

[PATCH v2] ksm: avoid periodic wakeup while mergeable mms are quiet

Description yet to be written!

Reported-by: Chintan Pandya <cpandya@codeaurora.org>
Not-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 include/linux/ksm.h   |   11 +++++++++
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    5 ++++
 mm/ksm.c              |   48 +++++++++++++++++++++++++++-------------
 4 files changed, 50 insertions(+), 15 deletions(-)

--- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
+++ linux/include/linux/ksm.h	2014-09-11 04:44:57.884104998 -0700
@@ -21,6 +21,7 @@ int ksm_madvise(struct vm_area_struct *v
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
+void __ksm_switch(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -35,6 +36,12 @@ static inline void ksm_exit(struct mm_st
 		__ksm_exit(mm);
 }
 
+static inline void ksm_switch(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_SWITCH_TO_KSM, &mm->flags)))
+		__ksm_switch(mm);
+}
+
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
  * which KSM maps into multiple mms, wherever identical anonymous page content
@@ -87,6 +94,10 @@ static inline void ksm_exit(struct mm_st
 {
 }
 
+static inline void ksm_switch(struct mm_struct *mm)
+{
+}
+
 static inline int PageKsm(struct page *page)
 {
 	return 0;
--- 3.17-rc4/include/linux/sched.h	2014-08-16 16:00:53.909189060 -0700
+++ linux/include/linux/sched.h	2014-09-11 04:44:57.888104998 -0700
@@ -453,6 +453,7 @@ static inline int get_dumpable(struct mm
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
+#define MMF_SWITCH_TO_KSM	21	/* notify KSM of switch to this mm */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
--- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
+++ linux/kernel/sched/core.c	2014-09-11 04:44:57.896104998 -0700
@@ -61,6 +61,7 @@
 #include <linux/times.h>
 #include <linux/tsacct_kern.h>
 #include <linux/kprobes.h>
+#include <linux/ksm.h>
 #include <linux/delayacct.h>
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
@@ -2348,6 +2349,10 @@ context_switch(struct rq *rq, struct tas
 	 * frame will be invalid.
 	 */
 	finish_task_switch(this_rq(), prev);
+
+	mm = current->mm;
+	if (mm)
+		ksm_switch(mm);
 }
 
 /*
--- 3.17-rc4/mm/ksm.c	2014-08-16 16:00:54.132189065 -0700
+++ linux/mm/ksm.c	2014-09-11 04:44:57.900104998 -0700
@@ -205,6 +205,9 @@ static struct kmem_cache *rmap_item_cach
 static struct kmem_cache *stable_node_cache;
 static struct kmem_cache *mm_slot_cache;
 
+/* The number of mergeable mms which have recently run */
+static atomic_t active_mergeable_mms = ATOMIC_INIT(0);
+
 /* The number of nodes in the stable tree */
 static unsigned long ksm_pages_shared;
 
@@ -313,9 +316,13 @@ static inline struct mm_slot *alloc_mm_s
 	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
 }
 
-static inline void free_mm_slot(struct mm_slot *mm_slot)
+static void free_mm_slot(struct mm_struct *mm, struct mm_slot *mm_slot)
 {
 	kmem_cache_free(mm_slot_cache, mm_slot);
+
+	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (!test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+		atomic_dec(&active_mergeable_mms);
 }
 
 static struct mm_slot *get_mm_slot(struct mm_struct *mm)
@@ -801,8 +808,7 @@ static int unmerge_and_remove_all_rmap_i
 			list_del(&mm_slot->mm_list);
 			spin_unlock(&ksm_mmlist_lock);
 
-			free_mm_slot(mm_slot);
-			clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+			free_mm_slot(mm, mm_slot);
 			up_read(&mm->mmap_sem);
 			mmdrop(mm);
 		} else {
@@ -1668,12 +1674,20 @@ next_mm:
 		list_del(&slot->mm_list);
 		spin_unlock(&ksm_mmlist_lock);
 
-		free_mm_slot(slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, slot);
 		up_read(&mm->mmap_sem);
 		mmdrop(mm);
 	} else {
 		spin_unlock(&ksm_mmlist_lock);
+		/*
+		 * After completing its scan, assume this mm to be inactive,
+		 * but set a flag for context_switch() to notify us as soon
+		 * as it is used again: see ksm_switch().  If the number of
+		 * active_mergeable_mms goes down to zero, ksmd will sleep
+		 * to save power, until awoken by mergeable context_switch().
+		 */
+		if (!test_and_set_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+			atomic_dec(&active_mergeable_mms);
 		up_read(&mm->mmap_sem);
 	}
 
@@ -1707,7 +1721,7 @@ static void ksm_do_scan(unsigned int sca
 
 static int ksmd_should_run(void)
 {
-	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
+	return (ksm_run & KSM_RUN_MERGE) && atomic_read(&active_mergeable_mms);
 }
 
 static int ksm_scan_thread(void *nothing)
@@ -1785,15 +1799,11 @@ int ksm_madvise(struct vm_area_struct *v
 int __ksm_enter(struct mm_struct *mm)
 {
 	struct mm_slot *mm_slot;
-	int needs_wakeup;
 
 	mm_slot = alloc_mm_slot();
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* Check ksm_run too?  Would need tighter locking */
-	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
-
 	spin_lock(&ksm_mmlist_lock);
 	insert_to_mm_slots_hash(mm, mm_slot);
 	/*
@@ -1812,10 +1822,9 @@ int __ksm_enter(struct mm_struct *mm)
 		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	set_bit(MMF_VM_MERGEABLE, &mm->flags);
 	atomic_inc(&mm->mm_count);
-
-	if (needs_wakeup)
+	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (atomic_inc_return(&active_mergeable_mms) == 1)
 		wake_up_interruptible(&ksm_thread_wait);
 
 	return 0;
@@ -1850,8 +1859,7 @@ void __ksm_exit(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	if (easy_to_free) {
-		free_mm_slot(mm_slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, mm_slot);
 		mmdrop(mm);
 	} else if (mm_slot) {
 		down_write(&mm->mmap_sem);
@@ -1859,6 +1867,16 @@ void __ksm_exit(struct mm_struct *mm)
 	}
 }
 
+void __ksm_switch(struct mm_struct *mm)
+{
+	/*
+	 * Called by context_switch() to a hitherto inactive mergeable mm.
+	 */
+	if (test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags) &&
+	    atomic_inc_return(&active_mergeable_mms) == 1)
+		wake_up_interruptible(&ksm_thread_wait);
+}
+
 struct page *ksm_might_need_to_copy(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-10  8:27             ` Peter Zijlstra
@ 2014-09-11 12:59                 ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-11 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Wed, 10 Sep 2014, Peter Zijlstra wrote:
> 
> Does it make sense to drive both KSM and khugepage the same way we drive
> the numa scanning? It has the benefit of getting rid of these threads,
> which pushes the work into the right accountable context (the task its
> doing the scanning for) and makes the scanning frequency depend on the
> actual task activity.

I expect it would be possible: but more work than I'd ever find time
to complete myself, with uncertain benefit.

khugepaged would probably be easier to convert, since it is dealing
with independent mms anyway.  Whereas ksmd is establishing sharing
between unrelated mms, so cannot deal with single mms in isolation.

But what's done by a single daemon today, could be passed from task
to task under mutex instead; with probably very different handling
of KSM's "unstable" tree (at present the old one is forgotten at the
start of each cycle, and the new one rebuilt from scratch: I expect
that would have to change, to removing rb entries one by one).

How well it would work out, I'm not confident to say.  And I think
we shall need an answer to the power question sooner than we can
turn the design of KSM on its head.  Vendors will go with what works
for them, never mind what our priniciples dictate.

Your suggestion of following the NUMA scanning did make me wonder
if I could use task_work: if there were already a re-arming task_work,
I could probably use that, and escape your gaze :)  But I don't think
it exists at present, and I don't think it's an extension that would
be welcomed, and I don't think it would present an efficient solution.

The most satisfying solution aesthetically, would be for KSM to write
protect the VM_MERGEABLE areas at some stage (when they "approach
stability", whatever I mean by that), and let itself be woken by the
faults (and if there are no write faults on any of the areas, then
there is no need for it to be awoken).

But I think that all those new faults would pose a very significant
regression in performance.

I don't have a good idea of where else to hook in at present.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-11 12:59                 ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-11 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Chintan Pandya, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Wed, 10 Sep 2014, Peter Zijlstra wrote:
> 
> Does it make sense to drive both KSM and khugepage the same way we drive
> the numa scanning? It has the benefit of getting rid of these threads,
> which pushes the work into the right accountable context (the task its
> doing the scanning for) and makes the scanning frequency depend on the
> actual task activity.

I expect it would be possible: but more work than I'd ever find time
to complete myself, with uncertain benefit.

khugepaged would probably be easier to convert, since it is dealing
with independent mms anyway.  Whereas ksmd is establishing sharing
between unrelated mms, so cannot deal with single mms in isolation.

But what's done by a single daemon today, could be passed from task
to task under mutex instead; with probably very different handling
of KSM's "unstable" tree (at present the old one is forgotten at the
start of each cycle, and the new one rebuilt from scratch: I expect
that would have to change, to removing rb entries one by one).

How well it would work out, I'm not confident to say.  And I think
we shall need an answer to the power question sooner than we can
turn the design of KSM on its head.  Vendors will go with what works
for them, never mind what our priniciples dictate.

Your suggestion of following the NUMA scanning did make me wonder
if I could use task_work: if there were already a re-arming task_work,
I could probably use that, and escape your gaze :)  But I don't think
it exists at present, and I don't think it's an extension that would
be welcomed, and I don't think it would present an efficient solution.

The most satisfying solution aesthetically, would be for KSM to write
protect the VM_MERGEABLE areas at some stage (when they "approach
stability", whatever I mean by that), and let itself be woken by the
faults (and if there are no write faults on any of the areas, then
there is no need for it to be awoken).

But I think that all those new faults would pose a very significant
regression in performance.

I don't have a good idea of where else to hook in at present.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
  2014-09-11  8:01               ` Chintan Pandya
@ 2014-09-11 13:25                 ` Hugh Dickins
  -1 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-11 13:25 UTC (permalink / raw)
  To: Chintan Pandya
  Cc: Hugh Dickins, Peter Zijlstra, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Thu, 11 Sep 2014, Chintan Pandya wrote:

> I don't mean to divert the thread too much. But just one suggestion offered
> by Harshad.
> 
> Why can't we stop invoking more of a KSM scanner thread when we are
> saturating from savings ? But again, to check whether savings are saturated
> or not, we may still want to rely upon timers and we have to wake the CPUs up
> from IDLE state.

I agree that it should make sense for KSM to slow down when it sees it's
making no progress (though that would depart from the pages_to_scan and
sleep_millisecs prescription - perhaps could be tied to sleep_millisecs 0).

But not stop.  That's the problem we're mainly concerned with here:
to save power we need it to stop, but then how to wake up, without
putting nasty hooks in hot paths for a minority interest?
I don't see an answer to that above.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread
@ 2014-09-11 13:25                 ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2014-09-11 13:25 UTC (permalink / raw)
  To: Chintan Pandya
  Cc: Hugh Dickins, Peter Zijlstra, akpm, linux-mm, linux-arm-msm,
	linux-kernel, Thomas Gleixner, John Stultz, Ingo Molnar,
	Frederic Weisbecker, Paul McKenney

On Thu, 11 Sep 2014, Chintan Pandya wrote:

> I don't mean to divert the thread too much. But just one suggestion offered
> by Harshad.
> 
> Why can't we stop invoking more of a KSM scanner thread when we are
> saturating from savings ? But again, to check whether savings are saturated
> or not, we may still want to rely upon timers and we have to wake the CPUs up
> from IDLE state.

I agree that it should make sense for KSM to slow down when it sees it's
making no progress (though that would depart from the pages_to_scan and
sleep_millisecs prescription - perhaps could be tied to sleep_millisecs 0).

But not stop.  That's the problem we're mainly concerned with here:
to save power we need it to stop, but then how to wake up, without
putting nasty hooks in hot paths for a minority interest?
I don't see an answer to that above.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-09-11 13:26 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-20 12:10 [PATCH v4 1/2] timer: provide an api for deferrable timeout Chintan Pandya
2014-08-20 12:10 ` Chintan Pandya
2014-08-20 12:10 ` [PATCH v4 2/2] ksm: provide support to use deferrable timers for scanner thread Chintan Pandya
2014-08-20 12:10   ` Chintan Pandya
2014-08-28  6:02   ` Hugh Dickins
2014-08-28  6:02     ` Hugh Dickins
2014-09-03  9:58     ` Peter Zijlstra
2014-09-03  9:58       ` Peter Zijlstra
2014-09-03 10:32       ` Thomas Gleixner
2014-09-03 10:32         ` Thomas Gleixner
2014-09-08  8:25       ` Hugh Dickins
2014-09-08  8:25         ` Hugh Dickins
2014-09-08  9:39         ` Peter Zijlstra
2014-09-09 14:52           ` Chintan Pandya
2014-09-09 14:52             ` Chintan Pandya
2014-09-09 20:37             ` Hugh Dickins
2014-09-09 20:37               ` Hugh Dickins
2014-09-09 20:14           ` Hugh Dickins
2014-09-09 20:14             ` Hugh Dickins
2014-09-10  8:10             ` Peter Zijlstra
2014-09-11 12:27               ` Hugh Dickins
2014-09-11 12:27                 ` Hugh Dickins
2014-09-10  8:27             ` Peter Zijlstra
2014-09-11 12:59               ` Hugh Dickins
2014-09-11 12:59                 ` Hugh Dickins
2014-09-11  8:01             ` Chintan Pandya
2014-09-11  8:01               ` Chintan Pandya
2014-09-11 13:25               ` Hugh Dickins
2014-09-11 13:25                 ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.