linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Convert khugepaged to a task_work function
@ 2014-10-23  2:49 Alex Thorlton
  2014-10-23  2:49 ` [PATCH 1/4] Disable khugepaged thread Alex Thorlton
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-10-23  2:49 UTC (permalink / raw)
  To: linux-mm, athorlton
  Cc: Andrew Morton, Bob Liu, David Rientjes, Eric W. Biederman,
	Hugh Dickins, Ingo Molnar, Kees Cook, Kirill A. Shutemov,
	Mel Gorman, Oleg Nesterov, Peter Zijlstra, Rik van Riel,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

Hey everyone,

Last week, while discussing possible fixes for some unexpected/unwanted behavior
from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
mentioned possibly changing changing khugepaged to work as a task_work function
instead of a kernel thread.  This will give us finer grained control over the
page collapse scans, eliminate some unnecessary scans since tasks that are
relatively inactive will not be scanned often, and eliminate the unwanted
behavior described in the email thread I mentioned.

This initial patch is fully functional, but there are quite a few areas that
will need to be polished up before it's ready to be considered for a merge.  I
wanted to get this initial version out with some basic test results quickly, so
that people can give their opinions and let me know if there's anything they'd
like to see done differently (and there probably is :).  I'll give details on
the code in the individual patches.

I gathered some pretty rudimentary test data using a 48-thread NAMD simulation
pinned to a cpuset with 8 cpus and about 60g of memory.  I'm checking to see if
I'm allowed to publish the input data so that others can replicate the test.  In
the meantime, if somebody knows of a publicly available benchmark that stresses
khugepaged, that would be helpful.

The only data point I gathered was the number of pages collapsed, sampled every
ten seconds, for the lifetime of the job.  This one statistic gives a pretty
decent illustration of the difference in behavior between the two kernels, but I
intend to add some other counters to measure fully completed scans, failed
allocations, and possibly scans skipped due to timer constraints.

The data for the standard kernel (with a very small patch to add the stat
counter that I used to the task_struct) is available here:

http://oss.sgi.com/projects/memtests/pgcollapse/output-khpd

This was a fairly recent kernel (last Tuesday).  Commit ID:
2d65a9f48fcdf7866aab6457bc707ca233e0c791.  I'll send the patches I used for that
kernel as a reply to this message shortly.

The output from the modified kernel is stored here:

http://oss.sgi.com/projects/memtests/pgcollapse/output-pgcollapse

The output is stored in a pretty dumb format (*really* wide).  Best viewed in a
simple text editor with word wrap off, just fyi.

Quick summary of what I found:  Both kernels performed about the same when it
comes to overall runtime, my kernel was 22 seconds faster with a total runtime
of 4:13:07.  Not a significant difference, but important to note that there was
no apparent performance degradation.  The most interesting result is that my
kernel completed the majority of the necessary page collapses for this job in
2:04, whereas the mainline kernel took 29:05 to get to the same point.

Let me know what you think.  Any suggestions are appreciated!

- Alex

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-kernel@vger.kernel.org

Alex Thorlton (4):
  Disable khugepaged thread
  Add pgcollapse controls to task_struct
  Convert khugepaged scan functions to work with task_work
  Add /proc files to expose per-mm pgcollapse stats

 fs/proc/base.c             |  23 +++++++
 include/linux/khugepaged.h |  10 ++-
 include/linux/sched.h      |  16 +++++
 kernel/fork.c              |   7 ++
 kernel/sched/fair.c        |  18 +++++
 mm/huge_memory.c           | 162 +++++++++++++++------------------------------
 6 files changed, 123 insertions(+), 113 deletions(-)

-- 
1.7.12.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] Disable khugepaged thread
  2014-10-23  2:49 [PATCH 0/4] Convert khugepaged to a task_work function Alex Thorlton
@ 2014-10-23  2:49 ` Alex Thorlton
  2014-10-23  2:49 ` [PATCH] Add pgcollapse controls to task_struct Alex Thorlton
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-10-23  2:49 UTC (permalink / raw)
  To: linux-mm, athorlton
  Cc: Andrew Morton, Bob Liu, David Rientjes, Eric W. Biederman,
	Hugh Dickins, Ingo Molnar, Kees Cook, Kirill A. Shutemov,
	Mel Gorman, Oleg Nesterov, Peter Zijlstra, Rik van Riel,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

This patch just removes any call to start khugepaged for now.  If we decide to
go forward with this new approach, then this patch will also dismantle the other
bits of khugepaged that we'll no longer need.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-kernel@vger.kernel.org

---
 mm/huge_memory.c | 23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 74c78aa..63abf52 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -295,24 +295,9 @@ static ssize_t enabled_store(struct kobject *kobj,
 			     struct kobj_attribute *attr,
 			     const char *buf, size_t count)
 {
-	ssize_t ret;
-
-	ret = double_flag_store(kobj, attr, buf, count,
-				TRANSPARENT_HUGEPAGE_FLAG,
-				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
-
-	if (ret > 0) {
-		int err;
-
-		mutex_lock(&khugepaged_mutex);
-		err = start_khugepaged();
-		mutex_unlock(&khugepaged_mutex);
-
-		if (err)
-			ret = err;
-	}
-
-	return ret;
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_FLAG,
+				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
 }
 static struct kobj_attribute enabled_attr =
 	__ATTR(enabled, 0644, enabled_show, enabled_store);
@@ -655,8 +640,6 @@ static int __init hugepage_init(void)
 	if (totalram_pages < (512 << (20 - PAGE_SHIFT)))
 		transparent_hugepage_flags = 0;
 
-	start_khugepaged();
-
 	return 0;
 out:
 	hugepage_exit_sysfs(hugepage_kobj);
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH] Add pgcollapse controls to task_struct
  2014-10-23  2:49 [PATCH 0/4] Convert khugepaged to a task_work function Alex Thorlton
  2014-10-23  2:49 ` [PATCH 1/4] Disable khugepaged thread Alex Thorlton
@ 2014-10-23  2:49 ` Alex Thorlton
  2014-10-23 15:29   ` Alex Thorlton
  2014-10-23  2:49 ` [PATCH 4/4] Add /proc files to expose per-mm pgcollapse stats Alex Thorlton
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Alex Thorlton @ 2014-10-23  2:49 UTC (permalink / raw)
  To: linux-mm, athorlton
  Cc: Andrew Morton, Bob Liu, David Rientjes, Eric W. Biederman,
	Hugh Dickins, Ingo Molnar, Kees Cook, Kirill A. Shutemov,
	Mel Gorman, Oleg Nesterov, Peter Zijlstra, Rik van Riel,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

This patch just adds the necessary bits to the task_struct so that the scans can
eventually be controlled on a per-mm basis.  As I mentioned previously, we might
want to add some more counters here.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-kernel@vger.kernel.org

---
 include/linux/sched.h | 12 ++++++++++++
 kernel/fork.c         |  6 ++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..109be66 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1661,6 +1661,18 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	struct callback_head pgcollapse_work;
+	/* default scan 8*512 pte (or vmas) every 30 second */
+	unsigned int pgcollapse_pages_to_scan;
+	unsigned int pgcollapse_pages_collapsed;
+	unsigned int pgcollapse_full_scans;
+	unsigned int pgcollapse_scan_sleep_millisecs;
+	/* during fragmentation poll the hugepage allocator once every minute */
+	unsigned int pgcollapse_alloc_sleep_millisecs;
+	unsigned long pgcollapse_last_scan;
+	unsigned long pgcollapse_scan_address;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..8c55309 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1355,6 +1355,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->sequential_io	= 0;
 	p->sequential_io_avg	= 0;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* need to pull these values from sysctl or something */
+	p->pgcollapse_pages_to_scan = HPAGE_PMD_NR * 8;
+	p->pgcollapse_scan_sleep_millisecs = 10000;
+	p->pgcollapse_alloc_sleep_millisecs = 60000;
+#endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] Add /proc files to expose per-mm pgcollapse stats
  2014-10-23  2:49 [PATCH 0/4] Convert khugepaged to a task_work function Alex Thorlton
  2014-10-23  2:49 ` [PATCH 1/4] Disable khugepaged thread Alex Thorlton
  2014-10-23  2:49 ` [PATCH] Add pgcollapse controls to task_struct Alex Thorlton
@ 2014-10-23  2:49 ` Alex Thorlton
  2014-10-23  3:06 ` [PATCH 1/2] Add pgcollapse stat counter to task_struct Alex Thorlton
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-10-23  2:49 UTC (permalink / raw)
  To: linux-mm, athorlton
  Cc: Andrew Morton, Bob Liu, David Rientjes, Eric W. Biederman,
	Hugh Dickins, Ingo Molnar, Kees Cook, Kirill A. Shutemov,
	Mel Gorman, Oleg Nesterov, Peter Zijlstra, Rik van Riel,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

This patch adds a /proc file to read out the information that we've added to the
task_struct.  I'll need to split the information out to separate files, probably
in a subdirectory, change a few of the files to allow us to modify their values,
and it will need appropriate locks.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-kernel@vger.kernel.org

---
 fs/proc/base.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 772efa4..7c5aca2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2466,6 +2466,25 @@ static const struct file_operations proc_projid_map_operations = {
 };
 #endif /* CONFIG_USER_NS */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int proc_pgcollapse_show(struct seq_file *m, struct pid_namespace *ns,
+		     struct pid *pid, struct task_struct *tsk)
+{
+	/* need locks here */
+	seq_printf(m, "pages_to_scan: %u\n", tsk->pgcollapse_pages_to_scan);
+	seq_printf(m, "pages_collapsed: %u\n", tsk->pgcollapse_pages_collapsed);
+	seq_printf(m, "full_scans: %u\n", tsk->pgcollapse_full_scans);
+	seq_printf(m, "scan_sleep_millisecs: %u\n",
+		   tsk->pgcollapse_scan_sleep_millisecs);
+	seq_printf(m, "alloc_sleep_millisecs: %u\n", 
+		   tsk->pgcollapse_alloc_sleep_millisecs);
+	seq_printf(m, "last_scan: %lu\n", tsk->pgcollapse_last_scan);
+	seq_printf(m, "scan_address: 0x%0lx\n", tsk->pgcollapse_scan_address);
+
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *task)
 {
@@ -2576,6 +2595,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	REG("timers",	  S_IRUGO, proc_timers_operations),
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ONE("pgcollapse", S_IRUGO, proc_pgcollapse_show),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -2914,6 +2936,9 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ONE("pgcollapse", S_IRUGO, proc_pgcollapse_show),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/2] Add pgcollapse stat counter to task_struct
  2014-10-23  2:49 [PATCH 0/4] Convert khugepaged to a task_work function Alex Thorlton
                   ` (2 preceding siblings ...)
  2014-10-23  2:49 ` [PATCH 4/4] Add /proc files to expose per-mm pgcollapse stats Alex Thorlton
@ 2014-10-23  3:06 ` Alex Thorlton
  2014-10-23  3:06 ` [PATCH 2/2] Add /proc files to expose per-mm pgcollapse stats Alex Thorlton
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-10-23  3:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Thorlton, Andrew Morton, Bob Liu, David Rientjes,
	Eric W . Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A . Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Vladimir Davydov, linux-kernel

Pretty self explanatory.  Just adding one of the same counters that I used to
gather data for the other patches.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-kernel@vger.kernel.org

---
 include/linux/sched.h | 3 +++
 mm/huge_memory.c      | 1 +
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..9b87d9a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1661,6 +1661,9 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned int pgcollapse_pages_collapsed;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 74c78aa..ca8a813 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2531,6 +2531,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	*hpage = NULL;
 
+        mm->owner->pgcollapse_pages_collapsed++;
 	khugepaged_pages_collapsed++;
 out_up_write:
 	up_write(&mm->mmap_sem);
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] Add /proc files to expose per-mm pgcollapse stats
  2014-10-23  2:49 [PATCH 0/4] Convert khugepaged to a task_work function Alex Thorlton
                   ` (3 preceding siblings ...)
  2014-10-23  3:06 ` [PATCH 1/2] Add pgcollapse stat counter to task_struct Alex Thorlton
@ 2014-10-23  3:06 ` Alex Thorlton
  2014-10-23 17:55 ` [PATCH 0/4] Convert khugepaged to a task_work function Rik van Riel
  2014-10-28 12:12 ` Andi Kleen
  6 siblings, 0 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-10-23  3:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Thorlton, Andrew Morton, Bob Liu, David Rientjes,
	Eric W . Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A . Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Vladimir Davydov, linux-kernel

Just add a proc file to expose the stat counter I added.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-kernel@vger.kernel.org

---
 fs/proc/base.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 772efa4..7517bf4 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2466,6 +2466,16 @@ static const struct file_operations proc_projid_map_operations = {
 };
 #endif /* CONFIG_USER_NS */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int proc_pgcollapse_show(struct seq_file *m, struct pid_namespace *ns,
+		     struct pid *pid, struct task_struct *tsk)
+{
+	seq_printf(m, "pages_collapsed: %u\n", tsk->pgcollapse_pages_collapsed);
+
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *task)
 {
@@ -2576,6 +2586,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	REG("timers",	  S_IRUGO, proc_timers_operations),
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ONE("pgcollapse", S_IRUGO, proc_pgcollapse_show),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -2914,6 +2927,9 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ONE("pgcollapse", S_IRUGO, proc_pgcollapse_show),
+#endif
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] Add pgcollapse controls to task_struct
  2014-10-23  2:49 ` [PATCH] Add pgcollapse controls to task_struct Alex Thorlton
@ 2014-10-23 15:29   ` Alex Thorlton
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-10-23 15:29 UTC (permalink / raw)
  To: linux-mm, athorlton
  Cc: Andrew Morton, Bob Liu, David Rientjes, Eric W. Biederman,
	Hugh Dickins, Ingo Molnar, Kees Cook, Kirill A. Shutemov,
	Mel Gorman, Oleg Nesterov, Peter Zijlstra, Rik van Riel,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

On Wed, Oct 22, 2014 at 09:49:25PM -0500, Alex Thorlton wrote:
> This patch just adds the necessary bits to the task_struct so that the scans can
> eventually be controlled on a per-mm basis.  As I mentioned previously, we might
> want to add some more counters here.

Just noticed that this one didn't get properly numbered when I split
them out.  This should be patch 2/4 for the first set that I sent. Sorry
about that!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-23  2:49 [PATCH 0/4] Convert khugepaged to a task_work function Alex Thorlton
                   ` (4 preceding siblings ...)
  2014-10-23  3:06 ` [PATCH 2/2] Add /proc files to expose per-mm pgcollapse stats Alex Thorlton
@ 2014-10-23 17:55 ` Rik van Riel
  2014-10-23 18:05   ` Alex Thorlton
  2014-10-28 12:12 ` Andi Kleen
  6 siblings, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2014-10-23 17:55 UTC (permalink / raw)
  To: Alex Thorlton, linux-mm
  Cc: Andrew Morton, Bob Liu, David Rientjes, Eric W. Biederman,
	Hugh Dickins, Ingo Molnar, Kees Cook, Kirill A. Shutemov,
	Mel Gorman, Oleg Nesterov, Peter Zijlstra, Thomas Gleixner,
	Vladimir Davydov, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/22/2014 10:49 PM, Alex Thorlton wrote:

> Alex Thorlton (4): Disable khugepaged thread Add pgcollapse
> controls to task_struct Convert khugepaged scan functions to work
> with task_work Add /proc files to expose per-mm pgcollapse stats

Is it just me, or did the third patch never show up in other people's
email either?

I don't see it in my inbox, my lkml folder, my linux-mm folder, or
on lkml.org

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUSUEBAAoJEM553pKExN6DougH/0fPp5nZkQ4oF3zrbLO8Hsig
muVu1VU+fy9Hkrp5SeHCXBpZmr9d00A/U4mAIrcDPrdCKWOWfG76BF31Qf4mWKwP
IB+YSHL4X/3LDGylq06xEZ3x+dci24v8Bq+3CLjMFphKcqY7A/R2VFDF82+f25jh
AanF4V6RMRSYoUVYQYbwtyToanSGb4hUKAh6chCXHRNL9m/wNwj5tiItMVY/N852
Zk0NgofzMV9yThnPCXloEr6toRJm1NrQlLYg/q5LHxA4b62NPGlmsod7gzg88svP
Q6I8quUY72sqsvMO5NfJGMxqK11PYEfo41jO2Q3sLfaMiKg0lnSBcr+FogiovGQ=
=QNt6
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-23 17:55 ` [PATCH 0/4] Convert khugepaged to a task_work function Rik van Riel
@ 2014-10-23 18:05   ` Alex Thorlton
  2014-10-23 18:52     ` Alex Thorlton
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Thorlton @ 2014-10-23 18:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alex Thorlton, linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

On Thu, Oct 23, 2014 at 01:55:13PM -0400, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 10/22/2014 10:49 PM, Alex Thorlton wrote:
> 
> > Alex Thorlton (4): Disable khugepaged thread Add pgcollapse
> > controls to task_struct Convert khugepaged scan functions to work
> > with task_work Add /proc files to expose per-mm pgcollapse stats
> 
> Is it just me, or did the third patch never show up in other people's
> email either?
> 
> I don't see it in my inbox, my lkml folder, my linux-mm folder, or
> on lkml.org

That's, more than likely, my fault.  I seem to be a pro at messing up
with git send-email :/  Everything showed up properly in my e-mail, but
it does look screwy on lkml.org.  I'll double check everything and do a
resend here shortly.

Sorry about that!

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-23 18:05   ` Alex Thorlton
@ 2014-10-23 18:52     ` Alex Thorlton
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-10-23 18:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alex Thorlton, linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

On Thu, Oct 23, 2014 at 01:05:19PM -0500, Alex Thorlton wrote:
> I'll double check everything and do a resend here shortly.

Resend is out there.  It looks like I got this one right (maybe next
time I'll get it on the first try :).  Thanks for pointing out my error,
Rik!

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-23  2:49 [PATCH 0/4] Convert khugepaged to a task_work function Alex Thorlton
                   ` (5 preceding siblings ...)
  2014-10-23 17:55 ` [PATCH 0/4] Convert khugepaged to a task_work function Rik van Riel
@ 2014-10-28 12:12 ` Andi Kleen
  2014-10-28 12:58   ` Rik van Riel
  2014-10-29 21:58   ` Alex Thorlton
  6 siblings, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2014-10-28 12:12 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Vladimir Davydov, linux-kernel

Alex Thorlton <athorlton@sgi.com> writes:

> Last week, while discussing possible fixes for some unexpected/unwanted behavior
> from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
> mentioned possibly changing changing khugepaged to work as a task_work function
> instead of a kernel thread.  This will give us finer grained control over the
> page collapse scans, eliminate some unnecessary scans since tasks that are
> relatively inactive will not be scanned often, and eliminate the unwanted
> behavior described in the email thread I mentioned.

With your change, what would happen in a single threaded case?

Previously one core would scan and another would run the workload.
With your change both scanning and running would be on the same
core.

Would seem like a step backwards to me.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-28 12:12 ` Andi Kleen
@ 2014-10-28 12:58   ` Rik van Riel
  2014-10-28 15:39     ` Rik van Riel
  2014-11-10 11:03     ` Mel Gorman
  2014-10-29 21:58   ` Alex Thorlton
  1 sibling, 2 replies; 21+ messages in thread
From: Rik van Riel @ 2014-10-28 12:58 UTC (permalink / raw)
  To: Andi Kleen, Alex Thorlton
  Cc: linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

On 10/28/2014 08:12 AM, Andi Kleen wrote:
> Alex Thorlton <athorlton@sgi.com> writes:
> 
>> Last week, while discussing possible fixes for some unexpected/unwanted behavior
>> from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
>> mentioned possibly changing changing khugepaged to work as a task_work function
>> instead of a kernel thread.  This will give us finer grained control over the
>> page collapse scans, eliminate some unnecessary scans since tasks that are
>> relatively inactive will not be scanned often, and eliminate the unwanted
>> behavior described in the email thread I mentioned.
> 
> With your change, what would happen in a single threaded case?
> 
> Previously one core would scan and another would run the workload.
> With your change both scanning and running would be on the same
> core.
> 
> Would seem like a step backwards to me.

It's not just scanning, either.

Memory compaction can spend a lot of time waiting on
locks. Not consuming CPU or anything, but just waiting.

I am not convinced that moving all that waiting to task
context is a good idea.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-28 12:58   ` Rik van Riel
@ 2014-10-28 15:39     ` Rik van Riel
  2014-10-31 20:27       ` Vlastimil Babka
  2014-11-10 11:03     ` Mel Gorman
  1 sibling, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2014-10-28 15:39 UTC (permalink / raw)
  To: Andi Kleen, Alex Thorlton
  Cc: linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

On 10/28/2014 08:58 AM, Rik van Riel wrote:
> On 10/28/2014 08:12 AM, Andi Kleen wrote:
>> Alex Thorlton <athorlton@sgi.com> writes:
>>
>>> Last week, while discussing possible fixes for some unexpected/unwanted behavior
>>> from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
>>> mentioned possibly changing changing khugepaged to work as a task_work function
>>> instead of a kernel thread.  This will give us finer grained control over the
>>> page collapse scans, eliminate some unnecessary scans since tasks that are
>>> relatively inactive will not be scanned often, and eliminate the unwanted
>>> behavior described in the email thread I mentioned.
>>
>> With your change, what would happen in a single threaded case?
>>
>> Previously one core would scan and another would run the workload.
>> With your change both scanning and running would be on the same
>> core.
>>
>> Would seem like a step backwards to me.
>
> It's not just scanning, either.
>
> Memory compaction can spend a lot of time waiting on
> locks. Not consuming CPU or anything, but just waiting.
>
> I am not convinced that moving all that waiting to task
> context is a good idea.

It may be worth investigating how the hugepage code calls
the memory allocation & compaction code.

Doing only async compaction from task_work context should
probably be ok.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-28 12:12 ` Andi Kleen
  2014-10-28 12:58   ` Rik van Riel
@ 2014-10-29 21:58   ` Alex Thorlton
  2014-10-30  0:23     ` Kirill A. Shutemov
  2014-10-30  8:35     ` Andi Kleen
  1 sibling, 2 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-10-29 21:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alex Thorlton, linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Vladimir Davydov, linux-kernel

On Tue, Oct 28, 2014 at 05:12:26AM -0700, Andi Kleen wrote:
> Alex Thorlton <athorlton@sgi.com> writes:
> 
> > Last week, while discussing possible fixes for some unexpected/unwanted behavior
> > from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
> > mentioned possibly changing changing khugepaged to work as a task_work function
> > instead of a kernel thread.  This will give us finer grained control over the
> > page collapse scans, eliminate some unnecessary scans since tasks that are
> > relatively inactive will not be scanned often, and eliminate the unwanted
> > behavior described in the email thread I mentioned.
> 
> With your change, what would happen in a single threaded case?
> 
> Previously one core would scan and another would run the workload.
> With your change both scanning and running would be on the same
> core.
> 
> Would seem like a step backwards to me.

I suppose from the single-threaded point of view, it could be.  Maybe we
could look at this a bit differently.  What if we allow processes to
choose their collapse mechanism on fork?  That way, the system could
default to using the standard khugepaged mechanism, but we could request
that processes handle collapses themselves if we want.  Overall, I don't
think that would add too much overhead to what I've already proposed
here, and it gives us more flexibility.

Thoughts?

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-29 21:58   ` Alex Thorlton
@ 2014-10-30  0:23     ` Kirill A. Shutemov
  2014-10-30  8:35     ` Andi Kleen
  1 sibling, 0 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2014-10-30  0:23 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Andi Kleen, linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Vladimir Davydov, linux-kernel

On Wed, Oct 29, 2014 at 04:58:39PM -0500, Alex Thorlton wrote:
> On Tue, Oct 28, 2014 at 05:12:26AM -0700, Andi Kleen wrote:
> > Alex Thorlton <athorlton@sgi.com> writes:
> > 
> > > Last week, while discussing possible fixes for some unexpected/unwanted behavior
> > > from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
> > > mentioned possibly changing changing khugepaged to work as a task_work function
> > > instead of a kernel thread.  This will give us finer grained control over the
> > > page collapse scans, eliminate some unnecessary scans since tasks that are
> > > relatively inactive will not be scanned often, and eliminate the unwanted
> > > behavior described in the email thread I mentioned.
> > 
> > With your change, what would happen in a single threaded case?
> > 
> > Previously one core would scan and another would run the workload.
> > With your change both scanning and running would be on the same
> > core.
> > 
> > Would seem like a step backwards to me.
> 
> I suppose from the single-threaded point of view, it could be.  Maybe we
> could look at this a bit differently.  What if we allow processes to
> choose their collapse mechanism on fork?

Yet another knob nobody uses? Let's just do it right.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-29 21:58   ` Alex Thorlton
  2014-10-30  0:23     ` Kirill A. Shutemov
@ 2014-10-30  8:35     ` Andi Kleen
  2014-10-30 18:25       ` Alex Thorlton
  1 sibling, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2014-10-30  8:35 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Andi Kleen, linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Vladimir Davydov, linux-kernel

> 
> I suppose from the single-threaded point of view, it could be.  Maybe we

It's not only for single threaded. Consider the "has to wait a long time
for a lock" problem Rik pointed out. With that multiple threads are
always better.

> could look at this a bit differently.  What if we allow processes to
> choose their collapse mechanism on fork?  That way, the system could
> default to using the standard khugepaged mechanism, but we could request
> that processes handle collapses themselves if we want.  Overall, I don't
> think that would add too much overhead to what I've already proposed
> here, and it gives us more flexibility.

We already have too many VM tunables. Better would be to switch
automatically somehow.

I guess you could use some kind of work stealing scheduler, but these
are fairly complicated. Maybe some simpler heuristics can be found.

BTW my thinking has been usually to actually use more khugepageds to 
scan large address spaces faster.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-30  8:35     ` Andi Kleen
@ 2014-10-30 18:25       ` Alex Thorlton
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-10-30 18:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alex Thorlton, linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Vladimir Davydov, linux-kernel

On Thu, Oct 30, 2014 at 09:35:44AM +0100, Andi Kleen wrote:
> We already have too many VM tunables. Better would be to switch
> automatically somehow.
> 
> I guess you could use some kind of work stealing scheduler, but these
> are fairly complicated. Maybe some simpler heuristics can be found.

That would be a better option in general, but (admittedly not having
thought about it much), I can't think of a good way to determine when to
make that switch.  The main problem being that we're not really seeing a
negative performance impact from khugepaged, but some undesired
behavior, which always exists.

Perhaps we could make a decision based on the number of remote
allocations made by khugepaged?  If we see a lot of allocations to
distant nodes, then maybe we tell khugepaged to stop running scans for a
particular process/mm and let the job handle things itself, either using
the task_work style scan that I've proposed, or just banning khugepaged,
period.  Again, I don't think this is a very good way to make the
decision, but something to think about.

> BTW my thinking has been usually to actually use more khugepageds to 
> scan large address spaces faster.

I hadn't thought of it, but I suppose that is an option as well.  Unless
I've completely missed something in the code, I don't think there's a
way to do this now, right?  Either way, I suppose it wouldn't be too hard
to do, but this still leaves the window wide open for allocations to be
made far away from where the process really needs them.  Maybe if we had
a way to spin up a new khugepaged on the fly, so that users can pin it
where they want it, that would work?  Just brainstorming here...

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-28 15:39     ` Rik van Riel
@ 2014-10-31 20:27       ` Vlastimil Babka
  2014-11-17 21:34         ` Alex Thorlton
  0 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2014-10-31 20:27 UTC (permalink / raw)
  To: Rik van Riel, Andi Kleen, Alex Thorlton
  Cc: linux-mm, Andrew Morton, Bob Liu, David Rientjes,
	Eric W. Biederman, Hugh Dickins, Ingo Molnar, Kees Cook,
	Kirill A. Shutemov, Mel Gorman, Oleg Nesterov, Peter Zijlstra,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

On 28.10.2014 16:39, Rik van Riel wrote:
> On 10/28/2014 08:58 AM, Rik van Riel wrote:
>> On 10/28/2014 08:12 AM, Andi Kleen wrote:
>>> Alex Thorlton <athorlton@sgi.com> writes:
>>>
>>>> Last week, while discussing possible fixes for some 
>>>> unexpected/unwanted behavior
>>>> from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several 
>>>> people
>>>> mentioned possibly changing changing khugepaged to work as a 
>>>> task_work function
>>>> instead of a kernel thread.  This will give us finer grained 
>>>> control over the
>>>> page collapse scans, eliminate some unnecessary scans since tasks 
>>>> that are
>>>> relatively inactive will not be scanned often, and eliminate the 
>>>> unwanted
>>>> behavior described in the email thread I mentioned.
>>>
>>> With your change, what would happen in a single threaded case?
>>>
>>> Previously one core would scan and another would run the workload.
>>> With your change both scanning and running would be on the same
>>> core.
>>>
>>> Would seem like a step backwards to me.
>>
>> It's not just scanning, either.
>>
>> Memory compaction can spend a lot of time waiting on
>> locks. Not consuming CPU or anything, but just waiting.
>>
>> I am not convinced that moving all that waiting to task
>> context is a good idea.
>
> It may be worth investigating how the hugepage code calls
> the memory allocation & compaction code.

It's actually quite stupid, AFAIK. it will scan for collapse candidates, 
and only then
try to allocate THP, which may involve compaction. If that fails, the 
scanning time was
wasted.

What could help would be to cache one or few free huge pages per zone 
with cache
re-fill done asynchronously, e.g. via work queues. The cache could 
benefit fault-THP
allocations as well. And adding some logic that if nobody uses the 
cached pages and
memory is low, then free them. And importantly, if it's not possible to 
allocate huge
pages for the cache, then prevent scanning for collapse candidates as 
there's no point.
(well this is probably more complex if some nodes can allocate huge 
pages and others
not).

For the scanning itself, I think NUMA balancing does similar thing in 
task_work context
already, no?

> Doing only async compaction from task_work context should
> probably be ok.

I'm afraid that if we give up sync compaction here, then there will be 
no more left to
defragment MIGRATE_UNMOVABLE pageblocks.

>
> -- 
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-28 12:58   ` Rik van Riel
  2014-10-28 15:39     ` Rik van Riel
@ 2014-11-10 11:03     ` Mel Gorman
  2014-11-17 21:16       ` Alex Thorlton
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2014-11-10 11:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Alex Thorlton, linux-mm, Andrew Morton, Bob Liu,
	David Rientjes, Eric W. Biederman, Hugh Dickins, Ingo Molnar,
	Kees Cook, Kirill A. Shutemov, Oleg Nesterov, Peter Zijlstra,
	Thomas Gleixner, Vladimir Davydov, linux-kernel

On Tue, Oct 28, 2014 at 08:58:42AM -0400, Rik van Riel wrote:
> On 10/28/2014 08:12 AM, Andi Kleen wrote:
> > Alex Thorlton <athorlton@sgi.com> writes:
> > 
> >> Last week, while discussing possible fixes for some unexpected/unwanted behavior
> >> from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
> >> mentioned possibly changing changing khugepaged to work as a task_work function
> >> instead of a kernel thread.  This will give us finer grained control over the
> >> page collapse scans, eliminate some unnecessary scans since tasks that are
> >> relatively inactive will not be scanned often, and eliminate the unwanted
> >> behavior described in the email thread I mentioned.
> > 
> > With your change, what would happen in a single threaded case?
> > 
> > Previously one core would scan and another would run the workload.
> > With your change both scanning and running would be on the same
> > core.
> > 
> > Would seem like a step backwards to me.
> 

Only in the single-threaded, one process for the whole system case.
khugepaged can only scan one address space at a time and if processes
fail to allocate a huge page on fault then they must wait until
khugepaged gets to scan them. The wait time is not unbounded, but it
could be considerable.

As pointed out elsewhere, scanning from task-work context allows the
scan rate to adapt due to different inputs -- runtime on CPU probably
being the most relevant. Another scan factor could be NUMA sharing within
THP-boundaries in which case we don't want to either collapse or continue
scanning at the same rate.

> It's not just scanning, either.
> 
> Memory compaction can spend a lot of time waiting on
> locks. Not consuming CPU or anything, but just waiting.
> 

I did not pick apart the implementation closely as it's still RFC but
there is no requirement for the reclaim/compaction to take place from
task work context. That would likely cause user-visible stalls in any
number of situations can trigger bug reports.

One possibility would be to try allocate a THP GFP_ATOMIC from task_work
context and only start the scan if that allocation succeeds. Scan the
address space for a THP to collapse. If a collapse target it found and
the allocated THP is on the correct node then great -- use it. If not,
the first page should be freed and a second GFP_ATOMIC allocation
attempt made. 

If a THP allocation fails then wake we need something to try allocate the
page on the processes behalf. khugepaged could be repurposed to do the
reclaim/compaction step or kswapd could be woken up. Either option may
be tricky to get right as currently waking kswapd is avoided to prevent
excessive reclaim. khugepaged could do the work but would need similar
back-off logic in the event of failures. Workqueues could also be used
but I'd worry about controlling the number of active workqueue requests
and accounting for the reclaim/compaction work is tricker if workqueues
were used. 

> I am not convinced that moving all that waiting to task
> context is a good idea.
> 

It allows the scanning of page tables to be parallelised, moves the
work into the task context where it can be accounted for and the scan
rate can be adapted to prevent useless work. I think those are desirable
characteristics although there is no data on the expected gains of doing
something like this. It's the proper deferral of THP allocations that is
likely to cause the most headaches.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-11-10 11:03     ` Mel Gorman
@ 2014-11-17 21:16       ` Alex Thorlton
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-11-17 21:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Andi Kleen, Alex Thorlton, linux-mm, Andrew Morton,
	Bob Liu, David Rientjes, Eric W. Biederman, Hugh Dickins,
	Ingo Molnar, Kees Cook, Kirill A. Shutemov, Oleg Nesterov,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, linux-kernel

On Mon, Nov 10, 2014 at 11:03:57AM +0000, Mel Gorman wrote:
> On Tue, Oct 28, 2014 at 08:58:42AM -0400, Rik van Riel wrote:
> > On 10/28/2014 08:12 AM, Andi Kleen wrote:
> > > Alex Thorlton <athorlton@sgi.com> writes:
> > > 
> > >> Last week, while discussing possible fixes for some unexpected/unwanted behavior
> > >> from khugepaged (see: https://lkml.org/lkml/2014/10/8/515) several people
> > >> mentioned possibly changing changing khugepaged to work as a task_work function
> > >> instead of a kernel thread.  This will give us finer grained control over the
> > >> page collapse scans, eliminate some unnecessary scans since tasks that are
> > >> relatively inactive will not be scanned often, and eliminate the unwanted
> > >> behavior described in the email thread I mentioned.
> > > 
> > > With your change, what would happen in a single threaded case?
> > > 
> > > Previously one core would scan and another would run the workload.
> > > With your change both scanning and running would be on the same
> > > core.
> > > 
> > > Would seem like a step backwards to me.
> > 
> 
> Only in the single-threaded, one process for the whole system case.
> khugepaged can only scan one address space at a time and if processes
> fail to allocate a huge page on fault then they must wait until
> khugepaged gets to scan them. The wait time is not unbounded, but it
> could be considerable.
> 
> As pointed out elsewhere, scanning from task-work context allows the
> scan rate to adapt due to different inputs -- runtime on CPU probably
> being the most relevant. Another scan factor could be NUMA sharing within
> THP-boundaries in which case we don't want to either collapse or continue
> scanning at the same rate.
> 
> > It's not just scanning, either.
> > 
> > Memory compaction can spend a lot of time waiting on
> > locks. Not consuming CPU or anything, but just waiting.
> > 
> 
> I did not pick apart the implementation closely as it's still RFC but
> there is no requirement for the reclaim/compaction to take place from
> task work context. That would likely cause user-visible stalls in any
> number of situations can trigger bug reports.

Not much to pick apart, really.  Basically just turns off khugepaged and
calls into the mm scan code from a task work function.  I didn't
consider the problems with possibly hitting compaction code.

> One possibility would be to try allocate a THP GFP_ATOMIC from task_work
> context and only start the scan if that allocation succeeds. Scan the
> address space for a THP to collapse. If a collapse target it found and
> the allocated THP is on the correct node then great -- use it. If not,
> the first page should be freed and a second GFP_ATOMIC allocation
> attempt made. 

I like this idea, especially because it wouldn't be all that complicated
to implement.  I'll take a look at doing something like this.

> If a THP allocation fails then wake we need something to try allocate the
> page on the processes behalf. khugepaged could be repurposed to do the
> reclaim/compaction step or kswapd could be woken up.

I think keeping khugepaged around to do the reclaim/compaction stuff
sounds like a good idea as well.  khugepaged could continue to scan in
the same way that it does now, but the only work it would do would be to
handle compaction/reclaim.  I don't know that code path very well, so
I'm not really sure if that makes a lot of sense, but I'll look into it.

> > I am not convinced that moving all that waiting to task
> > context is a good idea.
> > 
> 
> It allows the scanning of page tables to be parallelised, moves the
> work into the task context where it can be accounted for and the scan
> rate can be adapted to prevent useless work. I think those are desirable
> characteristics although there is no data on the expected gains of doing
> something like this. It's the proper deferral of THP allocations that is
> likely to cause the most headaches.

I might have missed something in the code, but isn't it possible to run
into compaction if a high-order allocation fails from task context
already?  It would likely be rare to hit, but I don't see anything that
prevents it.  That being said, the way the scans work in this code,
I'm adding in an extra opportunity to run into compaction code, which
probably isn't desirable.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] Convert khugepaged to a task_work function
  2014-10-31 20:27       ` Vlastimil Babka
@ 2014-11-17 21:34         ` Alex Thorlton
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Thorlton @ 2014-11-17 21:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rik van Riel, Andi Kleen, Alex Thorlton, linux-mm, Andrew Morton,
	Bob Liu, David Rientjes, Eric W. Biederman, Hugh Dickins,
	Ingo Molnar, Kees Cook, Kirill A. Shutemov, Mel Gorman,
	Oleg Nesterov, Peter Zijlstra, Thomas Gleixner, Vladimir Davydov,
	linux-kernel

On Fri, Oct 31, 2014 at 09:27:16PM +0100, Vlastimil Babka wrote:
> What could help would be to cache one or few free huge pages per
> zone with cache
> re-fill done asynchronously, e.g. via work queues. The cache could
> benefit fault-THP
> allocations as well. And adding some logic that if nobody uses the
> cached pages and
> memory is low, then free them. And importantly, if it's not possible
> to allocate huge
> pages for the cache, then prevent scanning for collapse candidates
> as there's no point.
> (well this is probably more complex if some nodes can allocate huge
> pages and others
> not).

I think this would be a pretty cool addition, even separately from this
effort.  If we keep a page cached on each NUMA node, then we could,
theoretically, really speed up the khugepaged scans (even if we don't
move those scans to task_work), and regular THP faults.  I'll add it to
my ever-growing wish list :)

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-11-17 21:33 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-23  2:49 [PATCH 0/4] Convert khugepaged to a task_work function Alex Thorlton
2014-10-23  2:49 ` [PATCH 1/4] Disable khugepaged thread Alex Thorlton
2014-10-23  2:49 ` [PATCH] Add pgcollapse controls to task_struct Alex Thorlton
2014-10-23 15:29   ` Alex Thorlton
2014-10-23  2:49 ` [PATCH 4/4] Add /proc files to expose per-mm pgcollapse stats Alex Thorlton
2014-10-23  3:06 ` [PATCH 1/2] Add pgcollapse stat counter to task_struct Alex Thorlton
2014-10-23  3:06 ` [PATCH 2/2] Add /proc files to expose per-mm pgcollapse stats Alex Thorlton
2014-10-23 17:55 ` [PATCH 0/4] Convert khugepaged to a task_work function Rik van Riel
2014-10-23 18:05   ` Alex Thorlton
2014-10-23 18:52     ` Alex Thorlton
2014-10-28 12:12 ` Andi Kleen
2014-10-28 12:58   ` Rik van Riel
2014-10-28 15:39     ` Rik van Riel
2014-10-31 20:27       ` Vlastimil Babka
2014-11-17 21:34         ` Alex Thorlton
2014-11-10 11:03     ` Mel Gorman
2014-11-17 21:16       ` Alex Thorlton
2014-10-29 21:58   ` Alex Thorlton
2014-10-30  0:23     ` Kirill A. Shutemov
2014-10-30  8:35     ` Andi Kleen
2014-10-30 18:25       ` Alex Thorlton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).