All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] memdelay: memory health metric for systems and workloads
@ 2017-07-27 15:30 ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

This patch series implements a fine-grained metric for memory
health. It builds on top of the refault detection code to quantify the
time lost on VM events that occur exclusively due a lack of memory and
maps it into a percentage of lost walltime for the system and cgroups.

Rationale

When presented with a Linux system or container executing a workload,
it's hard to judge the health of its memory situation.

The statistics exported by the memory management subsystem can reveal
smoking guns: page reclaim activity, major faults and refaults can be
indicative of an unhealthy memory situation. But they don't actually
quantify the cost a memory shortage imposes on the system or workload.

How bad is it when 2000 pages are refaulting each second? If the data
is stored contiguously on a fast flash drive, it might be okay. If the
data is spread out all over a rotating disk, it could be a problem -
unless the CPUs are still fully utilized, in which case adding memory
wouldn't make things move faster, but instead wait for CPU time.

A previous attempt to provide a health signal from the VM was the
vmpressure interface, 70ddf637eebe ("memcg: add memory.pressure_level
events"). This derives its pressure levels from recently observed
reclaim efficiency. As pages are scanned but not reclaimed, the ratio
is translated into levels of low, medium, and critical pressure.

However, the vmpressure scale is too coarse for today's systems. The
accuracy relies on storage being relatively slow compared to how fast
the CPU can go through the LRUs, so that when LRU scan cycles outstrip
IO completion rates the reclaim code runs into pages that are still
reading from disk. But as solid state devices close this speed gap,
and memory sizes are in the hundreds of gigabytes, this effect has
almost completely disappeared. By the time the reclaim scanner runs
into in-flight pages, the tasks in the system already spend a
significant part of their runtime waiting for refaulting pages. The
vmpressure range is compressed into the split second before OOM and
misses large, practically relevant parts of the pressure spectrum.

Knowing the exact time penalty that the kernel's paging activity is
imposing on a workload is a powerful tool. It allows users to finetune
a workload to available memory, but also detect and quantify minute
regressions and improvements in the reclaim and caching algorithms.

Structure

The first patch cleans up the different loadavg callsites and macros
as the memdelay averages are going to be tracked using these.

The second patch adds a distinction between page cache transitions
(inactive list refaults) and page cache thrashing (active list
refaults), since only the latter are unproductive refaults.

The third patch finally adds the memdelay accounting and interface:
its scheduler side identifies productive and unproductive task states,
and the VM side aggregates them into system and cgroup domain states
and calculates moving averages of the time spent in each state.

 arch/powerpc/platforms/cell/spufs/sched.c |   3 -
 arch/s390/appldata/appldata_os.c          |   4 -
 drivers/cpuidle/governors/menu.c          |   4 -
 fs/proc/array.c                           |   8 +
 fs/proc/base.c                            |   2 +
 fs/proc/internal.h                        |   2 +
 fs/proc/loadavg.c                         |   3 -
 include/linux/cgroup.h                    |  14 ++
 include/linux/memcontrol.h                |  14 ++
 include/linux/memdelay.h                  | 174 +++++++++++++++++
 include/linux/mmzone.h                    |   1 +
 include/linux/page-flags.h                |   5 +-
 include/linux/sched.h                     |  10 +-
 include/linux/sched/loadavg.h             |   3 +
 include/linux/swap.h                      |   2 +-
 include/trace/events/mmflags.h            |   1 +
 kernel/cgroup/cgroup.c                    |   4 +-
 kernel/debug/kdb/kdb_main.c               |   7 +-
 kernel/fork.c                             |   4 +
 kernel/sched/Makefile                     |   2 +-
 kernel/sched/core.c                       |  20 ++
 kernel/sched/memdelay.c                   | 112 +++++++++++
 mm/Makefile                               |   2 +-
 mm/compaction.c                           |   4 +
 mm/filemap.c                              |  18 +-
 mm/huge_memory.c                          |   1 +
 mm/memcontrol.c                           |  25 +++
 mm/memdelay.c                             | 289 ++++++++++++++++++++++++++++
 mm/migrate.c                              |   2 +
 mm/page_alloc.c                           |  11 +-
 mm/swap_state.c                           |   1 +
 mm/vmscan.c                               |  10 +
 mm/vmstat.c                               |   1 +
 mm/workingset.c                           |  98 ++++++----
 34 files changed, 792 insertions(+), 69 deletions(-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 0/3] memdelay: memory health metric for systems and workloads
@ 2017-07-27 15:30 ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

This patch series implements a fine-grained metric for memory
health. It builds on top of the refault detection code to quantify the
time lost on VM events that occur exclusively due a lack of memory and
maps it into a percentage of lost walltime for the system and cgroups.

Rationale

When presented with a Linux system or container executing a workload,
it's hard to judge the health of its memory situation.

The statistics exported by the memory management subsystem can reveal
smoking guns: page reclaim activity, major faults and refaults can be
indicative of an unhealthy memory situation. But they don't actually
quantify the cost a memory shortage imposes on the system or workload.

How bad is it when 2000 pages are refaulting each second? If the data
is stored contiguously on a fast flash drive, it might be okay. If the
data is spread out all over a rotating disk, it could be a problem -
unless the CPUs are still fully utilized, in which case adding memory
wouldn't make things move faster, but instead wait for CPU time.

A previous attempt to provide a health signal from the VM was the
vmpressure interface, 70ddf637eebe ("memcg: add memory.pressure_level
events"). This derives its pressure levels from recently observed
reclaim efficiency. As pages are scanned but not reclaimed, the ratio
is translated into levels of low, medium, and critical pressure.

However, the vmpressure scale is too coarse for today's systems. The
accuracy relies on storage being relatively slow compared to how fast
the CPU can go through the LRUs, so that when LRU scan cycles outstrip
IO completion rates the reclaim code runs into pages that are still
reading from disk. But as solid state devices close this speed gap,
and memory sizes are in the hundreds of gigabytes, this effect has
almost completely disappeared. By the time the reclaim scanner runs
into in-flight pages, the tasks in the system already spend a
significant part of their runtime waiting for refaulting pages. The
vmpressure range is compressed into the split second before OOM and
misses large, practically relevant parts of the pressure spectrum.

Knowing the exact time penalty that the kernel's paging activity is
imposing on a workload is a powerful tool. It allows users to finetune
a workload to available memory, but also detect and quantify minute
regressions and improvements in the reclaim and caching algorithms.

Structure

The first patch cleans up the different loadavg callsites and macros
as the memdelay averages are going to be tracked using these.

The second patch adds a distinction between page cache transitions
(inactive list refaults) and page cache thrashing (active list
refaults), since only the latter are unproductive refaults.

The third patch finally adds the memdelay accounting and interface:
its scheduler side identifies productive and unproductive task states,
and the VM side aggregates them into system and cgroup domain states
and calculates moving averages of the time spent in each state.

 arch/powerpc/platforms/cell/spufs/sched.c |   3 -
 arch/s390/appldata/appldata_os.c          |   4 -
 drivers/cpuidle/governors/menu.c          |   4 -
 fs/proc/array.c                           |   8 +
 fs/proc/base.c                            |   2 +
 fs/proc/internal.h                        |   2 +
 fs/proc/loadavg.c                         |   3 -
 include/linux/cgroup.h                    |  14 ++
 include/linux/memcontrol.h                |  14 ++
 include/linux/memdelay.h                  | 174 +++++++++++++++++
 include/linux/mmzone.h                    |   1 +
 include/linux/page-flags.h                |   5 +-
 include/linux/sched.h                     |  10 +-
 include/linux/sched/loadavg.h             |   3 +
 include/linux/swap.h                      |   2 +-
 include/trace/events/mmflags.h            |   1 +
 kernel/cgroup/cgroup.c                    |   4 +-
 kernel/debug/kdb/kdb_main.c               |   7 +-
 kernel/fork.c                             |   4 +
 kernel/sched/Makefile                     |   2 +-
 kernel/sched/core.c                       |  20 ++
 kernel/sched/memdelay.c                   | 112 +++++++++++
 mm/Makefile                               |   2 +-
 mm/compaction.c                           |   4 +
 mm/filemap.c                              |  18 +-
 mm/huge_memory.c                          |   1 +
 mm/memcontrol.c                           |  25 +++
 mm/memdelay.c                             | 289 ++++++++++++++++++++++++++++
 mm/migrate.c                              |   2 +
 mm/page_alloc.c                           |  11 +-
 mm/swap_state.c                           |   1 +
 mm/vmscan.c                               |  10 +
 mm/vmstat.c                               |   1 +
 mm/workingset.c                           |  98 ++++++----
 34 files changed, 792 insertions(+), 69 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros
  2017-07-27 15:30 ` Johannes Weiner
@ 2017-07-27 15:30   ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

There are several identical definitions of those macros in places that
mess with fixed-point load averages. Provide an official version.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/powerpc/platforms/cell/spufs/sched.c | 3 ---
 arch/s390/appldata/appldata_os.c          | 4 ----
 drivers/cpuidle/governors/menu.c          | 4 ----
 fs/proc/loadavg.c                         | 3 ---
 include/linux/sched/loadavg.h             | 3 +++
 kernel/debug/kdb/kdb_main.c               | 7 +------
 6 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 1fbb5da17dd2..de544070def3 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx,
 	}
 }
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int show_spu_loadavg(struct seq_file *s, void *private)
 {
 	int a, b, c;
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 45b3178200ab..a8aac17e1e82 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -24,10 +24,6 @@
 
 #include "appldata.h"
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 /*
  * OS data
  *
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index b2330fd69e34..3d7275ea541d 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -132,10 +132,6 @@ struct menu_device {
 	int		interval_ptr;
 };
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static inline int get_loadavg(unsigned long load)
 {
 	return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10;
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 983fce5c2418..111a25e4b088 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -9,9 +9,6 @@
 #include <linux/seqlock.h>
 #include <linux/time.h>
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int loadavg_proc_show(struct seq_file *m, void *v)
 {
 	unsigned long avnrun[3];
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 4264bc6b2c27..745483bb5cca 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -26,6 +26,9 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
 	load += n*(FIXED_1-exp); \
 	load >>= FSHIFT;
 
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
+
 extern void calc_global_load(unsigned long ticks);
 
 #endif /* _LINUX_SCHED_LOADAVG_H */
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index c8146d53ca67..2dddd25ccd7a 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2571,16 +2571,11 @@ static int kdb_summary(int argc, const char **argv)
 	}
 	kdb_printf("%02ld:%02ld\n", val.uptime/(60*60), (val.uptime/60)%60);
 
-	/* lifted from fs/proc/proc_misc.c::loadavg_read_proc() */
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 	kdb_printf("load avg   %ld.%02ld %ld.%02ld %ld.%02ld\n",
 		LOAD_INT(val.loads[0]), LOAD_FRAC(val.loads[0]),
 		LOAD_INT(val.loads[1]), LOAD_FRAC(val.loads[1]),
 		LOAD_INT(val.loads[2]), LOAD_FRAC(val.loads[2]));
-#undef LOAD_INT
-#undef LOAD_FRAC
+
 	/* Display in kilobytes */
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	kdb_printf("\nMemTotal:       %8lu kB\nMemFree:        %8lu kB\n"
-- 
2.13.3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros
@ 2017-07-27 15:30   ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

There are several identical definitions of those macros in places that
mess with fixed-point load averages. Provide an official version.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/powerpc/platforms/cell/spufs/sched.c | 3 ---
 arch/s390/appldata/appldata_os.c          | 4 ----
 drivers/cpuidle/governors/menu.c          | 4 ----
 fs/proc/loadavg.c                         | 3 ---
 include/linux/sched/loadavg.h             | 3 +++
 kernel/debug/kdb/kdb_main.c               | 7 +------
 6 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 1fbb5da17dd2..de544070def3 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx,
 	}
 }
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int show_spu_loadavg(struct seq_file *s, void *private)
 {
 	int a, b, c;
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 45b3178200ab..a8aac17e1e82 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -24,10 +24,6 @@
 
 #include "appldata.h"
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 /*
  * OS data
  *
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index b2330fd69e34..3d7275ea541d 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -132,10 +132,6 @@ struct menu_device {
 	int		interval_ptr;
 };
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static inline int get_loadavg(unsigned long load)
 {
 	return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10;
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 983fce5c2418..111a25e4b088 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -9,9 +9,6 @@
 #include <linux/seqlock.h>
 #include <linux/time.h>
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int loadavg_proc_show(struct seq_file *m, void *v)
 {
 	unsigned long avnrun[3];
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 4264bc6b2c27..745483bb5cca 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -26,6 +26,9 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
 	load += n*(FIXED_1-exp); \
 	load >>= FSHIFT;
 
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
+
 extern void calc_global_load(unsigned long ticks);
 
 #endif /* _LINUX_SCHED_LOADAVG_H */
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index c8146d53ca67..2dddd25ccd7a 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2571,16 +2571,11 @@ static int kdb_summary(int argc, const char **argv)
 	}
 	kdb_printf("%02ld:%02ld\n", val.uptime/(60*60), (val.uptime/60)%60);
 
-	/* lifted from fs/proc/proc_misc.c::loadavg_read_proc() */
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 	kdb_printf("load avg   %ld.%02ld %ld.%02ld %ld.%02ld\n",
 		LOAD_INT(val.loads[0]), LOAD_FRAC(val.loads[0]),
 		LOAD_INT(val.loads[1]), LOAD_FRAC(val.loads[1]),
 		LOAD_INT(val.loads[2]), LOAD_FRAC(val.loads[2]));
-#undef LOAD_INT
-#undef LOAD_FRAC
+
 	/* Display in kilobytes */
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	kdb_printf("\nMemTotal:       %8lu kB\nMemFree:        %8lu kB\n"
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/3] mm: workingset: tell cache transitions from workingset thrashing
  2017-07-27 15:30 ` Johannes Weiner
@ 2017-07-27 15:30   ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h         |  1 +
 include/linux/page-flags.h     |  5 ++-
 include/linux/swap.h           |  2 +-
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c                   |  9 ++--
 mm/huge_memory.c               |  1 +
 mm/migrate.c                   |  2 +
 mm/swap_state.c                |  1 +
 mm/vmscan.c                    |  1 +
 mm/vmstat.c                    |  1 +
 mm/workingset.c                | 98 +++++++++++++++++++++++++++---------------
 11 files changed, 79 insertions(+), 43 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef6a13b7bd3e..f33ad8d411e1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -156,6 +156,7 @@ enum node_stat_item {
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
+	WORKINGSET_RESTORE,
 	WORKINGSET_NODERECLAIM,
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d6de32..4d1e557d1f8c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,13 +73,14 @@
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
-	PG_error,
 	PG_referenced,
 	PG_uptodate,
 	PG_dirty,
 	PG_lru,
 	PG_active,
+	PG_workingset,
 	PG_waiters,		/* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
+	PG_error,
 	PG_slab,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
@@ -272,6 +273,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+	TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ba5882419a7d..dc18a2b0b8aa 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,7 +252,7 @@ struct swap_info_struct {
 
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 void workingset_update_node(struct radix_tree_node *node, void *private);
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 304ff94363b2..d7916f1f8240 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -90,6 +90,7 @@
 	{1UL << PG_dirty,		"dirty"		},		\
 	{1UL << PG_lru,			"lru"		},		\
 	{1UL << PG_active,		"active"	},		\
+	{1UL << PG_workingset,		"workingset"	},		\
 	{1UL << PG_slab,		"slab"		},		\
 	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
 	{1UL << PG_arch_1,		"arch_1"	},		\
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..5c592e925805 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -697,12 +697,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		 * data from the working set, only to cache data that will
 		 * get overwritten with something else, is a waste of memory.
 		 */
-		if (!(gfp_mask & __GFP_WRITE) &&
-		    shadow && workingset_refault(shadow)) {
-			SetPageActive(page);
-			workingset_activation(page);
-		} else
-			ClearPageActive(page);
+		WARN_ON_ONCE(PageActive(page));
+		if (!(gfp_mask & __GFP_WRITE) && shadow)
+			workingset_refault(page, shadow);
 		lru_cache_add(page);
 	}
 	return ret;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88c6167f194d..5f049628cf05 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2217,6 +2217,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
+			 (1L << PG_workingset) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
 			 (1L << PG_dirty)));
diff --git a/mm/migrate.c b/mm/migrate.c
index 89a0a1707f4c..bd3f13fef8ca 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -622,6 +622,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 		SetPageActive(newpage);
 	} else if (TestClearPageUnevictable(page))
 		SetPageUnevictable(newpage);
+	if (PageWorkingset(page))
+		SetPageWorkingset(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 539b8885e3d1..6ae807135887 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -379,6 +379,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * Initiate read into locked page and return.
 			 */
+			SetPageWorkingset(new_page);
 			lru_cache_add_anon(new_page);
 			*new_page_allocated = true;
 			return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ad39bbc79e6..285db147d013 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1978,6 +1978,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
+		SetPageWorkingset(page);
 		list_add(&page->lru, &l_inactive);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 76f73670200a..8cd81d40d97b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -956,6 +956,7 @@ const char * const vmstat_text[] = {
 	"nr_isolated_file",
 	"workingset_refault",
 	"workingset_activate",
+	"workingset_restore",
 	"workingset_nodereclaim",
 	"nr_anon_pages",
 	"nr_mapped",
diff --git a/mm/workingset.c b/mm/workingset.c
index b8c9ab678479..2192e52e7957 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -120,7 +120,7 @@
  * the only thing eating into inactive list space is active pages.
  *
  *
- *		Activating refaulting pages
+ *		Refaulting inactive pages
  *
  * All that is known about the active list is that the pages have been
  * accessed more than once in the past.  This means that at any given
@@ -133,6 +133,10 @@
  * used less frequently than the refaulting page - or even not used at
  * all anymore.
  *
+ * That means if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the current active list.
+ *
  * If this is wrong and demotion kicks in, the pages which are truly
  * used more frequently will be reactivated while the less frequently
  * used once will be evicted from memory.
@@ -140,6 +144,14 @@
  * But if this is right, the stale pages will be pushed out of memory
  * and the used pages get to stay in cache.
  *
+ *		Refaulting active pages
+ *
+ * If on the other hand the refaulting pages have recently been
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim. The cache is NOT transitioning to
+ * a different workingset; the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
  *
  *		Implementation
  *
@@ -155,8 +167,7 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 NODES_SHIFT +	\
-			 MEM_CGROUP_ID_SHIFT)
+			 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -169,23 +180,28 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
+			 bool workingset)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+	eviction = (eviction << 1) | workingset;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-			  unsigned long *evictionp)
+			  unsigned long *evictionp, bool *workingsetp)
 {
 	unsigned long entry = (unsigned long)shadow;
 	int memcgid, nid;
+	bool workingset;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	workingset = entry & 1;
+	entry >>= 1;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -194,6 +210,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
+	*workingsetp = workingset;
 }
 
 /**
@@ -206,8 +223,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
-	struct mem_cgroup *memcg = page_memcg(page);
 	struct pglist_data *pgdat = page_pgdat(page);
+	struct mem_cgroup *memcg = page_memcg(page);
 	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -219,30 +236,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, pgdat, eviction);
+	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
 /**
  * workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
  * evicted page in the context of the node it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
  */
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
 {
 	unsigned long refault_distance;
+	struct pglist_data *pgdat;
 	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
-	struct pglist_data *pgdat;
+	bool workingset;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
 
 	rcu_read_lock();
 	/*
@@ -262,43 +279,54 @@ bool workingset_refault(void *shadow)
 	 * configurations instead.
 	 */
 	memcg = mem_cgroup_from_id(memcgid);
-	if (!mem_cgroup_disabled() && !memcg) {
-		rcu_read_unlock();
-		return false;
-	}
+	if (!mem_cgroup_disabled() && !memcg)
+		goto out;
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
 
 	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
+	 * Calculate the refault distance
 	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * inactive_age to lap a shadow entry in the field, which can
+	 * then can result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
 	inc_node_state(pgdat, WORKINGSET_REFAULT);
 	inc_memcg_state(memcg, WORKINGSET_REFAULT);
 
-	if (refault_distance <= active_file) {
-		inc_node_state(pgdat, WORKINGSET_ACTIVATE);
-		inc_memcg_state(memcg, WORKINGSET_ACTIVATE);
-		rcu_read_unlock();
-		return true;
+	/*
+	 * Compare the distance to the existing workingset size. We
+	 * don't act on pages that couldn't stay resident even if all
+	 * the memory was available to the page cache.
+	 */
+	if (refault_distance > active_file)
+		goto out;
+
+	SetPageActive(page);
+	atomic_long_inc(&lruvec->inactive_age);
+	inc_node_state(pgdat, WORKINGSET_ACTIVATE);
+	inc_memcg_state(memcg, WORKINGSET_ACTIVATE);
+
+	/* Page was active prior to eviction */
+	if (workingset) {
+		SetPageWorkingset(page);
+		inc_node_state(pgdat, WORKINGSET_RESTORE);
+		inc_memcg_state(memcg, WORKINGSET_RESTORE);
 	}
+out:
 	rcu_read_unlock();
-	return false;
 }
 
 /**
-- 
2.13.3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/3] mm: workingset: tell cache transitions from workingset thrashing
@ 2017-07-27 15:30   ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h         |  1 +
 include/linux/page-flags.h     |  5 ++-
 include/linux/swap.h           |  2 +-
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c                   |  9 ++--
 mm/huge_memory.c               |  1 +
 mm/migrate.c                   |  2 +
 mm/swap_state.c                |  1 +
 mm/vmscan.c                    |  1 +
 mm/vmstat.c                    |  1 +
 mm/workingset.c                | 98 +++++++++++++++++++++++++++---------------
 11 files changed, 79 insertions(+), 43 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef6a13b7bd3e..f33ad8d411e1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -156,6 +156,7 @@ enum node_stat_item {
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
+	WORKINGSET_RESTORE,
 	WORKINGSET_NODERECLAIM,
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d6de32..4d1e557d1f8c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,13 +73,14 @@
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
-	PG_error,
 	PG_referenced,
 	PG_uptodate,
 	PG_dirty,
 	PG_lru,
 	PG_active,
+	PG_workingset,
 	PG_waiters,		/* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
+	PG_error,
 	PG_slab,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
@@ -272,6 +273,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+	TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ba5882419a7d..dc18a2b0b8aa 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,7 +252,7 @@ struct swap_info_struct {
 
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 void workingset_update_node(struct radix_tree_node *node, void *private);
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 304ff94363b2..d7916f1f8240 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -90,6 +90,7 @@
 	{1UL << PG_dirty,		"dirty"		},		\
 	{1UL << PG_lru,			"lru"		},		\
 	{1UL << PG_active,		"active"	},		\
+	{1UL << PG_workingset,		"workingset"	},		\
 	{1UL << PG_slab,		"slab"		},		\
 	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
 	{1UL << PG_arch_1,		"arch_1"	},		\
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..5c592e925805 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -697,12 +697,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		 * data from the working set, only to cache data that will
 		 * get overwritten with something else, is a waste of memory.
 		 */
-		if (!(gfp_mask & __GFP_WRITE) &&
-		    shadow && workingset_refault(shadow)) {
-			SetPageActive(page);
-			workingset_activation(page);
-		} else
-			ClearPageActive(page);
+		WARN_ON_ONCE(PageActive(page));
+		if (!(gfp_mask & __GFP_WRITE) && shadow)
+			workingset_refault(page, shadow);
 		lru_cache_add(page);
 	}
 	return ret;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88c6167f194d..5f049628cf05 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2217,6 +2217,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
+			 (1L << PG_workingset) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
 			 (1L << PG_dirty)));
diff --git a/mm/migrate.c b/mm/migrate.c
index 89a0a1707f4c..bd3f13fef8ca 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -622,6 +622,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 		SetPageActive(newpage);
 	} else if (TestClearPageUnevictable(page))
 		SetPageUnevictable(newpage);
+	if (PageWorkingset(page))
+		SetPageWorkingset(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 539b8885e3d1..6ae807135887 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -379,6 +379,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * Initiate read into locked page and return.
 			 */
+			SetPageWorkingset(new_page);
 			lru_cache_add_anon(new_page);
 			*new_page_allocated = true;
 			return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ad39bbc79e6..285db147d013 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1978,6 +1978,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
+		SetPageWorkingset(page);
 		list_add(&page->lru, &l_inactive);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 76f73670200a..8cd81d40d97b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -956,6 +956,7 @@ const char * const vmstat_text[] = {
 	"nr_isolated_file",
 	"workingset_refault",
 	"workingset_activate",
+	"workingset_restore",
 	"workingset_nodereclaim",
 	"nr_anon_pages",
 	"nr_mapped",
diff --git a/mm/workingset.c b/mm/workingset.c
index b8c9ab678479..2192e52e7957 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -120,7 +120,7 @@
  * the only thing eating into inactive list space is active pages.
  *
  *
- *		Activating refaulting pages
+ *		Refaulting inactive pages
  *
  * All that is known about the active list is that the pages have been
  * accessed more than once in the past.  This means that at any given
@@ -133,6 +133,10 @@
  * used less frequently than the refaulting page - or even not used at
  * all anymore.
  *
+ * That means if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the current active list.
+ *
  * If this is wrong and demotion kicks in, the pages which are truly
  * used more frequently will be reactivated while the less frequently
  * used once will be evicted from memory.
@@ -140,6 +144,14 @@
  * But if this is right, the stale pages will be pushed out of memory
  * and the used pages get to stay in cache.
  *
+ *		Refaulting active pages
+ *
+ * If on the other hand the refaulting pages have recently been
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim. The cache is NOT transitioning to
+ * a different workingset; the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
  *
  *		Implementation
  *
@@ -155,8 +167,7 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 NODES_SHIFT +	\
-			 MEM_CGROUP_ID_SHIFT)
+			 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -169,23 +180,28 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
+			 bool workingset)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+	eviction = (eviction << 1) | workingset;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-			  unsigned long *evictionp)
+			  unsigned long *evictionp, bool *workingsetp)
 {
 	unsigned long entry = (unsigned long)shadow;
 	int memcgid, nid;
+	bool workingset;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	workingset = entry & 1;
+	entry >>= 1;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -194,6 +210,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
+	*workingsetp = workingset;
 }
 
 /**
@@ -206,8 +223,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
-	struct mem_cgroup *memcg = page_memcg(page);
 	struct pglist_data *pgdat = page_pgdat(page);
+	struct mem_cgroup *memcg = page_memcg(page);
 	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -219,30 +236,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, pgdat, eviction);
+	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
 /**
  * workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
  * evicted page in the context of the node it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
  */
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
 {
 	unsigned long refault_distance;
+	struct pglist_data *pgdat;
 	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
-	struct pglist_data *pgdat;
+	bool workingset;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
 
 	rcu_read_lock();
 	/*
@@ -262,43 +279,54 @@ bool workingset_refault(void *shadow)
 	 * configurations instead.
 	 */
 	memcg = mem_cgroup_from_id(memcgid);
-	if (!mem_cgroup_disabled() && !memcg) {
-		rcu_read_unlock();
-		return false;
-	}
+	if (!mem_cgroup_disabled() && !memcg)
+		goto out;
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
 
 	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
+	 * Calculate the refault distance
 	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * inactive_age to lap a shadow entry in the field, which can
+	 * then can result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
 	inc_node_state(pgdat, WORKINGSET_REFAULT);
 	inc_memcg_state(memcg, WORKINGSET_REFAULT);
 
-	if (refault_distance <= active_file) {
-		inc_node_state(pgdat, WORKINGSET_ACTIVATE);
-		inc_memcg_state(memcg, WORKINGSET_ACTIVATE);
-		rcu_read_unlock();
-		return true;
+	/*
+	 * Compare the distance to the existing workingset size. We
+	 * don't act on pages that couldn't stay resident even if all
+	 * the memory was available to the page cache.
+	 */
+	if (refault_distance > active_file)
+		goto out;
+
+	SetPageActive(page);
+	atomic_long_inc(&lruvec->inactive_age);
+	inc_node_state(pgdat, WORKINGSET_ACTIVATE);
+	inc_memcg_state(memcg, WORKINGSET_ACTIVATE);
+
+	/* Page was active prior to eviction */
+	if (workingset) {
+		SetPageWorkingset(page);
+		inc_node_state(pgdat, WORKINGSET_RESTORE);
+		inc_memcg_state(memcg, WORKINGSET_RESTORE);
 	}
+out:
 	rcu_read_unlock();
-	return false;
 }
 
 /**
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-27 15:30 ` Johannes Weiner
@ 2017-07-27 15:30   ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

Linux doesn't have a useful metric to describe the memory health of a
system, a cgroup container, or individual tasks.

When workloads are bigger than available memory, they spend a certain
amount of their time inside page reclaim, waiting on thrashing cache,
and swapping in. This has impact on latency, and depending on the CPU
capacity in the system can also translate to a decrease in throughput.

While Linux exports some stats and counters for these events, it does
not quantify the true impact they have on throughput and latency. How
much of the execution time is spent unproductively? This is important
to know when sizing workloads to systems and containers. It also comes
in handy when evaluating the effectiveness and efficiency of the
kernel's memory management policies and heuristics.

This patch implements a metric that quantifies memory pressure in a
unit that matters most to applications and does not rely on hardware
aspects to be meaningful: wallclock time lost while waiting on memory.

Whenever a task is blocked on refaults, swapins, or direct reclaim,
the time it spends is accounted on the task level and aggregated into
a domain state along with other tasks on the system and cgroup level.

Each task has a /proc/<pid>/memdelay file that lists the microseconds
the task has been delayed since it's been forked. That file can be
sampled periodically for recent delays, or before and after certain
operations to measure their memory-related latencies.

On the system and cgroup-level, there are /proc/memdelay and
memory.memdelay, respectively, and their format is as such:

$ cat /proc/memdelay
2489084
41.61 47.28 29.66
0.00 0.00 0.00

The first line shows the cumulative delay times of all tasks in the
domain - in this case, all tasks in the system cumulatively lost 2.49
seconds due to memory delays.

The second and third line show percentages spent in aggregate states
for the domain - system or cgroup - in a load average type format as
decaying averages over the last 1m, 5m, and 15m:

The second line indicates the share of wall-time the domain spends in
a state where SOME tasks are delayed by memory while others are still
productive (runnable or iowait). This indicates a latency problem for
individual tasks, but since the CPU/IO capacity is still used, adding
more memory might not necessarily improve the domain's throughput.

The third line indicates the share of wall-time the domain spends in a
state where ALL non-idle tasks are delayed by memory. In this state,
the domain is entirely unproductive due to a lack of memory.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/proc/array.c            |   8 ++
 fs/proc/base.c             |   2 +
 fs/proc/internal.h         |   2 +
 include/linux/cgroup.h     |  14 +++
 include/linux/memcontrol.h |  14 +++
 include/linux/memdelay.h   | 174 +++++++++++++++++++++++++++
 include/linux/sched.h      |  10 +-
 kernel/cgroup/cgroup.c     |   4 +-
 kernel/fork.c              |   4 +
 kernel/sched/Makefile      |   2 +-
 kernel/sched/core.c        |  20 ++++
 kernel/sched/memdelay.c    | 112 ++++++++++++++++++
 mm/Makefile                |   2 +-
 mm/compaction.c            |   4 +
 mm/filemap.c               |   9 ++
 mm/memcontrol.c            |  25 ++++
 mm/memdelay.c              | 289 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |  11 +-
 mm/vmscan.c                |   9 ++
 19 files changed, 709 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/memdelay.h
 create mode 100644 kernel/sched/memdelay.c
 create mode 100644 mm/memdelay.c

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 88c355574aa0..00e0e9aa3e70 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -611,6 +611,14 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 
+int proc_pid_memdelay(struct seq_file *m, struct pid_namespace *ns,
+		      struct pid *pid, struct task_struct *task)
+{
+	seq_put_decimal_ull(m, "", task->memdelay_total);
+	seq_putc(m, '\n');
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CHILDREN
 static struct pid *
 get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f1e1927ccd48..cd653729b0c6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2873,6 +2873,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("cmdline",    S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
 	ONE("statm",      S_IRUGO, proc_pid_statm),
+	ONE("memdelay",   S_IRUGO, proc_pid_memdelay),
 	REG("maps",       S_IRUGO, proc_pid_maps_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
@@ -3263,6 +3264,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
+	ONE("memdelay",  S_IRUGO, proc_pid_memdelay),
 	REG("maps",      S_IRUGO, proc_tid_maps_operations),
 #ifdef CONFIG_PROC_CHILDREN
 	REG("children",  S_IRUGO, proc_tid_children_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c5ae09b6c726..49eba8f0cc7c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -146,6 +146,8 @@ extern int proc_pid_status(struct seq_file *, struct pid_namespace *,
 			   struct pid *, struct task_struct *);
 extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
 			  struct pid *, struct task_struct *);
+extern int proc_pid_memdelay(struct seq_file *, struct pid_namespace *,
+			     struct pid *, struct task_struct *);
 
 /*
  * base.c
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 710a005c6b7a..7283439043d9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -102,6 +102,17 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
 int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		     struct pid *pid, struct task_struct *tsk);
 
+/* caller must have irqs disabled */
+static inline void lock_task_cgroup(struct task_struct *p)
+{
+	spin_lock(&p->cgroups_lock);
+}
+
+static inline void unlock_task_cgroup(struct task_struct *p)
+{
+	spin_unlock(&p->cgroups_lock);
+}
+
 void cgroup_fork(struct task_struct *p);
 extern int cgroup_can_fork(struct task_struct *p);
 extern void cgroup_cancel_fork(struct task_struct *p);
@@ -620,6 +631,9 @@ static inline int cgroup_attach_task_all(struct task_struct *from,
 static inline int cgroupstats_build(struct cgroupstats *stats,
 				    struct dentry *dentry) { return -EINVAL; }
 
+static inline void lock_task_cgroup(struct task_struct *p) {}
+static inline void unlock_task_cgroup(struct task_struct *p) {}
+
 static inline void cgroup_fork(struct task_struct *p) {}
 static inline int cgroup_can_fork(struct task_struct *p) { return 0; }
 static inline void cgroup_cancel_fork(struct task_struct *p) {}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 899949bbb2f9..579a28e84f3b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@
 #include <linux/mmzone.h>
 #include <linux/writeback.h>
 #include <linux/page-flags.h>
+#include <linux/memdelay.h>
 
 struct mem_cgroup;
 struct page;
@@ -179,6 +180,9 @@ struct mem_cgroup {
 
 	unsigned long soft_limit;
 
+	/* Memory delay measurement domain */
+	struct memdelay_domain *memdelay_domain;
+
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
@@ -632,6 +636,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->lruvec;
 }
 
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+	return NULL;
+}
+
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
@@ -644,6 +653,11 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *task)
+{
+	return NULL;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
diff --git a/include/linux/memdelay.h b/include/linux/memdelay.h
new file mode 100644
index 000000000000..7187fdb49204
--- /dev/null
+++ b/include/linux/memdelay.h
@@ -0,0 +1,174 @@
+#ifndef _LINUX_MEMDELAY_H
+#define _LINUX_MEMDELAY_H
+
+#include <linux/spinlock_types.h>
+#include <linux/sched.h>
+
+struct seq_file;
+struct css_set;
+
+/*
+ * Task productivity states tracked by the scheduler
+ */
+enum memdelay_task_state {
+	MTS_NONE,		/* Idle/unqueued/untracked */
+	MTS_WORKING,		/* Runnable or waiting for IO */
+	MTS_DELAYED,		/* Memory delayed, not running */
+	MTS_DELAYED_ACTIVE,	/* Memory delayed, actively running */
+	NR_MEMDELAY_TASK_STATES,
+};
+
+/*
+ * System/cgroup delay state tracked by the VM, composed of the
+ * productivity states of all tasks inside the domain.
+ */
+enum memdelay_domain_state {
+	MDS_NONE,		/* No delayed tasks */
+	MDS_SOME,		/* Delayed tasks, working tasks */
+	MDS_FULL,		/* Delayed tasks, no working tasks */
+	NR_MEMDELAY_DOMAIN_STATES,
+};
+
+struct memdelay_domain_cpu {
+	spinlock_t lock;
+
+	/* Task states of the domain on this CPU */
+	int tasks[NR_MEMDELAY_TASK_STATES];
+
+	/* Delay state of the domain on this CPU */
+	enum memdelay_domain_state state;
+
+	/* Time of last state change */
+	unsigned long state_start;
+};
+
+struct memdelay_domain {
+	/* Aggregate delayed time of all domain tasks */
+	unsigned long aggregate;
+
+	/* Per-CPU delay states in the domain */
+	struct memdelay_domain_cpu __percpu *mdcs;
+
+	/* Cumulative state times from all CPUs */
+	unsigned long times[NR_MEMDELAY_DOMAIN_STATES];
+
+	/* Decaying state time averages over 1m, 5m, 15m */
+	unsigned long period_expires;
+	unsigned long avg_full[3];
+	unsigned long avg_some[3];
+};
+
+/* mm/memdelay.c */
+extern struct memdelay_domain memdelay_global_domain;
+void memdelay_init(void);
+void memdelay_task_change(struct task_struct *task, int old, int new);
+struct memdelay_domain *memdelay_domain_alloc(void);
+void memdelay_domain_free(struct memdelay_domain *md);
+int memdelay_domain_show(struct seq_file *s, struct memdelay_domain *md);
+
+/* kernel/sched/memdelay.c */
+void memdelay_enter(unsigned long *flags);
+void memdelay_leave(unsigned long *flags);
+
+/**
+ * memdelay_schedule - note a context switch
+ * @prev: task scheduling out
+ * @next: task scheduling in
+ *
+ * A task switch doesn't affect the balance between delayed and
+ * productive tasks, but we have to update whether the delay is
+ * actively using the CPU or not.
+ */
+static inline void memdelay_schedule(struct task_struct *prev,
+				     struct task_struct *next)
+{
+	if (prev->flags & PF_MEMDELAY)
+		memdelay_task_change(prev, MTS_DELAYED_ACTIVE, MTS_DELAYED);
+
+	if (next->flags & PF_MEMDELAY)
+		memdelay_task_change(next, MTS_DELAYED, MTS_DELAYED_ACTIVE);
+}
+
+/**
+ * memdelay_wakeup - note a task waking up
+ * @task: the task
+ *
+ * Notes an idle task becoming productive. Delayed tasks remain
+ * delayed even when they become runnable; tasks in iowait are
+ * considered productive.
+ */
+static inline void memdelay_wakeup(struct task_struct *task)
+{
+	if (task->flags & PF_MEMDELAY || task->in_iowait)
+		return;
+
+	memdelay_task_change(task, MTS_NONE, MTS_WORKING);
+}
+
+/**
+ * memdelay_wakeup - note a task going to sleep
+ * @task: the task
+ *
+ * Notes a working tasks becoming unproductive. Delayed tasks remain
+ * delayed; tasks sleeping in an iowait remain productive.
+ */
+static inline void memdelay_sleep(struct task_struct *task)
+{
+	if (task->flags & PF_MEMDELAY || task->in_iowait)
+		return;
+
+	memdelay_task_change(task, MTS_WORKING, MTS_NONE);
+}
+
+/**
+ * memdelay_del_add - track task movement between runqueues
+ * @task: the task
+ * @runnable: a runnable task is moved if %true, unqueued otherwise
+ * @add: task is being added if %true, removed otherwise
+ *
+ * Update the memdelay domain per-cpu states as tasks are being moved
+ * around the runqueues.
+ */
+static inline void memdelay_del_add(struct task_struct *task,
+				    bool runnable, bool add)
+{
+	int state;
+
+	if (task->flags & PF_MEMDELAY)
+		state = MTS_DELAYED;
+	else if (runnable || task->in_iowait)
+		state = MTS_WORKING;
+	else
+		return; /* already MTS_NONE */
+
+	if (add)
+		memdelay_task_change(task, MTS_NONE, state);
+	else
+		memdelay_task_change(task, state, MTS_NONE);
+}
+
+static inline void memdelay_del_runnable(struct task_struct *task)
+{
+	memdelay_del_add(task, true, false);
+}
+
+static inline void memdelay_add_runnable(struct task_struct *task)
+{
+	memdelay_del_add(task, true, true);
+}
+
+static inline void memdelay_del_sleeping(struct task_struct *task)
+{
+	memdelay_del_add(task, false, false);
+}
+
+static inline void memdelay_add_sleeping(struct task_struct *task)
+{
+	memdelay_del_add(task, false, true);
+}
+
+#ifdef CONFIG_CGROUPS
+void cgroup_move_task(struct task_struct *task, struct css_set *to);
+#endif
+
+#endif /* _LINUX_MEMDELAY_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b69fc650201..c5da04c260e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -837,6 +837,12 @@ struct task_struct {
 
 	struct io_context		*io_context;
 
+	unsigned long			memdelay_start;
+	unsigned long			memdelay_total;
+#ifdef CONFIG_DEBUG_VM
+	int				memdelay_state;
+#endif
+
 	/* Ptrace state: */
 	unsigned long			ptrace_message;
 	siginfo_t			*last_siginfo;
@@ -859,7 +865,8 @@ struct task_struct {
 	int				cpuset_slab_spread_rotor;
 #endif
 #ifdef CONFIG_CGROUPS
-	/* Control Group info protected by css_set_lock: */
+	spinlock_t			cgroups_lock;
+	/* Control Group info protected by cgroups_lock: */
 	struct css_set __rcu		*cgroups;
 	/* cg_list protected by css_set_lock and tsk->alloc_lock: */
 	struct list_head		cg_list;
@@ -1231,6 +1238,7 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+#define PF_MEMDELAY		0x01000000	/* Delayed due to lack of memory */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MUTEX_TESTER		0x20000000	/* Thread belongs to the rt mutex tester */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8d4e85eae42c..f442e16911bc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -694,7 +694,8 @@ static void css_set_move_task(struct task_struct *task,
 		 */
 		WARN_ON_ONCE(task->flags & PF_EXITING);
 
-		rcu_assign_pointer(task->cgroups, to_cset);
+		cgroup_move_task(task, to_cset);
+
 		list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
 							     &to_cset->tasks);
 	}
@@ -4693,6 +4694,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
  */
 void cgroup_fork(struct task_struct *child)
 {
+	spin_lock_init(&child->cgroups_lock);
 	RCU_INIT_POINTER(child->cgroups, &init_css_set);
 	INIT_LIST_HEAD(&child->cg_list);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d2bf95..73b8dae7b34e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1201,6 +1201,10 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	int retval;
 
 	tsk->min_flt = tsk->maj_flt = 0;
+	tsk->memdelay_total = 0;
+#ifdef CONFIG_DEBUG_VM
+	tsk->memdelay_state = 0;
+#endif
 	tsk->nvcsw = tsk->nivcsw = 0;
 #ifdef CONFIG_DETECT_HUNG_TASK
 	tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 89ab6758667b..5efb0fddc3d3 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -17,7 +17,7 @@ endif
 
 obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
-obj-y += wait.o swait.o completion.o idle.o
+obj-y += wait.o swait.o completion.o idle.o memdelay.o
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 326d4f88e2b1..a90399a5473f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -25,6 +25,7 @@
 #include <linux/profile.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/memdelay.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -758,6 +759,11 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
 
+	if (flags & ENQUEUE_WAKEUP)
+		memdelay_wakeup(p);
+	else
+		memdelay_add_runnable(p);
+
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -769,6 +775,11 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & DEQUEUE_SAVE))
 		sched_info_dequeued(rq, p);
 
+	if (flags & DEQUEUE_SLEEP)
+		memdelay_sleep(p);
+	else
+		memdelay_del_runnable(p);
+
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2053,7 +2064,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
+
+		memdelay_del_sleeping(p);
+
 		set_task_cpu(p, cpu);
+
+		memdelay_add_sleeping(p);
 	}
 
 #else /* CONFIG_SMP */
@@ -3434,6 +3450,8 @@ static void __sched notrace __schedule(bool preempt)
 		rq->curr = next;
 		++*switch_count;
 
+		memdelay_schedule(prev, next);
+
 		trace_sched_switch(preempt, prev, next);
 
 		/* Also unlocks the rq: */
@@ -6210,6 +6228,8 @@ void __init sched_init(void)
 
 	init_schedstats();
 
+	memdelay_init();
+
 	scheduler_running = 1;
 }
 
diff --git a/kernel/sched/memdelay.c b/kernel/sched/memdelay.c
new file mode 100644
index 000000000000..971f45a0b946
--- /dev/null
+++ b/kernel/sched/memdelay.c
@@ -0,0 +1,112 @@
+/*
+ * Memory delay metric
+ *
+ * Copyright (c) 2017 Facebook, Johannes Weiner
+ *
+ * This code quantifies and reports to userspace the wall-time impact
+ * of memory pressure on the system and memory-controlled cgroups.
+ */
+
+#include <linux/memdelay.h>
+#include <linux/cgroup.h>
+#include <linux/sched.h>
+
+#include "sched.h"
+
+/**
+ * memdelay_enter - mark the beginning of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as being delayed due to a lack of memory,
+ * such as waiting for a workingset refault or performing reclaim.
+ */
+void memdelay_enter(unsigned long *flags)
+{
+	*flags = current->flags & PF_MEMDELAY;
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMDELAY & accounting needs to be atomic wrt changes to
+	 * the task's scheduling state (hence IRQ disabling) and its
+	 * domain association (hence lock_task_cgroup). Otherwise we
+	 * could race with CPU or cgroup migration and misaccount.
+	 */
+	WARN_ON_ONCE(irqs_disabled());
+	local_irq_disable();
+	lock_task_cgroup(current);
+
+	current->flags |= PF_MEMDELAY;
+	memdelay_task_change(current, MTS_WORKING, MTS_DELAYED_ACTIVE);
+
+	unlock_task_cgroup(current);
+	local_irq_enable();
+}
+
+/**
+ * memdelay_leave - mark the end of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer delayed due to memory.
+ */
+void memdelay_leave(unsigned long *flags)
+{
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMDELAY & accounting needs to be atomic wrt changes to
+	 * the task's scheduling state (hence IRQ disabling) and its
+	 * domain association (hence lock_task_cgroup). Otherwise we
+	 * could race with CPU or cgroup migration and misaccount.
+	 */
+	WARN_ON_ONCE(irqs_disabled());
+	local_irq_disable();
+	lock_task_cgroup(current);
+
+	current->flags &= ~PF_MEMDELAY;
+	memdelay_task_change(current, MTS_DELAYED_ACTIVE, MTS_WORKING);
+
+	unlock_task_cgroup(current);
+	local_irq_enable();
+}
+
+#ifdef CONFIG_CGROUPS
+/**
+ * cgroup_move_task - move task to a different cgroup
+ * @task: the task
+ * @to: the target css_set
+ *
+ * Move task to a new cgroup and safely migrate its associated
+ * delayed/working state between the different domains.
+ *
+ * This function acquires the task's rq lock and lock_task_cgroup() to
+ * lock out concurrent changes to the task's scheduling state and - in
+ * case the task is running - concurrent changes to its delay state.
+ */
+void cgroup_move_task(struct task_struct *task, struct css_set *to)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	int state;
+
+	lock_task_cgroup(task);
+	rq = task_rq_lock(task, &rf);
+
+	if (task->flags & PF_MEMDELAY)
+		state = MTS_DELAYED + task_current(rq, task);
+	else if (task_on_rq_queued(task) || task->in_iowait)
+		state = MTS_WORKING;
+	else
+		state = MTS_NONE;
+
+	/*
+	 * Lame to do this here, but the scheduler cannot be locked
+	 * from the outside, so we move cgroups from inside sched/.
+	 */
+	memdelay_task_change(task, state, MTS_NONE);
+	rcu_assign_pointer(task->cgroups, to);
+	memdelay_task_change(task, MTS_NONE, state);
+
+	task_rq_unlock(rq, task, &rf);
+	unlock_task_cgroup(task);
+}
+#endif /* CONFIG_CGROUPS */
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a828a50..ac020693031d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,7 +39,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o vmacache.o swap_slots.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o $(mmu-y)
+			   memdelay.o debug.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 613c59e928cb..d4b81318d1d7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2041,11 +2041,15 @@ static int kcompactd(void *p)
 	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
 
 	while (!kthread_should_stop()) {
+		unsigned long mdflags;
+
 		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
 		wait_event_freezable(pgdat->kcompactd_wait,
 				kcompactd_work_requested(pgdat));
 
+		memdelay_enter(&mdflags);
 		kcompactd_do_work(pgdat);
+		memdelay_leave(&mdflags);
 	}
 
 	return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c592e925805..12869768e2e4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
+#include <linux/memdelay.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -830,10 +831,15 @@ static void wake_up_page(struct page *page, int bit)
 static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 		struct page *page, int bit_nr, int state, bool lock)
 {
+	bool refault = bit_nr == PG_locked && PageWorkingset(page);
 	struct wait_page_queue wait_page;
 	wait_queue_t *wait = &wait_page.wait;
+	unsigned long mdflags;
 	int ret = 0;
 
+	if (refault)
+		memdelay_enter(&mdflags);
+
 	init_wait(wait);
 	wait->func = wake_page_function;
 	wait_page.page = page;
@@ -873,6 +879,9 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 
 	finish_wait(q, wait);
 
+	if (refault)
+		memdelay_leave(&mdflags);
+
 	/*
 	 * A signal could leave PageWaiters set. Clearing it here if
 	 * !waitqueue_active would be possible (by open-coding finish_wait),
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94172089f52f..5d1ebe329c48 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -65,6 +65,7 @@
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/tracehook.h>
+#include <linux/memdelay.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3907,6 +3908,8 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	return ret;
 }
 
+static int memory_memdelay_show(struct seq_file *m, void *v);
+
 static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3974,6 +3977,10 @@ static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "pressure_level",
 	},
+	{
+		.name = "memdelay",
+		.seq_show = memory_memdelay_show,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.name = "numa_stat",
@@ -4142,6 +4149,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
+	memdelay_domain_free(memcg->memdelay_domain);
 	free_percpu(memcg->stat);
 	kfree(memcg);
 }
@@ -4247,10 +4255,15 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
+		memcg->memdelay_domain = &memdelay_global_domain;
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
 
+	memcg->memdelay_domain = memdelay_domain_alloc();
+	if (!memcg->memdelay_domain)
+		goto fail;
+
 	error = memcg_online_kmem(memcg);
 	if (error)
 		goto fail;
@@ -5241,6 +5254,13 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int memory_memdelay_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	return memdelay_domain_show(m, memcg->memdelay_domain);
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -5276,6 +5296,11 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = memory_stat_show,
 	},
+	{
+		.name = "memdelay",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_memdelay_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/memdelay.c b/mm/memdelay.c
new file mode 100644
index 000000000000..337a6bca9ee8
--- /dev/null
+++ b/mm/memdelay.c
@@ -0,0 +1,289 @@
+/*
+ * Memory delay metric
+ *
+ * Copyright (c) 2017 Facebook, Johannes Weiner
+ *
+ * This code quantifies and reports to userspace the wall-time impact
+ * of memory pressure on the system and memory-controlled cgroups.
+ */
+
+#include <linux/sched/loadavg.h>
+#include <linux/memcontrol.h>
+#include <linux/memdelay.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+
+static DEFINE_PER_CPU(struct memdelay_domain_cpu, global_domain_cpus);
+
+/* System-level keeping of memory delay statistics */
+struct memdelay_domain memdelay_global_domain = {
+	.mdcs = &global_domain_cpus,
+};
+
+static void domain_init(struct memdelay_domain *md)
+{
+	int cpu;
+
+	md->period_expires = jiffies + LOAD_FREQ;
+	for_each_possible_cpu(cpu) {
+		struct memdelay_domain_cpu *mdc;
+
+		mdc = per_cpu_ptr(md->mdcs, cpu);
+		spin_lock_init(&mdc->lock);
+	}
+}
+
+/**
+ * memdelay_init - initialize the memdelay subsystem
+ *
+ * This needs to run before the scheduler starts queuing and
+ * scheduling tasks.
+ */
+void __init memdelay_init(void)
+{
+	domain_init(&memdelay_global_domain);
+}
+
+static void domain_move_clock(struct memdelay_domain *md)
+{
+	unsigned long expires = READ_ONCE(md->period_expires);
+	unsigned long none, some, full;
+	int missed_periods;
+	unsigned long next;
+	int i;
+
+	if (time_before(jiffies, expires))
+		return;
+
+	missed_periods = 1 + (jiffies - expires) / LOAD_FREQ;
+	next = expires + (missed_periods * LOAD_FREQ);
+
+	if (cmpxchg(&md->period_expires, expires, next) != expires)
+		return;
+
+	none = xchg(&md->times[MDS_NONE], 0);
+	some = xchg(&md->times[MDS_SOME], 0);
+	full = xchg(&md->times[MDS_FULL], 0);
+
+	for (i = 0; i < missed_periods; i++) {
+		unsigned long pct;
+
+		pct = some * 100 / max(none + some + full, 1UL);
+		pct *= FIXED_1;
+		CALC_LOAD(md->avg_some[0], EXP_1, pct);
+		CALC_LOAD(md->avg_some[1], EXP_5, pct);
+		CALC_LOAD(md->avg_some[2], EXP_15, pct);
+
+		pct = full * 100 / max(none + some + full, 1UL);
+		pct *= FIXED_1;
+		CALC_LOAD(md->avg_full[0], EXP_1, pct);
+		CALC_LOAD(md->avg_full[1], EXP_5, pct);
+		CALC_LOAD(md->avg_full[2], EXP_15, pct);
+
+		none = some = full = 0;
+	}
+}
+
+static void domain_cpu_update(struct memdelay_domain *md, int cpu,
+			      int old, int new)
+{
+	enum memdelay_domain_state state;
+	struct memdelay_domain_cpu *mdc;
+	unsigned long now, delta;
+	unsigned long flags;
+
+	mdc = per_cpu_ptr(md->mdcs, cpu);
+	spin_lock_irqsave(&mdc->lock, flags);
+
+	if (old) {
+		WARN_ONCE(!mdc->tasks[old], "cpu=%d old=%d new=%d counter=%d\n",
+			  cpu, old, new, mdc->tasks[old]);
+		mdc->tasks[old] -= 1;
+	}
+	if (new)
+		mdc->tasks[new] += 1;
+
+	/*
+	 * The domain is somewhat delayed when a number of tasks are
+	 * delayed but there are still others running the workload.
+	 *
+	 * The domain is fully delayed when all non-idle tasks on the
+	 * CPU are delayed, or when a delayed task is actively running
+	 * and preventing productive tasks from making headway.
+	 *
+	 * The state times then add up over all CPUs in the domain: if
+	 * the domain is fully blocked on one CPU and there is another
+	 * one running the workload, the domain is considered fully
+	 * blocked 50% of the time.
+	 */
+	if (!mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_DELAYED])
+		state = MDS_NONE;
+	else if (mdc->tasks[MTS_WORKING])
+		state = MDS_SOME;
+	else
+		state = MDS_FULL;
+
+	if (mdc->state == state)
+		goto unlock;
+
+	now = ktime_to_ns(ktime_get());
+	delta = now - mdc->state_start;
+
+	domain_move_clock(md);
+	md->times[mdc->state] += delta;
+
+	mdc->state = state;
+	mdc->state_start = now;
+unlock:
+	spin_unlock_irqrestore(&mdc->lock, flags);
+}
+
+static struct memdelay_domain *memcg_domain(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!mem_cgroup_disabled())
+		return memcg->memdelay_domain;
+#endif
+	return &memdelay_global_domain;
+}
+
+/**
+ * memdelay_task_change - note a task changing its delay/work state
+ * @task: the task changing state
+ * @delayed: 1 when task enters delayed state, -1 when it leaves
+ * @working: 1 when task enters working state, -1 when it leaves
+ * @active_delay: 1 when task enters active delay, -1 when it leaves
+ *
+ * Updates the task's domain counters to reflect a change in the
+ * task's delayed/working state.
+ */
+void memdelay_task_change(struct task_struct *task, int old, int new)
+{
+	int cpu = task_cpu(task);
+	struct mem_cgroup *memcg;
+	unsigned long delay = 0;
+
+#ifdef CONFIG_DEBUG_VM
+	WARN_ONCE(task->memdelay_state != old,
+		  "cpu=%d task=%p state=%d (in_iowait=%d PF_MEMDELAYED=%d) old=%d new=%d\n",
+		  cpu, task, task->memdelay_state, task->in_iowait,
+		  !!(task->flags & PF_MEMDELAY), old, new);
+	task->memdelay_state = new;
+#endif
+
+	/* Account when tasks are entering and leaving delays */
+	if (old < MTS_DELAYED && new >= MTS_DELAYED) {
+		task->memdelay_start = ktime_to_ms(ktime_get());
+	} else if (old >= MTS_DELAYED && new < MTS_DELAYED) {
+		delay = ktime_to_ms(ktime_get()) - task->memdelay_start;
+		task->memdelay_total += delay;
+	}
+
+	/* Account domain state changes */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(task);
+	do {
+		struct memdelay_domain *md;
+
+		md = memcg_domain(memcg);
+		md->aggregate += delay;
+		domain_cpu_update(md, cpu, old, new);
+	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+	rcu_read_unlock();
+};
+
+/**
+ * memdelay_domain_alloc - allocate a cgroup memory delay domain
+ */
+struct memdelay_domain *memdelay_domain_alloc(void)
+{
+	struct memdelay_domain *md;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	if (!md)
+		return NULL;
+	md->mdcs = alloc_percpu(struct memdelay_domain_cpu);
+	if (!md->mdcs) {
+		kfree(md);
+		return NULL;
+	}
+	domain_init(md);
+	return md;
+}
+
+/**
+ * memdelay_domain_free - free a cgroup memory delay domain
+ */
+void memdelay_domain_free(struct memdelay_domain *md)
+{
+	if (md) {
+		free_percpu(md->mdcs);
+		kfree(md);
+	}
+}
+
+/**
+ * memdelay_domain_show - format memory delay domain stats to a seq_file
+ * @s: the seq_file
+ * @md: the memory domain
+ */
+int memdelay_domain_show(struct seq_file *s, struct memdelay_domain *md)
+{
+	int cpu;
+
+	domain_move_clock(md);
+
+	seq_printf(s, "%lu\n", md->aggregate);
+
+	seq_printf(s, "%lu.%02lu %lu.%02lu %lu.%02lu\n",
+		   LOAD_INT(md->avg_some[0]), LOAD_FRAC(md->avg_some[0]),
+		   LOAD_INT(md->avg_some[1]), LOAD_FRAC(md->avg_some[1]),
+		   LOAD_INT(md->avg_some[2]), LOAD_FRAC(md->avg_some[2]));
+
+	seq_printf(s, "%lu.%02lu %lu.%02lu %lu.%02lu\n",
+		   LOAD_INT(md->avg_full[0]), LOAD_FRAC(md->avg_full[0]),
+		   LOAD_INT(md->avg_full[1]), LOAD_FRAC(md->avg_full[1]),
+		   LOAD_INT(md->avg_full[2]), LOAD_FRAC(md->avg_full[2]));
+
+#ifdef CONFIG_DEBUG_VM
+	for_each_online_cpu(cpu) {
+		struct memdelay_domain_cpu *mdc;
+
+		mdc = per_cpu_ptr(md->mdcs, cpu);
+		seq_printf(s, "%d %d %d\n",
+			   mdc->tasks[MTS_WORKING],
+			   mdc->tasks[MTS_DELAYED],
+			   mdc->tasks[MTS_DELAYED_ACTIVE]);
+	}
+#endif
+
+	return 0;
+}
+
+static int memdelay_show(struct seq_file *m, void *v)
+{
+	return memdelay_domain_show(m, &memdelay_global_domain);
+}
+
+static int memdelay_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, memdelay_show, NULL);
+}
+
+static const struct file_operations memdelay_fops = {
+	.open           = memdelay_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static int __init memdelay_proc_init(void)
+{
+	proc_create("memdelay", 0, NULL, &memdelay_fops);
+	return 0;
+}
+module_init(memdelay_proc_init);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2302f250d6b1..bec5e96f3b88 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
+#include <linux/memdelay.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -3293,16 +3294,19 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
 		enum compact_priority prio, enum compact_result *compact_result)
 {
-	struct page *page;
 	unsigned int noreclaim_flag;
+	unsigned long mdflags;
+	struct page *page;
 
 	if (!order)
 		return NULL;
 
+	memdelay_enter(&mdflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 									prio);
 	memalloc_noreclaim_restore(noreclaim_flag);
+	memdelay_leave(&mdflags);
 
 	if (*compact_result <= COMPACT_INACTIVE)
 		return NULL;
@@ -3448,13 +3452,15 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 					const struct alloc_context *ac)
 {
 	struct reclaim_state reclaim_state;
-	int progress;
 	unsigned int noreclaim_flag;
+	unsigned long mdflags;
+	int progress;
 
 	cond_resched();
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
+	memdelay_enter(&mdflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	lockdep_set_current_reclaim_state(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
@@ -3466,6 +3472,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	current->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
 	memalloc_noreclaim_restore(noreclaim_flag);
+	memdelay_leave(&mdflags);
 
 	cond_resched();
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 285db147d013..f44651b49670 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,7 @@
 #include <linux/prefetch.h>
 #include <linux/printk.h>
 #include <linux/dax.h>
+#include <linux/memdelay.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3045,6 +3046,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
+	unsigned long mdflags;
 	int nid;
 	unsigned int noreclaim_flag;
 	struct scan_control sc = {
@@ -3073,9 +3075,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
+	memdelay_enter(&mdflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 	memalloc_noreclaim_restore(noreclaim_flag);
+	memdelay_leave(&mdflags);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3497,6 +3501,7 @@ static int kswapd(void *p)
 	pgdat->kswapd_order = 0;
 	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
+		unsigned long mdflags;
 		bool ret;
 
 		alloc_order = reclaim_order = pgdat->kswapd_order;
@@ -3533,7 +3538,11 @@ static int kswapd(void *p)
 		 */
 		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
 						alloc_order);
+
+		memdelay_enter(&mdflags);
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+		memdelay_leave(&mdflags);
+
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
 	}
-- 
2.13.3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-07-27 15:30   ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

Linux doesn't have a useful metric to describe the memory health of a
system, a cgroup container, or individual tasks.

When workloads are bigger than available memory, they spend a certain
amount of their time inside page reclaim, waiting on thrashing cache,
and swapping in. This has impact on latency, and depending on the CPU
capacity in the system can also translate to a decrease in throughput.

While Linux exports some stats and counters for these events, it does
not quantify the true impact they have on throughput and latency. How
much of the execution time is spent unproductively? This is important
to know when sizing workloads to systems and containers. It also comes
in handy when evaluating the effectiveness and efficiency of the
kernel's memory management policies and heuristics.

This patch implements a metric that quantifies memory pressure in a
unit that matters most to applications and does not rely on hardware
aspects to be meaningful: wallclock time lost while waiting on memory.

Whenever a task is blocked on refaults, swapins, or direct reclaim,
the time it spends is accounted on the task level and aggregated into
a domain state along with other tasks on the system and cgroup level.

Each task has a /proc/<pid>/memdelay file that lists the microseconds
the task has been delayed since it's been forked. That file can be
sampled periodically for recent delays, or before and after certain
operations to measure their memory-related latencies.

On the system and cgroup-level, there are /proc/memdelay and
memory.memdelay, respectively, and their format is as such:

$ cat /proc/memdelay
2489084
41.61 47.28 29.66
0.00 0.00 0.00

The first line shows the cumulative delay times of all tasks in the
domain - in this case, all tasks in the system cumulatively lost 2.49
seconds due to memory delays.

The second and third line show percentages spent in aggregate states
for the domain - system or cgroup - in a load average type format as
decaying averages over the last 1m, 5m, and 15m:

The second line indicates the share of wall-time the domain spends in
a state where SOME tasks are delayed by memory while others are still
productive (runnable or iowait). This indicates a latency problem for
individual tasks, but since the CPU/IO capacity is still used, adding
more memory might not necessarily improve the domain's throughput.

The third line indicates the share of wall-time the domain spends in a
state where ALL non-idle tasks are delayed by memory. In this state,
the domain is entirely unproductive due to a lack of memory.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/proc/array.c            |   8 ++
 fs/proc/base.c             |   2 +
 fs/proc/internal.h         |   2 +
 include/linux/cgroup.h     |  14 +++
 include/linux/memcontrol.h |  14 +++
 include/linux/memdelay.h   | 174 +++++++++++++++++++++++++++
 include/linux/sched.h      |  10 +-
 kernel/cgroup/cgroup.c     |   4 +-
 kernel/fork.c              |   4 +
 kernel/sched/Makefile      |   2 +-
 kernel/sched/core.c        |  20 ++++
 kernel/sched/memdelay.c    | 112 ++++++++++++++++++
 mm/Makefile                |   2 +-
 mm/compaction.c            |   4 +
 mm/filemap.c               |   9 ++
 mm/memcontrol.c            |  25 ++++
 mm/memdelay.c              | 289 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |  11 +-
 mm/vmscan.c                |   9 ++
 19 files changed, 709 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/memdelay.h
 create mode 100644 kernel/sched/memdelay.c
 create mode 100644 mm/memdelay.c

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 88c355574aa0..00e0e9aa3e70 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -611,6 +611,14 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 	return 0;
 }
 
+int proc_pid_memdelay(struct seq_file *m, struct pid_namespace *ns,
+		      struct pid *pid, struct task_struct *task)
+{
+	seq_put_decimal_ull(m, "", task->memdelay_total);
+	seq_putc(m, '\n');
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CHILDREN
 static struct pid *
 get_children_pid(struct inode *inode, struct pid *pid_prev, loff_t pos)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index f1e1927ccd48..cd653729b0c6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2873,6 +2873,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("cmdline",    S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
 	ONE("statm",      S_IRUGO, proc_pid_statm),
+	ONE("memdelay",   S_IRUGO, proc_pid_memdelay),
 	REG("maps",       S_IRUGO, proc_pid_maps_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
@@ -3263,6 +3264,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
+	ONE("memdelay",  S_IRUGO, proc_pid_memdelay),
 	REG("maps",      S_IRUGO, proc_tid_maps_operations),
 #ifdef CONFIG_PROC_CHILDREN
 	REG("children",  S_IRUGO, proc_tid_children_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c5ae09b6c726..49eba8f0cc7c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -146,6 +146,8 @@ extern int proc_pid_status(struct seq_file *, struct pid_namespace *,
 			   struct pid *, struct task_struct *);
 extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
 			  struct pid *, struct task_struct *);
+extern int proc_pid_memdelay(struct seq_file *, struct pid_namespace *,
+			     struct pid *, struct task_struct *);
 
 /*
  * base.c
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 710a005c6b7a..7283439043d9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -102,6 +102,17 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
 int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		     struct pid *pid, struct task_struct *tsk);
 
+/* caller must have irqs disabled */
+static inline void lock_task_cgroup(struct task_struct *p)
+{
+	spin_lock(&p->cgroups_lock);
+}
+
+static inline void unlock_task_cgroup(struct task_struct *p)
+{
+	spin_unlock(&p->cgroups_lock);
+}
+
 void cgroup_fork(struct task_struct *p);
 extern int cgroup_can_fork(struct task_struct *p);
 extern void cgroup_cancel_fork(struct task_struct *p);
@@ -620,6 +631,9 @@ static inline int cgroup_attach_task_all(struct task_struct *from,
 static inline int cgroupstats_build(struct cgroupstats *stats,
 				    struct dentry *dentry) { return -EINVAL; }
 
+static inline void lock_task_cgroup(struct task_struct *p) {}
+static inline void unlock_task_cgroup(struct task_struct *p) {}
+
 static inline void cgroup_fork(struct task_struct *p) {}
 static inline int cgroup_can_fork(struct task_struct *p) { return 0; }
 static inline void cgroup_cancel_fork(struct task_struct *p) {}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 899949bbb2f9..579a28e84f3b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@
 #include <linux/mmzone.h>
 #include <linux/writeback.h>
 #include <linux/page-flags.h>
+#include <linux/memdelay.h>
 
 struct mem_cgroup;
 struct page;
@@ -179,6 +180,9 @@ struct mem_cgroup {
 
 	unsigned long soft_limit;
 
+	/* Memory delay measurement domain */
+	struct memdelay_domain *memdelay_domain;
+
 	/* vmpressure notifications */
 	struct vmpressure vmpressure;
 
@@ -632,6 +636,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->lruvec;
 }
 
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+	return NULL;
+}
+
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
@@ -644,6 +653,11 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *task)
+{
+	return NULL;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
diff --git a/include/linux/memdelay.h b/include/linux/memdelay.h
new file mode 100644
index 000000000000..7187fdb49204
--- /dev/null
+++ b/include/linux/memdelay.h
@@ -0,0 +1,174 @@
+#ifndef _LINUX_MEMDELAY_H
+#define _LINUX_MEMDELAY_H
+
+#include <linux/spinlock_types.h>
+#include <linux/sched.h>
+
+struct seq_file;
+struct css_set;
+
+/*
+ * Task productivity states tracked by the scheduler
+ */
+enum memdelay_task_state {
+	MTS_NONE,		/* Idle/unqueued/untracked */
+	MTS_WORKING,		/* Runnable or waiting for IO */
+	MTS_DELAYED,		/* Memory delayed, not running */
+	MTS_DELAYED_ACTIVE,	/* Memory delayed, actively running */
+	NR_MEMDELAY_TASK_STATES,
+};
+
+/*
+ * System/cgroup delay state tracked by the VM, composed of the
+ * productivity states of all tasks inside the domain.
+ */
+enum memdelay_domain_state {
+	MDS_NONE,		/* No delayed tasks */
+	MDS_SOME,		/* Delayed tasks, working tasks */
+	MDS_FULL,		/* Delayed tasks, no working tasks */
+	NR_MEMDELAY_DOMAIN_STATES,
+};
+
+struct memdelay_domain_cpu {
+	spinlock_t lock;
+
+	/* Task states of the domain on this CPU */
+	int tasks[NR_MEMDELAY_TASK_STATES];
+
+	/* Delay state of the domain on this CPU */
+	enum memdelay_domain_state state;
+
+	/* Time of last state change */
+	unsigned long state_start;
+};
+
+struct memdelay_domain {
+	/* Aggregate delayed time of all domain tasks */
+	unsigned long aggregate;
+
+	/* Per-CPU delay states in the domain */
+	struct memdelay_domain_cpu __percpu *mdcs;
+
+	/* Cumulative state times from all CPUs */
+	unsigned long times[NR_MEMDELAY_DOMAIN_STATES];
+
+	/* Decaying state time averages over 1m, 5m, 15m */
+	unsigned long period_expires;
+	unsigned long avg_full[3];
+	unsigned long avg_some[3];
+};
+
+/* mm/memdelay.c */
+extern struct memdelay_domain memdelay_global_domain;
+void memdelay_init(void);
+void memdelay_task_change(struct task_struct *task, int old, int new);
+struct memdelay_domain *memdelay_domain_alloc(void);
+void memdelay_domain_free(struct memdelay_domain *md);
+int memdelay_domain_show(struct seq_file *s, struct memdelay_domain *md);
+
+/* kernel/sched/memdelay.c */
+void memdelay_enter(unsigned long *flags);
+void memdelay_leave(unsigned long *flags);
+
+/**
+ * memdelay_schedule - note a context switch
+ * @prev: task scheduling out
+ * @next: task scheduling in
+ *
+ * A task switch doesn't affect the balance between delayed and
+ * productive tasks, but we have to update whether the delay is
+ * actively using the CPU or not.
+ */
+static inline void memdelay_schedule(struct task_struct *prev,
+				     struct task_struct *next)
+{
+	if (prev->flags & PF_MEMDELAY)
+		memdelay_task_change(prev, MTS_DELAYED_ACTIVE, MTS_DELAYED);
+
+	if (next->flags & PF_MEMDELAY)
+		memdelay_task_change(next, MTS_DELAYED, MTS_DELAYED_ACTIVE);
+}
+
+/**
+ * memdelay_wakeup - note a task waking up
+ * @task: the task
+ *
+ * Notes an idle task becoming productive. Delayed tasks remain
+ * delayed even when they become runnable; tasks in iowait are
+ * considered productive.
+ */
+static inline void memdelay_wakeup(struct task_struct *task)
+{
+	if (task->flags & PF_MEMDELAY || task->in_iowait)
+		return;
+
+	memdelay_task_change(task, MTS_NONE, MTS_WORKING);
+}
+
+/**
+ * memdelay_wakeup - note a task going to sleep
+ * @task: the task
+ *
+ * Notes a working tasks becoming unproductive. Delayed tasks remain
+ * delayed; tasks sleeping in an iowait remain productive.
+ */
+static inline void memdelay_sleep(struct task_struct *task)
+{
+	if (task->flags & PF_MEMDELAY || task->in_iowait)
+		return;
+
+	memdelay_task_change(task, MTS_WORKING, MTS_NONE);
+}
+
+/**
+ * memdelay_del_add - track task movement between runqueues
+ * @task: the task
+ * @runnable: a runnable task is moved if %true, unqueued otherwise
+ * @add: task is being added if %true, removed otherwise
+ *
+ * Update the memdelay domain per-cpu states as tasks are being moved
+ * around the runqueues.
+ */
+static inline void memdelay_del_add(struct task_struct *task,
+				    bool runnable, bool add)
+{
+	int state;
+
+	if (task->flags & PF_MEMDELAY)
+		state = MTS_DELAYED;
+	else if (runnable || task->in_iowait)
+		state = MTS_WORKING;
+	else
+		return; /* already MTS_NONE */
+
+	if (add)
+		memdelay_task_change(task, MTS_NONE, state);
+	else
+		memdelay_task_change(task, state, MTS_NONE);
+}
+
+static inline void memdelay_del_runnable(struct task_struct *task)
+{
+	memdelay_del_add(task, true, false);
+}
+
+static inline void memdelay_add_runnable(struct task_struct *task)
+{
+	memdelay_del_add(task, true, true);
+}
+
+static inline void memdelay_del_sleeping(struct task_struct *task)
+{
+	memdelay_del_add(task, false, false);
+}
+
+static inline void memdelay_add_sleeping(struct task_struct *task)
+{
+	memdelay_del_add(task, false, true);
+}
+
+#ifdef CONFIG_CGROUPS
+void cgroup_move_task(struct task_struct *task, struct css_set *to);
+#endif
+
+#endif /* _LINUX_MEMDELAY_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b69fc650201..c5da04c260e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -837,6 +837,12 @@ struct task_struct {
 
 	struct io_context		*io_context;
 
+	unsigned long			memdelay_start;
+	unsigned long			memdelay_total;
+#ifdef CONFIG_DEBUG_VM
+	int				memdelay_state;
+#endif
+
 	/* Ptrace state: */
 	unsigned long			ptrace_message;
 	siginfo_t			*last_siginfo;
@@ -859,7 +865,8 @@ struct task_struct {
 	int				cpuset_slab_spread_rotor;
 #endif
 #ifdef CONFIG_CGROUPS
-	/* Control Group info protected by css_set_lock: */
+	spinlock_t			cgroups_lock;
+	/* Control Group info protected by cgroups_lock: */
 	struct css_set __rcu		*cgroups;
 	/* cg_list protected by css_set_lock and tsk->alloc_lock: */
 	struct list_head		cg_list;
@@ -1231,6 +1238,7 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+#define PF_MEMDELAY		0x01000000	/* Delayed due to lack of memory */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MUTEX_TESTER		0x20000000	/* Thread belongs to the rt mutex tester */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8d4e85eae42c..f442e16911bc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -694,7 +694,8 @@ static void css_set_move_task(struct task_struct *task,
 		 */
 		WARN_ON_ONCE(task->flags & PF_EXITING);
 
-		rcu_assign_pointer(task->cgroups, to_cset);
+		cgroup_move_task(task, to_cset);
+
 		list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
 							     &to_cset->tasks);
 	}
@@ -4693,6 +4694,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
  */
 void cgroup_fork(struct task_struct *child)
 {
+	spin_lock_init(&child->cgroups_lock);
 	RCU_INIT_POINTER(child->cgroups, &init_css_set);
 	INIT_LIST_HEAD(&child->cg_list);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d2bf95..73b8dae7b34e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1201,6 +1201,10 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	int retval;
 
 	tsk->min_flt = tsk->maj_flt = 0;
+	tsk->memdelay_total = 0;
+#ifdef CONFIG_DEBUG_VM
+	tsk->memdelay_state = 0;
+#endif
 	tsk->nvcsw = tsk->nivcsw = 0;
 #ifdef CONFIG_DETECT_HUNG_TASK
 	tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 89ab6758667b..5efb0fddc3d3 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -17,7 +17,7 @@ endif
 
 obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
-obj-y += wait.o swait.o completion.o idle.o
+obj-y += wait.o swait.o completion.o idle.o memdelay.o
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 326d4f88e2b1..a90399a5473f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -25,6 +25,7 @@
 #include <linux/profile.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/memdelay.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -758,6 +759,11 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
 
+	if (flags & ENQUEUE_WAKEUP)
+		memdelay_wakeup(p);
+	else
+		memdelay_add_runnable(p);
+
 	p->sched_class->enqueue_task(rq, p, flags);
 }
 
@@ -769,6 +775,11 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & DEQUEUE_SAVE))
 		sched_info_dequeued(rq, p);
 
+	if (flags & DEQUEUE_SLEEP)
+		memdelay_sleep(p);
+	else
+		memdelay_del_runnable(p);
+
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2053,7 +2064,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
+
+		memdelay_del_sleeping(p);
+
 		set_task_cpu(p, cpu);
+
+		memdelay_add_sleeping(p);
 	}
 
 #else /* CONFIG_SMP */
@@ -3434,6 +3450,8 @@ static void __sched notrace __schedule(bool preempt)
 		rq->curr = next;
 		++*switch_count;
 
+		memdelay_schedule(prev, next);
+
 		trace_sched_switch(preempt, prev, next);
 
 		/* Also unlocks the rq: */
@@ -6210,6 +6228,8 @@ void __init sched_init(void)
 
 	init_schedstats();
 
+	memdelay_init();
+
 	scheduler_running = 1;
 }
 
diff --git a/kernel/sched/memdelay.c b/kernel/sched/memdelay.c
new file mode 100644
index 000000000000..971f45a0b946
--- /dev/null
+++ b/kernel/sched/memdelay.c
@@ -0,0 +1,112 @@
+/*
+ * Memory delay metric
+ *
+ * Copyright (c) 2017 Facebook, Johannes Weiner
+ *
+ * This code quantifies and reports to userspace the wall-time impact
+ * of memory pressure on the system and memory-controlled cgroups.
+ */
+
+#include <linux/memdelay.h>
+#include <linux/cgroup.h>
+#include <linux/sched.h>
+
+#include "sched.h"
+
+/**
+ * memdelay_enter - mark the beginning of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as being delayed due to a lack of memory,
+ * such as waiting for a workingset refault or performing reclaim.
+ */
+void memdelay_enter(unsigned long *flags)
+{
+	*flags = current->flags & PF_MEMDELAY;
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMDELAY & accounting needs to be atomic wrt changes to
+	 * the task's scheduling state (hence IRQ disabling) and its
+	 * domain association (hence lock_task_cgroup). Otherwise we
+	 * could race with CPU or cgroup migration and misaccount.
+	 */
+	WARN_ON_ONCE(irqs_disabled());
+	local_irq_disable();
+	lock_task_cgroup(current);
+
+	current->flags |= PF_MEMDELAY;
+	memdelay_task_change(current, MTS_WORKING, MTS_DELAYED_ACTIVE);
+
+	unlock_task_cgroup(current);
+	local_irq_enable();
+}
+
+/**
+ * memdelay_leave - mark the end of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer delayed due to memory.
+ */
+void memdelay_leave(unsigned long *flags)
+{
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMDELAY & accounting needs to be atomic wrt changes to
+	 * the task's scheduling state (hence IRQ disabling) and its
+	 * domain association (hence lock_task_cgroup). Otherwise we
+	 * could race with CPU or cgroup migration and misaccount.
+	 */
+	WARN_ON_ONCE(irqs_disabled());
+	local_irq_disable();
+	lock_task_cgroup(current);
+
+	current->flags &= ~PF_MEMDELAY;
+	memdelay_task_change(current, MTS_DELAYED_ACTIVE, MTS_WORKING);
+
+	unlock_task_cgroup(current);
+	local_irq_enable();
+}
+
+#ifdef CONFIG_CGROUPS
+/**
+ * cgroup_move_task - move task to a different cgroup
+ * @task: the task
+ * @to: the target css_set
+ *
+ * Move task to a new cgroup and safely migrate its associated
+ * delayed/working state between the different domains.
+ *
+ * This function acquires the task's rq lock and lock_task_cgroup() to
+ * lock out concurrent changes to the task's scheduling state and - in
+ * case the task is running - concurrent changes to its delay state.
+ */
+void cgroup_move_task(struct task_struct *task, struct css_set *to)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	int state;
+
+	lock_task_cgroup(task);
+	rq = task_rq_lock(task, &rf);
+
+	if (task->flags & PF_MEMDELAY)
+		state = MTS_DELAYED + task_current(rq, task);
+	else if (task_on_rq_queued(task) || task->in_iowait)
+		state = MTS_WORKING;
+	else
+		state = MTS_NONE;
+
+	/*
+	 * Lame to do this here, but the scheduler cannot be locked
+	 * from the outside, so we move cgroups from inside sched/.
+	 */
+	memdelay_task_change(task, state, MTS_NONE);
+	rcu_assign_pointer(task->cgroups, to);
+	memdelay_task_change(task, MTS_NONE, state);
+
+	task_rq_unlock(rq, task, &rf);
+	unlock_task_cgroup(task);
+}
+#endif /* CONFIG_CGROUPS */
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a828a50..ac020693031d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,7 +39,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o vmacache.o swap_slots.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o $(mmu-y)
+			   memdelay.o debug.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 613c59e928cb..d4b81318d1d7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2041,11 +2041,15 @@ static int kcompactd(void *p)
 	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
 
 	while (!kthread_should_stop()) {
+		unsigned long mdflags;
+
 		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
 		wait_event_freezable(pgdat->kcompactd_wait,
 				kcompactd_work_requested(pgdat));
 
+		memdelay_enter(&mdflags);
 		kcompactd_do_work(pgdat);
+		memdelay_leave(&mdflags);
 	}
 
 	return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c592e925805..12869768e2e4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
+#include <linux/memdelay.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -830,10 +831,15 @@ static void wake_up_page(struct page *page, int bit)
 static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 		struct page *page, int bit_nr, int state, bool lock)
 {
+	bool refault = bit_nr == PG_locked && PageWorkingset(page);
 	struct wait_page_queue wait_page;
 	wait_queue_t *wait = &wait_page.wait;
+	unsigned long mdflags;
 	int ret = 0;
 
+	if (refault)
+		memdelay_enter(&mdflags);
+
 	init_wait(wait);
 	wait->func = wake_page_function;
 	wait_page.page = page;
@@ -873,6 +879,9 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 
 	finish_wait(q, wait);
 
+	if (refault)
+		memdelay_leave(&mdflags);
+
 	/*
 	 * A signal could leave PageWaiters set. Clearing it here if
 	 * !waitqueue_active would be possible (by open-coding finish_wait),
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94172089f52f..5d1ebe329c48 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -65,6 +65,7 @@
 #include <linux/lockdep.h>
 #include <linux/file.h>
 #include <linux/tracehook.h>
+#include <linux/memdelay.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3907,6 +3908,8 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	return ret;
 }
 
+static int memory_memdelay_show(struct seq_file *m, void *v);
+
 static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3974,6 +3977,10 @@ static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "pressure_level",
 	},
+	{
+		.name = "memdelay",
+		.seq_show = memory_memdelay_show,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.name = "numa_stat",
@@ -4142,6 +4149,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
+	memdelay_domain_free(memcg->memdelay_domain);
 	free_percpu(memcg->stat);
 	kfree(memcg);
 }
@@ -4247,10 +4255,15 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
+		memcg->memdelay_domain = &memdelay_global_domain;
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
 
+	memcg->memdelay_domain = memdelay_domain_alloc();
+	if (!memcg->memdelay_domain)
+		goto fail;
+
 	error = memcg_online_kmem(memcg);
 	if (error)
 		goto fail;
@@ -5241,6 +5254,13 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int memory_memdelay_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	return memdelay_domain_show(m, memcg->memdelay_domain);
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -5276,6 +5296,11 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = memory_stat_show,
 	},
+	{
+		.name = "memdelay",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_memdelay_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/memdelay.c b/mm/memdelay.c
new file mode 100644
index 000000000000..337a6bca9ee8
--- /dev/null
+++ b/mm/memdelay.c
@@ -0,0 +1,289 @@
+/*
+ * Memory delay metric
+ *
+ * Copyright (c) 2017 Facebook, Johannes Weiner
+ *
+ * This code quantifies and reports to userspace the wall-time impact
+ * of memory pressure on the system and memory-controlled cgroups.
+ */
+
+#include <linux/sched/loadavg.h>
+#include <linux/memcontrol.h>
+#include <linux/memdelay.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+
+static DEFINE_PER_CPU(struct memdelay_domain_cpu, global_domain_cpus);
+
+/* System-level keeping of memory delay statistics */
+struct memdelay_domain memdelay_global_domain = {
+	.mdcs = &global_domain_cpus,
+};
+
+static void domain_init(struct memdelay_domain *md)
+{
+	int cpu;
+
+	md->period_expires = jiffies + LOAD_FREQ;
+	for_each_possible_cpu(cpu) {
+		struct memdelay_domain_cpu *mdc;
+
+		mdc = per_cpu_ptr(md->mdcs, cpu);
+		spin_lock_init(&mdc->lock);
+	}
+}
+
+/**
+ * memdelay_init - initialize the memdelay subsystem
+ *
+ * This needs to run before the scheduler starts queuing and
+ * scheduling tasks.
+ */
+void __init memdelay_init(void)
+{
+	domain_init(&memdelay_global_domain);
+}
+
+static void domain_move_clock(struct memdelay_domain *md)
+{
+	unsigned long expires = READ_ONCE(md->period_expires);
+	unsigned long none, some, full;
+	int missed_periods;
+	unsigned long next;
+	int i;
+
+	if (time_before(jiffies, expires))
+		return;
+
+	missed_periods = 1 + (jiffies - expires) / LOAD_FREQ;
+	next = expires + (missed_periods * LOAD_FREQ);
+
+	if (cmpxchg(&md->period_expires, expires, next) != expires)
+		return;
+
+	none = xchg(&md->times[MDS_NONE], 0);
+	some = xchg(&md->times[MDS_SOME], 0);
+	full = xchg(&md->times[MDS_FULL], 0);
+
+	for (i = 0; i < missed_periods; i++) {
+		unsigned long pct;
+
+		pct = some * 100 / max(none + some + full, 1UL);
+		pct *= FIXED_1;
+		CALC_LOAD(md->avg_some[0], EXP_1, pct);
+		CALC_LOAD(md->avg_some[1], EXP_5, pct);
+		CALC_LOAD(md->avg_some[2], EXP_15, pct);
+
+		pct = full * 100 / max(none + some + full, 1UL);
+		pct *= FIXED_1;
+		CALC_LOAD(md->avg_full[0], EXP_1, pct);
+		CALC_LOAD(md->avg_full[1], EXP_5, pct);
+		CALC_LOAD(md->avg_full[2], EXP_15, pct);
+
+		none = some = full = 0;
+	}
+}
+
+static void domain_cpu_update(struct memdelay_domain *md, int cpu,
+			      int old, int new)
+{
+	enum memdelay_domain_state state;
+	struct memdelay_domain_cpu *mdc;
+	unsigned long now, delta;
+	unsigned long flags;
+
+	mdc = per_cpu_ptr(md->mdcs, cpu);
+	spin_lock_irqsave(&mdc->lock, flags);
+
+	if (old) {
+		WARN_ONCE(!mdc->tasks[old], "cpu=%d old=%d new=%d counter=%d\n",
+			  cpu, old, new, mdc->tasks[old]);
+		mdc->tasks[old] -= 1;
+	}
+	if (new)
+		mdc->tasks[new] += 1;
+
+	/*
+	 * The domain is somewhat delayed when a number of tasks are
+	 * delayed but there are still others running the workload.
+	 *
+	 * The domain is fully delayed when all non-idle tasks on the
+	 * CPU are delayed, or when a delayed task is actively running
+	 * and preventing productive tasks from making headway.
+	 *
+	 * The state times then add up over all CPUs in the domain: if
+	 * the domain is fully blocked on one CPU and there is another
+	 * one running the workload, the domain is considered fully
+	 * blocked 50% of the time.
+	 */
+	if (!mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_DELAYED])
+		state = MDS_NONE;
+	else if (mdc->tasks[MTS_WORKING])
+		state = MDS_SOME;
+	else
+		state = MDS_FULL;
+
+	if (mdc->state == state)
+		goto unlock;
+
+	now = ktime_to_ns(ktime_get());
+	delta = now - mdc->state_start;
+
+	domain_move_clock(md);
+	md->times[mdc->state] += delta;
+
+	mdc->state = state;
+	mdc->state_start = now;
+unlock:
+	spin_unlock_irqrestore(&mdc->lock, flags);
+}
+
+static struct memdelay_domain *memcg_domain(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!mem_cgroup_disabled())
+		return memcg->memdelay_domain;
+#endif
+	return &memdelay_global_domain;
+}
+
+/**
+ * memdelay_task_change - note a task changing its delay/work state
+ * @task: the task changing state
+ * @delayed: 1 when task enters delayed state, -1 when it leaves
+ * @working: 1 when task enters working state, -1 when it leaves
+ * @active_delay: 1 when task enters active delay, -1 when it leaves
+ *
+ * Updates the task's domain counters to reflect a change in the
+ * task's delayed/working state.
+ */
+void memdelay_task_change(struct task_struct *task, int old, int new)
+{
+	int cpu = task_cpu(task);
+	struct mem_cgroup *memcg;
+	unsigned long delay = 0;
+
+#ifdef CONFIG_DEBUG_VM
+	WARN_ONCE(task->memdelay_state != old,
+		  "cpu=%d task=%p state=%d (in_iowait=%d PF_MEMDELAYED=%d) old=%d new=%d\n",
+		  cpu, task, task->memdelay_state, task->in_iowait,
+		  !!(task->flags & PF_MEMDELAY), old, new);
+	task->memdelay_state = new;
+#endif
+
+	/* Account when tasks are entering and leaving delays */
+	if (old < MTS_DELAYED && new >= MTS_DELAYED) {
+		task->memdelay_start = ktime_to_ms(ktime_get());
+	} else if (old >= MTS_DELAYED && new < MTS_DELAYED) {
+		delay = ktime_to_ms(ktime_get()) - task->memdelay_start;
+		task->memdelay_total += delay;
+	}
+
+	/* Account domain state changes */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(task);
+	do {
+		struct memdelay_domain *md;
+
+		md = memcg_domain(memcg);
+		md->aggregate += delay;
+		domain_cpu_update(md, cpu, old, new);
+	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
+	rcu_read_unlock();
+};
+
+/**
+ * memdelay_domain_alloc - allocate a cgroup memory delay domain
+ */
+struct memdelay_domain *memdelay_domain_alloc(void)
+{
+	struct memdelay_domain *md;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	if (!md)
+		return NULL;
+	md->mdcs = alloc_percpu(struct memdelay_domain_cpu);
+	if (!md->mdcs) {
+		kfree(md);
+		return NULL;
+	}
+	domain_init(md);
+	return md;
+}
+
+/**
+ * memdelay_domain_free - free a cgroup memory delay domain
+ */
+void memdelay_domain_free(struct memdelay_domain *md)
+{
+	if (md) {
+		free_percpu(md->mdcs);
+		kfree(md);
+	}
+}
+
+/**
+ * memdelay_domain_show - format memory delay domain stats to a seq_file
+ * @s: the seq_file
+ * @md: the memory domain
+ */
+int memdelay_domain_show(struct seq_file *s, struct memdelay_domain *md)
+{
+	int cpu;
+
+	domain_move_clock(md);
+
+	seq_printf(s, "%lu\n", md->aggregate);
+
+	seq_printf(s, "%lu.%02lu %lu.%02lu %lu.%02lu\n",
+		   LOAD_INT(md->avg_some[0]), LOAD_FRAC(md->avg_some[0]),
+		   LOAD_INT(md->avg_some[1]), LOAD_FRAC(md->avg_some[1]),
+		   LOAD_INT(md->avg_some[2]), LOAD_FRAC(md->avg_some[2]));
+
+	seq_printf(s, "%lu.%02lu %lu.%02lu %lu.%02lu\n",
+		   LOAD_INT(md->avg_full[0]), LOAD_FRAC(md->avg_full[0]),
+		   LOAD_INT(md->avg_full[1]), LOAD_FRAC(md->avg_full[1]),
+		   LOAD_INT(md->avg_full[2]), LOAD_FRAC(md->avg_full[2]));
+
+#ifdef CONFIG_DEBUG_VM
+	for_each_online_cpu(cpu) {
+		struct memdelay_domain_cpu *mdc;
+
+		mdc = per_cpu_ptr(md->mdcs, cpu);
+		seq_printf(s, "%d %d %d\n",
+			   mdc->tasks[MTS_WORKING],
+			   mdc->tasks[MTS_DELAYED],
+			   mdc->tasks[MTS_DELAYED_ACTIVE]);
+	}
+#endif
+
+	return 0;
+}
+
+static int memdelay_show(struct seq_file *m, void *v)
+{
+	return memdelay_domain_show(m, &memdelay_global_domain);
+}
+
+static int memdelay_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, memdelay_show, NULL);
+}
+
+static const struct file_operations memdelay_fops = {
+	.open           = memdelay_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static int __init memdelay_proc_init(void)
+{
+	proc_create("memdelay", 0, NULL, &memdelay_fops);
+	return 0;
+}
+module_init(memdelay_proc_init);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2302f250d6b1..bec5e96f3b88 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
+#include <linux/memdelay.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -3293,16 +3294,19 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
 		enum compact_priority prio, enum compact_result *compact_result)
 {
-	struct page *page;
 	unsigned int noreclaim_flag;
+	unsigned long mdflags;
+	struct page *page;
 
 	if (!order)
 		return NULL;
 
+	memdelay_enter(&mdflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 									prio);
 	memalloc_noreclaim_restore(noreclaim_flag);
+	memdelay_leave(&mdflags);
 
 	if (*compact_result <= COMPACT_INACTIVE)
 		return NULL;
@@ -3448,13 +3452,15 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 					const struct alloc_context *ac)
 {
 	struct reclaim_state reclaim_state;
-	int progress;
 	unsigned int noreclaim_flag;
+	unsigned long mdflags;
+	int progress;
 
 	cond_resched();
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
+	memdelay_enter(&mdflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	lockdep_set_current_reclaim_state(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
@@ -3466,6 +3472,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	current->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
 	memalloc_noreclaim_restore(noreclaim_flag);
+	memdelay_leave(&mdflags);
 
 	cond_resched();
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 285db147d013..f44651b49670 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,7 @@
 #include <linux/prefetch.h>
 #include <linux/printk.h>
 #include <linux/dax.h>
+#include <linux/memdelay.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3045,6 +3046,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
+	unsigned long mdflags;
 	int nid;
 	unsigned int noreclaim_flag;
 	struct scan_control sc = {
@@ -3073,9 +3075,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
+	memdelay_enter(&mdflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 	memalloc_noreclaim_restore(noreclaim_flag);
+	memdelay_leave(&mdflags);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3497,6 +3501,7 @@ static int kswapd(void *p)
 	pgdat->kswapd_order = 0;
 	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
+		unsigned long mdflags;
 		bool ret;
 
 		alloc_order = reclaim_order = pgdat->kswapd_order;
@@ -3533,7 +3538,11 @@ static int kswapd(void *p)
 		 */
 		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
 						alloc_order);
+
+		memdelay_enter(&mdflags);
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+		memdelay_leave(&mdflags);
+
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
 	}
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-27 15:30   ` Johannes Weiner
@ 2017-07-27 15:56     ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:56 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> +	/*
> +	 * The domain is somewhat delayed when a number of tasks are
> +	 * delayed but there are still others running the workload.
> +	 *
> +	 * The domain is fully delayed when all non-idle tasks on the
> +	 * CPU are delayed, or when a delayed task is actively running
> +	 * and preventing productive tasks from making headway.
> +	 *
> +	 * The state times then add up over all CPUs in the domain: if
> +	 * the domain is fully blocked on one CPU and there is another
> +	 * one running the workload, the domain is considered fully
> +	 * blocked 50% of the time.
> +	 */
> +	if (!mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_DELAYED])
> +		state = MDS_NONE;
> +	else if (mdc->tasks[MTS_WORKING])
> +		state = MDS_SOME;
> +	else
> +		state = MDS_FULL;

Just a headsup, if you're wondering why the distinction between
delayed and delayed_active: I used to track iowait separately from
working, and in a brainfart oversimplified this part right here. It
should really be:

	if (delayed_active && !iowait)
		state = full
	else if (delayed)
		state = (working || iowait) ? some : full
	else
		state = none

I'm going to re-add separate iowait tracking in v2 and fix this, but
since this patch is already big and spans two major subsystems, I
wanted to run the overall design and idea by you first before doing
more polishing on this.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-07-27 15:56     ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-27 15:56 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> +	/*
> +	 * The domain is somewhat delayed when a number of tasks are
> +	 * delayed but there are still others running the workload.
> +	 *
> +	 * The domain is fully delayed when all non-idle tasks on the
> +	 * CPU are delayed, or when a delayed task is actively running
> +	 * and preventing productive tasks from making headway.
> +	 *
> +	 * The state times then add up over all CPUs in the domain: if
> +	 * the domain is fully blocked on one CPU and there is another
> +	 * one running the workload, the domain is considered fully
> +	 * blocked 50% of the time.
> +	 */
> +	if (!mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_DELAYED])
> +		state = MDS_NONE;
> +	else if (mdc->tasks[MTS_WORKING])
> +		state = MDS_SOME;
> +	else
> +		state = MDS_FULL;

Just a headsup, if you're wondering why the distinction between
delayed and delayed_active: I used to track iowait separately from
working, and in a brainfart oversimplified this part right here. It
should really be:

	if (delayed_active && !iowait)
		state = full
	else if (delayed)
		state = (working || iowait) ? some : full
	else
		state = none

I'm going to re-add separate iowait tracking in v2 and fix this, but
since this patch is already big and spans two major subsystems, I
wanted to run the overall design and idea by you first before doing
more polishing on this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
  2017-07-27 15:30 ` Johannes Weiner
@ 2017-07-27 20:43   ` Andrew Morton
  -1 siblings, 0 replies; 43+ messages in thread
From: Andrew Morton @ 2017-07-27 20:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Thu, 27 Jul 2017 11:30:07 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> This patch series implements a fine-grained metric for memory
> health.

I assume some Documentation/ is forthcoming.

Consuming another page flag hurts.  What's our current status there?

I'd be interested in seeing some usage examples.  Perhaps anecdotes
where "we observed problem X so we used memdelay in manner Y and saw
result Z".

I assume that some userspace code which utilizes this interface exists
already.  What's the long-term plan here?  systemd changes?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
@ 2017-07-27 20:43   ` Andrew Morton
  0 siblings, 0 replies; 43+ messages in thread
From: Andrew Morton @ 2017-07-27 20:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Thu, 27 Jul 2017 11:30:07 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> This patch series implements a fine-grained metric for memory
> health.

I assume some Documentation/ is forthcoming.

Consuming another page flag hurts.  What's our current status there?

I'd be interested in seeing some usage examples.  Perhaps anecdotes
where "we observed problem X so we used memdelay in manner Y and saw
result Z".

I assume that some userspace code which utilizes this interface exists
already.  What's the long-term plan here?  systemd changes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
  2017-07-27 20:43   ` Andrew Morton
@ 2017-07-28 19:43     ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-28 19:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Peter Zijlstra, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

Hi Andrew,

On Thu, Jul 27, 2017 at 01:43:25PM -0700, Andrew Morton wrote:
> On Thu, 27 Jul 2017 11:30:07 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > This patch series implements a fine-grained metric for memory
> > health.
> 
> I assume some Documentation/ is forthcoming.

Yep, I'll describe the interface and how to use this more extensively.

> Consuming another page flag hurts.  What's our current status there?

I would say we can make it 64-bit only, but I also need this refault
distinction flag in the LRU balancing patches [1] to apply pressure on
anon pages only when the page cache is actually thrashing, not when
it's just transitioning to another workingset. So let's see...

20 flags are always defined.

21 if you have an MMU.

23 with the zone bits for DMA, Normal, HighMem, Movable.

29 with the sparsemem section bits,

30 if PAE is enabled.

On that config, NUMA gets 2 bits for 4 nodes. If I take the 31st bit,
it'd be left with 2 possible nodes. If that's not enough, that system
can switch to discontigmem and regain the 6 or 7 sparsemem bits.

> I'd be interested in seeing some usage examples.  Perhaps anecdotes
> where "we observed problem X so we used memdelay in manner Y and saw
> result Z".

The very first thing that had me look at this was the pathological
behavior of memory pressure after switching my systems from spinning
disks to SSDs. Just like vmpressure, the OOM killer depends on reclaim
efficiency dropping through the floor - but that doesn't really happen
anymore. Sometimes my systems lock up for seconds, sometimes minutes,
or until I hard-reset them. The page cache, including executables, is
thrashing like crazy while reclaim efficiency hovers around 100%.

The same happens at FB data centers, where we lose machines during
peak times with no kernel-side remedy for recovering this livelock.

The OOM killer really needs to be hooked up to a portable measure of
thrashing impact rather than an inability of the VM to recycle pages.
I think expressing this cutoff in terms of unproductive time makes the
most sense: e.g. 60%+ of the last 10 seconds of elapsed walltime the
system was doing nothing but waiting for refaults or reclaiming; time
to kill something to free up memory and reduce access frequencies.

But even before OOM, we need to know when we start packing machines
and containers too tightly in terms of memory. Free pages don't mean
anything because of the page cache, and the refault rate on its own
doesn't tell you anything about throughput or latency deterioration.

A recurring scenario for me is that somebody has a machine running a
workload with peaks of 100% CPU, 100% IO, bursts of refaults and a
slowdown in the application. What resource is really lacking here? A
lack of memory can result in high CPU and IO times, but it could also
be mostly the application's own appetite for those resources. The
memdelay code shows us how much of the slowdown is caused by memory.

Figuring this out with tracing and profiling is *sometimes* possible,
but takes a ridiculous amount of effort and a reproducible workload.
In many cases it's not an option due to the scale we're dealing with.

For example, we have large pools of machines that run some hundred
jobs whose peak activity depends in part on factors out of our
control, such as user activity. When the peaks of several jobs align,
their individual throughput and latency goes down, and like above we
see the CPU, IO, latency spikes. Separating out how much of that is
due to memory then feeds into the job scheduler, which adjusts the job
placement, cgroup limits etc. accordingly throughout the pool.

Another thing is detecting regressions. Kernel developers tend to run
handpicked, idempotent A/B tests, on single machines, to detect
walltime impact of VM changes. That's often not very representative of
real applications. By tracking memdelay trends averaged over thousands
of machines that run similar workloads, we can tell whether a kernel
upgrade introduced a VM regression that matters to real applications
down to sub-percent walltime impact fairly easily and reliably.

Even with workloads that have their own clear throughput metrics to
detect regressions, knowing where to look makes finding problems
easier, which makes upgrades faster, which means we can run yet more
recent kernels :)

> I assume that some userspace code which utilizes this interface exists
> already.  What's the long-term plan here?  systemd changes?

We're putting it into our custom job scheduler/load balancers and
fleet monitoring infrastructure to track capacity and regressions.

System health monitoring tools like top, atop etc. can incorporate
this in their summaries as well as per-task statistics.

Things like systemd-cgtop that give container overviews can as well.

And as mentioned above, IMO the OOM killer is a prime candidate for
being an in-kernel user of this.

Thanks

[1] https://lwn.net/Articles/690079/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
@ 2017-07-28 19:43     ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-28 19:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Peter Zijlstra, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

Hi Andrew,

On Thu, Jul 27, 2017 at 01:43:25PM -0700, Andrew Morton wrote:
> On Thu, 27 Jul 2017 11:30:07 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > This patch series implements a fine-grained metric for memory
> > health.
> 
> I assume some Documentation/ is forthcoming.

Yep, I'll describe the interface and how to use this more extensively.

> Consuming another page flag hurts.  What's our current status there?

I would say we can make it 64-bit only, but I also need this refault
distinction flag in the LRU balancing patches [1] to apply pressure on
anon pages only when the page cache is actually thrashing, not when
it's just transitioning to another workingset. So let's see...

20 flags are always defined.

21 if you have an MMU.

23 with the zone bits for DMA, Normal, HighMem, Movable.

29 with the sparsemem section bits,

30 if PAE is enabled.

On that config, NUMA gets 2 bits for 4 nodes. If I take the 31st bit,
it'd be left with 2 possible nodes. If that's not enough, that system
can switch to discontigmem and regain the 6 or 7 sparsemem bits.

> I'd be interested in seeing some usage examples.  Perhaps anecdotes
> where "we observed problem X so we used memdelay in manner Y and saw
> result Z".

The very first thing that had me look at this was the pathological
behavior of memory pressure after switching my systems from spinning
disks to SSDs. Just like vmpressure, the OOM killer depends on reclaim
efficiency dropping through the floor - but that doesn't really happen
anymore. Sometimes my systems lock up for seconds, sometimes minutes,
or until I hard-reset them. The page cache, including executables, is
thrashing like crazy while reclaim efficiency hovers around 100%.

The same happens at FB data centers, where we lose machines during
peak times with no kernel-side remedy for recovering this livelock.

The OOM killer really needs to be hooked up to a portable measure of
thrashing impact rather than an inability of the VM to recycle pages.
I think expressing this cutoff in terms of unproductive time makes the
most sense: e.g. 60%+ of the last 10 seconds of elapsed walltime the
system was doing nothing but waiting for refaults or reclaiming; time
to kill something to free up memory and reduce access frequencies.

But even before OOM, we need to know when we start packing machines
and containers too tightly in terms of memory. Free pages don't mean
anything because of the page cache, and the refault rate on its own
doesn't tell you anything about throughput or latency deterioration.

A recurring scenario for me is that somebody has a machine running a
workload with peaks of 100% CPU, 100% IO, bursts of refaults and a
slowdown in the application. What resource is really lacking here? A
lack of memory can result in high CPU and IO times, but it could also
be mostly the application's own appetite for those resources. The
memdelay code shows us how much of the slowdown is caused by memory.

Figuring this out with tracing and profiling is *sometimes* possible,
but takes a ridiculous amount of effort and a reproducible workload.
In many cases it's not an option due to the scale we're dealing with.

For example, we have large pools of machines that run some hundred
jobs whose peak activity depends in part on factors out of our
control, such as user activity. When the peaks of several jobs align,
their individual throughput and latency goes down, and like above we
see the CPU, IO, latency spikes. Separating out how much of that is
due to memory then feeds into the job scheduler, which adjusts the job
placement, cgroup limits etc. accordingly throughout the pool.

Another thing is detecting regressions. Kernel developers tend to run
handpicked, idempotent A/B tests, on single machines, to detect
walltime impact of VM changes. That's often not very representative of
real applications. By tracking memdelay trends averaged over thousands
of machines that run similar workloads, we can tell whether a kernel
upgrade introduced a VM regression that matters to real applications
down to sub-percent walltime impact fairly easily and reliably.

Even with workloads that have their own clear throughput metrics to
detect regressions, knowing where to look makes finding problems
easier, which makes upgrades faster, which means we can run yet more
recent kernels :)

> I assume that some userspace code which utilizes this interface exists
> already.  What's the long-term plan here?  systemd changes?

We're putting it into our custom job scheduler/load balancers and
fleet monitoring infrastructure to track capacity and regressions.

System health monitoring tools like top, atop etc. can incorporate
this in their summaries as well as per-task statistics.

Things like systemd-cgtop that give container overviews can as well.

And as mentioned above, IMO the OOM killer is a prime candidate for
being an in-kernel user of this.

Thanks

[1] https://lwn.net/Articles/690079/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
  2017-07-27 15:30 ` Johannes Weiner
@ 2017-07-29  2:48   ` Mike Galbraith
  -1 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-07-29  2:48 UTC (permalink / raw)
  To: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

On Thu, 2017-07-27 at 11:30 -0400, Johannes Weiner wrote:
> 
> Structure
> 
> The first patch cleans up the different loadavg callsites and macros
> as the memdelay averages are going to be tracked using these.
> 
> The second patch adds a distinction between page cache transitions
> (inactive list refaults) and page cache thrashing (active list
> refaults), since only the latter are unproductive refaults.
> 
> The third patch finally adds the memdelay accounting and interface:
> its scheduler side identifies productive and unproductive task states,
> and the VM side aggregates them into system and cgroup domain states
> and calculates moving averages of the time spent in each state.

What tree is this against?  ttwu asm delta says "measure me".

	-Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
@ 2017-07-29  2:48   ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-07-29  2:48 UTC (permalink / raw)
  To: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

On Thu, 2017-07-27 at 11:30 -0400, Johannes Weiner wrote:
> 
> Structure
> 
> The first patch cleans up the different loadavg callsites and macros
> as the memdelay averages are going to be tracked using these.
> 
> The second patch adds a distinction between page cache transitions
> (inactive list refaults) and page cache thrashing (active list
> refaults), since only the latter are unproductive refaults.
> 
> The third patch finally adds the memdelay accounting and interface:
> its scheduler side identifies productive and unproductive task states,
> and the VM side aggregates them into system and cgroup domain states
> and calculates moving averages of the time spent in each state.

What tree is this against?  ttwu asm delta says "measure me".

	-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
  2017-07-29  2:48   ` Mike Galbraith
@ 2017-07-29  3:21     ` Mike Galbraith
  -1 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-07-29  3:21 UTC (permalink / raw)
  To: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

On Sat, 2017-07-29 at 04:48 +0200, Mike Galbraith wrote:
> On Thu, 2017-07-27 at 11:30 -0400, Johannes Weiner wrote:
> > 
> > Structure
> > 
> > The first patch cleans up the different loadavg callsites and macros
> > as the memdelay averages are going to be tracked using these.
> > 
> > The second patch adds a distinction between page cache transitions
> > (inactive list refaults) and page cache thrashing (active list
> > refaults), since only the latter are unproductive refaults.
> > 
> > The third patch finally adds the memdelay accounting and interface:
> > its scheduler side identifies productive and unproductive task states,
> > and the VM side aggregates them into system and cgroup domain states
> > and calculates moving averages of the time spent in each state.
> 
> What tree is this against?  ttwu asm delta says "measure me".

(mm/master.. gee)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
@ 2017-07-29  3:21     ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-07-29  3:21 UTC (permalink / raw)
  To: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

On Sat, 2017-07-29 at 04:48 +0200, Mike Galbraith wrote:
> On Thu, 2017-07-27 at 11:30 -0400, Johannes Weiner wrote:
> > 
> > Structure
> > 
> > The first patch cleans up the different loadavg callsites and macros
> > as the memdelay averages are going to be tracked using these.
> > 
> > The second patch adds a distinction between page cache transitions
> > (inactive list refaults) and page cache thrashing (active list
> > refaults), since only the latter are unproductive refaults.
> > 
> > The third patch finally adds the memdelay accounting and interface:
> > its scheduler side identifies productive and unproductive task states,
> > and the VM side aggregates them into system and cgroup domain states
> > and calculates moving averages of the time spent in each state.
> 
> What tree is this against?  ttwu asm delta says "measure me".

(mm/master.. gee)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
  2017-07-29  2:48   ` Mike Galbraith
@ 2017-07-29  6:38     ` Mike Galbraith
  -1 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-07-29  6:38 UTC (permalink / raw)
  To: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

On Sat, 2017-07-29 at 04:48 +0200, Mike Galbraith wrote:
> ttwu asm delta says "measure me".

q/d measurement with pipe-test

+cgroup_disable=memory
2.241926 usecs/loop -- avg 2.242376 891.9 KHz  1.000
+patchset
2.284428 usecs/loop -- avg 2.357621 848.3 KHz   .951

-cgroup_disable=memory
2.257433 usecs/loop -- avg 2.327356 859.3 KHz  1.000
+patchset
2.394804 usecs/loop -- avg 2.404556 831.8 KHz   .967

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
@ 2017-07-29  6:38     ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-07-29  6:38 UTC (permalink / raw)
  To: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Mel Gorman
  Cc: linux-mm, linux-kernel, kernel-team

On Sat, 2017-07-29 at 04:48 +0200, Mike Galbraith wrote:
> ttwu asm delta says "measure me".

q/d measurement with pipe-test

+cgroup_disable=memory
2.241926 usecs/loop -- avg 2.242376 891.9 KHz  1.000
+patchset
2.284428 usecs/loop -- avg 2.357621 848.3 KHz   .951

-cgroup_disable=memory
2.257433 usecs/loop -- avg 2.327356 859.3 KHz  1.000
+patchset
2.394804 usecs/loop -- avg 2.404556 831.8 KHz   .967

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-27 15:30   ` Johannes Weiner
@ 2017-07-29  9:10     ` Peter Zijlstra
  -1 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2017-07-29  9:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

So no, this doesn't have a change in hell of making it.

On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> +			      int old, int new)
> +{
> +	enum memdelay_domain_state state;
> +	struct memdelay_domain_cpu *mdc;
> +	unsigned long now, delta;
> +	unsigned long flags;
> +
> +	mdc = per_cpu_ptr(md->mdcs, cpu);
> +	spin_lock_irqsave(&mdc->lock, flags);

Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
do we really want to add more atomics there?

> +	if (old) {
> +		WARN_ONCE(!mdc->tasks[old], "cpu=%d old=%d new=%d counter=%d\n",
> +			  cpu, old, new, mdc->tasks[old]);
> +		mdc->tasks[old] -= 1;
> +	}
> +	if (new)
> +		mdc->tasks[new] += 1;
> +
> +	/*
> +	 * The domain is somewhat delayed when a number of tasks are
> +	 * delayed but there are still others running the workload.
> +	 *
> +	 * The domain is fully delayed when all non-idle tasks on the
> +	 * CPU are delayed, or when a delayed task is actively running
> +	 * and preventing productive tasks from making headway.
> +	 *
> +	 * The state times then add up over all CPUs in the domain: if
> +	 * the domain is fully blocked on one CPU and there is another
> +	 * one running the workload, the domain is considered fully
> +	 * blocked 50% of the time.
> +	 */
> +	if (!mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_DELAYED])
> +		state = MDS_NONE;
> +	else if (mdc->tasks[MTS_WORKING])
> +		state = MDS_SOME;
> +	else
> +		state = MDS_FULL;
> +
> +	if (mdc->state == state)
> +		goto unlock;
> +
> +	now = ktime_to_ns(ktime_get());

ktime_get_ns(), also no ktime in scheduler code.

> +	delta = now - mdc->state_start;
> +
> +	domain_move_clock(md);
> +	md->times[mdc->state] += delta;
> +
> +	mdc->state = state;
> +	mdc->state_start = now;
> +unlock:
> +	spin_unlock_irqrestore(&mdc->lock, flags);
> +}
> +
> +static struct memdelay_domain *memcg_domain(struct mem_cgroup *memcg)
> +{
> +#ifdef CONFIG_MEMCG
> +	if (!mem_cgroup_disabled())
> +		return memcg->memdelay_domain;
> +#endif
> +	return &memdelay_global_domain;
> +}
> +
> +/**
> + * memdelay_task_change - note a task changing its delay/work state
> + * @task: the task changing state
> + * @delayed: 1 when task enters delayed state, -1 when it leaves
> + * @working: 1 when task enters working state, -1 when it leaves
> + * @active_delay: 1 when task enters active delay, -1 when it leaves
> + *
> + * Updates the task's domain counters to reflect a change in the
> + * task's delayed/working state.
> + */
> +void memdelay_task_change(struct task_struct *task, int old, int new)
> +{
> +	int cpu = task_cpu(task);
> +	struct mem_cgroup *memcg;
> +	unsigned long delay = 0;
> +
> +#ifdef CONFIG_DEBUG_VM
> +	WARN_ONCE(task->memdelay_state != old,
> +		  "cpu=%d task=%p state=%d (in_iowait=%d PF_MEMDELAYED=%d) old=%d new=%d\n",
> +		  cpu, task, task->memdelay_state, task->in_iowait,
> +		  !!(task->flags & PF_MEMDELAY), old, new);
> +	task->memdelay_state = new;
> +#endif
> +
> +	/* Account when tasks are entering and leaving delays */
> +	if (old < MTS_DELAYED && new >= MTS_DELAYED) {
> +		task->memdelay_start = ktime_to_ms(ktime_get());
> +	} else if (old >= MTS_DELAYED && new < MTS_DELAYED) {
> +		delay = ktime_to_ms(ktime_get()) - task->memdelay_start;
> +		task->memdelay_total += delay;
> +	}

Scheduler stuff will _NOT_ user ktime_get() and will _NOT_ do pointless
divisions into ms.

> +
> +	/* Account domain state changes */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(task);
> +	do {
> +		struct memdelay_domain *md;
> +
> +		md = memcg_domain(memcg);
> +		md->aggregate += delay;
> +		domain_cpu_update(md, cpu, old, new);
> +	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
> +	rcu_read_unlock();

We are _NOT_ going to do a 3rd cgroup iteration for every task action.

> +};

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-07-29  9:10     ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2017-07-29  9:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

So no, this doesn't have a change in hell of making it.

On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> +			      int old, int new)
> +{
> +	enum memdelay_domain_state state;
> +	struct memdelay_domain_cpu *mdc;
> +	unsigned long now, delta;
> +	unsigned long flags;
> +
> +	mdc = per_cpu_ptr(md->mdcs, cpu);
> +	spin_lock_irqsave(&mdc->lock, flags);

Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
do we really want to add more atomics there?

> +	if (old) {
> +		WARN_ONCE(!mdc->tasks[old], "cpu=%d old=%d new=%d counter=%d\n",
> +			  cpu, old, new, mdc->tasks[old]);
> +		mdc->tasks[old] -= 1;
> +	}
> +	if (new)
> +		mdc->tasks[new] += 1;
> +
> +	/*
> +	 * The domain is somewhat delayed when a number of tasks are
> +	 * delayed but there are still others running the workload.
> +	 *
> +	 * The domain is fully delayed when all non-idle tasks on the
> +	 * CPU are delayed, or when a delayed task is actively running
> +	 * and preventing productive tasks from making headway.
> +	 *
> +	 * The state times then add up over all CPUs in the domain: if
> +	 * the domain is fully blocked on one CPU and there is another
> +	 * one running the workload, the domain is considered fully
> +	 * blocked 50% of the time.
> +	 */
> +	if (!mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_DELAYED])
> +		state = MDS_NONE;
> +	else if (mdc->tasks[MTS_WORKING])
> +		state = MDS_SOME;
> +	else
> +		state = MDS_FULL;
> +
> +	if (mdc->state == state)
> +		goto unlock;
> +
> +	now = ktime_to_ns(ktime_get());

ktime_get_ns(), also no ktime in scheduler code.

> +	delta = now - mdc->state_start;
> +
> +	domain_move_clock(md);
> +	md->times[mdc->state] += delta;
> +
> +	mdc->state = state;
> +	mdc->state_start = now;
> +unlock:
> +	spin_unlock_irqrestore(&mdc->lock, flags);
> +}
> +
> +static struct memdelay_domain *memcg_domain(struct mem_cgroup *memcg)
> +{
> +#ifdef CONFIG_MEMCG
> +	if (!mem_cgroup_disabled())
> +		return memcg->memdelay_domain;
> +#endif
> +	return &memdelay_global_domain;
> +}
> +
> +/**
> + * memdelay_task_change - note a task changing its delay/work state
> + * @task: the task changing state
> + * @delayed: 1 when task enters delayed state, -1 when it leaves
> + * @working: 1 when task enters working state, -1 when it leaves
> + * @active_delay: 1 when task enters active delay, -1 when it leaves
> + *
> + * Updates the task's domain counters to reflect a change in the
> + * task's delayed/working state.
> + */
> +void memdelay_task_change(struct task_struct *task, int old, int new)
> +{
> +	int cpu = task_cpu(task);
> +	struct mem_cgroup *memcg;
> +	unsigned long delay = 0;
> +
> +#ifdef CONFIG_DEBUG_VM
> +	WARN_ONCE(task->memdelay_state != old,
> +		  "cpu=%d task=%p state=%d (in_iowait=%d PF_MEMDELAYED=%d) old=%d new=%d\n",
> +		  cpu, task, task->memdelay_state, task->in_iowait,
> +		  !!(task->flags & PF_MEMDELAY), old, new);
> +	task->memdelay_state = new;
> +#endif
> +
> +	/* Account when tasks are entering and leaving delays */
> +	if (old < MTS_DELAYED && new >= MTS_DELAYED) {
> +		task->memdelay_start = ktime_to_ms(ktime_get());
> +	} else if (old >= MTS_DELAYED && new < MTS_DELAYED) {
> +		delay = ktime_to_ms(ktime_get()) - task->memdelay_start;
> +		task->memdelay_total += delay;
> +	}

Scheduler stuff will _NOT_ user ktime_get() and will _NOT_ do pointless
divisions into ms.

> +
> +	/* Account domain state changes */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(task);
> +	do {
> +		struct memdelay_domain *md;
> +
> +		md = memcg_domain(memcg);
> +		md->aggregate += delay;
> +		domain_cpu_update(md, cpu, old, new);
> +	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
> +	rcu_read_unlock();

We are _NOT_ going to do a 3rd cgroup iteration for every task action.

> +};

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-27 15:30   ` Johannes Weiner
                     ` (2 preceding siblings ...)
  (?)
@ 2017-07-29 13:31   ` kbuild test robot
  -1 siblings, 0 replies; 43+ messages in thread
From: kbuild test robot @ 2017-07-29 13:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, Mel Gorman, linux-mm, linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 3286 bytes --]

Hi Johannes,

[auto build test ERROR on v4.12]
[cannot apply to linus/master linux/master v4.13-rc2 v4.13-rc1 next-20170728]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/sched-loadavg-consolidate-LOAD_INT-LOAD_FRAC-macros/20170729-191658
config: arm-sunxi_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   In file included from include/linux/spinlock_types.h:13:0,
                    from include/linux/memdelay.h:4,
                    from kernel/sched/memdelay.c:10:
>> arch/arm/include/asm/spinlock_types.h:12:3: error: unknown type name 'u32'
      u32 slock;
      ^~~
>> arch/arm/include/asm/spinlock_types.h:18:4: error: unknown type name 'u16'
       u16 owner;
       ^~~
   arch/arm/include/asm/spinlock_types.h:19:4: error: unknown type name 'u16'
       u16 next;
       ^~~
   arch/arm/include/asm/spinlock_types.h:28:2: error: unknown type name 'u32'
     u32 lock;
     ^~~

vim +/u32 +12 arch/arm/include/asm/spinlock_types.h

546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06   9  
fb1c8f93 include/asm-arm/spinlock_types.h      Ingo Molnar     2005-09-10  10  typedef struct {
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  11  	union {
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06 @12  		u32 slock;
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  13  		struct __raw_tickets {
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  14  #ifdef __ARMEB__
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  15  			u16 next;
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  16  			u16 owner;
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  17  #else
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06 @18  			u16 owner;
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  19  			u16 next;
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  20  #endif
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  21  		} tickets;
546c2896 arch/arm/include/asm/spinlock_types.h Will Deacon     2012-07-06  22  	};
445c8951 arch/arm/include/asm/spinlock_types.h Thomas Gleixner 2009-12-02  23  } arch_spinlock_t;
fb1c8f93 include/asm-arm/spinlock_types.h      Ingo Molnar     2005-09-10  24  

:::::: The code at line 12 was first introduced by commit
:::::: 546c2896a42202dbc7d02f7c6ec9948ac1bf511b ARM: 7446/1: spinlock: use ticket algorithm for ARMv6+ locking implementation

:::::: TO: Will Deacon <will.deacon@arm.com>
:::::: CC: Russell King <rmk+kernel@arm.linux.org.uk>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21035 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-29  9:10     ` Peter Zijlstra
@ 2017-07-30 15:28       ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-30 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Sat, Jul 29, 2017 at 11:10:55AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> > +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> > +			      int old, int new)
> > +{
> > +	enum memdelay_domain_state state;
> > +	struct memdelay_domain_cpu *mdc;
> > +	unsigned long now, delta;
> > +	unsigned long flags;
> > +
> > +	mdc = per_cpu_ptr(md->mdcs, cpu);
> > +	spin_lock_irqsave(&mdc->lock, flags);
> 
> Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
> do we really want to add more atomics there?

I think we should be able to get away without an additional lock and
rely on the rq lock instead. schedule, enqueue, dequeue already hold
it, memdelay_enter/leave could be added. I need to think about what to
do with try_to_wake_up in order to get the cpu move accounting inside
the locked section of ttwu_queue(), but that should be doable too.

> > +	if (old) {
> > +		WARN_ONCE(!mdc->tasks[old], "cpu=%d old=%d new=%d counter=%d\n",
> > +			  cpu, old, new, mdc->tasks[old]);
> > +		mdc->tasks[old] -= 1;
> > +	}
> > +	if (new)
> > +		mdc->tasks[new] += 1;
> > +
> > +	/*
> > +	 * The domain is somewhat delayed when a number of tasks are
> > +	 * delayed but there are still others running the workload.
> > +	 *
> > +	 * The domain is fully delayed when all non-idle tasks on the
> > +	 * CPU are delayed, or when a delayed task is actively running
> > +	 * and preventing productive tasks from making headway.
> > +	 *
> > +	 * The state times then add up over all CPUs in the domain: if
> > +	 * the domain is fully blocked on one CPU and there is another
> > +	 * one running the workload, the domain is considered fully
> > +	 * blocked 50% of the time.
> > +	 */
> > +	if (!mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_DELAYED])
> > +		state = MDS_NONE;
> > +	else if (mdc->tasks[MTS_WORKING])
> > +		state = MDS_SOME;
> > +	else
> > +		state = MDS_FULL;
> > +
> > +	if (mdc->state == state)
> > +		goto unlock;
> > +
> > +	now = ktime_to_ns(ktime_get());
> 
> ktime_get_ns(), also no ktime in scheduler code.

Okay.

I actually don't need a time source that's comparable across CPUs
since accounting periods are always fully contained within one
CPU. From the comment docs, it sounds like cpu_clock() is what I want
to use there?

> > +	/* Account domain state changes */
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_from_task(task);
> > +	do {
> > +		struct memdelay_domain *md;
> > +
> > +		md = memcg_domain(memcg);
> > +		md->aggregate += delay;
> > +		domain_cpu_update(md, cpu, old, new);
> > +	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
> > +	rcu_read_unlock();
> 
> We are _NOT_ going to do a 3rd cgroup iteration for every task action.

I'll look into that.

Thanks

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-07-30 15:28       ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-30 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Sat, Jul 29, 2017 at 11:10:55AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> > +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> > +			      int old, int new)
> > +{
> > +	enum memdelay_domain_state state;
> > +	struct memdelay_domain_cpu *mdc;
> > +	unsigned long now, delta;
> > +	unsigned long flags;
> > +
> > +	mdc = per_cpu_ptr(md->mdcs, cpu);
> > +	spin_lock_irqsave(&mdc->lock, flags);
> 
> Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
> do we really want to add more atomics there?

I think we should be able to get away without an additional lock and
rely on the rq lock instead. schedule, enqueue, dequeue already hold
it, memdelay_enter/leave could be added. I need to think about what to
do with try_to_wake_up in order to get the cpu move accounting inside
the locked section of ttwu_queue(), but that should be doable too.

> > +	if (old) {
> > +		WARN_ONCE(!mdc->tasks[old], "cpu=%d old=%d new=%d counter=%d\n",
> > +			  cpu, old, new, mdc->tasks[old]);
> > +		mdc->tasks[old] -= 1;
> > +	}
> > +	if (new)
> > +		mdc->tasks[new] += 1;
> > +
> > +	/*
> > +	 * The domain is somewhat delayed when a number of tasks are
> > +	 * delayed but there are still others running the workload.
> > +	 *
> > +	 * The domain is fully delayed when all non-idle tasks on the
> > +	 * CPU are delayed, or when a delayed task is actively running
> > +	 * and preventing productive tasks from making headway.
> > +	 *
> > +	 * The state times then add up over all CPUs in the domain: if
> > +	 * the domain is fully blocked on one CPU and there is another
> > +	 * one running the workload, the domain is considered fully
> > +	 * blocked 50% of the time.
> > +	 */
> > +	if (!mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_DELAYED])
> > +		state = MDS_NONE;
> > +	else if (mdc->tasks[MTS_WORKING])
> > +		state = MDS_SOME;
> > +	else
> > +		state = MDS_FULL;
> > +
> > +	if (mdc->state == state)
> > +		goto unlock;
> > +
> > +	now = ktime_to_ns(ktime_get());
> 
> ktime_get_ns(), also no ktime in scheduler code.

Okay.

I actually don't need a time source that's comparable across CPUs
since accounting periods are always fully contained within one
CPU. From the comment docs, it sounds like cpu_clock() is what I want
to use there?

> > +	/* Account domain state changes */
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_from_task(task);
> > +	do {
> > +		struct memdelay_domain *md;
> > +
> > +		md = memcg_domain(memcg);
> > +		md->aggregate += delay;
> > +		domain_cpu_update(md, cpu, old, new);
> > +	} while (memcg && (memcg = parent_mem_cgroup(memcg)));
> > +	rcu_read_unlock();
> 
> We are _NOT_ going to do a 3rd cgroup iteration for every task action.

I'll look into that.

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-30 15:28       ` Johannes Weiner
@ 2017-07-31  8:31         ` Peter Zijlstra
  -1 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2017-07-31  8:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Sun, Jul 30, 2017 at 11:28:13AM -0400, Johannes Weiner wrote:
> On Sat, Jul 29, 2017 at 11:10:55AM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> > > +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> > > +			      int old, int new)
> > > +{
> > > +	enum memdelay_domain_state state;
> > > +	struct memdelay_domain_cpu *mdc;
> > > +	unsigned long now, delta;
> > > +	unsigned long flags;
> > > +
> > > +	mdc = per_cpu_ptr(md->mdcs, cpu);
> > > +	spin_lock_irqsave(&mdc->lock, flags);
> > 
> > Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
> > do we really want to add more atomics there?
> 
> I think we should be able to get away without an additional lock and
> rely on the rq lock instead. schedule, enqueue, dequeue already hold
> it, memdelay_enter/leave could be added. I need to think about what to
> do with try_to_wake_up in order to get the cpu move accounting inside
> the locked section of ttwu_queue(), but that should be doable too.

So could you start by describing what actual statistics we need? Because
as is the scheduler already does a gazillion stats and why can't re
repurpose some of those?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-07-31  8:31         ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2017-07-31  8:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Sun, Jul 30, 2017 at 11:28:13AM -0400, Johannes Weiner wrote:
> On Sat, Jul 29, 2017 at 11:10:55AM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> > > +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> > > +			      int old, int new)
> > > +{
> > > +	enum memdelay_domain_state state;
> > > +	struct memdelay_domain_cpu *mdc;
> > > +	unsigned long now, delta;
> > > +	unsigned long flags;
> > > +
> > > +	mdc = per_cpu_ptr(md->mdcs, cpu);
> > > +	spin_lock_irqsave(&mdc->lock, flags);
> > 
> > Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
> > do we really want to add more atomics there?
> 
> I think we should be able to get away without an additional lock and
> rely on the rq lock instead. schedule, enqueue, dequeue already hold
> it, memdelay_enter/leave could be added. I need to think about what to
> do with try_to_wake_up in order to get the cpu move accounting inside
> the locked section of ttwu_queue(), but that should be doable too.

So could you start by describing what actual statistics we need? Because
as is the scheduler already does a gazillion stats and why can't re
repurpose some of those?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-31  8:31         ` Peter Zijlstra
@ 2017-07-31 18:41           ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-31 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote:
> On Sun, Jul 30, 2017 at 11:28:13AM -0400, Johannes Weiner wrote:
> > On Sat, Jul 29, 2017 at 11:10:55AM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> > > > +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> > > > +			      int old, int new)
> > > > +{
> > > > +	enum memdelay_domain_state state;
> > > > +	struct memdelay_domain_cpu *mdc;
> > > > +	unsigned long now, delta;
> > > > +	unsigned long flags;
> > > > +
> > > > +	mdc = per_cpu_ptr(md->mdcs, cpu);
> > > > +	spin_lock_irqsave(&mdc->lock, flags);
> > > 
> > > Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
> > > do we really want to add more atomics there?
> > 
> > I think we should be able to get away without an additional lock and
> > rely on the rq lock instead. schedule, enqueue, dequeue already hold
> > it, memdelay_enter/leave could be added. I need to think about what to
> > do with try_to_wake_up in order to get the cpu move accounting inside
> > the locked section of ttwu_queue(), but that should be doable too.
> 
> So could you start by describing what actual statistics we need? Because
> as is the scheduler already does a gazillion stats and why can't re
> repurpose some of those?

If that's possible, that would be great of course.

We want to be able to tell how many tasks in a domain (the system or a
memory cgroup) are inside a memdelay section as opposed to how many
are in a "productive" state such as runnable or iowait. Then derive
from that whether the domain as a whole is unproductive (all non-idle
tasks memdelayed), or partially unproductive (some delayed, but CPUs
are productive or there are iowait tasks). Then derive the percentages
of walltime the domain spends partially or fully unproductive.

For that we need per-domain counters for

	1) nr of tasks in memdelay sections
	2) nr of iowait or runnable/queued tasks that are NOT inside
	   memdelay sections

The memdelay and runnable counts need to be per-cpu as well. (The idea
is this: if you have one CPU and some tasks are delayed while others
are runnable, you're 100% partially productive, as the CPU is fully
used. But if you have two CPUs, and the tasks on one CPU are all
runnable while the tasks on the others are all delayed, the domain is
50% of the time fully unproductive (and not 100% partially productive)
as half the available CPU time is being squandered by delays).

On the system-level, we already count runnable/queued per cpu through
rq->nr_running.

However, we need to distinguish between productive runnables and tasks
that are in runnable while in a memdelay section (doing reclaim). The
current counters don't do that.

Lastly, and somewhat obscurely, the presence of runnable tasks means
that usually the domain is at least partially productive. But if the
CPU is used by a task in a memdelay section (direct reclaim), the
domain is fully unproductive (unless there are iowait tasks in the
domain, since they make "progress" without CPU). So we need to track
task_current() && task_memdelayed() per-domain per-cpu as well.

Now, thinking only about the system-level, we could split
rq->nr_running into a sets of delayed and non-delayed counters
(present them as sum in all current read sides).

Adding an rq counter for tasks inside memdelay sections should be
straight-forward as well (except for maybe the migration cost of that
state between CPUs in ttwu that Mike pointed out).

That leaves the question of how to track these numbers per cgroup at
an acceptable cost. The idea for a tree of cgroups is that walltime
impact of delays at each level is reported for all tasks at or below
that level. E.g. a leave group aggregates the state of its own tasks,
the root/system aggregates the state of all tasks in the system; hence
the propagation of the task state counters up the hierarchy.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-07-31 18:41           ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-31 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote:
> On Sun, Jul 30, 2017 at 11:28:13AM -0400, Johannes Weiner wrote:
> > On Sat, Jul 29, 2017 at 11:10:55AM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 27, 2017 at 11:30:10AM -0400, Johannes Weiner wrote:
> > > > +static void domain_cpu_update(struct memdelay_domain *md, int cpu,
> > > > +			      int old, int new)
> > > > +{
> > > > +	enum memdelay_domain_state state;
> > > > +	struct memdelay_domain_cpu *mdc;
> > > > +	unsigned long now, delta;
> > > > +	unsigned long flags;
> > > > +
> > > > +	mdc = per_cpu_ptr(md->mdcs, cpu);
> > > > +	spin_lock_irqsave(&mdc->lock, flags);
> > > 
> > > Afaict this is inside scheduler locks, this cannot be a spinlock. Also,
> > > do we really want to add more atomics there?
> > 
> > I think we should be able to get away without an additional lock and
> > rely on the rq lock instead. schedule, enqueue, dequeue already hold
> > it, memdelay_enter/leave could be added. I need to think about what to
> > do with try_to_wake_up in order to get the cpu move accounting inside
> > the locked section of ttwu_queue(), but that should be doable too.
> 
> So could you start by describing what actual statistics we need? Because
> as is the scheduler already does a gazillion stats and why can't re
> repurpose some of those?

If that's possible, that would be great of course.

We want to be able to tell how many tasks in a domain (the system or a
memory cgroup) are inside a memdelay section as opposed to how many
are in a "productive" state such as runnable or iowait. Then derive
from that whether the domain as a whole is unproductive (all non-idle
tasks memdelayed), or partially unproductive (some delayed, but CPUs
are productive or there are iowait tasks). Then derive the percentages
of walltime the domain spends partially or fully unproductive.

For that we need per-domain counters for

	1) nr of tasks in memdelay sections
	2) nr of iowait or runnable/queued tasks that are NOT inside
	   memdelay sections

The memdelay and runnable counts need to be per-cpu as well. (The idea
is this: if you have one CPU and some tasks are delayed while others
are runnable, you're 100% partially productive, as the CPU is fully
used. But if you have two CPUs, and the tasks on one CPU are all
runnable while the tasks on the others are all delayed, the domain is
50% of the time fully unproductive (and not 100% partially productive)
as half the available CPU time is being squandered by delays).

On the system-level, we already count runnable/queued per cpu through
rq->nr_running.

However, we need to distinguish between productive runnables and tasks
that are in runnable while in a memdelay section (doing reclaim). The
current counters don't do that.

Lastly, and somewhat obscurely, the presence of runnable tasks means
that usually the domain is at least partially productive. But if the
CPU is used by a task in a memdelay section (direct reclaim), the
domain is fully unproductive (unless there are iowait tasks in the
domain, since they make "progress" without CPU). So we need to track
task_current() && task_memdelayed() per-domain per-cpu as well.

Now, thinking only about the system-level, we could split
rq->nr_running into a sets of delayed and non-delayed counters
(present them as sum in all current read sides).

Adding an rq counter for tasks inside memdelay sections should be
straight-forward as well (except for maybe the migration cost of that
state between CPUs in ttwu that Mike pointed out).

That leaves the question of how to track these numbers per cgroup at
an acceptable cost. The idea for a tree of cgroups is that walltime
impact of delays at each level is reported for all tasks at or below
that level. E.g. a leave group aggregates the state of its own tasks,
the root/system aggregates the state of all tasks in the system; hence
the propagation of the task state counters up the hierarchy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-31 18:41           ` Johannes Weiner
@ 2017-07-31 19:49             ` Mike Galbraith
  -1 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-07-31 19:49 UTC (permalink / raw)
  To: Johannes Weiner, Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, 2017-07-31 at 14:41 -0400, Johannes Weiner wrote:
> 
> Adding an rq counter for tasks inside memdelay sections should be
> straight-forward as well (except for maybe the migration cost of that
> state between CPUs in ttwu that Mike pointed out).

What I pointed out should be easily eliminated (zero use case).
 
> That leaves the question of how to track these numbers per cgroup at
> an acceptable cost. The idea for a tree of cgroups is that walltime
> impact of delays at each level is reported for all tasks at or below
> that level. E.g. a leave group aggregates the state of its own tasks,
> the root/system aggregates the state of all tasks in the system; hence
> the propagation of the task state counters up the hierarchy.

The crux of the biscuit is where exactly the investment return lies.
 Gathering of these numbers ain't gonna be free, no matter how hard you
try, and you're plugging into paths where every cycle added is made of
userspace hide.

	-Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-07-31 19:49             ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-07-31 19:49 UTC (permalink / raw)
  To: Johannes Weiner, Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, 2017-07-31 at 14:41 -0400, Johannes Weiner wrote:
> 
> Adding an rq counter for tasks inside memdelay sections should be
> straight-forward as well (except for maybe the migration cost of that
> state between CPUs in ttwu that Mike pointed out).

What I pointed out should be easily eliminated (zero use case).
 
> That leaves the question of how to track these numbers per cgroup at
> an acceptable cost. The idea for a tree of cgroups is that walltime
> impact of delays at each level is reported for all tasks at or below
> that level. E.g. a leave group aggregates the state of its own tasks,
> the root/system aggregates the state of all tasks in the system; hence
> the propagation of the task state counters up the hierarchy.

The crux of the biscuit is where exactly the investment return lies.
 Gathering of these numbers ain't gonna be free, no matter how hard you
try, and you're plugging into paths where every cycle added is made of
userspace hide.

	-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-31 19:49             ` Mike Galbraith
@ 2017-07-31 20:38               ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-31 20:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Rik van Riel,
	Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jul 31, 2017 at 09:49:39PM +0200, Mike Galbraith wrote:
> On Mon, 2017-07-31 at 14:41 -0400, Johannes Weiner wrote:
> > 
> > Adding an rq counter for tasks inside memdelay sections should be
> > straight-forward as well (except for maybe the migration cost of that
> > state between CPUs in ttwu that Mike pointed out).
> 
> What I pointed out should be easily eliminated (zero use case).

How so?

> > That leaves the question of how to track these numbers per cgroup at
> > an acceptable cost. The idea for a tree of cgroups is that walltime
> > impact of delays at each level is reported for all tasks at or below
> > that level. E.g. a leave group aggregates the state of its own tasks,
> > the root/system aggregates the state of all tasks in the system; hence
> > the propagation of the task state counters up the hierarchy.
> 
> The crux of the biscuit is where exactly the investment return lies.
>  Gathering of these numbers ain't gonna be free, no matter how hard you
> try, and you're plugging into paths where every cycle added is made of
> userspace hide.

Right. But how to implement it sanely and optimize for cycles, and
whether we want to default-enable this interface are two separate
conversations.

It makes sense to me to first make the implementation as lightweight
on cycles and maintainability as possible, and then worry about the
cost / benefit defaults of the shipped Linux kernel afterwards.

That goes for the purely informative userspace interface, anyway. The
easily-provoked thrashing livelock I have described in the email to
Andrew is a different matter. If the OOM killer requires hooking up to
this metric to fix it, it won't be optional. But the OOM code isn't
part of this series yet, so again a conversation best had later, IMO.

PS: I'm stealing the "made of userspace hide" thing.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-07-31 20:38               ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-07-31 20:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Rik van Riel,
	Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jul 31, 2017 at 09:49:39PM +0200, Mike Galbraith wrote:
> On Mon, 2017-07-31 at 14:41 -0400, Johannes Weiner wrote:
> > 
> > Adding an rq counter for tasks inside memdelay sections should be
> > straight-forward as well (except for maybe the migration cost of that
> > state between CPUs in ttwu that Mike pointed out).
> 
> What I pointed out should be easily eliminated (zero use case).

How so?

> > That leaves the question of how to track these numbers per cgroup at
> > an acceptable cost. The idea for a tree of cgroups is that walltime
> > impact of delays at each level is reported for all tasks at or below
> > that level. E.g. a leave group aggregates the state of its own tasks,
> > the root/system aggregates the state of all tasks in the system; hence
> > the propagation of the task state counters up the hierarchy.
> 
> The crux of the biscuit is where exactly the investment return lies.
>  Gathering of these numbers ain't gonna be free, no matter how hard you
> try, and you're plugging into paths where every cycle added is made of
> userspace hide.

Right. But how to implement it sanely and optimize for cycles, and
whether we want to default-enable this interface are two separate
conversations.

It makes sense to me to first make the implementation as lightweight
on cycles and maintainability as possible, and then worry about the
cost / benefit defaults of the shipped Linux kernel afterwards.

That goes for the purely informative userspace interface, anyway. The
easily-provoked thrashing livelock I have described in the email to
Andrew is a different matter. If the OOM killer requires hooking up to
this metric to fix it, it won't be optional. But the OOM code isn't
part of this series yet, so again a conversation best had later, IMO.

PS: I'm stealing the "made of userspace hide" thing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-31 20:38               ` Johannes Weiner
@ 2017-08-01  2:23                 ` Mike Galbraith
  -1 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-08-01  2:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Rik van Riel,
	Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, 2017-07-31 at 16:38 -0400, Johannes Weiner wrote:
> On Mon, Jul 31, 2017 at 09:49:39PM +0200, Mike Galbraith wrote:
> > On Mon, 2017-07-31 at 14:41 -0400, Johannes Weiner wrote:
> > > 
> > > Adding an rq counter for tasks inside memdelay sections should be
> > > straight-forward as well (except for maybe the migration cost of that
> > > state between CPUs in ttwu that Mike pointed out).
> > 
> > What I pointed out should be easily eliminated (zero use case).
> 
> How so?

I was thinking along the lines of schedstat_enabled().

> > > That leaves the question of how to track these numbers per cgroup at
> > > an acceptable cost. The idea for a tree of cgroups is that walltime
> > > impact of delays at each level is reported for all tasks at or below
> > > that level. E.g. a leave group aggregates the state of its own tasks,
> > > the root/system aggregates the state of all tasks in the system; hence
> > > the propagation of the task state counters up the hierarchy.
> > 
> > The crux of the biscuit is where exactly the investment return lies.
> >  Gathering of these numbers ain't gonna be free, no matter how hard you
> > try, and you're plugging into paths where every cycle added is made of
> > userspace hide.
> 
> Right. But how to implement it sanely and optimize for cycles, and
> whether we want to default-enable this interface are two separate
> conversations.
> 
> It makes sense to me to first make the implementation as lightweight
> on cycles and maintainability as possible, and then worry about the
> cost / benefit defaults of the shipped Linux kernel afterwards.
> 
> That goes for the purely informative userspace interface, anyway. The
> easily-provoked thrashing livelock I have described in the email to
> Andrew is a different matter. If the OOM killer requires hooking up to
> this metric to fix it, it won't be optional. But the OOM code isn't
> part of this series yet, so again a conversation best had later, IMO.

If that "the many must pay a toll to save the few" conversation ever
happens, just recall me registering my boo/hiss in advance.  I don't
have to feel guilty about not liking the idea of making donations to
feed the poor starving proggies ;-)

	-Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-08-01  2:23                 ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2017-08-01  2:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Rik van Riel,
	Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, 2017-07-31 at 16:38 -0400, Johannes Weiner wrote:
> On Mon, Jul 31, 2017 at 09:49:39PM +0200, Mike Galbraith wrote:
> > On Mon, 2017-07-31 at 14:41 -0400, Johannes Weiner wrote:
> > > 
> > > Adding an rq counter for tasks inside memdelay sections should be
> > > straight-forward as well (except for maybe the migration cost of that
> > > state between CPUs in ttwu that Mike pointed out).
> > 
> > What I pointed out should be easily eliminated (zero use case).
> 
> How so?

I was thinking along the lines of schedstat_enabled().

> > > That leaves the question of how to track these numbers per cgroup at
> > > an acceptable cost. The idea for a tree of cgroups is that walltime
> > > impact of delays at each level is reported for all tasks at or below
> > > that level. E.g. a leave group aggregates the state of its own tasks,
> > > the root/system aggregates the state of all tasks in the system; hence
> > > the propagation of the task state counters up the hierarchy.
> > 
> > The crux of the biscuit is where exactly the investment return lies.
> >  Gathering of these numbers ain't gonna be free, no matter how hard you
> > try, and you're plugging into paths where every cycle added is made of
> > userspace hide.
> 
> Right. But how to implement it sanely and optimize for cycles, and
> whether we want to default-enable this interface are two separate
> conversations.
> 
> It makes sense to me to first make the implementation as lightweight
> on cycles and maintainability as possible, and then worry about the
> cost / benefit defaults of the shipped Linux kernel afterwards.
> 
> That goes for the purely informative userspace interface, anyway. The
> easily-provoked thrashing livelock I have described in the email to
> Andrew is a different matter. If the OOM killer requires hooking up to
> this metric to fix it, it won't be optional. But the OOM code isn't
> part of this series yet, so again a conversation best had later, IMO.

If that "the many must pay a toll to save the few" conversation ever
happens, just recall me registering my boo/hiss in advance.  I don't
have to feel guilty about not liking the idea of making donations to
feed the poor starving proggies ;-)

	-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-07-31 18:41           ` Johannes Weiner
@ 2017-08-01  7:57             ` Peter Zijlstra
  -1 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2017-08-01  7:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, Jul 31, 2017 at 02:41:42PM -0400, Johannes Weiner wrote:
> On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote:

> > So could you start by describing what actual statistics we need? Because
> > as is the scheduler already does a gazillion stats and why can't re
> > repurpose some of those?
> 
> If that's possible, that would be great of course.
> 
> We want to be able to tell how many tasks in a domain (the system or a
> memory cgroup) are inside a memdelay section as opposed to how many

And you haven't even defined wth a memdelay section is yet..

> are in a "productive" state such as runnable or iowait. Then derive
> from that whether the domain as a whole is unproductive (all non-idle
> tasks memdelayed), or partially unproductive (some delayed, but CPUs
> are productive or there are iowait tasks). Then derive the percentages
> of walltime the domain spends partially or fully unproductive.
> 
> For that we need per-domain counters for
> 
> 	1) nr of tasks in memdelay sections
> 	2) nr of iowait or runnable/queued tasks that are NOT inside
> 	   memdelay sections

And I still have no clue..

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-08-01  7:57             ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2017-08-01  7:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, Jul 31, 2017 at 02:41:42PM -0400, Johannes Weiner wrote:
> On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote:

> > So could you start by describing what actual statistics we need? Because
> > as is the scheduler already does a gazillion stats and why can't re
> > repurpose some of those?
> 
> If that's possible, that would be great of course.
> 
> We want to be able to tell how many tasks in a domain (the system or a
> memory cgroup) are inside a memdelay section as opposed to how many

And you haven't even defined wth a memdelay section is yet..

> are in a "productive" state such as runnable or iowait. Then derive
> from that whether the domain as a whole is unproductive (all non-idle
> tasks memdelayed), or partially unproductive (some delayed, but CPUs
> are productive or there are iowait tasks). Then derive the percentages
> of walltime the domain spends partially or fully unproductive.
> 
> For that we need per-domain counters for
> 
> 	1) nr of tasks in memdelay sections
> 	2) nr of iowait or runnable/queued tasks that are NOT inside
> 	   memdelay sections

And I still have no clue..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-08-01  7:57             ` Peter Zijlstra
@ 2017-08-01 12:26               ` Johannes Weiner
  -1 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-08-01 12:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Tue, Aug 01, 2017 at 09:57:28AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 31, 2017 at 02:41:42PM -0400, Johannes Weiner wrote:
> > On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote:
> 
> > > So could you start by describing what actual statistics we need? Because
> > > as is the scheduler already does a gazillion stats and why can't re
> > > repurpose some of those?
> > 
> > If that's possible, that would be great of course.
> > 
> > We want to be able to tell how many tasks in a domain (the system or a
> > memory cgroup) are inside a memdelay section as opposed to how many
> 
> And you haven't even defined wth a memdelay section is yet..

It's what a task is in after it calls memdelay_enter() and before it
calls memdelay_leave().

Tasks mark themselves to be in a memory section when they know to
perform work that is necessary due to a lack of memory, such as
waiting for a refault or a direct reclaim invocation.

>From the patch:

+/**
+ * memdelay_enter - mark the beginning of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as being delayed due to a lack of memory,
+ * such as waiting for a workingset refault or performing reclaim.
+ */

+/**
+ * memdelay_leave - mark the end of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer delayed due to memory.
+ */

where a reclaim callsite looks like this (decluttered):

	memdelay_enter()
	nr_reclaimed = do_try_to_free_pages()
	memdelay_leave()

That's what defines the "unproductive due to lack of memory" state of
a task. Time spent in that state weighed against time spent while the
task is productive - runnable or in iowait while not in a memdelay
section - gives the memory health of the task. And the system and
cgroup states/health can be derived from task states as described:

> > are in a "productive" state such as runnable or iowait. Then derive
> > from that whether the domain as a whole is unproductive (all non-idle
> > tasks memdelayed), or partially unproductive (some delayed, but CPUs
> > are productive or there are iowait tasks). Then derive the percentages
> > of walltime the domain spends partially or fully unproductive.
> > 
> > For that we need per-domain counters for
> > 
> > 	1) nr of tasks in memdelay sections
> > 	2) nr of iowait or runnable/queued tasks that are NOT inside
> > 	   memdelay sections

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-08-01 12:26               ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-08-01 12:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Tue, Aug 01, 2017 at 09:57:28AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 31, 2017 at 02:41:42PM -0400, Johannes Weiner wrote:
> > On Mon, Jul 31, 2017 at 10:31:11AM +0200, Peter Zijlstra wrote:
> 
> > > So could you start by describing what actual statistics we need? Because
> > > as is the scheduler already does a gazillion stats and why can't re
> > > repurpose some of those?
> > 
> > If that's possible, that would be great of course.
> > 
> > We want to be able to tell how many tasks in a domain (the system or a
> > memory cgroup) are inside a memdelay section as opposed to how many
> 
> And you haven't even defined wth a memdelay section is yet..

It's what a task is in after it calls memdelay_enter() and before it
calls memdelay_leave().

Tasks mark themselves to be in a memory section when they know to
perform work that is necessary due to a lack of memory, such as
waiting for a refault or a direct reclaim invocation.

>From the patch:

+/**
+ * memdelay_enter - mark the beginning of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as being delayed due to a lack of memory,
+ * such as waiting for a workingset refault or performing reclaim.
+ */

+/**
+ * memdelay_leave - mark the end of a memory delay section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer delayed due to memory.
+ */

where a reclaim callsite looks like this (decluttered):

	memdelay_enter()
	nr_reclaimed = do_try_to_free_pages()
	memdelay_leave()

That's what defines the "unproductive due to lack of memory" state of
a task. Time spent in that state weighed against time spent while the
task is productive - runnable or in iowait while not in a memdelay
section - gives the memory health of the task. And the system and
cgroup states/health can be derived from task states as described:

> > are in a "productive" state such as runnable or iowait. Then derive
> > from that whether the domain as a whole is unproductive (all non-idle
> > tasks memdelayed), or partially unproductive (some delayed, but CPUs
> > are productive or there are iowait tasks). Then derive the percentages
> > of walltime the domain spends partially or fully unproductive.
> > 
> > For that we need per-domain counters for
> > 
> > 	1) nr of tasks in memdelay sections
> > 	2) nr of iowait or runnable/queued tasks that are NOT inside
> > 	   memdelay sections

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
  2017-07-28 19:43     ` Johannes Weiner
@ 2017-08-02  8:11       ` Michal Hocko
  -1 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-08-02  8:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Mel Gorman, linux-mm, linux-kernel, kernel-team

On Fri 28-07-17 15:43:37, Johannes Weiner wrote:
> Hi Andrew,
> 
> On Thu, Jul 27, 2017 at 01:43:25PM -0700, Andrew Morton wrote:
> > On Thu, 27 Jul 2017 11:30:07 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > This patch series implements a fine-grained metric for memory
> > > health.
> > 
> > I assume some Documentation/ is forthcoming.
> 
> Yep, I'll describe the interface and how to use this more extensively.
> 
> > Consuming another page flag hurts.  What's our current status there?
> 
> I would say we can make it 64-bit only, but I also need this refault
> distinction flag in the LRU balancing patches [1] to apply pressure on
> anon pages only when the page cache is actually thrashing, not when
> it's just transitioning to another workingset. So let's see...

I didn't get to look at the patchset yet but just for this part. I guess
you can go without a new page flag. PG_slab could be reused with some
care AFAICS.  Slab allocators do not seem to use other page flags so we
could make

bool PageSlab() 
{
	unsigned long flags = page->flags & ((1UL << NR_PAGEFLAGS) - 1);
	return (flags & (1UL << PG_slab)) == (1UL << PG_slab);
}

and then reuse the same bit for working set pages. Page cache will
almost always have LRU bit set and workingset_eviction assumes PG_locked
so we will have another bit set when needed. I know this is fuggly and
subtle but basically everything about struct page is inevitably like
that...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3] memdelay: memory health metric for systems and workloads
@ 2017-08-02  8:11       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-08-02  8:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Mel Gorman, linux-mm, linux-kernel, kernel-team

On Fri 28-07-17 15:43:37, Johannes Weiner wrote:
> Hi Andrew,
> 
> On Thu, Jul 27, 2017 at 01:43:25PM -0700, Andrew Morton wrote:
> > On Thu, 27 Jul 2017 11:30:07 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > This patch series implements a fine-grained metric for memory
> > > health.
> > 
> > I assume some Documentation/ is forthcoming.
> 
> Yep, I'll describe the interface and how to use this more extensively.
> 
> > Consuming another page flag hurts.  What's our current status there?
> 
> I would say we can make it 64-bit only, but I also need this refault
> distinction flag in the LRU balancing patches [1] to apply pressure on
> anon pages only when the page cache is actually thrashing, not when
> it's just transitioning to another workingset. So let's see...

I didn't get to look at the patchset yet but just for this part. I guess
you can go without a new page flag. PG_slab could be reused with some
care AFAICS.  Slab allocators do not seem to use other page flags so we
could make

bool PageSlab() 
{
	unsigned long flags = page->flags & ((1UL << NR_PAGEFLAGS) - 1);
	return (flags & (1UL << PG_slab)) == (1UL << PG_slab);
}

and then reuse the same bit for working set pages. Page cache will
almost always have LRU bit set and workingset_eviction assumes PG_locked
so we will have another bit set when needed. I know this is fuggly and
subtle but basically everything about struct page is inevitably like
that...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
  2017-08-01 12:26               ` Johannes Weiner
@ 2017-08-13 14:52                 ` Peter Zijlstra
  -1 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2017-08-13 14:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Tue, Aug 01, 2017 at 08:26:34AM -0400, Johannes Weiner wrote:
> On Tue, Aug 01, 2017 at 09:57:28AM +0200, Peter Zijlstra wrote:

> > And you haven't even defined wth a memdelay section is yet..
> 
> It's what a task is in after it calls memdelay_enter() and before it
> calls memdelay_leave().

Urgh, yes that makes it harder to reusing existing bits.. although
delayacct seems to do something vaguely similar. I've never really
looked at that, but if you can reuse/merge that would of course be good.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads
@ 2017-08-13 14:52                 ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2017-08-13 14:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Tue, Aug 01, 2017 at 08:26:34AM -0400, Johannes Weiner wrote:
> On Tue, Aug 01, 2017 at 09:57:28AM +0200, Peter Zijlstra wrote:

> > And you haven't even defined wth a memdelay section is yet..
> 
> It's what a task is in after it calls memdelay_enter() and before it
> calls memdelay_leave().

Urgh, yes that makes it harder to reusing existing bits.. although
delayacct seems to do something vaguely similar. I've never really
looked at that, but if you can reuse/merge that would of course be good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-08-13 14:52 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-27 15:30 [PATCH 0/3] memdelay: memory health metric for systems and workloads Johannes Weiner
2017-07-27 15:30 ` Johannes Weiner
2017-07-27 15:30 ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros Johannes Weiner
2017-07-27 15:30   ` Johannes Weiner
2017-07-27 15:30 ` [PATCH 2/3] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2017-07-27 15:30   ` Johannes Weiner
2017-07-27 15:30 ` [PATCH 3/3] mm/sched: memdelay: memory health interface for systems and workloads Johannes Weiner
2017-07-27 15:30   ` Johannes Weiner
2017-07-27 15:56   ` Johannes Weiner
2017-07-27 15:56     ` Johannes Weiner
2017-07-29  9:10   ` Peter Zijlstra
2017-07-29  9:10     ` Peter Zijlstra
2017-07-30 15:28     ` Johannes Weiner
2017-07-30 15:28       ` Johannes Weiner
2017-07-31  8:31       ` Peter Zijlstra
2017-07-31  8:31         ` Peter Zijlstra
2017-07-31 18:41         ` Johannes Weiner
2017-07-31 18:41           ` Johannes Weiner
2017-07-31 19:49           ` Mike Galbraith
2017-07-31 19:49             ` Mike Galbraith
2017-07-31 20:38             ` Johannes Weiner
2017-07-31 20:38               ` Johannes Weiner
2017-08-01  2:23               ` Mike Galbraith
2017-08-01  2:23                 ` Mike Galbraith
2017-08-01  7:57           ` Peter Zijlstra
2017-08-01  7:57             ` Peter Zijlstra
2017-08-01 12:26             ` Johannes Weiner
2017-08-01 12:26               ` Johannes Weiner
2017-08-13 14:52               ` Peter Zijlstra
2017-08-13 14:52                 ` Peter Zijlstra
2017-07-29 13:31   ` kbuild test robot
2017-07-27 20:43 ` [PATCH 0/3] memdelay: memory health metric " Andrew Morton
2017-07-27 20:43   ` Andrew Morton
2017-07-28 19:43   ` Johannes Weiner
2017-07-28 19:43     ` Johannes Weiner
2017-08-02  8:11     ` Michal Hocko
2017-08-02  8:11       ` Michal Hocko
2017-07-29  2:48 ` Mike Galbraith
2017-07-29  2:48   ` Mike Galbraith
2017-07-29  3:21   ` Mike Galbraith
2017-07-29  3:21     ` Mike Galbraith
2017-07-29  6:38   ` Mike Galbraith
2017-07-29  6:38     ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.