linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
@ 2018-07-12 17:29 Johannes Weiner
  2018-07-12 17:29 ` [PATCH 01/10] mm: workingset: don't drop refault information prematurely Johannes Weiner
                   ` (14 more replies)
  0 siblings, 15 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

PSI aggregates and reports the overall wallclock time in which the
tasks in a system (or cgroup) wait for contended hardware resources.

This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid, as well as anticipate major disruptions like OOM.

This version 2 of the series incorporates a ton of feedback from
PeterZ and SurenB; more details at the end of this email.

		Real-world applications

We're using the data collected by psi (and its previous incarnation,
memdelay) quite extensively at Facebook, with several success stories.

One usecase is avoiding OOM hangs/livelocks. The reason these happen
is because the OOM killer is triggered by reclaim not being able to
free pages, but with fast flash devices there is *always* some clean
and uptodate cache to reclaim; the OOM killer never kicks in, even as
tasks spend 90% of the time thrashing the cache pages of their own
executables. There is no situation where this ever makes sense in
practice. We wrote a <100 line POC python script to monitor memory
pressure and kill stuff way before such pathological thrashing leads
to full system losses that require forcible hard resets.

We've since extended and deployed this code into other places to
guarantee latency and throughput SLAs, since they're usually violated
way before the kernel OOM killer would ever kick in.

The idea is to eventually incorporate this back into the kernel, so
that Linux can avoid OOM livelocks (which TECHNICALLY aren't memory
deadlocks, but for the user indistinguishable) out of the box.

We also use psi memory pressure for loadshedding. Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success. We switched it to
psi and managed to anticipate and avoid OOM kills and hangs fairly
reliably. The reduction of OOM outages in the worker pool raised the
pool's aggregate productivity, and we were able to switch that service
to smaller machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as
well as to prevent multiple workloads on a machine from stepping on
each others' toes. We were not able to configure this properly without
the pressure metrics; we would see latency or bandwidth drops, but it
would often be hard to impossible to rootcause it post-mortem.

We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.

I've also recieved feedback and feature requests from Android for the
purpose of low-latency OOM killing. The on-demand stats aggregation in
the last patch of this series is for this purpose, to allow Android to
react to pressure before the system starts visibly hanging.

		How do you use this feature?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.

The cpu file contains one line:

	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which one or more
tasks are delayed on the runqueue while another task has the
CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell
short term trends from long term ones, similarly to the load average.

The total= value gives the absolute stall time in microseconds. This
allows detecting latency spikes that might be too short to sway the
running averages. It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse
with future hardware).

What to make of this "some" metric? If CPU utilization is at 100% and
CPU pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting. At two or more runnable
tasks per CPU, the system is 100% overcommitted and the pressure
average will indicate as much. From a utilization perspective this is
a great state of course: no CPU cycles are being wasted, even when 50%
of the threads were to go idle (as most workloads do vary). From the
perspective of the individual job it's not great, however, and they
would do better with more resources. Depending on what your priority
and options are, raised "some" numbers may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu, the time in which at least one
task is stalled on the resource. In the case of memory, this includes
waiting on swap-in, page cache refaults and page reclaim.

The full line, however, indicates time in which *nobody* is using the
CPU productively due to pressure: all non-idle tasks are waiting for
memory in one form or another. Significant time spent in there is a
good trigger for killing things, moving jobs to other machines, or
dropping incoming requests, since neither the jobs nor the machine
overall are making too much headway.

The io file is similar to memory. Because the block layer doesn't have
a concept of hardware contention right now (how much longer is my IO
request taking due to other tasks?), it reports CPU potential lost on
all IO delays, not just the potential lost due to competition.

		FAQ

Q: How is PSI's CPU component different from the load average?

A: There are several quirks in the load average that make it hard to
   impossible to tell how overcommitted the CPU really is.

   1. The load average is reported as a raw number of active tasks.
      You need to know how many CPUs there are in the system, how many
      CPUs the workload is allowed to use, then think about what the
      proportion between load and the number of CPUs means for the
      tasks trying to run.

      PSI reports the percentage of wallclock time in which tasks are
      waiting for a CPU to run on. It doesn't matter how many CPUs are
      present or usable. The number always tells the quality of life
      of tasks in the system or in a particular cgroup.

   2. The shortest averaging window is 1m, which is extremely coarse,
      and it's sampled in 5s intervals. A *lot* can happen on a CPU in
      5 seconds. This *may* be able to identify persistent long-term
      trends and very clear and obvious overloads, but it's unusable
      for latency spikes and more subtle overutilization.

      PSI's shortest window is 10s. It also exports the cumulative
      stall times (in microseconds) of synchronously recorded events.

   3. On Linux, the load average for historical reasons includes all
      TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
      busy the system is, but on the flipside it doesn't distinguish
      whether tasks are likely to contend over the CPU or IO - which
      obviously requires very different interventions from a sys admin
      or a job scheduler.

      PSI reports independent metrics for CPU and IO. You can tell
      which resource is making the tasks wait, but in conjunction
      still see how overloaded the system is overall.

These patches are against v4.17. They're maintained against upstream
here as well: http://git.cmpxchg.org/cgit.cgi/linux-psi.git

 Documentation/accounting/psi.txt                |  73 +++
 Documentation/cgroup-v2.txt                     |  18 +
 arch/powerpc/platforms/cell/cpufreq_spudemand.c |   2 +-
 arch/powerpc/platforms/cell/spufs/sched.c       |   9 +-
 arch/s390/appldata/appldata_os.c                |   4 -
 drivers/cpuidle/governors/menu.c                |   4 -
 fs/proc/loadavg.c                               |   3 -
 include/linux/cgroup-defs.h                     |   4 +
 include/linux/cgroup.h                          |  15 +
 include/linux/delayacct.h                       |  23 +
 include/linux/mmzone.h                          |   1 +
 include/linux/page-flags.h                      |   5 +-
 include/linux/psi.h                             |  52 ++
 include/linux/psi_types.h                       |  90 +++
 include/linux/sched.h                           |  10 +
 include/linux/sched/loadavg.h                   |  24 +-
 include/linux/sched/stat.h                      |  10 +-
 include/linux/swap.h                            |   2 +-
 include/trace/events/mmflags.h                  |   1 +
 include/uapi/linux/taskstats.h                  |   6 +-
 init/Kconfig                                    |  20 +
 kernel/cgroup/cgroup.c                          |  45 +-
 kernel/debug/kdb/kdb_main.c                     |   7 +-
 kernel/delayacct.c                              |  15 +
 kernel/fork.c                                   |   4 +
 kernel/sched/Makefile                           |   1 +
 kernel/sched/core.c                             |  11 +-
 kernel/sched/loadavg.c                          | 139 ++---
 kernel/sched/psi.c                              | 699 ++++++++++++++++++++++
 kernel/sched/sched.h                            | 178 +++---
 kernel/sched/stats.h                            | 102 +++-
 mm/compaction.c                                 |   5 +
 mm/filemap.c                                    |  27 +-
 mm/huge_memory.c                                |   1 +
 mm/memcontrol.c                                 |   2 +
 mm/migrate.c                                    |   2 +
 mm/page_alloc.c                                 |  10 +
 mm/swap_state.c                                 |   1 +
 mm/vmscan.c                                     |  14 +
 mm/vmstat.c                                     |   1 +
 mm/workingset.c                                 | 113 ++--
 tools/accounting/getdelays.c                    |   8 +-
 42 files changed, 1505 insertions(+), 256 deletions(-)

Changes in v2:
- Extensive documentation and comment update. Per everybody.
  In particular, I've added a much more detailed explanation
  of the SMP model, which caused some misunderstandings last time.
- Uninlined calc_load_n(), as it was just too fat. Per Peter.
- Split kernel/sched/stats.h churn into its own commit to
  avoid noise in the main patch and explain the reshuffle. Per Peter.
- Abstracted this_rq_lock_irq(). Per Peter.
- Eliminated cumulative clock drift error. Per Peter.
- Packed the per-cpu datastructure. Per Peter.
- Fixed 64-bit divisions on 32 bit. Per Peter.
- Added outer-most psi_disabled checks. Per Peter.
- Fixed some coding style issues. Per Peter.
- Fixed a bug in the lazy clock. Per Suren.
- On-demand stat aggregation when user reads. Per Suren.
- Fixed task state corruption on preemption race. Per Suren.
- Fixed a CONFIG_PSI=n build error.
- Minor cleanups, optimizations.



^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 01/10] mm: workingset: don't drop refault information prematurely
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-12 17:29 ` [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

From: Johannes Weiner <jweiner@fb.com>

If we just keep enough refault information to match the CURRENT page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out
all the cache. Once cache comes back, we won't see those refaults.
They might not be actionable for LRU aging, but we want to know about
them for measuring memory pressure.

Signed-off-by: Johannes Weiner <jweiner@fb.com>
---
 mm/workingset.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c83978..53759a3cf99a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -364,7 +364,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 {
 	unsigned long max_nodes;
 	unsigned long nodes;
-	unsigned long cache;
+	unsigned long pages;
 
 	/* list_lru lock nests inside the IRQ-safe i_pages lock */
 	local_irq_disable();
@@ -393,14 +393,14 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 	 *
 	 * PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE
 	 */
-	if (sc->memcg) {
-		cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
-						     LRU_ALL_FILE);
-	} else {
-		cache = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
-			node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
-	}
-	max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);
+#ifdef CONFIG_MEMCG
+	if (sc->memcg)
+		pages = page_counter_read(&sc->memcg->memory);
+	else
+#endif
+		pages = node_present_pages(sc->nid);
+
+	max_nodes = pages >> (RADIX_TREE_MAP_SHIFT - 3);
 
 	if (nodes <= max_nodes)
 		return 0;
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
  2018-07-12 17:29 ` [PATCH 01/10] mm: workingset: don't drop refault information prematurely Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-23 13:36   ` Arnd Bergmann
  2018-07-12 17:29 ` [PATCH 03/10] delayacct: track delays from thrashing cache pages Johannes Weiner
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

	20 bits are always page flags

	21 if you have an MMU

	23 with the zone bits for DMA, Normal, HighMem, Movable

	29 with the sparsemem section bits

	30 if PAE is enabled

	31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
nodes. If that's not enough, the system can switch to discontigmem and
re-gain the 6 or 7 sparsemem section bits.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h         |  1 +
 include/linux/page-flags.h     |  5 +-
 include/linux/swap.h           |  2 +-
 include/trace/events/mmflags.h |  1 +
 mm/filemap.c                   |  9 ++--
 mm/huge_memory.c               |  1 +
 mm/memcontrol.c                |  2 +
 mm/migrate.c                   |  2 +
 mm/swap_state.c                |  1 +
 mm/vmscan.c                    |  1 +
 mm/vmstat.c                    |  1 +
 mm/workingset.c                | 95 ++++++++++++++++++++++------------
 12 files changed, 79 insertions(+), 42 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..6af87946d241 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,7 @@ enum node_stat_item {
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
+	WORKINGSET_RESTORE,
 	WORKINGSET_NODERECLAIM,
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e34a27727b9a..7af1c3c15d8e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -69,13 +69,14 @@
  */
 enum pageflags {
 	PG_locked,		/* Page is locked. Don't touch. */
-	PG_error,
 	PG_referenced,
 	PG_uptodate,
 	PG_dirty,
 	PG_lru,
 	PG_active,
+	PG_workingset,
 	PG_waiters,		/* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
+	PG_error,
 	PG_slab,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
@@ -280,6 +281,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+	TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..d8c47dcdec6f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -296,7 +296,7 @@ struct vma_swap_readahead {
 
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 
 /* Do not use directly, use workingset_lookup_update */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb76d89..a1675d43777e 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -88,6 +88,7 @@
 	{1UL << PG_dirty,		"dirty"		},		\
 	{1UL << PG_lru,			"lru"		},		\
 	{1UL << PG_active,		"active"	},		\
+	{1UL << PG_workingset,		"workingset"	},		\
 	{1UL << PG_slab,		"slab"		},		\
 	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
 	{1UL << PG_arch_1,		"arch_1"	},		\
diff --git a/mm/filemap.c b/mm/filemap.c
index 0604cb02e6f3..bd36b7226cf4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -915,12 +915,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		 * data from the working set, only to cache data that will
 		 * get overwritten with something else, is a waste of memory.
 		 */
-		if (!(gfp_mask & __GFP_WRITE) &&
-		    shadow && workingset_refault(shadow)) {
-			SetPageActive(page);
-			workingset_activation(page);
-		} else
-			ClearPageActive(page);
+		WARN_ON_ONCE(PageActive(page));
+		if (!(gfp_mask & __GFP_WRITE) && shadow)
+			workingset_refault(page, shadow);
 		lru_cache_add(page);
 	}
 	return ret;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b9f3dbd885bd..c67ecf77ea8b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2370,6 +2370,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
+			 (1L << PG_workingset) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
 			 (1L << PG_dirty)));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..c59519d600ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5283,6 +5283,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
 		   stat[WORKINGSET_REFAULT]);
 	seq_printf(m, "workingset_activate %lu\n",
 		   stat[WORKINGSET_ACTIVATE]);
+	seq_printf(m, "workingset_restore %lu\n",
+		   stat[WORKINGSET_RESTORE]);
 	seq_printf(m, "workingset_nodereclaim %lu\n",
 		   stat[WORKINGSET_NODERECLAIM]);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f7cab1..a6a9114e62dc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -682,6 +682,8 @@ void migrate_page_states(struct page *newpage, struct page *page)
 		SetPageActive(newpage);
 	} else if (TestClearPageUnevictable(page))
 		SetPageUnevictable(newpage);
+	if (PageWorkingset(page))
+		SetPageWorkingset(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 07f9aa2340c3..2721ef8862d1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -451,6 +451,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * Initiate read into locked page and return.
 			 */
+			SetPageWorkingset(new_page);
 			lru_cache_add_anon(new_page);
 			*new_page_allocated = true;
 			return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9270a4370d54..8d1ad48ffbcd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1976,6 +1976,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
+		SetPageWorkingset(page);
 		list_add(&page->lru, &l_inactive);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a2b9518980ce..507dc9c01b88 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1145,6 +1145,7 @@ const char * const vmstat_text[] = {
 	"nr_isolated_file",
 	"workingset_refault",
 	"workingset_activate",
+	"workingset_restore",
 	"workingset_nodereclaim",
 	"nr_anon_pages",
 	"nr_mapped",
diff --git a/mm/workingset.c b/mm/workingset.c
index 53759a3cf99a..ef6be3d92116 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -121,7 +121,7 @@
  * the only thing eating into inactive list space is active pages.
  *
  *
- *		Activating refaulting pages
+ *		Refaulting inactive pages
  *
  * All that is known about the active list is that the pages have been
  * accessed more than once in the past.  This means that at any given
@@ -134,6 +134,10 @@
  * used less frequently than the refaulting page - or even not used at
  * all anymore.
  *
+ * That means if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the current active list.
+ *
  * If this is wrong and demotion kicks in, the pages which are truly
  * used more frequently will be reactivated while the less frequently
  * used once will be evicted from memory.
@@ -141,6 +145,14 @@
  * But if this is right, the stale pages will be pushed out of memory
  * and the used pages get to stay in cache.
  *
+ *		Refaulting active pages
+ *
+ * If on the other hand the refaulting pages have recently been
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim. The cache is NOT transitioning to
+ * a different workingset; the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
  *
  *		Implementation
  *
@@ -156,8 +168,7 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 NODES_SHIFT +	\
-			 MEM_CGROUP_ID_SHIFT)
+			 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -170,23 +181,28 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
+			 bool workingset)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+	eviction = (eviction << 1) | workingset;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
-			  unsigned long *evictionp)
+			  unsigned long *evictionp, bool *workingsetp)
 {
 	unsigned long entry = (unsigned long)shadow;
 	int memcgid, nid;
+	bool workingset;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	workingset = entry & 1;
+	entry >>= 1;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -195,6 +211,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*memcgidp = memcgid;
 	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
+	*workingsetp = workingset;
 }
 
 /**
@@ -207,8 +224,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
-	struct mem_cgroup *memcg = page_memcg(page);
 	struct pglist_data *pgdat = page_pgdat(page);
+	struct mem_cgroup *memcg = page_memcg(page);
 	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -220,30 +237,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, pgdat, eviction);
+	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
 
 /**
  * workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
  * evicted page in the context of the node it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
  */
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
 {
 	unsigned long refault_distance;
+	struct pglist_data *pgdat;
 	unsigned long active_file;
 	struct mem_cgroup *memcg;
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
-	struct pglist_data *pgdat;
+	bool workingset;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
 
 	rcu_read_lock();
 	/*
@@ -263,41 +280,51 @@ bool workingset_refault(void *shadow)
 	 * configurations instead.
 	 */
 	memcg = mem_cgroup_from_id(memcgid);
-	if (!mem_cgroup_disabled() && !memcg) {
-		rcu_read_unlock();
-		return false;
-	}
+	if (!mem_cgroup_disabled() && !memcg)
+		goto out;
 	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
 
 	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
+	 * Calculate the refault distance
 	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * inactive_age to lap a shadow entry in the field, which can
+	 * then can result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
 	inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
 
-	if (refault_distance <= active_file) {
-		inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
-		rcu_read_unlock();
-		return true;
+	/*
+	 * Compare the distance to the existing workingset size. We
+	 * don't act on pages that couldn't stay resident even if all
+	 * the memory was available to the page cache.
+	 */
+	if (refault_distance > active_file)
+		goto out;
+
+	SetPageActive(page);
+	atomic_long_inc(&lruvec->inactive_age);
+	inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+
+	/* Page was active prior to eviction */
+	if (workingset) {
+		SetPageWorkingset(page);
+		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
 	}
+out:
 	rcu_read_unlock();
-	return false;
 }
 
 /**
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 03/10] delayacct: track delays from thrashing cache pages
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
  2018-07-12 17:29 ` [PATCH 01/10] mm: workingset: don't drop refault information prematurely Johannes Weiner
  2018-07-12 17:29 ` [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-12 17:29 ` [PATCH 04/10] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

Delay accounting already measures the time a task spends in direct
reclaim and waiting for swapin, but in low memory situations tasks
spend can spend a significant amount of their time waiting on
thrashing page cache. This isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache
page to read back into memory.

Also update tools/accounting/getdelays.c:

     [hannes@computer accounting]$ sudo ./getdelays -d -p 1
     print delayacct stats ON
     PID     1

     CPU             count     real total  virtual total    delay total  delay average
                     50318      745000000      847346785      400533713          0.008ms
     IO              count    delay total  delay average
                       435      122601218              0ms
     SWAP            count    delay total  delay average
                         0              0              0ms
     RECLAIM         count    delay total  delay average
                         0              0              0ms
     THRASHING       count    delay total  delay average
                        19       12621439              0ms

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/delayacct.h      | 23 +++++++++++++++++++++++
 include/uapi/linux/taskstats.h |  6 +++++-
 kernel/delayacct.c             | 15 +++++++++++++++
 mm/filemap.c                   | 11 +++++++++++
 tools/accounting/getdelays.c   |  8 +++++++-
 5 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 5e335b6203f4..d3e75b3ba487 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -57,7 +57,12 @@ struct task_delay_info {
 
 	u64 freepages_start;
 	u64 freepages_delay;	/* wait for memory reclaim */
+
+	u64 thrashing_start;
+	u64 thrashing_delay;	/* wait for thrashing page */
+
 	u32 freepages_count;	/* total count of memory reclaim */
+	u32 thrashing_count;	/* total count of thrash waits */
 };
 #endif
 
@@ -76,6 +81,8 @@ extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 extern __u64 __delayacct_blkio_ticks(struct task_struct *);
 extern void __delayacct_freepages_start(void);
 extern void __delayacct_freepages_end(void);
+extern void __delayacct_thrashing_start(void);
+extern void __delayacct_thrashing_end(void);
 
 static inline int delayacct_is_task_waiting_on_io(struct task_struct *p)
 {
@@ -156,6 +163,18 @@ static inline void delayacct_freepages_end(void)
 		__delayacct_freepages_end();
 }
 
+static inline void delayacct_thrashing_start(void)
+{
+	if (current->delays)
+		__delayacct_thrashing_start();
+}
+
+static inline void delayacct_thrashing_end(void)
+{
+	if (current->delays)
+		__delayacct_thrashing_end();
+}
+
 #else
 static inline void delayacct_set_flag(int flag)
 {}
@@ -182,6 +201,10 @@ static inline void delayacct_freepages_start(void)
 {}
 static inline void delayacct_freepages_end(void)
 {}
+static inline void delayacct_thrashing_start(void)
+{}
+static inline void delayacct_thrashing_end(void)
+{}
 
 #endif /* CONFIG_TASK_DELAY_ACCT */
 
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index b7aa7bb2349f..5e8ca16a9079 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -34,7 +34,7 @@
  */
 
 
-#define TASKSTATS_VERSION	8
+#define TASKSTATS_VERSION	9
 #define TS_COMM_LEN		32	/* should be >= TASK_COMM_LEN
 					 * in linux/sched.h */
 
@@ -164,6 +164,10 @@ struct taskstats {
 	/* Delay waiting for memory reclaim */
 	__u64	freepages_count;
 	__u64	freepages_delay_total;
+
+	/* Delay waiting for thrashing page */
+	__u64	thrashing_count;
+	__u64	thrashing_delay_total;
 };
 
 
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index e2764d767f18..02ba745c448d 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -134,9 +134,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
 	d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp;
 	tmp = d->freepages_delay_total + tsk->delays->freepages_delay;
 	d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp;
+	tmp = d->thrashing_delay_total + tsk->delays->thrashing_delay;
+	d->thrashing_delay_total = (tmp < d->thrashing_delay_total) ? 0 : tmp;
 	d->blkio_count += tsk->delays->blkio_count;
 	d->swapin_count += tsk->delays->swapin_count;
 	d->freepages_count += tsk->delays->freepages_count;
+	d->thrashing_count += tsk->delays->thrashing_count;
 	spin_unlock_irqrestore(&tsk->delays->lock, flags);
 
 	return 0;
@@ -168,3 +171,15 @@ void __delayacct_freepages_end(void)
 		&current->delays->freepages_count);
 }
 
+void __delayacct_thrashing_start(void)
+{
+	current->delays->thrashing_start = ktime_get_ns();
+}
+
+void __delayacct_thrashing_end(void)
+{
+	delayacct_end(&current->delays->lock,
+		      &current->delays->thrashing_start,
+		      &current->delays->thrashing_delay,
+		      &current->delays->thrashing_count);
+}
diff --git a/mm/filemap.c b/mm/filemap.c
index bd36b7226cf4..e49961e13dd9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,7 @@
 #include <linux/cleancache.h>
 #include <linux/shmem_fs.h>
 #include <linux/rmap.h>
+#include <linux/delayacct.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1073,8 +1074,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 {
 	struct wait_page_queue wait_page;
 	wait_queue_entry_t *wait = &wait_page.wait;
+	bool thrashing = false;
 	int ret = 0;
 
+	if (bit_nr == PG_locked && !PageSwapBacked(page) &&
+	    !PageUptodate(page) && PageWorkingset(page)) {
+		delayacct_thrashing_start();
+		thrashing = true;
+	}
+
 	init_wait(wait);
 	wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0;
 	wait->func = wake_page_function;
@@ -1113,6 +1121,9 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 
 	finish_wait(q, wait);
 
+	if (thrashing)
+		delayacct_thrashing_end();
+
 	/*
 	 * A signal could leave PageWaiters set. Clearing it here if
 	 * !waitqueue_active would be possible (by open-coding finish_wait),
diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c
index 9f420d98b5fb..8cb504d30384 100644
--- a/tools/accounting/getdelays.c
+++ b/tools/accounting/getdelays.c
@@ -203,6 +203,8 @@ static void print_delayacct(struct taskstats *t)
 	       "SWAP  %15s%15s%15s\n"
 	       "      %15llu%15llu%15llums\n"
 	       "RECLAIM  %12s%15s%15s\n"
+	       "      %15llu%15llu%15llums\n"
+	       "THRASHING%12s%15s%15s\n"
 	       "      %15llu%15llu%15llums\n",
 	       "count", "real total", "virtual total",
 	       "delay total", "delay average",
@@ -222,7 +224,11 @@ static void print_delayacct(struct taskstats *t)
 	       "count", "delay total", "delay average",
 	       (unsigned long long)t->freepages_count,
 	       (unsigned long long)t->freepages_delay_total,
-	       average_ms(t->freepages_delay_total, t->freepages_count));
+	       average_ms(t->freepages_delay_total, t->freepages_count),
+	       "count", "delay total", "delay average",
+	       (unsigned long long)t->thrashing_count,
+	       (unsigned long long)t->thrashing_delay_total,
+	       average_ms(t->thrashing_delay_total, t->thrashing_count));
 }
 
 static void task_context_switch_counts(struct taskstats *t)
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 04/10] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (2 preceding siblings ...)
  2018-07-12 17:29 ` [PATCH 03/10] delayacct: track delays from thrashing cache pages Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-12 17:29 ` [PATCH 05/10] sched: loadavg: make calc_load_n() public Johannes Weiner
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 .../platforms/cell/cpufreq_spudemand.c        |  2 +-
 arch/powerpc/platforms/cell/spufs/sched.c     |  9 +++-----
 arch/s390/appldata/appldata_os.c              |  4 ----
 drivers/cpuidle/governors/menu.c              |  4 ----
 fs/proc/loadavg.c                             |  3 ---
 include/linux/sched/loadavg.h                 | 21 +++++++++++++++----
 kernel/debug/kdb/kdb_main.c                   |  7 +------
 kernel/sched/loadavg.c                        | 15 -------------
 8 files changed, 22 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/platforms/cell/cpufreq_spudemand.c b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
index 882944c36ef5..5d8e8b6bb1cc 100644
--- a/arch/powerpc/platforms/cell/cpufreq_spudemand.c
+++ b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
@@ -49,7 +49,7 @@ static int calc_freq(struct spu_gov_info_struct *info)
 	cpu = info->policy->cpu;
 	busy_spus = atomic_read(&cbe_spu_info[cpu_to_node(cpu)].busy_spus);
 
-	CALC_LOAD(info->busy_spus, EXP, busy_spus * FIXED_1);
+	info->busy_spus = calc_load(info->busy_spus, EXP, busy_spus * FIXED_1);
 	pr_debug("cpu %d: busy_spus=%d, info->busy_spus=%ld\n",
 			cpu, busy_spus, info->busy_spus);
 
diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index ccc421503363..70101510b19d 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -987,9 +987,9 @@ static void spu_calc_load(void)
 	unsigned long active_tasks; /* fixed-point */
 
 	active_tasks = count_active_contexts() * FIXED_1;
-	CALC_LOAD(spu_avenrun[0], EXP_1, active_tasks);
-	CALC_LOAD(spu_avenrun[1], EXP_5, active_tasks);
-	CALC_LOAD(spu_avenrun[2], EXP_15, active_tasks);
+	spu_avenrun[0] = calc_load(spu_avenrun[0], EXP_1, active_tasks);
+	spu_avenrun[1] = calc_load(spu_avenrun[1], EXP_5, active_tasks);
+	spu_avenrun[2] = calc_load(spu_avenrun[2], EXP_15, active_tasks);
 }
 
 static void spusched_wake(struct timer_list *unused)
@@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx,
 	}
 }
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int show_spu_loadavg(struct seq_file *s, void *private)
 {
 	int a, b, c;
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 433a994b1a89..54f375627532 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -25,10 +25,6 @@
 
 #include "appldata.h"
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 /*
  * OS data
  *
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 1bfe03ceb236..3738b670df7a 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -133,10 +133,6 @@ struct menu_device {
 	int		interval_ptr;
 };
 
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static inline int get_loadavg(unsigned long load)
 {
 	return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10;
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index b572cc865b92..8bee50a97c0f 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -10,9 +10,6 @@
 #include <linux/seqlock.h>
 #include <linux/time.h>
 
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
 static int loadavg_proc_show(struct seq_file *m, void *v)
 {
 	unsigned long avnrun[3];
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 80bc84ba5d2a..cc9cc62bb1f8 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -22,10 +22,23 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
 #define EXP_5		2014		/* 1/exp(5sec/5min) */
 #define EXP_15		2037		/* 1/exp(5sec/15min) */
 
-#define CALC_LOAD(load,exp,n) \
-	load *= exp; \
-	load += n*(FIXED_1-exp); \
-	load >>= FSHIFT;
+/*
+ * a1 = a0 * e + a * (1 - e)
+ */
+static inline unsigned long
+calc_load(unsigned long load, unsigned long exp, unsigned long active)
+{
+	unsigned long newload;
+
+	newload = load * exp + active * (FIXED_1 - exp);
+	if (active >= load)
+		newload += FIXED_1-1;
+
+	return newload / FIXED_1;
+}
+
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 
 extern void calc_global_load(unsigned long ticks);
 
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index e405677ee08d..a8f5aca5eb5e 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2556,16 +2556,11 @@ static int kdb_summary(int argc, const char **argv)
 	}
 	kdb_printf("%02ld:%02ld\n", val.uptime/(60*60), (val.uptime/60)%60);
 
-	/* lifted from fs/proc/proc_misc.c::loadavg_read_proc() */
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 	kdb_printf("load avg   %ld.%02ld %ld.%02ld %ld.%02ld\n",
 		LOAD_INT(val.loads[0]), LOAD_FRAC(val.loads[0]),
 		LOAD_INT(val.loads[1]), LOAD_FRAC(val.loads[1]),
 		LOAD_INT(val.loads[2]), LOAD_FRAC(val.loads[2]));
-#undef LOAD_INT
-#undef LOAD_FRAC
+
 	/* Display in kilobytes */
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	kdb_printf("\nMemTotal:       %8lu kB\nMemFree:        %8lu kB\n"
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index a171c1258109..54fbdfb2d86c 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -91,21 +91,6 @@ long calc_load_fold_active(struct rq *this_rq, long adjust)
 	return delta;
 }
 
-/*
- * a1 = a0 * e + a * (1 - e)
- */
-static unsigned long
-calc_load(unsigned long load, unsigned long exp, unsigned long active)
-{
-	unsigned long newload;
-
-	newload = load * exp + active * (FIXED_1 - exp);
-	if (active >= load)
-		newload += FIXED_1-1;
-
-	return newload / FIXED_1;
-}
-
 #ifdef CONFIG_NO_HZ_COMMON
 /*
  * Handle NO_HZ for the global load-average.
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 05/10] sched: loadavg: make calc_load_n() public
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (3 preceding siblings ...)
  2018-07-12 17:29 ` [PATCH 04/10] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-12 17:29 ` [PATCH 06/10] sched: sched.h: make rq locking and clock functions available in stats.h Johannes Weiner
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

It's going to be used in a later patch. Keep the churn separate.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/sched/loadavg.h |   3 +
 kernel/sched/loadavg.c        | 138 +++++++++++++++++-----------------
 2 files changed, 72 insertions(+), 69 deletions(-)

diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index cc9cc62bb1f8..4859bea47a7b 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -37,6 +37,9 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
 	return newload / FIXED_1;
 }
 
+extern unsigned long calc_load_n(unsigned long load, unsigned long exp,
+				 unsigned long active, unsigned int n);
+
 #define LOAD_INT(x) ((x) >> FSHIFT)
 #define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
 
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 54fbdfb2d86c..28a516575c18 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -91,6 +91,75 @@ long calc_load_fold_active(struct rq *this_rq, long adjust)
 	return delta;
 }
 
+/**
+ * fixed_power_int - compute: x^n, in O(log n) time
+ *
+ * @x:         base of the power
+ * @frac_bits: fractional bits of @x
+ * @n:         power to raise @x to.
+ *
+ * By exploiting the relation between the definition of the natural power
+ * function: x^n := x*x*...*x (x multiplied by itself for n times), and
+ * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
+ * (where: n_i \elem {0, 1}, the binary vector representing n),
+ * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
+ * of course trivially computable in O(log_2 n), the length of our binary
+ * vector.
+ */
+static unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+	unsigned long result = 1UL << frac_bits;
+
+	if (n) {
+		for (;;) {
+			if (n & 1) {
+				result *= x;
+				result += 1UL << (frac_bits - 1);
+				result >>= frac_bits;
+			}
+			n >>= 1;
+			if (!n)
+				break;
+			x *= x;
+			x += 1UL << (frac_bits - 1);
+			x >>= frac_bits;
+		}
+	}
+
+	return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ *    = (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ *    = a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ *    = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ *    = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *
+ *  ...
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
+ *    = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ *    = a0 * e^n + a * (1 - e^n)
+ *
+ * [1] application of the geometric series:
+ *
+ *              n         1 - x^(n+1)
+ *     S_n := \Sum x^i = -------------
+ *             i=0          1 - x
+ */
+unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+	    unsigned long active, unsigned int n)
+{
+	return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
+}
+
 #ifdef CONFIG_NO_HZ_COMMON
 /*
  * Handle NO_HZ for the global load-average.
@@ -210,75 +279,6 @@ static long calc_load_nohz_fold(void)
 	return delta;
 }
 
-/**
- * fixed_power_int - compute: x^n, in O(log n) time
- *
- * @x:         base of the power
- * @frac_bits: fractional bits of @x
- * @n:         power to raise @x to.
- *
- * By exploiting the relation between the definition of the natural power
- * function: x^n := x*x*...*x (x multiplied by itself for n times), and
- * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
- * (where: n_i \elem {0, 1}, the binary vector representing n),
- * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
- * of course trivially computable in O(log_2 n), the length of our binary
- * vector.
- */
-static unsigned long
-fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
-{
-	unsigned long result = 1UL << frac_bits;
-
-	if (n) {
-		for (;;) {
-			if (n & 1) {
-				result *= x;
-				result += 1UL << (frac_bits - 1);
-				result >>= frac_bits;
-			}
-			n >>= 1;
-			if (!n)
-				break;
-			x *= x;
-			x += 1UL << (frac_bits - 1);
-			x >>= frac_bits;
-		}
-	}
-
-	return result;
-}
-
-/*
- * a1 = a0 * e + a * (1 - e)
- *
- * a2 = a1 * e + a * (1 - e)
- *    = (a0 * e + a * (1 - e)) * e + a * (1 - e)
- *    = a0 * e^2 + a * (1 - e) * (1 + e)
- *
- * a3 = a2 * e + a * (1 - e)
- *    = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
- *    = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
- *
- *  ...
- *
- * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
- *    = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
- *    = a0 * e^n + a * (1 - e^n)
- *
- * [1] application of the geometric series:
- *
- *              n         1 - x^(n+1)
- *     S_n := \Sum x^i = -------------
- *             i=0          1 - x
- */
-static unsigned long
-calc_load_n(unsigned long load, unsigned long exp,
-	    unsigned long active, unsigned int n)
-{
-	return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
-}
-
 /*
  * NO_HZ can leave us missing all per-CPU ticks calling
  * calc_load_fold_active(), but since a NO_HZ CPU folds its delta into
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 06/10] sched: sched.h: make rq locking and clock functions available in stats.h
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (4 preceding siblings ...)
  2018-07-12 17:29 ` [PATCH 05/10] sched: loadavg: make calc_load_n() public Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-12 17:29 ` [PATCH 07/10] sched: introduce this_rq_lock_irq() Johannes Weiner
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

kernel/sched/sched.h includes "stats.h" half-way through the file. The
next patch introduces users of sched.h's rq locking functions and
update_rq_clock() in kernel/sched/stats.h. Move those definitions up
in the file so they are available in stats.h.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/sched.h | 164 +++++++++++++++++++++----------------------
 1 file changed, 82 insertions(+), 82 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cb467c221b15..b8f038497240 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -919,6 +919,8 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+extern void update_rq_clock(struct rq *rq);
+
 static inline u64 __rq_clock_broken(struct rq *rq)
 {
 	return READ_ONCE(rq->clock);
@@ -1037,6 +1039,86 @@ static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 #endif
 }
 
+struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
+	__acquires(rq->lock);
+
+struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
+	__acquires(p->pi_lock)
+	__acquires(rq->lock);
+
+static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
+	__releases(rq->lock)
+{
+	rq_unpin_lock(rq, rf);
+	raw_spin_unlock(&rq->lock);
+}
+
+static inline void
+task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+	__releases(rq->lock)
+	__releases(p->pi_lock)
+{
+	rq_unpin_lock(rq, rf);
+	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
+}
+
+static inline void
+rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
+	__acquires(rq->lock)
+{
+	raw_spin_lock_irqsave(&rq->lock, rf->flags);
+	rq_pin_lock(rq, rf);
+}
+
+static inline void
+rq_lock_irq(struct rq *rq, struct rq_flags *rf)
+	__acquires(rq->lock)
+{
+	raw_spin_lock_irq(&rq->lock);
+	rq_pin_lock(rq, rf);
+}
+
+static inline void
+rq_lock(struct rq *rq, struct rq_flags *rf)
+	__acquires(rq->lock)
+{
+	raw_spin_lock(&rq->lock);
+	rq_pin_lock(rq, rf);
+}
+
+static inline void
+rq_relock(struct rq *rq, struct rq_flags *rf)
+	__acquires(rq->lock)
+{
+	raw_spin_lock(&rq->lock);
+	rq_repin_lock(rq, rf);
+}
+
+static inline void
+rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
+	__releases(rq->lock)
+{
+	rq_unpin_lock(rq, rf);
+	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+}
+
+static inline void
+rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
+	__releases(rq->lock)
+{
+	rq_unpin_lock(rq, rf);
+	raw_spin_unlock_irq(&rq->lock);
+}
+
+static inline void
+rq_unlock(struct rq *rq, struct rq_flags *rf)
+	__releases(rq->lock)
+{
+	rq_unpin_lock(rq, rf);
+	raw_spin_unlock(&rq->lock);
+}
+
 #ifdef CONFIG_NUMA
 enum numa_topology_type {
 	NUMA_DIRECT,
@@ -1670,8 +1752,6 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
 	sched_update_tick_dependency(rq);
 }
 
-extern void update_rq_clock(struct rq *rq);
-
 extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
 extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 
@@ -1752,86 +1832,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
 static inline void sched_avg_update(struct rq *rq) { }
 #endif
 
-struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
-	__acquires(rq->lock);
-
-struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
-	__acquires(p->pi_lock)
-	__acquires(rq->lock);
-
-static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
-	__releases(rq->lock)
-{
-	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
-}
-
-static inline void
-task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
-	__releases(rq->lock)
-	__releases(p->pi_lock)
-{
-	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
-	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
-}
-
-static inline void
-rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
-	__acquires(rq->lock)
-{
-	raw_spin_lock_irqsave(&rq->lock, rf->flags);
-	rq_pin_lock(rq, rf);
-}
-
-static inline void
-rq_lock_irq(struct rq *rq, struct rq_flags *rf)
-	__acquires(rq->lock)
-{
-	raw_spin_lock_irq(&rq->lock);
-	rq_pin_lock(rq, rf);
-}
-
-static inline void
-rq_lock(struct rq *rq, struct rq_flags *rf)
-	__acquires(rq->lock)
-{
-	raw_spin_lock(&rq->lock);
-	rq_pin_lock(rq, rf);
-}
-
-static inline void
-rq_relock(struct rq *rq, struct rq_flags *rf)
-	__acquires(rq->lock)
-{
-	raw_spin_lock(&rq->lock);
-	rq_repin_lock(rq, rf);
-}
-
-static inline void
-rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
-	__releases(rq->lock)
-{
-	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
-}
-
-static inline void
-rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
-	__releases(rq->lock)
-{
-	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irq(&rq->lock);
-}
-
-static inline void
-rq_unlock(struct rq *rq, struct rq_flags *rf)
-	__releases(rq->lock)
-{
-	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
-}
-
 #ifdef CONFIG_SMP
 #ifdef CONFIG_PREEMPT
 
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 07/10] sched: introduce this_rq_lock_irq()
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (5 preceding siblings ...)
  2018-07-12 17:29 ` [PATCH 06/10] sched: sched.h: make rq locking and clock functions available in stats.h Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

do_sched_yield() disables IRQs, looks up this_rq() and locks it. The
next patch is adding another site with the same pattern, so provide a
convenience function for it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c  |  4 +---
 kernel/sched/sched.h | 12 ++++++++++++
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 211890edf37e..9586a8141f16 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4960,9 +4960,7 @@ static void do_sched_yield(void)
 	struct rq_flags rf;
 	struct rq *rq;
 
-	local_irq_disable();
-	rq = this_rq();
-	rq_lock(rq, &rf);
+	rq = this_rq_lock_irq(&rf);
 
 	schedstat_inc(rq->yld_count);
 	current->sched_class->yield_task(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b8f038497240..bc798c7cb4d4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1119,6 +1119,18 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
 	raw_spin_unlock(&rq->lock);
 }
 
+static inline struct rq *
+this_rq_lock_irq(struct rq_flags *rf)
+	__acquires(rq->lock)
+{
+	struct rq *rq;
+
+	local_irq_disable();
+	rq = this_rq();
+	rq_lock(rq, rf);
+	return rq;
+}
+
 #ifdef CONFIG_NUMA
 enum numa_topology_type {
 	NUMA_DIRECT,
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (6 preceding siblings ...)
  2018-07-12 17:29 ` [PATCH 07/10] sched: introduce this_rq_lock_irq() Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-13  9:21   ` Peter Zijlstra
                     ` (9 more replies)
  2018-07-12 17:29 ` [PATCH 09/10] psi: cgroup support Johannes Weiner
                   ` (6 subsequent siblings)
  14 siblings, 10 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).

v2:
- stable clock tick, as per Peter
- data structure layout optimization, as per Peter
- fix u64 divisions on 32 bit, as per Peter
- outermost psi_disabled checks, as per Peter
- coding style fixes, as per Peter
- just-in-time stats aggregation, as per Suren
- fix task state corruption with CONFIG_PREEMPT, as per Suren
- CONFIG_PSI=n build error
- avoid writing p->sched_psi_wake_requeue unnecessarily
- documentation & comment updates

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/accounting/psi.txt |  64 ++++
 include/linux/psi.h              |  27 ++
 include/linux/psi_types.h        |  90 +++++
 include/linux/sched.h            |  10 +
 include/linux/sched/stat.h       |  10 +-
 init/Kconfig                     |  16 +
 kernel/fork.c                    |   4 +
 kernel/sched/Makefile            |   1 +
 kernel/sched/core.c              |   7 +-
 kernel/sched/psi.c               | 585 +++++++++++++++++++++++++++++++
 kernel/sched/sched.h             |   2 +
 kernel/sched/stats.h             | 102 +++++-
 mm/compaction.c                  |   5 +
 mm/filemap.c                     |  15 +-
 mm/page_alloc.c                  |  10 +
 mm/vmscan.c                      |  13 +
 16 files changed, 946 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/accounting/psi.txt
 create mode 100644 include/linux/psi.h
 create mode 100644 include/linux/psi_types.h
 create mode 100644 kernel/sched/psi.c

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index 000000000000..51e7ef14142e
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,64 @@
+================================
+PSI - Pressure Stall Information
+================================
+
+:Date: April, 2018
+:Author: Johannes Weiner <hannes@cmpxchg.org>
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==================
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to distinguish this
+situation from a state where some tasks are stalled but the CPU is
+still doing productive work. As such, time spent in this subset of the
+stall state is tracked separately and exported in the "full" averages.
+
+The ratios are tracked as recent trends over ten, sixty, and three
+hundred second windows, which gives insight into short term events as
+well as medium and long term trends. The total absolute stall time is
+tracked and exported as well, to allow detection of latency spikes
+which wouldn't necessarily make a dent in the time averages, or to
+average trends over custom time frames.
diff --git a/include/linux/psi.h b/include/linux/psi.h
new file mode 100644
index 000000000000..371af1479699
--- /dev/null
+++ b/include/linux/psi.h
@@ -0,0 +1,27 @@
+#ifndef _LINUX_PSI_H
+#define _LINUX_PSI_H
+
+#include <linux/psi_types.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_PSI
+
+extern bool psi_disabled;
+
+void psi_init(void);
+
+void psi_task_change(struct task_struct *task, u64 now, int clear, int set);
+
+void psi_memstall_enter(unsigned long *flags);
+void psi_memstall_leave(unsigned long *flags);
+
+#else /* CONFIG_PSI */
+
+static inline void psi_init(void) {}
+
+static inline void psi_memstall_enter(unsigned long *flags) {}
+static inline void psi_memstall_leave(unsigned long *flags) {}
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_H */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
new file mode 100644
index 000000000000..0ac74bb496e6
--- /dev/null
+++ b/include/linux/psi_types.h
@@ -0,0 +1,90 @@
+#ifndef _LINUX_PSI_TYPES_H
+#define _LINUX_PSI_TYPES_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_PSI
+
+/* Tracked task states */
+enum psi_task_count {
+	NR_RUNNING,
+	NR_IOWAIT,
+	NR_MEMSTALL,
+	NR_PSI_TASK_COUNTS,
+};
+
+/* Task state bitmasks */
+#define TSK_RUNNING	(1 << NR_RUNNING)
+#define TSK_IOWAIT	(1 << NR_IOWAIT)
+#define TSK_MEMSTALL	(1 << NR_MEMSTALL)
+
+/* Resources that workloads could be stalled on */
+enum psi_res {
+	PSI_CPU,
+	PSI_MEM,
+	PSI_IO,
+	NR_PSI_RESOURCES,
+};
+
+/* Pressure states for a group of tasks */
+enum psi_state {
+	PSI_NONE,		/* No stalled tasks */
+	PSI_SOME,		/* Stalled tasks & working tasks */
+	PSI_FULL,		/* Stalled tasks & no working tasks */
+	NR_PSI_STATES,
+};
+
+struct psi_resource {
+	/* Current pressure state for this resource */
+	enum psi_state state;
+
+	/* Start of current state (rq_clock) */
+	u64 state_start;
+
+	/* Time sampling buckets for pressure states SOME and FULL (ns) */
+	u64 times[2];
+};
+
+struct psi_group_cpu {
+	/* States of the tasks belonging to this group */
+	unsigned int tasks[NR_PSI_TASK_COUNTS];
+
+	/* There are runnable or D-state tasks */
+	int nonidle;
+
+	/* Start of current non-idle state (rq_clock) */
+	u64 nonidle_start;
+
+	/* Time sampling bucket for non-idle state (ns) */
+	u64 nonidle_time;
+
+	/* Per-resource pressure tracking in this group */
+	struct psi_resource res[NR_PSI_RESOURCES];
+};
+
+struct psi_group {
+	struct psi_group_cpu *cpus;
+
+	struct mutex stat_lock;
+
+	u64 some[NR_PSI_RESOURCES];
+	u64 full[NR_PSI_RESOURCES];
+
+	unsigned long period_expires;
+
+	u64 last_some[NR_PSI_RESOURCES];
+	u64 last_full[NR_PSI_RESOURCES];
+
+	unsigned long avg_some[NR_PSI_RESOURCES][3];
+	unsigned long avg_full[NR_PSI_RESOURCES][3];
+
+	struct delayed_work clock_work;
+};
+
+#else /* CONFIG_PSI */
+
+struct psi_group { };
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ca3f3eae8980..d5e4ee234114 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -25,6 +25,7 @@
 #include <linux/latencytop.h>
 #include <linux/sched/prio.h>
 #include <linux/signal_types.h>
+#include <linux/psi_types.h>
 #include <linux/mm_types_task.h>
 #include <linux/task_io_accounting.h>
 
@@ -709,6 +710,10 @@ struct task_struct {
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
 	unsigned			sched_remote_wakeup:1;
+#ifdef CONFIG_PSI
+	unsigned			sched_psi_wake_requeue:1;
+#endif
+
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
 
@@ -956,6 +961,10 @@ struct task_struct {
 	siginfo_t			*last_siginfo;
 
 	struct task_io_accounting	ioac;
+#ifdef CONFIG_PSI
+	/* Pressure stall state */
+	unsigned int			psi_flags;
+#endif
 #ifdef CONFIG_TASK_XACCT
 	/* Accumulated RSS usage: */
 	u64				acct_rss_mem1;
@@ -1385,6 +1394,7 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+#define PF_MEMSTALL		0x01000000	/* Stalled due to lack of memory */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MUTEX_TESTER		0x20000000	/* Thread belongs to the rt mutex tester */
diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
index 04f1321d14c4..ac39435d1521 100644
--- a/include/linux/sched/stat.h
+++ b/include/linux/sched/stat.h
@@ -28,10 +28,14 @@ static inline int sched_info_on(void)
 	return 1;
 #elif defined(CONFIG_TASK_DELAY_ACCT)
 	extern int delayacct_on;
-	return delayacct_on;
-#else
-	return 0;
+	if (delayacct_on)
+		return 1;
+#elif defined(CONFIG_PSI)
+	extern int psi_disabled;
+	if (!psi_disabled)
+		return 1;
 #endif
+	return 0;
 }
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/init/Kconfig b/init/Kconfig
index 18b151f0ddc1..e34859bda33e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -457,6 +457,22 @@ config TASK_IO_ACCOUNTING
 
 	  Say N if unsure.
 
+config PSI
+	bool "Pressure stall information tracking"
+	select SCHED_INFO
+	help
+	  Collect metrics that indicate how overcommitted the CPU, memory,
+	  and IO capacity are in the system.
+
+	  If you say Y here, the kernel will create /proc/pressure/ with the
+	  pressure statistics files cpu, memory, and io. These will indicate
+	  the share of walltime in which some or all tasks in the system are
+	  delayed due to contention of the respective resource.
+
+	  For more details see Documentation/accounting/psi.txt.
+
+	  Say N if unsure.
+
 endmenu # "CPU/Task time and stats accounting"
 
 config CPU_ISOLATION
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..067aa5c28526 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1704,6 +1704,10 @@ static __latent_entropy struct task_struct *copy_process(
 
 	p->default_timer_slack_ns = current->timer_slack_ns;
 
+#ifdef CONFIG_PSI
+	p->psi_flags = 0;
+#endif
+
 	task_io_accounting_init(&p->ioac);
 	acct_clear_integrals(p);
 
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..b29bc18f2704 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
 obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
+obj-$(CONFIG_PSI) += psi.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9586a8141f16..16e8c8c8f432 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -744,7 +744,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 		update_rq_clock(rq);
 
 	if (!(flags & ENQUEUE_RESTORE))
-		sched_info_queued(rq, p);
+		sched_info_queued(rq, p, flags & ENQUEUE_WAKEUP);
 
 	p->sched_class->enqueue_task(rq, p, flags);
 }
@@ -755,7 +755,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 		update_rq_clock(rq);
 
 	if (!(flags & DEQUEUE_SAVE))
-		sched_info_dequeued(rq, p);
+		sched_info_dequeued(rq, p, flags & DEQUEUE_SLEEP);
 
 	p->sched_class->dequeue_task(rq, p, flags);
 }
@@ -2058,6 +2058,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
+		psi_ttwu_dequeue(p);
 		set_task_cpu(p, cpu);
 	}
 
@@ -6124,6 +6125,8 @@ void __init sched_init(void)
 
 	init_schedstats();
 
+	psi_init();
+
 	scheduler_running = 1;
 }
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
new file mode 100644
index 000000000000..ef8e20383e4c
--- /dev/null
+++ b/kernel/sched/psi.c
@@ -0,0 +1,585 @@
+/*
+ * Pressure stall information for CPU, memory and IO
+ *
+ * Copyright (c) 2018 Facebook, Inc.
+ * Author: Johannes Weiner <hannes@cmpxchg.org>
+ *
+ * When CPU, memory and IO are contended, tasks experience delays that
+ * reduce throughput and introduce latencies into the workload. Memory
+ * and IO contention, in addition, can cause a full loss of forward
+ * progress in which the CPU goes idle.
+ *
+ * This code aggregates individual task delays into resource pressure
+ * metrics that indicate problems with both workload health and
+ * resource utilization.
+ *
+ *			Model
+ *
+ * The time in which a task can execute on a CPU is our baseline for
+ * productivity. Pressure expresses the amount of time in which this
+ * potential cannot be realized due to resource contention.
+ *
+ * This concept of productivity has two components: the workload and
+ * the CPU. To measure the impact of pressure on both, we define two
+ * contention states for a resource: SOME and FULL.
+ *
+ * In the SOME state of a given resource, one or more tasks are
+ * delayed on that resource. This affects the workload's ability to
+ * perform work, but the CPU may still be executing other tasks.
+ *
+ * In the FULL state of a given resource, all non-idle tasks are
+ * delayed on that resource such that nobody is advancing and the CPU
+ * goes idle. This leaves both workload and CPU unproductive.
+ *
+ * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ *
+ *	SOME = nr_delayed_tasks != 0
+ *	FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
+ *
+ * The percentage of wallclock time spent in those compound stall
+ * states gives pressure numbers between 0 and 100 for each resource,
+ * where the SOME percentage indicates workload slowdowns and the FULL
+ * percentage indicates reduced CPU utilization:
+ *
+ *	%SOME = time(SOME) / period
+ *	%FULL = time(FULL) / period
+ *
+ *			Multiple CPUs
+ *
+ * The more tasks and available CPUs there are, the more work can be
+ * performed concurrently. This means that the potential that can go
+ * unrealized due to resource contention *also* scales with non-idle
+ * tasks and CPUs.
+ *
+ * Consider a scenario where 257 number crunching tasks are trying to
+ * run concurrently on 256 CPUs. If we simply aggregated the task
+ * states, we would have to conclude a CPU SOME pressure number of
+ * 100%, since *somebody* is waiting on a runqueue at all
+ * times. However, that is clearly not the amount of contention the
+ * workload is experiencing: only one out of 256 possible exceution
+ * threads will be contended at any given time, or about 0.4%.
+ *
+ * Conversely, consider a scenario of 4 tasks and 4 CPUs where at any
+ * given time *one* of the tasks is delayed due to a lack of memory.
+ * Again, looking purely at the task state would yield a memory FULL
+ * pressure number of 0%, since *somebody* is always making forward
+ * progress. But again this wouldn't capture the amount of execution
+ * potential lost, which is 1 out of 4 CPUs, or 25%.
+ *
+ * To calculate wasted potential (pressure) with multiple processors,
+ * we have to base our calculation on the number of non-idle tasks in
+ * conjunction with the number of available CPUs, which is the number
+ * of potential execution threads. SOME becomes then the proportion of
+ * delayed tasks to possibe threads, and FULL is the share of possible
+ * threads that are unproductive due to delays:
+ *
+ *	threads = min(nr_nonidle_tasks, nr_cpus)
+ *	   SOME = min(nr_delayed_tasks / threads, 1)
+ *	   FULL = (threads - min(nr_running_tasks, threads)) / threads
+ *
+ * For the 257 number crunchers on 256 CPUs, this yields:
+ *
+ *	threads = min(257, 256)
+ *	   SOME = min(1 / 256, 1)             = 0.4%
+ *	   FULL = (256 - min(257, 256)) / 256 = 0%
+ *
+ * For the 1 out of 4 memory-delayed tasks, this yields:
+ *
+ *	threads = min(4, 4)
+ *	   SOME = min(1 / 4, 1)               = 25%
+ *	   FULL = (4 - min(3, 4)) / 4         = 25%
+ *
+ * [ Substitute nr_cpus with 1, and you can see that it's a natural
+ *   extension of the single-CPU model. ]
+ *
+ *			Implementation
+ *
+ * To assess the precise time spent in each such state, we would have
+ * to freeze the system on task changes and start/stop the state
+ * clocks accordingly. Obviously that doesn't scale in practice.
+ *
+ * Because the scheduler aims to distribute the compute load evenly
+ * among the available CPUs, we can track task state locally to each
+ * CPU and, at much lower frequency, extrapolate the global state for
+ * the cumulative stall times and the running averages.
+ *
+ * For each runqueue, we track:
+ *
+ *	   tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
+ *	   tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_running_tasks[cpu])
+ *	tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0)
+ *
+ * and then periodically aggregate:
+ *
+ *	tNONIDLE = sum(tNONIDLE[i])
+ *
+ *	   tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE
+ *	   tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE
+ *
+ *	   %SOME = tSOME / period
+ *	   %FULL = tFULL / period
+ *
+ * This gives us an approximation of pressure that is practical
+ * cost-wise, yet way more sensitive and accurate than periodic
+ * sampling of the aggregate task states would be.
+ */
+
+#include <linux/sched/loadavg.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/cgroup.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/psi.h>
+#include "sched.h"
+
+static int psi_bug __read_mostly;
+
+bool psi_disabled __read_mostly;
+core_param(psi_disabled, psi_disabled, bool, 0644);
+
+/* Running averages - we need to be higher-res than loadavg */
+#define PSI_FREQ	(2*HZ+1)	/* 2 sec intervals */
+#define EXP_10s		1677		/* 1/exp(2s/10s) as fixed-point */
+#define EXP_60s		1981		/* 1/exp(2s/60s) */
+#define EXP_300s	2034		/* 1/exp(2s/300s) */
+
+/* Sampling frequency in nanoseconds */
+static u64 psi_period __read_mostly;
+
+/* System-level pressure and stall tracking */
+static DEFINE_PER_CPU(struct psi_group_cpu, system_group_cpus);
+static struct psi_group psi_system = {
+	.cpus = &system_group_cpus,
+};
+
+static void psi_clock(struct work_struct *work);
+
+static void psi_group_init(struct psi_group *group)
+{
+	group->period_expires = jiffies + PSI_FREQ;
+	INIT_DELAYED_WORK(&group->clock_work, psi_clock);
+	mutex_init(&group->stat_lock);
+}
+
+void __init psi_init(void)
+{
+	if (psi_disabled)
+		return;
+
+	psi_period = jiffies_to_nsecs(PSI_FREQ);
+	psi_group_init(&psi_system);
+}
+
+static void calc_avgs(unsigned long avg[3], u64 time, int missed_periods)
+{
+	unsigned long pct;
+
+	/* Sample the most recent active period */
+	pct = time * 100 / psi_period;
+	pct *= FIXED_1;
+	avg[0] = calc_load(avg[0], EXP_10s, pct);
+	avg[1] = calc_load(avg[1], EXP_60s, pct);
+	avg[2] = calc_load(avg[2], EXP_300s, pct);
+
+	/* Fill in zeroes for periods of no activity */
+	if (missed_periods) {
+		avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods);
+		avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods);
+		avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods);
+	}
+}
+
+static bool psi_update_stats(struct psi_group *group)
+{
+	u64 some[NR_PSI_RESOURCES] = { 0, };
+	u64 full[NR_PSI_RESOURCES] = { 0, };
+	unsigned long nonidle_total = 0;
+	unsigned long missed_periods;
+	unsigned long expires;
+	int cpu;
+	int r;
+
+	mutex_lock(&group->stat_lock);
+
+	/*
+	 * Collect the per-cpu time buckets and average them into a
+	 * single time sample that is normalized to wallclock time.
+	 *
+	 * For averaging, each CPU is weighted by its non-idle time in
+	 * the sampling period. This eliminates artifacts from uneven
+	 * loading, or even entirely idle CPUs.
+	 *
+	 * We could pin the online CPUs here, but the noise introduced
+	 * by missing up to one sample period from CPUs that are going
+	 * away shouldn't matter in practice - just like the noise of
+	 * previously offlined CPUs returning with a non-zero sample.
+	 */
+	for_each_online_cpu(cpu) {
+		struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
+		unsigned long nonidle;
+
+		if (!groupc->nonidle_time)
+			continue;
+
+		nonidle = nsecs_to_jiffies(groupc->nonidle_time);
+		groupc->nonidle_time = 0;
+		nonidle_total += nonidle;
+
+		for (r = 0; r < NR_PSI_RESOURCES; r++) {
+			struct psi_resource *res = &groupc->res[r];
+
+			some[r] += (res->times[0] + res->times[1]) * nonidle;
+			full[r] += res->times[1] * nonidle;
+
+			/* It's racy, but we can tolerate some error */
+			res->times[0] = 0;
+			res->times[1] = 0;
+		}
+	}
+
+	/*
+	 * Integrate the sample into the running statistics that are
+	 * reported to userspace: the cumulative stall times and the
+	 * decaying averages.
+	 *
+	 * Pressure percentages are sampled at PSI_FREQ. We might be
+	 * called more often when the user polls more frequently than
+	 * that; we might be called less often when there is no task
+	 * activity, thus no data, and clock ticks are sporadic. The
+	 * below handles both.
+	 */
+
+	/* total= */
+	for (r = 0; r < NR_PSI_RESOURCES; r++) {
+		do_div(some[r], max(nonidle_total, 1UL));
+		do_div(full[r], max(nonidle_total, 1UL));
+
+		group->some[r] += some[r];
+		group->full[r] += full[r];
+	}
+
+	/* avgX= */
+	expires = group->period_expires;
+	if (time_before(jiffies, expires))
+		goto out;
+
+	missed_periods = (jiffies - expires) / PSI_FREQ;
+	group->period_expires = expires + ((1 + missed_periods) * PSI_FREQ);
+
+	for (r = 0; r < NR_PSI_RESOURCES; r++) {
+		u64 some, full;
+
+		some = group->some[r] - group->last_some[r];
+		full = group->full[r] - group->last_full[r];
+
+		calc_avgs(group->avg_some[r], some, missed_periods);
+		calc_avgs(group->avg_full[r], full, missed_periods);
+
+		group->last_some[r] = group->some[r];
+		group->last_full[r] = group->full[r];
+	}
+out:
+	mutex_unlock(&group->stat_lock);
+	return nonidle_total;
+}
+
+static void psi_clock(struct work_struct *work)
+{
+	struct delayed_work *dwork;
+	struct psi_group *group;
+	bool nonidle;
+
+	dwork = to_delayed_work(work);
+	group = container_of(dwork, struct psi_group, clock_work);
+
+	/*
+	 * If there is task activity, periodically fold the per-cpu
+	 * times and feed samples into the running averages. If things
+	 * are idle and there is no data to process, stop the clock.
+	 * Once restarted, we'll catch up the running averages in one
+	 * go - see calc_avgs() and missed_periods.
+	 */
+
+	nonidle = psi_update_stats(group);
+
+	if (nonidle) {
+		unsigned long delay = 0;
+		unsigned long now;
+
+		now = READ_ONCE(jiffies);
+		if (time_after(group->period_expires, now))
+			delay = group->period_expires - now;
+		schedule_delayed_work(dwork, delay);
+	}
+}
+
+static void time_state(struct psi_resource *res, int state, u64 now)
+{
+	if (res->state != PSI_NONE) {
+		bool was_full = res->state == PSI_FULL;
+
+		res->times[was_full] += now - res->state_start;
+	}
+	if (res->state != state)
+		res->state = state;
+	if (res->state != PSI_NONE)
+		res->state_start = now;
+}
+
+static void psi_group_change(struct psi_group *group, int cpu, u64 now,
+			     unsigned int clear, unsigned int set)
+{
+	enum psi_state state = PSI_NONE;
+	struct psi_group_cpu *groupc;
+	unsigned int *tasks;
+	unsigned int to, bo;
+
+	groupc = per_cpu_ptr(group->cpus, cpu);
+	tasks = groupc->tasks;
+
+	/* Update task counts according to the set/clear bitmasks */
+	for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
+		int idx = to + (bo - 1);
+
+		if (tasks[idx] == 0 && !psi_bug) {
+			printk_deferred(KERN_ERR "psi: task underflow! cpu=%d idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
+					cpu, idx, tasks[0], tasks[1], tasks[2],
+					clear, set);
+			psi_bug = 1;
+		}
+		tasks[idx]--;
+	}
+	for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
+		tasks[to + (bo - 1)]++;
+
+	/* Time in which tasks wait for the CPU */
+	state = PSI_NONE;
+	if (tasks[NR_RUNNING] > 1)
+		state = PSI_SOME;
+	time_state(&groupc->res[PSI_CPU], state, now);
+
+	/* Time in which tasks wait for memory */
+	state = PSI_NONE;
+	if (tasks[NR_MEMSTALL]) {
+		if (!tasks[NR_RUNNING] ||
+		    (cpu_curr(cpu)->flags & PF_MEMSTALL))
+			state = PSI_FULL;
+		else
+			state = PSI_SOME;
+	}
+	time_state(&groupc->res[PSI_MEM], state, now);
+
+	/* Time in which tasks wait for IO */
+	state = PSI_NONE;
+	if (tasks[NR_IOWAIT]) {
+		if (!tasks[NR_RUNNING])
+			state = PSI_FULL;
+		else
+			state = PSI_SOME;
+	}
+	time_state(&groupc->res[PSI_IO], state, now);
+
+	/* Time in which tasks are non-idle, to weigh the CPU in summaries */
+	if (groupc->nonidle)
+		groupc->nonidle_time += now - groupc->nonidle_start;
+	groupc->nonidle = tasks[NR_RUNNING] ||
+		tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];
+	if (groupc->nonidle)
+		groupc->nonidle_start = now;
+
+	/* Kick the stats aggregation worker if it's gone to sleep */
+	if (!delayed_work_pending(&group->clock_work))
+		schedule_delayed_work(&group->clock_work, PSI_FREQ);
+}
+
+void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
+{
+	int cpu = task_cpu(task);
+
+	if (psi_disabled)
+		return;
+
+	if (!task->pid)
+		return;
+
+	if (((task->psi_flags & set) ||
+	     (task->psi_flags & clear) != clear) &&
+	    !psi_bug) {
+		printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
+				task->pid, task->comm, cpu,
+				task->psi_flags, clear, set);
+		psi_bug = 1;
+	}
+
+	task->psi_flags &= ~clear;
+	task->psi_flags |= set;
+
+	psi_group_change(&psi_system, cpu, now, clear, set);
+}
+
+/**
+ * psi_memstall_enter - mark the beginning of a memory stall section
+ * @flags: flags to handle nested sections
+ *
+ * Marks the calling task as being stalled due to a lack of memory,
+ * such as waiting for a refault or performing reclaim.
+ */
+void psi_memstall_enter(unsigned long *flags)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	if (psi_disabled)
+		return;
+
+	*flags = current->flags & PF_MEMSTALL;
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMSTALL setting & accounting needs to be atomic wrt
+	 * changes to the task's scheduling state, otherwise we can
+	 * race with CPU migration.
+	 */
+	rq = this_rq_lock_irq(&rf);
+
+	update_rq_clock(rq);
+
+	current->flags |= PF_MEMSTALL;
+	psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
+
+	rq_unlock_irq(rq, &rf);
+}
+
+/**
+ * psi_memstall_leave - mark the end of an memory stall section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer stalled due to lack of memory.
+ */
+void psi_memstall_leave(unsigned long *flags)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	if (psi_disabled)
+		return;
+
+	if (*flags)
+		return;
+	/*
+	 * PF_MEMSTALL clearing & accounting needs to be atomic wrt
+	 * changes to the task's scheduling state, otherwise we could
+	 * race with CPU migration.
+	 */
+	rq = this_rq_lock_irq(&rf);
+
+	update_rq_clock(rq);
+
+	current->flags &= ~PF_MEMSTALL;
+	psi_task_change(current, rq_clock(rq), TSK_MEMSTALL, 0);
+
+	rq_unlock_irq(rq, &rf);
+}
+
+static int psi_show(struct seq_file *m, struct psi_group *group,
+		    enum psi_res res)
+{
+	unsigned long avg[2][3];
+	u64 some, full;
+	int w;
+
+	if (psi_disabled)
+		return -EOPNOTSUPP;
+
+	psi_update_stats(group);
+
+	for (w = 0; w < 3; w++) {
+		avg[0][w] = group->avg_some[res][w];
+		avg[1][w] = group->avg_full[res][w];
+	}
+
+	some = group->some[res];
+	do_div(some, NSEC_PER_USEC);
+
+	seq_printf(m, "some avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
+		   LOAD_INT(avg[0][0]), LOAD_FRAC(avg[0][0]),
+		   LOAD_INT(avg[0][1]), LOAD_FRAC(avg[0][1]),
+		   LOAD_INT(avg[0][2]), LOAD_FRAC(avg[0][2]),
+		   some);
+
+	if (res == PSI_CPU)
+                return 0;
+
+	full = group->full[res];
+	do_div(full, NSEC_PER_USEC);
+
+	seq_printf(m, "full avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
+		   LOAD_INT(avg[1][0]), LOAD_FRAC(avg[1][0]),
+		   LOAD_INT(avg[1][1]), LOAD_FRAC(avg[1][1]),
+		   LOAD_INT(avg[1][2]), LOAD_FRAC(avg[1][2]),
+		   full);
+
+	return 0;
+}
+
+static int psi_cpu_show(struct seq_file *m, void *v)
+{
+	return psi_show(m, &psi_system, PSI_CPU);
+}
+
+static int psi_memory_show(struct seq_file *m, void *v)
+{
+	return psi_show(m, &psi_system, PSI_MEM);
+}
+
+static int psi_io_show(struct seq_file *m, void *v)
+{
+	return psi_show(m, &psi_system, PSI_IO);
+}
+
+static int psi_cpu_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, psi_cpu_show, NULL);
+}
+
+static int psi_memory_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, psi_memory_show, NULL);
+}
+
+static int psi_io_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, psi_io_show, NULL);
+}
+
+static const struct file_operations psi_cpu_fops = {
+	.open           = psi_cpu_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static const struct file_operations psi_memory_fops = {
+	.open           = psi_memory_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static const struct file_operations psi_io_fops = {
+	.open           = psi_io_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static int __init psi_proc_init(void)
+{
+	proc_mkdir("pressure", NULL);
+	proc_create("pressure/cpu", 0, NULL, &psi_cpu_fops);
+	proc_create("pressure/memory", 0, NULL, &psi_memory_fops);
+	proc_create("pressure/io", 0, NULL, &psi_io_fops);
+	return 0;
+}
+module_init(psi_proc_init);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc798c7cb4d4..e798491ff329 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,6 +54,7 @@
 #include <linux/proc_fs.h>
 #include <linux/prefetch.h>
 #include <linux/profile.h>
+#include <linux/psi.h>
 #include <linux/rcupdate_wait.h>
 #include <linux/security.h>
 #include <linux/stackprotector.h>
@@ -320,6 +321,7 @@ extern bool dl_cpu_busy(unsigned int cpu);
 #ifdef CONFIG_CGROUP_SCHED
 
 #include <linux/cgroup.h>
+#include <linux/psi.h>
 
 struct cfs_rq;
 struct rt_rq;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8aea199a39b4..15b858cbbcb0 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -55,25 +55,111 @@ static inline void rq_sched_info_depart  (struct rq *rq, unsigned long long delt
 # define   schedstat_val_or_zero(var)	0
 #endif /* CONFIG_SCHEDSTATS */
 
+#ifdef CONFIG_PSI
+/*
+ * PSI tracks state that persists across sleeps, such as iowaits and
+ * memory stalls. As a result, it has to distinguish between sleeps,
+ * where a task's runnable state changes, and requeues, where a task
+ * and its state are being moved between CPUs and runqueues.
+ */
+static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
+{
+	int clear = 0, set = TSK_RUNNING;
+
+	if (psi_disabled)
+		return;
+
+	if (!wakeup || p->sched_psi_wake_requeue) {
+		if (p->flags & PF_MEMSTALL)
+			set |= TSK_MEMSTALL;
+		if (p->sched_psi_wake_requeue)
+			p->sched_psi_wake_requeue = 0;
+	} else {
+		if (p->in_iowait)
+			clear |= TSK_IOWAIT;
+	}
+
+	psi_task_change(p, now, clear, set);
+}
+
+static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
+{
+	int clear = TSK_RUNNING, set = 0;
+
+	if (psi_disabled)
+		return;
+
+	if (!sleep) {
+		if (p->flags & PF_MEMSTALL)
+			clear |= TSK_MEMSTALL;
+	} else {
+		if (p->in_iowait)
+			set |= TSK_IOWAIT;
+	}
+
+	psi_task_change(p, now, clear, set);
+}
+
+static inline void psi_ttwu_dequeue(struct task_struct *p)
+{
+	if (psi_disabled)
+		return;
+	/*
+	 * Is the task being migrated during a wakeup? Make sure to
+	 * deregister its sleep-persistent psi states from the old
+	 * queue, and let psi_enqueue() know it has to requeue.
+	 */
+	if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
+		struct rq_flags rf;
+		struct rq *rq;
+		int clear = 0;
+
+		if (p->in_iowait)
+			clear |= TSK_IOWAIT;
+		if (p->flags & PF_MEMSTALL)
+			clear |= TSK_MEMSTALL;
+
+		rq = __task_rq_lock(p, &rf);
+		update_rq_clock(rq);
+		psi_task_change(p, rq_clock(rq), clear, 0);
+		p->sched_psi_wake_requeue = 1;
+		__task_rq_unlock(rq, &rf);
+	}
+}
+#else /* CONFIG_PSI */
+static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup) {}
+static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep) {}
+static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+#endif /* CONFIG_PSI */
+
 #ifdef CONFIG_SCHED_INFO
 static inline void sched_info_reset_dequeued(struct task_struct *t)
 {
 	t->sched_info.last_queued = 0;
 }
 
+static inline void sched_info_reset_queued(struct task_struct *t, u64 now)
+{
+	if (!t->sched_info.last_queued)
+		t->sched_info.last_queued = now;
+}
+
 /*
  * We are interested in knowing how long it was from the *first* time a
  * task was queued to the time that it finally hit a CPU, we call this routine
  * from dequeue_task() to account for possible rq->clock skew across CPUs. The
  * delta taken on each CPU would annul the skew.
  */
-static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t)
+static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t,
+				       bool sleep)
 {
 	unsigned long long now = rq_clock(rq), delta = 0;
 
-	if (unlikely(sched_info_on()))
+	if (unlikely(sched_info_on())) {
 		if (t->sched_info.last_queued)
 			delta = now - t->sched_info.last_queued;
+		psi_dequeue(t, now, sleep);
+	}
 	sched_info_reset_dequeued(t);
 	t->sched_info.run_delay += delta;
 
@@ -104,11 +190,14 @@ static void sched_info_arrive(struct rq *rq, struct task_struct *t)
  * the timestamp if it is already not set.  It's assumed that
  * sched_info_dequeued() will clear that stamp when appropriate.
  */
-static inline void sched_info_queued(struct rq *rq, struct task_struct *t)
+static inline void sched_info_queued(struct rq *rq, struct task_struct *t,
+				     bool wakeup)
 {
 	if (unlikely(sched_info_on())) {
-		if (!t->sched_info.last_queued)
-			t->sched_info.last_queued = rq_clock(rq);
+		unsigned long long now = rq_clock(rq);
+
+		sched_info_reset_queued(t, now);
+		psi_enqueue(t, now, wakeup);
 	}
 }
 
@@ -127,7 +216,8 @@ static inline void sched_info_depart(struct rq *rq, struct task_struct *t)
 	rq_sched_info_depart(rq, delta);
 
 	if (t->state == TASK_RUNNING)
-		sched_info_queued(rq, t);
+		if (unlikely(sched_info_on()))
+			sched_info_reset_queued(t, rq_clock(rq));
 }
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index 29bd1df18b98..8f9566745902 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -22,6 +22,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/page_owner.h>
+#include <linux/psi.h>
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
@@ -2068,11 +2069,15 @@ static int kcompactd(void *p)
 	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
 
 	while (!kthread_should_stop()) {
+		unsigned long pflags;
+
 		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
 		wait_event_freezable(pgdat->kcompactd_wait,
 				kcompactd_work_requested(pgdat));
 
+		psi_memstall_enter(&pflags);
 		kcompactd_do_work(pgdat);
+		psi_memstall_leave(&pflags);
 	}
 
 	return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index e49961e13dd9..eee06145b997 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -37,6 +37,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/rmap.h>
 #include <linux/delayacct.h>
+#include <linux/psi.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1075,11 +1076,14 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	struct wait_page_queue wait_page;
 	wait_queue_entry_t *wait = &wait_page.wait;
 	bool thrashing = false;
+	unsigned long pflags;
 	int ret = 0;
 
-	if (bit_nr == PG_locked && !PageSwapBacked(page) &&
+	if (bit_nr == PG_locked &&
 	    !PageUptodate(page) && PageWorkingset(page)) {
-		delayacct_thrashing_start();
+		if (!PageSwapBacked(page))
+			delayacct_thrashing_start();
+		psi_memstall_enter(&pflags);
 		thrashing = true;
 	}
 
@@ -1121,8 +1125,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 
 	finish_wait(q, wait);
 
-	if (thrashing)
-		delayacct_thrashing_end();
+	if (thrashing) {
+		if (!PageSwapBacked(page))
+			delayacct_thrashing_end();
+		psi_memstall_leave(&pflags);
+	}
 
 	/*
 	 * A signal could leave PageWaiters set. Clearing it here if
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 22320ea27489..8469f34e6731 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -67,6 +67,7 @@
 #include <linux/ftrace.h>
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
+#include <linux/psi.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -3552,15 +3553,20 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum compact_priority prio, enum compact_result *compact_result)
 {
 	struct page *page;
+	unsigned long pflags;
 	unsigned int noreclaim_flag;
 
 	if (!order)
 		return NULL;
 
+	psi_memstall_enter(&pflags);
 	noreclaim_flag = memalloc_noreclaim_save();
+
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 									prio);
+
 	memalloc_noreclaim_restore(noreclaim_flag);
+	psi_memstall_leave(&pflags);
 
 	if (*compact_result <= COMPACT_INACTIVE)
 		return NULL;
@@ -3749,11 +3755,14 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct reclaim_state reclaim_state;
 	int progress;
 	unsigned int noreclaim_flag;
+	unsigned long pflags;
 
 	cond_resched();
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
+
+	psi_memstall_enter(&pflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 	fs_reclaim_acquire(gfp_mask);
 	reclaim_state.reclaimed_slab = 0;
@@ -3765,6 +3774,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	current->reclaim_state = NULL;
 	fs_reclaim_release(gfp_mask);
 	memalloc_noreclaim_restore(noreclaim_flag);
+	psi_memstall_leave(&pflags);
 
 	cond_resched();
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8d1ad48ffbcd..ee91e8cbeb5a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -49,6 +49,7 @@
 #include <linux/prefetch.h>
 #include <linux/printk.h>
 #include <linux/dax.h>
+#include <linux/psi.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3115,6 +3116,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
+	unsigned long pflags;
 	int nid;
 	unsigned int noreclaim_flag;
 	struct scan_control sc = {
@@ -3143,9 +3145,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
+	psi_memstall_enter(&pflags);
 	noreclaim_flag = memalloc_noreclaim_save();
+
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
 	memalloc_noreclaim_restore(noreclaim_flag);
+	psi_memstall_leave(&pflags);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3565,6 +3571,7 @@ static int kswapd(void *p)
 	pgdat->kswapd_order = 0;
 	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
+		unsigned long pflags;
 		bool ret;
 
 		alloc_order = reclaim_order = pgdat->kswapd_order;
@@ -3601,9 +3608,15 @@ static int kswapd(void *p)
 		 */
 		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
 						alloc_order);
+
+		psi_memstall_enter(&pflags);
 		fs_reclaim_acquire(GFP_KERNEL);
+
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+
 		fs_reclaim_release(GFP_KERNEL);
+		psi_memstall_leave(&pflags);
+
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
 	}
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 09/10] psi: cgroup support
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (7 preceding siblings ...)
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-12 20:08   ` Tejun Heo
  2018-07-17 15:40   ` Peter Zijlstra
  2018-07-12 17:29 ` [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure Johannes Weiner
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system,
but also that of individual jobs, so that we can ensure individual job
health, fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups. In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure,
memory.pressure, and io.pressure files that track aggregate pressure
stall times for only the tasks inside the cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/accounting/psi.txt |  9 ++++
 Documentation/cgroup-v2.txt      | 18 +++++++
 include/linux/cgroup-defs.h      |  4 ++
 include/linux/cgroup.h           | 15 ++++++
 include/linux/psi.h              | 25 ++++++++++
 init/Kconfig                     |  4 ++
 kernel/cgroup/cgroup.c           | 45 +++++++++++++++++-
 kernel/sched/psi.c               | 82 +++++++++++++++++++++++++++++++-
 8 files changed, 198 insertions(+), 4 deletions(-)

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index 51e7ef14142e..e051810d5127 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -62,3 +62,12 @@ well as medium and long term trends. The total absolute stall time is
 tracked and exported as well, to allow detection of latency spikes
 which wouldn't necessarily make a dent in the time averages, or to
 average trends over custom time frames.
+
+Cgroup2 interface
+=================
+
+In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
+mounted, pressure stall information is also tracked for tasks grouped
+into cgroups. Each subdirectory in the cgroupfs mountpoint contains
+cpu.pressure, memory.pressure, and io.pressure files; the format is
+the same as the /proc/pressure/ files.
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeaed9f7a..a22879dba019 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -963,6 +963,12 @@ All time durations are in microseconds.
 	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
 	one number is written, $MAX is updated.
 
+  cpu.pressure
+	A read-only nested-key file which exists on non-root cgroups.
+
+	Shows pressure stall information for CPU. See
+	Documentation/accounting/psi.txt for details.
+
 
 Memory
 ------
@@ -1199,6 +1205,12 @@ PAGE_SIZE multiple when read back.
 	Swap usage hard limit.  If a cgroup's swap usage reaches this
 	limit, anonymous memory of the cgroup will not be swapped out.
 
+  memory.pressure
+	A read-only nested-key file which exists on non-root cgroups.
+
+	Shows pressure stall information for memory. See
+	Documentation/accounting/psi.txt for details.
+
 
 Usage Guidelines
 ~~~~~~~~~~~~~~~~
@@ -1334,6 +1346,12 @@ IO Interface Files
 
 	  8:16 rbps=2097152 wbps=max riops=max wiops=max
 
+  io.pressure
+	A read-only nested-key file which exists on non-root cgroups.
+
+	Shows pressure stall information for IO. See
+	Documentation/accounting/psi.txt for details.
+
 
 Writeback
 ~~~~~~~~~
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index dc5b70449dc6..280f18da956a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -20,6 +20,7 @@
 #include <linux/u64_stats_sync.h>
 #include <linux/workqueue.h>
 #include <linux/bpf-cgroup.h>
+#include <linux/psi_types.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -424,6 +425,9 @@ struct cgroup {
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
 
+	/* used to track pressure stalls */
+	struct psi_group psi;
+
 	/* used to store eBPF programs */
 	struct cgroup_bpf bpf;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 473e0c0abb86..fd94c294c207 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -627,6 +627,11 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 	pr_cont_kernfs_path(cgrp->kn);
 }
 
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+	return &cgrp->psi;
+}
+
 static inline void cgroup_init_kthreadd(void)
 {
 	/*
@@ -680,6 +685,16 @@ static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp)
 	return NULL;
 }
 
+static inline struct cgroup *cgroup_parent(struct cgroup *cgrp)
+{
+	return NULL;
+}
+
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+	return NULL;
+}
+
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 					       struct cgroup *ancestor)
 {
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 371af1479699..05c3dae3e9c5 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -4,6 +4,9 @@
 #include <linux/psi_types.h>
 #include <linux/sched.h>
 
+struct seq_file;
+struct css_set;
+
 #ifdef CONFIG_PSI
 
 extern bool psi_disabled;
@@ -15,6 +18,14 @@ void psi_task_change(struct task_struct *task, u64 now, int clear, int set);
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
+int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
+
+#ifdef CONFIG_CGROUPS
+int psi_cgroup_alloc(struct cgroup *cgrp);
+void psi_cgroup_free(struct cgroup *cgrp);
+void cgroup_move_task(struct task_struct *p, struct css_set *to);
+#endif
+
 #else /* CONFIG_PSI */
 
 static inline void psi_init(void) {}
@@ -22,6 +33,20 @@ static inline void psi_init(void) {}
 static inline void psi_memstall_enter(unsigned long *flags) {}
 static inline void psi_memstall_leave(unsigned long *flags) {}
 
+#ifdef CONFIG_CGROUPS
+static inline int psi_cgroup_alloc(struct cgroup *cgrp)
+{
+	return 0;
+}
+static inline void psi_cgroup_free(struct cgroup *cgrp)
+{
+}
+static inline void cgroup_move_task(struct task_struct *p, struct css_set *to)
+{
+	rcu_assign_pointer(p->cgroups, to);
+}
+#endif
+
 #endif /* CONFIG_PSI */
 
 #endif /* _LINUX_PSI_H */
diff --git a/init/Kconfig b/init/Kconfig
index e34859bda33e..b471dee2f0d4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -469,6 +469,10 @@ config PSI
 	  the share of walltime in which some or all tasks in the system are
 	  delayed due to contention of the respective resource.
 
+	  In kernels with cgroup support (cgroup2 only), cgroups will
+	  have cpu.pressure, memory.pressure, and io.pressure files,
+	  which aggregate pressure stalls for the grouped tasks only.
+
 	  For more details see Documentation/accounting/psi.txt.
 
 	  Say N if unsure.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a662bfcbea0e..de1ca380f234 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -54,6 +54,7 @@
 #include <linux/proc_ns.h>
 #include <linux/nsproxy.h>
 #include <linux/file.h>
+#include <linux/psi.h>
 #include <net/sock.h>
 
 #define CREATE_TRACE_POINTS
@@ -826,7 +827,7 @@ static void css_set_move_task(struct task_struct *task,
 		 */
 		WARN_ON_ONCE(task->flags & PF_EXITING);
 
-		rcu_assign_pointer(task->cgroups, to_cset);
+		cgroup_move_task(task, to_cset);
 		list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
 							     &to_cset->tasks);
 	}
@@ -3388,6 +3389,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
 	return ret;
 }
 
+#ifdef CONFIG_PSI
+static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
+{
+	return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
+}
+static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
+{
+	return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_MEM);
+}
+static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
+{
+	return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_IO);
+}
+#endif
+
 static int cgroup_file_open(struct kernfs_open_file *of)
 {
 	struct cftype *cft = of->kn->priv;
@@ -4499,6 +4515,23 @@ static struct cftype cgroup_base_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = cpu_stat_show,
 	},
+#ifdef CONFIG_PSI
+       {
+               .name = "cpu.pressure",
+               .flags = CFTYPE_NOT_ON_ROOT,
+               .seq_show = cgroup_cpu_pressure_show,
+       },
+       {
+               .name = "memory.pressure",
+               .flags = CFTYPE_NOT_ON_ROOT,
+               .seq_show = cgroup_memory_pressure_show,
+       },
+       {
+               .name = "io.pressure",
+               .flags = CFTYPE_NOT_ON_ROOT,
+               .seq_show = cgroup_io_pressure_show,
+       },
+#endif
 	{ }	/* terminate */
 };
 
@@ -4559,6 +4592,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			 */
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
+			psi_cgroup_free(cgrp);
 			if (cgroup_on_dfl(cgrp))
 				cgroup_stat_exit(cgrp);
 			kfree(cgrp);
@@ -4805,10 +4839,15 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	cgrp->self.parent = &parent->self;
 	cgrp->root = root;
 	cgrp->level = level;
-	ret = cgroup_bpf_inherit(cgrp);
+
+	ret = psi_cgroup_alloc(cgrp);
 	if (ret)
 		goto out_idr_free;
 
+	ret = cgroup_bpf_inherit(cgrp);
+	if (ret)
+		goto out_psi_free;
+
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
 		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
 
@@ -4846,6 +4885,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 
 	return cgrp;
 
+out_psi_free:
+	psi_cgroup_free(cgrp);
 out_idr_free:
 	cgroup_idr_remove(&root->cgroup_idr, cgrp->id);
 out_stat_exit:
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ef8e20383e4c..53e0b7b83e2e 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -395,6 +395,9 @@ static void psi_group_change(struct psi_group *group, int cpu, u64 now,
 
 void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
 {
+#ifdef CONFIG_CGROUPS
+	struct cgroup *cgroup, *parent;
+#endif
 	int cpu = task_cpu(task);
 
 	if (psi_disabled)
@@ -416,6 +419,18 @@ void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
 	task->psi_flags |= set;
 
 	psi_group_change(&psi_system, cpu, now, clear, set);
+
+#ifdef CONFIG_CGROUPS
+       cgroup = task->cgroups->dfl_cgrp;
+       while (cgroup && (parent = cgroup_parent(cgroup))) {
+               struct psi_group *group;
+
+               group = cgroup_psi(cgroup);
+               psi_group_change(group, cpu, now, clear, set);
+
+               cgroup = parent;
+       }
+#endif
 }
 
 /**
@@ -482,8 +497,71 @@ void psi_memstall_leave(unsigned long *flags)
 	rq_unlock_irq(rq, &rf);
 }
 
-static int psi_show(struct seq_file *m, struct psi_group *group,
-		    enum psi_res res)
+#ifdef CONFIG_CGROUPS
+int psi_cgroup_alloc(struct cgroup *cgroup)
+{
+	cgroup->psi.cpus = alloc_percpu(struct psi_group_cpu);
+	if (!cgroup->psi.cpus)
+		return -ENOMEM;
+	psi_group_init(&cgroup->psi);
+	return 0;
+}
+
+void psi_cgroup_free(struct cgroup *cgroup)
+{
+	cancel_delayed_work_sync(&cgroup->psi.clock_work);
+	free_percpu(cgroup->psi.cpus);
+}
+
+/**
+ * cgroup_move_task - move task to a different cgroup
+ * @task: the task
+ * @to: the target css_set
+ *
+ * Move task to a new cgroup and safely migrate its associated stall
+ * state between the different groups.
+ *
+ * This function acquires the task's rq lock to lock out concurrent
+ * changes to the task's scheduling state and - in case the task is
+ * running - concurrent changes to its stall state.
+ */
+void cgroup_move_task(struct task_struct *task, struct css_set *to)
+{
+	unsigned int task_flags = 0;
+	struct rq_flags rf;
+	struct rq *rq;
+	u64 now;
+
+	rq = task_rq_lock(task, &rf);
+
+	if (task_on_rq_queued(task)) {
+		task_flags = TSK_RUNNING;
+	} else if (task->in_iowait) {
+		task_flags = TSK_IOWAIT;
+	}
+	if (task->flags & PF_MEMSTALL)
+		task_flags |= TSK_MEMSTALL;
+
+	if (task_flags) {
+		update_rq_clock(rq);
+		now = rq_clock(rq);
+		psi_task_change(task, now, task_flags, 0);
+	}
+
+	/*
+	 * Lame to do this here, but the scheduler cannot be locked
+	 * from the outside, so we move cgroups from inside sched/.
+	 */
+	rcu_assign_pointer(task->cgroups, to);
+
+	if (task_flags)
+		psi_task_change(task, now, 0, task_flags);
+
+	task_rq_unlock(rq, task, &rf);
+}
+#endif /* CONFIG_CGROUPS */
+
+int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 {
 	unsigned long avg[2][3];
 	u64 some, full;
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (8 preceding siblings ...)
  2018-07-12 17:29 ` [PATCH 09/10] psi: cgroup support Johannes Weiner
@ 2018-07-12 17:29 ` Johannes Weiner
  2018-07-12 23:45   ` Andrew Morton
                     ` (2 more replies)
  2018-07-12 17:37 ` [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Linus Torvalds
                   ` (4 subsequent siblings)
  14 siblings, 3 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-12 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
  Cc: Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

Right now, psi reports pressure and stall times of already concluded
stall events. For most use cases this is current enough, but certain
highly latency-sensitive applications, like the Android OOM killer,
might want to know about and react to stall states before they have
even concluded (e.g. a prolonged reclaim cycle).

This patches the procfs/cgroupfs interface such that when the pressure
metrics are read, the current per-cpu states, if any, are taken into
account as well.

Any ongoing states are concluded, their time snapshotted, and then
restarted. This requires holding the rq lock to avoid corruption. It
could use some form of rq lock ratelimiting or avoidance.

Requested-by: Suren Baghdasaryan <surenb@google.com>
Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 56 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 46 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 53e0b7b83e2e..5a6c6057f775 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -190,7 +190,7 @@ static void calc_avgs(unsigned long avg[3], u64 time, int missed_periods)
 	}
 }
 
-static bool psi_update_stats(struct psi_group *group)
+static bool psi_update_stats(struct psi_group *group, bool ondemand)
 {
 	u64 some[NR_PSI_RESOURCES] = { 0, };
 	u64 full[NR_PSI_RESOURCES] = { 0, };
@@ -200,8 +200,6 @@ static bool psi_update_stats(struct psi_group *group)
 	int cpu;
 	int r;
 
-	mutex_lock(&group->stat_lock);
-
 	/*
 	 * Collect the per-cpu time buckets and average them into a
 	 * single time sample that is normalized to wallclock time.
@@ -218,10 +216,36 @@ static bool psi_update_stats(struct psi_group *group)
 	for_each_online_cpu(cpu) {
 		struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
 		unsigned long nonidle;
+		struct rq_flags rf;
+		struct rq *rq;
+		u64 now;
 
-		if (!groupc->nonidle_time)
+		if (!groupc->nonidle_time && !groupc->nonidle)
 			continue;
 
+		/*
+		 * We come here for two things: 1) periodic per-cpu
+		 * bucket flushing and averaging and 2) when the user
+		 * wants to read a pressure file. For flushing and
+		 * averaging, which is relatively infrequent, we can
+		 * be lazy and tolerate some raciness with concurrent
+		 * updates to the per-cpu counters. However, if a user
+		 * polls the pressure state, we want to give them the
+		 * most uptodate information we have, including any
+		 * currently active state which hasn't been timed yet,
+		 * because in case of an iowait or a reclaim run, that
+		 * can be significant.
+		 */
+		if (ondemand) {
+			rq = cpu_rq(cpu);
+			rq_lock_irq(rq, &rf);
+
+			now = cpu_clock(cpu);
+
+			groupc->nonidle_time += now - groupc->nonidle_start;
+			groupc->nonidle_start = now;
+		}
+
 		nonidle = nsecs_to_jiffies(groupc->nonidle_time);
 		groupc->nonidle_time = 0;
 		nonidle_total += nonidle;
@@ -229,13 +253,27 @@ static bool psi_update_stats(struct psi_group *group)
 		for (r = 0; r < NR_PSI_RESOURCES; r++) {
 			struct psi_resource *res = &groupc->res[r];
 
+			if (ondemand && res->state != PSI_NONE) {
+				bool is_full = res->state == PSI_FULL;
+
+				res->times[is_full] += now - res->state_start;
+				res->state_start = now;
+			}
+
 			some[r] += (res->times[0] + res->times[1]) * nonidle;
 			full[r] += res->times[1] * nonidle;
 
-			/* It's racy, but we can tolerate some error */
 			res->times[0] = 0;
 			res->times[1] = 0;
 		}
+
+		if (ondemand)
+			rq_unlock_irq(rq, &rf);
+	}
+
+	for (r = 0; r < NR_PSI_RESOURCES; r++) {
+		do_div(some[r], max(nonidle_total, 1UL));
+		do_div(full[r], max(nonidle_total, 1UL));
 	}
 
 	/*
@@ -249,12 +287,10 @@ static bool psi_update_stats(struct psi_group *group)
 	 * activity, thus no data, and clock ticks are sporadic. The
 	 * below handles both.
 	 */
+	mutex_lock(&group->stat_lock);
 
 	/* total= */
 	for (r = 0; r < NR_PSI_RESOURCES; r++) {
-		do_div(some[r], max(nonidle_total, 1UL));
-		do_div(full[r], max(nonidle_total, 1UL));
-
 		group->some[r] += some[r];
 		group->full[r] += full[r];
 	}
@@ -301,7 +337,7 @@ static void psi_clock(struct work_struct *work)
 	 * go - see calc_avgs() and missed_periods.
 	 */
 
-	nonidle = psi_update_stats(group);
+	nonidle = psi_update_stats(group, false);
 
 	if (nonidle) {
 		unsigned long delay = 0;
@@ -570,7 +606,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 	if (psi_disabled)
 		return -EOPNOTSUPP;
 
-	psi_update_stats(group);
+	psi_update_stats(group, true);
 
 	for (w = 0; w < 3; w++) {
 		avg[0][w] = group->avg_some[res][w];
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (9 preceding siblings ...)
  2018-07-12 17:29 ` [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure Johannes Weiner
@ 2018-07-12 17:37 ` Linus Torvalds
  2018-07-12 23:44 ` Andrew Morton
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2018-07-12 17:37 UTC (permalink / raw)
  To: hannes
  Cc: mingo, Peter Zijlstra, Andrew Morton, tj, surenb, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, shakeelb, linux-mm, cgroups,
	lkml, kernel-team

On Thu, Jul 12, 2018 at 10:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> PSI aggregates and reports the overall wallclock time in which the
> tasks in a system (or cgroup) wait for contended hardware resources.

No comments on the patches themselves (the concept looks sane, and I'm
finding it more intriguing for non-oom uses than for oom), but just a
note to say that gmail hates you and marked every single patch as spam
for some reason.

I have no idea why. All the headers look fine, DKIM passes, nothing
bad stands out.

So it must be personal.

             Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/10] psi: cgroup support
  2018-07-12 17:29 ` [PATCH 09/10] psi: cgroup support Johannes Weiner
@ 2018-07-12 20:08   ` Tejun Heo
  2018-07-17 15:40   ` Peter Zijlstra
  1 sibling, 0 replies; 83+ messages in thread
From: Tejun Heo @ 2018-07-12 20:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:41PM -0400, Johannes Weiner wrote:
> On a system that executes multiple cgrouped jobs and independent
> workloads, we don't just care about the health of the overall system,
> but also that of individual jobs, so that we can ensure individual job
> health, fairness between jobs, or prioritize some jobs over others.
> 
> This patch implements pressure stall tracking for cgroups. In kernels
> with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure,
> memory.pressure, and io.pressure files that track aggregate pressure
> stall times for only the tasks inside the cgroup.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Tejun Heo <tj@kernel.org>

Please feel free to route with the rest of the patchset.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (10 preceding siblings ...)
  2018-07-12 17:37 ` [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Linus Torvalds
@ 2018-07-12 23:44 ` Andrew Morton
  2018-07-13 22:14   ` Johannes Weiner
  2018-07-16 15:57 ` Daniel Drake
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 83+ messages in thread
From: Andrew Morton @ 2018-07-12 23:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, 12 Jul 2018 13:29:32 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

>
> ...
>
> The io file is similar to memory. Because the block layer doesn't have
> a concept of hardware contention right now (how much longer is my IO
> request taking due to other tasks?), it reports CPU potential lost on
> all IO delays, not just the potential lost due to competition.

Probably dumb question: disks aren't the only form of IO.  Does it make
sense to accumulate PSI for other forms of IO?  Networking comes to
mind...


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure
  2018-07-12 17:29 ` [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure Johannes Weiner
@ 2018-07-12 23:45   ` Andrew Morton
  2018-07-13 22:17     ` Johannes Weiner
  2018-07-13 22:13   ` Suren Baghdasaryan
  2018-07-17 15:13   ` Peter Zijlstra
  2 siblings, 1 reply; 83+ messages in thread
From: Andrew Morton @ 2018-07-12 23:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, 12 Jul 2018 13:29:42 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Right now, psi reports pressure and stall times of already concluded
> stall events. For most use cases this is current enough, but certain
> highly latency-sensitive applications, like the Android OOM killer,
> might want to know about and react to stall states before they have
> even concluded (e.g. a prolonged reclaim cycle).
> 
> This patches the procfs/cgroupfs interface such that when the pressure
> metrics are read, the current per-cpu states, if any, are taken into
> account as well.
> 
> Any ongoing states are concluded, their time snapshotted, and then
> restarted. This requires holding the rq lock to avoid corruption. It
> could use some form of rq lock ratelimiting or avoidance.
> 
> Requested-by: Suren Baghdasaryan <surenb@google.com>
> Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

What-does-that-mean:?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
@ 2018-07-13  9:21   ` Peter Zijlstra
  2018-07-13 16:17     ` Johannes Weiner
  2018-07-17 10:03   ` Peter Zijlstra
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-13  9:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static inline void psi_ttwu_dequeue(struct task_struct *p)
> +{
> +	if (psi_disabled)
> +		return;
> +	/*
> +	 * Is the task being migrated during a wakeup? Make sure to
> +	 * deregister its sleep-persistent psi states from the old
> +	 * queue, and let psi_enqueue() know it has to requeue.
> +	 */
> +	if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> +		struct rq_flags rf;
> +		struct rq *rq;
> +		int clear = 0;
> +
> +		if (p->in_iowait)
> +			clear |= TSK_IOWAIT;
> +		if (p->flags & PF_MEMSTALL)
> +			clear |= TSK_MEMSTALL;
> +
> +		rq = __task_rq_lock(p, &rf);
> +		update_rq_clock(rq);
> +		psi_task_change(p, rq_clock(rq), clear, 0);
> +		p->sched_psi_wake_requeue = 1;
> +		__task_rq_unlock(rq, &rf);
> +	}
> +}

Still NAK, what happened to this here:

  https://lkml.kernel.org/r/20180514083353.GN12217@hirez.programming.kicks-ass.net


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-13  9:21   ` Peter Zijlstra
@ 2018-07-13 16:17     ` Johannes Weiner
  2018-07-14  8:48       ` Peter Zijlstra
  2018-07-14  9:02       ` Peter Zijlstra
  0 siblings, 2 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-13 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

Hi Peter,

On Fri, Jul 13, 2018 at 11:21:53AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > +{
> > +	if (psi_disabled)
> > +		return;
> > +	/*
> > +	 * Is the task being migrated during a wakeup? Make sure to
> > +	 * deregister its sleep-persistent psi states from the old
> > +	 * queue, and let psi_enqueue() know it has to requeue.
> > +	 */
> > +	if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > +		struct rq_flags rf;
> > +		struct rq *rq;
> > +		int clear = 0;
> > +
> > +		if (p->in_iowait)
> > +			clear |= TSK_IOWAIT;
> > +		if (p->flags & PF_MEMSTALL)
> > +			clear |= TSK_MEMSTALL;
> > +
> > +		rq = __task_rq_lock(p, &rf);
> > +		update_rq_clock(rq);
> > +		psi_task_change(p, rq_clock(rq), clear, 0);
> > +		p->sched_psi_wake_requeue = 1;
> > +		__task_rq_unlock(rq, &rf);
> > +	}
> > +}
> 
> Still NAK, what happened to this here:
> 
>   https://lkml.kernel.org/r/20180514083353.GN12217@hirez.programming.kicks-ass.net

I did react to this in the v2 docs / code comments, but I should have
been more direct about addressing your points - sorry about that.

In that thread we disagree about exactly how to aggregate task stalls
to produce meaningful numbers, but your main issue is with the way we
track state per-CPU instead of globally, given the rq lock cost on
wake-migration and the meaning of task->cpu of a sleeping task.

First off, what I want to do can indeed be done without a strong link
of a sleeping task to a CPU. We don't rely on it, and it's something I
only figured out in v2. The important thing is not, as I previously
thought, that CPUs are tracked independently from each other, but that
we use potential execution threads as the baseline for potential that
could be wasted by resource delays. Tracking CPUs independently just
happens to do that implicitly, but it's not a requirement.

In v2 of psi.c I'm outlining a model that formulates the SOME and FULL
states from global state in a way that still produces meaningful
numbers on SMP machines by comparing the task state to the number of
possible concurrent execution threads. Here is the excerpt:

	threads = min(nr_nonidle_tasks, nr_cpus)
	   SOME = min(nr_delayed_tasks / threads, 1)
	   FULL = (threads - min(nr_running_tasks, threads)) / threads

It's followed in psi.c by examples of how/why it works, but whether
you agree with the exact formula or not, what you can see is that it
could be implemented exactly like the load average: use per-cpu
counters to construct global values for those task counts, fold and
sample that state periodically and feed it into the running averages.

So whytf is it still done with cpu-local task states?

The general problem with sampling here is that it's way too coarse to
capture the events we want to know about. The load average is okay-ish
for long term trends, but interactive things care about stalls in the
millisecond range each, and we cannot get those accurately with
second-long sampling intervals (and we cannot fold the CPU state much
more frequently than this before it gets prohibitively expensive).

Since our stall states are composed of multiple tasks, recording the
precise time spent in them requires some sort of serialization with
scheduling activity, and doing that globally would be a non-starter on
SMP. Hence still the CPU-local state tracking to approximate the
global state.

Now to your concern about relying on the task<->CPU association.

We don't *really* rely on a strict association, it's more of a hint or
historic correlation. It's fine if tasks move around on us, we just
want to approximate when CPUs go idle due to stalls or lack of work.
Let's take your quote from the thread:

: Note that a task doesn't sleep on a CPU. When it sleeps it is not
: strictly associated with a CPU, only when it runs does it have an
: association.
:
: What is the value of accounting a sleep state to a particular CPU
: if the task when wakes up on another? Where did the sleep take place?

Let's say you have a CPU running a task that then stalls on
memory. When it wakes back up it gets moved to another CPU.

We don't care so much about what happens after the task wakes up, we
just need to know where the task was running when it stalled. Even if
the task gets migrated on wakeup - *while* the stall is occuring, we
can say whether that task's old CPU goes idle due to that stall, and
has to report FULL; or something else can run on it, in which case it
only reports SOME. And even if the task bounced around CPUs while it
was running, and it was only briefly on the CPU on which it stalled -
what we care about is a CPU being idle because of stalls instead of a
genuine lack of work.

This is certainly susceptible to delayed tasks bunching up unevenly on
CPUs, like the comment in the referenced e33a9bba85a8 ("sched/core:
move IO scheduling accounting from io_schedule_timeout() into
scheduler") points out. I.e. a second task starts running on that CPU
with the delayed task, then gets delayed as itself; now you have two
delayed tasks on a single CPU and possibly none on some other CPU.

Does that mean we underreport pressure, or report "a lower bound of
pressure" in the words of e33a9bba85a8?

Not entirely. We average CPUs based on nonidle weight. If you have two
CPUs and one has two stalled tasks while the other CPU is idle, the
average still works out to 100% FULL since the idle CPU doesn't weigh
anything in the aggregation.

It's not perfect since the nonidle tracking is shared between all
three resources and, say, an iowait task tracked on the other CPU
would render that CPU "productive" from a *memory* stand point. We
*could* change that by splitting out nonidle tracking per resource,
but I'm honestly not convinced that this is an issue in practice - it
certainly hasn't been for us. Even if we said this *is* a legitimate
issue, reporting the lower bound of all stall events is a smaller
error than missing events entirely like periodic sampling would.

That's my thought process, anyway. I'd be more than happy to make this
more lightweight, but I don't see a way to do it without losing
significant functional precision.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure
  2018-07-12 17:29 ` [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure Johannes Weiner
  2018-07-12 23:45   ` Andrew Morton
@ 2018-07-13 22:13   ` Suren Baghdasaryan
  2018-07-13 22:49     ` Johannes Weiner
  2018-07-17 15:13   ` Peter Zijlstra
  2 siblings, 1 reply; 83+ messages in thread
From: Suren Baghdasaryan @ 2018-07-13 22:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	Tejun Heo, Vinayak Menon, Christopher Lameter, Mike Galbraith,
	Shakeel Butt, linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Jul 12, 2018 at 10:29 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Right now, psi reports pressure and stall times of already concluded
> stall events. For most use cases this is current enough, but certain
> highly latency-sensitive applications, like the Android OOM killer,

to be more precise, it's Android LMKD (low memory killer daemon) not
to be confused with kernel OOM killer.

> might want to know about and react to stall states before they have
> even concluded (e.g. a prolonged reclaim cycle).
>
> This patches the procfs/cgroupfs interface such that when the pressure
> metrics are read, the current per-cpu states, if any, are taken into
> account as well.
>
> Any ongoing states are concluded, their time snapshotted, and then
> restarted. This requires holding the rq lock to avoid corruption. It
> could use some form of rq lock ratelimiting or avoidance.
>
> Requested-by: Suren Baghdasaryan <surenb@google.com>
> Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---

IMHO this description is a little difficult to understand. In essence,
PSI information is being updated periodically every 2secs and without
this patch the data can be stale at the time when we read it (because
it was last updated up to 2secs ago). To avoid this we update the PSI
"total" values when data is being read.

>  kernel/sched/psi.c | 56 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 46 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 53e0b7b83e2e..5a6c6057f775 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -190,7 +190,7 @@ static void calc_avgs(unsigned long avg[3], u64 time, int missed_periods)
>         }
>  }
>
> -static bool psi_update_stats(struct psi_group *group)
> +static bool psi_update_stats(struct psi_group *group, bool ondemand)
>  {
>         u64 some[NR_PSI_RESOURCES] = { 0, };
>         u64 full[NR_PSI_RESOURCES] = { 0, };
> @@ -200,8 +200,6 @@ static bool psi_update_stats(struct psi_group *group)
>         int cpu;
>         int r;
>
> -       mutex_lock(&group->stat_lock);
> -
>         /*
>          * Collect the per-cpu time buckets and average them into a
>          * single time sample that is normalized to wallclock time.
> @@ -218,10 +216,36 @@ static bool psi_update_stats(struct psi_group *group)
>         for_each_online_cpu(cpu) {
>                 struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
>                 unsigned long nonidle;
> +               struct rq_flags rf;
> +               struct rq *rq;
> +               u64 now;
>
> -               if (!groupc->nonidle_time)
> +               if (!groupc->nonidle_time && !groupc->nonidle)
>                         continue;
>
> +               /*
> +                * We come here for two things: 1) periodic per-cpu
> +                * bucket flushing and averaging and 2) when the user
> +                * wants to read a pressure file. For flushing and
> +                * averaging, which is relatively infrequent, we can
> +                * be lazy and tolerate some raciness with concurrent
> +                * updates to the per-cpu counters. However, if a user
> +                * polls the pressure state, we want to give them the
> +                * most uptodate information we have, including any
> +                * currently active state which hasn't been timed yet,
> +                * because in case of an iowait or a reclaim run, that
> +                * can be significant.
> +                */
> +               if (ondemand) {
> +                       rq = cpu_rq(cpu);
> +                       rq_lock_irq(rq, &rf);
> +
> +                       now = cpu_clock(cpu);
> +
> +                       groupc->nonidle_time += now - groupc->nonidle_start;
> +                       groupc->nonidle_start = now;
> +               }
> +
>                 nonidle = nsecs_to_jiffies(groupc->nonidle_time);
>                 groupc->nonidle_time = 0;
>                 nonidle_total += nonidle;
> @@ -229,13 +253,27 @@ static bool psi_update_stats(struct psi_group *group)
>                 for (r = 0; r < NR_PSI_RESOURCES; r++) {
>                         struct psi_resource *res = &groupc->res[r];
>
> +                       if (ondemand && res->state != PSI_NONE) {
> +                               bool is_full = res->state == PSI_FULL;
> +
> +                               res->times[is_full] += now - res->state_start;
> +                               res->state_start = now;
> +                       }
> +
>                         some[r] += (res->times[0] + res->times[1]) * nonidle;
>                         full[r] += res->times[1] * nonidle;
>
> -                       /* It's racy, but we can tolerate some error */
>                         res->times[0] = 0;
>                         res->times[1] = 0;
>                 }
> +
> +               if (ondemand)
> +                       rq_unlock_irq(rq, &rf);
> +       }
> +
> +       for (r = 0; r < NR_PSI_RESOURCES; r++) {
> +               do_div(some[r], max(nonidle_total, 1UL));
> +               do_div(full[r], max(nonidle_total, 1UL));
>         }
>
>         /*
> @@ -249,12 +287,10 @@ static bool psi_update_stats(struct psi_group *group)
>          * activity, thus no data, and clock ticks are sporadic. The
>          * below handles both.
>          */
> +       mutex_lock(&group->stat_lock);
>
>         /* total= */
>         for (r = 0; r < NR_PSI_RESOURCES; r++) {
> -               do_div(some[r], max(nonidle_total, 1UL));
> -               do_div(full[r], max(nonidle_total, 1UL));
> -
>                 group->some[r] += some[r];
>                 group->full[r] += full[r];
>         }
> @@ -301,7 +337,7 @@ static void psi_clock(struct work_struct *work)
>          * go - see calc_avgs() and missed_periods.
>          */
>
> -       nonidle = psi_update_stats(group);
> +       nonidle = psi_update_stats(group, false);
>
>         if (nonidle) {
>                 unsigned long delay = 0;
> @@ -570,7 +606,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
>         if (psi_disabled)
>                 return -EOPNOTSUPP;
>
> -       psi_update_stats(group);
> +       psi_update_stats(group, true);
>
>         for (w = 0; w < 3; w++) {
>                 avg[0][w] = group->avg_some[res][w];
> --
> 2.18.0
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-12 23:44 ` Andrew Morton
@ 2018-07-13 22:14   ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-13 22:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 04:44:22PM -0700, Andrew Morton wrote:
> On Thu, 12 Jul 2018 13:29:32 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> >
> > ...
> >
> > The io file is similar to memory. Because the block layer doesn't have
> > a concept of hardware contention right now (how much longer is my IO
> > request taking due to other tasks?), it reports CPU potential lost on
> > all IO delays, not just the potential lost due to competition.
> 
> Probably dumb question: disks aren't the only form of IO.  Does it make
> sense to accumulate PSI for other forms of IO?  Networking comes to
> mind...

It's conceivable, although I haven't thought too much about it yet. If
that turns out to be a state we might want to track, we can easily add
a task state to identify such stalls and add /proc/pressure/net e.g.

"io" in this case means only the block layer / filesystems. I think
keeping this distinction makes sense in the interest of identifying
which type of hardware resource is posing a pressure problem.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure
  2018-07-12 23:45   ` Andrew Morton
@ 2018-07-13 22:17     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-13 22:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 04:45:37PM -0700, Andrew Morton wrote:
> On Thu, 12 Jul 2018 13:29:42 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Right now, psi reports pressure and stall times of already concluded
> > stall events. For most use cases this is current enough, but certain
> > highly latency-sensitive applications, like the Android OOM killer,
> > might want to know about and react to stall states before they have
> > even concluded (e.g. a prolonged reclaim cycle).
> > 
> > This patches the procfs/cgroupfs interface such that when the pressure
> > metrics are read, the current per-cpu states, if any, are taken into
> > account as well.
> > 
> > Any ongoing states are concluded, their time snapshotted, and then
> > restarted. This requires holding the rq lock to avoid corruption. It
> > could use some form of rq lock ratelimiting or avoidance.
> > 
> > Requested-by: Suren Baghdasaryan <surenb@google.com>
> > Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> What-does-that-mean:?

I didn't think this patch was ready for upstream yet, hence the RFC
and the lack of a proper sign-off.

But Suren has been testing this and found it useful in his specific
low-latency application, so I included it for completeness, for other
testers to find, and for possible suggestions on how to improve it.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure
  2018-07-13 22:13   ` Suren Baghdasaryan
@ 2018-07-13 22:49     ` Johannes Weiner
  2018-07-13 23:34       ` Suren Baghdasaryan
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-13 22:49 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	Tejun Heo, Vinayak Menon, Christopher Lameter, Mike Galbraith,
	Shakeel Butt, linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Jul 13, 2018 at 03:13:07PM -0700, Suren Baghdasaryan wrote:
> On Thu, Jul 12, 2018 at 10:29 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > might want to know about and react to stall states before they have
> > even concluded (e.g. a prolonged reclaim cycle).
> >
> > This patches the procfs/cgroupfs interface such that when the pressure
> > metrics are read, the current per-cpu states, if any, are taken into
> > account as well.
> >
> > Any ongoing states are concluded, their time snapshotted, and then
> > restarted. This requires holding the rq lock to avoid corruption. It
> > could use some form of rq lock ratelimiting or avoidance.
> >
> > Requested-by: Suren Baghdasaryan <surenb@google.com>
> > Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> 
> IMHO this description is a little difficult to understand. In essence,
> PSI information is being updated periodically every 2secs and without
> this patch the data can be stale at the time when we read it (because
> it was last updated up to 2secs ago). To avoid this we update the PSI
> "total" values when data is being read.

That fix I actually folded into the main patch. We now always update
the total= field at the time the user reads to include all concluded
events, even if we sampled less than 2s ago. Only the running averages
are still bound to the 2s sampling window.

What this patch adds on top is for total= to include any *ongoing*
stall events that might be happening on a CPU at the time of reading
from the interface, like a reclaim cycle that hasn't finished yet.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure
  2018-07-13 22:49     ` Johannes Weiner
@ 2018-07-13 23:34       ` Suren Baghdasaryan
  0 siblings, 0 replies; 83+ messages in thread
From: Suren Baghdasaryan @ 2018-07-13 23:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	Tejun Heo, Vinayak Menon, Christopher Lameter, Mike Galbraith,
	Shakeel Butt, linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Jul 13, 2018 at 3:49 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Fri, Jul 13, 2018 at 03:13:07PM -0700, Suren Baghdasaryan wrote:
>> On Thu, Jul 12, 2018 at 10:29 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > might want to know about and react to stall states before they have
>> > even concluded (e.g. a prolonged reclaim cycle).
>> >
>> > This patches the procfs/cgroupfs interface such that when the pressure
>> > metrics are read, the current per-cpu states, if any, are taken into
>> > account as well.
>> >
>> > Any ongoing states are concluded, their time snapshotted, and then
>> > restarted. This requires holding the rq lock to avoid corruption. It
>> > could use some form of rq lock ratelimiting or avoidance.
>> >
>> > Requested-by: Suren Baghdasaryan <surenb@google.com>
>> > Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> > ---
>>
>> IMHO this description is a little difficult to understand. In essence,
>> PSI information is being updated periodically every 2secs and without
>> this patch the data can be stale at the time when we read it (because
>> it was last updated up to 2secs ago). To avoid this we update the PSI
>> "total" values when data is being read.
>
> That fix I actually folded into the main patch. We now always update
> the total= field at the time the user reads to include all concluded
> events, even if we sampled less than 2s ago. Only the running averages
> are still bound to the 2s sampling window.
>
> What this patch adds on top is for total= to include any *ongoing*
> stall events that might be happening on a CPU at the time of reading
> from the interface, like a reclaim cycle that hasn't finished yet.

Ok, I see now what you mean. So ondemand flag controls whether
*ongoing* stall events are accounted for or not. Nit: maybe rename
that flag to better explain it's function?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-13 16:17     ` Johannes Weiner
@ 2018-07-14  8:48       ` Peter Zijlstra
  2018-07-14  9:02       ` Peter Zijlstra
  1 sibling, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-14  8:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team


Hi Johannes,

A few quick comments on first reading; I'll do a second and more
thorough reading on Monday.

On Fri, Jul 13, 2018 at 12:17:56PM -0400, Johannes Weiner wrote:
> First off, what I want to do can indeed be done without a strong link
> of a sleeping task to a CPU. We don't rely on it, and it's something I
> only figured out in v2. The important thing is not, as I previously
> thought, that CPUs are tracked independently from each other, but that
> we use potential execution threads as the baseline for potential that
> could be wasted by resource delays. Tracking CPUs independently just
> happens to do that implicitly, but it's not a requirement.

I don't follow, but I don't think I agree.

Consider the case of 2 CPUs and 2 blocked tasks. If they both blocked on
the same CPU, then only that CPU has lost potential. Whereas the only
thing that matters is the number of blocked tasks and the number of idle
CPUs.

Those two tasks can fill the two idle CPUs. Tracking per CPU just
utterly confuses the matter.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-13 16:17     ` Johannes Weiner
  2018-07-14  8:48       ` Peter Zijlstra
@ 2018-07-14  9:02       ` Peter Zijlstra
  1 sibling, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-14  9:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri, Jul 13, 2018 at 12:17:56PM -0400, Johannes Weiner wrote:
> On Fri, Jul 13, 2018 at 11:21:53AM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > > +{
> > > +	if (psi_disabled)
> > > +		return;
> > > +	/*
> > > +	 * Is the task being migrated during a wakeup? Make sure to
> > > +	 * deregister its sleep-persistent psi states from the old
> > > +	 * queue, and let psi_enqueue() know it has to requeue.
> > > +	 */
> > > +	if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > > +		struct rq_flags rf;
> > > +		struct rq *rq;
> > > +		int clear = 0;
> > > +
> > > +		if (p->in_iowait)
> > > +			clear |= TSK_IOWAIT;
> > > +		if (p->flags & PF_MEMSTALL)
> > > +			clear |= TSK_MEMSTALL;
> > > +
> > > +		rq = __task_rq_lock(p, &rf);
> > > +		update_rq_clock(rq);
> > > +		psi_task_change(p, rq_clock(rq), clear, 0);
> > > +		p->sched_psi_wake_requeue = 1;
> > > +		__task_rq_unlock(rq, &rf);
> > > +	}
> > > +}
> > 
> > Still NAK, what happened to this here:

> That's my thought process, anyway. I'd be more than happy to make this
> more lightweight, but I don't see a way to do it without losing
> significant functional precision.

I think you're going to have to. We put a lot of effort into not taking
the old rq->lock on remote wakeups and got a significant performance
benefit from that.

You just utterly destroyed that for workloads with a high number of
iowait wakeups.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (11 preceding siblings ...)
  2018-07-12 23:44 ` Andrew Morton
@ 2018-07-16 15:57 ` Daniel Drake
  2018-07-17 11:25   ` Michal Hocko
  2018-07-23 21:14 ` Balbir Singh
  2018-07-27 22:01 ` Pavel Machek
  14 siblings, 1 reply; 83+ messages in thread
From: Daniel Drake @ 2018-07-16 15:57 UTC (permalink / raw)
  To: hannes
  Cc: linux-kernel, linux-mm, cgroups, linux, linux-block, Ingo Molnar,
	Peter Zijlstra, Andrew Morton, Tejun Heo, Balbir Singh,
	Mike Galbraith, Oliver Yang, Shakeel Butt, xxx xxx,
	Taras Kondratiuk, Daniel Walker, Vinayak Menon,
	Ruslan Ruslichenko, kernel-team

Hi Johannes,

Thanks for your work on psi! 

We have also been investigating the "thrashing problem" on our Endless
desktop OS. We have seen that systems can easily get into a state where the
UI becomes unresponsive to input, and the mouse cursor becomes extremely
slow or stuck when the system is running out of memory. We are working with
a full GNOME desktop environment on systems with only 2GB RAM, and
sometimes no real swap (although zram-swap helps mitigate the problem to
some extent).

My analysis so far indicates that when the system is low on memory and hits
this condition, the system is spending much of the time under
__alloc_pages_direct_reclaim. "perf trace -F" shows many many page faults
in executable code while this is going on. I believe the kernel is
swapping out executable code in order to satisfy memory allocation
requests, but then that swapped-out code is needed a moment later so it
gets swapped in again via the page fault handler, and all this activity
severely starves the system from being able to respond to user input.

I appreciate the kernel's attempt to keep processes alive, but in the
desktop case we see that the system rarely recovers from this situation,
so you have to hard shutdown. In this case we view it as desirable that
the OOM killer would step in (it is not doing so because direct reclaim
is not actually failing).

I had recently touched upon the cpuset mempressure counter, which
looked promising, but in practice I found that it was not a useful enough
representation of thrashing. It measures the rate at which
__perform_reclaim() is called, but I have observed that as the system gets
deeper and deeper into thrashing, __perform_reclaim() is actually called
at an increasingly slower rate, because each invocation ends up taking
more and more time (after 2 minutes of thrashing it can take close to 1s).

Instead of rate of function call it seems necessary to measure the amount
of work done by that codepath, and that's what you are doing with psi.

I tried psi on a 2GB RAM system with no swap (also no zram-swap) and
was pleased with the results combined with this sample userspace code:

https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd

It invokes the OOM killer when memory full_avg10 is >=10%, i.e. it kills
if all tasks were blocked on memory management for at least 1s in a 10s
period.

Upon initial tests it is working very well. The system recovers quickly
from thrashing after the daemon steps in and kills a process. I have yet
to see any kills being made prematurely. It would be great to see this
upstream soon.

I also support your ideas to have the kernel offer mechanisms to handle
this directly in future; it would be nice not to have the requirement of
delegating this task to userspace, plus there may be a possibility that
userspace is starved so much that it cannot step in to handle this.

The only question I have is about the format of the data in /proc. The
memory file returns two lines and several values on each line. This
requires a bit more parsing than what I have become accustomed to in recent
years of the "one value per file" approach that seems prevalent in sysfs.
Would it make sense to instead have a single value read from (say)
/proc/pressure/memory/full_avg10 ?

Thanks
Daniel


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
  2018-07-13  9:21   ` Peter Zijlstra
@ 2018-07-17 10:03   ` Peter Zijlstra
  2018-07-18 21:56     ` Johannes Weiner
  2018-07-17 14:16   ` Peter Zijlstra
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-17 10:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static void time_state(struct psi_resource *res, int state, u64 now)
> +{
> +	if (res->state != PSI_NONE) {
> +		bool was_full = res->state == PSI_FULL;
> +
> +		res->times[was_full] += now - res->state_start;
> +	}
> +	if (res->state != state)
> +		res->state = state;
> +	if (res->state != PSI_NONE)
> +		res->state_start = now;
> +}
> +
> +static void psi_group_change(struct psi_group *group, int cpu, u64 now,
> +			     unsigned int clear, unsigned int set)
> +{
> +	enum psi_state state = PSI_NONE;
> +	struct psi_group_cpu *groupc;
> +	unsigned int *tasks;
> +	unsigned int to, bo;
> +
> +	groupc = per_cpu_ptr(group->cpus, cpu);
> +	tasks = groupc->tasks;
> +
> +	/* Update task counts according to the set/clear bitmasks */
> +	for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> +		int idx = to + (bo - 1);
> +
> +		if (tasks[idx] == 0 && !psi_bug) {
> +			printk_deferred(KERN_ERR "psi: task underflow! cpu=%d idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> +					cpu, idx, tasks[0], tasks[1], tasks[2],
> +					clear, set);
> +			psi_bug = 1;
> +		}
> +		tasks[idx]--;
> +	}
> +	for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> +		tasks[to + (bo - 1)]++;
> +
> +	/* Time in which tasks wait for the CPU */
> +	state = PSI_NONE;
> +	if (tasks[NR_RUNNING] > 1)
> +		state = PSI_SOME;
> +	time_state(&groupc->res[PSI_CPU], state, now);
> +
> +	/* Time in which tasks wait for memory */
> +	state = PSI_NONE;
> +	if (tasks[NR_MEMSTALL]) {
> +		if (!tasks[NR_RUNNING] ||
> +		    (cpu_curr(cpu)->flags & PF_MEMSTALL))
> +			state = PSI_FULL;
> +		else
> +			state = PSI_SOME;
> +	}
> +	time_state(&groupc->res[PSI_MEM], state, now);
> +
> +	/* Time in which tasks wait for IO */
> +	state = PSI_NONE;
> +	if (tasks[NR_IOWAIT]) {
> +		if (!tasks[NR_RUNNING])
> +			state = PSI_FULL;
> +		else
> +			state = PSI_SOME;
> +	}
> +	time_state(&groupc->res[PSI_IO], state, now);
> +
> +	/* Time in which tasks are non-idle, to weigh the CPU in summaries */
> +	if (groupc->nonidle)
> +		groupc->nonidle_time += now - groupc->nonidle_start;
> +	groupc->nonidle = tasks[NR_RUNNING] ||
> +		tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];
> +	if (groupc->nonidle)
> +		groupc->nonidle_start = now;
> +
> +	/* Kick the stats aggregation worker if it's gone to sleep */
> +	if (!delayed_work_pending(&group->clock_work))
> +		schedule_delayed_work(&group->clock_work, PSI_FREQ);
> +}
> +
> +void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
> +{
> +	int cpu = task_cpu(task);
> +
> +	if (psi_disabled)
> +		return;
> +
> +	if (!task->pid)
> +		return;
> +
> +	if (((task->psi_flags & set) ||
> +	     (task->psi_flags & clear) != clear) &&
> +	    !psi_bug) {
> +		printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
> +				task->pid, task->comm, cpu,
> +				task->psi_flags, clear, set);
> +		psi_bug = 1;
> +	}
> +
> +	task->psi_flags &= ~clear;
> +	task->psi_flags |= set;
> +
> +	psi_group_change(&psi_system, cpu, now, clear, set);
> +}


> +/*
> + * PSI tracks state that persists across sleeps, such as iowaits and
> + * memory stalls. As a result, it has to distinguish between sleeps,
> + * where a task's runnable state changes, and requeues, where a task
> + * and its state are being moved between CPUs and runqueues.
> + */
> +static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
> +{
> +	int clear = 0, set = TSK_RUNNING;
> +
> +	if (psi_disabled)
> +		return;
> +
> +	if (!wakeup || p->sched_psi_wake_requeue) {
> +		if (p->flags & PF_MEMSTALL)
> +			set |= TSK_MEMSTALL;
> +		if (p->sched_psi_wake_requeue)
> +			p->sched_psi_wake_requeue = 0;
> +	} else {
> +		if (p->in_iowait)
> +			clear |= TSK_IOWAIT;
> +	}
> +
> +	psi_task_change(p, now, clear, set);
> +}
> +
> +static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
> +{
> +	int clear = TSK_RUNNING, set = 0;
> +
> +	if (psi_disabled)
> +		return;
> +
> +	if (!sleep) {
> +		if (p->flags & PF_MEMSTALL)
> +			clear |= TSK_MEMSTALL;
> +	} else {
> +		if (p->in_iowait)
> +			set |= TSK_IOWAIT;
> +	}
> +
> +	psi_task_change(p, now, clear, set);
> +}

This is still a scary amount of accounting; not to mention you'll be
adding O(cgroup-depth) to this in a later patch.

Where are the performance numbers for all this?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-16 15:57 ` Daniel Drake
@ 2018-07-17 11:25   ` Michal Hocko
  2018-07-17 12:13     ` Daniel Drake
  2018-07-18 22:21     ` Johannes Weiner
  0 siblings, 2 replies; 83+ messages in thread
From: Michal Hocko @ 2018-07-17 11:25 UTC (permalink / raw)
  To: Daniel Drake
  Cc: hannes, linux-kernel, linux-mm, cgroups, linux, linux-block,
	Ingo Molnar, Peter Zijlstra, Andrew Morton, Tejun Heo,
	Balbir Singh, Mike Galbraith, Oliver Yang, Shakeel Butt, xxx xxx,
	Taras Kondratiuk, Daniel Walker, Vinayak Menon,
	Ruslan Ruslichenko, kernel-team

On Mon 16-07-18 10:57:45, Daniel Drake wrote:
> Hi Johannes,
> 
> Thanks for your work on psi! 
> 
> We have also been investigating the "thrashing problem" on our Endless
> desktop OS. We have seen that systems can easily get into a state where the
> UI becomes unresponsive to input, and the mouse cursor becomes extremely
> slow or stuck when the system is running out of memory. We are working with
> a full GNOME desktop environment on systems with only 2GB RAM, and
> sometimes no real swap (although zram-swap helps mitigate the problem to
> some extent).
> 
> My analysis so far indicates that when the system is low on memory and hits
> this condition, the system is spending much of the time under
> __alloc_pages_direct_reclaim. "perf trace -F" shows many many page faults
> in executable code while this is going on. I believe the kernel is
> swapping out executable code in order to satisfy memory allocation
> requests, but then that swapped-out code is needed a moment later so it
> gets swapped in again via the page fault handler, and all this activity
> severely starves the system from being able to respond to user input.
> 
> I appreciate the kernel's attempt to keep processes alive, but in the
> desktop case we see that the system rarely recovers from this situation,
> so you have to hard shutdown. In this case we view it as desirable that
> the OOM killer would step in (it is not doing so because direct reclaim
> is not actually failing).

Yes this is really unfortunate. One thing that could help would be to
consider a trashing level during the reclaim (get_scan_count) to simply
forget about LRUs which are constantly refaulting pages back. We already
have the infrastructure for that. We just need to plumb it in.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-17 11:25   ` Michal Hocko
@ 2018-07-17 12:13     ` Daniel Drake
  2018-07-17 12:23       ` Michal Hocko
  2018-07-18 22:21     ` Johannes Weiner
  1 sibling, 1 reply; 83+ messages in thread
From: Daniel Drake @ 2018-07-17 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes, Linux Kernel, linux-mm, cgroups, Linux Upstreaming Team,
	linux-block, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Tejun Heo, Balbir Singh, Mike Galbraith, Oliver Yang,
	Shakeel Butt, xxx xxx, Taras Kondratiuk, Daniel Walker,
	Vinayak Menon, Ruslan Ruslichenko, kernel-team

On Tue, Jul 17, 2018 at 6:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
> Yes this is really unfortunate. One thing that could help would be to
> consider a trashing level during the reclaim (get_scan_count) to simply
> forget about LRUs which are constantly refaulting pages back. We already
> have the infrastructure for that. We just need to plumb it in.

Can you go into a bit more detail about that infrastructure and how we
might detect which pages are being constantly refaulted? I'm
interested in spending a few hours on this topic to see if I can come
up with anything.

Thanks
Daniel

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-17 12:13     ` Daniel Drake
@ 2018-07-17 12:23       ` Michal Hocko
  2018-07-25 22:57         ` Daniel Drake
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2018-07-17 12:23 UTC (permalink / raw)
  To: Daniel Drake
  Cc: hannes, Linux Kernel, linux-mm, cgroups, Linux Upstreaming Team,
	linux-block, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Tejun Heo, Balbir Singh, Mike Galbraith, Oliver Yang,
	Shakeel Butt, xxx xxx, Taras Kondratiuk, Daniel Walker,
	Vinayak Menon, Ruslan Ruslichenko, kernel-team

On Tue 17-07-18 07:13:52, Daniel Drake wrote:
> On Tue, Jul 17, 2018 at 6:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > Yes this is really unfortunate. One thing that could help would be to
> > consider a trashing level during the reclaim (get_scan_count) to simply
> > forget about LRUs which are constantly refaulting pages back. We already
> > have the infrastructure for that. We just need to plumb it in.
> 
> Can you go into a bit more detail about that infrastructure and how we
> might detect which pages are being constantly refaulted? I'm
> interested in spending a few hours on this topic to see if I can come
> up with anything.

mm/workingset.c allows for tracking when an actual page got evicted.
workingset_refault tells us whether a give filemap fault is a recent
refault and activates the page if that is the case. So what you need is
to note how many refaulted pages we have on the active LRU list. If that
is a large part of the list and if the inactive list is really small
then we know we are trashing. This all sounds much easier than it will
eventually turn out to be of course but I didn't really get to play with
this much.

HTH even though it is not really thought through well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
  2018-07-13  9:21   ` Peter Zijlstra
  2018-07-17 10:03   ` Peter Zijlstra
@ 2018-07-17 14:16   ` Peter Zijlstra
  2018-07-18 22:00     ` Johannes Weiner
  2018-07-17 14:21   ` Peter Zijlstra
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-17 14:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +/* Tracked task states */
> +enum psi_task_count {
> +	NR_RUNNING,
> +	NR_IOWAIT,
> +	NR_MEMSTALL,
> +	NR_PSI_TASK_COUNTS,
> +};

> +/* Resources that workloads could be stalled on */
> +enum psi_res {
> +	PSI_CPU,
> +	PSI_MEM,
> +	PSI_IO,
> +	NR_PSI_RESOURCES,
> +};

These two have mem and iowait in different order. It really doesn't
matter, but my brain stumbled.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
                     ` (2 preceding siblings ...)
  2018-07-17 14:16   ` Peter Zijlstra
@ 2018-07-17 14:21   ` Peter Zijlstra
  2018-07-18 22:03     ` Johannes Weiner
  2018-07-17 15:01   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-17 14:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> index 04f1321d14c4..ac39435d1521 100644
> --- a/include/linux/sched/stat.h
> +++ b/include/linux/sched/stat.h
> @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
>  	return 1;
>  #elif defined(CONFIG_TASK_DELAY_ACCT)
>  	extern int delayacct_on;
> +	if (delayacct_on)
> +		return 1;
> +#elif defined(CONFIG_PSI)
> +	extern int psi_disabled;
> +	if (!psi_disabled)
> +		return 1;
>  #endif
> +	return 0;
>  }

Doesn't that want to be something like:

static inline bool sched_info_on(void)
{
#ifdef CONFIG_SCHEDSTAT
	return true;
#else /* !SCHEDSTAT */
#ifdef CONFIG_TASK_DELAY_ACCT
	extern int delayacct_on;
	if (delayacct_on)
		return true;
#endif /* DELAYACCT */
#ifdef CONFIG_PSI
	extern int psi_disabled;
	if (!psi_disabled)
		return true;
#endif
	return false;
#endif /* !SCHEDSTATE */
}

Such that if you build a TASK_DELAY_ACCT && PSI kernel, and boot with
nodelayacct, you still get sched_info_on().

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
                     ` (3 preceding siblings ...)
  2018-07-17 14:21   ` Peter Zijlstra
@ 2018-07-17 15:01   ` Peter Zijlstra
  2018-07-18 22:06     ` Johannes Weiner
  2018-07-17 15:17   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-17 15:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static bool psi_update_stats(struct psi_group *group)
> +{
> +	u64 some[NR_PSI_RESOURCES] = { 0, };
> +	u64 full[NR_PSI_RESOURCES] = { 0, };
> +	unsigned long nonidle_total = 0;
> +	unsigned long missed_periods;
> +	unsigned long expires;
> +	int cpu;
> +	int r;
> +
> +	mutex_lock(&group->stat_lock);
> +
> +	/*
> +	 * Collect the per-cpu time buckets and average them into a
> +	 * single time sample that is normalized to wallclock time.
> +	 *
> +	 * For averaging, each CPU is weighted by its non-idle time in
> +	 * the sampling period. This eliminates artifacts from uneven
> +	 * loading, or even entirely idle CPUs.
> +	 *
> +	 * We could pin the online CPUs here, but the noise introduced
> +	 * by missing up to one sample period from CPUs that are going
> +	 * away shouldn't matter in practice - just like the noise of
> +	 * previously offlined CPUs returning with a non-zero sample.

But why!? cpuu_read_lock() is neither expensive nor complicated. So why
try and avoid it?

> +	 */
> +	for_each_online_cpu(cpu) {
> +		struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> +		unsigned long nonidle;
> +
> +		if (!groupc->nonidle_time)
> +			continue;
> +
> +		nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> +		groupc->nonidle_time = 0;
> +		nonidle_total += nonidle;
> +
> +		for (r = 0; r < NR_PSI_RESOURCES; r++) {
> +			struct psi_resource *res = &groupc->res[r];
> +
> +			some[r] += (res->times[0] + res->times[1]) * nonidle;
> +			full[r] += res->times[1] * nonidle;
> +
> +			/* It's racy, but we can tolerate some error */
> +			res->times[0] = 0;
> +			res->times[1] = 0;
> +		}
> +	}
> +
> +	/*
> +	 * Integrate the sample into the running statistics that are
> +	 * reported to userspace: the cumulative stall times and the
> +	 * decaying averages.
> +	 *
> +	 * Pressure percentages are sampled at PSI_FREQ. We might be
> +	 * called more often when the user polls more frequently than
> +	 * that; we might be called less often when there is no task
> +	 * activity, thus no data, and clock ticks are sporadic. The
> +	 * below handles both.
> +	 */
> +
> +	/* total= */
> +	for (r = 0; r < NR_PSI_RESOURCES; r++) {
> +		do_div(some[r], max(nonidle_total, 1UL));
> +		do_div(full[r], max(nonidle_total, 1UL));
> +
> +		group->some[r] += some[r];
> +		group->full[r] += full[r];

		group->some[r] = div64_ul(some[r], max(nonidle_total, 1UL));
		group->full[r] = div64_ul(full[r], max(nonidle_total, 1UL));

Is easier to read imo.

> +	}
> +
> +	/* avgX= */
> +	expires = group->period_expires;
> +	if (time_before(jiffies, expires))
> +		goto out;
> +
> +	missed_periods = (jiffies - expires) / PSI_FREQ;
> +	group->period_expires = expires + ((1 + missed_periods) * PSI_FREQ);
> +
> +	for (r = 0; r < NR_PSI_RESOURCES; r++) {
> +		u64 some, full;
> +
> +		some = group->some[r] - group->last_some[r];
> +		full = group->full[r] - group->last_full[r];
> +
> +		calc_avgs(group->avg_some[r], some, missed_periods);
> +		calc_avgs(group->avg_full[r], full, missed_periods);
> +
> +		group->last_some[r] = group->some[r];
> +		group->last_full[r] = group->full[r];
> +	}
> +out:
> +	mutex_unlock(&group->stat_lock);
> +	return nonidle_total;
> +}

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure
  2018-07-12 17:29 ` [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure Johannes Weiner
  2018-07-12 23:45   ` Andrew Morton
  2018-07-13 22:13   ` Suren Baghdasaryan
@ 2018-07-17 15:13   ` Peter Zijlstra
  2 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-17 15:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:42PM -0400, Johannes Weiner wrote:
> @@ -218,10 +216,36 @@ static bool psi_update_stats(struct psi_group *group)
>  	for_each_online_cpu(cpu) {
>  		struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
>  		unsigned long nonidle;
> +		struct rq_flags rf;
> +		struct rq *rq;
> +		u64 now;
>  
> -		if (!groupc->nonidle_time)
> +		if (!groupc->nonidle_time && !groupc->nonidle)
>  			continue;
>  
> +		/*
> +		 * We come here for two things: 1) periodic per-cpu
> +		 * bucket flushing and averaging and 2) when the user
> +		 * wants to read a pressure file. For flushing and
> +		 * averaging, which is relatively infrequent, we can
> +		 * be lazy and tolerate some raciness with concurrent
> +		 * updates to the per-cpu counters. However, if a user
> +		 * polls the pressure state, we want to give them the
> +		 * most uptodate information we have, including any
> +		 * currently active state which hasn't been timed yet,
> +		 * because in case of an iowait or a reclaim run, that
> +		 * can be significant.
> +		 */
> +		if (ondemand) {
> +			rq = cpu_rq(cpu);
> +			rq_lock_irq(rq, &rf);

That's a DoS right there..

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
                     ` (4 preceding siblings ...)
  2018-07-17 15:01   ` Peter Zijlstra
@ 2018-07-17 15:17   ` Peter Zijlstra
  2018-07-18 22:11     ` Johannes Weiner
  2018-07-17 15:32   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-17 15:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> index 04f1321d14c4..ac39435d1521 100644
> --- a/include/linux/sched/stat.h
> +++ b/include/linux/sched/stat.h
> @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
>  	return 1;
>  #elif defined(CONFIG_TASK_DELAY_ACCT)
>  	extern int delayacct_on;
> -	return delayacct_on;
> -#else
> -	return 0;
> +	if (delayacct_on)
> +		return 1;
> +#elif defined(CONFIG_PSI)
> +	extern int psi_disabled;
> +	if (!psi_disabled)
> +		return 1;
>  #endif
> +	return 0;
>  }
>  
>  #ifdef CONFIG_SCHEDSTATS

> diff --git a/init/Kconfig b/init/Kconfig
> index 18b151f0ddc1..e34859bda33e 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -457,6 +457,22 @@ config TASK_IO_ACCOUNTING
>  
>  	  Say N if unsure.
>  
> +config PSI
> +	bool "Pressure stall information tracking"
> +	select SCHED_INFO

What's the deal here? AFAICT it does not in fact use SCHED_INFO for
_anything_. You just hooked into the sched_info_{en,de}queue() hooks,
but you don't use any of the sched_info data.

So the dependency is an artificial one that should not exist.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9586a8141f16..16e8c8c8f432 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -744,7 +744,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  		update_rq_clock(rq);
>  
>  	if (!(flags & ENQUEUE_RESTORE))
> -		sched_info_queued(rq, p);
> +		sched_info_queued(rq, p, flags & ENQUEUE_WAKEUP);
>  
>  	p->sched_class->enqueue_task(rq, p, flags);
>  }
> @@ -755,7 +755,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  		update_rq_clock(rq);
>  
>  	if (!(flags & DEQUEUE_SAVE))
> -		sched_info_dequeued(rq, p);
> +		sched_info_dequeued(rq, p, flags & DEQUEUE_SLEEP);
>  
>  	p->sched_class->dequeue_task(rq, p, flags);
>  }

> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8aea199a39b4..15b858cbbcb0 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h

>  #ifdef CONFIG_SCHED_INFO
>  static inline void sched_info_reset_dequeued(struct task_struct *t)
>  {
>  	t->sched_info.last_queued = 0;
>  }
>  
> +static inline void sched_info_reset_queued(struct task_struct *t, u64 now)
> +{
> +	if (!t->sched_info.last_queued)
> +		t->sched_info.last_queued = now;
> +}
> +
>  /*
>   * We are interested in knowing how long it was from the *first* time a
>   * task was queued to the time that it finally hit a CPU, we call this routine
>   * from dequeue_task() to account for possible rq->clock skew across CPUs. The
>   * delta taken on each CPU would annul the skew.
>   */
> -static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t)
> +static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t,
> +				       bool sleep)
>  {
>  	unsigned long long now = rq_clock(rq), delta = 0;
>  
> -	if (unlikely(sched_info_on()))
> +	if (unlikely(sched_info_on())) {
>  		if (t->sched_info.last_queued)
>  			delta = now - t->sched_info.last_queued;
> +		psi_dequeue(t, now, sleep);
> +	}
>  	sched_info_reset_dequeued(t);
>  	t->sched_info.run_delay += delta;
>  
> @@ -104,11 +190,14 @@ static void sched_info_arrive(struct rq *rq, struct task_struct *t)
>   * the timestamp if it is already not set.  It's assumed that
>   * sched_info_dequeued() will clear that stamp when appropriate.
>   */
> -static inline void sched_info_queued(struct rq *rq, struct task_struct *t)
> +static inline void sched_info_queued(struct rq *rq, struct task_struct *t,
> +				     bool wakeup)
>  {
>  	if (unlikely(sched_info_on())) {
> -		if (!t->sched_info.last_queued)
> -			t->sched_info.last_queued = rq_clock(rq);
> +		unsigned long long now = rq_clock(rq);
> +
> +		sched_info_reset_queued(t, now);
> +		psi_enqueue(t, now, wakeup);
>  	}
>  }
>  
> @@ -127,7 +216,8 @@ static inline void sched_info_depart(struct rq *rq, struct task_struct *t)
>  	rq_sched_info_depart(rq, delta);
>  
>  	if (t->state == TASK_RUNNING)
> -		sched_info_queued(rq, t);
> +		if (unlikely(sched_info_on()))
> +			sched_info_reset_queued(t, rq_clock(rq));
>  }

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
                     ` (5 preceding siblings ...)
  2018-07-17 15:17   ` Peter Zijlstra
@ 2018-07-17 15:32   ` Peter Zijlstra
  2018-07-18 12:03   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-17 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +struct psi_group {
> +	struct psi_group_cpu *cpus;

That one wants a __percpu annotation on I think. Also, maybe a rename.

> +
> +	struct mutex stat_lock;
> +
> +	u64 some[NR_PSI_RESOURCES];
> +	u64 full[NR_PSI_RESOURCES];
> +
> +	unsigned long period_expires;
> +
> +	u64 last_some[NR_PSI_RESOURCES];
> +	u64 last_full[NR_PSI_RESOURCES];
> +
> +	unsigned long avg_some[NR_PSI_RESOURCES][3];
> +	unsigned long avg_full[NR_PSI_RESOURCES][3];
> +
> +	struct delayed_work clock_work;
> +};

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/10] psi: cgroup support
  2018-07-12 17:29 ` [PATCH 09/10] psi: cgroup support Johannes Weiner
  2018-07-12 20:08   ` Tejun Heo
@ 2018-07-17 15:40   ` Peter Zijlstra
  2018-07-24 15:54     ` Johannes Weiner
  1 sibling, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-17 15:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:41PM -0400, Johannes Weiner wrote:
> +/**
> + * cgroup_move_task - move task to a different cgroup
> + * @task: the task
> + * @to: the target css_set
> + *
> + * Move task to a new cgroup and safely migrate its associated stall
> + * state between the different groups.
> + *
> + * This function acquires the task's rq lock to lock out concurrent
> + * changes to the task's scheduling state and - in case the task is
> + * running - concurrent changes to its stall state.
> + */
> +void cgroup_move_task(struct task_struct *task, struct css_set *to)
> +{
> +	unsigned int task_flags = 0;
> +	struct rq_flags rf;
> +	struct rq *rq;
> +	u64 now;
> +
> +	rq = task_rq_lock(task, &rf);
> +
> +	if (task_on_rq_queued(task)) {
> +		task_flags = TSK_RUNNING;
> +	} else if (task->in_iowait) {
> +		task_flags = TSK_IOWAIT;
> +	}
> +	if (task->flags & PF_MEMSTALL)
> +		task_flags |= TSK_MEMSTALL;
> +
> +	if (task_flags) {
> +		update_rq_clock(rq);
> +		now = rq_clock(rq);
> +		psi_task_change(task, now, task_flags, 0);
> +	}
> +
> +	/*
> +	 * Lame to do this here, but the scheduler cannot be locked
> +	 * from the outside, so we move cgroups from inside sched/.
> +	 */
> +	rcu_assign_pointer(task->cgroups, to);
> +
> +	if (task_flags)
> +		psi_task_change(task, now, 0, task_flags);
> +
> +	task_rq_unlock(rq, task, &rf);
> +}

Why is that not part of cpu_cgroup_attach() / sched_move_task() ?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
                     ` (6 preceding siblings ...)
  2018-07-17 15:32   ` Peter Zijlstra
@ 2018-07-18 12:03   ` Peter Zijlstra
  2018-07-18 12:22     ` Peter Zijlstra
                       ` (4 more replies)
  2018-07-18 12:46   ` Peter Zijlstra
  2018-07-20 20:35   ` Peter Zijlstra
  9 siblings, 5 replies; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-18 12:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +/* Tracked task states */
> +enum psi_task_count {
> +	NR_RUNNING,
> +	NR_IOWAIT,
> +	NR_MEMSTALL,
> +	NR_PSI_TASK_COUNTS,
> +};
> +
> +/* Task state bitmasks */
> +#define TSK_RUNNING	(1 << NR_RUNNING)
> +#define TSK_IOWAIT	(1 << NR_IOWAIT)
> +#define TSK_MEMSTALL	(1 << NR_MEMSTALL)
> +
> +/* Resources that workloads could be stalled on */
> +enum psi_res {
> +	PSI_CPU,
> +	PSI_MEM,
> +	PSI_IO,
> +	NR_PSI_RESOURCES,
> +};
> +
> +/* Pressure states for a group of tasks */
> +enum psi_state {
> +	PSI_NONE,		/* No stalled tasks */
> +	PSI_SOME,		/* Stalled tasks & working tasks */
> +	PSI_FULL,		/* Stalled tasks & no working tasks */
> +	NR_PSI_STATES,
> +};
> +
> +struct psi_resource {
> +	/* Current pressure state for this resource */
> +	enum psi_state state;

This has a 4 byte hole here (really 7 but GCC is generous and uses 4
bytes for the enum that spans the value range [0-2]).

> +	/* Start of current state (rq_clock) */
> +	u64 state_start;
> +
> +	/* Time sampling buckets for pressure states SOME and FULL (ns) */
> +	u64 times[2];
> +};
> +
> +struct psi_group_cpu {
> +	/* States of the tasks belonging to this group */
> +	unsigned int tasks[NR_PSI_TASK_COUNTS];
> +
> +	/* There are runnable or D-state tasks */
> +	int nonidle;
> +
> +	/* Start of current non-idle state (rq_clock) */
> +	u64 nonidle_start;
> +
> +	/* Time sampling bucket for non-idle state (ns) */
> +	u64 nonidle_time;
> +
> +	/* Per-resource pressure tracking in this group */
> +	struct psi_resource res[NR_PSI_RESOURCES];
> +};

> +static DEFINE_PER_CPU(struct psi_group_cpu, system_group_cpus);

Since psi_group_cpu is exactly 2 lines big, I think you want the above
to be DEFINE_PER_CPU_SHARED_ALIGNED() to minimize cache misses on
accounting. Also, I think you want to stick ____cacheline_aligned_in_smp
on the structure, such that alloc_percpu() also DTRT.

Of those 2 lines, 12 bytes are wasted because of that hole above, and a
further 8 are wasted because PSI_CPU does not use FULL, for a total of
20 wasted bytes in there.

> +static void time_state(struct psi_resource *res, int state, u64 now)
> +{
> +	if (res->state != PSI_NONE) {
> +		bool was_full = res->state == PSI_FULL;
> +
> +		res->times[was_full] += now - res->state_start;
> +	}
> +	if (res->state != state)
> +		res->state = state;
> +	if (res->state != PSI_NONE)
> +		res->state_start = now;
> +}

Does the compiler optimize that and fold the two != NONE branches?

> +static void psi_group_change(struct psi_group *group, int cpu, u64 now,
> +			     unsigned int clear, unsigned int set)
> +{
> +	enum psi_state state = PSI_NONE;
> +	struct psi_group_cpu *groupc;
> +	unsigned int *tasks;
> +	unsigned int to, bo;
> +
> +	groupc = per_cpu_ptr(group->cpus, cpu);
> +	tasks = groupc->tasks;

	bool was_nonidle = tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];

> +	/* Update task counts according to the set/clear bitmasks */
> +	for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> +		int idx = to + (bo - 1);
> +
> +		if (tasks[idx] == 0 && !psi_bug) {
> +			printk_deferred(KERN_ERR "psi: task underflow! cpu=%d idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> +					cpu, idx, tasks[0], tasks[1], tasks[2],
> +					clear, set);
> +			psi_bug = 1;
> +		}

		WARN_ONCE(!tasks[idx], ...);

> +		tasks[idx]--;
> +	}
> +	for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> +		tasks[to + (bo - 1)]++;

You want to benchmark this, but since it's only 3 consecutive bits, it
might actually be faster to not use ffs() and simply test all 3 bits:

	for (to = set, bo = 0; to; to &= ~(1 << bo), bo++)
		tasks[bo]++;

or something like that.

> +
> +	/* Time in which tasks wait for the CPU */
> +	state = PSI_NONE;
> +	if (tasks[NR_RUNNING] > 1)
> +		state = PSI_SOME;
> +	time_state(&groupc->res[PSI_CPU], state, now);
> +
> +	/* Time in which tasks wait for memory */
> +	state = PSI_NONE;
> +	if (tasks[NR_MEMSTALL]) {
> +		if (!tasks[NR_RUNNING] ||
> +		    (cpu_curr(cpu)->flags & PF_MEMSTALL))

I'm confused, why do we care if the current tasks is MEMSTALL or not?

> +			state = PSI_FULL;
> +		else
> +			state = PSI_SOME;
> +	}
> +	time_state(&groupc->res[PSI_MEM], state, now);
> +
> +	/* Time in which tasks wait for IO */
> +	state = PSI_NONE;
> +	if (tasks[NR_IOWAIT]) {
> +		if (!tasks[NR_RUNNING])
> +			state = PSI_FULL;
> +		else
> +			state = PSI_SOME;
> +	}
> +	time_state(&groupc->res[PSI_IO], state, now);
> +
> +	/* Time in which tasks are non-idle, to weigh the CPU in summaries */
	if (was_nonidle);
> +		groupc->nonidle_time += now - groupc->nonidle_start;

	if (tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL])
> +		groupc->nonidle_start = now;

Does away with groupc->nonidle, giving us 24 bytes free.

> +	/* Kick the stats aggregation worker if it's gone to sleep */
> +	if (!delayed_work_pending(&group->clock_work))
> +		schedule_delayed_work(&group->clock_work, PSI_FREQ);
> +}

If you always update the time buckets, rename nonidle_start as last_time
and do away with psi_resource::state_start, you gain another 24 bytes,
giving 48 bytes free.

And as said before, we can compress the state from 12 bytes, to 6 bits
(or 1 byte), giving another 11 bytes for 59 bytes free.

Leaving us just 5 bytes short of needing a single cacheline :/

struct ponies {
        unsigned int               tasks[3];                                             /*     0    12 */
        unsigned int               cpu_state:2;                                          /*    12:30  4 */
        unsigned int               io_state:2;                                           /*    12:28  4 */
        unsigned int               mem_state:2;                                          /*    12:26  4 */

        /* XXX 26 bits hole, try to pack */

        /* typedef u64 */ long long unsigned int     last_time;                          /*    16     8 */
        /* typedef u64 */ long long unsigned int     some_time[3];                       /*    24    24 */
        /* typedef u64 */ long long unsigned int     full_time[2];                       /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        /* typedef u64 */ long long unsigned int     nonidle_time;                       /*    64     8 */

        /* size: 72, cachelines: 2, members: 8 */
        /* bit holes: 1, sum bit holes: 26 bits */
        /* last cacheline: 8 bytes */
};

ARGGH!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 12:03   ` Peter Zijlstra
@ 2018-07-18 12:22     ` Peter Zijlstra
  2018-07-18 22:36     ` Johannes Weiner
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-18 12:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +	for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> > +		tasks[to + (bo - 1)]++;
> 
> You want to benchmark this, but since it's only 3 consecutive bits, it
> might actually be faster to not use ffs() and simply test all 3 bits:
> 
> 	for (to = set, bo = 0; to; to &= ~(1 << bo), bo++)

		if (to & (1 << bo))

> 		tasks[bo]++;



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
                     ` (7 preceding siblings ...)
  2018-07-18 12:03   ` Peter Zijlstra
@ 2018-07-18 12:46   ` Peter Zijlstra
  2018-07-18 13:56     ` Johannes Weiner
  2018-07-20 20:35   ` Peter Zijlstra
  9 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-18 12:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:

> +static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
> +{
> +	int clear = 0, set = TSK_RUNNING;
> +
> +	if (psi_disabled)
> +		return;
> +
> +	if (!wakeup || p->sched_psi_wake_requeue) {
> +		if (p->flags & PF_MEMSTALL)
> +			set |= TSK_MEMSTALL;
> +		if (p->sched_psi_wake_requeue)
> +			p->sched_psi_wake_requeue = 0;
> +	} else {
> +		if (p->in_iowait)
> +			clear |= TSK_IOWAIT;
> +	}
> +
> +	psi_task_change(p, now, clear, set);
> +}
> +
> +static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
> +{
> +	int clear = TSK_RUNNING, set = 0;
> +
> +	if (psi_disabled)
> +		return;
> +
> +	if (!sleep) {
> +		if (p->flags & PF_MEMSTALL)
> +			clear |= TSK_MEMSTALL;
> +	} else {
> +		if (p->in_iowait)
> +			set |= TSK_IOWAIT;
> +	}
> +
> +	psi_task_change(p, now, clear, set);
> +}

> +/**
> + * psi_memstall_enter - mark the beginning of a memory stall section
> + * @flags: flags to handle nested sections
> + *
> + * Marks the calling task as being stalled due to a lack of memory,
> + * such as waiting for a refault or performing reclaim.
> + */
> +void psi_memstall_enter(unsigned long *flags)
> +{
> +	struct rq_flags rf;
> +	struct rq *rq;
> +
> +	if (psi_disabled)
> +		return;
> +
> +	*flags = current->flags & PF_MEMSTALL;
> +	if (*flags)
> +		return;
> +	/*
> +	 * PF_MEMSTALL setting & accounting needs to be atomic wrt
> +	 * changes to the task's scheduling state, otherwise we can
> +	 * race with CPU migration.
> +	 */
> +	rq = this_rq_lock_irq(&rf);
> +
> +	update_rq_clock(rq);
> +
> +	current->flags |= PF_MEMSTALL;
> +	psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> +
> +	rq_unlock_irq(rq, &rf);
> +}

I'm confused by this whole MEMSTALL thing... I thought the idea was to
account the time we were _blocked_ because of memstall, but you seem to
count the time we're _running_ with PF_MEMSTALL.


And esp. the wait_on_page_bit_common caller seems performance sensitive,
and the above function is quite expensive.



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 12:46   ` Peter Zijlstra
@ 2018-07-18 13:56     ` Johannes Weiner
  2018-07-18 16:31       ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-18 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

Hi Peter,

thanks for the feedback so far, I'll get to the other emails
later. I'm currently running A/B tests against our production traffic
to get uptodate numbers in particular on the optimizations you
suggested for the cacheline packing, time_state(), ffs() etc.

On Wed, Jul 18, 2018 at 02:46:27PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> 
> > +static inline void psi_enqueue(struct task_struct *p, u64 now, bool wakeup)
> > +{
> > +	int clear = 0, set = TSK_RUNNING;
> > +
> > +	if (psi_disabled)
> > +		return;
> > +
> > +	if (!wakeup || p->sched_psi_wake_requeue) {
> > +		if (p->flags & PF_MEMSTALL)
> > +			set |= TSK_MEMSTALL;
> > +		if (p->sched_psi_wake_requeue)
> > +			p->sched_psi_wake_requeue = 0;
> > +	} else {
> > +		if (p->in_iowait)
> > +			clear |= TSK_IOWAIT;
> > +	}
> > +
> > +	psi_task_change(p, now, clear, set);
> > +}
> > +
> > +static inline void psi_dequeue(struct task_struct *p, u64 now, bool sleep)
> > +{
> > +	int clear = TSK_RUNNING, set = 0;
> > +
> > +	if (psi_disabled)
> > +		return;
> > +
> > +	if (!sleep) {
> > +		if (p->flags & PF_MEMSTALL)
> > +			clear |= TSK_MEMSTALL;
> > +	} else {
> > +		if (p->in_iowait)
> > +			set |= TSK_IOWAIT;
> > +	}
> > +
> > +	psi_task_change(p, now, clear, set);
> > +}
> 
> > +/**
> > + * psi_memstall_enter - mark the beginning of a memory stall section
> > + * @flags: flags to handle nested sections
> > + *
> > + * Marks the calling task as being stalled due to a lack of memory,
> > + * such as waiting for a refault or performing reclaim.
> > + */
> > +void psi_memstall_enter(unsigned long *flags)
> > +{
> > +	struct rq_flags rf;
> > +	struct rq *rq;
> > +
> > +	if (psi_disabled)
> > +		return;
> > +
> > +	*flags = current->flags & PF_MEMSTALL;
> > +	if (*flags)
> > +		return;
> > +	/*
> > +	 * PF_MEMSTALL setting & accounting needs to be atomic wrt
> > +	 * changes to the task's scheduling state, otherwise we can
> > +	 * race with CPU migration.
> > +	 */
> > +	rq = this_rq_lock_irq(&rf);
> > +
> > +	update_rq_clock(rq);
> > +
> > +	current->flags |= PF_MEMSTALL;
> > +	psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> > +
> > +	rq_unlock_irq(rq, &rf);
> > +}
> 
> I'm confused by this whole MEMSTALL thing... I thought the idea was to
> account the time we were _blocked_ because of memstall, but you seem to
> count the time we're _running_ with PF_MEMSTALL.

Under heavy memory pressure, a lot of active CPU time is spent
scanning and rotating through the LRU lists, which we do want to
capture in the pressure metric. What we really want to know is the
time in which CPU potential goes to waste due to a lack of
resources. That's the CPU going idle due to a memstall, but it's also
a CPU doing *work* which only occurs due to a lack of memory. We want
to know about both to judge how productive system and workload are.

> And esp. the wait_on_page_bit_common caller seems performance sensitive,
> and the above function is quite expensive.

Right, but we don't call it on every invocation, only when waiting for
the IO to read back a page that was recently deactivated and evicted:

	if (bit_nr == PG_locked &&
	    !PageUptodate(page) && PageWorkingset(page)) {
		if (!PageSwapBacked(page))
			delayacct_thrashing_start();
		psi_memstall_enter(&pflags);
		thrashing = true;
	}

That means the page cache workingset/file active list is thrashing, in
which case the IO itself is our biggest concern, not necessarily a few
additional cycles before going to sleep to wait on its completion.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 13:56     ` Johannes Weiner
@ 2018-07-18 16:31       ` Peter Zijlstra
  2018-07-18 16:46         ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-18 16:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jul 18, 2018 at 09:56:33AM -0400, Johannes Weiner wrote:
> On Wed, Jul 18, 2018 at 02:46:27PM +0200, Peter Zijlstra wrote:

> > I'm confused by this whole MEMSTALL thing... I thought the idea was to
> > account the time we were _blocked_ because of memstall, but you seem to
> > count the time we're _running_ with PF_MEMSTALL.
> 
> Under heavy memory pressure, a lot of active CPU time is spent
> scanning and rotating through the LRU lists, which we do want to
> capture in the pressure metric. What we really want to know is the
> time in which CPU potential goes to waste due to a lack of
> resources. That's the CPU going idle due to a memstall, but it's also
> a CPU doing *work* which only occurs due to a lack of memory. We want
> to know about both to judge how productive system and workload are.

Then maybe memstall (esp. the 'stall' part of it) is a bit of a
misnomer.

> > And esp. the wait_on_page_bit_common caller seems performance sensitive,
> > and the above function is quite expensive.
> 
> Right, but we don't call it on every invocation, only when waiting for
> the IO to read back a page that was recently deactivated and evicted:
> 
> 	if (bit_nr == PG_locked &&
> 	    !PageUptodate(page) && PageWorkingset(page)) {
> 		if (!PageSwapBacked(page))
> 			delayacct_thrashing_start();
> 		psi_memstall_enter(&pflags);
> 		thrashing = true;
> 	}
> 
> That means the page cache workingset/file active list is thrashing, in
> which case the IO itself is our biggest concern, not necessarily a few
> additional cycles before going to sleep to wait on its completion.

Ah, right. PageWorkingset() is only true if we (recently) evicted that
page before, right?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 16:31       ` Peter Zijlstra
@ 2018-07-18 16:46         ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-18 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jul 18, 2018 at 06:31:15PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 18, 2018 at 09:56:33AM -0400, Johannes Weiner wrote:
> > On Wed, Jul 18, 2018 at 02:46:27PM +0200, Peter Zijlstra wrote:
> 
> > > I'm confused by this whole MEMSTALL thing... I thought the idea was to
> > > account the time we were _blocked_ because of memstall, but you seem to
> > > count the time we're _running_ with PF_MEMSTALL.
> > 
> > Under heavy memory pressure, a lot of active CPU time is spent
> > scanning and rotating through the LRU lists, which we do want to
> > capture in the pressure metric. What we really want to know is the
> > time in which CPU potential goes to waste due to a lack of
> > resources. That's the CPU going idle due to a memstall, but it's also
> > a CPU doing *work* which only occurs due to a lack of memory. We want
> > to know about both to judge how productive system and workload are.
> 
> Then maybe memstall (esp. the 'stall' part of it) is a bit of a
> misnomer.

I'm not tied to that name, but I can't really think of a better
one. It was called PF_MEMDELAY in the past, but "delay" also has
busy-spinning connotations in the kernel. "wait" also implies that
it's a passive state.

> > > And esp. the wait_on_page_bit_common caller seems performance sensitive,
> > > and the above function is quite expensive.
> > 
> > Right, but we don't call it on every invocation, only when waiting for
> > the IO to read back a page that was recently deactivated and evicted:
> > 
> > 	if (bit_nr == PG_locked &&
> > 	    !PageUptodate(page) && PageWorkingset(page)) {
> > 		if (!PageSwapBacked(page))
> > 			delayacct_thrashing_start();
> > 		psi_memstall_enter(&pflags);
> > 		thrashing = true;
> > 	}
> > 
> > That means the page cache workingset/file active list is thrashing, in
> > which case the IO itself is our biggest concern, not necessarily a few
> > additional cycles before going to sleep to wait on its completion.
> 
> Ah, right. PageWorkingset() is only true if we (recently) evicted that
> page before, right?

Yep, but not all of those, only the ones who were on the active list
in their previous incarnation, aka refaulting *hot* pages, aka there
is little chance this is healthy behavior.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-17 10:03   ` Peter Zijlstra
@ 2018-07-18 21:56     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-18 21:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jul 17, 2018 at 12:03:47PM +0200, Peter Zijlstra wrote:
> This is still a scary amount of accounting; not to mention you'll be
> adding O(cgroup-depth) to this in a later patch.
> 
> Where are the performance numbers for all this?

I benchmarked it using our two most scheduling sensitive workloads:
memcache and webserver. They handle a ton of small requests - lots of
wakeups and sleeps with little actual work in between - so they tend
to be canaries for scheduler regressions.

In the tests, the boxes were handling live traffic over the course of
several hours. Half the machines, the control, ran with CONFIG_PSI=n.

For memcache I used eight machines total. They're 2-socket, 14 core,
56 thread boxes. The test runs for half the test period, flips the
test and control kernels on the hardware to rule out HW factors, DC
location etc., then runs the other half of the test.

For the webservers, I used 32 machines total. They're single socket,
16 core, 32 thread machines.

During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the
first half and nopsi=77.52% psi=78.25%, so psi added between 0.7 and
0.9 percentage points to the CPU load, a difference of about 1%.

As far as end-to-end request latency from the client perspective goes,
we don't sample those finely enough to capture the requests going to
those particular machines during the test, but we know the p50
turnaround time in this workload is 54us, and perf bench sched pipe on
those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so
this doesn't add much here either.

The profile for the pipe benchmark shows:

     0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
     0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
     0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
     0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change


The webserver load is running inside 4 nested cgroup levels. The CPU
load with both nopsi and psi kernels was indistinguishable at 81%.

For comparison, we had to disable the cgroup cpu controller on the
webservers because it added 4 percentage points to the CPU% during
this same exact test.

Versions of this accounting code now run on 80% of our fleet. None of
our workloads have reported regressions during the rollout.

[ Also note that the webservers that tested the nopsi kernel were
  during that time susceptible to swap storms, memory livelocks, and
  eventual hardresets because without psi they couldn't run our full
  resource isolation stack that would prevent that ;) ]

Let me know if there are other tests I could run.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-17 14:16   ` Peter Zijlstra
@ 2018-07-18 22:00     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-18 22:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jul 17, 2018 at 04:16:14PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +/* Tracked task states */
> > +enum psi_task_count {
> > +	NR_RUNNING,
> > +	NR_IOWAIT,
> > +	NR_MEMSTALL,
> > +	NR_PSI_TASK_COUNTS,
> > +};
> 
> > +/* Resources that workloads could be stalled on */
> > +enum psi_res {
> > +	PSI_CPU,
> > +	PSI_MEM,
> > +	PSI_IO,
> > +	NR_PSI_RESOURCES,
> > +};
> 
> These two have mem and iowait in different order. It really doesn't
> matter, but my brain stumbled.

No problem, I swapped them around for v3.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-17 14:21   ` Peter Zijlstra
@ 2018-07-18 22:03     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-18 22:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jul 17, 2018 at 04:21:57PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
> > index 04f1321d14c4..ac39435d1521 100644
> > --- a/include/linux/sched/stat.h
> > +++ b/include/linux/sched/stat.h
> > @@ -28,10 +28,14 @@ static inline int sched_info_on(void)
> >  	return 1;
> >  #elif defined(CONFIG_TASK_DELAY_ACCT)
> >  	extern int delayacct_on;
> > +	if (delayacct_on)
> > +		return 1;
> > +#elif defined(CONFIG_PSI)
> > +	extern int psi_disabled;
> > +	if (!psi_disabled)
> > +		return 1;
> >  #endif
> > +	return 0;
> >  }
> 
> Doesn't that want to be something like:
> 
> static inline bool sched_info_on(void)
> {
> #ifdef CONFIG_SCHEDSTAT
> 	return true;
> #else /* !SCHEDSTAT */
> #ifdef CONFIG_TASK_DELAY_ACCT
> 	extern int delayacct_on;
> 	if (delayacct_on)
> 		return true;
> #endif /* DELAYACCT */
> #ifdef CONFIG_PSI
> 	extern int psi_disabled;
> 	if (!psi_disabled)
> 		return true;
> #endif
> 	return false;
> #endif /* !SCHEDSTATE */
> }
> 
> Such that if you build a TASK_DELAY_ACCT && PSI kernel, and boot with
> nodelayacct, you still get sched_info_on().

You're right, that was a brainfart on my end. But as you point out in
the other email, the SCHED_INFO dependency is artificial, so I'll
rework this entire part.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-17 15:01   ` Peter Zijlstra
@ 2018-07-18 22:06     ` Johannes Weiner
  2018-07-20 14:13       ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-18 22:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jul 17, 2018 at 05:01:42PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +static bool psi_update_stats(struct psi_group *group)
> > +{
> > +	u64 some[NR_PSI_RESOURCES] = { 0, };
> > +	u64 full[NR_PSI_RESOURCES] = { 0, };
> > +	unsigned long nonidle_total = 0;
> > +	unsigned long missed_periods;
> > +	unsigned long expires;
> > +	int cpu;
> > +	int r;
> > +
> > +	mutex_lock(&group->stat_lock);
> > +
> > +	/*
> > +	 * Collect the per-cpu time buckets and average them into a
> > +	 * single time sample that is normalized to wallclock time.
> > +	 *
> > +	 * For averaging, each CPU is weighted by its non-idle time in
> > +	 * the sampling period. This eliminates artifacts from uneven
> > +	 * loading, or even entirely idle CPUs.
> > +	 *
> > +	 * We could pin the online CPUs here, but the noise introduced
> > +	 * by missing up to one sample period from CPUs that are going
> > +	 * away shouldn't matter in practice - just like the noise of
> > +	 * previously offlined CPUs returning with a non-zero sample.
> 
> But why!? cpuu_read_lock() is neither expensive nor complicated. So why
> try and avoid it?

Hm, I don't feel strongly about it either way. I'll add it.

> > +	/* total= */
> > +	for (r = 0; r < NR_PSI_RESOURCES; r++) {
> > +		do_div(some[r], max(nonidle_total, 1UL));
> > +		do_div(full[r], max(nonidle_total, 1UL));
> > +
> > +		group->some[r] += some[r];
> > +		group->full[r] += full[r];
> 
> 		group->some[r] = div64_ul(some[r], max(nonidle_total, 1UL));
> 		group->full[r] = div64_ul(full[r], max(nonidle_total, 1UL));
> 
> Is easier to read imo.

Sounds good to me, I'll change that.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-17 15:17   ` Peter Zijlstra
@ 2018-07-18 22:11     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-18 22:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jul 17, 2018 at 05:17:05PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > @@ -457,6 +457,22 @@ config TASK_IO_ACCOUNTING
> >  
> >  	  Say N if unsure.
> >  
> > +config PSI
> > +	bool "Pressure stall information tracking"
> > +	select SCHED_INFO
> 
> What's the deal here? AFAICT it does not in fact use SCHED_INFO for
> _anything_. You just hooked into the sched_info_{en,de}queue() hooks,
> but you don't use any of the sched_info data.
> 
> So the dependency is an artificial one that should not exist.

You're right, it doesn't strictly depend on it. I'll split that out.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-17 11:25   ` Michal Hocko
  2018-07-17 12:13     ` Daniel Drake
@ 2018-07-18 22:21     ` Johannes Weiner
  2018-07-19 11:29       ` peter enderborg
  1 sibling, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-18 22:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Drake, linux-kernel, linux-mm, cgroups, linux,
	linux-block, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Tejun Heo, Balbir Singh, Mike Galbraith, Oliver Yang,
	Shakeel Butt, xxx xxx, Taras Kondratiuk, Daniel Walker,
	Vinayak Menon, Ruslan Ruslichenko, kernel-team

On Tue, Jul 17, 2018 at 01:25:15PM +0200, Michal Hocko wrote:
> On Mon 16-07-18 10:57:45, Daniel Drake wrote:
> > Hi Johannes,
> > 
> > Thanks for your work on psi! 
> > 
> > We have also been investigating the "thrashing problem" on our Endless
> > desktop OS. We have seen that systems can easily get into a state where the
> > UI becomes unresponsive to input, and the mouse cursor becomes extremely
> > slow or stuck when the system is running out of memory. We are working with
> > a full GNOME desktop environment on systems with only 2GB RAM, and
> > sometimes no real swap (although zram-swap helps mitigate the problem to
> > some extent).
> > 
> > My analysis so far indicates that when the system is low on memory and hits
> > this condition, the system is spending much of the time under
> > __alloc_pages_direct_reclaim. "perf trace -F" shows many many page faults
> > in executable code while this is going on. I believe the kernel is
> > swapping out executable code in order to satisfy memory allocation
> > requests, but then that swapped-out code is needed a moment later so it
> > gets swapped in again via the page fault handler, and all this activity
> > severely starves the system from being able to respond to user input.
> > 
> > I appreciate the kernel's attempt to keep processes alive, but in the
> > desktop case we see that the system rarely recovers from this situation,
> > so you have to hard shutdown. In this case we view it as desirable that
> > the OOM killer would step in (it is not doing so because direct reclaim
> > is not actually failing).

Yes, we currently use a userspace application that monitors pressure
and OOM kills (there is usually plenty of headroom left for a small
application to run by the time quality of service for most workloads
has already tanked to unacceptable levels). We want to eventually add
this back into the kernel with the appropriate configuration options
(pressure threshold value and sustained duration etc.)

> Yes this is really unfortunate. One thing that could help would be to
> consider a trashing level during the reclaim (get_scan_count) to simply
> forget about LRUs which are constantly refaulting pages back. We already
> have the infrastructure for that. We just need to plumb it in.

This doesn't work without quantifying the actual time you're spending
on thrashing IO. The cutoff for acceptable refaults is very different
between rotating disks, crappy SSDs, and high-end flash.

But in the future we might want the OOM killer to monitor psi memory
levels and dispatch tasks when we sustain X percent for Y seconds.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 12:03   ` Peter Zijlstra
  2018-07-18 12:22     ` Peter Zijlstra
@ 2018-07-18 22:36     ` Johannes Weiner
  2018-07-19 13:58       ` Peter Zijlstra
  2018-07-19  9:26     ` Peter Zijlstra
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-18 22:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +	/* Time in which tasks wait for the CPU */
> > +	state = PSI_NONE;
> > +	if (tasks[NR_RUNNING] > 1)
> > +		state = PSI_SOME;
> > +	time_state(&groupc->res[PSI_CPU], state, now);
> > +
> > +	/* Time in which tasks wait for memory */
> > +	state = PSI_NONE;
> > +	if (tasks[NR_MEMSTALL]) {
> > +		if (!tasks[NR_RUNNING] ||
> > +		    (cpu_curr(cpu)->flags & PF_MEMSTALL))
> 
> I'm confused, why do we care if the current tasks is MEMSTALL or not?

We want to know whether we're losing CPU potential because of a lack
of memory. That can happen when the task waits for refaults and the
CPU goes idle, but it can also happen when the CPU is performing
reclaim.

If the task waits for refaults and something else is runnable, we're
not losing CPU potential. But if the task performs reclaim and uses
the CPU, nothing else can do productive work on that CPU.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 12:03   ` Peter Zijlstra
  2018-07-18 12:22     ` Peter Zijlstra
  2018-07-18 22:36     ` Johannes Weiner
@ 2018-07-19  9:26     ` Peter Zijlstra
  2018-07-19 12:50       ` Johannes Weiner
  2018-07-19 15:08     ` Linus Torvalds
  2018-07-19 18:47     ` Johannes Weiner
  4 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-19  9:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:

> Leaving us just 5 bytes short of needing a single cacheline :/
> 
> struct ponies {
>         unsigned int               tasks[3];                                             /*     0    12 */
>         unsigned int               cpu_state:2;                                          /*    12:30  4 */
>         unsigned int               io_state:2;                                           /*    12:28  4 */
>         unsigned int               mem_state:2;                                          /*    12:26  4 */
> 
>         /* XXX 26 bits hole, try to pack */
> 
>         /* typedef u64 */ long long unsigned int     last_time;                          /*    16     8 */
>         /* typedef u64 */ long long unsigned int     some_time[3];                       /*    24    24 */
>         /* typedef u64 */ long long unsigned int     full_time[2];                       /*    48    16 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         /* typedef u64 */ long long unsigned int     nonidle_time;                       /*    64     8 */
> 
>         /* size: 72, cachelines: 2, members: 8 */
>         /* bit holes: 1, sum bit holes: 26 bits */
>         /* last cacheline: 8 bytes */
> };
> 
> ARGGH!

It _might_ be possible to use curr->se.exec_start for last_time if you
very carefully audit and place the hooks. I've not gone through it in
detail, but it might just work.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-18 22:21     ` Johannes Weiner
@ 2018-07-19 11:29       ` peter enderborg
  2018-07-19 12:18         ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: peter enderborg @ 2018-07-19 11:29 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Daniel Drake, linux-kernel, linux-mm, cgroups, linux,
	linux-block, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Tejun Heo, Balbir Singh, Mike Galbraith, Oliver Yang,
	Shakeel Butt, xxx xxx, Taras Kondratiuk, Daniel Walker,
	Vinayak Menon, Ruslan Ruslichenko, kernel-team

On 07/19/2018 12:21 AM, Johannes Weiner wrote:
>
> Yes, we currently use a userspace application that monitors pressure
> and OOM kills (there is usually plenty of headroom left for a small
> application to run by the time quality of service for most workloads
> has already tanked to unacceptable levels). We want to eventually add
> this back into the kernel with the appropriate configuration options
> (pressure threshold value and sustained duration etc.)
Is that the same application as googles lmkd for android? Any source
that you might share?


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-19 11:29       ` peter enderborg
@ 2018-07-19 12:18         ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-19 12:18 UTC (permalink / raw)
  To: peter enderborg
  Cc: Michal Hocko, Daniel Drake, linux-kernel, linux-mm, cgroups,
	linux, linux-block, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Tejun Heo, Balbir Singh, Mike Galbraith, Oliver Yang,
	Shakeel Butt, xxx xxx, Taras Kondratiuk, Daniel Walker,
	Vinayak Menon, Ruslan Ruslichenko, kernel-team

On Thu, Jul 19, 2018 at 01:29:39PM +0200, peter enderborg wrote:
> On 07/19/2018 12:21 AM, Johannes Weiner wrote:
> >
> > Yes, we currently use a userspace application that monitors pressure
> > and OOM kills (there is usually plenty of headroom left for a small
> > application to run by the time quality of service for most workloads
> > has already tanked to unacceptable levels). We want to eventually add
> > this back into the kernel with the appropriate configuration options
> > (pressure threshold value and sustained duration etc.)
> Is that the same application as googles lmkd for android? Any source
> that you might share?

Sure! This is the oomd we've been developing and using at Facebook:

	https://github.com/facebookincubator/oomd

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-19  9:26     ` Peter Zijlstra
@ 2018-07-19 12:50       ` Johannes Weiner
  2018-07-19 13:18         ` Peter Zijlstra
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-19 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 19, 2018 at 11:26:14AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> 
> > Leaving us just 5 bytes short of needing a single cacheline :/
> > 
> > struct ponies {
> >         unsigned int               tasks[3];                                             /*     0    12 */
> >         unsigned int               cpu_state:2;                                          /*    12:30  4 */
> >         unsigned int               io_state:2;                                           /*    12:28  4 */
> >         unsigned int               mem_state:2;                                          /*    12:26  4 */
> > 
> >         /* XXX 26 bits hole, try to pack */
> > 
> >         /* typedef u64 */ long long unsigned int     last_time;                          /*    16     8 */
> >         /* typedef u64 */ long long unsigned int     some_time[3];                       /*    24    24 */
> >         /* typedef u64 */ long long unsigned int     full_time[2];                       /*    48    16 */
> >         /* --- cacheline 1 boundary (64 bytes) --- */
> >         /* typedef u64 */ long long unsigned int     nonidle_time;                       /*    64     8 */
> > 
> >         /* size: 72, cachelines: 2, members: 8 */
> >         /* bit holes: 1, sum bit holes: 26 bits */
> >         /* last cacheline: 8 bytes */
> > };
> > 
> > ARGGH!
> 
> It _might_ be possible to use curr->se.exec_start for last_time if you
> very carefully audit and place the hooks. I've not gone through it in
> detail, but it might just work.

Hnngg, and chop off an entire cacheline...

But don't we flush that delta out and update the timestamp on every
tick? entity_tick() does update_curr(). That might be too expensive :(

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-19 12:50       ` Johannes Weiner
@ 2018-07-19 13:18         ` Peter Zijlstra
  0 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-19 13:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team, Arnaldo Carvalho de Melo

On Thu, Jul 19, 2018 at 08:50:38AM -0400, Johannes Weiner wrote:
> On Thu, Jul 19, 2018 at 11:26:14AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > 
> > > Leaving us just 5 bytes short of needing a single cacheline :/
> > > 
> > > struct ponies {
> > >         unsigned int               tasks[3];                                             /*     0    12 */
> > >         unsigned int               cpu_state:2;                                          /*    12:30  4 */
> > >         unsigned int               io_state:2;                                           /*    12:28  4 */
> > >         unsigned int               mem_state:2;                                          /*    12:26  4 */
> > > 
> > >         /* XXX 26 bits hole, try to pack */
> > > 
> > >         /* typedef u64 */ long long unsigned int     last_time;                          /*    16     8 */
> > >         /* typedef u64 */ long long unsigned int     some_time[3];                       /*    24    24 */
> > >         /* typedef u64 */ long long unsigned int     full_time[2];                       /*    48    16 */
> > >         /* --- cacheline 1 boundary (64 bytes) --- */
> > >         /* typedef u64 */ long long unsigned int     nonidle_time;                       /*    64     8 */
> > > 
> > >         /* size: 72, cachelines: 2, members: 8 */
> > >         /* bit holes: 1, sum bit holes: 26 bits */
> > >         /* last cacheline: 8 bytes */
> > > };
> > > 
> > > ARGGH!
> > 
> > It _might_ be possible to use curr->se.exec_start for last_time if you
> > very carefully audit and place the hooks. I've not gone through it in
> > detail, but it might just work.
> 
> Hnngg, and chop off an entire cacheline...

Yes.. a worthy goal :-)

> But don't we flush that delta out and update the timestamp on every
> tick?

Indeed.

> entity_tick() does update_curr(). That might be too expensive :(

Well, since you already do all this accounting on every enqueue/dequeue,
this can run many thousands of times per tick already, so once per tick
doesn't sound bad.

However, I just realized this might not in fact work, because
curr->se.exec_start is per task, and you really want something per-cpu
for this.

Bah, if only perf had a useful tool to report on data layout instead of
this c2c crap.. :-( The thinking being that we could maybe find a
usage-hole (a data member that is not in fact used) near something we
already touch for writing. 





^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 22:36     ` Johannes Weiner
@ 2018-07-19 13:58       ` Peter Zijlstra
  0 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-19 13:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jul 18, 2018 at 06:36:44PM -0400, Johannes Weiner wrote:
> On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > +	/* Time in which tasks wait for the CPU */
> > > +	state = PSI_NONE;
> > > +	if (tasks[NR_RUNNING] > 1)
> > > +		state = PSI_SOME;
> > > +	time_state(&groupc->res[PSI_CPU], state, now);
> > > +
> > > +	/* Time in which tasks wait for memory */
> > > +	state = PSI_NONE;
> > > +	if (tasks[NR_MEMSTALL]) {
> > > +		if (!tasks[NR_RUNNING] ||
> > > +		    (cpu_curr(cpu)->flags & PF_MEMSTALL))
> > 
> > I'm confused, why do we care if the current tasks is MEMSTALL or not?
> 
> We want to know whether we're losing CPU potential because of a lack
> of memory. That can happen when the task waits for refaults and the
> CPU goes idle, but it can also happen when the CPU is performing
> reclaim.
> 
> If the task waits for refaults and something else is runnable, we're
> not losing CPU potential. But if the task performs reclaim and uses
> the CPU, nothing else can do productive work on that CPU.

Right, this is because MEMSTALL is not just blocking (as per that other
sub-thread).

This is really unfortunate, because it means the state is not a simple
function of the task counts.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 12:03   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2018-07-19  9:26     ` Peter Zijlstra
@ 2018-07-19 15:08     ` Linus Torvalds
  2018-07-19 17:54       ` Johannes Weiner
  2018-07-19 18:47     ` Johannes Weiner
  4 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2018-07-19 15:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, Ingo Molnar, Andrew Morton, Tejun Heo, surenb,
	Vinayak Menon, Christoph Lameter, Mike Galbraith, shakeelb,
	linux-mm, cgroups, Linux Kernel Mailing List, kernel-team

On Wed, Jul 18, 2018 at 5:03 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> And as said before, we can compress the state from 12 bytes, to 6 bits
> (or 1 byte), giving another 11 bytes for 59 bytes free.
>
> Leaving us just 5 bytes short of needing a single cacheline :/

Do you actually need 64 bits for the times?

That's the big cost. And it seems ridiculous, if you actually care about size.

You already have a 64-bit start time. Everything else is some
cumulative relative time. Do those really need 64-bit and nanosecond
resolution?

Maybe a 32-bit microsecond would be ok - would you ever account more
than 35 minutes of anything without starting anew?

             Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-19 15:08     ` Linus Torvalds
@ 2018-07-19 17:54       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-19 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Tejun Heo, surenb,
	Vinayak Menon, Christoph Lameter, Mike Galbraith, shakeelb,
	linux-mm, cgroups, Linux Kernel Mailing List, kernel-team

On Thu, Jul 19, 2018 at 08:08:20AM -0700, Linus Torvalds wrote:
> On Wed, Jul 18, 2018 at 5:03 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > And as said before, we can compress the state from 12 bytes, to 6 bits
> > (or 1 byte), giving another 11 bytes for 59 bytes free.
> >
> > Leaving us just 5 bytes short of needing a single cacheline :/
> 
> Do you actually need 64 bits for the times?
> 
> That's the big cost. And it seems ridiculous, if you actually care about size.
> 
> You already have a 64-bit start time. Everything else is some
> cumulative relative time. Do those really need 64-bit and nanosecond
> resolution?
> 
> Maybe a 32-bit microsecond would be ok - would you ever account more
> than 35 minutes of anything without starting anew?

D'oh, you're right, the per-cpu buckets don't need to be this big at
all. In fact, we flush those deltas out every 2 seconds when there is
activity to maintain the running averages. Since we get 4.2s worth of
nanoseconds into a u32, we don't even need to divide in the hotpath.

Something along the lines of this here should work:

static void psi_group_change(struct psi_group *group, int cpu, u64 now,
			     unsigned int clear, unsigned int set)
{
	struct psi_group_cpu *groupc;
	unsigned int *tasks;
	unsigned int t;
	u32 delta;

	groupc = per_cpu_ptr(group->cpus, cpu);
	tasks = groupc->tasks;

	/* Time since last task change on this runqueue */
	delta = now - groupc->last_time;
	groupc->last_time = now;

	/* Tasks waited for IO? */
	if (tasks[NR_IOWAIT]) {
		if (!tasks[NR_RUNNING])
			groupc->full_time[PSI_IO] += delta;
		else
			groupc->some_time[PSI_IO] += delta;
	}

	/* Tasks waited for memory? */
	if (tasks[NR_MEMSTALL]) {
		if (!tasks[NR_RUNNING] ||
		    (cpu_curr(cpu)->flags & PF_MEMSTALL))
			groupc->full_time[PSI_MEM] += delta;
		else
			groupc->some_time[PSI_MEM] += delta;
	}

	/* Tasks waited for the CPU? */
	if (tasks[NR_RUNNING] > 1)
		groupc->some_time[PSI_CPU] += delta;

	/* Tasks were generally non-idle? To weigh the CPU in summaries */
	if (tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL])
		groupc->nonidle_time += delta;

	/* Update task counts according to the set/clear bitmasks */
	for (t = 0; clear; clear &= ~(1 << t), t++)
		if (clear & (1 << t))
			groupc->tasks[t]--;
	for (t = 0; set; set &= ~(1 << t), t++)
		if (set & (1 << t))
			groupc->tasks[t]++;

	/* Kick the stats aggregation worker if it's gone to sleep */
	if (!delayed_work_pending(&group->clock_work))
		schedule_delayed_work(&group->clock_work, PSI_FREQ);
}

And then we can pack it down to one cacheline:

struct psi_group_cpu {
	/* States of the tasks belonging to this group */
	unsigned int tasks[NR_PSI_TASK_COUNTS]; // 3

	/* Time sampling bucket for pressure states - no FULL for CPU */
	u32 some_time[NR_PSI_RESOURCES];
	u32 full_time[NR_PSI_RESOURCES - 1];

	/* Time sampling bucket for non-idle state (ns) */
	u32 nonidle_time;

	/* Time of last task change in this group (rq_clock) */
	u64 last_time;
};

I'm going to go test with this.

Thanks

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 12:03   ` Peter Zijlstra
                       ` (3 preceding siblings ...)
  2018-07-19 15:08     ` Linus Torvalds
@ 2018-07-19 18:47     ` Johannes Weiner
  2018-07-19 20:31       ` Peter Zijlstra
  4 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-19 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > +	/* Update task counts according to the set/clear bitmasks */
> > +	for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> > +		int idx = to + (bo - 1);
> > +
> > +		if (tasks[idx] == 0 && !psi_bug) {
> > +			printk_deferred(KERN_ERR "psi: task underflow! cpu=%d idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> > +					cpu, idx, tasks[0], tasks[1], tasks[2],
> > +					clear, set);
> > +			psi_bug = 1;
> > +		}
> 
> 		WARN_ONCE(!tasks[idx], ...);

It's just open-coded because of the printk_deferred, since this is
inside the scheduler.

It actually used to be a straight-up WARN_ONCE() in older
versions. Recursive scheduling bugs are no fun to debug ;)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-19 18:47     ` Johannes Weiner
@ 2018-07-19 20:31       ` Peter Zijlstra
  2018-07-24 16:01         ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-19 20:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 19, 2018 at 02:47:40PM -0400, Johannes Weiner wrote:
> On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > +	/* Update task counts according to the set/clear bitmasks */
> > > +	for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> > > +		int idx = to + (bo - 1);
> > > +
> > > +		if (tasks[idx] == 0 && !psi_bug) {
> > > +			printk_deferred(KERN_ERR "psi: task underflow! cpu=%d idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> > > +					cpu, idx, tasks[0], tasks[1], tasks[2],
> > > +					clear, set);
> > > +			psi_bug = 1;
> > > +		}
> > 
> > 		WARN_ONCE(!tasks[idx], ...);
> 
> It's just open-coded because of the printk_deferred, since this is
> inside the scheduler.

Yeah, meh. There's ton of WARNs in the scheduler, WARNs should not
trigger anyway. But yeah printk is crap, which is why I don't use printk
anymore:

  https://lkml.kernel.org/r/20170928121823.430053219@infradead.org



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-18 22:06     ` Johannes Weiner
@ 2018-07-20 14:13       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-20 14:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jul 18, 2018 at 06:06:23PM -0400, Johannes Weiner wrote:
> On Tue, Jul 17, 2018 at 05:01:42PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > +static bool psi_update_stats(struct psi_group *group)
> > > +{
> > > +	u64 some[NR_PSI_RESOURCES] = { 0, };
> > > +	u64 full[NR_PSI_RESOURCES] = { 0, };
> > > +	unsigned long nonidle_total = 0;
> > > +	unsigned long missed_periods;
> > > +	unsigned long expires;
> > > +	int cpu;
> > > +	int r;
> > > +
> > > +	mutex_lock(&group->stat_lock);
> > > +
> > > +	/*
> > > +	 * Collect the per-cpu time buckets and average them into a
> > > +	 * single time sample that is normalized to wallclock time.
> > > +	 *
> > > +	 * For averaging, each CPU is weighted by its non-idle time in
> > > +	 * the sampling period. This eliminates artifacts from uneven
> > > +	 * loading, or even entirely idle CPUs.
> > > +	 *
> > > +	 * We could pin the online CPUs here, but the noise introduced
> > > +	 * by missing up to one sample period from CPUs that are going
> > > +	 * away shouldn't matter in practice - just like the noise of
> > > +	 * previously offlined CPUs returning with a non-zero sample.
> > 
> > But why!? cpuu_read_lock() is neither expensive nor complicated. So why
> > try and avoid it?
> 
> Hm, I don't feel strongly about it either way. I'll add it.

Thinking more about it, this really doesn't buy anything. Whether a
CPU comes online or goes offline during the loop is no different than
that happening right before grabbing the cpus_read_lock(). If we see a
sample from a CPU, we incorporate it, if not we don't.

So it's not so much avoidance as it's lack of reason for synchronizing
against hotplugging in any fashion. The comment is wrong. This noise
it points to is there with and without the lock, and the only way to
avoid it would be to do either for_each_possible_cpu() in that loop or
having a hotplug callback that would flush the offlining CPU bucket
into a holding place for missed dead cpu samples that the aggregation
loop checks every time. Neither of these seem remotely worth the cost.

I'll fix the comment instead.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
                     ` (8 preceding siblings ...)
  2018-07-18 12:46   ` Peter Zijlstra
@ 2018-07-20 20:35   ` Peter Zijlstra
  9 siblings, 0 replies; 83+ messages in thread
From: Peter Zijlstra @ 2018-07-20 20:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +static bool psi_update_stats(struct psi_group *group)
> +{

> +	for_each_online_cpu(cpu) {
> +		struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> +		unsigned long nonidle;
> +
> +		if (!groupc->nonidle_time)
> +			continue;
> +
> +		nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> +		groupc->nonidle_time = 0;
> +		nonidle_total += nonidle;
> +
> +		for (r = 0; r < NR_PSI_RESOURCES; r++) {
> +			struct psi_resource *res = &groupc->res[r];
> +
> +			some[r] += (res->times[0] + res->times[1]) * nonidle;
> +			full[r] += res->times[1] * nonidle;
> +
> +			/* It's racy, but we can tolerate some error */
> +			res->times[0] = 0;
> +			res->times[1] = 0;
> +		}
> +	}

An alternative for this, that also allows that ondemand update, but
without spamming the rq->lock would be something like:

struct psi_group_cpu {
	u32 tasks[3];
	u32 cpu_state : 2;
	u32 mem_state : 2;
	u32 io_state  : 2;
	u32 :0;

	u64 last_update_time;

	u32 nonidle;
	u32 full[2];
	u32 some[3];
} ____cacheline_aligned_in_smp;

/* Allocate _2_ copies */
DEFINE_PER_CPU_ALIGNED_SHARED(struct psi_group_cpu[2], psi_cpus);

struct psi_group global_psi = {
	.cpus = &psi_cpus[0],
};


	u64 sums[6] = { 0, };

	for_each_possible_cpu(cpu) {
		struct psi_group_cpu *pgc = per_cpu_ptr(group->cpus, cpu);
		u32 *active, *shadow;

		active = &pgc[0].nonidle;
		shadow = &pgc[1].nonidle;

		/*
		 * Compare the active count to the shadow count
		 * if different, compute the delta and update the shadow
		 * copy.
		 * This only writes to the shadow copy (separate line)
		 * and leaves the active a read-only access.
		 */
		for (i = 0; i < 6; i++) {
			u32 old = READ_ONCE(shadow[i]);
			u32 new = READ_ONCE(active[i]);

			delta = (new - old);
			if (!delta) {
				if (!i)
					goto next;
				continue;
			}

			WRITE_ONCE(shadow[i], new);

			sums[i] += delta;
		}
next:		;
	}

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing
  2018-07-12 17:29 ` [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
@ 2018-07-23 13:36   ` Arnd Bergmann
  2018-07-23 15:23     ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Arnd Bergmann @ 2018-07-23 13:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, Linux-MM,
	cgroups, Linux Kernel Mailing List, kernel-team, Catalin Marinas,
	Will Deacon, Linux ARM

On Thu, Jul 12, 2018 at 7:29 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> How many page->flags does this leave us with on 32-bit?
>
>         20 bits are always page flags
>
>         21 if you have an MMU
>
>         23 with the zone bits for DMA, Normal, HighMem, Movable
>
>         29 with the sparsemem section bits
>
>         30 if PAE is enabled
>
>         31 with this patch.
>
> So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
> nodes. If that's not enough, the system can switch to discontigmem and
> re-gain the 6 or 7 sparsemem section bits.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

It seems we ran out of bits on arm64 in randconfig builds:

In file included from /git/arm-soc/include/linux/kernel.h:10,
                 from /git/arm-soc/arch/arm64/mm/init.c:20:
/git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
/git/arm-soc/include/linux/compiler.h:357:38: error: call to
'__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                      ^
/git/arm-soc/include/linux/compiler.h:337:4: note: in definition of
macro '__compiletime_assert'
    prefix ## suffix();    \
    ^~~~~~
/git/arm-soc/include/linux/compiler.h:357:2: note: in expansion of
macro '_compiletime_assert'
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
  ^~~~~~~~~~~~~~~~~~~
/git/arm-soc/include/linux/build_bug.h:45:37: note: in expansion of
macro 'compiletime_assert'
 #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                     ^~~~~~~~~~~~~~~~~~
/git/arm-soc/include/linux/build_bug.h:69:2: note: in expansion of
macro 'BUILD_BUG_ON_MSG'
  BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
  ^~~~~~~~~~~~~~~~
/git/arm-soc/arch/arm64/mm/init.c:618:2: note: in expansion of macro
'BUILD_BUG_ON'
  BUILD_BUG_ON(sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT));
  ^~~~~~~~~~~~
/git/arm-soc/scripts/Makefile.build:317: recipe for target
'arch/arm64/mm/init.o' failed

Apparently this triggered

#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
BITS_PER_LONG - NR_PAGEFLAGS
#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
#else
#define LAST_CPUPID_WIDTH 0
#endif

and in turn

#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
#endif

and that _last_cpupid in struct page made sizeof(struct page) larger than 64.

This is for a randconfig build, see https://pastebin.com/YuwSTah3
for the configuration file, some of the relevant options are

CONFIG_64BIT=y
CONFIG_MEMCG=y
CONFIG_SPARSEMEM=y
CONFIG_ARM64_PA_BITS=52
CONFIG_ARM64_64K_PAGES=y
CONFIG_NR_CPUS=64
CONFIG_NUMA_BALANCING=y
# CONFIG_SPARSEMEM_VMEMMAP is not set
CONFIG_NODES_SHIFT=2
# CONFIG_ARCH_USES_PG_UNCACHED is not set
CONFIG_MEMORY_FAILURE=y
CONFIG_IDLE_PAGE_TRACKING=y

#define MAX_NR_ZONES 3
#define ZONES_SHIFT 2
#define MAX_PHYSMEM_BITS 52
#define SECTION_SIZE_BITS 30
#define SECTIONS_WIDTH 22
#define ZONES_WIDTH 2
#define NODES_SHIFT 2
#define LAST__PID_SHIFT 8
#define NR_CPUS_BITS 6
#define LAST_CPUPID_SHIFT 14
#define NR_PAGEFLAGS 25

With the extra page flag, the sum of SECTIONS_WIDTH, NODES_SHIFT,  ZONES_WIDTH,
LAST_CPUPID_SHIFT, and NR_PAGEFLAGS is now 65. Before this change, I could
not trigger that error in randconfig builds. However, setting CONFIG_NR_CPUS or
CONFIG_NODES_SHIFT higher than the defaults would trigger it as well (randconfig
does not randomize those options).

       Arnd

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing
  2018-07-23 13:36   ` Arnd Bergmann
@ 2018-07-23 15:23     ` Johannes Weiner
  2018-07-23 15:35       ` Arnd Bergmann
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-23 15:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, Linux-MM,
	cgroups, Linux Kernel Mailing List, kernel-team, Catalin Marinas,
	Will Deacon, Linux ARM

Hi Arnd,

On Mon, Jul 23, 2018 at 03:36:09PM +0200, Arnd Bergmann wrote:
> On Thu, Jul 12, 2018 at 7:29 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > How many page->flags does this leave us with on 32-bit?
> >
> >         20 bits are always page flags
> >
> >         21 if you have an MMU
> >
> >         23 with the zone bits for DMA, Normal, HighMem, Movable
> >
> >         29 with the sparsemem section bits
> >
> >         30 if PAE is enabled
> >
> >         31 with this patch.
> >
> > So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
> > nodes. If that's not enough, the system can switch to discontigmem and
> > re-gain the 6 or 7 sparsemem section bits.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> It seems we ran out of bits on arm64 in randconfig builds:
> 
> In file included from /git/arm-soc/include/linux/kernel.h:10,
>                  from /git/arm-soc/arch/arm64/mm/init.c:20:
> /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
> /git/arm-soc/include/linux/compiler.h:357:38: error: call to
> '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
> failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)

This BUILD_BUG_ON() is to make sure we're sizing the VMEMMAP struct
page array properly (address space divided by struct page size).

From the code:

/*
 * Log2 of the upper bound of the size of a struct page. Used for sizing
 * the vmemmap region only, does not affect actual memory footprint.
 * We don't use sizeof(struct page) directly since taking its size here
 * requires its definition to be available at this point in the inclusion
 * chain, and it may not be a power of 2 in the first place.
 */
#define STRUCT_PAGE_MAX_SHIFT	6

> Apparently this triggered
> 
> #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
> BITS_PER_LONG - NR_PAGEFLAGS
> #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
> #else
> #define LAST_CPUPID_WIDTH 0
> #endif
> 
> and in turn
> 
> #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
> #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
> #endif
> 
> and that _last_cpupid in struct page made sizeof(struct page) larger than 64.
> 
> This is for a randconfig build, see https://pastebin.com/YuwSTah3
> for the configuration file, some of the relevant options are
> 
> CONFIG_64BIT=y
> CONFIG_MEMCG=y
> CONFIG_SPARSEMEM=y
> CONFIG_ARM64_PA_BITS=52
> CONFIG_ARM64_64K_PAGES=y
> CONFIG_NR_CPUS=64
> CONFIG_NUMA_BALANCING=y
> # CONFIG_SPARSEMEM_VMEMMAP is not set

However, the check isn't conditional on that config option. And when
VMEMMAP is disabled, we need 22 additional bits to identify the sparse
memory sections in page->flags as well:

> CONFIG_NODES_SHIFT=2
> # CONFIG_ARCH_USES_PG_UNCACHED is not set
> CONFIG_MEMORY_FAILURE=y
> CONFIG_IDLE_PAGE_TRACKING=y
> 
> #define MAX_NR_ZONES 3
> #define ZONES_SHIFT 2
> #define MAX_PHYSMEM_BITS 52
> #define SECTION_SIZE_BITS 30
> #define SECTIONS_WIDTH 22

^^^ Those we get back with VMEMMAP enabled.

So for configs for which the check is intended, it passes. We just
need to make it conditional to those.

---

From 1d24635a6c7cd395bad5c29a3b9e5d2e98d9ab84 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 23 Jul 2018 10:18:23 -0400
Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
 setups

Arnd reports the following arm64 randconfig build error with the PSI
patches that add another page flag:

  /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
  /git/arm-soc/include/linux/compiler.h:357:38: error: call to
  '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
  failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)

The additional page flag causes other information stored in
page->flags to get bumped into their own struct page member:

  #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
  BITS_PER_LONG - NR_PAGEFLAGS
  #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
  #else
  #define LAST_CPUPID_WIDTH 0
  #endif

  #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
  #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
  #endif

which in turn causes the struct page size to exceed the size set in
STRUCT_PAGE_MAX_SHIFT. This value is an an estimate used to size the
VMEMMAP page array according to address space and struct page size.

However, the check is performed - and triggers here - on a !VMEMMAP
config, which consumes an additional 22 page bits for the sparse
section id. When VMEMMAP is enabled, those bits are returned, cpupid
doesn't need its own member, and the page passes the VMEMMAP check.

Restrict that check to the situation it was meant to check: that we
are sizing the VMEMMAP page array correctly.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/arm64/mm/init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1b18b4722420..72c9b6778b0a 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -611,11 +611,13 @@ void __init mem_init(void)
 	BUILD_BUG_ON(TASK_SIZE_32			> TASK_SIZE_64);
 #endif
 
+#ifndef CONFIG_SPARSEMEM_VMEMMAP
 	/*
 	 * Make sure we chose the upper bound of sizeof(struct page)
-	 * correctly.
+	 * correctly when sizing the VMEMMAP array.
 	 */
 	BUILD_BUG_ON(sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT));
+#endif
 
 	if (PAGE_SIZE >= 16384 && get_num_physpages() <= 128) {
 		extern int sysctl_overcommit_memory;
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing
  2018-07-23 15:23     ` Johannes Weiner
@ 2018-07-23 15:35       ` Arnd Bergmann
  2018-07-23 16:27         ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Arnd Bergmann @ 2018-07-23 15:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Suren Baghdasaryan, Mike Galbraith, Will Deacon,
	Linux Kernel Mailing List, kernel-team, Linux-MM, Vinayak Menon,
	Ingo Molnar, Shakeel Butt, Catalin Marinas, Tejun Heo, cgroups,
	Andrew Morton, Linus Torvalds, Christopher Lameter, Linux ARM

On Mon, Jul 23, 2018 at 5:23 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Mon, Jul 23, 2018 at 03:36:09PM +0200, Arnd Bergmann wrote:
>> On Thu, Jul 12, 2018 at 7:29 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> In file included from /git/arm-soc/include/linux/kernel.h:10,
>>                  from /git/arm-soc/arch/arm64/mm/init.c:20:
>> /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
>> /git/arm-soc/include/linux/compiler.h:357:38: error: call to
>> '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
>> failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)
>
> This BUILD_BUG_ON() is to make sure we're sizing the VMEMMAP struct
> page array properly (address space divided by struct page size).
>
> From the code:
>
> /*
>  * Log2 of the upper bound of the size of a struct page. Used for sizing
>  * the vmemmap region only, does not affect actual memory footprint.
>  * We don't use sizeof(struct page) directly since taking its size here
>  * requires its definition to be available at this point in the inclusion
>  * chain, and it may not be a power of 2 in the first place.
>  */
> #define STRUCT_PAGE_MAX_SHIFT   6
>
...
> However, the check isn't conditional on that config option. And when
> VMEMMAP is disabled, we need 22 additional bits to identify the sparse
> memory sections in page->flags as well:
>
>> CONFIG_NODES_SHIFT=2
>> # CONFIG_ARCH_USES_PG_UNCACHED is not set
>> CONFIG_MEMORY_FAILURE=y
>> CONFIG_IDLE_PAGE_TRACKING=y
>>
>> #define MAX_NR_ZONES 3
>> #define ZONES_SHIFT 2
>> #define MAX_PHYSMEM_BITS 52
>> #define SECTION_SIZE_BITS 30
>> #define SECTIONS_WIDTH 22
>
> ^^^ Those we get back with VMEMMAP enabled.
>
> So for configs for which the check is intended, it passes. We just
> need to make it conditional to those.

Ok, thanks for the analysis, I had missed that and was about to
send a different patch to increase STRUCT_PAGE_MAX_SHIFT
in some configurations, which is not as good.

> From 1d24635a6c7cd395bad5c29a3b9e5d2e98d9ab84 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 23 Jul 2018 10:18:23 -0400
> Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
>  setups
>
> Arnd reports the following arm64 randconfig build error with the PSI
> patches that add another page flag:
>

You could add further text here that I had just added to my
patch description (not sent):

    Further experiments show that the build error already existed before,
    but was only triggered with larger values of CONFIG_NR_CPU and/or
    CONFIG_NODES_SHIFT that might be used in actual configurations but
    not in randconfig builds.

    With longer CPU and node masks, I could recreate the problem with
    kernels as old as linux-4.7 when arm64 NUMA support got added.

    Cc: stable@vger.kernel.org
    Fixes: 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
    Fixes: 3e1907d5bf5a ("arm64: mm: move vmemmap region right below
the linear region")

>  arch/arm64/mm/init.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 1b18b4722420..72c9b6778b0a 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -611,11 +611,13 @@ void __init mem_init(void)
>         BUILD_BUG_ON(TASK_SIZE_32                       > TASK_SIZE_64);
>  #endif
>
> +#ifndef CONFIG_SPARSEMEM_VMEMMAP
>         /*

I tested it on two broken configurations, and found that you have
a typo here, it should be 'ifdef', not 'ifndef'. With that change, it
seems to build fine.

Tested-by: Arnd Bergmann <arnd@arndb.de>

      Arnd

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing
  2018-07-23 15:35       ` Arnd Bergmann
@ 2018-07-23 16:27         ` Johannes Weiner
  2018-07-24 15:04           ` Will Deacon
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-23 16:27 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Suren Baghdasaryan, Mike Galbraith, Will Deacon,
	Linux Kernel Mailing List, kernel-team, Linux-MM, Vinayak Menon,
	Ingo Molnar, Shakeel Butt, Catalin Marinas, Tejun Heo, cgroups,
	Andrew Morton, Linus Torvalds, Christopher Lameter, Linux ARM

On Mon, Jul 23, 2018 at 05:35:35PM +0200, Arnd Bergmann wrote:
> On Mon, Jul 23, 2018 at 5:23 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > From 1d24635a6c7cd395bad5c29a3b9e5d2e98d9ab84 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Mon, 23 Jul 2018 10:18:23 -0400
> > Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
> >  setups
> >
> > Arnd reports the following arm64 randconfig build error with the PSI
> > patches that add another page flag:
> >
> 
> You could add further text here that I had just added to my
> patch description (not sent):
> 
>     Further experiments show that the build error already existed before,
>     but was only triggered with larger values of CONFIG_NR_CPU and/or
>     CONFIG_NODES_SHIFT that might be used in actual configurations but
>     not in randconfig builds.
> 
>     With longer CPU and node masks, I could recreate the problem with
>     kernels as old as linux-4.7 when arm64 NUMA support got added.
> 
>     Cc: stable@vger.kernel.org
>     Fixes: 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
>     Fixes: 3e1907d5bf5a ("arm64: mm: move vmemmap region right below
> the linear region")

Sure thing.

> >  arch/arm64/mm/init.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 1b18b4722420..72c9b6778b0a 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -611,11 +611,13 @@ void __init mem_init(void)
> >         BUILD_BUG_ON(TASK_SIZE_32                       > TASK_SIZE_64);
> >  #endif
> >
> > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> >         /*
> 
> I tested it on two broken configurations, and found that you have
> a typo here, it should be 'ifdef', not 'ifndef'. With that change, it
> seems to build fine.
> 
> Tested-by: Arnd Bergmann <arnd@arndb.de>

Thanks for testing it, I don't have a cross-compile toolchain set up.

---

From 34c4c4549f09f971d2d391a8d652d56cb9b05475 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 23 Jul 2018 10:18:23 -0400
Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
 setups

Arnd reports the following arm64 randconfig build error with the PSI
patches that add another page flag:

  /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
  /git/arm-soc/include/linux/compiler.h:357:38: error: call to
  '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
  failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)

The additional page flag causes other information stored in
page->flags to get bumped into their own struct page member:

  #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
  BITS_PER_LONG - NR_PAGEFLAGS
  #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
  #else
  #define LAST_CPUPID_WIDTH 0
  #endif

  #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
  #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
  #endif

which in turn causes the struct page size to exceed the size set in
STRUCT_PAGE_MAX_SHIFT. This value is an an estimate used to size the
VMEMMAP page array according to address space and struct page size.

However, the check is performed - and triggers here - on a !VMEMMAP
config, which consumes an additional 22 page bits for the sparse
section id. When VMEMMAP is enabled, those bits are returned, cpupid
doesn't need its own member, and the page passes the VMEMMAP check.

Restrict that check to the situation it was meant to check: that we
are sizing the VMEMMAP page array correctly.

Says Arnd:

    Further experiments show that the build error already existed before,
    but was only triggered with larger values of CONFIG_NR_CPU and/or
    CONFIG_NODES_SHIFT that might be used in actual configurations but
    not in randconfig builds.

    With longer CPU and node masks, I could recreate the problem with
    kernels as old as linux-4.7 when arm64 NUMA support got added.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Arnd Bergmann <arnd@arndb.de>
Cc: stable@vger.kernel.org
Fixes: 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
Fixes: 3e1907d5bf5a ("arm64: mm: move vmemmap region right below the linear region")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/arm64/mm/init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1b18b4722420..86d9f9d303b0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -611,11 +611,13 @@ void __init mem_init(void)
 	BUILD_BUG_ON(TASK_SIZE_32			> TASK_SIZE_64);
 #endif
 
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
 	/*
 	 * Make sure we chose the upper bound of sizeof(struct page)
-	 * correctly.
+	 * correctly when sizing the VMEMMAP array.
 	 */
 	BUILD_BUG_ON(sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT));
+#endif
 
 	if (PAGE_SIZE >= 16384 && get_num_physpages() <= 128) {
 		extern int sysctl_overcommit_memory;
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (12 preceding siblings ...)
  2018-07-16 15:57 ` Daniel Drake
@ 2018-07-23 21:14 ` Balbir Singh
  2018-07-24 15:15   ` Johannes Weiner
  2018-07-27 22:01 ` Pavel Machek
  14 siblings, 1 reply; 83+ messages in thread
From: Balbir Singh @ 2018-07-23 21:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, akpm, Linus Torvalds, Tejun Heo,
	surenb, Vinayak Menon, Christoph Lameter, Mike Galbraith,
	Shakeel Butt, linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Jul 13, 2018 at 3:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> PSI aggregates and reports the overall wallclock time in which the
> tasks in a system (or cgroup) wait for contended hardware resources.
>
> This helps users understand the resource pressure their workloads are
> under, which allows them to rootcause and fix throughput and latency
> problems caused by overcommitting, underprovisioning, suboptimal job
> placement in a grid, as well as anticipate major disruptions like OOM.
>
> This version 2 of the series incorporates a ton of feedback from
> PeterZ and SurenB; more details at the end of this email.
>
>                 Real-world applications
>
> We're using the data collected by psi (and its previous incarnation,
> memdelay) quite extensively at Facebook, with several success stories.
>
> One usecase is avoiding OOM hangs/livelocks. The reason these happen
> is because the OOM killer is triggered by reclaim not being able to
> free pages, but with fast flash devices there is *always* some clean
> and uptodate cache to reclaim; the OOM killer never kicks in, even as
> tasks spend 90% of the time thrashing the cache pages of their own
> executables. There is no situation where this ever makes sense in
> practice. We wrote a <100 line POC python script to monitor memory
> pressure and kill stuff way before such pathological thrashing leads
> to full system losses that require forcible hard resets.
>
> We've since extended and deployed this code into other places to
> guarantee latency and throughput SLAs, since they're usually violated
> way before the kernel OOM killer would ever kick in.
>
> The idea is to eventually incorporate this back into the kernel, so
> that Linux can avoid OOM livelocks (which TECHNICALLY aren't memory
> deadlocks, but for the user indistinguishable) out of the box.
>
> We also use psi memory pressure for loadshedding. Our batch job
> infrastructure used to use heuristics based on various VM stats to
> anticipate OOM situations, with lackluster success. We switched it to
> psi and managed to anticipate and avoid OOM kills and hangs fairly
> reliably. The reduction of OOM outages in the worker pool raised the
> pool's aggregate productivity, and we were able to switch that service
> to smaller machines.
>
> Lastly, we use cgroups to isolate a machine's main workload from
> maintenance crap like package upgrades, logging, configuration, as
> well as to prevent multiple workloads on a machine from stepping on
> each others' toes. We were not able to configure this properly without
> the pressure metrics; we would see latency or bandwidth drops, but it
> would often be hard to impossible to rootcause it post-mortem.
>
> We now log and graph pressure for the containers in our fleet and can
> trivially link latency spikes and throughput drops to shortages of
> specific resources after the fact, and fix the job config/scheduling.
>
> I've also recieved feedback and feature requests from Android for the
> purpose of low-latency OOM killing. The on-demand stats aggregation in
> the last patch of this series is for this purpose, to allow Android to
> react to pressure before the system starts visibly hanging.
>
>                 How do you use this feature?
>
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
> cpu.pressure, memory.pressure and io.pressure files, which simply
> aggregate task stalls at the cgroup level instead of system-wide.
>
> The cpu file contains one line:
>
>         some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
>
> The averages give the percentage of walltime in which one or more
> tasks are delayed on the runqueue while another task has the
> CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell
> short term trends from long term ones, similarly to the load average.
>

Does the mechanism scale? I am a little concerned about how frequently
this infrastructure is monitored/read/acted upon. Why aren't existing
mechanisms sufficient -- why is the avg delay calculation in the
kernel?

> The total= value gives the absolute stall time in microseconds. This
> allows detecting latency spikes that might be too short to sway the
> running averages. It also allows custom time averaging in case the
> 10s/1m/5m windows aren't adequate for the usecase (or are too coarse
> with future hardware).
>
> What to make of this "some" metric? If CPU utilization is at 100% and
> CPU pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (as most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> would do better with more resources. Depending on what your priority
> and options are, raised "some" numbers may or may not require action.
>
> The memory file contains two lines:
>
> some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
> full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
>
> The some line is the same as for cpu, the time in which at least one
> task is stalled on the resource. In the case of memory, this includes
> waiting on swap-in, page cache refaults and page reclaim.
>
> The full line, however, indicates time in which *nobody* is using the
> CPU productively due to pressure: all non-idle tasks are waiting for
> memory in one form or another. Significant time spent in there is a
> good trigger for killing things, moving jobs to other machines, or
> dropping incoming requests, since neither the jobs nor the machine
> overall are making too much headway.
>
> The io file is similar to memory. Because the block layer doesn't have
> a concept of hardware contention right now (how much longer is my IO
> request taking due to other tasks?), it reports CPU potential lost on
> all IO delays, not just the potential lost due to competition.
>

There is no talk about the overhead this introduces in general, may be
the details are in the patches. I'll read through them

Balbir Singh.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing
  2018-07-23 16:27         ` Johannes Weiner
@ 2018-07-24 15:04           ` Will Deacon
  2018-07-25 16:06             ` Will Deacon
  0 siblings, 1 reply; 83+ messages in thread
From: Will Deacon @ 2018-07-24 15:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Arnd Bergmann, Peter Zijlstra, Suren Baghdasaryan,
	Mike Galbraith, Linux Kernel Mailing List, kernel-team, Linux-MM,
	Vinayak Menon, Ingo Molnar, Shakeel Butt, Catalin Marinas,
	Tejun Heo, cgroups, Andrew Morton, Linus Torvalds,
	Christopher Lameter, Linux ARM

On Mon, Jul 23, 2018 at 12:27:35PM -0400, Johannes Weiner wrote:
> On Mon, Jul 23, 2018 at 05:35:35PM +0200, Arnd Bergmann wrote:
> > On Mon, Jul 23, 2018 at 5:23 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > index 1b18b4722420..72c9b6778b0a 100644
> > > --- a/arch/arm64/mm/init.c
> > > +++ b/arch/arm64/mm/init.c
> > > @@ -611,11 +611,13 @@ void __init mem_init(void)
> > >         BUILD_BUG_ON(TASK_SIZE_32                       > TASK_SIZE_64);
> > >  #endif
> > >
> > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > >         /*
> > 
> > I tested it on two broken configurations, and found that you have
> > a typo here, it should be 'ifdef', not 'ifndef'. With that change, it
> > seems to build fine.
> > 
> > Tested-by: Arnd Bergmann <arnd@arndb.de>
> 
> Thanks for testing it, I don't have a cross-compile toolchain set up.
> 
> ---

Thanks Arnd, Johannes. I can pick this up for -rc7 via the arm64 tree,
unless it's already queued elsewhere?

Will

> From 34c4c4549f09f971d2d391a8d652d56cb9b05475 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 23 Jul 2018 10:18:23 -0400
> Subject: [PATCH] arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap
>  setups
> 
> Arnd reports the following arm64 randconfig build error with the PSI
> patches that add another page flag:
> 
>   /git/arm-soc/arch/arm64/mm/init.c: In function 'mem_init':
>   /git/arm-soc/include/linux/compiler.h:357:38: error: call to
>   '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON
>   failed: sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT)
> 
> The additional page flag causes other information stored in
> page->flags to get bumped into their own struct page member:
> 
>   #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <=
>   BITS_PER_LONG - NR_PAGEFLAGS
>   #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
>   #else
>   #define LAST_CPUPID_WIDTH 0
>   #endif
> 
>   #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
>   #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
>   #endif
> 
> which in turn causes the struct page size to exceed the size set in
> STRUCT_PAGE_MAX_SHIFT. This value is an an estimate used to size the
> VMEMMAP page array according to address space and struct page size.
> 
> However, the check is performed - and triggers here - on a !VMEMMAP
> config, which consumes an additional 22 page bits for the sparse
> section id. When VMEMMAP is enabled, those bits are returned, cpupid
> doesn't need its own member, and the page passes the VMEMMAP check.
> 
> Restrict that check to the situation it was meant to check: that we
> are sizing the VMEMMAP page array correctly.
> 
> Says Arnd:
> 
>     Further experiments show that the build error already existed before,
>     but was only triggered with larger values of CONFIG_NR_CPU and/or
>     CONFIG_NODES_SHIFT that might be used in actual configurations but
>     not in randconfig builds.
> 
>     With longer CPU and node masks, I could recreate the problem with
>     kernels as old as linux-4.7 when arm64 NUMA support got added.
> 
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Tested-by: Arnd Bergmann <arnd@arndb.de>
> Cc: stable@vger.kernel.org
> Fixes: 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
> Fixes: 3e1907d5bf5a ("arm64: mm: move vmemmap region right below the linear region")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  arch/arm64/mm/init.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 1b18b4722420..86d9f9d303b0 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -611,11 +611,13 @@ void __init mem_init(void)
>  	BUILD_BUG_ON(TASK_SIZE_32			> TASK_SIZE_64);
>  #endif
>  
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
>  	/*
>  	 * Make sure we chose the upper bound of sizeof(struct page)
> -	 * correctly.
> +	 * correctly when sizing the VMEMMAP array.
>  	 */
>  	BUILD_BUG_ON(sizeof(struct page) > (1 << STRUCT_PAGE_MAX_SHIFT));
> +#endif
>  
>  	if (PAGE_SIZE >= 16384 && get_num_physpages() <= 128) {
>  		extern int sysctl_overcommit_memory;
> -- 
> 2.18.0
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-23 21:14 ` Balbir Singh
@ 2018-07-24 15:15   ` Johannes Weiner
  2018-07-26  1:07     ` Singh, Balbir
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-24 15:15 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Ingo Molnar, Peter Zijlstra, akpm, Linus Torvalds, Tejun Heo,
	surenb, Vinayak Menon, Christoph Lameter, Mike Galbraith,
	Shakeel Butt, linux-mm, cgroups, linux-kernel, kernel-team

Hi Balbir,

On Tue, Jul 24, 2018 at 07:14:02AM +1000, Balbir Singh wrote:
> Does the mechanism scale? I am a little concerned about how frequently
> this infrastructure is monitored/read/acted upon.

I expect most users to poll in the frequency ballpark of the running
averages (10s, 1m, 5m). Our OOMD defaults to 5s polling of the 10s
average; we collect the 1m average once per minute from our machines
and cgroups to log the system/workload health trends in our fleet.

Suren has been experimenting with adaptive polling down to the
millisecond range on Android.

> Why aren't existing mechanisms sufficient

Our existing stuff gives a lot of indication when something *may* be
an issue, like the rate of page reclaim, the number of refaults, the
average number of active processes, one task waiting on a resource.

But the real difference between an issue and a non-issue is how much
it affects your overall goal of making forward progress or reacting to
a request in time. And that's the only thing users really care
about. It doesn't matter whether my system is doing 2314 or 6723 page
refaults per minute, or scanned 8495 pages recently. I need to know
whether I'm losing 1% or 20% of my time on overcommitted memory.

Delayacct is time-based, so it's a step in the right direction, but it
doesn't aggregate tasks and CPUs into compound productivity states to
tell you if only parts of your workload are seeing delays (which is
often tolerable for the purpose of ensuring maximum HW utilization) or
your system overall is not making forward progress. That aggregation
isn't something you can do in userspace with polled delayacct data.

> -- why is the avg delay calculation in the kernel?

For one, as per above, most users will probably be using the standard
averaging windows, and we already have this highly optimizd
infrastructure from the load average. I don't see why we shouldn't use
that instead of exporting an obscure number that requires most users
to have an additional library or copy-paste the loadavg code.

I also mentioned the OOM killer as a likely in-kernel user of the
pressure percentages to protect from memory livelocks out of the box,
in which case we have to do this calculation in the kernel anyway.

> There is no talk about the overhead this introduces in general, may be
> the details are in the patches. I'll read through them

I sent an email on benchmarks and overhead in one of the subthreads, I
will include that information in the cover letter in v3.

https://lore.kernel.org/lkml/20180718215644.GB2838@cmpxchg.org/

Thanks!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/10] psi: cgroup support
  2018-07-17 15:40   ` Peter Zijlstra
@ 2018-07-24 15:54     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-24 15:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

Hi Peter,

On Tue, Jul 17, 2018 at 05:40:59PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 12, 2018 at 01:29:41PM -0400, Johannes Weiner wrote:
> > +/**
> > + * cgroup_move_task - move task to a different cgroup
> > + * @task: the task
> > + * @to: the target css_set
> > + *
> > + * Move task to a new cgroup and safely migrate its associated stall
> > + * state between the different groups.
> > + *
> > + * This function acquires the task's rq lock to lock out concurrent
> > + * changes to the task's scheduling state and - in case the task is
> > + * running - concurrent changes to its stall state.
> > + */
> > +void cgroup_move_task(struct task_struct *task, struct css_set *to)
> > +{
> > +	unsigned int task_flags = 0;
> > +	struct rq_flags rf;
> > +	struct rq *rq;
> > +	u64 now;
> > +
> > +	rq = task_rq_lock(task, &rf);
> > +
> > +	if (task_on_rq_queued(task)) {
> > +		task_flags = TSK_RUNNING;
> > +	} else if (task->in_iowait) {
> > +		task_flags = TSK_IOWAIT;
> > +	}
> > +	if (task->flags & PF_MEMSTALL)
> > +		task_flags |= TSK_MEMSTALL;
> > +
> > +	if (task_flags) {
> > +		update_rq_clock(rq);
> > +		now = rq_clock(rq);
> > +		psi_task_change(task, now, task_flags, 0);
> > +	}
> > +
> > +	/*
> > +	 * Lame to do this here, but the scheduler cannot be locked
> > +	 * from the outside, so we move cgroups from inside sched/.
> > +	 */
> > +	rcu_assign_pointer(task->cgroups, to);
> > +
> > +	if (task_flags)
> > +		psi_task_change(task, now, 0, task_flags);
> > +
> > +	task_rq_unlock(rq, task, &rf);
> > +}
> 
> Why is that not part of cpu_cgroup_attach() / sched_move_task() ?

Hm, there is some overlap, but it's not the same operation.

cpu_cgroup_attach() handles rq migration between cgroups that have the
cpu controller enabled, but psi needs to migrate task counts around
for memory and IO as well, as we always need to know nr_runnable.

The cpu controller is super expensive, though, and e.g. we had to
disable it for cost purposes while still running psi, so it wouldn't
be great to need full hierarchical per-cgroup scheduling policy just
to know the runnable count in a group.

Likewise, I don't think we'd want to change the cgroup core to call
->attach for *all* cgroups and have the callback figure out whether
the controller is actually enabled on them or not for this one case.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
  2018-07-19 20:31       ` Peter Zijlstra
@ 2018-07-24 16:01         ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2018-07-24 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
	Suren Baghdasaryan, Vinayak Menon, Christopher Lameter,
	Mike Galbraith, Shakeel Butt, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Jul 19, 2018 at 10:31:15PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 19, 2018 at 02:47:40PM -0400, Johannes Weiner wrote:
> > On Wed, Jul 18, 2018 at 02:03:18PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> > > > +	/* Update task counts according to the set/clear bitmasks */
> > > > +	for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> > > > +		int idx = to + (bo - 1);
> > > > +
> > > > +		if (tasks[idx] == 0 && !psi_bug) {
> > > > +			printk_deferred(KERN_ERR "psi: task underflow! cpu=%d idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> > > > +					cpu, idx, tasks[0], tasks[1], tasks[2],
> > > > +					clear, set);
> > > > +			psi_bug = 1;
> > > > +		}
> > > 
> > > 		WARN_ONCE(!tasks[idx], ...);
> > 
> > It's just open-coded because of the printk_deferred, since this is
> > inside the scheduler.
> 
> Yeah, meh. There's ton of WARNs in the scheduler, WARNs should not
> trigger anyway.

This one in particular gave us quite a runaround. We had a subtle bug
in how psi processed task CPU migration that would only manifest with
hundreds of thousands of machine hours. When it triggered, instead of
the warning, we'd crash on a corrupted stack with a completely useless
crash dump - PC pointing to things that couldn't possibly trap etc.

So printk_deferred has been a lot more useful in those rare but
desparate cases ;-) Plus we keep the machine alive.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing
  2018-07-24 15:04           ` Will Deacon
@ 2018-07-25 16:06             ` Will Deacon
  0 siblings, 0 replies; 83+ messages in thread
From: Will Deacon @ 2018-07-25 16:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Arnd Bergmann, Peter Zijlstra, Suren Baghdasaryan,
	Mike Galbraith, Linux Kernel Mailing List, kernel-team, Linux-MM,
	Vinayak Menon, Ingo Molnar, Shakeel Butt, Catalin Marinas,
	Tejun Heo, cgroups, Andrew Morton, Linus Torvalds,
	Christopher Lameter, Linux ARM

On Tue, Jul 24, 2018 at 04:04:48PM +0100, Will Deacon wrote:
> On Mon, Jul 23, 2018 at 12:27:35PM -0400, Johannes Weiner wrote:
> > On Mon, Jul 23, 2018 at 05:35:35PM +0200, Arnd Bergmann wrote:
> > > On Mon, Jul 23, 2018 at 5:23 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > > index 1b18b4722420..72c9b6778b0a 100644
> > > > --- a/arch/arm64/mm/init.c
> > > > +++ b/arch/arm64/mm/init.c
> > > > @@ -611,11 +611,13 @@ void __init mem_init(void)
> > > >         BUILD_BUG_ON(TASK_SIZE_32                       > TASK_SIZE_64);
> > > >  #endif
> > > >
> > > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP
> > > >         /*
> > > 
> > > I tested it on two broken configurations, and found that you have
> > > a typo here, it should be 'ifdef', not 'ifndef'. With that change, it
> > > seems to build fine.
> > > 
> > > Tested-by: Arnd Bergmann <arnd@arndb.de>
> > 
> > Thanks for testing it, I don't have a cross-compile toolchain set up.
> > 
> > ---
> 
> Thanks Arnd, Johannes. I can pick this up for -rc7 via the arm64 tree,
> unless it's already queued elsewhere?

I've pushed this to the arm64 for-next/fixes branch heading for -rc7.

Will

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-17 12:23       ` Michal Hocko
@ 2018-07-25 22:57         ` Daniel Drake
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Drake @ 2018-07-25 22:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes, Linux Kernel, linux-mm, cgroups, Linux Upstreaming Team,
	linux-block, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Tejun Heo, Balbir Singh, Mike Galbraith, Oliver Yang,
	Shakeel Butt, xxx xxx, Taras Kondratiuk, Daniel Walker,
	Vinayak Menon, Ruslan Ruslichenko, kernel-team

On Tue, Jul 17, 2018 at 7:23 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Tue 17-07-18 07:13:52, Daniel Drake wrote:
>> On Tue, Jul 17, 2018 at 6:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > Yes this is really unfortunate. One thing that could help would be to
>> > consider a trashing level during the reclaim (get_scan_count) to simply
>> > forget about LRUs which are constantly refaulting pages back. We already
>> > have the infrastructure for that. We just need to plumb it in.
>>
>> Can you go into a bit more detail about that infrastructure and how we
>> might detect which pages are being constantly refaulted? I'm
>> interested in spending a few hours on this topic to see if I can come
>> up with anything.
>
> mm/workingset.c allows for tracking when an actual page got evicted.
> workingset_refault tells us whether a give filemap fault is a recent
> refault and activates the page if that is the case. So what you need is
> to note how many refaulted pages we have on the active LRU list. If that
> is a large part of the list and if the inactive list is really small
> then we know we are trashing.

Thanks for the guidance. So this sounds like it is something that
should be done on a timer (or on some other condition?), check the
state of the active LRU list as described and if things are bad then
invoke the OOM killer?

I'm having trouble linking that idea to your original suggestion:

> One thing that could help would be to consider a trashing level during the reclaim
> (get_scan_count) to simply forget about LRUs which are constantly refaulting
> pages back.

which I interpret to mean that the  for_each_evictable_lru loop in
get_scan_count should skip over constantly-refaulty LRUs rather than
add them to nr[] and lru_pages, which I assume would then cause direct
reclaim to fail when we are thrashing, leading to OOM kill?

Are these two different ideas, or am I just misunderstanding something basic?

That confusion aside, studying the code to understand how I can
determine if a page is being constantly refaulted or not, I see that
the well documented condition for this (in workingset_refault) is:

  (refault - eviction) & EVICTION_MASK <= active_file

refault and active_file are just values from the lruvec which seems
easily accessible. However the eviction value is taken at the point of
page eviction, and it is then stored in the shadow entries stored in
the page cache for pages that have been evicted, but the shadow entry
is then lost when the page is reactivated.

The suggestion(s) seem to revolve around checking if currently-active
pages are refaulting a lot, and I am still not clear on how to
determine that, given that the shadow/eviction information was lost at
the point when those active pages were refaulted.


BTW feel free to drop this thread if you are busy, or delay your
response to a convenient time. I'm new to this area and probably
making silly mistakes, and not yet convinced that I'll be able to see
it through.

Daniel

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-24 15:15   ` Johannes Weiner
@ 2018-07-26  1:07     ` Singh, Balbir
  2018-07-26 20:07       ` Johannes Weiner
  0 siblings, 1 reply; 83+ messages in thread
From: Singh, Balbir @ 2018-07-26  1:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, akpm, Linus Torvalds, Tejun Heo,
	surenb, Vinayak Menon, Christoph Lameter, Mike Galbraith,
	Shakeel Butt, linux-mm, cgroups, linux-kernel, kernel-team



On 7/25/18 1:15 AM, Johannes Weiner wrote:
> Hi Balbir,
> 
> On Tue, Jul 24, 2018 at 07:14:02AM +1000, Balbir Singh wrote:
>> Does the mechanism scale? I am a little concerned about how frequently
>> this infrastructure is monitored/read/acted upon.
> 
> I expect most users to poll in the frequency ballpark of the running
> averages (10s, 1m, 5m). Our OOMD defaults to 5s polling of the 10s
> average; we collect the 1m average once per minute from our machines
> and cgroups to log the system/workload health trends in our fleet.
> 
> Suren has been experimenting with adaptive polling down to the
> millisecond range on Android.
> 

I think this is a bad way of doing things, polling only adds to overheads, there needs to be an event driven mechanism and the selection of the events need to happen in user space.

>> Why aren't existing mechanisms sufficient
> 
> Our existing stuff gives a lot of indication when something *may* be
> an issue, like the rate of page reclaim, the number of refaults, the
> average number of active processes, one task waiting on a resource.
> 
> But the real difference between an issue and a non-issue is how much
> it affects your overall goal of making forward progress or reacting to
> a request in time. And that's the only thing users really care
> about. It doesn't matter whether my system is doing 2314 or 6723 page
> refaults per minute, or scanned 8495 pages recently. I need to know
> whether I'm losing 1% or 20% of my time on overcommitted memory.
> 
> Delayacct is time-based, so it's a step in the right direction, but it
> doesn't aggregate tasks and CPUs into compound productivity states to
> tell you if only parts of your workload are seeing delays (which is
> often tolerable for the purpose of ensuring maximum HW utilization) or
> your system overall is not making forward progress. That aggregation
> isn't something you can do in userspace with polled delayacct data.

By aggregation you mean cgroup aggregation?

> 
>> -- why is the avg delay calculation in the kernel?
> 
> For one, as per above, most users will probably be using the standard
> averaging windows, and we already have this highly optimizd
> infrastructure from the load average. I don't see why we shouldn't use
> that instead of exporting an obscure number that requires most users
> to have an additional library or copy-paste the loadavg code.
> 
> I also mentioned the OOM killer as a likely in-kernel user of the
> pressure percentages to protect from memory livelocks out of the box,
> in which case we have to do this calculation in the kernel anyway.
> 
>> There is no talk about the overhead this introduces in general, may be
>> the details are in the patches. I'll read through them
> 
> I sent an email on benchmarks and overhead in one of the subthreads, I
> will include that information in the cover letter in v3.
> 
> https://lore.kernel.org/lkml/20180718215644.GB2838@cmpxchg.org/

Thanks, I'll take a look

Balbir Singh.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-26  1:07     ` Singh, Balbir
@ 2018-07-26 20:07       ` Johannes Weiner
  2018-07-27 23:40         ` Suren Baghdasaryan
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-26 20:07 UTC (permalink / raw)
  To: Singh, Balbir
  Cc: Ingo Molnar, Peter Zijlstra, akpm, Linus Torvalds, Tejun Heo,
	surenb, Vinayak Menon, Christoph Lameter, Mike Galbraith,
	Shakeel Butt, linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Jul 26, 2018 at 11:07:32AM +1000, Singh, Balbir wrote:
> On 7/25/18 1:15 AM, Johannes Weiner wrote:
> > On Tue, Jul 24, 2018 at 07:14:02AM +1000, Balbir Singh wrote:
> >> Does the mechanism scale? I am a little concerned about how frequently
> >> this infrastructure is monitored/read/acted upon.
> > 
> > I expect most users to poll in the frequency ballpark of the running
> > averages (10s, 1m, 5m). Our OOMD defaults to 5s polling of the 10s
> > average; we collect the 1m average once per minute from our machines
> > and cgroups to log the system/workload health trends in our fleet.
> > 
> > Suren has been experimenting with adaptive polling down to the
> > millisecond range on Android.
> > 
> 
> I think this is a bad way of doing things, polling only adds to
> overheads, there needs to be an event driven mechanism and the
> selection of the events need to happen in user space.

Of course, I'm not saying you should be doing this, and in fact Suren
and I were talking about notification/event infrastructure.

You asked if this scales and I'm telling you it's not impossible to
read at such frequencies.

Maybe you can clarify your question.

> >> Why aren't existing mechanisms sufficient
> > 
> > Our existing stuff gives a lot of indication when something *may* be
> > an issue, like the rate of page reclaim, the number of refaults, the
> > average number of active processes, one task waiting on a resource.
> > 
> > But the real difference between an issue and a non-issue is how much
> > it affects your overall goal of making forward progress or reacting to
> > a request in time. And that's the only thing users really care
> > about. It doesn't matter whether my system is doing 2314 or 6723 page
> > refaults per minute, or scanned 8495 pages recently. I need to know
> > whether I'm losing 1% or 20% of my time on overcommitted memory.
> > 
> > Delayacct is time-based, so it's a step in the right direction, but it
> > doesn't aggregate tasks and CPUs into compound productivity states to
> > tell you if only parts of your workload are seeing delays (which is
> > often tolerable for the purpose of ensuring maximum HW utilization) or
> > your system overall is not making forward progress. That aggregation
> > isn't something you can do in userspace with polled delayacct data.
> 
> By aggregation you mean cgroup aggregation?

System-wide and per cgroup.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
                   ` (13 preceding siblings ...)
  2018-07-23 21:14 ` Balbir Singh
@ 2018-07-27 22:01 ` Pavel Machek
  2018-07-30 15:40   ` Johannes Weiner
  14 siblings, 1 reply; 83+ messages in thread
From: Pavel Machek @ 2018-07-27 22:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 744 bytes --]

Hi!

> The idea is to eventually incorporate this back into the kernel, so
> that Linux can avoid OOM livelocks (which TECHNICALLY aren't memory
> deadlocks, but for the user indistinguishable) out of the box.
> 
> We also use psi memory pressure for loadshedding. Our batch job

psi->PSI?

> 		How do you use this feature?
> 
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also

Could we get the config named CONFIG_PRESSURE to match /proc/pressure?
"PSI" is little too terse...

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-26 20:07       ` Johannes Weiner
@ 2018-07-27 23:40         ` Suren Baghdasaryan
  0 siblings, 0 replies; 83+ messages in thread
From: Suren Baghdasaryan @ 2018-07-27 23:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Singh, Balbir, Ingo Molnar, Peter Zijlstra, akpm, Linus Torvalds,
	Tejun Heo, Vinayak Menon, Christoph Lameter, Mike Galbraith,
	Shakeel Butt, linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Jul 26, 2018 at 1:07 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, Jul 26, 2018 at 11:07:32AM +1000, Singh, Balbir wrote:
>> On 7/25/18 1:15 AM, Johannes Weiner wrote:
>> > On Tue, Jul 24, 2018 at 07:14:02AM +1000, Balbir Singh wrote:
>> >> Does the mechanism scale? I am a little concerned about how frequently
>> >> this infrastructure is monitored/read/acted upon.
>> >
>> > I expect most users to poll in the frequency ballpark of the running
>> > averages (10s, 1m, 5m). Our OOMD defaults to 5s polling of the 10s
>> > average; we collect the 1m average once per minute from our machines
>> > and cgroups to log the system/workload health trends in our fleet.
>> >
>> > Suren has been experimenting with adaptive polling down to the
>> > millisecond range on Android.
>> >
>>
>> I think this is a bad way of doing things, polling only adds to
>> overheads, there needs to be an event driven mechanism and the
>> selection of the events need to happen in user space.
>
> Of course, I'm not saying you should be doing this, and in fact Suren
> and I were talking about notification/event infrastructure.

I implemented a psi-monitor prototype which allows userspace to
specify the max PSI stall it can tolerate (in terms of % of time spent
on memory management). When that threshold is breached an event to
userspace is generated. I'm still testing it but early results look
promising. I'm planning to send it upstream when it's ready and after
the main PSI patchset is merged.

>
> You asked if this scales and I'm telling you it's not impossible to
> read at such frequencies.
>

Yes it's doable. One usecase might be to poll at a higher rate for a
short period of time immediately after the initial event is received
to clarify the short-term signal dynamics.

> Maybe you can clarify your question.
>
>> >> Why aren't existing mechanisms sufficient
>> >
>> > Our existing stuff gives a lot of indication when something *may* be
>> > an issue, like the rate of page reclaim, the number of refaults, the
>> > average number of active processes, one task waiting on a resource.
>> >
>> > But the real difference between an issue and a non-issue is how much
>> > it affects your overall goal of making forward progress or reacting to
>> > a request in time. And that's the only thing users really care
>> > about. It doesn't matter whether my system is doing 2314 or 6723 page
>> > refaults per minute, or scanned 8495 pages recently. I need to know
>> > whether I'm losing 1% or 20% of my time on overcommitted memory.
>> >
>> > Delayacct is time-based, so it's a step in the right direction, but it
>> > doesn't aggregate tasks and CPUs into compound productivity states to
>> > tell you if only parts of your workload are seeing delays (which is
>> > often tolerable for the purpose of ensuring maximum HW utilization) or
>> > your system overall is not making forward progress. That aggregation
>> > isn't something you can do in userspace with polled delayacct data.
>>
>> By aggregation you mean cgroup aggregation?
>
> System-wide and per cgroup.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-27 22:01 ` Pavel Machek
@ 2018-07-30 15:40   ` Johannes Weiner
  2018-07-30 17:39     ` Pavel Machek
  0 siblings, 1 reply; 83+ messages in thread
From: Johannes Weiner @ 2018-07-30 15:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

On Sat, Jul 28, 2018 at 12:01:23AM +0200, Pavel Machek wrote:
> > 		How do you use this feature?
> > 
> > A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> > 3 files: cpu, memory, and io. If using cgroup2, cgroups will also
> 
> Could we get the config named CONFIG_PRESSURE to match /proc/pressure?
> "PSI" is little too terse...

I'd rather have the internal config symbol match the naming scheme in
the code, where psi is a shorter, unique token as copmared to e.g.
pressure, press, prsr, etc.

The prompt text that the user primarily sees spells out "Pressure", so
I don't think this is confusing.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-30 15:40   ` Johannes Weiner
@ 2018-07-30 17:39     ` Pavel Machek
  2018-07-30 17:51       ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Pavel Machek @ 2018-07-30 17:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
	Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 870 bytes --]

On Mon 2018-07-30 11:40:35, Johannes Weiner wrote:
> On Sat, Jul 28, 2018 at 12:01:23AM +0200, Pavel Machek wrote:
> > > 		How do you use this feature?
> > > 
> > > A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> > > 3 files: cpu, memory, and io. If using cgroup2, cgroups will also
> > 
> > Could we get the config named CONFIG_PRESSURE to match /proc/pressure?
> > "PSI" is little too terse...
> 
> I'd rather have the internal config symbol match the naming scheme in
> the code, where psi is a shorter, unique token as copmared to e.g.
> pressure, press, prsr, etc.

I'd do "pressure", really. Yes, psi is shorter, but I'd say that
length is not really important there.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-30 17:39     ` Pavel Machek
@ 2018-07-30 17:51       ` Tejun Heo
  2018-07-30 17:54         ` Randy Dunlap
  2018-07-30 17:59         ` Pavel Machek
  0 siblings, 2 replies; 83+ messages in thread
From: Tejun Heo @ 2018-07-30 17:51 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello,

On Mon, Jul 30, 2018 at 07:39:40PM +0200, Pavel Machek wrote:
> > I'd rather have the internal config symbol match the naming scheme in
> > the code, where psi is a shorter, unique token as copmared to e.g.
> > pressure, press, prsr, etc.
> 
> I'd do "pressure", really. Yes, psi is shorter, but I'd say that
> length is not really important there.

This is an extreme bikeshedding without any relevance.  You can make
suggestions but please lay it to the rest.  There isn't any general
consensus against the current name and you're just trying to push your
favorite name without proper justifications after contributing nothing
to the project.  Please stop.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-30 17:51       ` Tejun Heo
@ 2018-07-30 17:54         ` Randy Dunlap
  2018-07-30 18:05           ` Tejun Heo
  2018-07-30 17:59         ` Pavel Machek
  1 sibling, 1 reply; 83+ messages in thread
From: Randy Dunlap @ 2018-07-30 17:54 UTC (permalink / raw)
  To: Tejun Heo, Pavel Machek
  Cc: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

On 07/30/2018 10:51 AM, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 30, 2018 at 07:39:40PM +0200, Pavel Machek wrote:
>>> I'd rather have the internal config symbol match the naming scheme in
>>> the code, where psi is a shorter, unique token as copmared to e.g.
>>> pressure, press, prsr, etc.
>>
>> I'd do "pressure", really. Yes, psi is shorter, but I'd say that
>> length is not really important there.
> 
> This is an extreme bikeshedding without any relevance.  You can make
> suggestions but please lay it to the rest.  There isn't any general
> consensus against the current name and you're just trying to push your
> favorite name without proper justifications after contributing nothing
> to the project.  Please stop.
> 
> Thanks.

I'd say he's trying to make something that is readable and easier to
understand for users.

Thanks.


-- 
~Randy

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-30 17:51       ` Tejun Heo
  2018-07-30 17:54         ` Randy Dunlap
@ 2018-07-30 17:59         ` Pavel Machek
  2018-07-30 18:07           ` Tejun Heo
  1 sibling, 1 reply; 83+ messages in thread
From: Pavel Machek @ 2018-07-30 17:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

On Mon 2018-07-30 10:51:20, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 30, 2018 at 07:39:40PM +0200, Pavel Machek wrote:
> > > I'd rather have the internal config symbol match the naming scheme in
> > > the code, where psi is a shorter, unique token as copmared to e.g.
> > > pressure, press, prsr, etc.
> > 
> > I'd do "pressure", really. Yes, psi is shorter, but I'd say that
> > length is not really important there.
> 
> This is an extreme bikeshedding without any relevance.  You can make
> suggestions but please lay it to the rest.  There isn't any general
> consensus against the current name and you're just trying to push your
> favorite name without proper justifications after contributing nothing
> to the project.  Please stop.

Its true I have no interest in psi. But I'm trying to use same kernel
you are trying to "improve" and I was confused enough by seing
"CONFIG_PSI". And yes, my association was "pounds per square inch" and
"what is it doing here".

So I'm asking you to change the name.

USB is well known acronym, so it is okay to have CONFIG_USB. PSI is
also well known -- but means something else.

And the code kind-of acknowledges that acronym is unknown, by having
/proc/pressure.

So please just fix it.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-30 17:54         ` Randy Dunlap
@ 2018-07-30 18:05           ` Tejun Heo
  0 siblings, 0 replies; 83+ messages in thread
From: Tejun Heo @ 2018-07-30 18:05 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Pavel Machek, Johannes Weiner, Ingo Molnar, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello,

On Mon, Jul 30, 2018 at 10:54:05AM -0700, Randy Dunlap wrote:
> I'd say he's trying to make something that is readable and easier to
> understand for users.

Sure, it's perfectly fine to make those suggestions and discuss but
the counter points have already been discussed (e.g. PSI is a known
acronym associated with pressure and internal symbols all use them for
brevity and uniqueness).  There's no clear technically winning choice
here and it's a decision of a relatively low importance given that
it's confined to kernel config.  I can't see any merit in turning it
into a last-word match.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2
  2018-07-30 17:59         ` Pavel Machek
@ 2018-07-30 18:07           ` Tejun Heo
  0 siblings, 0 replies; 83+ messages in thread
From: Tejun Heo @ 2018-07-30 18:07 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Linus Torvalds, Suren Baghdasaryan, Vinayak Menon,
	Christopher Lameter, Mike Galbraith, Shakeel Butt, linux-mm,
	cgroups, linux-kernel, kernel-team

On Mon, Jul 30, 2018 at 07:59:36PM +0200, Pavel Machek wrote:
> Its true I have no interest in psi. But I'm trying to use same kernel
> you are trying to "improve" and I was confused enough by seing
> "CONFIG_PSI". And yes, my association was "pounds per square inch" and
> "what is it doing here".

Read the help message.  If that's not enough, we sure can improve it.

> So I'm asking you to change the name.
> 
> USB is well known acronym, so it is okay to have CONFIG_USB. PSI is
> also well known -- but means something else.
> 
> And the code kind-of acknowledges that acronym is unknown, by having
> /proc/pressure.

Your momentary confusion isn't the only criterion.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2018-07-30 18:08 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
2018-07-12 17:29 ` [PATCH 01/10] mm: workingset: don't drop refault information prematurely Johannes Weiner
2018-07-12 17:29 ` [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2018-07-23 13:36   ` Arnd Bergmann
2018-07-23 15:23     ` Johannes Weiner
2018-07-23 15:35       ` Arnd Bergmann
2018-07-23 16:27         ` Johannes Weiner
2018-07-24 15:04           ` Will Deacon
2018-07-25 16:06             ` Will Deacon
2018-07-12 17:29 ` [PATCH 03/10] delayacct: track delays from thrashing cache pages Johannes Weiner
2018-07-12 17:29 ` [PATCH 04/10] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
2018-07-12 17:29 ` [PATCH 05/10] sched: loadavg: make calc_load_n() public Johannes Weiner
2018-07-12 17:29 ` [PATCH 06/10] sched: sched.h: make rq locking and clock functions available in stats.h Johannes Weiner
2018-07-12 17:29 ` [PATCH 07/10] sched: introduce this_rq_lock_irq() Johannes Weiner
2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-07-13  9:21   ` Peter Zijlstra
2018-07-13 16:17     ` Johannes Weiner
2018-07-14  8:48       ` Peter Zijlstra
2018-07-14  9:02       ` Peter Zijlstra
2018-07-17 10:03   ` Peter Zijlstra
2018-07-18 21:56     ` Johannes Weiner
2018-07-17 14:16   ` Peter Zijlstra
2018-07-18 22:00     ` Johannes Weiner
2018-07-17 14:21   ` Peter Zijlstra
2018-07-18 22:03     ` Johannes Weiner
2018-07-17 15:01   ` Peter Zijlstra
2018-07-18 22:06     ` Johannes Weiner
2018-07-20 14:13       ` Johannes Weiner
2018-07-17 15:17   ` Peter Zijlstra
2018-07-18 22:11     ` Johannes Weiner
2018-07-17 15:32   ` Peter Zijlstra
2018-07-18 12:03   ` Peter Zijlstra
2018-07-18 12:22     ` Peter Zijlstra
2018-07-18 22:36     ` Johannes Weiner
2018-07-19 13:58       ` Peter Zijlstra
2018-07-19  9:26     ` Peter Zijlstra
2018-07-19 12:50       ` Johannes Weiner
2018-07-19 13:18         ` Peter Zijlstra
2018-07-19 15:08     ` Linus Torvalds
2018-07-19 17:54       ` Johannes Weiner
2018-07-19 18:47     ` Johannes Weiner
2018-07-19 20:31       ` Peter Zijlstra
2018-07-24 16:01         ` Johannes Weiner
2018-07-18 12:46   ` Peter Zijlstra
2018-07-18 13:56     ` Johannes Weiner
2018-07-18 16:31       ` Peter Zijlstra
2018-07-18 16:46         ` Johannes Weiner
2018-07-20 20:35   ` Peter Zijlstra
2018-07-12 17:29 ` [PATCH 09/10] psi: cgroup support Johannes Weiner
2018-07-12 20:08   ` Tejun Heo
2018-07-17 15:40   ` Peter Zijlstra
2018-07-24 15:54     ` Johannes Weiner
2018-07-12 17:29 ` [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure Johannes Weiner
2018-07-12 23:45   ` Andrew Morton
2018-07-13 22:17     ` Johannes Weiner
2018-07-13 22:13   ` Suren Baghdasaryan
2018-07-13 22:49     ` Johannes Weiner
2018-07-13 23:34       ` Suren Baghdasaryan
2018-07-17 15:13   ` Peter Zijlstra
2018-07-12 17:37 ` [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Linus Torvalds
2018-07-12 23:44 ` Andrew Morton
2018-07-13 22:14   ` Johannes Weiner
2018-07-16 15:57 ` Daniel Drake
2018-07-17 11:25   ` Michal Hocko
2018-07-17 12:13     ` Daniel Drake
2018-07-17 12:23       ` Michal Hocko
2018-07-25 22:57         ` Daniel Drake
2018-07-18 22:21     ` Johannes Weiner
2018-07-19 11:29       ` peter enderborg
2018-07-19 12:18         ` Johannes Weiner
2018-07-23 21:14 ` Balbir Singh
2018-07-24 15:15   ` Johannes Weiner
2018-07-26  1:07     ` Singh, Balbir
2018-07-26 20:07       ` Johannes Weiner
2018-07-27 23:40         ` Suren Baghdasaryan
2018-07-27 22:01 ` Pavel Machek
2018-07-30 15:40   ` Johannes Weiner
2018-07-30 17:39     ` Pavel Machek
2018-07-30 17:51       ` Tejun Heo
2018-07-30 17:54         ` Randy Dunlap
2018-07-30 18:05           ` Tejun Heo
2018-07-30 17:59         ` Pavel Machek
2018-07-30 18:07           ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).