* [PATCH 1/9] mm: workingset: don't drop refault information prematurely
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
@ 2018-08-28 17:22 ` Johannes Weiner
2018-08-28 17:22 ` [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
` (9 subsequent siblings)
10 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 17:22 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
From: Johannes Weiner <jweiner@fb.com>
If we keep just enough refault information to match the *current* page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out
all the cache. Once cache comes back, we won't see those refaults.
They might not be actionable for LRU aging, but we want to know about
them for measuring memory pressure.
Signed-off-by: Johannes Weiner <jweiner@fb.com>
---
mm/workingset.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c83978..53759a3cf99a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -364,7 +364,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
{
unsigned long max_nodes;
unsigned long nodes;
- unsigned long cache;
+ unsigned long pages;
/* list_lru lock nests inside the IRQ-safe i_pages lock */
local_irq_disable();
@@ -393,14 +393,14 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
*
* PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE
*/
- if (sc->memcg) {
- cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
- LRU_ALL_FILE);
- } else {
- cache = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
- node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
- }
- max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);
+#ifdef CONFIG_MEMCG
+ if (sc->memcg)
+ pages = page_counter_read(&sc->memcg->memory);
+ else
+#endif
+ pages = node_present_pages(sc->nid);
+
+ max_nodes = pages >> (RADIX_TREE_MAP_SHIFT - 3);
if (nodes <= max_nodes)
return 0;
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
2018-08-28 17:22 ` [PATCH 1/9] mm: workingset: don't drop refault information prematurely Johannes Weiner
@ 2018-08-28 17:22 ` Johannes Weiner
2018-08-28 17:22 ` [PATCH 3/9] delayacct: track delays from thrashing cache pages Johannes Weiner
` (8 subsequent siblings)
10 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 17:22 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
Refaults happen during transitions between workingsets as well as
in-place thrashing. Knowing the difference between the two has a range
of applications, including measuring the impact of memory shortage on
the system performance, as well as the ability to smarter balance
pressure between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has
been active or not in its lifetime. This bit is then stored in the
shadow entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA
nodes. If that's not enough, the system can switch to discontigmem and
re-gain the 6 or 7 sparsemem section bits.
v4:
- fix a typo in the comments, as per Suren
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/mmzone.h | 1 +
include/linux/page-flags.h | 5 +-
include/linux/swap.h | 2 +-
include/trace/events/mmflags.h | 1 +
mm/filemap.c | 9 ++--
mm/huge_memory.c | 1 +
mm/memcontrol.c | 2 +
mm/migrate.c | 2 +
mm/swap_state.c | 1 +
mm/vmscan.c | 1 +
mm/vmstat.c | 1 +
mm/workingset.c | 95 ++++++++++++++++++++++------------
12 files changed, 79 insertions(+), 42 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..6af87946d241 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,7 @@ enum node_stat_item {
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+ WORKINGSET_RESTORE,
WORKINGSET_NODERECLAIM,
NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 901943e4754b..79346bc1da7a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -69,13 +69,14 @@
*/
enum pageflags {
PG_locked, /* Page is locked. Don't touch. */
- PG_error,
PG_referenced,
PG_uptodate,
PG_dirty,
PG_lru,
PG_active,
+ PG_workingset,
PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
+ PG_error,
PG_slab,
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
PG_arch_1,
@@ -280,6 +281,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
+ TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
__PAGEFLAG(Slab, slab, PF_NO_TAIL)
__PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c063443d8638..d8822365782b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -296,7 +296,7 @@ struct vma_swap_readahead {
/* linux/mm/workingset.c */
void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
void workingset_activation(struct page *page);
/* Do not use directly, use workingset_lookup_update */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb76d89..a1675d43777e 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -88,6 +88,7 @@
{1UL << PG_dirty, "dirty" }, \
{1UL << PG_lru, "lru" }, \
{1UL << PG_active, "active" }, \
+ {1UL << PG_workingset, "workingset" }, \
{1UL << PG_slab, "slab" }, \
{1UL << PG_owner_priv_1, "owner_priv_1" }, \
{1UL << PG_arch_1, "arch_1" }, \
diff --git a/mm/filemap.c b/mm/filemap.c
index 52517f28e6f4..5e53424d9097 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -915,12 +915,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
* data from the working set, only to cache data that will
* get overwritten with something else, is a waste of memory.
*/
- if (!(gfp_mask & __GFP_WRITE) &&
- shadow && workingset_refault(shadow)) {
- SetPageActive(page);
- workingset_activation(page);
- } else
- ClearPageActive(page);
+ WARN_ON_ONCE(PageActive(page));
+ if (!(gfp_mask & __GFP_WRITE) && shadow)
+ workingset_refault(page, shadow);
lru_cache_add(page);
}
return ret;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 25346bd99364..04d663c58bbe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2369,6 +2369,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
(1L << PG_mlocked) |
(1L << PG_uptodate) |
(1L << PG_active) |
+ (1L << PG_workingset) |
(1L << PG_locked) |
(1L << PG_unevictable) |
(1L << PG_dirty)));
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2173f7e5164..84824b775470 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5329,6 +5329,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
stat[WORKINGSET_REFAULT]);
seq_printf(m, "workingset_activate %lu\n",
stat[WORKINGSET_ACTIVATE]);
+ seq_printf(m, "workingset_restore %lu\n",
+ stat[WORKINGSET_RESTORE]);
seq_printf(m, "workingset_nodereclaim %lu\n",
stat[WORKINGSET_NODERECLAIM]);
diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f7cab1..a6a9114e62dc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -682,6 +682,8 @@ void migrate_page_states(struct page *newpage, struct page *page)
SetPageActive(newpage);
} else if (TestClearPageUnevictable(page))
SetPageUnevictable(newpage);
+ if (PageWorkingset(page))
+ SetPageWorkingset(newpage);
if (PageChecked(page))
SetPageChecked(newpage);
if (PageMappedToDisk(page))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ecee9c6c4cc1..0d6a7f268d2e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -448,6 +448,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
/*
* Initiate read into locked page and return.
*/
+ SetPageWorkingset(new_page);
lru_cache_add_anon(new_page);
*new_page_allocated = true;
return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03822f86f288..7fdbc18fea6f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1976,6 +1976,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
}
ClearPageActive(page); /* we are de-activating */
+ SetPageWorkingset(page);
list_add(&page->lru, &l_inactive);
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ba0870ecddd..28f2faad95d4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1145,6 +1145,7 @@ const char * const vmstat_text[] = {
"nr_isolated_file",
"workingset_refault",
"workingset_activate",
+ "workingset_restore",
"workingset_nodereclaim",
"nr_anon_pages",
"nr_mapped",
diff --git a/mm/workingset.c b/mm/workingset.c
index 53759a3cf99a..f1bbce55ea60 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -121,7 +121,7 @@
* the only thing eating into inactive list space is active pages.
*
*
- * Activating refaulting pages
+ * Refaulting inactive pages
*
* All that is known about the active list is that the pages have been
* accessed more than once in the past. This means that at any given
@@ -134,6 +134,10 @@
* used less frequently than the refaulting page - or even not used at
* all anymore.
*
+ * That means if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the current active list.
+ *
* If this is wrong and demotion kicks in, the pages which are truly
* used more frequently will be reactivated while the less frequently
* used once will be evicted from memory.
@@ -141,6 +145,14 @@
* But if this is right, the stale pages will be pushed out of memory
* and the used pages get to stay in cache.
*
+ * Refaulting active pages
+ *
+ * If on the other hand the refaulting pages have recently been
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim. The cache is NOT transitioning to
+ * a different workingset; the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
*
* Implementation
*
@@ -156,8 +168,7 @@
*/
#define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \
- NODES_SHIFT + \
- MEM_CGROUP_ID_SHIFT)
+ 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
/*
@@ -170,23 +181,28 @@
*/
static unsigned int bucket_order __read_mostly;
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
+ bool workingset)
{
eviction >>= bucket_order;
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+ eviction = (eviction << 1) | workingset;
eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
}
static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
- unsigned long *evictionp)
+ unsigned long *evictionp, bool *workingsetp)
{
unsigned long entry = (unsigned long)shadow;
int memcgid, nid;
+ bool workingset;
entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+ workingset = entry & 1;
+ entry >>= 1;
nid = entry & ((1UL << NODES_SHIFT) - 1);
entry >>= NODES_SHIFT;
memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -195,6 +211,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
*memcgidp = memcgid;
*pgdat = NODE_DATA(nid);
*evictionp = entry << bucket_order;
+ *workingsetp = workingset;
}
/**
@@ -207,8 +224,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
*/
void *workingset_eviction(struct address_space *mapping, struct page *page)
{
- struct mem_cgroup *memcg = page_memcg(page);
struct pglist_data *pgdat = page_pgdat(page);
+ struct mem_cgroup *memcg = page_memcg(page);
int memcgid = mem_cgroup_id(memcg);
unsigned long eviction;
struct lruvec *lruvec;
@@ -220,30 +237,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
lruvec = mem_cgroup_lruvec(pgdat, memcg);
eviction = atomic_long_inc_return(&lruvec->inactive_age);
- return pack_shadow(memcgid, pgdat, eviction);
+ return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
}
/**
* workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
* @shadow: shadow entry of the evicted page
*
* Calculates and evaluates the refault distance of the previously
* evicted page in the context of the node it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
*/
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
{
unsigned long refault_distance;
+ struct pglist_data *pgdat;
unsigned long active_file;
struct mem_cgroup *memcg;
unsigned long eviction;
struct lruvec *lruvec;
unsigned long refault;
- struct pglist_data *pgdat;
+ bool workingset;
int memcgid;
- unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+ unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
rcu_read_lock();
/*
@@ -263,41 +280,51 @@ bool workingset_refault(void *shadow)
* configurations instead.
*/
memcg = mem_cgroup_from_id(memcgid);
- if (!mem_cgroup_disabled() && !memcg) {
- rcu_read_unlock();
- return false;
- }
+ if (!mem_cgroup_disabled() && !memcg)
+ goto out;
lruvec = mem_cgroup_lruvec(pgdat, memcg);
refault = atomic_long_read(&lruvec->inactive_age);
active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
/*
- * The unsigned subtraction here gives an accurate distance
- * across inactive_age overflows in most cases.
+ * Calculate the refault distance
*
- * There is a special case: usually, shadow entries have a
- * short lifetime and are either refaulted or reclaimed along
- * with the inode before they get too old. But it is not
- * impossible for the inactive_age to lap a shadow entry in
- * the field, which can then can result in a false small
- * refault distance, leading to a false activation should this
- * old entry actually refault again. However, earlier kernels
- * used to deactivate unconditionally with *every* reclaim
- * invocation for the longest time, so the occasional
- * inappropriate activation leading to pressure on the active
- * list is not a problem.
+ * The unsigned subtraction here gives an accurate distance
+ * across inactive_age overflows in most cases. There is a
+ * special case: usually, shadow entries have a short lifetime
+ * and are either refaulted or reclaimed along with the inode
+ * before they get too old. But it is not impossible for the
+ * inactive_age to lap a shadow entry in the field, which can
+ * then result in a false small refault distance, leading to a
+ * false activation should this old entry actually refault
+ * again. However, earlier kernels used to deactivate
+ * unconditionally with *every* reclaim invocation for the
+ * longest time, so the occasional inappropriate activation
+ * leading to pressure on the active list is not a problem.
*/
refault_distance = (refault - eviction) & EVICTION_MASK;
inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
- if (refault_distance <= active_file) {
- inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
- rcu_read_unlock();
- return true;
+ /*
+ * Compare the distance to the existing workingset size. We
+ * don't act on pages that couldn't stay resident even if all
+ * the memory was available to the page cache.
+ */
+ if (refault_distance > active_file)
+ goto out;
+
+ SetPageActive(page);
+ atomic_long_inc(&lruvec->inactive_age);
+ inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+
+ /* Page was active prior to eviction */
+ if (workingset) {
+ SetPageWorkingset(page);
+ inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
}
+out:
rcu_read_unlock();
- return false;
}
/**
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 3/9] delayacct: track delays from thrashing cache pages
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
2018-08-28 17:22 ` [PATCH 1/9] mm: workingset: don't drop refault information prematurely Johannes Weiner
2018-08-28 17:22 ` [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
@ 2018-08-28 17:22 ` Johannes Weiner
2018-08-28 17:22 ` [PATCH 4/9] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
` (7 subsequent siblings)
10 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 17:22 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
Delay accounting already measures the time a task spends in direct
reclaim and waiting for swapin, but in low memory situations tasks
spend can spend a significant amount of their time waiting on
thrashing page cache. This isn't tracked right now.
To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache
page to read back into memory.
Also update tools/accounting/getdelays.c:
[hannes@computer accounting]$ sudo ./getdelays -d -p 1
print delayacct stats ON
PID 1
CPU count real total virtual total delay total delay average
50318 745000000 847346785 400533713 0.008ms
IO count delay total delay average
435 122601218 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
0 0 0ms
THRASHING count delay total delay average
19 12621439 0ms
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/delayacct.h | 23 +++++++++++++++++++++++
include/uapi/linux/taskstats.h | 6 +++++-
kernel/delayacct.c | 15 +++++++++++++++
mm/filemap.c | 11 +++++++++++
tools/accounting/getdelays.c | 8 +++++++-
5 files changed, 61 insertions(+), 2 deletions(-)
diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index 31c865d1842e..577d1b25fccd 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -57,7 +57,12 @@ struct task_delay_info {
u64 freepages_start;
u64 freepages_delay; /* wait for memory reclaim */
+
+ u64 thrashing_start;
+ u64 thrashing_delay; /* wait for thrashing page */
+
u32 freepages_count; /* total count of memory reclaim */
+ u32 thrashing_count; /* total count of thrash waits */
};
#endif
@@ -76,6 +81,8 @@ extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
extern __u64 __delayacct_blkio_ticks(struct task_struct *);
extern void __delayacct_freepages_start(void);
extern void __delayacct_freepages_end(void);
+extern void __delayacct_thrashing_start(void);
+extern void __delayacct_thrashing_end(void);
static inline int delayacct_is_task_waiting_on_io(struct task_struct *p)
{
@@ -156,6 +163,18 @@ static inline void delayacct_freepages_end(void)
__delayacct_freepages_end();
}
+static inline void delayacct_thrashing_start(void)
+{
+ if (current->delays)
+ __delayacct_thrashing_start();
+}
+
+static inline void delayacct_thrashing_end(void)
+{
+ if (current->delays)
+ __delayacct_thrashing_end();
+}
+
#else
static inline void delayacct_set_flag(int flag)
{}
@@ -182,6 +201,10 @@ static inline void delayacct_freepages_start(void)
{}
static inline void delayacct_freepages_end(void)
{}
+static inline void delayacct_thrashing_start(void)
+{}
+static inline void delayacct_thrashing_end(void)
+{}
#endif /* CONFIG_TASK_DELAY_ACCT */
diff --git a/include/uapi/linux/taskstats.h b/include/uapi/linux/taskstats.h
index b7aa7bb2349f..5e8ca16a9079 100644
--- a/include/uapi/linux/taskstats.h
+++ b/include/uapi/linux/taskstats.h
@@ -34,7 +34,7 @@
*/
-#define TASKSTATS_VERSION 8
+#define TASKSTATS_VERSION 9
#define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN
* in linux/sched.h */
@@ -164,6 +164,10 @@ struct taskstats {
/* Delay waiting for memory reclaim */
__u64 freepages_count;
__u64 freepages_delay_total;
+
+ /* Delay waiting for thrashing page */
+ __u64 thrashing_count;
+ __u64 thrashing_delay_total;
};
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index ca8ac2824f0b..2a12b988c717 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -135,9 +135,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp;
tmp = d->freepages_delay_total + tsk->delays->freepages_delay;
d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp;
+ tmp = d->thrashing_delay_total + tsk->delays->thrashing_delay;
+ d->thrashing_delay_total = (tmp < d->thrashing_delay_total) ? 0 : tmp;
d->blkio_count += tsk->delays->blkio_count;
d->swapin_count += tsk->delays->swapin_count;
d->freepages_count += tsk->delays->freepages_count;
+ d->thrashing_count += tsk->delays->thrashing_count;
raw_spin_unlock_irqrestore(&tsk->delays->lock, flags);
return 0;
@@ -169,3 +172,15 @@ void __delayacct_freepages_end(void)
¤t->delays->freepages_count);
}
+void __delayacct_thrashing_start(void)
+{
+ current->delays->thrashing_start = ktime_get_ns();
+}
+
+void __delayacct_thrashing_end(void)
+{
+ delayacct_end(¤t->delays->lock,
+ ¤t->delays->thrashing_start,
+ ¤t->delays->thrashing_delay,
+ ¤t->delays->thrashing_count);
+}
diff --git a/mm/filemap.c b/mm/filemap.c
index 5e53424d9097..ca895ebe43ac 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,7 @@
#include <linux/cleancache.h>
#include <linux/shmem_fs.h>
#include <linux/rmap.h>
+#include <linux/delayacct.h>
#include "internal.h"
#define CREATE_TRACE_POINTS
@@ -1073,8 +1074,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
{
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
+ bool thrashing = false;
int ret = 0;
+ if (bit_nr == PG_locked && !PageSwapBacked(page) &&
+ !PageUptodate(page) && PageWorkingset(page)) {
+ delayacct_thrashing_start();
+ thrashing = true;
+ }
+
init_wait(wait);
wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0;
wait->func = wake_page_function;
@@ -1113,6 +1121,9 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
finish_wait(q, wait);
+ if (thrashing)
+ delayacct_thrashing_end();
+
/*
* A signal could leave PageWaiters set. Clearing it here if
* !waitqueue_active would be possible (by open-coding finish_wait),
diff --git a/tools/accounting/getdelays.c b/tools/accounting/getdelays.c
index 9f420d98b5fb..8cb504d30384 100644
--- a/tools/accounting/getdelays.c
+++ b/tools/accounting/getdelays.c
@@ -203,6 +203,8 @@ static void print_delayacct(struct taskstats *t)
"SWAP %15s%15s%15s\n"
" %15llu%15llu%15llums\n"
"RECLAIM %12s%15s%15s\n"
+ " %15llu%15llu%15llums\n"
+ "THRASHING%12s%15s%15s\n"
" %15llu%15llu%15llums\n",
"count", "real total", "virtual total",
"delay total", "delay average",
@@ -222,7 +224,11 @@ static void print_delayacct(struct taskstats *t)
"count", "delay total", "delay average",
(unsigned long long)t->freepages_count,
(unsigned long long)t->freepages_delay_total,
- average_ms(t->freepages_delay_total, t->freepages_count));
+ average_ms(t->freepages_delay_total, t->freepages_count),
+ "count", "delay total", "delay average",
+ (unsigned long long)t->thrashing_count,
+ (unsigned long long)t->thrashing_delay_total,
+ average_ms(t->thrashing_delay_total, t->thrashing_count));
}
static void task_context_switch_counts(struct taskstats *t)
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 4/9] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
` (2 preceding siblings ...)
2018-08-28 17:22 ` [PATCH 3/9] delayacct: track delays from thrashing cache pages Johannes Weiner
@ 2018-08-28 17:22 ` Johannes Weiner
2018-09-12 23:28 ` Andrew Morton
2018-08-28 17:22 ` [PATCH 5/9] sched: loadavg: make calc_load_n() public Johannes Weiner
` (6 subsequent siblings)
10 siblings, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 17:22 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
.../platforms/cell/cpufreq_spudemand.c | 2 +-
arch/powerpc/platforms/cell/spufs/sched.c | 9 +++-----
arch/s390/appldata/appldata_os.c | 4 ----
| 4 ----
fs/proc/loadavg.c | 3 ---
include/linux/sched/loadavg.h | 21 +++++++++++++++----
kernel/debug/kdb/kdb_main.c | 7 +------
kernel/sched/loadavg.c | 15 -------------
8 files changed, 22 insertions(+), 43 deletions(-)
diff --git a/arch/powerpc/platforms/cell/cpufreq_spudemand.c b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
index 882944c36ef5..5d8e8b6bb1cc 100644
--- a/arch/powerpc/platforms/cell/cpufreq_spudemand.c
+++ b/arch/powerpc/platforms/cell/cpufreq_spudemand.c
@@ -49,7 +49,7 @@ static int calc_freq(struct spu_gov_info_struct *info)
cpu = info->policy->cpu;
busy_spus = atomic_read(&cbe_spu_info[cpu_to_node(cpu)].busy_spus);
- CALC_LOAD(info->busy_spus, EXP, busy_spus * FIXED_1);
+ info->busy_spus = calc_load(info->busy_spus, EXP, busy_spus * FIXED_1);
pr_debug("cpu %d: busy_spus=%d, info->busy_spus=%ld\n",
cpu, busy_spus, info->busy_spus);
diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index c9ef3c532169..9fcccb4490b9 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -987,9 +987,9 @@ static void spu_calc_load(void)
unsigned long active_tasks; /* fixed-point */
active_tasks = count_active_contexts() * FIXED_1;
- CALC_LOAD(spu_avenrun[0], EXP_1, active_tasks);
- CALC_LOAD(spu_avenrun[1], EXP_5, active_tasks);
- CALC_LOAD(spu_avenrun[2], EXP_15, active_tasks);
+ spu_avenrun[0] = calc_load(spu_avenrun[0], EXP_1, active_tasks);
+ spu_avenrun[1] = calc_load(spu_avenrun[1], EXP_5, active_tasks);
+ spu_avenrun[2] = calc_load(spu_avenrun[2], EXP_15, active_tasks);
}
static void spusched_wake(struct timer_list *unused)
@@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx,
}
}
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
static int show_spu_loadavg(struct seq_file *s, void *private)
{
int a, b, c;
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 433a994b1a89..54f375627532 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -25,10 +25,6 @@
#include "appldata.h"
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
/*
* OS data
*
--git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 1aef60d160eb..e508d08b7ccb 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -131,10 +131,6 @@ struct menu_device {
int interval_ptr;
};
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
static inline int get_loadavg(unsigned long load)
{
return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10;
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index d06694757201..8468baee951d 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -10,9 +10,6 @@
#include <linux/seqlock.h>
#include <linux/time.h>
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
-
static int loadavg_proc_show(struct seq_file *m, void *v)
{
unsigned long avnrun[3];
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 80bc84ba5d2a..cc9cc62bb1f8 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -22,10 +22,23 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
#define EXP_5 2014 /* 1/exp(5sec/5min) */
#define EXP_15 2037 /* 1/exp(5sec/15min) */
-#define CALC_LOAD(load,exp,n) \
- load *= exp; \
- load += n*(FIXED_1-exp); \
- load >>= FSHIFT;
+/*
+ * a1 = a0 * e + a * (1 - e)
+ */
+static inline unsigned long
+calc_load(unsigned long load, unsigned long exp, unsigned long active)
+{
+ unsigned long newload;
+
+ newload = load * exp + active * (FIXED_1 - exp);
+ if (active >= load)
+ newload += FIXED_1-1;
+
+ return newload / FIXED_1;
+}
+
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
extern void calc_global_load(unsigned long ticks);
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 2ddfce8f1e8f..bb4fe4e1a601 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2556,16 +2556,11 @@ static int kdb_summary(int argc, const char **argv)
}
kdb_printf("%02ld:%02ld\n", val.uptime/(60*60), (val.uptime/60)%60);
- /* lifted from fs/proc/proc_misc.c::loadavg_read_proc() */
-
-#define LOAD_INT(x) ((x) >> FSHIFT)
-#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
kdb_printf("load avg %ld.%02ld %ld.%02ld %ld.%02ld\n",
LOAD_INT(val.loads[0]), LOAD_FRAC(val.loads[0]),
LOAD_INT(val.loads[1]), LOAD_FRAC(val.loads[1]),
LOAD_INT(val.loads[2]), LOAD_FRAC(val.loads[2]));
-#undef LOAD_INT
-#undef LOAD_FRAC
+
/* Display in kilobytes */
#define K(x) ((x) << (PAGE_SHIFT - 10))
kdb_printf("\nMemTotal: %8lu kB\nMemFree: %8lu kB\n"
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index a171c1258109..54fbdfb2d86c 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -91,21 +91,6 @@ long calc_load_fold_active(struct rq *this_rq, long adjust)
return delta;
}
-/*
- * a1 = a0 * e + a * (1 - e)
- */
-static unsigned long
-calc_load(unsigned long load, unsigned long exp, unsigned long active)
-{
- unsigned long newload;
-
- newload = load * exp + active * (FIXED_1 - exp);
- if (active >= load)
- newload += FIXED_1-1;
-
- return newload / FIXED_1;
-}
-
#ifdef CONFIG_NO_HZ_COMMON
/*
* Handle NO_HZ for the global load-average.
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 4/9] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
2018-08-28 17:22 ` [PATCH 4/9] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
@ 2018-09-12 23:28 ` Andrew Morton
2018-09-13 1:49 ` Johannes Weiner
0 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2018-09-12 23:28 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Tue, 28 Aug 2018 13:22:53 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> There are several definitions of those functions/macros in places that
> mess with fixed-point load averages. Provide an official version.
missed blk-iolatency.c for some reason?
--- a/block/blk-iolatency.c~sched-loadavg-consolidate-load_int-load_frac-calc_load-fix
+++ a/block/blk-iolatency.c
@@ -139,7 +139,7 @@ struct iolatency_grp {
#define BLKIOLATENCY_MAX_WIN_SIZE NSEC_PER_SEC
/*
* These are the constants used to fake the fixed-point moving average
- * calculation just like load average. The call to CALC_LOAD folds
+ * calculation just like load average. The call to calc_load() folds
* (FIXED_1 (2048) - exp_factor) * new_sample into lat_avg. The sampling
* window size is bucketed to try to approximately calculate average
* latency such that 1/exp (decay rate) is [1 min, 2.5 min) when windows
@@ -503,7 +503,7 @@ static void iolatency_check_latencies(st
lat_info = &parent->child_lat;
/*
- * CALC_LOAD takes in a number stored in fixed point representation.
+ * calc_load() takes in a number stored in fixed point representation.
* Because we are using this for IO time in ns, the values stored
* are significantly larger than the FIXED_1 denominator (2048).
* Therefore, rounding errors in the calculation are negligible and
@@ -512,7 +512,7 @@ static void iolatency_check_latencies(st
exp_idx = min_t(int, BLKIOLATENCY_NR_EXP_FACTORS - 1,
div64_u64(iolat->cur_win_nsec,
BLKIOLATENCY_EXP_BUCKET_SIZE));
- CALC_LOAD(iolat->lat_avg, iolatency_exp_factors[exp_idx], stat.mean);
+ calc_load(iolat->lat_avg, iolatency_exp_factors[exp_idx], stat.mean);
/* Everything is ok and we don't need to adjust the scale. */
if (stat.mean <= iolat->min_lat_nsec &&
_
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 4/9] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
2018-09-12 23:28 ` Andrew Morton
@ 2018-09-13 1:49 ` Johannes Weiner
0 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-09-13 1:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Wed, Sep 12, 2018 at 04:28:28PM -0700, Andrew Morton wrote:
> On Tue, 28 Aug 2018 13:22:53 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > There are several definitions of those functions/macros in places that
> > mess with fixed-point load averages. Provide an official version.
>
> missed blk-iolatency.c for some reason?
Ah, that callsite came in with this merge window. Thanks for the
fixup.
> --- a/block/blk-iolatency.c~sched-loadavg-consolidate-load_int-load_frac-calc_load-fix
> +++ a/block/blk-iolatency.c
> @@ -512,7 +512,7 @@ static void iolatency_check_latencies(st
> exp_idx = min_t(int, BLKIOLATENCY_NR_EXP_FACTORS - 1,
> div64_u64(iolat->cur_win_nsec,
> BLKIOLATENCY_EXP_BUCKET_SIZE));
> - CALC_LOAD(iolat->lat_avg, iolatency_exp_factors[exp_idx], stat.mean);
> + calc_load(iolat->lat_avg, iolatency_exp_factors[exp_idx], stat.mean);
The macro used to modify the avg parameter in place, but with the
function we need an explicit assignment to update the variable:
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index 335c22317757..8793f1344e11 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -512,7 +512,8 @@ static void iolatency_check_latencies(struct iolatency_grp *iolat, u64 now)
exp_idx = min_t(int, BLKIOLATENCY_NR_EXP_FACTORS - 1,
div64_u64(iolat->cur_win_nsec,
BLKIOLATENCY_EXP_BUCKET_SIZE));
- calc_load(iolat->lat_avg, iolatency_exp_factors[exp_idx], stat.mean);
+ iolat->lat_avg = calc_load(iolat->lat_avg,
+ iolatency_exp_factors[exp_idx], stat.mean);
/* Everything is ok and we don't need to adjust the scale. */
if (stat.mean <= iolat->min_lat_nsec &&
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 5/9] sched: loadavg: make calc_load_n() public
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
` (3 preceding siblings ...)
2018-08-28 17:22 ` [PATCH 4/9] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
@ 2018-08-28 17:22 ` Johannes Weiner
2018-08-28 17:22 ` [PATCH 6/9] sched: sched.h: make rq locking and clock functions available in stats.h Johannes Weiner
` (5 subsequent siblings)
10 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 17:22 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
It's going to be used in a later patch. Keep the churn separate.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/sched/loadavg.h | 3 +
kernel/sched/loadavg.c | 138 +++++++++++++++++-----------------
2 files changed, 72 insertions(+), 69 deletions(-)
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index cc9cc62bb1f8..4859bea47a7b 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -37,6 +37,9 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
return newload / FIXED_1;
}
+extern unsigned long calc_load_n(unsigned long load, unsigned long exp,
+ unsigned long active, unsigned int n);
+
#define LOAD_INT(x) ((x) >> FSHIFT)
#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 54fbdfb2d86c..28a516575c18 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -91,6 +91,75 @@ long calc_load_fold_active(struct rq *this_rq, long adjust)
return delta;
}
+/**
+ * fixed_power_int - compute: x^n, in O(log n) time
+ *
+ * @x: base of the power
+ * @frac_bits: fractional bits of @x
+ * @n: power to raise @x to.
+ *
+ * By exploiting the relation between the definition of the natural power
+ * function: x^n := x*x*...*x (x multiplied by itself for n times), and
+ * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
+ * (where: n_i \elem {0, 1}, the binary vector representing n),
+ * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
+ * of course trivially computable in O(log_2 n), the length of our binary
+ * vector.
+ */
+static unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+ unsigned long result = 1UL << frac_bits;
+
+ if (n) {
+ for (;;) {
+ if (n & 1) {
+ result *= x;
+ result += 1UL << (frac_bits - 1);
+ result >>= frac_bits;
+ }
+ n >>= 1;
+ if (!n)
+ break;
+ x *= x;
+ x += 1UL << (frac_bits - 1);
+ x >>= frac_bits;
+ }
+ }
+
+ return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ * = (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ * = a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ * = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *
+ * ...
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
+ * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ * = a0 * e^n + a * (1 - e^n)
+ *
+ * [1] application of the geometric series:
+ *
+ * n 1 - x^(n+1)
+ * S_n := \Sum x^i = -------------
+ * i=0 1 - x
+ */
+unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+ unsigned long active, unsigned int n)
+{
+ return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
+}
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* Handle NO_HZ for the global load-average.
@@ -210,75 +279,6 @@ static long calc_load_nohz_fold(void)
return delta;
}
-/**
- * fixed_power_int - compute: x^n, in O(log n) time
- *
- * @x: base of the power
- * @frac_bits: fractional bits of @x
- * @n: power to raise @x to.
- *
- * By exploiting the relation between the definition of the natural power
- * function: x^n := x*x*...*x (x multiplied by itself for n times), and
- * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
- * (where: n_i \elem {0, 1}, the binary vector representing n),
- * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
- * of course trivially computable in O(log_2 n), the length of our binary
- * vector.
- */
-static unsigned long
-fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
-{
- unsigned long result = 1UL << frac_bits;
-
- if (n) {
- for (;;) {
- if (n & 1) {
- result *= x;
- result += 1UL << (frac_bits - 1);
- result >>= frac_bits;
- }
- n >>= 1;
- if (!n)
- break;
- x *= x;
- x += 1UL << (frac_bits - 1);
- x >>= frac_bits;
- }
- }
-
- return result;
-}
-
-/*
- * a1 = a0 * e + a * (1 - e)
- *
- * a2 = a1 * e + a * (1 - e)
- * = (a0 * e + a * (1 - e)) * e + a * (1 - e)
- * = a0 * e^2 + a * (1 - e) * (1 + e)
- *
- * a3 = a2 * e + a * (1 - e)
- * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
- * = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
- *
- * ...
- *
- * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]
- * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
- * = a0 * e^n + a * (1 - e^n)
- *
- * [1] application of the geometric series:
- *
- * n 1 - x^(n+1)
- * S_n := \Sum x^i = -------------
- * i=0 1 - x
- */
-static unsigned long
-calc_load_n(unsigned long load, unsigned long exp,
- unsigned long active, unsigned int n)
-{
- return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
-}
-
/*
* NO_HZ can leave us missing all per-CPU ticks calling
* calc_load_fold_active(), but since a NO_HZ CPU folds its delta into
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 6/9] sched: sched.h: make rq locking and clock functions available in stats.h
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
` (4 preceding siblings ...)
2018-08-28 17:22 ` [PATCH 5/9] sched: loadavg: make calc_load_n() public Johannes Weiner
@ 2018-08-28 17:22 ` Johannes Weiner
2018-08-28 17:22 ` [PATCH 7/9] sched: introduce this_rq_lock_irq() Johannes Weiner
` (4 subsequent siblings)
10 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 17:22 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
kernel/sched/sched.h includes "stats.h" half-way through the file. The
next patch introduces users of sched.h's rq locking functions and
update_rq_clock() in kernel/sched/stats.h. Move those definitions up
in the file so they are available in stats.h.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
kernel/sched/sched.h | 164 +++++++++++++++++++++----------------------
1 file changed, 82 insertions(+), 82 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..eb9b1326906c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -926,6 +926,8 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)
+extern void update_rq_clock(struct rq *rq);
+
static inline u64 __rq_clock_broken(struct rq *rq)
{
return READ_ONCE(rq->clock);
@@ -1044,6 +1046,86 @@ static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
#endif
}
+struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
+ __acquires(rq->lock);
+
+struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
+ __acquires(p->pi_lock)
+ __acquires(rq->lock);
+
+static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
+ __releases(rq->lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock(&rq->lock);
+}
+
+static inline void
+task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
+ __releases(rq->lock)
+ __releases(p->pi_lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock(&rq->lock);
+ raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
+}
+
+static inline void
+rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
+ __acquires(rq->lock)
+{
+ raw_spin_lock_irqsave(&rq->lock, rf->flags);
+ rq_pin_lock(rq, rf);
+}
+
+static inline void
+rq_lock_irq(struct rq *rq, struct rq_flags *rf)
+ __acquires(rq->lock)
+{
+ raw_spin_lock_irq(&rq->lock);
+ rq_pin_lock(rq, rf);
+}
+
+static inline void
+rq_lock(struct rq *rq, struct rq_flags *rf)
+ __acquires(rq->lock)
+{
+ raw_spin_lock(&rq->lock);
+ rq_pin_lock(rq, rf);
+}
+
+static inline void
+rq_relock(struct rq *rq, struct rq_flags *rf)
+ __acquires(rq->lock)
+{
+ raw_spin_lock(&rq->lock);
+ rq_repin_lock(rq, rf);
+}
+
+static inline void
+rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
+ __releases(rq->lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+}
+
+static inline void
+rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
+ __releases(rq->lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock_irq(&rq->lock);
+}
+
+static inline void
+rq_unlock(struct rq *rq, struct rq_flags *rf)
+ __releases(rq->lock)
+{
+ rq_unpin_lock(rq, rf);
+ raw_spin_unlock(&rq->lock);
+}
+
#ifdef CONFIG_NUMA
enum numa_topology_type {
NUMA_DIRECT,
@@ -1683,8 +1765,6 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
sched_update_tick_dependency(rq);
}
-extern void update_rq_clock(struct rq *rq);
-
extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
@@ -1765,86 +1845,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
static inline void sched_avg_update(struct rq *rq) { }
#endif
-struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
- __acquires(rq->lock);
-
-struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
- __acquires(p->pi_lock)
- __acquires(rq->lock);
-
-static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
- __releases(rq->lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
-}
-
-static inline void
-task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
- __releases(rq->lock)
- __releases(p->pi_lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
- raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
-}
-
-static inline void
-rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
- __acquires(rq->lock)
-{
- raw_spin_lock_irqsave(&rq->lock, rf->flags);
- rq_pin_lock(rq, rf);
-}
-
-static inline void
-rq_lock_irq(struct rq *rq, struct rq_flags *rf)
- __acquires(rq->lock)
-{
- raw_spin_lock_irq(&rq->lock);
- rq_pin_lock(rq, rf);
-}
-
-static inline void
-rq_lock(struct rq *rq, struct rq_flags *rf)
- __acquires(rq->lock)
-{
- raw_spin_lock(&rq->lock);
- rq_pin_lock(rq, rf);
-}
-
-static inline void
-rq_relock(struct rq *rq, struct rq_flags *rf)
- __acquires(rq->lock)
-{
- raw_spin_lock(&rq->lock);
- rq_repin_lock(rq, rf);
-}
-
-static inline void
-rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
- __releases(rq->lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
-}
-
-static inline void
-rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
- __releases(rq->lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock_irq(&rq->lock);
-}
-
-static inline void
-rq_unlock(struct rq *rq, struct rq_flags *rf)
- __releases(rq->lock)
-{
- rq_unpin_lock(rq, rf);
- raw_spin_unlock(&rq->lock);
-}
-
#ifdef CONFIG_SMP
#ifdef CONFIG_PREEMPT
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 7/9] sched: introduce this_rq_lock_irq()
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
` (5 preceding siblings ...)
2018-08-28 17:22 ` [PATCH 6/9] sched: sched.h: make rq locking and clock functions available in stats.h Johannes Weiner
@ 2018-08-28 17:22 ` Johannes Weiner
2018-08-28 17:22 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
` (3 subsequent siblings)
10 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 17:22 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
do_sched_yield() disables IRQs, looks up this_rq() and locks it. The
next patch is adding another site with the same pattern, so provide a
convenience function for it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
kernel/sched/core.c | 4 +---
kernel/sched/sched.h | 12 ++++++++++++
2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..61059e671fc6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4960,9 +4960,7 @@ static void do_sched_yield(void)
struct rq_flags rf;
struct rq *rq;
- local_irq_disable();
- rq = this_rq();
- rq_lock(rq, &rf);
+ rq = this_rq_lock_irq(&rf);
schedstat_inc(rq->yld_count);
current->sched_class->yield_task(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eb9b1326906c..83db5de1464c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1126,6 +1126,18 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
raw_spin_unlock(&rq->lock);
}
+static inline struct rq *
+this_rq_lock_irq(struct rq_flags *rf)
+ __acquires(rq->lock)
+{
+ struct rq *rq;
+
+ local_irq_disable();
+ rq = this_rq();
+ rq_lock(rq, rf);
+ return rq;
+}
+
#ifdef CONFIG_NUMA
enum numa_topology_type {
NUMA_DIRECT,
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
` (6 preceding siblings ...)
2018-08-28 17:22 ` [PATCH 7/9] sched: introduce this_rq_lock_irq() Johannes Weiner
@ 2018-08-28 17:22 ` Johannes Weiner
2018-08-28 20:11 ` Randy Dunlap
` (2 more replies)
2018-08-28 17:22 ` [PATCH 9/9] psi: cgroup support Johannes Weiner
` (2 subsequent siblings)
10 siblings, 3 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 17:22 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).
v2:
- stable clock tick, as per Peter
- data structure layout optimization, as per Peter
- fix u64 divisions on 32 bit, as per Peter
- outermost psi_disabled checks, as per Peter
- coding style fixes, as per Peter
- just-in-time stats aggregation, as per Suren
- fix task state corruption with CONFIG_PREEMPT, as per Suren
- CONFIG_PSI=n build error
- avoid writing p->sched_psi_wake_requeue unnecessarily
- documentation & comment updates
v3:
- pack scheduler hotpath data into one cacheline, as per Peter and Linus
- drop unnecessary SCHED_INFO dependency, as per Peter
- lockless live-state aggregation, as per Peter
- do_div -> div64_ul and some other cleanups, as per Peter
- realtime sampling period and slipped sample handling, as per Tejun
v4:
- replace an unsafe cpu_curr() dereference in the aggregator by
sampling active reclaimers from scheduler_tick(), as per Peter
- fix several race conditions that cause the unlocked live aggregator
to get ahead of the scheduler's recorded times and cause sample
calculations to underflow into bogusly large time deltas, as per Suren
- fix rare accounting artifacts from CPU hotplugging, as per Peter
- make the aggregation loop over all states more readable, as per Peter
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
Documentation/accounting/psi.txt | 64 +++
include/linux/psi.h | 28 ++
include/linux/psi_types.h | 92 +++++
include/linux/sched.h | 10 +
init/Kconfig | 15 +
kernel/fork.c | 4 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 12 +-
kernel/sched/psi.c | 650 +++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 +
kernel/sched/stats.h | 86 ++++
mm/compaction.c | 5 +
mm/filemap.c | 15 +-
mm/page_alloc.c | 9 +
mm/vmscan.c | 9 +
15 files changed, 996 insertions(+), 6 deletions(-)
create mode 100644 Documentation/accounting/psi.txt
create mode 100644 include/linux/psi.h
create mode 100644 include/linux/psi_types.h
create mode 100644 kernel/sched/psi.c
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index 000000000000..51e7ef14142e
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,64 @@
+================================
+PSI - Pressure Stall Information
+================================
+
+:Date: April, 2018
+:Author: Johannes Weiner <hannes@cmpxchg.org>
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==================
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to distinguish this
+situation from a state where some tasks are stalled but the CPU is
+still doing productive work. As such, time spent in this subset of the
+stall state is tracked separately and exported in the "full" averages.
+
+The ratios are tracked as recent trends over ten, sixty, and three
+hundred second windows, which gives insight into short term events as
+well as medium and long term trends. The total absolute stall time is
+tracked and exported as well, to allow detection of latency spikes
+which wouldn't necessarily make a dent in the time averages, or to
+average trends over custom time frames.
diff --git a/include/linux/psi.h b/include/linux/psi.h
new file mode 100644
index 000000000000..b0daf050de58
--- /dev/null
+++ b/include/linux/psi.h
@@ -0,0 +1,28 @@
+#ifndef _LINUX_PSI_H
+#define _LINUX_PSI_H
+
+#include <linux/psi_types.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_PSI
+
+extern bool psi_disabled;
+
+void psi_init(void);
+
+void psi_task_change(struct task_struct *task, int clear, int set);
+
+void psi_memstall_tick(struct task_struct *task, int cpu);
+void psi_memstall_enter(unsigned long *flags);
+void psi_memstall_leave(unsigned long *flags);
+
+#else /* CONFIG_PSI */
+
+static inline void psi_init(void) {}
+
+static inline void psi_memstall_enter(unsigned long *flags) {}
+static inline void psi_memstall_leave(unsigned long *flags) {}
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_H */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
new file mode 100644
index 000000000000..2cf422db5d18
--- /dev/null
+++ b/include/linux/psi_types.h
@@ -0,0 +1,92 @@
+#ifndef _LINUX_PSI_TYPES_H
+#define _LINUX_PSI_TYPES_H
+
+#include <linux/seqlock.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_PSI
+
+/* Tracked task states */
+enum psi_task_count {
+ NR_IOWAIT,
+ NR_MEMSTALL,
+ NR_RUNNING,
+ NR_PSI_TASK_COUNTS,
+};
+
+/* Task state bitmasks */
+#define TSK_IOWAIT (1 << NR_IOWAIT)
+#define TSK_MEMSTALL (1 << NR_MEMSTALL)
+#define TSK_RUNNING (1 << NR_RUNNING)
+
+/* Resources that workloads could be stalled on */
+enum psi_res {
+ PSI_IO,
+ PSI_MEM,
+ PSI_CPU,
+ NR_PSI_RESOURCES,
+};
+
+/*
+ * Pressure states for each resource:
+ *
+ * SOME: Stalled tasks & working tasks
+ * FULL: Stalled tasks & no working tasks
+ */
+enum psi_states {
+ PSI_IO_SOME,
+ PSI_IO_FULL,
+ PSI_MEM_SOME,
+ PSI_MEM_FULL,
+ PSI_CPU_SOME,
+ /* Only per-CPU, to weigh the CPU in the global average: */
+ PSI_NONIDLE,
+ NR_PSI_STATES,
+};
+
+struct psi_group_cpu {
+ /* 1st cacheline updated by the scheduler */
+
+ /* Aggregator needs to know of concurrent changes */
+ seqcount_t seq ____cacheline_aligned_in_smp;
+
+ /* States of the tasks belonging to this group */
+ unsigned int tasks[NR_PSI_TASK_COUNTS];
+
+ /* Period time sampling buckets for each state of interest (ns) */
+ u32 times[NR_PSI_STATES];
+
+ /* Time of last task change in this group (rq_clock) */
+ u64 state_start;
+
+ /* 2nd cacheline updated by the aggregator */
+
+ /* Delta detection against the sampling buckets */
+ u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
+};
+
+struct psi_group {
+ /* Protects data updated during an aggregation */
+ struct mutex stat_lock;
+
+ /* Per-cpu task state & time tracking */
+ struct psi_group_cpu __percpu *pcpu;
+
+ /* Periodic aggregation state */
+ u64 total_prev[NR_PSI_STATES - 1];
+ u64 last_update;
+ u64 next_update;
+ struct delayed_work clock_work;
+
+ /* Total stall times and sampled pressure averages */
+ u64 total[NR_PSI_STATES - 1];
+ unsigned long avg[NR_PSI_STATES - 1][3];
+};
+
+#else /* CONFIG_PSI */
+
+struct psi_group { };
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 43731fe51c97..87c2fe4a28b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -25,6 +25,7 @@
#include <linux/latencytop.h>
#include <linux/sched/prio.h>
#include <linux/signal_types.h>
+#include <linux/psi_types.h>
#include <linux/mm_types_task.h>
#include <linux/task_io_accounting.h>
#include <linux/rseq.h>
@@ -710,6 +711,10 @@ struct task_struct {
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
unsigned sched_remote_wakeup:1;
+#ifdef CONFIG_PSI
+ unsigned sched_psi_wake_requeue:1;
+#endif
+
/* Force alignment to the next boundary: */
unsigned :0;
@@ -957,6 +962,10 @@ struct task_struct {
siginfo_t *last_siginfo;
struct task_io_accounting ioac;
+#ifdef CONFIG_PSI
+ /* Pressure stall state */
+ unsigned int psi_flags;
+#endif
#ifdef CONFIG_TASK_XACCT
/* Accumulated RSS usage: */
u64 acct_rss_mem1;
@@ -1397,6 +1406,7 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
diff --git a/init/Kconfig b/init/Kconfig
index 041f3a022122..98d59bc268df 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -455,6 +455,21 @@ config TASK_IO_ACCOUNTING
Say N if unsure.
+config PSI
+ bool "Pressure stall information tracking"
+ help
+ Collect metrics that indicate how overcommitted the CPU, memory,
+ and IO capacity are in the system.
+
+ If you say Y here, the kernel will create /proc/pressure/ with the
+ pressure statistics files cpu, memory, and io. These will indicate
+ the share of walltime in which some or all tasks in the system are
+ delayed due to contention of the respective resource.
+
+ For more details see Documentation/accounting/psi.txt.
+
+ Say N if unsure.
+
endmenu # "CPU/Task time and stats accounting"
config CPU_ISOLATION
diff --git a/kernel/fork.c b/kernel/fork.c
index 1b27babc4c78..f6cd2dd13db8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1736,6 +1736,10 @@ static __latent_entropy struct task_struct *copy_process(
p->default_timer_slack_ns = current->timer_slack_ns;
+#ifdef CONFIG_PSI
+ p->psi_flags = 0;
+#endif
+
task_io_accounting_init(&p->ioac);
acct_clear_integrals(p);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..b29bc18f2704 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
+obj-$(CONFIG_PSI) += psi.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 61059e671fc6..0fa008c43400 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -744,8 +744,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & ENQUEUE_RESTORE))
+ if (!(flags & ENQUEUE_RESTORE)) {
sched_info_queued(rq, p);
+ psi_enqueue(p, flags & ENQUEUE_WAKEUP);
+ }
p->sched_class->enqueue_task(rq, p, flags);
}
@@ -755,8 +757,10 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & DEQUEUE_SAVE))
+ if (!(flags & DEQUEUE_SAVE)) {
sched_info_dequeued(rq, p);
+ psi_dequeue(p, flags & DEQUEUE_SLEEP);
+ }
p->sched_class->dequeue_task(rq, p, flags);
}
@@ -2060,6 +2064,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
+ psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
@@ -3078,6 +3083,7 @@ void scheduler_tick(void)
curr->sched_class->task_tick(rq, curr, 0);
cpu_load_update_active(rq);
calc_global_load_tick(rq);
+ psi_task_tick(rq);
rq_unlock(rq, &rf);
@@ -6110,6 +6116,8 @@ void __init sched_init(void)
init_schedstats();
+ psi_init();
+
scheduler_running = 1;
}
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
new file mode 100644
index 000000000000..92489e66840b
--- /dev/null
+++ b/kernel/sched/psi.c
@@ -0,0 +1,650 @@
+/*
+ * Pressure stall information for CPU, memory and IO
+ *
+ * Copyright (c) 2018 Facebook, Inc.
+ * Author: Johannes Weiner <hannes@cmpxchg.org>
+ *
+ * When CPU, memory and IO are contended, tasks experience delays that
+ * reduce throughput and introduce latencies into the workload. Memory
+ * and IO contention, in addition, can cause a full loss of forward
+ * progress in which the CPU goes idle.
+ *
+ * This code aggregates individual task delays into resource pressure
+ * metrics that indicate problems with both workload health and
+ * resource utilization.
+ *
+ * Model
+ *
+ * The time in which a task can execute on a CPU is our baseline for
+ * productivity. Pressure expresses the amount of time in which this
+ * potential cannot be realized due to resource contention.
+ *
+ * This concept of productivity has two components: the workload and
+ * the CPU. To measure the impact of pressure on both, we define two
+ * contention states for a resource: SOME and FULL.
+ *
+ * In the SOME state of a given resource, one or more tasks are
+ * delayed on that resource. This affects the workload's ability to
+ * perform work, but the CPU may still be executing other tasks.
+ *
+ * In the FULL state of a given resource, all non-idle tasks are
+ * delayed on that resource such that nobody is advancing and the CPU
+ * goes idle. This leaves both workload and CPU unproductive.
+ *
+ * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ *
+ * SOME = nr_delayed_tasks != 0
+ * FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
+ *
+ * The percentage of wallclock time spent in those compound stall
+ * states gives pressure numbers between 0 and 100 for each resource,
+ * where the SOME percentage indicates workload slowdowns and the FULL
+ * percentage indicates reduced CPU utilization:
+ *
+ * %SOME = time(SOME) / period
+ * %FULL = time(FULL) / period
+ *
+ * Multiple CPUs
+ *
+ * The more tasks and available CPUs there are, the more work can be
+ * performed concurrently. This means that the potential that can go
+ * unrealized due to resource contention *also* scales with non-idle
+ * tasks and CPUs.
+ *
+ * Consider a scenario where 257 number crunching tasks are trying to
+ * run concurrently on 256 CPUs. If we simply aggregated the task
+ * states, we would have to conclude a CPU SOME pressure number of
+ * 100%, since *somebody* is waiting on a runqueue at all
+ * times. However, that is clearly not the amount of contention the
+ * workload is experiencing: only one out of 256 possible exceution
+ * threads will be contended at any given time, or about 0.4%.
+ *
+ * Conversely, consider a scenario of 4 tasks and 4 CPUs where at any
+ * given time *one* of the tasks is delayed due to a lack of memory.
+ * Again, looking purely at the task state would yield a memory FULL
+ * pressure number of 0%, since *somebody* is always making forward
+ * progress. But again this wouldn't capture the amount of execution
+ * potential lost, which is 1 out of 4 CPUs, or 25%.
+ *
+ * To calculate wasted potential (pressure) with multiple processors,
+ * we have to base our calculation on the number of non-idle tasks in
+ * conjunction with the number of available CPUs, which is the number
+ * of potential execution threads. SOME becomes then the proportion of
+ * delayed tasks to possibe threads, and FULL is the share of possible
+ * threads that are unproductive due to delays:
+ *
+ * threads = min(nr_nonidle_tasks, nr_cpus)
+ * SOME = min(nr_delayed_tasks / threads, 1)
+ * FULL = (threads - min(nr_running_tasks, threads)) / threads
+ *
+ * For the 257 number crunchers on 256 CPUs, this yields:
+ *
+ * threads = min(257, 256)
+ * SOME = min(1 / 256, 1) = 0.4%
+ * FULL = (256 - min(257, 256)) / 256 = 0%
+ *
+ * For the 1 out of 4 memory-delayed tasks, this yields:
+ *
+ * threads = min(4, 4)
+ * SOME = min(1 / 4, 1) = 25%
+ * FULL = (4 - min(3, 4)) / 4 = 25%
+ *
+ * [ Substitute nr_cpus with 1, and you can see that it's a natural
+ * extension of the single-CPU model. ]
+ *
+ * Implementation
+ *
+ * To assess the precise time spent in each such state, we would have
+ * to freeze the system on task changes and start/stop the state
+ * clocks accordingly. Obviously that doesn't scale in practice.
+ *
+ * Because the scheduler aims to distribute the compute load evenly
+ * among the available CPUs, we can track task state locally to each
+ * CPU and, at much lower frequency, extrapolate the global state for
+ * the cumulative stall times and the running averages.
+ *
+ * For each runqueue, we track:
+ *
+ * tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
+ * tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_running_tasks[cpu])
+ * tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0)
+ *
+ * and then periodically aggregate:
+ *
+ * tNONIDLE = sum(tNONIDLE[i])
+ *
+ * tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE
+ * tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE
+ *
+ * %SOME = tSOME / period
+ * %FULL = tFULL / period
+ *
+ * This gives us an approximation of pressure that is practical
+ * cost-wise, yet way more sensitive and accurate than periodic
+ * sampling of the aggregate task states would be.
+ */
+
+#include <linux/sched/loadavg.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/seqlock.h>
+#include <linux/cgroup.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/psi.h>
+#include "sched.h"
+
+static int psi_bug __read_mostly;
+
+bool psi_disabled __read_mostly;
+core_param(psi_disabled, psi_disabled, bool, 0644);
+
+/* Running averages - we need to be higher-res than loadavg */
+#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
+#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
+#define EXP_60s 1981 /* 1/exp(2s/60s) */
+#define EXP_300s 2034 /* 1/exp(2s/300s) */
+
+/* Sampling frequency in nanoseconds */
+static u64 psi_period __read_mostly;
+
+/* System-level pressure and stall tracking */
+static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu);
+static struct psi_group psi_system = {
+ .pcpu = &system_group_pcpu,
+};
+
+static void psi_clock(struct work_struct *work);
+
+static void group_init(struct psi_group *group)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
+ group->next_update = sched_clock() + psi_period;
+ INIT_DELAYED_WORK(&group->clock_work, psi_clock);
+ mutex_init(&group->stat_lock);
+}
+
+void __init psi_init(void)
+{
+ if (psi_disabled)
+ return;
+
+ psi_period = jiffies_to_nsecs(PSI_FREQ);
+ group_init(&psi_system);
+}
+
+static bool test_state(unsigned int *tasks, enum psi_states state)
+{
+ switch (state) {
+ case PSI_IO_SOME:
+ return tasks[NR_IOWAIT];
+ case PSI_IO_FULL:
+ return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+ case PSI_MEM_SOME:
+ return tasks[NR_MEMSTALL];
+ case PSI_MEM_FULL:
+ return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+ case PSI_CPU_SOME:
+ return tasks[NR_RUNNING] > 1;
+ case PSI_NONIDLE:
+ return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
+ tasks[NR_RUNNING];
+ default:
+ return false;
+ }
+}
+
+static u32 get_recent_time(struct psi_group *group, int cpu,
+ enum psi_states state)
+{
+ struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+ unsigned int seq;
+ u32 time, delta;
+
+ do {
+ seq = read_seqcount_begin(&groupc->seq);
+
+ time = groupc->times[state];
+ /*
+ * In addition to already concluded states, we also
+ * incorporate currently active states on the CPU,
+ * since states may last for many sampling periods.
+ *
+ * This way we keep our delta sampling buckets small
+ * (u32) and our reported pressure close to what's
+ * actually happening.
+ */
+ if (test_state(groupc->tasks, state))
+ time += cpu_clock(cpu) - groupc->state_start;
+ } while (read_seqcount_retry(&groupc->seq, seq));
+
+ delta = time - groupc->times_prev[state];
+ groupc->times_prev[state] = time;
+
+ return delta;
+}
+
+static void calc_avgs(unsigned long avg[3], int missed_periods,
+ u64 time, u64 period)
+{
+ unsigned long pct;
+
+ /* Fill in zeroes for periods of no activity */
+ if (missed_periods) {
+ avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods);
+ avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods);
+ avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods);
+ }
+
+ /* Sample the most recent active period */
+ pct = div_u64(time * 100, period);
+ pct *= FIXED_1;
+ avg[0] = calc_load(avg[0], EXP_10s, pct);
+ avg[1] = calc_load(avg[1], EXP_60s, pct);
+ avg[2] = calc_load(avg[2], EXP_300s, pct);
+}
+
+static bool update_stats(struct psi_group *group)
+{
+ u64 deltas[NR_PSI_STATES - 1] = { 0, };
+ unsigned long missed_periods = 0;
+ unsigned long nonidle_total = 0;
+ u64 now, expires, period;
+ int cpu;
+ int s;
+
+ mutex_lock(&group->stat_lock);
+
+ /*
+ * Collect the per-cpu time buckets and average them into a
+ * single time sample that is normalized to wallclock time.
+ *
+ * For averaging, each CPU is weighted by its non-idle time in
+ * the sampling period. This eliminates artifacts from uneven
+ * loading, or even entirely idle CPUs.
+ */
+ for_each_possible_cpu(cpu) {
+ u32 nonidle;
+
+ nonidle = get_recent_time(group, cpu, PSI_NONIDLE);
+ nonidle = nsecs_to_jiffies(nonidle);
+ nonidle_total += nonidle;
+
+ for (s = 0; s < PSI_NONIDLE; s++) {
+ u32 delta;
+
+ delta = get_recent_time(group, cpu, s);
+ deltas[s] += (u64)delta * nonidle;
+ }
+ }
+
+ /*
+ * Integrate the sample into the running statistics that are
+ * reported to userspace: the cumulative stall times and the
+ * decaying averages.
+ *
+ * Pressure percentages are sampled at PSI_FREQ. We might be
+ * called more often when the user polls more frequently than
+ * that; we might be called less often when there is no task
+ * activity, thus no data, and clock ticks are sporadic. The
+ * below handles both.
+ */
+
+ /* total= */
+ for (s = 0; s < NR_PSI_STATES - 1; s++)
+ group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
+
+ /* avgX= */
+ now = sched_clock();
+ expires = group->next_update;
+ if (now < expires)
+ goto out;
+ if (now - expires > psi_period)
+ missed_periods = div_u64(now - expires, psi_period);
+
+ /*
+ * The periodic clock tick can get delayed for various
+ * reasons, especially on loaded systems. To avoid clock
+ * drift, we schedule the clock in fixed psi_period intervals.
+ * But the deltas we sample out of the per-cpu buckets above
+ * are based on the actual time elapsing between clock ticks.
+ */
+ group->next_update = expires + ((1 + missed_periods) * psi_period);
+ period = now - (group->last_update + (missed_periods * psi_period));
+ group->last_update = now;
+
+ for (s = 0; s < NR_PSI_STATES - 1; s++) {
+ u32 sample;
+
+ sample = group->total[s] - group->total_prev[s];
+ /*
+ * Due to the lockless sampling of the time buckets,
+ * recorded time deltas can slip into the next period,
+ * which under full pressure can result in samples in
+ * excess of the period length.
+ *
+ * We don't want to report non-sensical pressures in
+ * excess of 100%, nor do we want to drop such events
+ * on the floor. Instead we punt any overage into the
+ * future until pressure subsides. By doing this we
+ * don't underreport the occurring pressure curve, we
+ * just report it delayed by one period length.
+ *
+ * The error isn't cumulative. As soon as another
+ * delta slips from a period P to P+1, by definition
+ * it frees up its time T in P.
+ */
+ if (sample > period)
+ sample = period;
+ group->total_prev[s] += sample;
+ calc_avgs(group->avg[s], missed_periods, sample, period);
+ }
+out:
+ mutex_unlock(&group->stat_lock);
+ return nonidle_total;
+}
+
+static void psi_clock(struct work_struct *work)
+{
+ struct delayed_work *dwork;
+ struct psi_group *group;
+ bool nonidle;
+
+ dwork = to_delayed_work(work);
+ group = container_of(dwork, struct psi_group, clock_work);
+
+ /*
+ * If there is task activity, periodically fold the per-cpu
+ * times and feed samples into the running averages. If things
+ * are idle and there is no data to process, stop the clock.
+ * Once restarted, we'll catch up the running averages in one
+ * go - see calc_avgs() and missed_periods.
+ */
+
+ nonidle = update_stats(group);
+
+ if (nonidle) {
+ unsigned long delay = 0;
+ u64 now;
+
+ now = sched_clock();
+ if (group->next_update > now)
+ delay = nsecs_to_jiffies(group->next_update - now) + 1;
+ schedule_delayed_work(dwork, delay);
+ }
+}
+
+static void record_times(struct psi_group_cpu *groupc, int cpu,
+ bool memstall_tick)
+{
+ u32 delta;
+ u64 now;
+
+ now = cpu_clock(cpu);
+ delta = now - groupc->state_start;
+ groupc->state_start = now;
+
+ if (test_state(groupc->tasks, PSI_IO_SOME)) {
+ groupc->times[PSI_IO_SOME] += delta;
+ if (test_state(groupc->tasks, PSI_IO_FULL))
+ groupc->times[PSI_IO_FULL] += delta;
+ }
+
+ if (test_state(groupc->tasks, PSI_MEM_SOME)) {
+ groupc->times[PSI_MEM_SOME] += delta;
+ if (test_state(groupc->tasks, PSI_MEM_FULL))
+ groupc->times[PSI_MEM_FULL] += delta;
+ else if (memstall_tick) {
+ u32 sample;
+ /*
+ * Since we care about lost potential, a
+ * memstall is FULL when there are no other
+ * working tasks, but also when the CPU is
+ * actively reclaiming and nothing productive
+ * could run even if it were runnable.
+ *
+ * When the timer tick sees a reclaiming CPU,
+ * regardless of runnable tasks, sample a FULL
+ * tick (or less if it hasn't been a full tick
+ * since the last state change).
+ */
+ sample = min(delta, (u32)jiffies_to_nsecs(1));
+ groupc->times[PSI_MEM_FULL] += sample;
+ }
+ }
+
+ if (test_state(groupc->tasks, PSI_CPU_SOME))
+ groupc->times[PSI_CPU_SOME] += delta;
+
+ if (test_state(groupc->tasks, PSI_NONIDLE))
+ groupc->times[PSI_NONIDLE] += delta;
+}
+
+static void psi_group_change(struct psi_group *group, int cpu,
+ unsigned int clear, unsigned int set)
+{
+ struct psi_group_cpu *groupc;
+ unsigned int t, m;
+
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ /*
+ * First we assess the aggregate resource states this CPU's
+ * tasks have been in since the last change, and account any
+ * SOME and FULL time these may have resulted in.
+ *
+ * Then we update the task counts according to the state
+ * change requested through the @clear and @set bits.
+ */
+ write_seqcount_begin(&groupc->seq);
+
+ record_times(groupc, cpu, false);
+
+ for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
+ if (!(m & (1 << t)))
+ continue;
+ if (groupc->tasks[t] == 0 && !psi_bug) {
+ printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n",
+ cpu, t, groupc->tasks[0],
+ groupc->tasks[1], groupc->tasks[2],
+ clear, set);
+ psi_bug = 1;
+ }
+ groupc->tasks[t]--;
+ }
+
+ for (t = 0; set; set &= ~(1 << t), t++)
+ if (set & (1 << t))
+ groupc->tasks[t]++;
+
+ write_seqcount_end(&groupc->seq);
+
+ if (!delayed_work_pending(&group->clock_work))
+ schedule_delayed_work(&group->clock_work, PSI_FREQ);
+}
+
+void psi_task_change(struct task_struct *task, int clear, int set)
+{
+ int cpu = task_cpu(task);
+
+ if (!task->pid)
+ return;
+
+ if (((task->psi_flags & set) ||
+ (task->psi_flags & clear) != clear) &&
+ !psi_bug) {
+ printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
+ task->pid, task->comm, cpu,
+ task->psi_flags, clear, set);
+ psi_bug = 1;
+ }
+
+ task->psi_flags &= ~clear;
+ task->psi_flags |= set;
+
+ psi_group_change(&psi_system, cpu, clear, set);
+}
+
+void psi_memstall_tick(struct task_struct *task, int cpu)
+{
+ struct psi_group_cpu *groupc;
+
+ groupc = per_cpu_ptr(psi_system.pcpu, cpu);
+ write_seqcount_begin(&groupc->seq);
+ record_times(groupc, cpu, true);
+ write_seqcount_end(&groupc->seq);
+}
+
+/**
+ * psi_memstall_enter - mark the beginning of a memory stall section
+ * @flags: flags to handle nested sections
+ *
+ * Marks the calling task as being stalled due to a lack of memory,
+ * such as waiting for a refault or performing reclaim.
+ */
+void psi_memstall_enter(unsigned long *flags)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (psi_disabled)
+ return;
+
+ *flags = current->flags & PF_MEMSTALL;
+ if (*flags)
+ return;
+ /*
+ * PF_MEMSTALL setting & accounting needs to be atomic wrt
+ * changes to the task's scheduling state, otherwise we can
+ * race with CPU migration.
+ */
+ rq = this_rq_lock_irq(&rf);
+
+ current->flags |= PF_MEMSTALL;
+ psi_task_change(current, 0, TSK_MEMSTALL);
+
+ rq_unlock_irq(rq, &rf);
+}
+
+/**
+ * psi_memstall_leave - mark the end of an memory stall section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer stalled due to lack of memory.
+ */
+void psi_memstall_leave(unsigned long *flags)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (psi_disabled)
+ return;
+
+ if (*flags)
+ return;
+ /*
+ * PF_MEMSTALL clearing & accounting needs to be atomic wrt
+ * changes to the task's scheduling state, otherwise we could
+ * race with CPU migration.
+ */
+ rq = this_rq_lock_irq(&rf);
+
+ current->flags &= ~PF_MEMSTALL;
+ psi_task_change(current, TSK_MEMSTALL, 0);
+
+ rq_unlock_irq(rq, &rf);
+}
+
+static int psi_show(struct seq_file *m, struct psi_group *group,
+ enum psi_res res)
+{
+ int full;
+
+ if (psi_disabled)
+ return -EOPNOTSUPP;
+
+ update_stats(group);
+
+ for (full = 0; full < 2 - (res == PSI_CPU); full++) {
+ unsigned long avg[3];
+ u64 total;
+ int w;
+
+ for (w = 0; w < 3; w++)
+ avg[w] = group->avg[res * 2 + full][w];
+ total = div_u64(group->total[res * 2 + full], NSEC_PER_USEC);
+
+ seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
+ full ? "full" : "some",
+ LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
+ LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
+ LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
+ total);
+ }
+
+ return 0;
+}
+
+static int psi_io_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IO);
+}
+
+static int psi_memory_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_MEM);
+}
+
+static int psi_cpu_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_CPU);
+}
+
+static int psi_io_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_io_show, NULL);
+}
+
+static int psi_memory_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_memory_show, NULL);
+}
+
+static int psi_cpu_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_cpu_show, NULL);
+}
+
+static const struct file_operations psi_io_fops = {
+ .open = psi_io_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static const struct file_operations psi_memory_fops = {
+ .open = psi_memory_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static const struct file_operations psi_cpu_fops = {
+ .open = psi_cpu_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init psi_proc_init(void)
+{
+ proc_mkdir("pressure", NULL);
+ proc_create("pressure/io", 0, NULL, &psi_io_fops);
+ proc_create("pressure/memory", 0, NULL, &psi_memory_fops);
+ proc_create("pressure/cpu", 0, NULL, &psi_cpu_fops);
+ return 0;
+}
+module_init(psi_proc_init);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 83db5de1464c..25c5538647ad 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,6 +54,7 @@
#include <linux/proc_fs.h>
#include <linux/prefetch.h>
#include <linux/profile.h>
+#include <linux/psi.h>
#include <linux/rcupdate_wait.h>
#include <linux/security.h>
#include <linux/stackprotector.h>
@@ -320,6 +321,7 @@ extern bool dl_cpu_busy(unsigned int cpu);
#ifdef CONFIG_CGROUP_SCHED
#include <linux/cgroup.h>
+#include <linux/psi.h>
struct cfs_rq;
struct rt_rq;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8aea199a39b4..2e07d8f59b3e 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -55,6 +55,92 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_val_or_zero(var) 0
#endif /* CONFIG_SCHEDSTATS */
+#ifdef CONFIG_PSI
+/*
+ * PSI tracks state that persists across sleeps, such as iowaits and
+ * memory stalls. As a result, it has to distinguish between sleeps,
+ * where a task's runnable state changes, and requeues, where a task
+ * and its state are being moved between CPUs and runqueues.
+ */
+static inline void psi_enqueue(struct task_struct *p, bool wakeup)
+{
+ int clear = 0, set = TSK_RUNNING;
+
+ if (psi_disabled)
+ return;
+
+ if (!wakeup || p->sched_psi_wake_requeue) {
+ if (p->flags & PF_MEMSTALL)
+ set |= TSK_MEMSTALL;
+ if (p->sched_psi_wake_requeue)
+ p->sched_psi_wake_requeue = 0;
+ } else {
+ if (p->in_iowait)
+ clear |= TSK_IOWAIT;
+ }
+
+ psi_task_change(p, clear, set);
+}
+
+static inline void psi_dequeue(struct task_struct *p, bool sleep)
+{
+ int clear = TSK_RUNNING, set = 0;
+
+ if (psi_disabled)
+ return;
+
+ if (!sleep) {
+ if (p->flags & PF_MEMSTALL)
+ clear |= TSK_MEMSTALL;
+ } else {
+ if (p->in_iowait)
+ set |= TSK_IOWAIT;
+ }
+
+ psi_task_change(p, clear, set);
+}
+
+static inline void psi_ttwu_dequeue(struct task_struct *p)
+{
+ if (psi_disabled)
+ return;
+ /*
+ * Is the task being migrated during a wakeup? Make sure to
+ * deregister its sleep-persistent psi states from the old
+ * queue, and let psi_enqueue() know it has to requeue.
+ */
+ if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
+ struct rq_flags rf;
+ struct rq *rq;
+ int clear = 0;
+
+ if (p->in_iowait)
+ clear |= TSK_IOWAIT;
+ if (p->flags & PF_MEMSTALL)
+ clear |= TSK_MEMSTALL;
+
+ rq = __task_rq_lock(p, &rf);
+ psi_task_change(p, clear, 0);
+ p->sched_psi_wake_requeue = 1;
+ __task_rq_unlock(rq, &rf);
+ }
+}
+
+static inline void psi_task_tick(struct rq *rq)
+{
+ if (psi_disabled)
+ return;
+
+ if (unlikely(rq->curr->flags & PF_MEMSTALL))
+ psi_memstall_tick(rq->curr, rq->cpu);
+}
+#else /* CONFIG_PSI */
+static inline void psi_enqueue(struct task_struct *p, bool wakeup) {}
+static inline void psi_dequeue(struct task_struct *p, bool sleep) {}
+static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+static inline void psi_task_tick(struct rq *rq) {}
+#endif /* CONFIG_PSI */
+
#ifdef CONFIG_SCHED_INFO
static inline void sched_info_reset_dequeued(struct task_struct *t)
{
diff --git a/mm/compaction.c b/mm/compaction.c
index faca45ebe62d..7c607479de4a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -22,6 +22,7 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/page_owner.h>
+#include <linux/psi.h>
#include "internal.h"
#ifdef CONFIG_COMPACTION
@@ -2068,11 +2069,15 @@ static int kcompactd(void *p)
pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
while (!kthread_should_stop()) {
+ unsigned long pflags;
+
trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
wait_event_freezable(pgdat->kcompactd_wait,
kcompactd_work_requested(pgdat));
+ psi_memstall_enter(&pflags);
kcompactd_do_work(pgdat);
+ psi_memstall_leave(&pflags);
}
return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index ca895ebe43ac..5d27f7f51aa4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -37,6 +37,7 @@
#include <linux/shmem_fs.h>
#include <linux/rmap.h>
#include <linux/delayacct.h>
+#include <linux/psi.h>
#include "internal.h"
#define CREATE_TRACE_POINTS
@@ -1075,11 +1076,14 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
bool thrashing = false;
+ unsigned long pflags;
int ret = 0;
- if (bit_nr == PG_locked && !PageSwapBacked(page) &&
+ if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
- delayacct_thrashing_start();
+ if (!PageSwapBacked(page))
+ delayacct_thrashing_start();
+ psi_memstall_enter(&pflags);
thrashing = true;
}
@@ -1121,8 +1125,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
finish_wait(q, wait);
- if (thrashing)
- delayacct_thrashing_end();
+ if (thrashing) {
+ if (!PageSwapBacked(page))
+ delayacct_thrashing_end();
+ psi_memstall_leave(&pflags);
+ }
/*
* A signal could leave PageWaiters set. Clearing it here if
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a790ef4be74e..2974b92273e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -67,6 +67,7 @@
#include <linux/ftrace.h>
#include <linux/lockdep.h>
#include <linux/nmi.h>
+#include <linux/psi.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -3549,15 +3550,20 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
enum compact_priority prio, enum compact_result *compact_result)
{
struct page *page;
+ unsigned long pflags;
unsigned int noreclaim_flag;
if (!order)
return NULL;
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
+
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio);
+
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
if (*compact_result <= COMPACT_INACTIVE)
return NULL;
@@ -3756,11 +3762,13 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
struct reclaim_state reclaim_state;
int progress;
unsigned int noreclaim_flag;
+ unsigned long pflags;
cond_resched();
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
+ psi_memstall_enter(&pflags);
fs_reclaim_acquire(gfp_mask);
noreclaim_flag = memalloc_noreclaim_save();
reclaim_state.reclaimed_slab = 0;
@@ -3772,6 +3780,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
current->reclaim_state = NULL;
memalloc_noreclaim_restore(noreclaim_flag);
fs_reclaim_release(gfp_mask);
+ psi_memstall_leave(&pflags);
cond_resched();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7fdbc18fea6f..818dd786a355 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -49,6 +49,7 @@
#include <linux/prefetch.h>
#include <linux/printk.h>
#include <linux/dax.h>
+#include <linux/psi.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3131,6 +3132,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
+ unsigned long pflags;
int nid;
unsigned int noreclaim_flag;
struct scan_control sc = {
@@ -3159,9 +3161,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
sc.gfp_mask,
sc.reclaim_idx);
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
+
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
@@ -3326,6 +3332,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
int i;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
+ unsigned long pflags;
struct zone *zone;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
@@ -3336,6 +3343,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
.may_swap = 1,
};
+ psi_memstall_enter(&pflags);
__fs_reclaim_acquire();
count_vm_event(PAGEOUTRUN);
@@ -3437,6 +3445,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
out:
snapshot_refaults(NULL, pgdat);
__fs_reclaim_release();
+ psi_memstall_leave(&pflags);
/*
* Return the order kswapd stopped reclaiming at as
* prepare_kswapd_sleep() takes it into account. If another caller
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-28 17:22 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
@ 2018-08-28 20:11 ` Randy Dunlap
2018-08-28 20:56 ` Johannes Weiner
2018-09-07 10:16 ` Peter Zijlstra
2018-09-07 10:24 ` Peter Zijlstra
2 siblings, 1 reply; 55+ messages in thread
From: Randy Dunlap @ 2018-08-28 20:11 UTC (permalink / raw)
To: Johannes Weiner, Ingo Molnar, Peter Zijlstra, Andrew Morton,
Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On 08/28/2018 10:22 AM, Johannes Weiner wrote:
> diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
> new file mode 100644
> index 000000000000..51e7ef14142e
> --- /dev/null
> +++ b/Documentation/accounting/psi.txt
> @@ -0,0 +1,64 @@
> +================================
> +PSI - Pressure Stall Information
> +================================
> +
> +:Date: April, 2018
> +:Author: Johannes Weiner <hannes@cmpxchg.org>
> +
> +When CPU, memory or IO devices are contended, workloads experience
> +latency spikes, throughput losses, and run the risk of OOM kills.
> +
> +Without an accurate measure of such contention, users are forced to
> +either play it safe and under-utilize their hardware resources, or
> +roll the dice and frequently suffer the disruptions resulting from
> +excessive overcommit.
> +
> +The psi feature identifies and quantifies the disruptions caused by
> +such resource crunches and the time impact it has on complex workloads
> +or even entire systems.
> +
> +Having an accurate measure of productivity losses caused by resource
> +scarcity aids users in sizing workloads to hardware--or provisioning
> +hardware according to workload demand.
> +
> +As psi aggregates this information in realtime, systems can be managed
> +dynamically using techniques such as load shedding, migrating jobs to
> +other systems or data centers, or strategically pausing or killing low
> +priority or restartable batch jobs.
> +
> +This allows maximizing hardware utilization without sacrificing
> +workload health or risking major disruptions such as OOM kills.
> +
> +Pressure interface
> +==================
> +
> +Pressure information for each resource is exported through the
> +respective file in /proc/pressure/ -- cpu, memory, and io.
> +
Hi,
> +In both cases, the format for CPU is as such:
I don't see what "In both cases" refers to here. It seems that you could
just remove it.
> +
> +some avg10=0.00 avg60=0.00 avg300=0.00 total=0
> +
> +and for memory and IO:
> +
> +some avg10=0.00 avg60=0.00 avg300=0.00 total=0
> +full avg10=0.00 avg60=0.00 avg300=0.00 total=0
> +
> +The "some" line indicates the share of time in which at least some
> +tasks are stalled on a given resource.
--
~Randy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-28 20:11 ` Randy Dunlap
@ 2018-08-28 20:56 ` Johannes Weiner
2018-08-28 21:30 ` Randy Dunlap
0 siblings, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 20:56 UTC (permalink / raw)
To: Randy Dunlap
Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Tue, Aug 28, 2018 at 01:11:11PM -0700, Randy Dunlap wrote:
> On 08/28/2018 10:22 AM, Johannes Weiner wrote:
> > diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
> > new file mode 100644
> > index 000000000000..51e7ef14142e
> > --- /dev/null
> > +++ b/Documentation/accounting/psi.txt
> > @@ -0,0 +1,64 @@
> > +================================
> > +PSI - Pressure Stall Information
> > +================================
> > +
> > +:Date: April, 2018
> > +:Author: Johannes Weiner <hannes@cmpxchg.org>
> > +
> > +When CPU, memory or IO devices are contended, workloads experience
> > +latency spikes, throughput losses, and run the risk of OOM kills.
> > +
> > +Without an accurate measure of such contention, users are forced to
> > +either play it safe and under-utilize their hardware resources, or
> > +roll the dice and frequently suffer the disruptions resulting from
> > +excessive overcommit.
> > +
> > +The psi feature identifies and quantifies the disruptions caused by
> > +such resource crunches and the time impact it has on complex workloads
> > +or even entire systems.
> > +
> > +Having an accurate measure of productivity losses caused by resource
> > +scarcity aids users in sizing workloads to hardware--or provisioning
> > +hardware according to workload demand.
> > +
> > +As psi aggregates this information in realtime, systems can be managed
> > +dynamically using techniques such as load shedding, migrating jobs to
> > +other systems or data centers, or strategically pausing or killing low
> > +priority or restartable batch jobs.
> > +
> > +This allows maximizing hardware utilization without sacrificing
> > +workload health or risking major disruptions such as OOM kills.
> > +
> > +Pressure interface
> > +==================
> > +
> > +Pressure information for each resource is exported through the
> > +respective file in /proc/pressure/ -- cpu, memory, and io.
> > +
>
> Hi,
>
> > +In both cases, the format for CPU is as such:
>
> I don't see what "In both cases" refers to here. It seems that you could
> just remove it.
You're right, that must be a left-over from when I described CPU
separately; "both cases" referred to memory and IO which have
identical formats. It needs to be removed:
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index e051810d5127..b8ca28b60215 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -35,7 +35,7 @@ Pressure interface
Pressure information for each resource is exported through the
respective file in /proc/pressure/ -- cpu, memory, and io.
-In both cases, the format for CPU is as such:
+The format for CPU is as such:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-28 20:56 ` Johannes Weiner
@ 2018-08-28 21:30 ` Randy Dunlap
0 siblings, 0 replies; 55+ messages in thread
From: Randy Dunlap @ 2018-08-28 21:30 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On 08/28/2018 01:56 PM, Johannes Weiner wrote:
> On Tue, Aug 28, 2018 at 01:11:11PM -0700, Randy Dunlap wrote:
>> On 08/28/2018 10:22 AM, Johannes Weiner wrote:
>>> diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
>>> new file mode 100644
>>> index 000000000000..51e7ef14142e
>>> --- /dev/null
>>> +++ b/Documentation/accounting/psi.txt
>>> @@ -0,0 +1,64 @@
>>> +================================
>>> +PSI - Pressure Stall Information
>>> +================================
>>> +
>>> +:Date: April, 2018
>>> +:Author: Johannes Weiner <hannes@cmpxchg.org>
>>> +
>>> +When CPU, memory or IO devices are contended, workloads experience
>>> +latency spikes, throughput losses, and run the risk of OOM kills.
>>> +
>>> +Without an accurate measure of such contention, users are forced to
>>> +either play it safe and under-utilize their hardware resources, or
>>> +roll the dice and frequently suffer the disruptions resulting from
>>> +excessive overcommit.
>>> +
>>> +The psi feature identifies and quantifies the disruptions caused by
>>> +such resource crunches and the time impact it has on complex workloads
>>> +or even entire systems.
>>> +
>>> +Having an accurate measure of productivity losses caused by resource
>>> +scarcity aids users in sizing workloads to hardware--or provisioning
>>> +hardware according to workload demand.
>>> +
>>> +As psi aggregates this information in realtime, systems can be managed
>>> +dynamically using techniques such as load shedding, migrating jobs to
>>> +other systems or data centers, or strategically pausing or killing low
>>> +priority or restartable batch jobs.
>>> +
>>> +This allows maximizing hardware utilization without sacrificing
>>> +workload health or risking major disruptions such as OOM kills.
>>> +
>>> +Pressure interface
>>> +==================
>>> +
>>> +Pressure information for each resource is exported through the
>>> +respective file in /proc/pressure/ -- cpu, memory, and io.
>>> +
>>
>> Hi,
>>
>>> +In both cases, the format for CPU is as such:
>>
>> I don't see what "In both cases" refers to here. It seems that you could
>> just remove it.
>
> You're right, that must be a left-over from when I described CPU
> separately; "both cases" referred to memory and IO which have
> identical formats. It needs to be removed:
>
> diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
> index e051810d5127..b8ca28b60215 100644
> --- a/Documentation/accounting/psi.txt
> +++ b/Documentation/accounting/psi.txt
> @@ -35,7 +35,7 @@ Pressure interface
> Pressure information for each resource is exported through the
> respective file in /proc/pressure/ -- cpu, memory, and io.
>
> -In both cases, the format for CPU is as such:
> +The format for CPU is as such:
>
> some avg10=0.00 avg60=0.00 avg300=0.00 total=0
OK. However, after reading patch 9/9, I thought that the "both cases"
could possibly mean the files in /proc/pressure/ and the files in
cgroup ({cpu,io,memory}.pressure).
--
~Randy
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-28 17:22 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-08-28 20:11 ` Randy Dunlap
@ 2018-09-07 10:16 ` Peter Zijlstra
2018-09-07 10:21 ` Peter Zijlstra
2018-09-07 14:44 ` Johannes Weiner
2018-09-07 10:24 ` Peter Zijlstra
2 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2018-09-07 10:16 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Tue, Aug 28, 2018 at 01:22:57PM -0400, Johannes Weiner wrote:
> +enum psi_states {
> + PSI_IO_SOME,
> + PSI_IO_FULL,
> + PSI_MEM_SOME,
> + PSI_MEM_FULL,
> + PSI_CPU_SOME,
> + /* Only per-CPU, to weigh the CPU in the global average: */
> + PSI_NONIDLE,
> + NR_PSI_STATES,
> +};
> +static u32 get_recent_time(struct psi_group *group, int cpu,
> + enum psi_states state)
> +{
> + struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> + unsigned int seq;
> + u32 time, delta;
> +
> + do {
> + seq = read_seqcount_begin(&groupc->seq);
> +
> + time = groupc->times[state];
> + /*
> + * In addition to already concluded states, we also
> + * incorporate currently active states on the CPU,
> + * since states may last for many sampling periods.
> + *
> + * This way we keep our delta sampling buckets small
> + * (u32) and our reported pressure close to what's
> + * actually happening.
> + */
> + if (test_state(groupc->tasks, state))
> + time += cpu_clock(cpu) - groupc->state_start;
> + } while (read_seqcount_retry(&groupc->seq, seq));
> +
> + delta = time - groupc->times_prev[state];
> + groupc->times_prev[state] = time;
> +
> + return delta;
> +}
> +static bool update_stats(struct psi_group *group)
> +{
> + u64 deltas[NR_PSI_STATES - 1] = { 0, };
> + unsigned long missed_periods = 0;
> + unsigned long nonidle_total = 0;
> + u64 now, expires, period;
> + int cpu;
> + int s;
> +
> + mutex_lock(&group->stat_lock);
> +
> + /*
> + * Collect the per-cpu time buckets and average them into a
> + * single time sample that is normalized to wallclock time.
> + *
> + * For averaging, each CPU is weighted by its non-idle time in
> + * the sampling period. This eliminates artifacts from uneven
> + * loading, or even entirely idle CPUs.
> + */
> + for_each_possible_cpu(cpu) {
> + u32 nonidle;
> +
> + nonidle = get_recent_time(group, cpu, PSI_NONIDLE);
> + nonidle = nsecs_to_jiffies(nonidle);
> + nonidle_total += nonidle;
> +
> + for (s = 0; s < PSI_NONIDLE; s++) {
> + u32 delta;
> +
> + delta = get_recent_time(group, cpu, s);
> + deltas[s] += (u64)delta * nonidle;
> + }
> + }
This does the whole seqcount thing 6x, which is a bit of a waste.
struct snapshot {
u32 times[NR_PSI_STATES];
};
static inline struct snapshot get_times_snapshot(struct psi_group *pg, int cpu)
{
struct pci_group_cpu *pgc = per_cpu_ptr(pg->pcpu, cpu);
struct snapshot s;
unsigned int seq;
u32 delta;
int i;
do {
seq = read_seqcount_begin(&pgc->seq);
delta = cpu_clock(cpu) - pgc->state_start;
for (i = 0; i < NR_PSI_STATES; i++) {
s.times[i] = gpc->times[i];
if (test_state(pgc->tasks, i))
s.times[i] += delta;
}
} while (read_seqcount_retry(&pgc->seq, seq);
return s;
}
for_each_possible_cpu(cpu) {
struct snapshot s = get_times_snapshot(pg, cpu);
nonidle = nsecs_to_jiffies(s.times[PSI_NONIDLE]);
nonidle_total += nonidle;
for (i = 0; i < PSI_NONIDLE; i++)
deltas[s] += (u64)s.times[i] * nonidle;
/* ... */
}
It's a bit cumbersome, but that's because of C.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-09-07 10:16 ` Peter Zijlstra
@ 2018-09-07 10:21 ` Peter Zijlstra
2018-09-07 14:44 ` Johannes Weiner
1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2018-09-07 10:21 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Sep 07, 2018 at 12:16:34PM +0200, Peter Zijlstra wrote:
> This does the whole seqcount thing 6x, which is a bit of a waste.
>
> struct snapshot {
> u32 times[NR_PSI_STATES];
> };
>
> static inline struct snapshot get_times_snapshot(struct psi_group *pg, int cpu)
> {
> struct pci_group_cpu *pgc = per_cpu_ptr(pg->pcpu, cpu);
> struct snapshot s;
> unsigned int seq;
> u32 delta;
> int i;
>
> do {
> seq = read_seqcount_begin(&pgc->seq);
>
> delta = cpu_clock(cpu) - pgc->state_start;
> for (i = 0; i < NR_PSI_STATES; i++) {
> s.times[i] = gpc->times[i];
> if (test_state(pgc->tasks, i))
> s.times[i] += delta;
> }
>
> } while (read_seqcount_retry(&pgc->seq, seq);
Sorry, I forgot the whole times_prev thing:
for (i = 0; i < NR_PSI_STATES; i++) {
tmp = s.times[i];
s.times[i] -= pgc->times_prev[i];
pgc->times_prev[i] = tmp;
}
> return s;
> }
>
>
> for_each_possible_cpu(cpu) {
> struct snapshot s = get_times_snapshot(pg, cpu);
>
> nonidle = nsecs_to_jiffies(s.times[PSI_NONIDLE]);
> nonidle_total += nonidle;
>
> for (i = 0; i < PSI_NONIDLE; i++)
> deltas[s] += (u64)s.times[i] * nonidle;
>
> /* ... */
>
> }
>
>
> It's a bit cumbersome, but that's because of C.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-09-07 10:16 ` Peter Zijlstra
2018-09-07 10:21 ` Peter Zijlstra
@ 2018-09-07 14:44 ` Johannes Weiner
2018-09-07 14:58 ` Peter Zijlstra
1 sibling, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2018-09-07 14:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Sep 07, 2018 at 12:16:34PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 01:22:57PM -0400, Johannes Weiner wrote:
> > +enum psi_states {
> > + PSI_IO_SOME,
> > + PSI_IO_FULL,
> > + PSI_MEM_SOME,
> > + PSI_MEM_FULL,
> > + PSI_CPU_SOME,
> > + /* Only per-CPU, to weigh the CPU in the global average: */
> > + PSI_NONIDLE,
> > + NR_PSI_STATES,
> > +};
>
> > +static u32 get_recent_time(struct psi_group *group, int cpu,
> > + enum psi_states state)
> > +{
> > + struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> > + unsigned int seq;
> > + u32 time, delta;
> > +
> > + do {
> > + seq = read_seqcount_begin(&groupc->seq);
> > +
> > + time = groupc->times[state];
> > + /*
> > + * In addition to already concluded states, we also
> > + * incorporate currently active states on the CPU,
> > + * since states may last for many sampling periods.
> > + *
> > + * This way we keep our delta sampling buckets small
> > + * (u32) and our reported pressure close to what's
> > + * actually happening.
> > + */
> > + if (test_state(groupc->tasks, state))
> > + time += cpu_clock(cpu) - groupc->state_start;
> > + } while (read_seqcount_retry(&groupc->seq, seq));
> > +
> > + delta = time - groupc->times_prev[state];
> > + groupc->times_prev[state] = time;
> > +
> > + return delta;
> > +}
>
> > +static bool update_stats(struct psi_group *group)
> > +{
> > + u64 deltas[NR_PSI_STATES - 1] = { 0, };
> > + unsigned long missed_periods = 0;
> > + unsigned long nonidle_total = 0;
> > + u64 now, expires, period;
> > + int cpu;
> > + int s;
> > +
> > + mutex_lock(&group->stat_lock);
> > +
> > + /*
> > + * Collect the per-cpu time buckets and average them into a
> > + * single time sample that is normalized to wallclock time.
> > + *
> > + * For averaging, each CPU is weighted by its non-idle time in
> > + * the sampling period. This eliminates artifacts from uneven
> > + * loading, or even entirely idle CPUs.
> > + */
> > + for_each_possible_cpu(cpu) {
> > + u32 nonidle;
> > +
> > + nonidle = get_recent_time(group, cpu, PSI_NONIDLE);
> > + nonidle = nsecs_to_jiffies(nonidle);
> > + nonidle_total += nonidle;
> > +
> > + for (s = 0; s < PSI_NONIDLE; s++) {
> > + u32 delta;
> > +
> > + delta = get_recent_time(group, cpu, s);
> > + deltas[s] += (u64)delta * nonidle;
> > + }
> > + }
>
> This does the whole seqcount thing 6x, which is a bit of a waste.
[...]
> It's a bit cumbersome, but that's because of C.
I was actually debating exactly this with Suren before, but since this
is a super cold path I went with readability. I was also thinking that
restarts could happen quite regularly under heavy scheduler load, and
so keeping the individual retry sections small could be helpful - but
I didn't instrument this in any way.
No strong opinion from me, I can send an updated patch if you prefer.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-09-07 14:44 ` Johannes Weiner
@ 2018-09-07 14:58 ` Peter Zijlstra
2018-09-07 17:50 ` Johannes Weiner
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2018-09-07 14:58 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Sep 07, 2018 at 10:44:22AM -0400, Johannes Weiner wrote:
> > This does the whole seqcount thing 6x, which is a bit of a waste.
>
> [...]
>
> > It's a bit cumbersome, but that's because of C.
>
> I was actually debating exactly this with Suren before, but since this
> is a super cold path I went with readability. I was also thinking that
> restarts could happen quite regularly under heavy scheduler load, and
> so keeping the individual retry sections small could be helpful - but
> I didn't instrument this in any way.
I was hoping going over the whole thing once would reduce the time we
need to keep that line in shared mode and reduce traffic. And yes, this
path is cold, but I was thinking about reducing the interference on the
remote CPU.
Alternatively, we memcpy the whole line under the seqlock and then do
everything later.
Also, this only has a single cpu_clock() invocation.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-09-07 14:58 ` Peter Zijlstra
@ 2018-09-07 17:50 ` Johannes Weiner
0 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-09-07 17:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Sep 07, 2018 at 04:58:58PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 07, 2018 at 10:44:22AM -0400, Johannes Weiner wrote:
>
> > > This does the whole seqcount thing 6x, which is a bit of a waste.
> >
> > [...]
> >
> > > It's a bit cumbersome, but that's because of C.
> >
> > I was actually debating exactly this with Suren before, but since this
> > is a super cold path I went with readability. I was also thinking that
> > restarts could happen quite regularly under heavy scheduler load, and
> > so keeping the individual retry sections small could be helpful - but
> > I didn't instrument this in any way.
>
> I was hoping going over the whole thing once would reduce the time we
> need to keep that line in shared mode and reduce traffic. And yes, this
> path is cold, but I was thinking about reducing the interference on the
> remote CPU.
>
> Alternatively, we memcpy the whole line under the seqlock and then do
> everything later.
>
> Also, this only has a single cpu_clock() invocation.
Good points.
How about the below? It's still pretty readable, and generates compact
code inside the now single retry section:
ffffffff81ed464f: 44 89 ff mov %r15d,%edi
ffffffff81ed4652: e8 00 00 00 00 callq ffffffff81ed4657 <update_stats+0xca>
ffffffff81ed4653: R_X86_64_PLT32 sched_clock_cpu-0x4
memcpy(times, groupc->times, sizeof(groupc->times));
ffffffff81ed4657: 49 8b 14 24 mov (%r12),%rdx
state_start = groupc->state_start;
ffffffff81ed465b: 48 8b 4b 50 mov 0x50(%rbx),%rcx
memcpy(times, groupc->times, sizeof(groupc->times));
ffffffff81ed465f: 48 89 54 24 30 mov %rdx,0x30(%rsp)
ffffffff81ed4664: 49 8b 54 24 08 mov 0x8(%r12),%rdx
ffffffff81ed4669: 48 89 54 24 38 mov %rdx,0x38(%rsp)
ffffffff81ed466e: 49 8b 54 24 10 mov 0x10(%r12),%rdx
ffffffff81ed4673: 48 89 54 24 40 mov %rdx,0x40(%rsp)
memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
ffffffff81ed4678: 49 8b 55 00 mov 0x0(%r13),%rdx
ffffffff81ed467c: 48 89 54 24 24 mov %rdx,0x24(%rsp)
ffffffff81ed4681: 41 8b 55 08 mov 0x8(%r13),%edx
ffffffff81ed4685: 89 54 24 2c mov %edx,0x2c(%rsp)
---
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0f07749b60a4..595414599b98 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -197,17 +197,26 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
}
}
-static u32 get_recent_time(struct psi_group *group, int cpu,
- enum psi_states state)
+static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
{
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+ unsigned int tasks[NR_PSI_TASK_COUNTS];
+ u64 now, state_start;
unsigned int seq;
- u32 time, delta;
+ int s;
+ /* Snapshot a coherent view of the CPU state */
do {
seq = read_seqcount_begin(&groupc->seq);
+ now = cpu_clock(cpu);
+ memcpy(times, groupc->times, sizeof(groupc->times));
+ memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
+ state_start = groupc->state_start;
+ } while (read_seqcount_retry(&groupc->seq, seq));
- time = groupc->times[state];
+ /* Calculate state time deltas against the previous snapshot */
+ for (s = 0; s < NR_PSI_STATES; s++) {
+ u32 delta;
/*
* In addition to already concluded states, we also
* incorporate currently active states on the CPU,
@@ -217,14 +226,14 @@ static u32 get_recent_time(struct psi_group *group, int cpu,
* (u32) and our reported pressure close to what's
* actually happening.
*/
- if (test_state(groupc->tasks, state))
- time += cpu_clock(cpu) - groupc->state_start;
- } while (read_seqcount_retry(&groupc->seq, seq));
+ if (test_state(tasks, s))
+ times[s] += now - state_start;
- delta = time - groupc->times_prev[state];
- groupc->times_prev[state] = time;
+ delta = times[s] - groupc->times_prev[s];
+ groupc->times_prev[s] = times[s];
- return delta;
+ times[s] = delta;
+ }
}
static void calc_avgs(unsigned long avg[3], int missed_periods,
@@ -267,18 +276,16 @@ static bool update_stats(struct psi_group *group)
* loading, or even entirely idle CPUs.
*/
for_each_possible_cpu(cpu) {
+ u32 times[NR_PSI_STATES];
u32 nonidle;
- nonidle = get_recent_time(group, cpu, PSI_NONIDLE);
- nonidle = nsecs_to_jiffies(nonidle);
- nonidle_total += nonidle;
+ get_recent_times(group, cpu, times);
- for (s = 0; s < PSI_NONIDLE; s++) {
- u32 delta;
+ nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]);
+ nonidle_total += nonidle;
- delta = get_recent_time(group, cpu, s);
- deltas[s] += (u64)delta * nonidle;
- }
+ for (s = 0; s < PSI_NONIDLE; s++)
+ deltas[s] += (u64)times[s] * nonidle;
}
/*
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-28 17:22 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-08-28 20:11 ` Randy Dunlap
2018-09-07 10:16 ` Peter Zijlstra
@ 2018-09-07 10:24 ` Peter Zijlstra
2018-09-07 14:54 ` Johannes Weiner
2 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2018-09-07 10:24 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Tue, Aug 28, 2018 at 01:22:57PM -0400, Johannes Weiner wrote:
> +static void psi_clock(struct work_struct *work)
> +{
> + struct delayed_work *dwork;
> + struct psi_group *group;
> + bool nonidle;
> +
> + dwork = to_delayed_work(work);
> + group = container_of(dwork, struct psi_group, clock_work);
> +
> + /*
> + * If there is task activity, periodically fold the per-cpu
> + * times and feed samples into the running averages. If things
> + * are idle and there is no data to process, stop the clock.
> + * Once restarted, we'll catch up the running averages in one
> + * go - see calc_avgs() and missed_periods.
> + */
> +
> + nonidle = update_stats(group);
> +
> + if (nonidle) {
> + unsigned long delay = 0;
> + u64 now;
> +
> + now = sched_clock();
> + if (group->next_update > now)
> + delay = nsecs_to_jiffies(group->next_update - now) + 1;
> + schedule_delayed_work(dwork, delay);
> + }
> +}
Just a little nit; I would expect a function called *clock() to return a
time.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-09-07 10:24 ` Peter Zijlstra
@ 2018-09-07 14:54 ` Johannes Weiner
0 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-09-07 14:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Sep 07, 2018 at 12:24:58PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 01:22:57PM -0400, Johannes Weiner wrote:
> > +static void psi_clock(struct work_struct *work)
> > +{
> > + struct delayed_work *dwork;
> > + struct psi_group *group;
> > + bool nonidle;
> > +
> > + dwork = to_delayed_work(work);
> > + group = container_of(dwork, struct psi_group, clock_work);
> > +
> > + /*
> > + * If there is task activity, periodically fold the per-cpu
> > + * times and feed samples into the running averages. If things
> > + * are idle and there is no data to process, stop the clock.
> > + * Once restarted, we'll catch up the running averages in one
> > + * go - see calc_avgs() and missed_periods.
> > + */
> > +
> > + nonidle = update_stats(group);
> > +
> > + if (nonidle) {
> > + unsigned long delay = 0;
> > + u64 now;
> > +
> > + now = sched_clock();
> > + if (group->next_update > now)
> > + delay = nsecs_to_jiffies(group->next_update - now) + 1;
> > + schedule_delayed_work(dwork, delay);
> > + }
> > +}
>
> Just a little nit; I would expect a function called *clock() to return a
> time.
Fair enough, let's rename this. How about this on top?
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 92489e66840b..0f07749b60a4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -154,7 +154,7 @@ static struct psi_group psi_system = {
.pcpu = &system_group_pcpu,
};
-static void psi_clock(struct work_struct *work);
+static void psi_update_work(struct work_struct *work);
static void group_init(struct psi_group *group)
{
@@ -163,7 +163,7 @@ static void group_init(struct psi_group *group)
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
group->next_update = sched_clock() + psi_period;
- INIT_DELAYED_WORK(&group->clock_work, psi_clock);
+ INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
mutex_init(&group->stat_lock);
}
@@ -347,7 +347,7 @@ static bool update_stats(struct psi_group *group)
return nonidle_total;
}
-static void psi_clock(struct work_struct *work)
+static void psi_update_work(struct work_struct *work)
{
struct delayed_work *dwork;
struct psi_group *group;
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 9/9] psi: cgroup support
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
` (7 preceding siblings ...)
2018-08-28 17:22 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
@ 2018-08-28 17:22 ` Johannes Weiner
2018-09-05 21:43 ` [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
2018-10-19 2:07 ` Andrew Morton
10 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-28 17:22 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system,
but also that of individual jobs, so that we can ensure individual job
health, fairness between jobs, or prioritize some jobs over others.
This patch implements pressure stall tracking for cgroups. In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure,
memory.pressure, and io.pressure files that track aggregate pressure
stall times for only the tasks inside the cgroup.
v3:
- fix copy-paste indentation screwups
v4:
- propagate psi_disabled checks outward
- factor out iterate_groups()
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
Documentation/accounting/psi.txt | 9 ++
Documentation/admin-guide/cgroup-v2.rst | 18 ++++
include/linux/cgroup-defs.h | 4 +
include/linux/cgroup.h | 15 +++
include/linux/psi.h | 25 +++++
init/Kconfig | 4 +
kernel/cgroup/cgroup.c | 45 ++++++++-
kernel/sched/psi.c | 118 ++++++++++++++++++++++--
8 files changed, 228 insertions(+), 10 deletions(-)
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index 51e7ef14142e..e051810d5127 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -62,3 +62,12 @@ well as medium and long term trends. The total absolute stall time is
tracked and exported as well, to allow detection of latency spikes
which wouldn't necessarily make a dent in the time averages, or to
average trends over custom time frames.
+
+Cgroup2 interface
+=================
+
+In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
+mounted, pressure stall information is also tracked for tasks grouped
+into cgroups. Each subdirectory in the cgroupfs mountpoint contains
+cpu.pressure, memory.pressure, and io.pressure files; the format is
+the same as the /proc/pressure/ files.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8a2c52d5c53b..02cb308ea400 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -963,6 +963,12 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
+ cpu.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for CPU. See
+ Documentation/accounting/psi.txt for details.
+
Memory
------
@@ -1250,6 +1256,12 @@ PAGE_SIZE multiple when read back.
higher than the limit for an extended period of time. This
reduces the impact on the workload and memory management.
+ memory.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for memory. See
+ Documentation/accounting/psi.txt for details.
+
Usage Guidelines
~~~~~~~~~~~~~~~~
@@ -1385,6 +1397,12 @@ IO Interface Files
8:16 rbps=2097152 wbps=max riops=max wiops=max
+ io.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for IO. See
+ Documentation/accounting/psi.txt for details.
+
Writeback
~~~~~~~~~
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index c0e68f903011..f4be871ca169 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -20,6 +20,7 @@
#include <linux/u64_stats_sync.h>
#include <linux/workqueue.h>
#include <linux/bpf-cgroup.h>
+#include <linux/psi_types.h>
#ifdef CONFIG_CGROUPS
@@ -435,6 +436,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
+ /* used to track pressure stalls */
+ struct psi_group psi;
+
/* used to store eBPF programs */
struct cgroup_bpf bpf;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c9fdf6f57913..7b667a89704b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -627,6 +627,11 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
pr_cont_kernfs_path(cgrp->kn);
}
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+ return &cgrp->psi;
+}
+
static inline void cgroup_init_kthreadd(void)
{
/*
@@ -680,6 +685,16 @@ static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp)
return NULL;
}
+static inline struct cgroup *cgroup_parent(struct cgroup *cgrp)
+{
+ return NULL;
+}
+
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+ return NULL;
+}
+
static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
struct cgroup *ancestor)
{
diff --git a/include/linux/psi.h b/include/linux/psi.h
index b0daf050de58..8e0725aac0aa 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -4,6 +4,9 @@
#include <linux/psi_types.h>
#include <linux/sched.h>
+struct seq_file;
+struct css_set;
+
#ifdef CONFIG_PSI
extern bool psi_disabled;
@@ -16,6 +19,14 @@ void psi_memstall_tick(struct task_struct *task, int cpu);
void psi_memstall_enter(unsigned long *flags);
void psi_memstall_leave(unsigned long *flags);
+int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
+
+#ifdef CONFIG_CGROUPS
+int psi_cgroup_alloc(struct cgroup *cgrp);
+void psi_cgroup_free(struct cgroup *cgrp);
+void cgroup_move_task(struct task_struct *p, struct css_set *to);
+#endif
+
#else /* CONFIG_PSI */
static inline void psi_init(void) {}
@@ -23,6 +34,20 @@ static inline void psi_init(void) {}
static inline void psi_memstall_enter(unsigned long *flags) {}
static inline void psi_memstall_leave(unsigned long *flags) {}
+#ifdef CONFIG_CGROUPS
+static inline int psi_cgroup_alloc(struct cgroup *cgrp)
+{
+ return 0;
+}
+static inline void psi_cgroup_free(struct cgroup *cgrp)
+{
+}
+static inline void cgroup_move_task(struct task_struct *p, struct css_set *to)
+{
+ rcu_assign_pointer(p->cgroups, to);
+}
+#endif
+
#endif /* CONFIG_PSI */
#endif /* _LINUX_PSI_H */
diff --git a/init/Kconfig b/init/Kconfig
index 98d59bc268df..7506dcd81d1c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -466,6 +466,10 @@ config PSI
the share of walltime in which some or all tasks in the system are
delayed due to contention of the respective resource.
+ In kernels with cgroup support, cgroups (cgroup2 only) will
+ have cpu.pressure, memory.pressure, and io.pressure files,
+ which aggregate pressure stalls for the grouped tasks only.
+
For more details see Documentation/accounting/psi.txt.
Say N if unsure.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 077370bf8964..ba7d3e1e3970 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -55,6 +55,7 @@
#include <linux/nsproxy.h>
#include <linux/file.h>
#include <linux/sched/cputime.h>
+#include <linux/psi.h>
#include <net/sock.h>
#define CREATE_TRACE_POINTS
@@ -829,7 +830,7 @@ static void css_set_move_task(struct task_struct *task,
*/
WARN_ON_ONCE(task->flags & PF_EXITING);
- rcu_assign_pointer(task->cgroups, to_cset);
+ cgroup_move_task(task, to_cset);
list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
&to_cset->tasks);
}
@@ -3406,6 +3407,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
return ret;
}
+#ifdef CONFIG_PSI
+static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
+{
+ return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_IO);
+}
+static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
+{
+ return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_MEM);
+}
+static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
+{
+ return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU);
+}
+#endif
+
static int cgroup_file_open(struct kernfs_open_file *of)
{
struct cftype *cft = of->kn->priv;
@@ -4534,6 +4550,23 @@ static struct cftype cgroup_base_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_stat_show,
},
+#ifdef CONFIG_PSI
+ {
+ .name = "io.pressure",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cgroup_io_pressure_show,
+ },
+ {
+ .name = "memory.pressure",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cgroup_memory_pressure_show,
+ },
+ {
+ .name = "cpu.pressure",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cgroup_cpu_pressure_show,
+ },
+#endif
{ } /* terminate */
};
@@ -4594,6 +4627,7 @@ static void css_free_rwork_fn(struct work_struct *work)
*/
cgroup_put(cgroup_parent(cgrp));
kernfs_put(cgrp->kn);
+ psi_cgroup_free(cgrp);
if (cgroup_on_dfl(cgrp))
cgroup_rstat_exit(cgrp);
kfree(cgrp);
@@ -4850,10 +4884,15 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
cgrp->self.parent = &parent->self;
cgrp->root = root;
cgrp->level = level;
- ret = cgroup_bpf_inherit(cgrp);
+
+ ret = psi_cgroup_alloc(cgrp);
if (ret)
goto out_idr_free;
+ ret = cgroup_bpf_inherit(cgrp);
+ if (ret)
+ goto out_psi_free;
+
for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
@@ -4891,6 +4930,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
return cgrp;
+out_psi_free:
+ psi_cgroup_free(cgrp);
out_idr_free:
cgroup_idr_remove(&root->cgroup_idr, cgrp->id);
out_stat_exit:
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 92489e66840b..84127de49193 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -466,9 +466,35 @@ static void psi_group_change(struct psi_group *group, int cpu,
schedule_delayed_work(&group->clock_work, PSI_FREQ);
}
+static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
+{
+#ifdef CONFIG_CGROUPS
+ struct cgroup *cgroup = NULL;
+
+ if (!*iter)
+ cgroup = task->cgroups->dfl_cgrp;
+ else if (*iter == &psi_system)
+ return NULL;
+ else
+ cgroup = cgroup_parent(*iter);
+
+ if (cgroup && cgroup_parent(cgroup)) {
+ *iter = cgroup;
+ return cgroup_psi(cgroup);
+ }
+#else
+ if (*iter)
+ return NULL;
+#endif
+ *iter = &psi_system;
+ return &psi_system;
+}
+
void psi_task_change(struct task_struct *task, int clear, int set)
{
int cpu = task_cpu(task);
+ struct psi_group *group;
+ void *iter = NULL;
if (!task->pid)
return;
@@ -485,17 +511,23 @@ void psi_task_change(struct task_struct *task, int clear, int set)
task->psi_flags &= ~clear;
task->psi_flags |= set;
- psi_group_change(&psi_system, cpu, clear, set);
+ while ((group = iterate_groups(task, &iter)))
+ psi_group_change(group, cpu, clear, set);
}
void psi_memstall_tick(struct task_struct *task, int cpu)
{
- struct psi_group_cpu *groupc;
+ struct psi_group *group;
+ void *iter = NULL;
- groupc = per_cpu_ptr(psi_system.pcpu, cpu);
- write_seqcount_begin(&groupc->seq);
- record_times(groupc, cpu, true);
- write_seqcount_end(&groupc->seq);
+ while ((group = iterate_groups(task, &iter))) {
+ struct psi_group_cpu *groupc;
+
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+ write_seqcount_begin(&groupc->seq);
+ record_times(groupc, cpu, true);
+ write_seqcount_end(&groupc->seq);
+ }
}
/**
@@ -558,8 +590,78 @@ void psi_memstall_leave(unsigned long *flags)
rq_unlock_irq(rq, &rf);
}
-static int psi_show(struct seq_file *m, struct psi_group *group,
- enum psi_res res)
+#ifdef CONFIG_CGROUPS
+int psi_cgroup_alloc(struct cgroup *cgroup)
+{
+ if (psi_disabled)
+ return 0;
+
+ cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu);
+ if (!cgroup->psi.pcpu)
+ return -ENOMEM;
+ group_init(&cgroup->psi);
+ return 0;
+}
+
+void psi_cgroup_free(struct cgroup *cgroup)
+{
+ if (psi_disabled)
+ return;
+
+ cancel_delayed_work_sync(&cgroup->psi.clock_work);
+ free_percpu(cgroup->psi.pcpu);
+}
+
+/**
+ * cgroup_move_task - move task to a different cgroup
+ * @task: the task
+ * @to: the target css_set
+ *
+ * Move task to a new cgroup and safely migrate its associated stall
+ * state between the different groups.
+ *
+ * This function acquires the task's rq lock to lock out concurrent
+ * changes to the task's scheduling state and - in case the task is
+ * running - concurrent changes to its stall state.
+ */
+void cgroup_move_task(struct task_struct *task, struct css_set *to)
+{
+ bool move_psi = !psi_disabled;
+ unsigned int task_flags = 0;
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (move_psi) {
+ rq = task_rq_lock(task, &rf);
+
+ if (task_on_rq_queued(task))
+ task_flags = TSK_RUNNING;
+ else if (task->in_iowait)
+ task_flags = TSK_IOWAIT;
+
+ if (task->flags & PF_MEMSTALL)
+ task_flags |= TSK_MEMSTALL;
+
+ if (task_flags)
+ psi_task_change(task, task_flags, 0);
+ }
+
+ /*
+ * Lame to do this here, but the scheduler cannot be locked
+ * from the outside, so we move cgroups from inside sched/.
+ */
+ rcu_assign_pointer(task->cgroups, to);
+
+ if (move_psi) {
+ if (task_flags)
+ psi_task_change(task, 0, task_flags);
+
+ task_rq_unlock(rq, task, &rf);
+ }
+}
+#endif /* CONFIG_CGROUPS */
+
+int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
{
int full;
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
` (8 preceding siblings ...)
2018-08-28 17:22 ` [PATCH 9/9] psi: cgroup support Johannes Weiner
@ 2018-09-05 21:43 ` Johannes Weiner
2018-09-07 7:36 ` Daniel Drake
` (2 more replies)
2018-10-19 2:07 ` Andrew Morton
10 siblings, 3 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-09-05 21:43 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Tue, Aug 28, 2018 at 01:22:49PM -0400, Johannes Weiner wrote:
> This version 4 of the PSI series incorporates feedback from Peter and
> fixes two races in the lockless aggregator that Suren found in his
> testing and which caused the sample calculation to sometimes underflow
> and record bogusly large samples; details at the bottom of this email.
Peter, do the changes from v3 look sane to you?
If there aren't any further objections, I was hoping we could get this
lined up for 4.20.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-05 21:43 ` [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
@ 2018-09-07 7:36 ` Daniel Drake
2018-09-07 7:46 ` Peter Zijlstra
2018-09-07 11:04 ` Peter Zijlstra
2 siblings, 0 replies; 55+ messages in thread
From: Daniel Drake @ 2018-09-07 7:36 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds,
Tejun Heo, Suren Baghdasaryan, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, Linux Kernel, kernel-team
On Thu, Sep 6, 2018 at 5:43 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Peter, do the changes from v3 look sane to you?
>
> If there aren't any further objections, I was hoping we could get this
> lined up for 4.20.
That would be excellent. I just retested the latest version at
http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the
results are great.
Test setup:
Endless OS
GeminiLake N4200 low end laptop
2GB RAM
swap (and zram swap) disabled
Baseline test: open a handful of large-ish apps and several website
tabs in Google Chrome.
Results: after a couple of minutes, system is excessively thrashing,
mouse cursor can barely be moved, UI is not responding to mouse
clicks, so it's impractical to recover from this situation as an
ordinary user
Add my simple killer:
https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
Results: when the thrashing causes the UI to become sluggish, the
killer steps in and kills something (usually a chrome tab), and the
system remains usable. I repeatedly opened more apps and more websites
over a 15 minute period but I wasn't able to get the system to a point
of UI unresponsiveness.
Thanks,
Daniel
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-05 21:43 ` [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
2018-09-07 7:36 ` Daniel Drake
@ 2018-09-07 7:46 ` Peter Zijlstra
2018-09-07 11:04 ` Peter Zijlstra
2 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2018-09-07 7:46 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Wed, Sep 05, 2018 at 05:43:03PM -0400, Johannes Weiner wrote:
> On Tue, Aug 28, 2018 at 01:22:49PM -0400, Johannes Weiner wrote:
> > This version 4 of the PSI series incorporates feedback from Peter and
> > fixes two races in the lockless aggregator that Suren found in his
> > testing and which caused the sample calculation to sometimes underflow
> > and record bogusly large samples; details at the bottom of this email.
>
> Peter, do the changes from v3 look sane to you?
I'll go have a look.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-05 21:43 ` [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
2018-09-07 7:36 ` Daniel Drake
2018-09-07 7:46 ` Peter Zijlstra
@ 2018-09-07 11:04 ` Peter Zijlstra
2018-09-07 15:09 ` Johannes Weiner
2 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2018-09-07 11:04 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Wed, Sep 05, 2018 at 05:43:03PM -0400, Johannes Weiner wrote:
> On Tue, Aug 28, 2018 at 01:22:49PM -0400, Johannes Weiner wrote:
> > This version 4 of the PSI series incorporates feedback from Peter and
> > fixes two races in the lockless aggregator that Suren found in his
> > testing and which caused the sample calculation to sometimes underflow
> > and record bogusly large samples; details at the bottom of this email.
>
> Peter, do the changes from v3 look sane to you?
>
> If there aren't any further objections, I was hoping we could get this
> lined up for 4.20.
I suppose it looks ok, there's a few small nits, but nothing big.
I still hate psi_ttwu_dequeue(), but I don't really know what to about
that.
So yeah, grudingly acked. Did you want me to pick this up through the
scheduler tree since most of this lives there?
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-07 11:04 ` Peter Zijlstra
@ 2018-09-07 15:09 ` Johannes Weiner
2018-09-07 15:58 ` Suren Baghdasaryan
0 siblings, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2018-09-07 15:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Sep 07, 2018 at 01:04:07PM +0200, Peter Zijlstra wrote:
> So yeah, grudingly acked. Did you want me to pick this up through the
> scheduler tree since most of this lives there?
Thanks for the ack.
As for routing it, I'll leave that decision to you and Andrew. It
touches stuff all over, so it could result in quite a few conflicts
between trees (although I don't expect any of them to be non-trivial).
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-07 15:09 ` Johannes Weiner
@ 2018-09-07 15:58 ` Suren Baghdasaryan
2018-09-17 5:22 ` Daniel Drake
2018-09-17 13:29 ` peter enderborg
0 siblings, 2 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2018-09-07 15:58 UTC (permalink / raw)
To: Johannes Weiner
Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Linus Torvalds,
Tejun Heo, Daniel Drake, Vinayak Menon, Christopher Lameter,
Peter Enderborg, Shakeel Butt, Mike Galbraith, linux-mm, cgroups,
LKML, kernel-team
Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
code system running Android. Signals behave as expected reacting to
memory pressure, no jumps in "total" counters that would indicate an
overflow/underflow issues. Nicely done!
Tested-by: Suren Baghdasaryan <surenb@google.com>
On Fri, Sep 7, 2018 at 8:09 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Fri, Sep 07, 2018 at 01:04:07PM +0200, Peter Zijlstra wrote:
>> So yeah, grudingly acked. Did you want me to pick this up through the
>> scheduler tree since most of this lives there?
>
> Thanks for the ack.
>
> As for routing it, I'll leave that decision to you and Andrew. It
> touches stuff all over, so it could result in quite a few conflicts
> between trees (although I don't expect any of them to be non-trivial).
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-07 15:58 ` Suren Baghdasaryan
@ 2018-09-17 5:22 ` Daniel Drake
2018-09-18 15:53 ` Suren Baghdasaryan
2018-09-17 13:29 ` peter enderborg
1 sibling, 1 reply; 55+ messages in thread
From: Daniel Drake @ 2018-09-17 5:22 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Johannes Weiner, Peter Zijlstra, Ingo Molnar, Andrew Morton,
Linus Torvalds, Tejun Heo, Vinayak Menon, Christopher Lameter,
Peter Enderborg, Shakeel Butt, Mike Galbraith, linux-mm, cgroups,
LKML, kernel-team
Hi Suren
On Fri, Sep 7, 2018 at 11:58 PM, Suren Baghdasaryan <surenb@google.com> wrote:
> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
> code system running Android. Signals behave as expected reacting to
> memory pressure, no jumps in "total" counters that would indicate an
> overflow/underflow issues. Nicely done!
Can you share your Linux v4.9 psi backport somewhere?
Thanks
Daniel
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-17 5:22 ` Daniel Drake
@ 2018-09-18 15:53 ` Suren Baghdasaryan
2018-09-25 22:05 ` Suren Baghdasaryan
0 siblings, 1 reply; 55+ messages in thread
From: Suren Baghdasaryan @ 2018-09-18 15:53 UTC (permalink / raw)
To: Daniel Drake
Cc: Johannes Weiner, Peter Zijlstra, Ingo Molnar, Andrew Morton,
Linus Torvalds, Tejun Heo, Vinayak Menon, Christopher Lameter,
Peter Enderborg, Shakeel Butt, Mike Galbraith, linux-mm, cgroups,
LKML, kernel-team
Hi Daniel,
On Sun, Sep 16, 2018 at 10:22 PM, Daniel Drake <drake@endlessm.com> wrote:
> Hi Suren
>
> On Fri, Sep 7, 2018 at 11:58 PM, Suren Baghdasaryan <surenb@google.com> wrote:
>> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
>> code system running Android. Signals behave as expected reacting to
>> memory pressure, no jumps in "total" counters that would indicate an
>> overflow/underflow issues. Nicely done!
>
> Can you share your Linux v4.9 psi backport somewhere?
>
Absolutely. Let me figure out what's the best way to do share that and
make sure they apply cleanly on official 4.9 (I was using vendor's
tree for testing). Will need a day or so to get this done.
In case you need them sooner, there were several "prerequisite"
patches that I had to backport to make PSI backporting
easier/possible. Following is the list as shown by "git log
--oneline":
PSI patches:
ef94c067f360 psi: cgroup support
60081a7aeb0b psi: pressure stall information for CPU, memory, and IO
acd2a16497e9 sched: introduce this_rq_lock_irq()
f30268c29309 sched: sched.h: make rq locking and clock functions
available in stats.h
a2fd1c94b743 sched: loadavg: make calc_load_n() public
32a74dec4967 sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
8e3991dd1a73 delayacct: track delays from thrashing cache pages
4ae940e7e6ff mm: workingset: tell cache transitions from workingset thrashing
e9ccd63399e0 mm: workingset: don't drop refault information prematurely
Prerequisites:
b5a58c778c54 workqueue: make workqueue available early during boot
ae5f39ee13b5 sched/core: Add wrappers for lockdep_(un)pin_lock()
7276f98a72c1 sched/headers, delayacct: Move the 'struct
task_delay_info' definition from <linux/sched.h> to
<linux/delayacct.h>
287318d13688 mm: add PageWaiters indicating tasks are waiting for a page bit
edfa64560aaa sched/headers: Remove <linux/sched.h> from <linux/sched/loadavg.h>
f6b6ba853959 sched/headers: Move loadavg related definitions from
<linux/sched.h> to <linux/sched/loadavg.h>
395b0a9f7aae sched/headers: Prepare for new header dependencies before
moving code to <linux/sched/loadavg.h>
PSI patches needed some adjustments but nothing really major.
> Thanks
> Daniel
Thanks,
Suren.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-18 15:53 ` Suren Baghdasaryan
@ 2018-09-25 22:05 ` Suren Baghdasaryan
0 siblings, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2018-09-25 22:05 UTC (permalink / raw)
To: Daniel Drake
Cc: Johannes Weiner, Peter Zijlstra, Ingo Molnar, Andrew Morton,
Linus Torvalds, Tejun Heo, Vinayak Menon, Christopher Lameter,
Peter Enderborg, Shakeel Butt, Mike Galbraith, linux-mm, cgroups,
LKML, kernel-team
I emailed Daniel 4.9 backport patches. Unfortunately that seems to be
the easiest way to share them. If anyone else is interested in them
please email me directly.
Thanks,
Suren.
On Tue, Sep 18, 2018 at 8:53 AM, Suren Baghdasaryan <surenb@google.com> wrote:
> Hi Daniel,
>
> On Sun, Sep 16, 2018 at 10:22 PM, Daniel Drake <drake@endlessm.com> wrote:
>> Hi Suren
>>
>> On Fri, Sep 7, 2018 at 11:58 PM, Suren Baghdasaryan <surenb@google.com> wrote:
>>> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
>>> code system running Android. Signals behave as expected reacting to
>>> memory pressure, no jumps in "total" counters that would indicate an
>>> overflow/underflow issues. Nicely done!
>>
>> Can you share your Linux v4.9 psi backport somewhere?
>>
>
> Absolutely. Let me figure out what's the best way to do share that and
> make sure they apply cleanly on official 4.9 (I was using vendor's
> tree for testing). Will need a day or so to get this done.
> In case you need them sooner, there were several "prerequisite"
> patches that I had to backport to make PSI backporting
> easier/possible. Following is the list as shown by "git log
> --oneline":
>
> PSI patches:
>
> ef94c067f360 psi: cgroup support
> 60081a7aeb0b psi: pressure stall information for CPU, memory, and IO
> acd2a16497e9 sched: introduce this_rq_lock_irq()
> f30268c29309 sched: sched.h: make rq locking and clock functions
> available in stats.h
> a2fd1c94b743 sched: loadavg: make calc_load_n() public
> 32a74dec4967 sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
> 8e3991dd1a73 delayacct: track delays from thrashing cache pages
> 4ae940e7e6ff mm: workingset: tell cache transitions from workingset thrashing
> e9ccd63399e0 mm: workingset: don't drop refault information prematurely
>
> Prerequisites:
>
> b5a58c778c54 workqueue: make workqueue available early during boot
> ae5f39ee13b5 sched/core: Add wrappers for lockdep_(un)pin_lock()
> 7276f98a72c1 sched/headers, delayacct: Move the 'struct
> task_delay_info' definition from <linux/sched.h> to
> <linux/delayacct.h>
> 287318d13688 mm: add PageWaiters indicating tasks are waiting for a page bit
> edfa64560aaa sched/headers: Remove <linux/sched.h> from <linux/sched/loadavg.h>
> f6b6ba853959 sched/headers: Move loadavg related definitions from
> <linux/sched.h> to <linux/sched/loadavg.h>
> 395b0a9f7aae sched/headers: Prepare for new header dependencies before
> moving code to <linux/sched/loadavg.h>
>
> PSI patches needed some adjustments but nothing really major.
>
>> Thanks
>> Daniel
>
> Thanks,
> Suren.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-07 15:58 ` Suren Baghdasaryan
@ 2018-09-17 13:29 ` peter enderborg
2018-09-17 13:29 ` peter enderborg
1 sibling, 0 replies; 55+ messages in thread
From: peter enderborg @ 2018-09-17 13:29 UTC (permalink / raw)
To: Suren Baghdasaryan, Johannes Weiner
Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Linus Torvalds,
Tejun Heo, Daniel Drake, Vinayak Menon, Christopher Lameter,
Shakeel Butt, Mike Galbraith, linux-mm, cgroups, LKML,
kernel-team
Will it be part of the backport to 4.9 google android or is it for test only?
I guess that this patch is to big for the LTS tree.
On 09/07/2018 05:58 PM, Suren Baghdasaryan wrote:
> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
> code system running Android. Signals behave as expected reacting to
> memory pressure, no jumps in "total" counters that would indicate an
> overflow/underflow issues. Nicely done!
>
> Tested-by: Suren Baghdasaryan <surenb@google.com>
>
> On Fri, Sep 7, 2018 at 8:09 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> On Fri, Sep 07, 2018 at 01:04:07PM +0200, Peter Zijlstra wrote:
>>> So yeah, grudingly acked. Did you want me to pick this up through the
>>> scheduler tree since most of this lives there?
>> Thanks for the ack.
>>
>> As for routing it, I'll leave that decision to you and Andrew. It
>> touches stuff all over, so it could result in quite a few conflicts
>> between trees (although I don't expect any of them to be non-trivial).
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
@ 2018-09-17 13:29 ` peter enderborg
0 siblings, 0 replies; 55+ messages in thread
From: peter enderborg @ 2018-09-17 13:29 UTC (permalink / raw)
To: Suren Baghdasaryan, Johannes Weiner
Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Linus Torvalds,
Tejun Heo, Daniel Drake, Vinayak Menon, Christopher Lameter,
Shakeel Butt, Mike Galbraith, linux-mm, cgroups, LKML,
kernel-team
Will it be part of the backport to 4.9 google android or is it for test only?
I guess that this patch is to big for the LTS tree.
On 09/07/2018 05:58 PM, Suren Baghdasaryan wrote:
> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
> code system running Android. Signals behave as expected reacting to
> memory pressure, no jumps in "total" counters that would indicate an
> overflow/underflow issues. Nicely done!
>
> Tested-by: Suren Baghdasaryan <surenb@google.com>
>
> On Fri, Sep 7, 2018 at 8:09 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> On Fri, Sep 07, 2018 at 01:04:07PM +0200, Peter Zijlstra wrote:
>>> So yeah, grudingly acked. Did you want me to pick this up through the
>>> scheduler tree since most of this lives there?
>> Thanks for the ack.
>>
>> As for routing it, I'll leave that decision to you and Andrew. It
>> touches stuff all over, so it could result in quite a few conflicts
>> between trees (although I don't expect any of them to be non-trivial).
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-17 13:29 ` peter enderborg
(?)
@ 2018-09-17 13:40 ` Peter Zijlstra
-1 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2018-09-17 13:40 UTC (permalink / raw)
To: peter enderborg
Cc: Suren Baghdasaryan, Johannes Weiner, Ingo Molnar, Andrew Morton,
Linus Torvalds, Tejun Heo, Daniel Drake, Vinayak Menon,
Christopher Lameter, Shakeel Butt, Mike Galbraith, linux-mm,
cgroups, LKML, kernel-team
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-09-17 13:29 ` peter enderborg
(?)
(?)
@ 2018-09-18 16:03 ` Suren Baghdasaryan
-1 siblings, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2018-09-18 16:03 UTC (permalink / raw)
To: peter enderborg
Cc: Johannes Weiner, Peter Zijlstra, Ingo Molnar, Andrew Morton,
Linus Torvalds, Tejun Heo, Daniel Drake, Vinayak Menon,
Christopher Lameter, Shakeel Butt, Mike Galbraith, linux-mm,
cgroups, LKML, kernel-team
On Mon, Sep 17, 2018 at 6:29 AM, peter enderborg
<peter.enderborg@sony.com> wrote:
> Will it be part of the backport to 4.9 google android or is it for test only?
Currently I'm testing these patches in tandem with PSI monitor that
I'm developing and test results look good. If things go well and we
start using PSI for Android I will try to upstream the backport. If
upstream rejects it we will have to merge it into Android common
kernel repo as a last resort. Hope this answers your question.
> I guess that this patch is to big for the LTS tree.
>
> On 09/07/2018 05:58 PM, Suren Baghdasaryan wrote:
>> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
>> code system running Android. Signals behave as expected reacting to
>> memory pressure, no jumps in "total" counters that would indicate an
>> overflow/underflow issues. Nicely done!
>>
>> Tested-by: Suren Baghdasaryan <surenb@google.com>
>>
>> On Fri, Sep 7, 2018 at 8:09 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>> On Fri, Sep 07, 2018 at 01:04:07PM +0200, Peter Zijlstra wrote:
>>>> So yeah, grudingly acked. Did you want me to pick this up through the
>>>> scheduler tree since most of this lives there?
>>> Thanks for the ack.
>>>
>>> As for routing it, I'll leave that decision to you and Andrew. It
>>> touches stuff all over, so it could result in quite a few conflicts
>>> between trees (although I don't expect any of them to be non-trivial).
>
>
Thanks,
Suren.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-08-28 17:22 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
` (9 preceding siblings ...)
2018-09-05 21:43 ` [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4 Johannes Weiner
@ 2018-10-19 2:07 ` Andrew Morton
2018-10-23 17:29 ` Johannes Weiner
10 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2018-10-19 2:07 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Tue, 28 Aug 2018 13:22:49 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> This version 4 of the PSI series incorporates feedback from Peter and
> fixes two races in the lockless aggregator that Suren found in his
> testing and which caused the sample calculation to sometimes underflow
> and record bogusly large samples; details at the bottom of this email.
We've had very little in the way of review activity for the PSI
patchset. According to the changelog tags, anyway.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-10-19 2:07 ` Andrew Morton
@ 2018-10-23 17:29 ` Johannes Weiner
2018-10-23 17:41 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2018-10-23 17:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Thu, Oct 18, 2018 at 07:07:10PM -0700, Andrew Morton wrote:
> On Tue, 28 Aug 2018 13:22:49 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > This version 4 of the PSI series incorporates feedback from Peter and
> > fixes two races in the lockless aggregator that Suren found in his
> > testing and which caused the sample calculation to sometimes underflow
> > and record bogusly large samples; details at the bottom of this email.
>
> We've had very little in the way of review activity for the PSI
> patchset. According to the changelog tags, anyway.
Peter reviewed it quite extensively over all revisions, and acked the
final version. Peter, can we add your acked-by or reviewed-by tag(s)?
The scheduler part accounts for 99% of the complexity in those
patches. The mm bits, while somewhat sprawling, are mostly mechanical.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4
2018-10-23 17:29 ` Johannes Weiner
@ 2018-10-23 17:41 ` Peter Zijlstra
0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2018-10-23 17:41 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Ingo Molnar, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Peter Enderborg, Shakeel Butt,
Mike Galbraith, linux-mm, cgroups, linux-kernel, kernel-team
On Tue, Oct 23, 2018 at 01:29:37PM -0400, Johannes Weiner wrote:
> On Thu, Oct 18, 2018 at 07:07:10PM -0700, Andrew Morton wrote:
> > On Tue, 28 Aug 2018 13:22:49 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > > This version 4 of the PSI series incorporates feedback from Peter and
> > > fixes two races in the lockless aggregator that Suren found in his
> > > testing and which caused the sample calculation to sometimes underflow
> > > and record bogusly large samples; details at the bottom of this email.
> >
> > We've had very little in the way of review activity for the PSI
> > patchset. According to the changelog tags, anyway.
>
> Peter reviewed it quite extensively over all revisions, and acked the
> final version. Peter, can we add your acked-by or reviewed-by tag(s)?
I don't really do reviewed by; but yes, I thought I already did; lemme
find.
> The scheduler part accounts for 99% of the complexity in those
> patches. The mm bits, while somewhat sprawling, are mostly mechanical.
Ah, I now see my mistake;
https://lkml.kernel.org/r/20180907110407.GQ24106@hirez.programming.kicks-ass.net
I forgot to include an actual tag therein. My bad.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-01 15:19 [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v3 Johannes Weiner
@ 2018-08-01 15:19 ` Johannes Weiner
2018-08-03 16:56 ` Peter Zijlstra
` (3 more replies)
0 siblings, 4 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-01 15:19 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).
v2:
- stable clock tick, as per Peter
- data structure layout optimization, as per Peter
- fix u64 divisions on 32 bit, as per Peter
- outermost psi_disabled checks, as per Peter
- coding style fixes, as per Peter
- just-in-time stats aggregation, as per Suren
- fix task state corruption with CONFIG_PREEMPT, as per Suren
- CONFIG_PSI=n build error
- avoid writing p->sched_psi_wake_requeue unnecessarily
- documentation & comment updates
v3:
- pack scheduler hotpath data into one cacheline, as per Peter and Linus
- drop unnecessary SCHED_INFO dependency, as per Peter
- lockless live-state aggregation, as per Peter
- do_div -> div64_ul and some other cleanups, as per Peter
- realtime sampling period and slipped sample handling, as per Tejun
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
Documentation/accounting/psi.txt | 64 +++
include/linux/psi.h | 27 ++
include/linux/psi_types.h | 87 +++++
include/linux/sched.h | 10 +
init/Kconfig | 15 +
kernel/fork.c | 4 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 11 +-
kernel/sched/psi.c | 643 +++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 +
kernel/sched/stats.h | 80 ++++
mm/compaction.c | 5 +
mm/filemap.c | 15 +-
mm/page_alloc.c | 10 +
mm/vmscan.c | 13 +
15 files changed, 981 insertions(+), 6 deletions(-)
create mode 100644 Documentation/accounting/psi.txt
create mode 100644 include/linux/psi.h
create mode 100644 include/linux/psi_types.h
create mode 100644 kernel/sched/psi.c
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index 000000000000..51e7ef14142e
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,64 @@
+================================
+PSI - Pressure Stall Information
+================================
+
+:Date: April, 2018
+:Author: Johannes Weiner <hannes@cmpxchg.org>
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==================
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to distinguish this
+situation from a state where some tasks are stalled but the CPU is
+still doing productive work. As such, time spent in this subset of the
+stall state is tracked separately and exported in the "full" averages.
+
+The ratios are tracked as recent trends over ten, sixty, and three
+hundred second windows, which gives insight into short term events as
+well as medium and long term trends. The total absolute stall time is
+tracked and exported as well, to allow detection of latency spikes
+which wouldn't necessarily make a dent in the time averages, or to
+average trends over custom time frames.
diff --git a/include/linux/psi.h b/include/linux/psi.h
new file mode 100644
index 000000000000..371af1479699
--- /dev/null
+++ b/include/linux/psi.h
@@ -0,0 +1,27 @@
+#ifndef _LINUX_PSI_H
+#define _LINUX_PSI_H
+
+#include <linux/psi_types.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_PSI
+
+extern bool psi_disabled;
+
+void psi_init(void);
+
+void psi_task_change(struct task_struct *task, u64 now, int clear, int set);
+
+void psi_memstall_enter(unsigned long *flags);
+void psi_memstall_leave(unsigned long *flags);
+
+#else /* CONFIG_PSI */
+
+static inline void psi_init(void) {}
+
+static inline void psi_memstall_enter(unsigned long *flags) {}
+static inline void psi_memstall_leave(unsigned long *flags) {}
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_H */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
new file mode 100644
index 000000000000..b6ff46362eb3
--- /dev/null
+++ b/include/linux/psi_types.h
@@ -0,0 +1,87 @@
+#ifndef _LINUX_PSI_TYPES_H
+#define _LINUX_PSI_TYPES_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_PSI
+
+/* Tracked task states */
+enum psi_task_count {
+ NR_IOWAIT,
+ NR_MEMSTALL,
+ NR_RUNNING,
+ NR_PSI_TASK_COUNTS,
+};
+
+/* Task state bitmasks */
+#define TSK_IOWAIT (1 << NR_IOWAIT)
+#define TSK_MEMSTALL (1 << NR_MEMSTALL)
+#define TSK_RUNNING (1 << NR_RUNNING)
+
+/* Resources that workloads could be stalled on */
+enum psi_res {
+ PSI_IO,
+ PSI_MEM,
+ PSI_CPU,
+ NR_PSI_RESOURCES,
+};
+
+/*
+ * Pressure states for each resource:
+ *
+ * SOME: Stalled tasks & working tasks
+ * FULL: Stalled tasks & no working tasks
+ */
+enum psi_states {
+ PSI_IO_SOME,
+ PSI_IO_FULL,
+ PSI_MEM_SOME,
+ PSI_MEM_FULL,
+ PSI_CPU_SOME,
+ PSI_NONIDLE,
+ NR_PSI_STATES,
+};
+
+struct psi_group_cpu {
+ /* 1st cacheline updated by the scheduler */
+
+ /* States of the tasks belonging to this group */
+ unsigned int tasks[NR_PSI_TASK_COUNTS] ____cacheline_aligned_in_smp;
+
+ /* Period time sampling buckets for each state of interest (ns) */
+ u32 times[NR_PSI_STATES];
+
+ /* Time of last task change in this group (rq_clock) */
+ u64 state_start;
+
+ /* 2nd cacheline updated by the aggregator */
+
+ /* Delta detection against the sampling buckets */
+ u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
+};
+
+struct psi_group {
+ /* Protects data updated during an aggregation */
+ struct mutex stat_lock;
+
+ /* Per-cpu task state & time tracking */
+ struct psi_group_cpu __percpu *pcpu;
+
+ /* Periodic aggregation state */
+ u64 total_prev[NR_PSI_STATES - 1];
+ u64 last_update;
+ u64 next_update;
+ struct delayed_work clock_work;
+
+ /* Total stall times and sampled pressure averages */
+ u64 total[NR_PSI_STATES - 1];
+ unsigned long avg[NR_PSI_STATES - 1][3];
+};
+
+#else /* CONFIG_PSI */
+
+struct psi_group { };
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ca3f3eae8980..d5e4ee234114 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -25,6 +25,7 @@
#include <linux/latencytop.h>
#include <linux/sched/prio.h>
#include <linux/signal_types.h>
+#include <linux/psi_types.h>
#include <linux/mm_types_task.h>
#include <linux/task_io_accounting.h>
@@ -709,6 +710,10 @@ struct task_struct {
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
unsigned sched_remote_wakeup:1;
+#ifdef CONFIG_PSI
+ unsigned sched_psi_wake_requeue:1;
+#endif
+
/* Force alignment to the next boundary: */
unsigned :0;
@@ -956,6 +961,10 @@ struct task_struct {
siginfo_t *last_siginfo;
struct task_io_accounting ioac;
+#ifdef CONFIG_PSI
+ /* Pressure stall state */
+ unsigned int psi_flags;
+#endif
#ifdef CONFIG_TASK_XACCT
/* Accumulated RSS usage: */
u64 acct_rss_mem1;
@@ -1385,6 +1394,7 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
diff --git a/init/Kconfig b/init/Kconfig
index 18b151f0ddc1..ad61ddb5d68e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -457,6 +457,21 @@ config TASK_IO_ACCOUNTING
Say N if unsure.
+config PSI
+ bool "Pressure stall information tracking"
+ help
+ Collect metrics that indicate how overcommitted the CPU, memory,
+ and IO capacity are in the system.
+
+ If you say Y here, the kernel will create /proc/pressure/ with the
+ pressure statistics files cpu, memory, and io. These will indicate
+ the share of walltime in which some or all tasks in the system are
+ delayed due to contention of the respective resource.
+
+ For more details see Documentation/accounting/psi.txt.
+
+ Say N if unsure.
+
endmenu # "CPU/Task time and stats accounting"
config CPU_ISOLATION
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..067aa5c28526 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1704,6 +1704,10 @@ static __latent_entropy struct task_struct *copy_process(
p->default_timer_slack_ns = current->timer_slack_ns;
+#ifdef CONFIG_PSI
+ p->psi_flags = 0;
+#endif
+
task_io_accounting_init(&p->ioac);
acct_clear_integrals(p);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..b29bc18f2704 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
+obj-$(CONFIG_PSI) += psi.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9586a8141f16..e53137df405b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -743,8 +743,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & ENQUEUE_RESTORE))
+ if (!(flags & ENQUEUE_RESTORE)) {
sched_info_queued(rq, p);
+ psi_enqueue(rq, p, flags & ENQUEUE_WAKEUP);
+ }
p->sched_class->enqueue_task(rq, p, flags);
}
@@ -754,8 +756,10 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & DEQUEUE_SAVE))
+ if (!(flags & DEQUEUE_SAVE)) {
sched_info_dequeued(rq, p);
+ psi_dequeue(rq, p, flags & DEQUEUE_SLEEP);
+ }
p->sched_class->dequeue_task(rq, p, flags);
}
@@ -2058,6 +2062,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
+ psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
@@ -6124,6 +6129,8 @@ void __init sched_init(void)
init_schedstats();
+ psi_init();
+
scheduler_running = 1;
}
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
new file mode 100644
index 000000000000..57ec86592b5a
--- /dev/null
+++ b/kernel/sched/psi.c
@@ -0,0 +1,643 @@
+/*
+ * Pressure stall information for CPU, memory and IO
+ *
+ * Copyright (c) 2018 Facebook, Inc.
+ * Author: Johannes Weiner <hannes@cmpxchg.org>
+ *
+ * When CPU, memory and IO are contended, tasks experience delays that
+ * reduce throughput and introduce latencies into the workload. Memory
+ * and IO contention, in addition, can cause a full loss of forward
+ * progress in which the CPU goes idle.
+ *
+ * This code aggregates individual task delays into resource pressure
+ * metrics that indicate problems with both workload health and
+ * resource utilization.
+ *
+ * Model
+ *
+ * The time in which a task can execute on a CPU is our baseline for
+ * productivity. Pressure expresses the amount of time in which this
+ * potential cannot be realized due to resource contention.
+ *
+ * This concept of productivity has two components: the workload and
+ * the CPU. To measure the impact of pressure on both, we define two
+ * contention states for a resource: SOME and FULL.
+ *
+ * In the SOME state of a given resource, one or more tasks are
+ * delayed on that resource. This affects the workload's ability to
+ * perform work, but the CPU may still be executing other tasks.
+ *
+ * In the FULL state of a given resource, all non-idle tasks are
+ * delayed on that resource such that nobody is advancing and the CPU
+ * goes idle. This leaves both workload and CPU unproductive.
+ *
+ * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ *
+ * SOME = nr_delayed_tasks != 0
+ * FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
+ *
+ * The percentage of wallclock time spent in those compound stall
+ * states gives pressure numbers between 0 and 100 for each resource,
+ * where the SOME percentage indicates workload slowdowns and the FULL
+ * percentage indicates reduced CPU utilization:
+ *
+ * %SOME = time(SOME) / period
+ * %FULL = time(FULL) / period
+ *
+ * Multiple CPUs
+ *
+ * The more tasks and available CPUs there are, the more work can be
+ * performed concurrently. This means that the potential that can go
+ * unrealized due to resource contention *also* scales with non-idle
+ * tasks and CPUs.
+ *
+ * Consider a scenario where 257 number crunching tasks are trying to
+ * run concurrently on 256 CPUs. If we simply aggregated the task
+ * states, we would have to conclude a CPU SOME pressure number of
+ * 100%, since *somebody* is waiting on a runqueue at all
+ * times. However, that is clearly not the amount of contention the
+ * workload is experiencing: only one out of 256 possible exceution
+ * threads will be contended at any given time, or about 0.4%.
+ *
+ * Conversely, consider a scenario of 4 tasks and 4 CPUs where at any
+ * given time *one* of the tasks is delayed due to a lack of memory.
+ * Again, looking purely at the task state would yield a memory FULL
+ * pressure number of 0%, since *somebody* is always making forward
+ * progress. But again this wouldn't capture the amount of execution
+ * potential lost, which is 1 out of 4 CPUs, or 25%.
+ *
+ * To calculate wasted potential (pressure) with multiple processors,
+ * we have to base our calculation on the number of non-idle tasks in
+ * conjunction with the number of available CPUs, which is the number
+ * of potential execution threads. SOME becomes then the proportion of
+ * delayed tasks to possibe threads, and FULL is the share of possible
+ * threads that are unproductive due to delays:
+ *
+ * threads = min(nr_nonidle_tasks, nr_cpus)
+ * SOME = min(nr_delayed_tasks / threads, 1)
+ * FULL = (threads - min(nr_running_tasks, threads)) / threads
+ *
+ * For the 257 number crunchers on 256 CPUs, this yields:
+ *
+ * threads = min(257, 256)
+ * SOME = min(1 / 256, 1) = 0.4%
+ * FULL = (256 - min(257, 256)) / 256 = 0%
+ *
+ * For the 1 out of 4 memory-delayed tasks, this yields:
+ *
+ * threads = min(4, 4)
+ * SOME = min(1 / 4, 1) = 25%
+ * FULL = (4 - min(3, 4)) / 4 = 25%
+ *
+ * [ Substitute nr_cpus with 1, and you can see that it's a natural
+ * extension of the single-CPU model. ]
+ *
+ * Implementation
+ *
+ * To assess the precise time spent in each such state, we would have
+ * to freeze the system on task changes and start/stop the state
+ * clocks accordingly. Obviously that doesn't scale in practice.
+ *
+ * Because the scheduler aims to distribute the compute load evenly
+ * among the available CPUs, we can track task state locally to each
+ * CPU and, at much lower frequency, extrapolate the global state for
+ * the cumulative stall times and the running averages.
+ *
+ * For each runqueue, we track:
+ *
+ * tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
+ * tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_running_tasks[cpu])
+ * tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0)
+ *
+ * and then periodically aggregate:
+ *
+ * tNONIDLE = sum(tNONIDLE[i])
+ *
+ * tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE
+ * tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE
+ *
+ * %SOME = tSOME / period
+ * %FULL = tFULL / period
+ *
+ * This gives us an approximation of pressure that is practical
+ * cost-wise, yet way more sensitive and accurate than periodic
+ * sampling of the aggregate task states would be.
+ */
+
+#include <linux/sched/loadavg.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/cgroup.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/psi.h>
+#include "sched.h"
+
+static int psi_bug __read_mostly;
+
+bool psi_disabled __read_mostly;
+core_param(psi_disabled, psi_disabled, bool, 0644);
+
+/* Running averages - we need to be higher-res than loadavg */
+#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
+#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
+#define EXP_60s 1981 /* 1/exp(2s/60s) */
+#define EXP_300s 2034 /* 1/exp(2s/300s) */
+
+/* Sampling frequency in nanoseconds */
+static u64 psi_period __read_mostly;
+
+/* System-level pressure and stall tracking */
+static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu);
+static struct psi_group psi_system = {
+ .pcpu = &system_group_pcpu,
+};
+
+static void psi_clock(struct work_struct *work);
+
+static void psi_group_init(struct psi_group *group)
+{
+ group->next_update = sched_clock() + psi_period;
+ INIT_DELAYED_WORK(&group->clock_work, psi_clock);
+ mutex_init(&group->stat_lock);
+}
+
+void __init psi_init(void)
+{
+ if (psi_disabled)
+ return;
+
+ psi_period = jiffies_to_nsecs(PSI_FREQ);
+ psi_group_init(&psi_system);
+}
+
+static void calc_avgs(unsigned long avg[3], int missed_periods,
+ u64 time, u64 period)
+{
+ unsigned long pct;
+
+ /* Fill in zeroes for periods of no activity */
+ if (missed_periods) {
+ avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods);
+ avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods);
+ avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods);
+ }
+
+ /* Sample the most recent active period */
+ pct = div_u64(time * 100, period);
+ pct *= FIXED_1;
+ avg[0] = calc_load(avg[0], EXP_10s, pct);
+ avg[1] = calc_load(avg[1], EXP_60s, pct);
+ avg[2] = calc_load(avg[2], EXP_300s, pct);
+}
+
+static bool test_state(unsigned int *tasks, int cpu, enum psi_states state)
+{
+ switch (state) {
+ case PSI_IO_SOME:
+ return tasks[NR_IOWAIT];
+ case PSI_IO_FULL:
+ return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+ case PSI_MEM_SOME:
+ return tasks[NR_MEMSTALL];
+ case PSI_MEM_FULL:
+ /*
+ * Since we care about lost potential, things are
+ * fully blocked on memory when there are no other
+ * working tasks, but also when the CPU is actively
+ * being used by a reclaimer and nothing productive
+ * could run even if it were runnable.
+ */
+ return tasks[NR_MEMSTALL] &&
+ (!tasks[NR_RUNNING] ||
+ cpu_curr(cpu)->flags & PF_MEMSTALL);
+ case PSI_CPU_SOME:
+ return tasks[NR_RUNNING] > 1;
+ case PSI_NONIDLE:
+ return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
+ tasks[NR_RUNNING];
+ default:
+ return false;
+ }
+}
+
+static bool psi_update_stats(struct psi_group *group)
+{
+ u64 deltas[NR_PSI_STATES - 1] = { 0, };
+ unsigned long missed_periods = 0;
+ unsigned long nonidle_total = 0;
+ u64 now, expires, period;
+ int cpu;
+ int s;
+
+ mutex_lock(&group->stat_lock);
+
+ /*
+ * Collect the per-cpu time buckets and average them into a
+ * single time sample that is normalized to wallclock time.
+ *
+ * For averaging, each CPU is weighted by its non-idle time in
+ * the sampling period. This eliminates artifacts from uneven
+ * loading, or even entirely idle CPUs.
+ *
+ * We don't need to synchronize against CPU hotplugging. If we
+ * see a CPU that's online and has samples, we incorporate it.
+ */
+ for_each_online_cpu(cpu) {
+ struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+ u32 uninitialized_var(nonidle);
+
+ BUILD_BUG_ON(PSI_NONIDLE != NR_PSI_STATES - 1);
+
+ for (s = PSI_NONIDLE; s >= 0; s--) {
+ u32 time, delta;
+
+ time = READ_ONCE(groupc->times[s]);
+ /*
+ * In addition to already concluded states, we
+ * also incorporate currently active states on
+ * the CPU, since states may last for many
+ * sampling periods.
+ *
+ * This way we keep our delta sampling buckets
+ * small (u32) and our reported pressure close
+ * to what's actually happening.
+ */
+ if (test_state(groupc->tasks, cpu, s)) {
+ /*
+ * We can race with a state change and
+ * need to make sure the state_start
+ * update is ordered against the
+ * updates to the live state and the
+ * time buckets (groupc->times).
+ *
+ * 1. If we observe task state that
+ * needs to be recorded, make sure we
+ * see state_start from when that
+ * state went into effect or we'll
+ * count time from the previous state.
+ *
+ * 2. If the time delta has already
+ * been added to the bucket, make sure
+ * we don't see it in state_start or
+ * we'll count it twice.
+ *
+ * If the time delta is out of
+ * state_start but not in the time
+ * bucket yet, we'll miss it entirely
+ * and handle it in the next period.
+ */
+ smp_rmb();
+ time += cpu_clock(cpu) - groupc->state_start;
+ }
+ delta = time - groupc->times_prev[s];
+ groupc->times_prev[s] = time;
+
+ if (s == PSI_NONIDLE) {
+ nonidle = nsecs_to_jiffies(delta);
+ nonidle_total += nonidle;
+ } else {
+ deltas[s] += (u64)delta * nonidle;
+ }
+ }
+ }
+
+ /*
+ * Integrate the sample into the running statistics that are
+ * reported to userspace: the cumulative stall times and the
+ * decaying averages.
+ *
+ * Pressure percentages are sampled at PSI_FREQ. We might be
+ * called more often when the user polls more frequently than
+ * that; we might be called less often when there is no task
+ * activity, thus no data, and clock ticks are sporadic. The
+ * below handles both.
+ */
+
+ /* total= */
+ for (s = 0; s < NR_PSI_STATES - 1; s++)
+ group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
+
+ /* avgX= */
+ now = sched_clock();
+ expires = group->next_update;
+ if (now < expires)
+ goto out;
+ if (now - expires > psi_period)
+ missed_periods = div_u64(now - expires, psi_period);
+
+ /*
+ * The periodic clock tick can get delayed for various
+ * reasons, especially on loaded systems. To avoid clock
+ * drift, we schedule the clock in fixed psi_period intervals.
+ * But the deltas we sample out of the per-cpu buckets above
+ * are based on the actual time elapsing between clock ticks.
+ */
+ group->next_update = expires + ((1 + missed_periods) * psi_period);
+ period = now - (group->last_update + (missed_periods * psi_period));
+ group->last_update = now;
+
+ for (s = 0; s < NR_PSI_STATES - 1; s++) {
+ u32 sample;
+
+ sample = group->total[s] - group->total_prev[s];
+ /*
+ * Due to the lockless sampling of the time buckets,
+ * recorded time deltas can slip into the next period,
+ * which under full pressure can result in samples in
+ * excess of the period length.
+ *
+ * We don't want to report non-sensical pressures in
+ * excess of 100%, nor do we want to drop such events
+ * on the floor. Instead we punt any overage into the
+ * future until pressure subsides. By doing this we
+ * don't underreport the occurring pressure curve, we
+ * just report it delayed by one period length.
+ *
+ * The error isn't cumulative. As soon as another
+ * delta slips from a period P to P+1, by definition
+ * it frees up its time T in P.
+ */
+ if (sample > period)
+ sample = period;
+ group->total_prev[s] += sample;
+ calc_avgs(group->avg[s], missed_periods, sample, period);
+ }
+out:
+ mutex_unlock(&group->stat_lock);
+ return nonidle_total;
+}
+
+static void psi_clock(struct work_struct *work)
+{
+ struct delayed_work *dwork;
+ struct psi_group *group;
+ bool nonidle;
+
+ dwork = to_delayed_work(work);
+ group = container_of(dwork, struct psi_group, clock_work);
+
+ /*
+ * If there is task activity, periodically fold the per-cpu
+ * times and feed samples into the running averages. If things
+ * are idle and there is no data to process, stop the clock.
+ * Once restarted, we'll catch up the running averages in one
+ * go - see calc_avgs() and missed_periods.
+ */
+
+ nonidle = psi_update_stats(group);
+
+ if (nonidle) {
+ unsigned long delay = 0;
+ u64 now;
+
+ now = sched_clock();
+ if (group->next_update > now)
+ delay = nsecs_to_jiffies(group->next_update - now) + 1;
+ schedule_delayed_work(dwork, delay);
+ }
+}
+
+static void psi_group_change(struct psi_group *group, int cpu, u64 now,
+ unsigned int clear, unsigned int set)
+{
+ struct psi_group_cpu *groupc;
+ unsigned int t, m;
+ u32 delta;
+
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ /*
+ * First we assess the aggregate resource states these CPU's
+ * tasks have been in since the last change, and account any
+ * SOME and FULL time that may have resulted in.
+ *
+ * Then we update the task counts according to the state
+ * change requested through the @clear and @set bits.
+ */
+
+ delta = now - groupc->state_start;
+ groupc->state_start = now;
+
+ /*
+ * Update state_start before recording time in the sampling
+ * buckets and changing task counts, to prevent a racing
+ * aggregation from counting the delta twice or attributing it
+ * to an old state.
+ */
+ smp_wmb();
+
+ if (test_state(groupc->tasks, cpu, PSI_IO_SOME)) {
+ groupc->times[PSI_IO_SOME] += delta;
+ if (test_state(groupc->tasks, cpu, PSI_IO_FULL))
+ groupc->times[PSI_IO_FULL] += delta;
+ }
+ if (test_state(groupc->tasks, cpu, PSI_MEM_SOME)) {
+ groupc->times[PSI_MEM_SOME] += delta;
+ if (test_state(groupc->tasks, cpu, PSI_MEM_FULL))
+ groupc->times[PSI_MEM_FULL] += delta;
+ }
+ if (test_state(groupc->tasks, cpu, PSI_CPU_SOME))
+ groupc->times[PSI_CPU_SOME] += delta;
+ if (test_state(groupc->tasks, cpu, PSI_NONIDLE))
+ groupc->times[PSI_NONIDLE] += delta;
+
+ for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
+ if (!(m & (1 << t)))
+ continue;
+ if (groupc->tasks[t] == 0 && !psi_bug) {
+ printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n",
+ cpu, t, groupc->tasks[0],
+ groupc->tasks[1], groupc->tasks[2],
+ clear, set);
+ psi_bug = 1;
+ }
+ groupc->tasks[t]--;
+ }
+ for (t = 0; set; set &= ~(1 << t), t++)
+ if (set & (1 << t))
+ groupc->tasks[t]++;
+
+ if (!delayed_work_pending(&group->clock_work))
+ schedule_delayed_work(&group->clock_work, PSI_FREQ);
+}
+
+void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
+{
+ int cpu = task_cpu(task);
+
+ if (psi_disabled)
+ return;
+
+ if (!task->pid)
+ return;
+
+ if (((task->psi_flags & set) ||
+ (task->psi_flags & clear) != clear) &&
+ !psi_bug) {
+ printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
+ task->pid, task->comm, cpu,
+ task->psi_flags, clear, set);
+ psi_bug = 1;
+ }
+
+ task->psi_flags &= ~clear;
+ task->psi_flags |= set;
+
+ psi_group_change(&psi_system, cpu, now, clear, set);
+}
+
+/**
+ * psi_memstall_enter - mark the beginning of a memory stall section
+ * @flags: flags to handle nested sections
+ *
+ * Marks the calling task as being stalled due to a lack of memory,
+ * such as waiting for a refault or performing reclaim.
+ */
+void psi_memstall_enter(unsigned long *flags)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (psi_disabled)
+ return;
+
+ *flags = current->flags & PF_MEMSTALL;
+ if (*flags)
+ return;
+ /*
+ * PF_MEMSTALL setting & accounting needs to be atomic wrt
+ * changes to the task's scheduling state, otherwise we can
+ * race with CPU migration.
+ */
+ rq = this_rq_lock_irq(&rf);
+
+ update_rq_clock(rq);
+
+ current->flags |= PF_MEMSTALL;
+ psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
+
+ rq_unlock_irq(rq, &rf);
+}
+
+/**
+ * psi_memstall_leave - mark the end of an memory stall section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer stalled due to lack of memory.
+ */
+void psi_memstall_leave(unsigned long *flags)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (psi_disabled)
+ return;
+
+ if (*flags)
+ return;
+ /*
+ * PF_MEMSTALL clearing & accounting needs to be atomic wrt
+ * changes to the task's scheduling state, otherwise we could
+ * race with CPU migration.
+ */
+ rq = this_rq_lock_irq(&rf);
+
+ update_rq_clock(rq);
+
+ current->flags &= ~PF_MEMSTALL;
+ psi_task_change(current, rq_clock(rq), TSK_MEMSTALL, 0);
+
+ rq_unlock_irq(rq, &rf);
+}
+
+static int psi_show(struct seq_file *m, struct psi_group *group,
+ enum psi_res res)
+{
+ int full;
+
+ if (psi_disabled)
+ return -EOPNOTSUPP;
+
+ psi_update_stats(group);
+
+ for (full = 0; full < 2 - (res == PSI_CPU); full++) {
+ unsigned long avg[3];
+ u64 total;
+ int w;
+
+ for (w = 0; w < 3; w++)
+ avg[w] = group->avg[res * 2 + full][w];
+ total = div_u64(group->total[res * 2 + full], NSEC_PER_USEC);
+
+ seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
+ full ? "full" : "some",
+ LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
+ LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
+ LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
+ total);
+ }
+
+ return 0;
+}
+
+static int psi_io_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IO);
+}
+
+static int psi_memory_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_MEM);
+}
+
+static int psi_cpu_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_CPU);
+}
+
+static int psi_io_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_io_show, NULL);
+}
+
+static int psi_memory_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_memory_show, NULL);
+}
+
+static int psi_cpu_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_cpu_show, NULL);
+}
+
+static const struct file_operations psi_io_fops = {
+ .open = psi_io_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static const struct file_operations psi_memory_fops = {
+ .open = psi_memory_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static const struct file_operations psi_cpu_fops = {
+ .open = psi_cpu_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init psi_proc_init(void)
+{
+ proc_mkdir("pressure", NULL);
+ proc_create("pressure/io", 0, NULL, &psi_io_fops);
+ proc_create("pressure/memory", 0, NULL, &psi_memory_fops);
+ proc_create("pressure/cpu", 0, NULL, &psi_cpu_fops);
+ return 0;
+}
+module_init(psi_proc_init);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc798c7cb4d4..e798491ff329 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,6 +54,7 @@
#include <linux/proc_fs.h>
#include <linux/prefetch.h>
#include <linux/profile.h>
+#include <linux/psi.h>
#include <linux/rcupdate_wait.h>
#include <linux/security.h>
#include <linux/stackprotector.h>
@@ -320,6 +321,7 @@ extern bool dl_cpu_busy(unsigned int cpu);
#ifdef CONFIG_CGROUP_SCHED
#include <linux/cgroup.h>
+#include <linux/psi.h>
struct cfs_rq;
struct rt_rq;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8aea199a39b4..f3e0267eb47d 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -55,6 +55,86 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_val_or_zero(var) 0
#endif /* CONFIG_SCHEDSTATS */
+#ifdef CONFIG_PSI
+/*
+ * PSI tracks state that persists across sleeps, such as iowaits and
+ * memory stalls. As a result, it has to distinguish between sleeps,
+ * where a task's runnable state changes, and requeues, where a task
+ * and its state are being moved between CPUs and runqueues.
+ */
+static inline void psi_enqueue(struct rq *rq, struct task_struct *p,
+ bool wakeup)
+{
+ int clear = 0, set = TSK_RUNNING;
+
+ if (psi_disabled)
+ return;
+
+ if (!wakeup || p->sched_psi_wake_requeue) {
+ if (p->flags & PF_MEMSTALL)
+ set |= TSK_MEMSTALL;
+ if (p->sched_psi_wake_requeue)
+ p->sched_psi_wake_requeue = 0;
+ } else {
+ if (p->in_iowait)
+ clear |= TSK_IOWAIT;
+ }
+
+ psi_task_change(p, rq_clock(rq), clear, set);
+}
+
+static inline void psi_dequeue(struct rq *rq, struct task_struct *p, bool sleep)
+{
+ int clear = TSK_RUNNING, set = 0;
+
+ if (psi_disabled)
+ return;
+
+ if (!sleep) {
+ if (p->flags & PF_MEMSTALL)
+ clear |= TSK_MEMSTALL;
+ } else {
+ if (p->in_iowait)
+ set |= TSK_IOWAIT;
+ }
+
+ psi_task_change(p, rq_clock(rq), clear, set);
+}
+
+static inline void psi_ttwu_dequeue(struct task_struct *p)
+{
+ if (psi_disabled)
+ return;
+ /*
+ * Is the task being migrated during a wakeup? Make sure to
+ * deregister its sleep-persistent psi states from the old
+ * queue, and let psi_enqueue() know it has to requeue.
+ */
+ if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
+ struct rq_flags rf;
+ struct rq *rq;
+ int clear = 0;
+
+ if (p->in_iowait)
+ clear |= TSK_IOWAIT;
+ if (p->flags & PF_MEMSTALL)
+ clear |= TSK_MEMSTALL;
+
+ rq = __task_rq_lock(p, &rf);
+ update_rq_clock(rq);
+ psi_task_change(p, rq_clock(rq), clear, 0);
+ p->sched_psi_wake_requeue = 1;
+ __task_rq_unlock(rq, &rf);
+ }
+}
+#else /* CONFIG_PSI */
+static inline void psi_enqueue(struct rq *rq, struct task_struct *p,
+ bool wakeup) {}
+static inline void psi_dequeue(struct rq *rq, struct task_struct *p,
+ bool sleep) {}
+static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+#endif /* CONFIG_PSI */
+
#ifdef CONFIG_SCHED_INFO
static inline void sched_info_reset_dequeued(struct task_struct *t)
{
diff --git a/mm/compaction.c b/mm/compaction.c
index 29bd1df18b98..8f9566745902 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -22,6 +22,7 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/page_owner.h>
+#include <linux/psi.h>
#include "internal.h"
#ifdef CONFIG_COMPACTION
@@ -2068,11 +2069,15 @@ static int kcompactd(void *p)
pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
while (!kthread_should_stop()) {
+ unsigned long pflags;
+
trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
wait_event_freezable(pgdat->kcompactd_wait,
kcompactd_work_requested(pgdat));
+ psi_memstall_enter(&pflags);
kcompactd_do_work(pgdat);
+ psi_memstall_leave(&pflags);
}
return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index e49961e13dd9..eee06145b997 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -37,6 +37,7 @@
#include <linux/shmem_fs.h>
#include <linux/rmap.h>
#include <linux/delayacct.h>
+#include <linux/psi.h>
#include "internal.h"
#define CREATE_TRACE_POINTS
@@ -1075,11 +1076,14 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
bool thrashing = false;
+ unsigned long pflags;
int ret = 0;
- if (bit_nr == PG_locked && !PageSwapBacked(page) &&
+ if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
- delayacct_thrashing_start();
+ if (!PageSwapBacked(page))
+ delayacct_thrashing_start();
+ psi_memstall_enter(&pflags);
thrashing = true;
}
@@ -1121,8 +1125,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
finish_wait(q, wait);
- if (thrashing)
- delayacct_thrashing_end();
+ if (thrashing) {
+ if (!PageSwapBacked(page))
+ delayacct_thrashing_end();
+ psi_memstall_leave(&pflags);
+ }
/*
* A signal could leave PageWaiters set. Clearing it here if
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 22320ea27489..8469f34e6731 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -67,6 +67,7 @@
#include <linux/ftrace.h>
#include <linux/lockdep.h>
#include <linux/nmi.h>
+#include <linux/psi.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -3552,15 +3553,20 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
enum compact_priority prio, enum compact_result *compact_result)
{
struct page *page;
+ unsigned long pflags;
unsigned int noreclaim_flag;
if (!order)
return NULL;
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
+
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio);
+
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
if (*compact_result <= COMPACT_INACTIVE)
return NULL;
@@ -3749,11 +3755,14 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
struct reclaim_state reclaim_state;
int progress;
unsigned int noreclaim_flag;
+ unsigned long pflags;
cond_resched();
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
+
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
fs_reclaim_acquire(gfp_mask);
reclaim_state.reclaimed_slab = 0;
@@ -3765,6 +3774,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
current->reclaim_state = NULL;
fs_reclaim_release(gfp_mask);
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
cond_resched();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8d1ad48ffbcd..ee91e8cbeb5a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -49,6 +49,7 @@
#include <linux/prefetch.h>
#include <linux/printk.h>
#include <linux/dax.h>
+#include <linux/psi.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3115,6 +3116,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
+ unsigned long pflags;
int nid;
unsigned int noreclaim_flag;
struct scan_control sc = {
@@ -3143,9 +3145,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
sc.gfp_mask,
sc.reclaim_idx);
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
+
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
@@ -3565,6 +3571,7 @@ static int kswapd(void *p)
pgdat->kswapd_order = 0;
pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
for ( ; ; ) {
+ unsigned long pflags;
bool ret;
alloc_order = reclaim_order = pgdat->kswapd_order;
@@ -3601,9 +3608,15 @@ static int kswapd(void *p)
*/
trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
alloc_order);
+
+ psi_memstall_enter(&pflags);
fs_reclaim_acquire(GFP_KERNEL);
+
reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+
fs_reclaim_release(GFP_KERNEL);
+ psi_memstall_leave(&pflags);
+
if (reclaim_order < alloc_order)
goto kswapd_try_sleep;
}
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-01 15:19 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
@ 2018-08-03 16:56 ` Peter Zijlstra
2018-08-06 15:05 ` Johannes Weiner
` (2 more replies)
2018-08-03 17:07 ` Peter Zijlstra
` (2 subsequent siblings)
3 siblings, 3 replies; 55+ messages in thread
From: Peter Zijlstra @ 2018-08-03 16:56 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> +static bool test_state(unsigned int *tasks, int cpu, enum psi_states state)
> +{
> + switch (state) {
> + case PSI_IO_SOME:
> + return tasks[NR_IOWAIT];
> + case PSI_IO_FULL:
> + return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
> + case PSI_MEM_SOME:
> + return tasks[NR_MEMSTALL];
> + case PSI_MEM_FULL:
> + /*
> + * Since we care about lost potential, things are
> + * fully blocked on memory when there are no other
> + * working tasks, but also when the CPU is actively
> + * being used by a reclaimer and nothing productive
> + * could run even if it were runnable.
> + */
> + return tasks[NR_MEMSTALL] &&
> + (!tasks[NR_RUNNING] ||
> + cpu_curr(cpu)->flags & PF_MEMSTALL);
I don't think you can do this, there is nothing that guarantees
cpu_curr() still exists.
> + case PSI_CPU_SOME:
> + return tasks[NR_RUNNING] > 1;
> + case PSI_NONIDLE:
> + return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
> + tasks[NR_RUNNING];
> + default:
> + return false;
> + }
> +}
> +
> +static bool psi_update_stats(struct psi_group *group)
> +{
> + u64 deltas[NR_PSI_STATES - 1] = { 0, };
> + unsigned long missed_periods = 0;
> + unsigned long nonidle_total = 0;
> + u64 now, expires, period;
> + int cpu;
> + int s;
> +
> + mutex_lock(&group->stat_lock);
> +
> + /*
> + * Collect the per-cpu time buckets and average them into a
> + * single time sample that is normalized to wallclock time.
> + *
> + * For averaging, each CPU is weighted by its non-idle time in
> + * the sampling period. This eliminates artifacts from uneven
> + * loading, or even entirely idle CPUs.
> + *
> + * We don't need to synchronize against CPU hotplugging. If we
> + * see a CPU that's online and has samples, we incorporate it.
> + */
> + for_each_online_cpu(cpu) {
> + struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> + u32 uninitialized_var(nonidle);
urgh.. I can see why the compiler got confused. Dodgy :-)
> +
> + BUILD_BUG_ON(PSI_NONIDLE != NR_PSI_STATES - 1);
> +
> + for (s = PSI_NONIDLE; s >= 0; s--) {
> + u32 time, delta;
> +
> + time = READ_ONCE(groupc->times[s]);
> + /*
> + * In addition to already concluded states, we
> + * also incorporate currently active states on
> + * the CPU, since states may last for many
> + * sampling periods.
> + *
> + * This way we keep our delta sampling buckets
> + * small (u32) and our reported pressure close
> + * to what's actually happening.
> + */
> + if (test_state(groupc->tasks, cpu, s)) {
> + /*
> + * We can race with a state change and
> + * need to make sure the state_start
> + * update is ordered against the
> + * updates to the live state and the
> + * time buckets (groupc->times).
> + *
> + * 1. If we observe task state that
> + * needs to be recorded, make sure we
> + * see state_start from when that
> + * state went into effect or we'll
> + * count time from the previous state.
> + *
> + * 2. If the time delta has already
> + * been added to the bucket, make sure
> + * we don't see it in state_start or
> + * we'll count it twice.
> + *
> + * If the time delta is out of
> + * state_start but not in the time
> + * bucket yet, we'll miss it entirely
> + * and handle it in the next period.
> + */
> + smp_rmb();
> + time += cpu_clock(cpu) - groupc->state_start;
> + }
The alternative is adding an update to scheduler_tick(), that would
ensure you're never more than nr_cpu_ids * TICK_NSEC behind.
> + delta = time - groupc->times_prev[s];
> + groupc->times_prev[s] = time;
> +
> + if (s == PSI_NONIDLE) {
> + nonidle = nsecs_to_jiffies(delta);
> + nonidle_total += nonidle;
> + } else {
> + deltas[s] += (u64)delta * nonidle;
> + }
> + }
> + }
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-03 16:56 ` Peter Zijlstra
@ 2018-08-06 15:05 ` Johannes Weiner
2018-08-06 15:25 ` Peter Zijlstra
2018-08-06 15:19 ` Johannes Weiner
2018-08-21 19:44 ` Johannes Weiner
2 siblings, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2018-08-06 15:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Aug 03, 2018 at 06:56:41PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> > +static bool test_state(unsigned int *tasks, int cpu, enum psi_states state)
> > +{
> > + switch (state) {
> > + case PSI_IO_SOME:
> > + return tasks[NR_IOWAIT];
> > + case PSI_IO_FULL:
> > + return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
> > + case PSI_MEM_SOME:
> > + return tasks[NR_MEMSTALL];
> > + case PSI_MEM_FULL:
> > + /*
> > + * Since we care about lost potential, things are
> > + * fully blocked on memory when there are no other
> > + * working tasks, but also when the CPU is actively
> > + * being used by a reclaimer and nothing productive
> > + * could run even if it were runnable.
> > + */
> > + return tasks[NR_MEMSTALL] &&
> > + (!tasks[NR_RUNNING] ||
> > + cpu_curr(cpu)->flags & PF_MEMSTALL);
>
> I don't think you can do this, there is nothing that guarantees
> cpu_curr() still exists.
Argh, that's right. This needs an explicit count if we want to access
it locklessly. And you already said you didn't like that this is the
only state not derived purely from the task counters, so maybe this is
the way to go after all.
How about something like this (untested)?
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index b6ff46362eb3..afc39fbbf9dd 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -10,6 +10,7 @@ enum psi_task_count {
NR_IOWAIT,
NR_MEMSTALL,
NR_RUNNING,
+ NR_RECLAIMING,
NR_PSI_TASK_COUNTS,
};
@@ -17,6 +18,7 @@ enum psi_task_count {
#define TSK_IOWAIT (1 << NR_IOWAIT)
#define TSK_MEMSTALL (1 << NR_MEMSTALL)
#define TSK_RUNNING (1 << NR_RUNNING)
+#define TSK_RECLAIMING (1 << NR_RECLAIMING)
/* Resources that workloads could be stalled on */
enum psi_res {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e53137df405b..90fd813dd7c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3517,6 +3517,7 @@ static void __sched notrace __schedule(bool preempt)
*/
++*switch_count;
+ psi_switch(rq, prev, next);
trace_sched_switch(preempt, prev, next);
/* Also unlocks the rq: */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a20f885da66f..352c3a032ff0 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -209,8 +209,7 @@ static bool test_state(unsigned int *tasks, int cpu, enum psi_states state)
* could run even if it were runnable.
*/
return tasks[NR_MEMSTALL] &&
- (!tasks[NR_RUNNING] ||
- cpu_curr(cpu)->flags & PF_MEMSTALL);
+ (!tasks[NR_RUNNING] || tasks[NR_RECLAIMING]);
case PSI_CPU_SOME:
return tasks[NR_RUNNING] > 1;
case PSI_NONIDLE:
@@ -530,7 +529,7 @@ void psi_memstall_enter(unsigned long *flags)
update_rq_clock(rq);
current->flags |= PF_MEMSTALL;
- psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
+ psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL|TSK_RECLAIMING);
rq_unlock_irq(rq, &rf);
}
@@ -561,7 +560,7 @@ void psi_memstall_leave(unsigned long *flags)
update_rq_clock(rq);
current->flags &= ~PF_MEMSTALL;
- psi_task_change(current, rq_clock(rq), TSK_MEMSTALL, 0);
+ psi_task_change(current, rq_clock(rq), TSK_MEMSTALL|TSK_RECLAIMING, 0);
rq_unlock_irq(rq, &rf);
}
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index f3e0267eb47d..2babdd53715d 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -127,12 +127,26 @@ static inline void psi_ttwu_dequeue(struct task_struct *p)
__task_rq_unlock(rq, &rf);
}
}
+
+static inline void psi_switch(struct rq *rq, struct task_struct *prev,
+ struct task_struct *next)
+{
+ if (psi_disabled)
+ return;
+
+ if (unlikely(prev->flags & PF_MEMSTALL))
+ psi_task_change(prev, rq_clock(rq), TSK_RECLAIMING, 0);
+ if (unlikely(next->flags & PF_MEMSTALL))
+ psi_task_change(next, rq_clock(rq), 0, TSK_RECLAIMING);
+}
#else /* CONFIG_PSI */
static inline void psi_enqueue(struct rq *rq, struct task_struct *p,
bool wakeup) {}
static inline void psi_dequeue(struct rq *rq, struct task_struct *p,
bool sleep) {}
static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+static inline void psi_switch(struct rq *rq, struct task_struct *prev,
+ struct task_struct *next) {}
#endif /* CONFIG_PSI */
#ifdef CONFIG_SCHED_INFO
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-06 15:05 ` Johannes Weiner
@ 2018-08-06 15:25 ` Peter Zijlstra
2018-08-06 15:40 ` Johannes Weiner
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2018-08-06 15:25 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Mon, Aug 06, 2018 at 11:05:50AM -0400, Johannes Weiner wrote:
> Argh, that's right. This needs an explicit count if we want to access
> it locklessly. And you already said you didn't like that this is the
> only state not derived purely from the task counters, so maybe this is
> the way to go after all.
>
> How about something like this (untested)?
> +static inline void psi_switch(struct rq *rq, struct task_struct *prev,
> + struct task_struct *next)
> +{
> + if (psi_disabled)
> + return;
> +
> + if (unlikely(prev->flags & PF_MEMSTALL))
> + psi_task_change(prev, rq_clock(rq), TSK_RECLAIMING, 0);
> + if (unlikely(next->flags & PF_MEMSTALL))
> + psi_task_change(next, rq_clock(rq), 0, TSK_RECLAIMING);
> +}
Urgh... can't say I really like that.
I would really rather do that scheduler_tick() thing to avoid the remote
update. The tick is a lot less hot than the switch path and esp.
next->flags might be a cold line (prev->flags is typically the same line
as prev->state so we already have that, but I don't think anybody now
looks at next->flags or its line, so that'd be cold load).
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-06 15:25 ` Peter Zijlstra
@ 2018-08-06 15:40 ` Johannes Weiner
0 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-06 15:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Mon, Aug 06, 2018 at 05:25:28PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 06, 2018 at 11:05:50AM -0400, Johannes Weiner wrote:
> > Argh, that's right. This needs an explicit count if we want to access
> > it locklessly. And you already said you didn't like that this is the
> > only state not derived purely from the task counters, so maybe this is
> > the way to go after all.
> >
> > How about something like this (untested)?
>
>
> > +static inline void psi_switch(struct rq *rq, struct task_struct *prev,
> > + struct task_struct *next)
> > +{
> > + if (psi_disabled)
> > + return;
> > +
> > + if (unlikely(prev->flags & PF_MEMSTALL))
> > + psi_task_change(prev, rq_clock(rq), TSK_RECLAIMING, 0);
> > + if (unlikely(next->flags & PF_MEMSTALL))
> > + psi_task_change(next, rq_clock(rq), 0, TSK_RECLAIMING);
> > +}
>
>
> Urgh... can't say I really like that.
>
> I would really rather do that scheduler_tick() thing to avoid the remote
> update. The tick is a lot less hot than the switch path and esp.
> next->flags might be a cold line (prev->flags is typically the same line
> as prev->state so we already have that, but I don't think anybody now
> looks at next->flags or its line, so that'd be cold load).
Okay, the tick updater sounds like a much better option then. HZ
frequency should produce more than recent enough data.
That means we will retain the not-so-nice PF_MEMSTALL flag test under
rq lock, but it'll eliminate most of that memory ordering headache.
I'll do that. Thanks!
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-03 16:56 ` Peter Zijlstra
2018-08-06 15:05 ` Johannes Weiner
@ 2018-08-06 15:19 ` Johannes Weiner
2018-08-06 16:03 ` Peter Zijlstra
2018-08-21 19:44 ` Johannes Weiner
2 siblings, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2018-08-06 15:19 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Aug 03, 2018 at 06:56:41PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> > +static bool psi_update_stats(struct psi_group *group)
> > +{
> > + u64 deltas[NR_PSI_STATES - 1] = { 0, };
> > + unsigned long missed_periods = 0;
> > + unsigned long nonidle_total = 0;
> > + u64 now, expires, period;
> > + int cpu;
> > + int s;
> > +
> > + mutex_lock(&group->stat_lock);
> > +
> > + /*
> > + * Collect the per-cpu time buckets and average them into a
> > + * single time sample that is normalized to wallclock time.
> > + *
> > + * For averaging, each CPU is weighted by its non-idle time in
> > + * the sampling period. This eliminates artifacts from uneven
> > + * loading, or even entirely idle CPUs.
> > + *
> > + * We don't need to synchronize against CPU hotplugging. If we
> > + * see a CPU that's online and has samples, we incorporate it.
> > + */
> > + for_each_online_cpu(cpu) {
> > + struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> > + u32 uninitialized_var(nonidle);
>
> urgh.. I can see why the compiler got confused. Dodgy :-)
:-) I think we can make this cleaner. Something like this (modulo the
READ_ONCE/WRITE_ONCE you pointed out in the other email)?
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index abccfddba5d5..ce6f02ada1cd 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -220,6 +220,49 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
}
}
+static u32 read_update_delta(struct psi_group_cpu *groupc,
+ enum psi_states state, int cpu)
+{
+ u32 time, delta;
+
+ time = READ_ONCE(groupc->times[state]);
+ /*
+ * In addition to already concluded states, we also
+ * incorporate currently active states on the CPU, since
+ * states may last for many sampling periods.
+ *
+ * This way we keep our delta sampling buckets small (u32) and
+ * our reported pressure close to what's actually happening.
+ */
+ if (test_state(groupc->tasks, state)) {
+ /*
+ * We can race with a state change and need to make
+ * sure the state_start update is ordered against the
+ * updates to the live state and the time buckets
+ * (groupc->times).
+ *
+ * 1. If we observe task state that needs to be
+ * recorded, make sure we see state_start from when
+ * that state went into effect or we'll count time
+ * from the previous state.
+ *
+ * 2. If the time delta has already been added to the
+ * bucket, make sure we don't see it in state_start or
+ * we'll count it twice.
+ *
+ * If the time delta is out of state_start but not in
+ * the time bucket yet, we'll miss it entirely and
+ * handle it in the next period.
+ */
+ smp_rmb();
+ time += cpu_clock(cpu) - groupc->state_start;
+ }
+ delta = time - groupc->times_prev[state];
+ groupc->times_prev[state] = time;
+
+ return delta;
+}
+
static bool psi_update_stats(struct psi_group *group)
{
u64 deltas[NR_PSI_STATES - 1] = { 0, };
@@ -244,60 +287,17 @@ static bool psi_update_stats(struct psi_group *group)
*/
for_each_online_cpu(cpu) {
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
- u32 uninitialized_var(nonidle);
-
- BUILD_BUG_ON(PSI_NONIDLE != NR_PSI_STATES - 1);
-
- for (s = PSI_NONIDLE; s >= 0; s--) {
- u32 time, delta;
-
- time = READ_ONCE(groupc->times[s]);
- /*
- * In addition to already concluded states, we
- * also incorporate currently active states on
- * the CPU, since states may last for many
- * sampling periods.
- *
- * This way we keep our delta sampling buckets
- * small (u32) and our reported pressure close
- * to what's actually happening.
- */
- if (test_state(groupc->tasks, cpu, s)) {
- /*
- * We can race with a state change and
- * need to make sure the state_start
- * update is ordered against the
- * updates to the live state and the
- * time buckets (groupc->times).
- *
- * 1. If we observe task state that
- * needs to be recorded, make sure we
- * see state_start from when that
- * state went into effect or we'll
- * count time from the previous state.
- *
- * 2. If the time delta has already
- * been added to the bucket, make sure
- * we don't see it in state_start or
- * we'll count it twice.
- *
- * If the time delta is out of
- * state_start but not in the time
- * bucket yet, we'll miss it entirely
- * and handle it in the next period.
- */
- smp_rmb();
- time += cpu_clock(cpu) - groupc->state_start;
- }
- delta = time - groupc->times_prev[s];
- groupc->times_prev[s] = time;
-
- if (s == PSI_NONIDLE) {
- nonidle = nsecs_to_jiffies(delta);
- nonidle_total += nonidle;
- } else {
- deltas[s] += (u64)delta * nonidle;
- }
+ u32 nonidle;
+
+ nonidle = read_update_delta(groupc, PSI_NONIDLE, cpu);
+ nonidle = nsecs_to_jiffies(nonidle);
+ nonidle_total += nonidle;
+
+ for (s = 0; s < PSI_NONIDLE; s++) {
+ u32 delta;
+
+ delta = read_update_delta(groupc, s, cpu);
+ deltas[s] += (u64)delta * nonidle;
}
}
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-06 15:19 ` Johannes Weiner
@ 2018-08-06 16:03 ` Peter Zijlstra
0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2018-08-06 16:03 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Mon, Aug 06, 2018 at 11:19:28AM -0400, Johannes Weiner wrote:
> On Fri, Aug 03, 2018 at 06:56:41PM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> > > + u32 uninitialized_var(nonidle);
> >
> > urgh.. I can see why the compiler got confused. Dodgy :-)
>
> :-) I think we can make this cleaner. Something like this (modulo the
> READ_ONCE/WRITE_ONCE you pointed out in the other email)?
>
> @@ -244,60 +287,17 @@ static bool psi_update_stats(struct psi_group *group)
> */
> for_each_online_cpu(cpu) {
> struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
> + u32 nonidle;
> +
> + nonidle = read_update_delta(groupc, PSI_NONIDLE, cpu);
> + nonidle = nsecs_to_jiffies(nonidle);
> + nonidle_total += nonidle;
> +
> + for (s = 0; s < PSI_NONIDLE; s++) {
> + u32 delta;
> +
> + delta = read_update_delta(groupc, s, cpu);
> + deltas[s] += (u64)delta * nonidle;
> }
> }
Yes, much clearer, thanks!
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-03 16:56 ` Peter Zijlstra
2018-08-06 15:05 ` Johannes Weiner
2018-08-06 15:19 ` Johannes Weiner
@ 2018-08-21 19:44 ` Johannes Weiner
2018-08-22 9:16 ` Peter Zijlstra
2 siblings, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2018-08-21 19:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
Hi,
a quick update on that feedback before I send out v4:
On Fri, Aug 03, 2018 at 06:56:41PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> > +static bool test_state(unsigned int *tasks, int cpu, enum psi_states state)
> > +{
> > + switch (state) {
> > + case PSI_IO_SOME:
> > + return tasks[NR_IOWAIT];
> > + case PSI_IO_FULL:
> > + return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
> > + case PSI_MEM_SOME:
> > + return tasks[NR_MEMSTALL];
> > + case PSI_MEM_FULL:
> > + /*
> > + * Since we care about lost potential, things are
> > + * fully blocked on memory when there are no other
> > + * working tasks, but also when the CPU is actively
> > + * being used by a reclaimer and nothing productive
> > + * could run even if it were runnable.
> > + */
> > + return tasks[NR_MEMSTALL] &&
> > + (!tasks[NR_RUNNING] ||
> > + cpu_curr(cpu)->flags & PF_MEMSTALL);
>
> I don't think you can do this, there is nothing that guarantees
> cpu_curr() still exists.
As discussed later in this thread, I've replaced this with time
sampling from inside scheduler_tick(): in the unlikely event that
rq->curr is PF_MEMSTALL, it'll record TICK_NSEC worth of MEM_FULL.
However:
> > + for (s = PSI_NONIDLE; s >= 0; s--) {
> > + u32 time, delta;
> > +
> > + time = READ_ONCE(groupc->times[s]);
> > + /*
> > + * In addition to already concluded states, we
> > + * also incorporate currently active states on
> > + * the CPU, since states may last for many
> > + * sampling periods.
> > + *
> > + * This way we keep our delta sampling buckets
> > + * small (u32) and our reported pressure close
> > + * to what's actually happening.
> > + */
> > + if (test_state(groupc->tasks, cpu, s)) {
> > + /*
> > + * We can race with a state change and
> > + * need to make sure the state_start
> > + * update is ordered against the
> > + * updates to the live state and the
> > + * time buckets (groupc->times).
> > + *
> > + * 1. If we observe task state that
> > + * needs to be recorded, make sure we
> > + * see state_start from when that
> > + * state went into effect or we'll
> > + * count time from the previous state.
> > + *
> > + * 2. If the time delta has already
> > + * been added to the bucket, make sure
> > + * we don't see it in state_start or
> > + * we'll count it twice.
> > + *
> > + * If the time delta is out of
> > + * state_start but not in the time
> > + * bucket yet, we'll miss it entirely
> > + * and handle it in the next period.
> > + */
> > + smp_rmb();
> > + time += cpu_clock(cpu) - groupc->state_start;
> > + }
>
> The alternative is adding an update to scheduler_tick(), that would
> ensure you're never more than nr_cpu_ids * TICK_NSEC behind.
I wasn't able to convert *all* states to tick updates like this.
The reason is that, while testing rq->curr for PF_MEMSTALL is cheap,
other tasks associated with the rq could be from any cgroup in the
system. That means we'd have to do for_each_cgroup() on every tick to
keep the groupc->times that closely uptodate, and that wouldn't scale.
We tend to have hundreds of them, some setups have thousands.
Since we don't need to be *that* current, I left the on-demand update
inside the aggregator for now. It's a bit trickier, but much cheaper.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-21 19:44 ` Johannes Weiner
@ 2018-08-22 9:16 ` Peter Zijlstra
0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2018-08-22 9:16 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Tue, Aug 21, 2018 at 03:44:13PM -0400, Johannes Weiner wrote:
> > > + for (s = PSI_NONIDLE; s >= 0; s--) {
> > > + u32 time, delta;
> > > +
> > > + time = READ_ONCE(groupc->times[s]);
> > > + /*
> > > + * In addition to already concluded states, we
> > > + * also incorporate currently active states on
> > > + * the CPU, since states may last for many
> > > + * sampling periods.
> > > + *
> > > + * This way we keep our delta sampling buckets
> > > + * small (u32) and our reported pressure close
> > > + * to what's actually happening.
> > > + */
> > > + if (test_state(groupc->tasks, cpu, s)) {
> > > + /*
> > > + * We can race with a state change and
> > > + * need to make sure the state_start
> > > + * update is ordered against the
> > > + * updates to the live state and the
> > > + * time buckets (groupc->times).
> > > + *
> > > + * 1. If we observe task state that
> > > + * needs to be recorded, make sure we
> > > + * see state_start from when that
> > > + * state went into effect or we'll
> > > + * count time from the previous state.
> > > + *
> > > + * 2. If the time delta has already
> > > + * been added to the bucket, make sure
> > > + * we don't see it in state_start or
> > > + * we'll count it twice.
> > > + *
> > > + * If the time delta is out of
> > > + * state_start but not in the time
> > > + * bucket yet, we'll miss it entirely
> > > + * and handle it in the next period.
> > > + */
> > > + smp_rmb();
> > > + time += cpu_clock(cpu) - groupc->state_start;
> > > + }
> >
> > The alternative is adding an update to scheduler_tick(), that would
> > ensure you're never more than nr_cpu_ids * TICK_NSEC behind.
>
> I wasn't able to convert *all* states to tick updates like this.
>
> The reason is that, while testing rq->curr for PF_MEMSTALL is cheap,
> other tasks associated with the rq could be from any cgroup in the
> system. That means we'd have to do for_each_cgroup() on every tick to
> keep the groupc->times that closely uptodate, and that wouldn't scale.
> We tend to have hundreds of them, some setups have thousands.
>
> Since we don't need to be *that* current, I left the on-demand update
> inside the aggregator for now. It's a bit trickier, but much cheaper.
ARGH indeed; I was thinking we only need to update current. But because
we're tracking blocked state that doesn't work.
Sorry for that :/
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-01 15:19 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-08-03 16:56 ` Peter Zijlstra
@ 2018-08-03 17:07 ` Peter Zijlstra
2018-08-06 15:23 ` Johannes Weiner
2018-08-03 17:15 ` Peter Zijlstra
2018-08-03 17:21 ` Peter Zijlstra
3 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2018-08-03 17:07 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> +static bool psi_update_stats(struct psi_group *group)
> +{
> + u64 deltas[NR_PSI_STATES - 1] = { 0, };
> + unsigned long missed_periods = 0;
> + unsigned long nonidle_total = 0;
> + u64 now, expires, period;
> + int cpu;
> + int s;
> +
> + mutex_lock(&group->stat_lock);
> +
> + /*
> + * Collect the per-cpu time buckets and average them into a
> + * single time sample that is normalized to wallclock time.
> + *
> + * For averaging, each CPU is weighted by its non-idle time in
> + * the sampling period. This eliminates artifacts from uneven
> + * loading, or even entirely idle CPUs.
> + *
> + * We don't need to synchronize against CPU hotplugging. If we
> + * see a CPU that's online and has samples, we incorporate it.
> + */
> + for_each_online_cpu(cpu) {
I'm still puzzled by this.. for 99% of the machines online == possible.
Why not always iterate possible and leave it at that? This is hardly a
fast path.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-03 17:07 ` Peter Zijlstra
@ 2018-08-06 15:23 ` Johannes Weiner
0 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-06 15:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Aug 03, 2018 at 07:07:33PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> > +static bool psi_update_stats(struct psi_group *group)
> > +{
> > + u64 deltas[NR_PSI_STATES - 1] = { 0, };
> > + unsigned long missed_periods = 0;
> > + unsigned long nonidle_total = 0;
> > + u64 now, expires, period;
> > + int cpu;
> > + int s;
> > +
> > + mutex_lock(&group->stat_lock);
> > +
> > + /*
> > + * Collect the per-cpu time buckets and average them into a
> > + * single time sample that is normalized to wallclock time.
> > + *
> > + * For averaging, each CPU is weighted by its non-idle time in
> > + * the sampling period. This eliminates artifacts from uneven
> > + * loading, or even entirely idle CPUs.
> > + *
> > + * We don't need to synchronize against CPU hotplugging. If we
> > + * see a CPU that's online and has samples, we incorporate it.
> > + */
> > + for_each_online_cpu(cpu) {
>
> I'm still puzzled by this.. for 99% of the machines online == possible.
> Why not always iterate possible and leave it at that? This is hardly a
> fast path.
Hmm, you're right, that makes things much simpler. I guess I'm mostly
worried about the 1% where this significantly differs, but it looks
like we're smarter than simply doing CONFIG_NR_CPUS for the possible
map, and we can easily stomach a bit of discrepancy in this path.
I'll change that to possible and delete/update the third paragraph.
Thanks
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-01 15:19 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-08-03 16:56 ` Peter Zijlstra
2018-08-03 17:07 ` Peter Zijlstra
@ 2018-08-03 17:15 ` Peter Zijlstra
2018-08-03 17:21 ` Peter Zijlstra
3 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2018-08-03 17:15 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> + /* total= */
> + for (s = 0; s < NR_PSI_STATES - 1; s++)
> + group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
Just a nit; probably not worth fixing.
This looses the remainder of that division. But since the divisor is
variable it becomes really hard to not loose something at some point.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-01 15:19 ` [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
` (2 preceding siblings ...)
2018-08-03 17:15 ` Peter Zijlstra
@ 2018-08-03 17:21 ` Peter Zijlstra
2018-08-21 20:11 ` Johannes Weiner
3 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2018-08-03 17:21 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> + time = READ_ONCE(groupc->times[s]);
> + /*
> + * In addition to already concluded states, we
> + * also incorporate currently active states on
> + * the CPU, since states may last for many
> + * sampling periods.
> + *
> + * This way we keep our delta sampling buckets
> + * small (u32) and our reported pressure close
> + * to what's actually happening.
> + */
> + if (test_state(groupc->tasks, cpu, s)) {
> + /*
> + * We can race with a state change and
> + * need to make sure the state_start
> + * update is ordered against the
> + * updates to the live state and the
> + * time buckets (groupc->times).
> + *
> + * 1. If we observe task state that
> + * needs to be recorded, make sure we
> + * see state_start from when that
> + * state went into effect or we'll
> + * count time from the previous state.
> + *
> + * 2. If the time delta has already
> + * been added to the bucket, make sure
> + * we don't see it in state_start or
> + * we'll count it twice.
> + *
> + * If the time delta is out of
> + * state_start but not in the time
> + * bucket yet, we'll miss it entirely
> + * and handle it in the next period.
> + */
> + smp_rmb();
> + time += cpu_clock(cpu) - groupc->state_start;
> + }
As is, groupc->state_start needs a READ_ONCE() above and a WRITE_ONCE()
below. But like stated earlier, doing an update in scheduler_tick() is
probably easier.
> +static void psi_group_change(struct psi_group *group, int cpu, u64 now,
> + unsigned int clear, unsigned int set)
> +{
> + struct psi_group_cpu *groupc;
> + unsigned int t, m;
> + u32 delta;
> +
> + groupc = per_cpu_ptr(group->pcpu, cpu);
> +
> + /*
> + * First we assess the aggregate resource states these CPU's
> + * tasks have been in since the last change, and account any
> + * SOME and FULL time that may have resulted in.
> + *
> + * Then we update the task counts according to the state
> + * change requested through the @clear and @set bits.
> + */
> +
> + delta = now - groupc->state_start;
> + groupc->state_start = now;
> +
> + /*
> + * Update state_start before recording time in the sampling
> + * buckets and changing task counts, to prevent a racing
> + * aggregation from counting the delta twice or attributing it
> + * to an old state.
> + */
> + smp_wmb();
> +
> + if (test_state(groupc->tasks, cpu, PSI_IO_SOME)) {
> + groupc->times[PSI_IO_SOME] += delta;
> + if (test_state(groupc->tasks, cpu, PSI_IO_FULL))
> + groupc->times[PSI_IO_FULL] += delta;
> + }
> + if (test_state(groupc->tasks, cpu, PSI_MEM_SOME)) {
> + groupc->times[PSI_MEM_SOME] += delta;
> + if (test_state(groupc->tasks, cpu, PSI_MEM_FULL))
> + groupc->times[PSI_MEM_FULL] += delta;
> + }
Might we worth checking the compiler does the right thing here and
optimizes this branch fest into something sensible.
> + if (test_state(groupc->tasks, cpu, PSI_CPU_SOME))
> + groupc->times[PSI_CPU_SOME] += delta;
> + if (test_state(groupc->tasks, cpu, PSI_NONIDLE))
> + groupc->times[PSI_NONIDLE] += delta;
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-03 17:21 ` Peter Zijlstra
@ 2018-08-21 20:11 ` Johannes Weiner
2018-08-22 9:10 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: Johannes Weiner @ 2018-08-21 20:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Fri, Aug 03, 2018 at 07:21:39PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> > + time = READ_ONCE(groupc->times[s]);
> > + /*
> > + * In addition to already concluded states, we
> > + * also incorporate currently active states on
> > + * the CPU, since states may last for many
> > + * sampling periods.
> > + *
> > + * This way we keep our delta sampling buckets
> > + * small (u32) and our reported pressure close
> > + * to what's actually happening.
> > + */
> > + if (test_state(groupc->tasks, cpu, s)) {
> > + /*
> > + * We can race with a state change and
> > + * need to make sure the state_start
> > + * update is ordered against the
> > + * updates to the live state and the
> > + * time buckets (groupc->times).
> > + *
> > + * 1. If we observe task state that
> > + * needs to be recorded, make sure we
> > + * see state_start from when that
> > + * state went into effect or we'll
> > + * count time from the previous state.
> > + *
> > + * 2. If the time delta has already
> > + * been added to the bucket, make sure
> > + * we don't see it in state_start or
> > + * we'll count it twice.
> > + *
> > + * If the time delta is out of
> > + * state_start but not in the time
> > + * bucket yet, we'll miss it entirely
> > + * and handle it in the next period.
> > + */
> > + smp_rmb();
> > + time += cpu_clock(cpu) - groupc->state_start;
> > + }
>
> As is, groupc->state_start needs a READ_ONCE() above and a WRITE_ONCE()
> below. But like stated earlier, doing an update in scheduler_tick() is
> probably easier.
I've wrapped these in READ_ONCE/WRITE_ONCE.
> > +static void psi_group_change(struct psi_group *group, int cpu, u64 now,
> > + unsigned int clear, unsigned int set)
> > +{
> > + struct psi_group_cpu *groupc;
> > + unsigned int t, m;
> > + u32 delta;
> > +
> > + groupc = per_cpu_ptr(group->pcpu, cpu);
> > +
> > + /*
> > + * First we assess the aggregate resource states these CPU's
> > + * tasks have been in since the last change, and account any
> > + * SOME and FULL time that may have resulted in.
> > + *
> > + * Then we update the task counts according to the state
> > + * change requested through the @clear and @set bits.
> > + */
> > +
> > + delta = now - groupc->state_start;
> > + groupc->state_start = now;
> > +
> > + /*
> > + * Update state_start before recording time in the sampling
> > + * buckets and changing task counts, to prevent a racing
> > + * aggregation from counting the delta twice or attributing it
> > + * to an old state.
> > + */
> > + smp_wmb();
> > +
> > + if (test_state(groupc->tasks, cpu, PSI_IO_SOME)) {
> > + groupc->times[PSI_IO_SOME] += delta;
> > + if (test_state(groupc->tasks, cpu, PSI_IO_FULL))
> > + groupc->times[PSI_IO_FULL] += delta;
> > + }
> > + if (test_state(groupc->tasks, cpu, PSI_MEM_SOME)) {
> > + groupc->times[PSI_MEM_SOME] += delta;
> > + if (test_state(groupc->tasks, cpu, PSI_MEM_FULL))
> > + groupc->times[PSI_MEM_FULL] += delta;
> > + }
>
> Might we worth checking the compiler does the right thing here and
> optimizes this branch fest into something sensible.
Yup, the results looked good. It recognizes that SOME and FULL have
overlapping conditions and then lays out the branches such that it
does not have to do redundant tests. It also recognizes that NONIDLE
is true when any of the other states is true and collapses that.
> > + if (test_state(groupc->tasks, cpu, PSI_CPU_SOME))
> > + groupc->times[PSI_CPU_SOME] += delta;
> > + if (test_state(groupc->tasks, cpu, PSI_NONIDLE))
> > + groupc->times[PSI_NONIDLE] += delta;
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-21 20:11 ` Johannes Weiner
@ 2018-08-22 9:10 ` Peter Zijlstra
2018-08-22 17:28 ` Johannes Weiner
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2018-08-22 9:10 UTC (permalink / raw)
To: Johannes Weiner
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Tue, Aug 21, 2018 at 04:11:15PM -0400, Johannes Weiner wrote:
> On Fri, Aug 03, 2018 at 07:21:39PM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> > > + time = READ_ONCE(groupc->times[s]);
> > > + /*
> > > + * In addition to already concluded states, we
> > > + * also incorporate currently active states on
> > > + * the CPU, since states may last for many
> > > + * sampling periods.
> > > + *
> > > + * This way we keep our delta sampling buckets
> > > + * small (u32) and our reported pressure close
> > > + * to what's actually happening.
> > > + */
> > > + if (test_state(groupc->tasks, cpu, s)) {
> > > + /*
> > > + * We can race with a state change and
> > > + * need to make sure the state_start
> > > + * update is ordered against the
> > > + * updates to the live state and the
> > > + * time buckets (groupc->times).
> > > + *
> > > + * 1. If we observe task state that
> > > + * needs to be recorded, make sure we
> > > + * see state_start from when that
> > > + * state went into effect or we'll
> > > + * count time from the previous state.
> > > + *
> > > + * 2. If the time delta has already
> > > + * been added to the bucket, make sure
> > > + * we don't see it in state_start or
> > > + * we'll count it twice.
> > > + *
> > > + * If the time delta is out of
> > > + * state_start but not in the time
> > > + * bucket yet, we'll miss it entirely
> > > + * and handle it in the next period.
> > > + */
> > > + smp_rmb();
> > > + time += cpu_clock(cpu) - groupc->state_start;
> > > + }
> >
> > As is, groupc->state_start needs a READ_ONCE() above and a WRITE_ONCE()
> > below. But like stated earlier, doing an update in scheduler_tick() is
> > probably easier.
>
> I've wrapped these in READ_ONCE/WRITE_ONCE.
I just realized, these are u64, so READ_ONCE/WRITE_ONCE will not work
correct on 32bit.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-22 9:10 ` Peter Zijlstra
@ 2018-08-22 17:28 ` Johannes Weiner
0 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-22 17:28 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Tejun Heo,
Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
On Wed, Aug 22, 2018 at 11:10:24AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 21, 2018 at 04:11:15PM -0400, Johannes Weiner wrote:
> > On Fri, Aug 03, 2018 at 07:21:39PM +0200, Peter Zijlstra wrote:
> > > On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> > > > + time = READ_ONCE(groupc->times[s]);
> > > > + /*
> > > > + * In addition to already concluded states, we
> > > > + * also incorporate currently active states on
> > > > + * the CPU, since states may last for many
> > > > + * sampling periods.
> > > > + *
> > > > + * This way we keep our delta sampling buckets
> > > > + * small (u32) and our reported pressure close
> > > > + * to what's actually happening.
> > > > + */
> > > > + if (test_state(groupc->tasks, cpu, s)) {
> > > > + /*
> > > > + * We can race with a state change and
> > > > + * need to make sure the state_start
> > > > + * update is ordered against the
> > > > + * updates to the live state and the
> > > > + * time buckets (groupc->times).
> > > > + *
> > > > + * 1. If we observe task state that
> > > > + * needs to be recorded, make sure we
> > > > + * see state_start from when that
> > > > + * state went into effect or we'll
> > > > + * count time from the previous state.
> > > > + *
> > > > + * 2. If the time delta has already
> > > > + * been added to the bucket, make sure
> > > > + * we don't see it in state_start or
> > > > + * we'll count it twice.
> > > > + *
> > > > + * If the time delta is out of
> > > > + * state_start but not in the time
> > > > + * bucket yet, we'll miss it entirely
> > > > + * and handle it in the next period.
> > > > + */
> > > > + smp_rmb();
> > > > + time += cpu_clock(cpu) - groupc->state_start;
> > > > + }
> > >
> > > As is, groupc->state_start needs a READ_ONCE() above and a WRITE_ONCE()
> > > below. But like stated earlier, doing an update in scheduler_tick() is
> > > probably easier.
> >
> > I've wrapped these in READ_ONCE/WRITE_ONCE.
>
> I just realized, these are u64, so READ_ONCE/WRITE_ONCE will not work
> correct on 32bit.
Ah, right.
Actually, that race described in the comment above - "If the time
delta is out of state_start but not in the time bucket yet, we'll miss
it entirely and handle it in the next period" - can cause bogus time
samples if state persists for more than 2s. Because if we observed a
live state and included it in our private copy of the time bucket
(times_prev), missing the delta in transit to the time bucket in the
next aggregation results in times_prev being ahead of 'time', which
causes the delta to underflow into a bogusly large sample.
Memory barriers alone cannot guarantee full coherency here (neither
seeing the delta twice, nor missing it entirely) so I'm switching this
over to seqcount to make sure the aggregator sees something sensible.
And then I don't need the READ_ONCE/WRITE_ONCE.
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 8/9] psi: pressure stall information for CPU, memory, and IO
2018-08-01 15:12 Johannes Weiner
@ 2018-08-01 15:13 ` Johannes Weiner
0 siblings, 0 replies; 55+ messages in thread
From: Johannes Weiner @ 2018-08-01 15:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Andrew Morton, Linus Torvalds
Cc: Tejun Heo, Suren Baghdasaryan, Daniel Drake, Vinayak Menon,
Christopher Lameter, Mike Galbraith, Shakeel Butt,
Peter Enderborg, linux-mm, cgroups, linux-kernel, kernel-team
When systems are overcommitted and resources become contended, it's
hard to tell exactly the impact this has on workload productivity, or
how close the system is to lockups and OOM kills. In particular, when
machines work multiple jobs concurrently, the impact of overcommit in
terms of latency and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing
individual job health or risk complete machine lockups, this patch
implements a way to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or
IO, respectively. Stall states are aggregate versions of the per-task
delay accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure
percentages, and they give a general sense of system health and
productivity loss incurred by resource overcommit. They can also
indicate when the system is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each
CPU and samples the time they spend in stall states. Every 2 seconds,
the samples are averaged across CPUs - weighted by the CPUs' non-idle
time to eliminate artifacts from unused CPUs - and translated into
percentages of walltime. A running average of those percentages is
maintained over 10s, 1m, and 5m periods (similar to the loadaverage).
v2:
- stable clock tick, as per Peter
- data structure layout optimization, as per Peter
- fix u64 divisions on 32 bit, as per Peter
- outermost psi_disabled checks, as per Peter
- coding style fixes, as per Peter
- just-in-time stats aggregation, as per Suren
- fix task state corruption with CONFIG_PREEMPT, as per Suren
- CONFIG_PSI=n build error
- avoid writing p->sched_psi_wake_requeue unnecessarily
- documentation & comment updates
v3:
- pack scheduler hotpath data into one cacheline, as per Peter and Linus
- drop unnecessary SCHED_INFO dependency, as per Peter
- lockless live-state aggregation, as per Peter
- do_div -> div64_ul and some other cleanups, as per Peter
- realtime sampling period and slipped sample handling, as per Tejun
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
Documentation/accounting/psi.txt | 64 +++
include/linux/psi.h | 27 ++
include/linux/psi_types.h | 87 +++++
include/linux/sched.h | 10 +
init/Kconfig | 15 +
kernel/fork.c | 4 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 11 +-
kernel/sched/psi.c | 643 +++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 +
kernel/sched/stats.h | 80 ++++
mm/compaction.c | 5 +
mm/filemap.c | 15 +-
mm/page_alloc.c | 10 +
mm/vmscan.c | 13 +
15 files changed, 981 insertions(+), 6 deletions(-)
create mode 100644 Documentation/accounting/psi.txt
create mode 100644 include/linux/psi.h
create mode 100644 include/linux/psi_types.h
create mode 100644 kernel/sched/psi.c
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
new file mode 100644
index 000000000000..51e7ef14142e
--- /dev/null
+++ b/Documentation/accounting/psi.txt
@@ -0,0 +1,64 @@
+================================
+PSI - Pressure Stall Information
+================================
+
+:Date: April, 2018
+:Author: Johannes Weiner <hannes@cmpxchg.org>
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==================
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+In both cases, the format for CPU is as such:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO:
+
+some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to distinguish this
+situation from a state where some tasks are stalled but the CPU is
+still doing productive work. As such, time spent in this subset of the
+stall state is tracked separately and exported in the "full" averages.
+
+The ratios are tracked as recent trends over ten, sixty, and three
+hundred second windows, which gives insight into short term events as
+well as medium and long term trends. The total absolute stall time is
+tracked and exported as well, to allow detection of latency spikes
+which wouldn't necessarily make a dent in the time averages, or to
+average trends over custom time frames.
diff --git a/include/linux/psi.h b/include/linux/psi.h
new file mode 100644
index 000000000000..371af1479699
--- /dev/null
+++ b/include/linux/psi.h
@@ -0,0 +1,27 @@
+#ifndef _LINUX_PSI_H
+#define _LINUX_PSI_H
+
+#include <linux/psi_types.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_PSI
+
+extern bool psi_disabled;
+
+void psi_init(void);
+
+void psi_task_change(struct task_struct *task, u64 now, int clear, int set);
+
+void psi_memstall_enter(unsigned long *flags);
+void psi_memstall_leave(unsigned long *flags);
+
+#else /* CONFIG_PSI */
+
+static inline void psi_init(void) {}
+
+static inline void psi_memstall_enter(unsigned long *flags) {}
+static inline void psi_memstall_leave(unsigned long *flags) {}
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_H */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
new file mode 100644
index 000000000000..b6ff46362eb3
--- /dev/null
+++ b/include/linux/psi_types.h
@@ -0,0 +1,87 @@
+#ifndef _LINUX_PSI_TYPES_H
+#define _LINUX_PSI_TYPES_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_PSI
+
+/* Tracked task states */
+enum psi_task_count {
+ NR_IOWAIT,
+ NR_MEMSTALL,
+ NR_RUNNING,
+ NR_PSI_TASK_COUNTS,
+};
+
+/* Task state bitmasks */
+#define TSK_IOWAIT (1 << NR_IOWAIT)
+#define TSK_MEMSTALL (1 << NR_MEMSTALL)
+#define TSK_RUNNING (1 << NR_RUNNING)
+
+/* Resources that workloads could be stalled on */
+enum psi_res {
+ PSI_IO,
+ PSI_MEM,
+ PSI_CPU,
+ NR_PSI_RESOURCES,
+};
+
+/*
+ * Pressure states for each resource:
+ *
+ * SOME: Stalled tasks & working tasks
+ * FULL: Stalled tasks & no working tasks
+ */
+enum psi_states {
+ PSI_IO_SOME,
+ PSI_IO_FULL,
+ PSI_MEM_SOME,
+ PSI_MEM_FULL,
+ PSI_CPU_SOME,
+ PSI_NONIDLE,
+ NR_PSI_STATES,
+};
+
+struct psi_group_cpu {
+ /* 1st cacheline updated by the scheduler */
+
+ /* States of the tasks belonging to this group */
+ unsigned int tasks[NR_PSI_TASK_COUNTS] ____cacheline_aligned_in_smp;
+
+ /* Period time sampling buckets for each state of interest (ns) */
+ u32 times[NR_PSI_STATES];
+
+ /* Time of last task change in this group (rq_clock) */
+ u64 state_start;
+
+ /* 2nd cacheline updated by the aggregator */
+
+ /* Delta detection against the sampling buckets */
+ u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
+};
+
+struct psi_group {
+ /* Protects data updated during an aggregation */
+ struct mutex stat_lock;
+
+ /* Per-cpu task state & time tracking */
+ struct psi_group_cpu __percpu *pcpu;
+
+ /* Periodic aggregation state */
+ u64 total_prev[NR_PSI_STATES - 1];
+ u64 last_update;
+ u64 next_update;
+ struct delayed_work clock_work;
+
+ /* Total stall times and sampled pressure averages */
+ u64 total[NR_PSI_STATES - 1];
+ unsigned long avg[NR_PSI_STATES - 1][3];
+};
+
+#else /* CONFIG_PSI */
+
+struct psi_group { };
+
+#endif /* CONFIG_PSI */
+
+#endif /* _LINUX_PSI_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ca3f3eae8980..d5e4ee234114 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -25,6 +25,7 @@
#include <linux/latencytop.h>
#include <linux/sched/prio.h>
#include <linux/signal_types.h>
+#include <linux/psi_types.h>
#include <linux/mm_types_task.h>
#include <linux/task_io_accounting.h>
@@ -709,6 +710,10 @@ struct task_struct {
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
unsigned sched_remote_wakeup:1;
+#ifdef CONFIG_PSI
+ unsigned sched_psi_wake_requeue:1;
+#endif
+
/* Force alignment to the next boundary: */
unsigned :0;
@@ -956,6 +961,10 @@ struct task_struct {
siginfo_t *last_siginfo;
struct task_io_accounting ioac;
+#ifdef CONFIG_PSI
+ /* Pressure stall state */
+ unsigned int psi_flags;
+#endif
#ifdef CONFIG_TASK_XACCT
/* Accumulated RSS usage: */
u64 acct_rss_mem1;
@@ -1385,6 +1394,7 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
diff --git a/init/Kconfig b/init/Kconfig
index 18b151f0ddc1..ad61ddb5d68e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -457,6 +457,21 @@ config TASK_IO_ACCOUNTING
Say N if unsure.
+config PSI
+ bool "Pressure stall information tracking"
+ help
+ Collect metrics that indicate how overcommitted the CPU, memory,
+ and IO capacity are in the system.
+
+ If you say Y here, the kernel will create /proc/pressure/ with the
+ pressure statistics files cpu, memory, and io. These will indicate
+ the share of walltime in which some or all tasks in the system are
+ delayed due to contention of the respective resource.
+
+ For more details see Documentation/accounting/psi.txt.
+
+ Say N if unsure.
+
endmenu # "CPU/Task time and stats accounting"
config CPU_ISOLATION
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..067aa5c28526 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1704,6 +1704,10 @@ static __latent_entropy struct task_struct *copy_process(
p->default_timer_slack_ns = current->timer_slack_ns;
+#ifdef CONFIG_PSI
+ p->psi_flags = 0;
+#endif
+
task_io_accounting_init(&p->ioac);
acct_clear_integrals(p);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..b29bc18f2704 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
+obj-$(CONFIG_PSI) += psi.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9586a8141f16..e53137df405b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -743,8 +743,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & ENQUEUE_RESTORE))
+ if (!(flags & ENQUEUE_RESTORE)) {
sched_info_queued(rq, p);
+ psi_enqueue(rq, p, flags & ENQUEUE_WAKEUP);
+ }
p->sched_class->enqueue_task(rq, p, flags);
}
@@ -754,8 +756,10 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & DEQUEUE_SAVE))
+ if (!(flags & DEQUEUE_SAVE)) {
sched_info_dequeued(rq, p);
+ psi_dequeue(rq, p, flags & DEQUEUE_SLEEP);
+ }
p->sched_class->dequeue_task(rq, p, flags);
}
@@ -2058,6 +2062,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
+ psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
@@ -6124,6 +6129,8 @@ void __init sched_init(void)
init_schedstats();
+ psi_init();
+
scheduler_running = 1;
}
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
new file mode 100644
index 000000000000..57ec86592b5a
--- /dev/null
+++ b/kernel/sched/psi.c
@@ -0,0 +1,643 @@
+/*
+ * Pressure stall information for CPU, memory and IO
+ *
+ * Copyright (c) 2018 Facebook, Inc.
+ * Author: Johannes Weiner <hannes@cmpxchg.org>
+ *
+ * When CPU, memory and IO are contended, tasks experience delays that
+ * reduce throughput and introduce latencies into the workload. Memory
+ * and IO contention, in addition, can cause a full loss of forward
+ * progress in which the CPU goes idle.
+ *
+ * This code aggregates individual task delays into resource pressure
+ * metrics that indicate problems with both workload health and
+ * resource utilization.
+ *
+ * Model
+ *
+ * The time in which a task can execute on a CPU is our baseline for
+ * productivity. Pressure expresses the amount of time in which this
+ * potential cannot be realized due to resource contention.
+ *
+ * This concept of productivity has two components: the workload and
+ * the CPU. To measure the impact of pressure on both, we define two
+ * contention states for a resource: SOME and FULL.
+ *
+ * In the SOME state of a given resource, one or more tasks are
+ * delayed on that resource. This affects the workload's ability to
+ * perform work, but the CPU may still be executing other tasks.
+ *
+ * In the FULL state of a given resource, all non-idle tasks are
+ * delayed on that resource such that nobody is advancing and the CPU
+ * goes idle. This leaves both workload and CPU unproductive.
+ *
+ * (Naturally, the FULL state doesn't exist for the CPU resource.)
+ *
+ * SOME = nr_delayed_tasks != 0
+ * FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
+ *
+ * The percentage of wallclock time spent in those compound stall
+ * states gives pressure numbers between 0 and 100 for each resource,
+ * where the SOME percentage indicates workload slowdowns and the FULL
+ * percentage indicates reduced CPU utilization:
+ *
+ * %SOME = time(SOME) / period
+ * %FULL = time(FULL) / period
+ *
+ * Multiple CPUs
+ *
+ * The more tasks and available CPUs there are, the more work can be
+ * performed concurrently. This means that the potential that can go
+ * unrealized due to resource contention *also* scales with non-idle
+ * tasks and CPUs.
+ *
+ * Consider a scenario where 257 number crunching tasks are trying to
+ * run concurrently on 256 CPUs. If we simply aggregated the task
+ * states, we would have to conclude a CPU SOME pressure number of
+ * 100%, since *somebody* is waiting on a runqueue at all
+ * times. However, that is clearly not the amount of contention the
+ * workload is experiencing: only one out of 256 possible exceution
+ * threads will be contended at any given time, or about 0.4%.
+ *
+ * Conversely, consider a scenario of 4 tasks and 4 CPUs where at any
+ * given time *one* of the tasks is delayed due to a lack of memory.
+ * Again, looking purely at the task state would yield a memory FULL
+ * pressure number of 0%, since *somebody* is always making forward
+ * progress. But again this wouldn't capture the amount of execution
+ * potential lost, which is 1 out of 4 CPUs, or 25%.
+ *
+ * To calculate wasted potential (pressure) with multiple processors,
+ * we have to base our calculation on the number of non-idle tasks in
+ * conjunction with the number of available CPUs, which is the number
+ * of potential execution threads. SOME becomes then the proportion of
+ * delayed tasks to possibe threads, and FULL is the share of possible
+ * threads that are unproductive due to delays:
+ *
+ * threads = min(nr_nonidle_tasks, nr_cpus)
+ * SOME = min(nr_delayed_tasks / threads, 1)
+ * FULL = (threads - min(nr_running_tasks, threads)) / threads
+ *
+ * For the 257 number crunchers on 256 CPUs, this yields:
+ *
+ * threads = min(257, 256)
+ * SOME = min(1 / 256, 1) = 0.4%
+ * FULL = (256 - min(257, 256)) / 256 = 0%
+ *
+ * For the 1 out of 4 memory-delayed tasks, this yields:
+ *
+ * threads = min(4, 4)
+ * SOME = min(1 / 4, 1) = 25%
+ * FULL = (4 - min(3, 4)) / 4 = 25%
+ *
+ * [ Substitute nr_cpus with 1, and you can see that it's a natural
+ * extension of the single-CPU model. ]
+ *
+ * Implementation
+ *
+ * To assess the precise time spent in each such state, we would have
+ * to freeze the system on task changes and start/stop the state
+ * clocks accordingly. Obviously that doesn't scale in practice.
+ *
+ * Because the scheduler aims to distribute the compute load evenly
+ * among the available CPUs, we can track task state locally to each
+ * CPU and, at much lower frequency, extrapolate the global state for
+ * the cumulative stall times and the running averages.
+ *
+ * For each runqueue, we track:
+ *
+ * tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
+ * tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_running_tasks[cpu])
+ * tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0)
+ *
+ * and then periodically aggregate:
+ *
+ * tNONIDLE = sum(tNONIDLE[i])
+ *
+ * tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE
+ * tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE
+ *
+ * %SOME = tSOME / period
+ * %FULL = tFULL / period
+ *
+ * This gives us an approximation of pressure that is practical
+ * cost-wise, yet way more sensitive and accurate than periodic
+ * sampling of the aggregate task states would be.
+ */
+
+#include <linux/sched/loadavg.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/cgroup.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/psi.h>
+#include "sched.h"
+
+static int psi_bug __read_mostly;
+
+bool psi_disabled __read_mostly;
+core_param(psi_disabled, psi_disabled, bool, 0644);
+
+/* Running averages - we need to be higher-res than loadavg */
+#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */
+#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */
+#define EXP_60s 1981 /* 1/exp(2s/60s) */
+#define EXP_300s 2034 /* 1/exp(2s/300s) */
+
+/* Sampling frequency in nanoseconds */
+static u64 psi_period __read_mostly;
+
+/* System-level pressure and stall tracking */
+static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu);
+static struct psi_group psi_system = {
+ .pcpu = &system_group_pcpu,
+};
+
+static void psi_clock(struct work_struct *work);
+
+static void psi_group_init(struct psi_group *group)
+{
+ group->next_update = sched_clock() + psi_period;
+ INIT_DELAYED_WORK(&group->clock_work, psi_clock);
+ mutex_init(&group->stat_lock);
+}
+
+void __init psi_init(void)
+{
+ if (psi_disabled)
+ return;
+
+ psi_period = jiffies_to_nsecs(PSI_FREQ);
+ psi_group_init(&psi_system);
+}
+
+static void calc_avgs(unsigned long avg[3], int missed_periods,
+ u64 time, u64 period)
+{
+ unsigned long pct;
+
+ /* Fill in zeroes for periods of no activity */
+ if (missed_periods) {
+ avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods);
+ avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods);
+ avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods);
+ }
+
+ /* Sample the most recent active period */
+ pct = div_u64(time * 100, period);
+ pct *= FIXED_1;
+ avg[0] = calc_load(avg[0], EXP_10s, pct);
+ avg[1] = calc_load(avg[1], EXP_60s, pct);
+ avg[2] = calc_load(avg[2], EXP_300s, pct);
+}
+
+static bool test_state(unsigned int *tasks, int cpu, enum psi_states state)
+{
+ switch (state) {
+ case PSI_IO_SOME:
+ return tasks[NR_IOWAIT];
+ case PSI_IO_FULL:
+ return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+ case PSI_MEM_SOME:
+ return tasks[NR_MEMSTALL];
+ case PSI_MEM_FULL:
+ /*
+ * Since we care about lost potential, things are
+ * fully blocked on memory when there are no other
+ * working tasks, but also when the CPU is actively
+ * being used by a reclaimer and nothing productive
+ * could run even if it were runnable.
+ */
+ return tasks[NR_MEMSTALL] &&
+ (!tasks[NR_RUNNING] ||
+ cpu_curr(cpu)->flags & PF_MEMSTALL);
+ case PSI_CPU_SOME:
+ return tasks[NR_RUNNING] > 1;
+ case PSI_NONIDLE:
+ return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
+ tasks[NR_RUNNING];
+ default:
+ return false;
+ }
+}
+
+static bool psi_update_stats(struct psi_group *group)
+{
+ u64 deltas[NR_PSI_STATES - 1] = { 0, };
+ unsigned long missed_periods = 0;
+ unsigned long nonidle_total = 0;
+ u64 now, expires, period;
+ int cpu;
+ int s;
+
+ mutex_lock(&group->stat_lock);
+
+ /*
+ * Collect the per-cpu time buckets and average them into a
+ * single time sample that is normalized to wallclock time.
+ *
+ * For averaging, each CPU is weighted by its non-idle time in
+ * the sampling period. This eliminates artifacts from uneven
+ * loading, or even entirely idle CPUs.
+ *
+ * We don't need to synchronize against CPU hotplugging. If we
+ * see a CPU that's online and has samples, we incorporate it.
+ */
+ for_each_online_cpu(cpu) {
+ struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+ u32 uninitialized_var(nonidle);
+
+ BUILD_BUG_ON(PSI_NONIDLE != NR_PSI_STATES - 1);
+
+ for (s = PSI_NONIDLE; s >= 0; s--) {
+ u32 time, delta;
+
+ time = READ_ONCE(groupc->times[s]);
+ /*
+ * In addition to already concluded states, we
+ * also incorporate currently active states on
+ * the CPU, since states may last for many
+ * sampling periods.
+ *
+ * This way we keep our delta sampling buckets
+ * small (u32) and our reported pressure close
+ * to what's actually happening.
+ */
+ if (test_state(groupc->tasks, cpu, s)) {
+ /*
+ * We can race with a state change and
+ * need to make sure the state_start
+ * update is ordered against the
+ * updates to the live state and the
+ * time buckets (groupc->times).
+ *
+ * 1. If we observe task state that
+ * needs to be recorded, make sure we
+ * see state_start from when that
+ * state went into effect or we'll
+ * count time from the previous state.
+ *
+ * 2. If the time delta has already
+ * been added to the bucket, make sure
+ * we don't see it in state_start or
+ * we'll count it twice.
+ *
+ * If the time delta is out of
+ * state_start but not in the time
+ * bucket yet, we'll miss it entirely
+ * and handle it in the next period.
+ */
+ smp_rmb();
+ time += cpu_clock(cpu) - groupc->state_start;
+ }
+ delta = time - groupc->times_prev[s];
+ groupc->times_prev[s] = time;
+
+ if (s == PSI_NONIDLE) {
+ nonidle = nsecs_to_jiffies(delta);
+ nonidle_total += nonidle;
+ } else {
+ deltas[s] += (u64)delta * nonidle;
+ }
+ }
+ }
+
+ /*
+ * Integrate the sample into the running statistics that are
+ * reported to userspace: the cumulative stall times and the
+ * decaying averages.
+ *
+ * Pressure percentages are sampled at PSI_FREQ. We might be
+ * called more often when the user polls more frequently than
+ * that; we might be called less often when there is no task
+ * activity, thus no data, and clock ticks are sporadic. The
+ * below handles both.
+ */
+
+ /* total= */
+ for (s = 0; s < NR_PSI_STATES - 1; s++)
+ group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
+
+ /* avgX= */
+ now = sched_clock();
+ expires = group->next_update;
+ if (now < expires)
+ goto out;
+ if (now - expires > psi_period)
+ missed_periods = div_u64(now - expires, psi_period);
+
+ /*
+ * The periodic clock tick can get delayed for various
+ * reasons, especially on loaded systems. To avoid clock
+ * drift, we schedule the clock in fixed psi_period intervals.
+ * But the deltas we sample out of the per-cpu buckets above
+ * are based on the actual time elapsing between clock ticks.
+ */
+ group->next_update = expires + ((1 + missed_periods) * psi_period);
+ period = now - (group->last_update + (missed_periods * psi_period));
+ group->last_update = now;
+
+ for (s = 0; s < NR_PSI_STATES - 1; s++) {
+ u32 sample;
+
+ sample = group->total[s] - group->total_prev[s];
+ /*
+ * Due to the lockless sampling of the time buckets,
+ * recorded time deltas can slip into the next period,
+ * which under full pressure can result in samples in
+ * excess of the period length.
+ *
+ * We don't want to report non-sensical pressures in
+ * excess of 100%, nor do we want to drop such events
+ * on the floor. Instead we punt any overage into the
+ * future until pressure subsides. By doing this we
+ * don't underreport the occurring pressure curve, we
+ * just report it delayed by one period length.
+ *
+ * The error isn't cumulative. As soon as another
+ * delta slips from a period P to P+1, by definition
+ * it frees up its time T in P.
+ */
+ if (sample > period)
+ sample = period;
+ group->total_prev[s] += sample;
+ calc_avgs(group->avg[s], missed_periods, sample, period);
+ }
+out:
+ mutex_unlock(&group->stat_lock);
+ return nonidle_total;
+}
+
+static void psi_clock(struct work_struct *work)
+{
+ struct delayed_work *dwork;
+ struct psi_group *group;
+ bool nonidle;
+
+ dwork = to_delayed_work(work);
+ group = container_of(dwork, struct psi_group, clock_work);
+
+ /*
+ * If there is task activity, periodically fold the per-cpu
+ * times and feed samples into the running averages. If things
+ * are idle and there is no data to process, stop the clock.
+ * Once restarted, we'll catch up the running averages in one
+ * go - see calc_avgs() and missed_periods.
+ */
+
+ nonidle = psi_update_stats(group);
+
+ if (nonidle) {
+ unsigned long delay = 0;
+ u64 now;
+
+ now = sched_clock();
+ if (group->next_update > now)
+ delay = nsecs_to_jiffies(group->next_update - now) + 1;
+ schedule_delayed_work(dwork, delay);
+ }
+}
+
+static void psi_group_change(struct psi_group *group, int cpu, u64 now,
+ unsigned int clear, unsigned int set)
+{
+ struct psi_group_cpu *groupc;
+ unsigned int t, m;
+ u32 delta;
+
+ groupc = per_cpu_ptr(group->pcpu, cpu);
+
+ /*
+ * First we assess the aggregate resource states these CPU's
+ * tasks have been in since the last change, and account any
+ * SOME and FULL time that may have resulted in.
+ *
+ * Then we update the task counts according to the state
+ * change requested through the @clear and @set bits.
+ */
+
+ delta = now - groupc->state_start;
+ groupc->state_start = now;
+
+ /*
+ * Update state_start before recording time in the sampling
+ * buckets and changing task counts, to prevent a racing
+ * aggregation from counting the delta twice or attributing it
+ * to an old state.
+ */
+ smp_wmb();
+
+ if (test_state(groupc->tasks, cpu, PSI_IO_SOME)) {
+ groupc->times[PSI_IO_SOME] += delta;
+ if (test_state(groupc->tasks, cpu, PSI_IO_FULL))
+ groupc->times[PSI_IO_FULL] += delta;
+ }
+ if (test_state(groupc->tasks, cpu, PSI_MEM_SOME)) {
+ groupc->times[PSI_MEM_SOME] += delta;
+ if (test_state(groupc->tasks, cpu, PSI_MEM_FULL))
+ groupc->times[PSI_MEM_FULL] += delta;
+ }
+ if (test_state(groupc->tasks, cpu, PSI_CPU_SOME))
+ groupc->times[PSI_CPU_SOME] += delta;
+ if (test_state(groupc->tasks, cpu, PSI_NONIDLE))
+ groupc->times[PSI_NONIDLE] += delta;
+
+ for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
+ if (!(m & (1 << t)))
+ continue;
+ if (groupc->tasks[t] == 0 && !psi_bug) {
+ printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n",
+ cpu, t, groupc->tasks[0],
+ groupc->tasks[1], groupc->tasks[2],
+ clear, set);
+ psi_bug = 1;
+ }
+ groupc->tasks[t]--;
+ }
+ for (t = 0; set; set &= ~(1 << t), t++)
+ if (set & (1 << t))
+ groupc->tasks[t]++;
+
+ if (!delayed_work_pending(&group->clock_work))
+ schedule_delayed_work(&group->clock_work, PSI_FREQ);
+}
+
+void psi_task_change(struct task_struct *task, u64 now, int clear, int set)
+{
+ int cpu = task_cpu(task);
+
+ if (psi_disabled)
+ return;
+
+ if (!task->pid)
+ return;
+
+ if (((task->psi_flags & set) ||
+ (task->psi_flags & clear) != clear) &&
+ !psi_bug) {
+ printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
+ task->pid, task->comm, cpu,
+ task->psi_flags, clear, set);
+ psi_bug = 1;
+ }
+
+ task->psi_flags &= ~clear;
+ task->psi_flags |= set;
+
+ psi_group_change(&psi_system, cpu, now, clear, set);
+}
+
+/**
+ * psi_memstall_enter - mark the beginning of a memory stall section
+ * @flags: flags to handle nested sections
+ *
+ * Marks the calling task as being stalled due to a lack of memory,
+ * such as waiting for a refault or performing reclaim.
+ */
+void psi_memstall_enter(unsigned long *flags)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (psi_disabled)
+ return;
+
+ *flags = current->flags & PF_MEMSTALL;
+ if (*flags)
+ return;
+ /*
+ * PF_MEMSTALL setting & accounting needs to be atomic wrt
+ * changes to the task's scheduling state, otherwise we can
+ * race with CPU migration.
+ */
+ rq = this_rq_lock_irq(&rf);
+
+ update_rq_clock(rq);
+
+ current->flags |= PF_MEMSTALL;
+ psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
+
+ rq_unlock_irq(rq, &rf);
+}
+
+/**
+ * psi_memstall_leave - mark the end of an memory stall section
+ * @flags: flags to handle nested memdelay sections
+ *
+ * Marks the calling task as no longer stalled due to lack of memory.
+ */
+void psi_memstall_leave(unsigned long *flags)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (psi_disabled)
+ return;
+
+ if (*flags)
+ return;
+ /*
+ * PF_MEMSTALL clearing & accounting needs to be atomic wrt
+ * changes to the task's scheduling state, otherwise we could
+ * race with CPU migration.
+ */
+ rq = this_rq_lock_irq(&rf);
+
+ update_rq_clock(rq);
+
+ current->flags &= ~PF_MEMSTALL;
+ psi_task_change(current, rq_clock(rq), TSK_MEMSTALL, 0);
+
+ rq_unlock_irq(rq, &rf);
+}
+
+static int psi_show(struct seq_file *m, struct psi_group *group,
+ enum psi_res res)
+{
+ int full;
+
+ if (psi_disabled)
+ return -EOPNOTSUPP;
+
+ psi_update_stats(group);
+
+ for (full = 0; full < 2 - (res == PSI_CPU); full++) {
+ unsigned long avg[3];
+ u64 total;
+ int w;
+
+ for (w = 0; w < 3; w++)
+ avg[w] = group->avg[res * 2 + full][w];
+ total = div_u64(group->total[res * 2 + full], NSEC_PER_USEC);
+
+ seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n",
+ full ? "full" : "some",
+ LOAD_INT(avg[0]), LOAD_FRAC(avg[0]),
+ LOAD_INT(avg[1]), LOAD_FRAC(avg[1]),
+ LOAD_INT(avg[2]), LOAD_FRAC(avg[2]),
+ total);
+ }
+
+ return 0;
+}
+
+static int psi_io_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_IO);
+}
+
+static int psi_memory_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_MEM);
+}
+
+static int psi_cpu_show(struct seq_file *m, void *v)
+{
+ return psi_show(m, &psi_system, PSI_CPU);
+}
+
+static int psi_io_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_io_show, NULL);
+}
+
+static int psi_memory_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_memory_show, NULL);
+}
+
+static int psi_cpu_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, psi_cpu_show, NULL);
+}
+
+static const struct file_operations psi_io_fops = {
+ .open = psi_io_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static const struct file_operations psi_memory_fops = {
+ .open = psi_memory_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static const struct file_operations psi_cpu_fops = {
+ .open = psi_cpu_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init psi_proc_init(void)
+{
+ proc_mkdir("pressure", NULL);
+ proc_create("pressure/io", 0, NULL, &psi_io_fops);
+ proc_create("pressure/memory", 0, NULL, &psi_memory_fops);
+ proc_create("pressure/cpu", 0, NULL, &psi_cpu_fops);
+ return 0;
+}
+module_init(psi_proc_init);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc798c7cb4d4..e798491ff329 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,6 +54,7 @@
#include <linux/proc_fs.h>
#include <linux/prefetch.h>
#include <linux/profile.h>
+#include <linux/psi.h>
#include <linux/rcupdate_wait.h>
#include <linux/security.h>
#include <linux/stackprotector.h>
@@ -320,6 +321,7 @@ extern bool dl_cpu_busy(unsigned int cpu);
#ifdef CONFIG_CGROUP_SCHED
#include <linux/cgroup.h>
+#include <linux/psi.h>
struct cfs_rq;
struct rt_rq;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8aea199a39b4..f3e0267eb47d 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -55,6 +55,86 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_val_or_zero(var) 0
#endif /* CONFIG_SCHEDSTATS */
+#ifdef CONFIG_PSI
+/*
+ * PSI tracks state that persists across sleeps, such as iowaits and
+ * memory stalls. As a result, it has to distinguish between sleeps,
+ * where a task's runnable state changes, and requeues, where a task
+ * and its state are being moved between CPUs and runqueues.
+ */
+static inline void psi_enqueue(struct rq *rq, struct task_struct *p,
+ bool wakeup)
+{
+ int clear = 0, set = TSK_RUNNING;
+
+ if (psi_disabled)
+ return;
+
+ if (!wakeup || p->sched_psi_wake_requeue) {
+ if (p->flags & PF_MEMSTALL)
+ set |= TSK_MEMSTALL;
+ if (p->sched_psi_wake_requeue)
+ p->sched_psi_wake_requeue = 0;
+ } else {
+ if (p->in_iowait)
+ clear |= TSK_IOWAIT;
+ }
+
+ psi_task_change(p, rq_clock(rq), clear, set);
+}
+
+static inline void psi_dequeue(struct rq *rq, struct task_struct *p, bool sleep)
+{
+ int clear = TSK_RUNNING, set = 0;
+
+ if (psi_disabled)
+ return;
+
+ if (!sleep) {
+ if (p->flags & PF_MEMSTALL)
+ clear |= TSK_MEMSTALL;
+ } else {
+ if (p->in_iowait)
+ set |= TSK_IOWAIT;
+ }
+
+ psi_task_change(p, rq_clock(rq), clear, set);
+}
+
+static inline void psi_ttwu_dequeue(struct task_struct *p)
+{
+ if (psi_disabled)
+ return;
+ /*
+ * Is the task being migrated during a wakeup? Make sure to
+ * deregister its sleep-persistent psi states from the old
+ * queue, and let psi_enqueue() know it has to requeue.
+ */
+ if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
+ struct rq_flags rf;
+ struct rq *rq;
+ int clear = 0;
+
+ if (p->in_iowait)
+ clear |= TSK_IOWAIT;
+ if (p->flags & PF_MEMSTALL)
+ clear |= TSK_MEMSTALL;
+
+ rq = __task_rq_lock(p, &rf);
+ update_rq_clock(rq);
+ psi_task_change(p, rq_clock(rq), clear, 0);
+ p->sched_psi_wake_requeue = 1;
+ __task_rq_unlock(rq, &rf);
+ }
+}
+#else /* CONFIG_PSI */
+static inline void psi_enqueue(struct rq *rq, struct task_struct *p,
+ bool wakeup) {}
+static inline void psi_dequeue(struct rq *rq, struct task_struct *p,
+ bool sleep) {}
+static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+#endif /* CONFIG_PSI */
+
#ifdef CONFIG_SCHED_INFO
static inline void sched_info_reset_dequeued(struct task_struct *t)
{
diff --git a/mm/compaction.c b/mm/compaction.c
index 29bd1df18b98..8f9566745902 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -22,6 +22,7 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/page_owner.h>
+#include <linux/psi.h>
#include "internal.h"
#ifdef CONFIG_COMPACTION
@@ -2068,11 +2069,15 @@ static int kcompactd(void *p)
pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
while (!kthread_should_stop()) {
+ unsigned long pflags;
+
trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
wait_event_freezable(pgdat->kcompactd_wait,
kcompactd_work_requested(pgdat));
+ psi_memstall_enter(&pflags);
kcompactd_do_work(pgdat);
+ psi_memstall_leave(&pflags);
}
return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index e49961e13dd9..eee06145b997 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -37,6 +37,7 @@
#include <linux/shmem_fs.h>
#include <linux/rmap.h>
#include <linux/delayacct.h>
+#include <linux/psi.h>
#include "internal.h"
#define CREATE_TRACE_POINTS
@@ -1075,11 +1076,14 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
bool thrashing = false;
+ unsigned long pflags;
int ret = 0;
- if (bit_nr == PG_locked && !PageSwapBacked(page) &&
+ if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
- delayacct_thrashing_start();
+ if (!PageSwapBacked(page))
+ delayacct_thrashing_start();
+ psi_memstall_enter(&pflags);
thrashing = true;
}
@@ -1121,8 +1125,11 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
finish_wait(q, wait);
- if (thrashing)
- delayacct_thrashing_end();
+ if (thrashing) {
+ if (!PageSwapBacked(page))
+ delayacct_thrashing_end();
+ psi_memstall_leave(&pflags);
+ }
/*
* A signal could leave PageWaiters set. Clearing it here if
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 22320ea27489..8469f34e6731 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -67,6 +67,7 @@
#include <linux/ftrace.h>
#include <linux/lockdep.h>
#include <linux/nmi.h>
+#include <linux/psi.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -3552,15 +3553,20 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
enum compact_priority prio, enum compact_result *compact_result)
{
struct page *page;
+ unsigned long pflags;
unsigned int noreclaim_flag;
if (!order)
return NULL;
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
+
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio);
+
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
if (*compact_result <= COMPACT_INACTIVE)
return NULL;
@@ -3749,11 +3755,14 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
struct reclaim_state reclaim_state;
int progress;
unsigned int noreclaim_flag;
+ unsigned long pflags;
cond_resched();
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
+
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
fs_reclaim_acquire(gfp_mask);
reclaim_state.reclaimed_slab = 0;
@@ -3765,6 +3774,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
current->reclaim_state = NULL;
fs_reclaim_release(gfp_mask);
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
cond_resched();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8d1ad48ffbcd..ee91e8cbeb5a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -49,6 +49,7 @@
#include <linux/prefetch.h>
#include <linux/printk.h>
#include <linux/dax.h>
+#include <linux/psi.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3115,6 +3116,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
+ unsigned long pflags;
int nid;
unsigned int noreclaim_flag;
struct scan_control sc = {
@@ -3143,9 +3145,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
sc.gfp_mask,
sc.reclaim_idx);
+ psi_memstall_enter(&pflags);
noreclaim_flag = memalloc_noreclaim_save();
+
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
memalloc_noreclaim_restore(noreclaim_flag);
+ psi_memstall_leave(&pflags);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
@@ -3565,6 +3571,7 @@ static int kswapd(void *p)
pgdat->kswapd_order = 0;
pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
for ( ; ; ) {
+ unsigned long pflags;
bool ret;
alloc_order = reclaim_order = pgdat->kswapd_order;
@@ -3601,9 +3608,15 @@ static int kswapd(void *p)
*/
trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
alloc_order);
+
+ psi_memstall_enter(&pflags);
fs_reclaim_acquire(GFP_KERNEL);
+
reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+
fs_reclaim_release(GFP_KERNEL);
+ psi_memstall_leave(&pflags);
+
if (reclaim_order < alloc_order)
goto kswapd_try_sleep;
}
--
2.18.0
^ permalink raw reply related [flat|nested] 55+ messages in thread