linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] psi: enhance psi with the help of ebpf
@ 2020-03-26 11:12 Yafang Shao
  2020-03-26 11:12 ` [PATCH 1/2] psi: introduce various types of memstall Yafang Shao
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Yafang Shao @ 2020-03-26 11:12 UTC (permalink / raw)
  To: hannes, peterz, akpm, mhocko, axboe, mgorman, rostedt, mingo
  Cc: linux-mm, linux-block, linux-kernel, Yafang Shao

PSI gives us a powerful way to anaylze memory pressure issue, but we can
make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
Especially with ebpf we can flexiblely get more details of the memory
pressure.

In orderc to achieve this goal, a new parameter is added into
psi_memstall_{enter, leave}, which indicates the specific type of a
memstall. There're totally ten memstalls by now,
        MEMSTALL_KSWAPD
        MEMSTALL_RECLAIM_DIRECT
        MEMSTALL_RECLAIM_MEMCG
        MEMSTALL_RECLAIM_HIGH
        MEMSTALL_KCOMPACTD
        MEMSTALL_COMPACT
        MEMSTALL_WORKINGSET_REFAULT
        MEMSTALL_WORKINGSET_THRASHING
        MEMSTALL_MEMDELAY
        MEMSTALL_SWAPIO
With the help of kprobe or tracepoint to trace this newly added agument we
can know which type of memstall it is and then do corresponding
improvement. I can also help us to analyze the latency spike caused by
memory pressure.

But note that we can't use it to build memory pressure for a specific type
of memstall, e.g. memcg pressure, compaction pressure and etc, because it
doesn't implement various types of task->in_memstall, e.g.
task->in_memcgstall, task->in_compactionstall and etc.

Although there're already some tracepoints can help us to achieve this
goal, e.g.
        vmscan:mm_vmscan_kswapd_{wake, sleep}
        vmscan:mm_vmscan_direct_reclaim_{begin, end}
        vmscan:mm_vmscan_memcg_reclaim_{begin, end}
        /* no tracepoint for memcg high reclaim*/
        compcation:mm_compaction_kcompactd_{wake, sleep}
        compcation:mm_compaction_begin_{begin, end}
        /* no tracepoint for workingset refault */
        /* no tracepoint for workingset thrashing */
        /* no tracepoint for use memdelay */
        /* no tracepoint for swapio */
but psi_memstall_{enter, leave} gives us a unified entrance for all
types of memstall and we don't need to add many begin and end tracepoints
that hasn't been implemented yet.

Patch #2 gives us an example of how to use it with ebpf. With the help of
ebpf we can trace a specific task, application, container and etc. It also
can help us to analyze the spread of latencies and whether they were
clustered at a point of time or spread out over long periods of time.

Yafang Shao (2):
  psi: introduce various types of memstall
  psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave}

 block/blk-cgroup.c           |  4 ++--
 block/blk-core.c             |  4 ++--
 include/linux/psi.h          | 15 +++++++++++----
 include/linux/psi_types.h    | 13 +++++++++++++
 include/trace/events/sched.h | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/psi.c           | 14 ++++++++++++--
 mm/compaction.c              |  4 ++--
 mm/filemap.c                 |  4 ++--
 mm/memcontrol.c              |  4 ++--
 mm/page_alloc.c              |  8 ++++----
 mm/page_io.c                 |  4 ++--
 mm/vmscan.c                  |  8 ++++----
 12 files changed, 97 insertions(+), 26 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] psi: introduce various types of memstall
  2020-03-26 11:12 [PATCH 0/2] psi: enhance psi with the help of ebpf Yafang Shao
@ 2020-03-26 11:12 ` Yafang Shao
  2020-03-26 11:12 ` [PATCH 2/2] psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave} Yafang Shao
  2020-03-26 14:31 ` [PATCH 0/2] psi: enhance psi with the help of ebpf Johannes Weiner
  2 siblings, 0 replies; 9+ messages in thread
From: Yafang Shao @ 2020-03-26 11:12 UTC (permalink / raw)
  To: hannes, peterz, akpm, mhocko, axboe, mgorman, rostedt, mingo
  Cc: linux-mm, linux-block, linux-kernel, Yafang Shao

The memstall is used as a memory pressure index now. But there're many
paths to get into memstall, so once memstall happens we don't know the
specific reason of it.

This patch introduces various types of memstall as bellow,
	MEMSTALL_KSWAPD
	MEMSTALL_RECLAIM_DIRECT
	MEMSTALL_RECLAIM_MEMCG
	MEMSTALL_RECLAIM_HIGH
	MEMSTALL_KCOMPACTD
	MEMSTALL_COMPACT
	MEMSTALL_WORKINGSET_REFAULT
	MEMSTALL_WORKINGSET_THRASHING
	MEMSTALL_MEMDELAY
	MEMSTALL_SWAPIO
and adds a new parameter 'type' in psi_memstall_{enter, leave}.

After that, we can trace specific types of memstall with other
powerful tools like tracepoint, kprobe, ebpf and etc. It can also help us
to analyze latency spike caused by memory pressure. But note that we
can't use it to build memory pressure for a specific type of memstall,
e.g. memcg pressure, compaction pressure and etc, because it doesn't
implement various types of task->in_memstall, e.g. task->in_memcgstall,
task->in_compactionstall and etc. IOW, the main goal of it is to trace
the spread of latencies and the specific reason of these latencies.

Although there're already some tracepoints can help us to achieve this
goal, e.g.
	vmscan:mm_vmscan_kswapd_{wake, sleep}
	vmscan:mm_vmscan_direct_reclaim_{begin, end}
	vmscan:mm_vmscan_memcg_reclaim_{begin, end}
	/* no tracepoint for memcg high reclaim*/
	compcation:mm_compaction_kcompactd_{wake, sleep}
	compcation:mm_compaction_begin_{begin, end}
	/* no tracepoint for workingset refault */
	/* no tracepoint for workingset thrashing */
	/* no tracepoint for use memdelay */
	/* no tracepoint for swapio */
but psi_memstall_{enter, leave} gives us a unified entrance for all
types of memstall and we don't need to add many begin and end tracepoints
that hasn't been implemented yet.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 block/blk-cgroup.c        |  4 ++--
 block/blk-core.c          |  4 ++--
 include/linux/psi.h       | 15 +++++++++++----
 include/linux/psi_types.h | 13 +++++++++++++
 kernel/sched/psi.c        |  6 ++++--
 mm/compaction.c           |  4 ++--
 mm/filemap.c              |  4 ++--
 mm/memcontrol.c           |  4 ++--
 mm/page_alloc.c           |  8 ++++----
 mm/page_io.c              |  4 ++--
 mm/vmscan.c               |  8 ++++----
 11 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a229b94..fc24095 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1593,7 +1593,7 @@ static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay)
 	delay_nsec = min_t(u64, delay_nsec, 250 * NSEC_PER_MSEC);
 
 	if (use_memdelay)
-		psi_memstall_enter(&pflags);
+		psi_memstall_enter(&pflags, MEMSTALL_MEMDELAY);
 
 	exp = ktime_add_ns(now, delay_nsec);
 	tok = io_schedule_prepare();
@@ -1605,7 +1605,7 @@ static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay)
 	io_schedule_finish(tok);
 
 	if (use_memdelay)
-		psi_memstall_leave(&pflags);
+		psi_memstall_leave(&pflags, MEMSTALL_MEMDELAY);
 }
 
 /**
diff --git a/block/blk-core.c b/block/blk-core.c
index 60dc955..e2039cf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1190,12 +1190,12 @@ blk_qc_t submit_bio(struct bio *bio)
 	 * submission can be a significant part of overall IO time.
 	 */
 	if (workingset_read)
-		psi_memstall_enter(&pflags);
+		psi_memstall_enter(&pflags, MEMSTALL_WORKINGSET_REFAULT);
 
 	ret = generic_make_request(bio);
 
 	if (workingset_read)
-		psi_memstall_leave(&pflags);
+		psi_memstall_leave(&pflags, MEMSTALL_WORKINGSET_REFAULT);
 
 	return ret;
 }
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7b3de73..7bf94f6 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -19,8 +19,8 @@
 void psi_task_change(struct task_struct *task, int clear, int set);
 
 void psi_memstall_tick(struct task_struct *task, int cpu);
-void psi_memstall_enter(unsigned long *flags);
-void psi_memstall_leave(unsigned long *flags);
+void psi_memstall_enter(unsigned long *flags, enum memstall_types type);
+void psi_memstall_leave(unsigned long *flags, enum memstall_types type);
 
 int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
 
@@ -41,8 +41,15 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
 
 static inline void psi_init(void) {}
 
-static inline void psi_memstall_enter(unsigned long *flags) {}
-static inline void psi_memstall_leave(unsigned long *flags) {}
+static inline void psi_memstall_enter(unsigned long *flags,
+				      enum memstall_types type)
+{
+}
+
+static inline void psi_memstall_leave(unsigned long *flags,
+				      enum memstall_types type)
+{
+}
 
 #ifdef CONFIG_CGROUPS
 static inline int psi_cgroup_alloc(struct cgroup *cgrp)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 07aaf9b..52a3f08 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -7,6 +7,19 @@
 #include <linux/kref.h>
 #include <linux/wait.h>
 
+enum memstall_types {
+	MEMSTALL_KSWAPD,
+	MEMSTALL_RECLAIM_DIRECT,
+	MEMSTALL_RECLAIM_MEMCG,
+	MEMSTALL_RECLAIM_HIGH,
+	MEMSTALL_KCOMPACTD,
+	MEMSTALL_COMPACT,
+	MEMSTALL_WORKINGSET_REFAULT,
+	MEMSTALL_WORKINGSET_THRASH,
+	MEMSTALL_MEMDELAY,
+	MEMSTALL_SWAP,
+};
+
 #ifdef CONFIG_PSI
 
 /* Tracked task states */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0285207..460f084 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -806,11 +806,12 @@ void psi_memstall_tick(struct task_struct *task, int cpu)
 /**
  * psi_memstall_enter - mark the beginning of a memory stall section
  * @flags: flags to handle nested sections
+ * @type: type of memstall
  *
  * Marks the calling task as being stalled due to a lack of memory,
  * such as waiting for a refault or performing reclaim.
  */
-void psi_memstall_enter(unsigned long *flags)
+void psi_memstall_enter(unsigned long *flags, enum memstall_types type)
 {
 	struct rq_flags rf;
 	struct rq *rq;
@@ -837,10 +838,11 @@ void psi_memstall_enter(unsigned long *flags)
 /**
  * psi_memstall_leave - mark the end of an memory stall section
  * @flags: flags to handle nested memdelay sections
+ * @type: type of memstall
  *
  * Marks the calling task as no longer stalled due to lack of memory.
  */
-void psi_memstall_leave(unsigned long *flags)
+void psi_memstall_leave(unsigned long *flags, enum memstall_types type)
 {
 	struct rq_flags rf;
 	struct rq *rq;
diff --git a/mm/compaction.c b/mm/compaction.c
index 672d3c7..c0d5331 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2647,9 +2647,9 @@ static int kcompactd(void *p)
 		wait_event_freezable(pgdat->kcompactd_wait,
 				kcompactd_work_requested(pgdat));
 
-		psi_memstall_enter(&pflags);
+		psi_memstall_enter(&pflags, MEMSTALL_KCOMPACTD);
 		kcompactd_do_work(pgdat);
-		psi_memstall_leave(&pflags);
+		psi_memstall_leave(&pflags, MEMSTALL_KCOMPACTD);
 	}
 
 	return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478..f5459e3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1123,7 +1123,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 			delayacct_thrashing_start();
 			delayacct = true;
 		}
-		psi_memstall_enter(&pflags);
+		psi_memstall_enter(&pflags, MEMSTALL_WORKINGSET_THRASH);
 		thrashing = true;
 	}
 
@@ -1182,7 +1182,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	if (thrashing) {
 		if (delayacct)
 			delayacct_thrashing_end();
-		psi_memstall_leave(&pflags);
+		psi_memstall_leave(&pflags, MEMSTALL_WORKINGSET_THRASH);
 	}
 
 	/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a4bd8b..a9b336e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2399,9 +2399,9 @@ void mem_cgroup_handle_over_high(void)
 	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
 	 * need to account for any ill-begotten jiffies to pay them off later.
 	 */
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_RECLAIM_HIGH);
 	schedule_timeout_killable(penalty_jiffies);
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_RECLAIM_HIGH);
 
 out:
 	css_put(&memcg->css);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb75..8789234a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3884,14 +3884,14 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	if (!order)
 		return NULL;
 
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_COMPACT);
 	noreclaim_flag = memalloc_noreclaim_save();
 
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 								prio, &page);
 
 	memalloc_noreclaim_restore(noreclaim_flag);
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_COMPACT);
 
 	/*
 	 * At least in one zone compaction wasn't deferred or skipped, so let's
@@ -4106,7 +4106,7 @@ void fs_reclaim_release(gfp_t gfp_mask)
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_RECLAIM_DIRECT);
 	fs_reclaim_acquire(gfp_mask);
 	noreclaim_flag = memalloc_noreclaim_save();
 
@@ -4115,7 +4115,7 @@ void fs_reclaim_release(gfp_t gfp_mask)
 
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(gfp_mask);
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_RECLAIM_DIRECT);
 
 	cond_resched();
 
diff --git a/mm/page_io.c b/mm/page_io.c
index 76965be..67de6b1 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -369,7 +369,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	 * or the submitting cgroup IO-throttled, submission can be a
 	 * significant part of overall IO time.
 	 */
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_SWAPIO);
 
 	if (frontswap_load(page) == 0) {
 		SetPageUptodate(page);
@@ -431,7 +431,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	bio_put(bio);
 
 out:
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_SWAPIO);
 	return ret;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8763705..4445c1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3352,13 +3352,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);
 
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_RECLAIM_MEMCG);
 	noreclaim_flag = memalloc_noreclaim_save();
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	memalloc_noreclaim_restore(noreclaim_flag);
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_RECLAIM_MEMCG);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 	set_task_reclaim_state(current, NULL);
@@ -3568,7 +3568,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	};
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_KSWAPD);
 	__fs_reclaim_acquire();
 
 	count_vm_event(PAGEOUTRUN);
@@ -3747,7 +3747,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 	snapshot_refaults(NULL, pgdat);
 	__fs_reclaim_release();
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_KSWAPD);
 	set_task_reclaim_state(current, NULL);
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave}
  2020-03-26 11:12 [PATCH 0/2] psi: enhance psi with the help of ebpf Yafang Shao
  2020-03-26 11:12 ` [PATCH 1/2] psi: introduce various types of memstall Yafang Shao
@ 2020-03-26 11:12 ` Yafang Shao
  2020-03-26 14:31 ` [PATCH 0/2] psi: enhance psi with the help of ebpf Johannes Weiner
  2 siblings, 0 replies; 9+ messages in thread
From: Yafang Shao @ 2020-03-26 11:12 UTC (permalink / raw)
  To: hannes, peterz, akpm, mhocko, axboe, mgorman, rostedt, mingo
  Cc: linux-mm, linux-block, linux-kernel, Yafang Shao

With the new parameter introduced in psi_memstall_{enter, leave} we can
get the specific type of memstal. To make it easier to use, we'd better
introduce tracepoints for them. Once these two tracepoints are added we
can easily use other tools like ebpf or bash script to collect the
memstall data and analyze.

Here's one example with bpftrace to measure application's latency.
tracepoint:sched:psi_memstall_enter
{
        @start[tid, args->type] = nsecs
}

tracepoint:sched:psi_memstall_leave
{
        @time[comm, args->type] = hist(nsecs - @start[tid, args->type]);
        delete(@start[tid, args->type]);
}

Bellow is part of the result after producing some memory pressure.
@time[objdump, 7]:
[256K, 512K)           1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[objdump, 6]:
[8K, 16K)              2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[objcopy, 7]:
[16K, 32K)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[ld, 7]:
[4M, 8M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8M, 16M)              1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[khugepaged, 5]:
[4K, 8K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[kswapd0, 0]:
[16K, 32K)             1 |@@@@@                                               |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              1 |@@@@@                                               |
[16M, 32M)            10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32M, 64M)             9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[64M, 128M)            2 |@@@@@@@@@@                                          |
[128M, 256M)           2 |@@@@@@@@@@                                          |
[256M, 512M)           3 |@@@@@@@@@@@@@@@                                     |
[512M, 1G)             1 |@@@@@                                               |

@time[kswapd1, 0]:
[1M, 2M)               1 |@@@@                                                |
[2M, 4M)               2 |@@@@@@@@                                            |
[4M, 8M)               0 |                                                    |
[8M, 16M)             12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16M, 32M)             7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[32M, 64M)             5 |@@@@@@@@@@@@@@@@@@@@@                               |
[64M, 128M)            5 |@@@@@@@@@@@@@@@@@@@@@                               |
[128M, 256M)           3 |@@@@@@@@@@@@@                                       |
[256M, 512M)           1 |@@@@                                                |

@time[khugepaged, 1]:
[2M, 4M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

With the builtin variable 'cgroup' of bpftrace we can also filter a
memcg and its descendants.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/trace/events/sched.h | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/psi.c           |  8 ++++++++
 2 files changed, 49 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e..6aca996 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -7,8 +7,20 @@
 
 #include <linux/sched/numa_balancing.h>
 #include <linux/tracepoint.h>
+#include <linux/psi_types.h>
 #include <linux/binfmts.h>
 
+#define show_psi_memstall_type(type) __print_symbolic(type,	\
+	{MEMSTALL_KSWAPD, "MEMSTALL_KSWAPD"},			\
+	{MEMSTALL_RECLAIM_DIRECT, "MEMSTALL_RECLAIM_DIRECT"},	\
+	{MEMSTALL_RECLAIM_MEMCG, "MEMSTALL_RECLAIM_MEMCG"},	\
+	{MEMSTALL_RECLAIM_HIGH, "MEMSTALL_RECLAIM_HIGH"},	\
+	{MEMSTALL_KCOMPACTD, "MEMSTALL_KCOMPACTD"},		\
+	{MEMSTALL_COMPACT, "MEMSTALL_COMPACT"},			\
+	{MEMSTALL_WORKINGSET, "MEMSTALL_WORKINGSET"},		\
+	{MEMSTALL_PGLOCK, "MEMSTALL_PGLOCK"},			\
+	{MEMSTALL_MEMDELAY, "MEMSTALL_MEMDELAY"},		\
+	{MEMSTALL_SWAP, "MEMSTALL_SWAP"})
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
@@ -625,6 +637,35 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
 	TP_PROTO(struct root_domain *rd, bool overutilized),
 	TP_ARGS(rd, overutilized));
 
+DECLARE_EVENT_CLASS(psi_memstall_template,
+
+	TP_PROTO(int type),
+
+	TP_ARGS(type),
+
+	TP_STRUCT__entry(
+		__field(int, type)
+	),
+
+	TP_fast_assign(
+		__entry->type = type;
+	),
+
+	TP_printk("type=%s",
+		show_psi_memstall_type(__entry->type))
+);
+
+DEFINE_EVENT(psi_memstall_template, psi_memstall_enter,
+	TP_PROTO(int type),
+	TP_ARGS(type)
+);
+
+DEFINE_EVENT(psi_memstall_template, psi_memstall_leave,
+	TP_PROTO(int type),
+	TP_ARGS(type)
+);
+
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 460f084..4c5a402 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -142,6 +142,8 @@
 #include <linux/psi.h>
 #include "sched.h"
 
+#include <trace/events/sched.h>
+
 static int psi_bug __read_mostly;
 
 DEFINE_STATIC_KEY_FALSE(psi_disabled);
@@ -822,6 +824,9 @@ void psi_memstall_enter(unsigned long *flags, enum memstall_types type)
 	*flags = current->flags & PF_MEMSTALL;
 	if (*flags)
 		return;
+
+	trace_psi_memstall_enter(type);
+
 	/*
 	 * PF_MEMSTALL setting & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we can
@@ -852,6 +857,9 @@ void psi_memstall_leave(unsigned long *flags, enum memstall_types type)
 
 	if (*flags)
 		return;
+
+	trace_psi_memstall_leave(type);
+
 	/*
 	 * PF_MEMSTALL clearing & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we could
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] psi: enhance psi with the help of ebpf
  2020-03-26 11:12 [PATCH 0/2] psi: enhance psi with the help of ebpf Yafang Shao
  2020-03-26 11:12 ` [PATCH 1/2] psi: introduce various types of memstall Yafang Shao
  2020-03-26 11:12 ` [PATCH 2/2] psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave} Yafang Shao
@ 2020-03-26 14:31 ` Johannes Weiner
  2020-03-27  1:17   ` Yafang Shao
  2 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2020-03-26 14:31 UTC (permalink / raw)
  To: Yafang Shao
  Cc: peterz, akpm, mhocko, axboe, mgorman, rostedt, mingo, linux-mm,
	linux-block, linux-kernel

On Thu, Mar 26, 2020 at 07:12:05AM -0400, Yafang Shao wrote:
> PSI gives us a powerful way to anaylze memory pressure issue, but we can
> make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> Especially with ebpf we can flexiblely get more details of the memory
> pressure.
> 
> In orderc to achieve this goal, a new parameter is added into
> psi_memstall_{enter, leave}, which indicates the specific type of a
> memstall. There're totally ten memstalls by now,
>         MEMSTALL_KSWAPD
>         MEMSTALL_RECLAIM_DIRECT
>         MEMSTALL_RECLAIM_MEMCG
>         MEMSTALL_RECLAIM_HIGH
>         MEMSTALL_KCOMPACTD
>         MEMSTALL_COMPACT
>         MEMSTALL_WORKINGSET_REFAULT
>         MEMSTALL_WORKINGSET_THRASHING
>         MEMSTALL_MEMDELAY
>         MEMSTALL_SWAPIO

What does this provide over the events tracked in /proc/vmstats?

Can you elaborate a bit how you are using this information? It's not
quite clear to me from the example in patch #2.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] psi: enhance psi with the help of ebpf
  2020-03-26 14:31 ` [PATCH 0/2] psi: enhance psi with the help of ebpf Johannes Weiner
@ 2020-03-27  1:17   ` Yafang Shao
  2020-03-31 15:11     ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Yafang Shao @ 2020-03-27  1:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Andrew Morton, Michal Hocko, Jens Axboe, mgorman,
	Steven Rostedt, mingo, Linux MM, linux-block, LKML

On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Mar 26, 2020 at 07:12:05AM -0400, Yafang Shao wrote:
> > PSI gives us a powerful way to anaylze memory pressure issue, but we can
> > make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> > Especially with ebpf we can flexiblely get more details of the memory
> > pressure.
> >
> > In orderc to achieve this goal, a new parameter is added into
> > psi_memstall_{enter, leave}, which indicates the specific type of a
> > memstall. There're totally ten memstalls by now,
> >         MEMSTALL_KSWAPD
> >         MEMSTALL_RECLAIM_DIRECT
> >         MEMSTALL_RECLAIM_MEMCG
> >         MEMSTALL_RECLAIM_HIGH
> >         MEMSTALL_KCOMPACTD
> >         MEMSTALL_COMPACT
> >         MEMSTALL_WORKINGSET_REFAULT
> >         MEMSTALL_WORKINGSET_THRASHING
> >         MEMSTALL_MEMDELAY
> >         MEMSTALL_SWAPIO
>
> What does this provide over the events tracked in /proc/vmstats?
>

/proc/vmstat only tells us which events occured, but it can't tell us
how long these events take.
Sometimes we really want to know how long the event takes and PSI can
provide us the data
For example, in the past days when I did performance tuning for a
database service, I monitored that the latency spike is related with
the workingset_refault counter in /proc/vmstat, and at that time I
really want to know the spread of latencies caused by
workingset_refault, but there's no easy way to get it. Now with newly
added MEMSTALL_WORKINGSET_REFAULT, I can get the latencies caused by
workingset refault.

> Can you elaborate a bit how you are using this information? It's not
> quite clear to me from the example in patch #2.
>

From the traced data in patch #2, we can find that the high latencies
of user tasks are always type 7 of memstall , which is
MEMSTALL_WORKINGSET_THRASHING,  and then we should look into the
details of wokingset of the user tasks and think about how to improve
it - for example, by reducing the workingset.

BTW, there's some error in the definition of show_psi_memstall_type()
in patch #2 ( that's an old version), I will correct it.

To summarize, with the pressure data in /proc/pressure/memroy we know
that the system is under memory pressure, and then with the newly
added tracing facility in this patchset we can get the reason of this
memory pressure, and then thinks about how to make the change.
The workflow can be illustrated as bellow.

                                      REASON        ACTION
                                |    compaction  |  look into the
details of compaction |
Memory pressure -  |    vmscan        |  look into the details of vmscan       |
                                |    workingset   |  look into the
details of workingset   |
                                |     etc              |   ...
                                           |


Thanks

Yafang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] psi: enhance psi with the help of ebpf
  2020-03-27  1:17   ` Yafang Shao
@ 2020-03-31 15:11     ` Johannes Weiner
  2020-04-01  1:22       ` Yafang Shao
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2020-03-31 15:11 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Peter Zijlstra, Andrew Morton, Michal Hocko, Jens Axboe, mgorman,
	Steven Rostedt, mingo, Linux MM, linux-block, LKML

On Fri, Mar 27, 2020 at 09:17:59AM +0800, Yafang Shao wrote:
> On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Mar 26, 2020 at 07:12:05AM -0400, Yafang Shao wrote:
> > > PSI gives us a powerful way to anaylze memory pressure issue, but we can
> > > make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> > > Especially with ebpf we can flexiblely get more details of the memory
> > > pressure.
> > >
> > > In orderc to achieve this goal, a new parameter is added into
> > > psi_memstall_{enter, leave}, which indicates the specific type of a
> > > memstall. There're totally ten memstalls by now,
> > >         MEMSTALL_KSWAPD
> > >         MEMSTALL_RECLAIM_DIRECT
> > >         MEMSTALL_RECLAIM_MEMCG
> > >         MEMSTALL_RECLAIM_HIGH
> > >         MEMSTALL_KCOMPACTD
> > >         MEMSTALL_COMPACT
> > >         MEMSTALL_WORKINGSET_REFAULT
> > >         MEMSTALL_WORKINGSET_THRASHING
> > >         MEMSTALL_MEMDELAY
> > >         MEMSTALL_SWAPIO
> >
> > What does this provide over the events tracked in /proc/vmstats?
> >
> 
> /proc/vmstat only tells us which events occured, but it can't tell us
> how long these events take.
> Sometimes we really want to know how long the event takes and PSI can
> provide us the data
> For example, in the past days when I did performance tuning for a
> database service, I monitored that the latency spike is related with
> the workingset_refault counter in /proc/vmstat, and at that time I
> really want to know the spread of latencies caused by
> workingset_refault, but there's no easy way to get it. Now with newly
> added MEMSTALL_WORKINGSET_REFAULT, I can get the latencies caused by
> workingset refault.

Okay, but how do you use that information in practice?

> > Can you elaborate a bit how you are using this information? It's not
> > quite clear to me from the example in patch #2.
> >
> 
> From the traced data in patch #2, we can find that the high latencies
> of user tasks are always type 7 of memstall , which is
> MEMSTALL_WORKINGSET_THRASHING,  and then we should look into the
> details of wokingset of the user tasks and think about how to improve
> it - for example, by reducing the workingset.

That's an analyses we run frequently as well: we see high pressure,
and then correlate it with the events.

High rate of refaults? The workingset is too big.

High rate of compaction work? Somebody is asking for higher order
pages under load; check THP events next.

etc.

This works fairly reliably. I'm curious what the extra per-event
latency breakdown would add and where it would be helpful.

I'm not really opposed to your patches it if it is, I just don't see
the usecase right now.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] psi: enhance psi with the help of ebpf
  2020-03-31 15:11     ` Johannes Weiner
@ 2020-04-01  1:22       ` Yafang Shao
  2020-04-03 15:48         ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Yafang Shao @ 2020-04-01  1:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Andrew Morton, Michal Hocko, Jens Axboe, mgorman,
	Steven Rostedt, mingo, Linux MM, linux-block, LKML

On Tue, Mar 31, 2020 at 11:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Mar 27, 2020 at 09:17:59AM +0800, Yafang Shao wrote:
> > On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Thu, Mar 26, 2020 at 07:12:05AM -0400, Yafang Shao wrote:
> > > > PSI gives us a powerful way to anaylze memory pressure issue, but we can
> > > > make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> > > > Especially with ebpf we can flexiblely get more details of the memory
> > > > pressure.
> > > >
> > > > In orderc to achieve this goal, a new parameter is added into
> > > > psi_memstall_{enter, leave}, which indicates the specific type of a
> > > > memstall. There're totally ten memstalls by now,
> > > >         MEMSTALL_KSWAPD
> > > >         MEMSTALL_RECLAIM_DIRECT
> > > >         MEMSTALL_RECLAIM_MEMCG
> > > >         MEMSTALL_RECLAIM_HIGH
> > > >         MEMSTALL_KCOMPACTD
> > > >         MEMSTALL_COMPACT
> > > >         MEMSTALL_WORKINGSET_REFAULT
> > > >         MEMSTALL_WORKINGSET_THRASHING
> > > >         MEMSTALL_MEMDELAY
> > > >         MEMSTALL_SWAPIO
> > >
> > > What does this provide over the events tracked in /proc/vmstats?
> > >
> >
> > /proc/vmstat only tells us which events occured, but it can't tell us
> > how long these events take.
> > Sometimes we really want to know how long the event takes and PSI can
> > provide us the data
> > For example, in the past days when I did performance tuning for a
> > database service, I monitored that the latency spike is related with
> > the workingset_refault counter in /proc/vmstat, and at that time I
> > really want to know the spread of latencies caused by
> > workingset_refault, but there's no easy way to get it. Now with newly
> > added MEMSTALL_WORKINGSET_REFAULT, I can get the latencies caused by
> > workingset refault.
>
> Okay, but how do you use that information in practice?
>

With the newly added facility,  we can know when these events occur
and how long each event takes.
Then we can use these datas to generate a Latency Heat Map[1] and to
compare whether these latencies match with the application latencies
recoreded in its log - for example the slow query log in mysql. If the
refault latencies match with the slow query log, then these slow
queries are caused by these workingset refault.  If the refault
latencies don't match with slow query log, IOW much smaller than the
slow query log, then  the slow query log isn't caused by the
workingset refault.

High rate of refaults may not cause high pressure, if the backing
device is fast enough. While the latencies of refaults give us a
direct relationship with memory pressure.

[1]. http://www.brendangregg.com/HeatMaps/latency.html

> > > Can you elaborate a bit how you are using this information? It's not
> > > quite clear to me from the example in patch #2.
> > >
> >
> > From the traced data in patch #2, we can find that the high latencies
> > of user tasks are always type 7 of memstall , which is
> > MEMSTALL_WORKINGSET_THRASHING,  and then we should look into the
> > details of wokingset of the user tasks and think about how to improve
> > it - for example, by reducing the workingset.
>
> That's an analyses we run frequently as well: we see high pressure,
> and then correlate it with the events.
>
> High rate of refaults? The workingset is too big.
>
> High rate of compaction work? Somebody is asking for higher order
> pages under load; check THP events next.
>
> etc.
>
> This works fairly reliably. I'm curious what the extra per-event
> latency breakdown would add and where it would be helpful.
>
> I'm not really opposed to your patches it if it is, I just don't see
> the usecase right now.
>

As I explained above, the rate of these events can't give us a full
view of the memory pressure. High memory pressure may not caused by
high rate of vmstat events, while it can be caused by low rate of
events but with high latencies.  Latencies are the application really
concerns, that's why PSI is very useful for us.

Furthermore, there're some events are not recored in vmstat. e.g.

typf of memstall                                           vmstat event

MEMSTALL_KSWAPD                                pageoutrun, {pgscan,
pgsteal}_kswapd
MEMSTALL_RECLAIM_DIRECT                {pgscan,steal}_direct
MEMSTALL_RECLAIM_MEMCG                /* no event */
MEMSTALL_RECLAIM_HIGH                     /* no event */
MEMSTALL_KCOMPACTD                         compact_daemon_wake
MEMSTALL_COMPACT                              compact_{stall, fail, success}
MEMSTALL_WORKINGSET_REFAULT     workingset_refault
MEMSTALL_WORKINGSET_THRASH      /* no event */
MEMSTALL_MEMDELAY                           /* no event */
MEMSTALL_SWAPIO                                 pswpin

After we add these types of memstall, we don't need to add these
missed events one by one.

Thanks
Yafang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] psi: enhance psi with the help of ebpf
  2020-04-01  1:22       ` Yafang Shao
@ 2020-04-03 15:48         ` Johannes Weiner
  2020-04-04  2:54           ` Yafang Shao
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2020-04-03 15:48 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Peter Zijlstra, Andrew Morton, Michal Hocko, Jens Axboe, mgorman,
	Steven Rostedt, mingo, Linux MM, linux-block, LKML

On Wed, Apr 01, 2020 at 09:22:24AM +0800, Yafang Shao wrote:
> On Tue, Mar 31, 2020 at 11:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Fri, Mar 27, 2020 at 09:17:59AM +0800, Yafang Shao wrote:
> > > On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> With the newly added facility,  we can know when these events occur
> and how long each event takes.
> Then we can use these datas to generate a Latency Heat Map[1] and to
> compare whether these latencies match with the application latencies
> recoreded in its log - for example the slow query log in mysql. If the
> refault latencies match with the slow query log, then these slow
> queries are caused by these workingset refault.  If the refault
> latencies don't match with slow query log, IOW much smaller than the
> slow query log, then  the slow query log isn't caused by the
> workingset refault.

Okay, you want to use it much finer-grained to understand individual
end-to-end latencies for your services, rather than "the system is
melting down and I want to know why." That sounds valid to me.

> > > > Can you elaborate a bit how you are using this information? It's not
> > > > quite clear to me from the example in patch #2.
> > > >
> > >
> > > From the traced data in patch #2, we can find that the high latencies
> > > of user tasks are always type 7 of memstall , which is
> > > MEMSTALL_WORKINGSET_THRASHING,  and then we should look into the
> > > details of wokingset of the user tasks and think about how to improve
> > > it - for example, by reducing the workingset.
> >
> > That's an analyses we run frequently as well: we see high pressure,
> > and then correlate it with the events.
> >
> > High rate of refaults? The workingset is too big.
> >
> > High rate of compaction work? Somebody is asking for higher order
> > pages under load; check THP events next.
> >
> > etc.
> >
> > This works fairly reliably. I'm curious what the extra per-event
> > latency breakdown would add and where it would be helpful.
> >
> > I'm not really opposed to your patches it if it is, I just don't see
> > the usecase right now.
> >
> 
> As I explained above, the rate of these events can't give us a full
> view of the memory pressure. High memory pressure may not caused by
> high rate of vmstat events, while it can be caused by low rate of
> events but with high latencies.  Latencies are the application really
> concerns, that's why PSI is very useful for us.
> 
> Furthermore, there're some events are not recored in vmstat. e.g.
> 
> typf of memstall                                           vmstat event
> 
> MEMSTALL_KSWAPD                                pageoutrun, {pgscan,
> pgsteal}_kswapd
> MEMSTALL_RECLAIM_DIRECT                {pgscan,steal}_direct
> MEMSTALL_RECLAIM_MEMCG                /* no event */
> MEMSTALL_RECLAIM_HIGH                     /* no event */
> MEMSTALL_KCOMPACTD                         compact_daemon_wake
> MEMSTALL_COMPACT                              compact_{stall, fail, success}
> MEMSTALL_WORKINGSET_REFAULT     workingset_refault
> MEMSTALL_WORKINGSET_THRASH      /* no event */
> MEMSTALL_MEMDELAY                           /* no event */
> MEMSTALL_SWAPIO                                 pswpin
> 
> After we add these types of memstall, we don't need to add these
> missed events one by one.

I'm a bit concerned about the maintainability of these things. It
makes moving code around harder, and it forces somebody who has no
interest in this debugging facility to thing about the categories.

And naming them is hard even for somebody who does care. I'm not a fan
of MEMSTALL_MEMDELAY, for example because it's way too
non-descript. The distinction between MEMSTALL_WORKINGSET_REFAULT and
MEMSTALL_WORKINGSET_THRASH is dubious as well.

These are recipes for bit rot and making the code harder to hack on.

I see two options to do this better: One is to use stack traces as
identifiers instead of a made-up type. The other is to use the calling
function as the id (see how kmalloc_track_caller() utilizes _RET_IP_).

bpftrace can use the stack as a map key. So this should already work
without any kernel modifications, using @start[tid, kstack]?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] psi: enhance psi with the help of ebpf
  2020-04-03 15:48         ` Johannes Weiner
@ 2020-04-04  2:54           ` Yafang Shao
  0 siblings, 0 replies; 9+ messages in thread
From: Yafang Shao @ 2020-04-04  2:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Andrew Morton, Michal Hocko, Jens Axboe, mgorman,
	Steven Rostedt, mingo, Linux MM, linux-block, LKML

On Fri, Apr 3, 2020 at 11:48 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Apr 01, 2020 at 09:22:24AM +0800, Yafang Shao wrote:
> > On Tue, Mar 31, 2020 at 11:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > On Fri, Mar 27, 2020 at 09:17:59AM +0800, Yafang Shao wrote:
> > > > On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > With the newly added facility,  we can know when these events occur
> > and how long each event takes.
> > Then we can use these datas to generate a Latency Heat Map[1] and to
> > compare whether these latencies match with the application latencies
> > recoreded in its log - for example the slow query log in mysql. If the
> > refault latencies match with the slow query log, then these slow
> > queries are caused by these workingset refault.  If the refault
> > latencies don't match with slow query log, IOW much smaller than the
> > slow query log, then  the slow query log isn't caused by the
> > workingset refault.
>
> Okay, you want to use it much finer-grained to understand individual
> end-to-end latencies for your services, rather than "the system is
> melting down and I want to know why." That sounds valid to me.
>

Right, individual end-to-end latencies are very important for the
latency sensitive services.

> > > > > Can you elaborate a bit how you are using this information? It's not
> > > > > quite clear to me from the example in patch #2.
> > > > >
> > > >
> > > > From the traced data in patch #2, we can find that the high latencies
> > > > of user tasks are always type 7 of memstall , which is
> > > > MEMSTALL_WORKINGSET_THRASHING,  and then we should look into the
> > > > details of wokingset of the user tasks and think about how to improve
> > > > it - for example, by reducing the workingset.
> > >
> > > That's an analyses we run frequently as well: we see high pressure,
> > > and then correlate it with the events.
> > >
> > > High rate of refaults? The workingset is too big.
> > >
> > > High rate of compaction work? Somebody is asking for higher order
> > > pages under load; check THP events next.
> > >
> > > etc.
> > >
> > > This works fairly reliably. I'm curious what the extra per-event
> > > latency breakdown would add and where it would be helpful.
> > >
> > > I'm not really opposed to your patches it if it is, I just don't see
> > > the usecase right now.
> > >
> >
> > As I explained above, the rate of these events can't give us a full
> > view of the memory pressure. High memory pressure may not caused by
> > high rate of vmstat events, while it can be caused by low rate of
> > events but with high latencies.  Latencies are the application really
> > concerns, that's why PSI is very useful for us.
> >
> > Furthermore, there're some events are not recored in vmstat. e.g.
> >
> > typf of memstall                                           vmstat event
> >
> > MEMSTALL_KSWAPD                                pageoutrun, {pgscan,
> > pgsteal}_kswapd
> > MEMSTALL_RECLAIM_DIRECT                {pgscan,steal}_direct
> > MEMSTALL_RECLAIM_MEMCG                /* no event */
> > MEMSTALL_RECLAIM_HIGH                     /* no event */
> > MEMSTALL_KCOMPACTD                         compact_daemon_wake
> > MEMSTALL_COMPACT                              compact_{stall, fail, success}
> > MEMSTALL_WORKINGSET_REFAULT     workingset_refault
> > MEMSTALL_WORKINGSET_THRASH      /* no event */
> > MEMSTALL_MEMDELAY                           /* no event */
> > MEMSTALL_SWAPIO                                 pswpin
> >
> > After we add these types of memstall, we don't need to add these
> > missed events one by one.
>
> I'm a bit concerned about the maintainability of these things. It
> makes moving code around harder, and it forces somebody who has no
> interest in this debugging facility to thing about the categories.
>
> And naming them is hard even for somebody who does care. I'm not a fan
> of MEMSTALL_MEMDELAY, for example because it's way too
> non-descript. The distinction between MEMSTALL_WORKINGSET_REFAULT and
> MEMSTALL_WORKINGSET_THRASH is dubious as well.
>

Agree with you that the naming is not good.

> These are recipes for bit rot and making the code harder to hack on.
>
> I see two options to do this better: One is to use stack traces as
> identifiers instead of a made-up type. The other is to use the calling
> function as the id (see how kmalloc_track_caller() utilizes _RET_IP_).
>
> bpftrace can use the stack as a map key. So this should already work
> without any kernel modifications, using @start[tid, kstack]?
>

If we don't make any kernel modifications, it is not easy to get
whehter the psi_memstall_{enter, leave} is nested or not.
The nested psi_memstall_{enter, leave} may make some noises.
Seems the first option is better. With _RET_IP_ we can also get the caller.
So how about adding tracepoints for psi_memstall_{enter, leave} as bellow ?

@@ -904,7 +904,7 @@ void psi_memstall_enter(unsigned long *flags, enum
memstall_types type)
          if (*flags)
                return;

 +         trace_psi_memstall_enter(_RET_IP_);

@@ -944,7 +943,7 @@ void psi_memstall_leave(unsigned long *flags, enum
memstall_types type)
        if (*flags)
                return;

+       trace_psi_memstall_leave(_RET_IP_);

Thanks
Yafang

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-04-04  2:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-26 11:12 [PATCH 0/2] psi: enhance psi with the help of ebpf Yafang Shao
2020-03-26 11:12 ` [PATCH 1/2] psi: introduce various types of memstall Yafang Shao
2020-03-26 11:12 ` [PATCH 2/2] psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave} Yafang Shao
2020-03-26 14:31 ` [PATCH 0/2] psi: enhance psi with the help of ebpf Johannes Weiner
2020-03-27  1:17   ` Yafang Shao
2020-03-31 15:11     ` Johannes Weiner
2020-04-01  1:22       ` Yafang Shao
2020-04-03 15:48         ` Johannes Weiner
2020-04-04  2:54           ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).