linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] psi: enhance psi with the help of ebpf
@ 2020-03-31 10:04 Yafang Shao
  2020-03-31 10:04 ` [PATCH v2 1/2] psi: introduce various types of memstall Yafang Shao
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Yafang Shao @ 2020-03-31 10:04 UTC (permalink / raw)
  To: hannes, peterz, akpm; +Cc: linux-mm, linux-block, linux-kernel, Yafang Shao

PSI gives us a powerful way to anaylze memory pressure issue, but we can
make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
Especially with ebpf we can flexiblely get more details of the memory
pressure.

In orderc to achieve this goal, a new parameter is added into
psi_memstall_{enter, leave}, which indicates the specific type of a
memstall. There're totally ten memstalls by now,
        MEMSTALL_KSWAPD
        MEMSTALL_RECLAIM_DIRECT
        MEMSTALL_RECLAIM_MEMCG
        MEMSTALL_RECLAIM_HIGH
        MEMSTALL_KCOMPACTD
        MEMSTALL_COMPACT
        MEMSTALL_WORKINGSET_REFAULT
        MEMSTALL_WORKINGSET_THRASH
        MEMSTALL_MEMDELAY
        MEMSTALL_SWAPIO
With the help of kprobe or tracepoint to trace this newly added agument we
can know which type of memstall it is and then do corresponding
improvement. I can also help us to analyze the latency spike caused by
memory pressure.

But note that we can't use it to build memory pressure for a specific type
of memstall, e.g. memcg pressure, compaction pressure and etc, because it
doesn't implement various types of task->in_memstall, e.g.
task->in_memcgstall, task->in_compactionstall and etc.

Although there're already some tracepoints can help us to achieve this
goal, e.g.
        vmscan:mm_vmscan_kswapd_{wake, sleep}
        vmscan:mm_vmscan_direct_reclaim_{begin, end}
        vmscan:mm_vmscan_memcg_reclaim_{begin, end}
        /* no tracepoint for memcg high reclaim*/
        compcation:mm_compaction_kcompactd_{wake, sleep}
        compcation:mm_compaction_begin_{begin, end}
        /* no tracepoint for workingset refault */
        /* no tracepoint for workingset thrashing */
        /* no tracepoint for use memdelay */
        /* no tracepoint for swapio */
but psi_memstall_{enter, leave} gives us a unified entrance for all
types of memstall and we don't need to add many begin and end tracepoints
that hasn't been implemented yet.

Patch #2 gives us an example of how to use it with ebpf. With the help of
ebpf we can trace a specific task, application, container and etc. It also
can help us to analyze the spread of latencies and whether they were
clustered at a point of time or spread out over long periods of time.

To summarize, with the pressure data in /proc/pressure/memroy we know that
the system is under memory pressure, and then with the newly added tracing
facility in this patchset we can get the reason of this memory pressure,
and then thinks about how to make the change.
The workflow can be illustrated as bellow.

		   REASON	  ACTION
		 | compcation	| improve compcation	|
		 | vmscan	| improve vmscan	|
Memory pressure -| workingset	| improve workingset	|
		 | etc		| ...			|

Yafang Shao (2):
  psi: introduce various types of memstall
  psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave}

 block/blk-cgroup.c           |  4 ++--
 block/blk-core.c             |  4 ++--
 include/linux/psi.h          | 15 +++++++++++----
 include/linux/psi_types.h    | 13 +++++++++++++
 include/trace/events/sched.h | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/psi.c           | 14 ++++++++++++--
 mm/compaction.c              |  4 ++--
 mm/filemap.c                 |  4 ++--
 mm/memcontrol.c              |  4 ++--
 mm/page_alloc.c              |  8 ++++----
 mm/page_io.c                 |  4 ++--
 mm/vmscan.c                  |  8 ++++----
 12 files changed, 97 insertions(+), 26 deletions(-)

-- 
2.18.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/2] psi: introduce various types of memstall
  2020-03-31 10:04 [PATCH v2 0/2] psi: enhance psi with the help of ebpf Yafang Shao
@ 2020-03-31 10:04 ` Yafang Shao
  2020-03-31 10:04 ` [PATCH v2 2/2] psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave} Yafang Shao
  2020-07-15 16:36 ` [PATCH v2 0/2] psi: enhance psi with the help of ebpf Shakeel Butt
  2 siblings, 0 replies; 7+ messages in thread
From: Yafang Shao @ 2020-03-31 10:04 UTC (permalink / raw)
  To: hannes, peterz, akpm; +Cc: linux-mm, linux-block, linux-kernel, Yafang Shao

The memstall is used as a memory pressure index now. But there're many
paths to get into memstall, so once memstall happens we don't know the
specific reason of it.

This patch introduces various types of memstall as bellow,
	MEMSTALL_KSWAPD
	MEMSTALL_RECLAIM_DIRECT
	MEMSTALL_RECLAIM_MEMCG
	MEMSTALL_RECLAIM_HIGH
	MEMSTALL_KCOMPACTD
	MEMSTALL_COMPACT
	MEMSTALL_WORKINGSET_REFAULT
	MEMSTALL_WORKINGSET_THRASH
	MEMSTALL_MEMDELAY
	MEMSTALL_SWAPIO
and adds a new parameter 'type' in psi_memstall_{enter, leave}.

After that, we can trace specific types of memstall with other
powerful tools like tracepoint, kprobe, ebpf and etc. It can also help us
to analyze latency spike caused by memory pressure. But note that we
can't use it to build memory pressure for a specific type of memstall,
e.g. memcg pressure, compaction pressure and etc, because it doesn't
implement various types of task->in_memstall, e.g. task->in_memcgstall,
task->in_compactionstall and etc. IOW, the main goal of it is to trace
the spread of latencies and the specific reason of these latencies.

Although there're already some tracepoints can help us to achieve this
goal, e.g.
	vmscan:mm_vmscan_kswapd_{wake, sleep}
	vmscan:mm_vmscan_direct_reclaim_{begin, end}
	vmscan:mm_vmscan_memcg_reclaim_{begin, end}
	/* no tracepoint for memcg high reclaim*/
	compcation:mm_compaction_kcompactd_{wake, sleep}
	compcation:mm_compaction_begin_{begin, end}
	/* no tracepoint for workingset refault */
	/* no tracepoint for workingset thrashing */
	/* no tracepoint for use memdelay */
	/* no tracepoint for swapio */
but psi_memstall_{enter, leave} gives us a unified entrance for all
types of memstall and we don't need to add many begin and end tracepoints
that hasn't been implemented yet.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 block/blk-cgroup.c        |  4 ++--
 block/blk-core.c          |  4 ++--
 include/linux/psi.h       | 15 +++++++++++----
 include/linux/psi_types.h | 13 +++++++++++++
 kernel/sched/psi.c        |  6 ++++--
 mm/compaction.c           |  4 ++--
 mm/filemap.c              |  4 ++--
 mm/memcontrol.c           |  4 ++--
 mm/page_alloc.c           |  8 ++++----
 mm/page_io.c              |  4 ++--
 mm/vmscan.c               |  8 ++++----
 11 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a229b94d5390..fc24095c13c0 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1593,7 +1593,7 @@ static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay)
 	delay_nsec = min_t(u64, delay_nsec, 250 * NSEC_PER_MSEC);
 
 	if (use_memdelay)
-		psi_memstall_enter(&pflags);
+		psi_memstall_enter(&pflags, MEMSTALL_MEMDELAY);
 
 	exp = ktime_add_ns(now, delay_nsec);
 	tok = io_schedule_prepare();
@@ -1605,7 +1605,7 @@ static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay)
 	io_schedule_finish(tok);
 
 	if (use_memdelay)
-		psi_memstall_leave(&pflags);
+		psi_memstall_leave(&pflags, MEMSTALL_MEMDELAY);
 }
 
 /**
diff --git a/block/blk-core.c b/block/blk-core.c
index 60dc9552ef8d..e2039cf4719a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1190,12 +1190,12 @@ blk_qc_t submit_bio(struct bio *bio)
 	 * submission can be a significant part of overall IO time.
 	 */
 	if (workingset_read)
-		psi_memstall_enter(&pflags);
+		psi_memstall_enter(&pflags, MEMSTALL_WORKINGSET_REFAULT);
 
 	ret = generic_make_request(bio);
 
 	if (workingset_read)
-		psi_memstall_leave(&pflags);
+		psi_memstall_leave(&pflags, MEMSTALL_WORKINGSET_REFAULT);
 
 	return ret;
 }
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7b3de7321219..7bf94f6fb5e8 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -19,8 +19,8 @@ void psi_init(void);
 void psi_task_change(struct task_struct *task, int clear, int set);
 
 void psi_memstall_tick(struct task_struct *task, int cpu);
-void psi_memstall_enter(unsigned long *flags);
-void psi_memstall_leave(unsigned long *flags);
+void psi_memstall_enter(unsigned long *flags, enum memstall_types type);
+void psi_memstall_leave(unsigned long *flags, enum memstall_types type);
 
 int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
 
@@ -41,8 +41,15 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
 
 static inline void psi_init(void) {}
 
-static inline void psi_memstall_enter(unsigned long *flags) {}
-static inline void psi_memstall_leave(unsigned long *flags) {}
+static inline void psi_memstall_enter(unsigned long *flags,
+				      enum memstall_types type)
+{
+}
+
+static inline void psi_memstall_leave(unsigned long *flags,
+				      enum memstall_types type)
+{
+}
 
 #ifdef CONFIG_CGROUPS
 static inline int psi_cgroup_alloc(struct cgroup *cgrp)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 07aaf9b82241..48ebb51484f9 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -7,6 +7,19 @@
 #include <linux/kref.h>
 #include <linux/wait.h>
 
+enum memstall_types {
+	MEMSTALL_KSWAPD,
+	MEMSTALL_RECLAIM_DIRECT,
+	MEMSTALL_RECLAIM_MEMCG,
+	MEMSTALL_RECLAIM_HIGH,
+	MEMSTALL_KCOMPACTD,
+	MEMSTALL_COMPACT,
+	MEMSTALL_WORKINGSET_REFAULT,
+	MEMSTALL_WORKINGSET_THRASH,
+	MEMSTALL_MEMDELAY,
+	MEMSTALL_SWAPIO,
+};
+
 #ifdef CONFIG_PSI
 
 /* Tracked task states */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 028520702717..460f08436b58 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -806,11 +806,12 @@ void psi_memstall_tick(struct task_struct *task, int cpu)
 /**
  * psi_memstall_enter - mark the beginning of a memory stall section
  * @flags: flags to handle nested sections
+ * @type: type of memstall
  *
  * Marks the calling task as being stalled due to a lack of memory,
  * such as waiting for a refault or performing reclaim.
  */
-void psi_memstall_enter(unsigned long *flags)
+void psi_memstall_enter(unsigned long *flags, enum memstall_types type)
 {
 	struct rq_flags rf;
 	struct rq *rq;
@@ -837,10 +838,11 @@ void psi_memstall_enter(unsigned long *flags)
 /**
  * psi_memstall_leave - mark the end of an memory stall section
  * @flags: flags to handle nested memdelay sections
+ * @type: type of memstall
  *
  * Marks the calling task as no longer stalled due to lack of memory.
  */
-void psi_memstall_leave(unsigned long *flags)
+void psi_memstall_leave(unsigned long *flags, enum memstall_types type)
 {
 	struct rq_flags rf;
 	struct rq *rq;
diff --git a/mm/compaction.c b/mm/compaction.c
index 672d3c78c6ab..c0d533192974 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2647,9 +2647,9 @@ static int kcompactd(void *p)
 		wait_event_freezable(pgdat->kcompactd_wait,
 				kcompactd_work_requested(pgdat));
 
-		psi_memstall_enter(&pflags);
+		psi_memstall_enter(&pflags, MEMSTALL_KCOMPACTD);
 		kcompactd_do_work(pgdat);
-		psi_memstall_leave(&pflags);
+		psi_memstall_leave(&pflags, MEMSTALL_KCOMPACTD);
 	}
 
 	return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..f5459e3850ef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1123,7 +1123,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 			delayacct_thrashing_start();
 			delayacct = true;
 		}
-		psi_memstall_enter(&pflags);
+		psi_memstall_enter(&pflags, MEMSTALL_WORKINGSET_THRASH);
 		thrashing = true;
 	}
 
@@ -1182,7 +1182,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	if (thrashing) {
 		if (delayacct)
 			delayacct_thrashing_end();
-		psi_memstall_leave(&pflags);
+		psi_memstall_leave(&pflags, MEMSTALL_WORKINGSET_THRASH);
 	}
 
 	/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a4bd8b9adc2..a9b336ea7fe5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2399,9 +2399,9 @@ void mem_cgroup_handle_over_high(void)
 	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
 	 * need to account for any ill-begotten jiffies to pay them off later.
 	 */
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_RECLAIM_HIGH);
 	schedule_timeout_killable(penalty_jiffies);
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_RECLAIM_HIGH);
 
 out:
 	css_put(&memcg->css);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..8789234a2fca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3884,14 +3884,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	if (!order)
 		return NULL;
 
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_COMPACT);
 	noreclaim_flag = memalloc_noreclaim_save();
 
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 								prio, &page);
 
 	memalloc_noreclaim_restore(noreclaim_flag);
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_COMPACT);
 
 	/*
 	 * At least in one zone compaction wasn't deferred or skipped, so let's
@@ -4106,7 +4106,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_RECLAIM_DIRECT);
 	fs_reclaim_acquire(gfp_mask);
 	noreclaim_flag = memalloc_noreclaim_save();
 
@@ -4115,7 +4115,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(gfp_mask);
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_RECLAIM_DIRECT);
 
 	cond_resched();
 
diff --git a/mm/page_io.c b/mm/page_io.c
index 76965be1d40e..67de6b1801a4 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -369,7 +369,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	 * or the submitting cgroup IO-throttled, submission can be a
 	 * significant part of overall IO time.
 	 */
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_SWAPIO);
 
 	if (frontswap_load(page) == 0) {
 		SetPageUptodate(page);
@@ -431,7 +431,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	bio_put(bio);
 
 out:
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_SWAPIO);
 	return ret;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 876370565455..4445c1dd9551 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3352,13 +3352,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);
 
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_RECLAIM_MEMCG);
 	noreclaim_flag = memalloc_noreclaim_save();
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	memalloc_noreclaim_restore(noreclaim_flag);
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_RECLAIM_MEMCG);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 	set_task_reclaim_state(current, NULL);
@@ -3568,7 +3568,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	};
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
-	psi_memstall_enter(&pflags);
+	psi_memstall_enter(&pflags, MEMSTALL_KSWAPD);
 	__fs_reclaim_acquire();
 
 	count_vm_event(PAGEOUTRUN);
@@ -3747,7 +3747,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 	snapshot_refaults(NULL, pgdat);
 	__fs_reclaim_release();
-	psi_memstall_leave(&pflags);
+	psi_memstall_leave(&pflags, MEMSTALL_KSWAPD);
 	set_task_reclaim_state(current, NULL);
 
 	/*
-- 
2.18.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave}
  2020-03-31 10:04 [PATCH v2 0/2] psi: enhance psi with the help of ebpf Yafang Shao
  2020-03-31 10:04 ` [PATCH v2 1/2] psi: introduce various types of memstall Yafang Shao
@ 2020-03-31 10:04 ` Yafang Shao
  2020-07-15 16:36 ` [PATCH v2 0/2] psi: enhance psi with the help of ebpf Shakeel Butt
  2 siblings, 0 replies; 7+ messages in thread
From: Yafang Shao @ 2020-03-31 10:04 UTC (permalink / raw)
  To: hannes, peterz, akpm; +Cc: linux-mm, linux-block, linux-kernel, Yafang Shao

With the new parameter introduced in psi_memstall_{enter, leave} we can
get the specific type of memstal. To make it easier to use, we'd better
introduce tracepoints for them. Once these two tracepoints are added we
can easily use other tools like ebpf or bash script to collect the
memstall data and analyze.

The output of these tracepoints is,

          usemem-30288 [012] .... 302479.734290: psi_memstall_enter: type=MEMSTALL_RECLAIM_DIRECT
          usemem-30288 [012] .N.. 302479.741186: psi_memstall_leave: type=MEMSTALL_RECLAIM_DIRECT
          usemem-30288 [021] .... 302479.742075: psi_memstall_enter: type=MEMSTALL_COMPACT
          usemem-30288 [021] .... 302479.744869: psi_memstall_leave: type=MEMSTALL_COMPACT
           <...>-388   [000] .... 302514.609040: psi_memstall_enter: type=MEMSTALL_KSWAPD
         kswapd0-388   [000] .... 302514.616376: psi_memstall_leave: type=MEMSTALL_KSWAPD
           <...>-223   [024] .... 302514.616380: psi_memstall_enter: type=MEMSTALL_KCOMPACTD
      kcompactd0-223   [024] .... 302514.618414: psi_memstall_leave: type=MEMSTALL_KCOMPACTD
   supervisorctl-31675 [014] .... 302516.281293: psi_memstall_enter: type=MEMSTALL_WORKINGSET_REFAULT
   supervisorctl-31675 [014] .N.. 302516.281314: psi_memstall_leave: type=MEMSTALL_WORKINGSET_REFAULT
            bash-32092 [034] .... 302526.225639: psi_memstall_enter: type=MEMSTALL_WORKINGSET_THRASH
            bash-32092 [034] .... 302526.225843: psi_memstall_leave: type=MEMSTALL_WORKINGSET_THRASH

Here's one example with bpftrace to measure application's latency with
these tracepoints.

tracepoint:sched:psi_memstall_enter
{
        @start[tid, args->type] = nsecs
}

tracepoint:sched:psi_memstall_leave
{
        @time[comm, args->type] = hist(nsecs - @start[tid, args->type]);
        delete(@start[tid, args->type]);
}

Bellow is part of the result after producing some memory pressure.
@time[objdump, 7]:
[256K, 512K)           1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[objdump, 6]:
[8K, 16K)              2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[objcopy, 7]:
[16K, 32K)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[ld, 7]:
[4M, 8M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8M, 16M)              1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[khugepaged, 5]:
[4K, 8K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[kswapd0, 0]:
[16K, 32K)             1 |@@@@@                                               |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              1 |@@@@@                                               |
[16M, 32M)            10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32M, 64M)             9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[64M, 128M)            2 |@@@@@@@@@@                                          |
[128M, 256M)           2 |@@@@@@@@@@                                          |
[256M, 512M)           3 |@@@@@@@@@@@@@@@                                     |
[512M, 1G)             1 |@@@@@                                               |

@time[kswapd1, 0]:
[1M, 2M)               1 |@@@@                                                |
[2M, 4M)               2 |@@@@@@@@                                            |
[4M, 8M)               0 |                                                    |
[8M, 16M)             12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16M, 32M)             7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[32M, 64M)             5 |@@@@@@@@@@@@@@@@@@@@@                               |
[64M, 128M)            5 |@@@@@@@@@@@@@@@@@@@@@                               |
[128M, 256M)           3 |@@@@@@@@@@@@@                                       |
[256M, 512M)           1 |@@@@                                                |

@time[khugepaged, 1]:
[2M, 4M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

From these traced data, we can find that the high latencies of user tasks
are always type 7 of memstall,  which is MEMSTALL_WORKINGSET_THRASH, and
then we should look into the details of wokingset of the user tasks and think
about how to improve it - for example by reducing the workingset.

With the builtin variable 'cgroup' of bpftrace we can also filter a
memcg and its descendants.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/trace/events/sched.h | 41 ++++++++++++++++++++++++++++++++++++
 kernel/sched/psi.c           |  8 +++++++
 2 files changed, 49 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e56e55..8ea2cdf78810 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -7,8 +7,20 @@
 
 #include <linux/sched/numa_balancing.h>
 #include <linux/tracepoint.h>
+#include <linux/psi_types.h>
 #include <linux/binfmts.h>
 
+#define show_psi_memstall_type(type) __print_symbolic(type,		\
+	{MEMSTALL_KSWAPD, "MEMSTALL_KSWAPD"},				\
+	{MEMSTALL_RECLAIM_DIRECT, "MEMSTALL_RECLAIM_DIRECT"},		\
+	{MEMSTALL_RECLAIM_MEMCG, "MEMSTALL_RECLAIM_MEMCG"},		\
+	{MEMSTALL_RECLAIM_HIGH, "MEMSTALL_RECLAIM_HIGH"},		\
+	{MEMSTALL_KCOMPACTD, "MEMSTALL_KCOMPACTD"},			\
+	{MEMSTALL_COMPACT, "MEMSTALL_COMPACT"},				\
+	{MEMSTALL_WORKINGSET_REFAULT, "MEMSTALL_WORKINGSET_REFAULT"},	\
+	{MEMSTALL_WORKINGSET_THRASH, "MEMSTALL_WORKINGSET_THRASH"},	\
+	{MEMSTALL_MEMDELAY, "MEMSTALL_MEMDELAY"},			\
+	{MEMSTALL_SWAPIO, "MEMSTALL_SWAPIO"})
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
@@ -625,6 +637,35 @@ DECLARE_TRACE(sched_overutilized_tp,
 	TP_PROTO(struct root_domain *rd, bool overutilized),
 	TP_ARGS(rd, overutilized));
 
+DECLARE_EVENT_CLASS(psi_memstall_template,
+
+	TP_PROTO(int type),
+
+	TP_ARGS(type),
+
+	TP_STRUCT__entry(
+		__field(int, type)
+	),
+
+	TP_fast_assign(
+		__entry->type = type;
+	),
+
+	TP_printk("type=%s",
+		show_psi_memstall_type(__entry->type))
+);
+
+DEFINE_EVENT(psi_memstall_template, psi_memstall_enter,
+	TP_PROTO(int type),
+	TP_ARGS(type)
+);
+
+DEFINE_EVENT(psi_memstall_template, psi_memstall_leave,
+	TP_PROTO(int type),
+	TP_ARGS(type)
+);
+
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 460f08436b58..4c5a40222e88 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -142,6 +142,8 @@
 #include <linux/psi.h>
 #include "sched.h"
 
+#include <trace/events/sched.h>
+
 static int psi_bug __read_mostly;
 
 DEFINE_STATIC_KEY_FALSE(psi_disabled);
@@ -822,6 +824,9 @@ void psi_memstall_enter(unsigned long *flags, enum memstall_types type)
 	*flags = current->flags & PF_MEMSTALL;
 	if (*flags)
 		return;
+
+	trace_psi_memstall_enter(type);
+
 	/*
 	 * PF_MEMSTALL setting & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we can
@@ -852,6 +857,9 @@ void psi_memstall_leave(unsigned long *flags, enum memstall_types type)
 
 	if (*flags)
 		return;
+
+	trace_psi_memstall_leave(type);
+
 	/*
 	 * PF_MEMSTALL clearing & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we could
-- 
2.18.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/2] psi: enhance psi with the help of ebpf
  2020-03-31 10:04 [PATCH v2 0/2] psi: enhance psi with the help of ebpf Yafang Shao
  2020-03-31 10:04 ` [PATCH v2 1/2] psi: introduce various types of memstall Yafang Shao
  2020-03-31 10:04 ` [PATCH v2 2/2] psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave} Yafang Shao
@ 2020-07-15 16:36 ` Shakeel Butt
  2020-07-16  3:18   ` Yafang Shao
  2 siblings, 1 reply; 7+ messages in thread
From: Shakeel Butt @ 2020-07-15 16:36 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Johannes Weiner, Peter Zijlstra (Intel),
	Andrew Morton, Linux MM, open list:BLOCK LAYER, LKML

Hi Yafang,

On Tue, Mar 31, 2020 at 3:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> PSI gives us a powerful way to anaylze memory pressure issue, but we can
> make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> Especially with ebpf we can flexiblely get more details of the memory
> pressure.
>
> In orderc to achieve this goal, a new parameter is added into
> psi_memstall_{enter, leave}, which indicates the specific type of a
> memstall. There're totally ten memstalls by now,
>         MEMSTALL_KSWAPD
>         MEMSTALL_RECLAIM_DIRECT
>         MEMSTALL_RECLAIM_MEMCG
>         MEMSTALL_RECLAIM_HIGH
>         MEMSTALL_KCOMPACTD
>         MEMSTALL_COMPACT
>         MEMSTALL_WORKINGSET_REFAULT
>         MEMSTALL_WORKINGSET_THRASH
>         MEMSTALL_MEMDELAY
>         MEMSTALL_SWAPIO
> With the help of kprobe or tracepoint to trace this newly added agument we
> can know which type of memstall it is and then do corresponding
> improvement. I can also help us to analyze the latency spike caused by
> memory pressure.
>
> But note that we can't use it to build memory pressure for a specific type
> of memstall, e.g. memcg pressure, compaction pressure and etc, because it
> doesn't implement various types of task->in_memstall, e.g.
> task->in_memcgstall, task->in_compactionstall and etc.
>
> Although there're already some tracepoints can help us to achieve this
> goal, e.g.
>         vmscan:mm_vmscan_kswapd_{wake, sleep}
>         vmscan:mm_vmscan_direct_reclaim_{begin, end}
>         vmscan:mm_vmscan_memcg_reclaim_{begin, end}
>         /* no tracepoint for memcg high reclaim*/
>         compcation:mm_compaction_kcompactd_{wake, sleep}
>         compcation:mm_compaction_begin_{begin, end}
>         /* no tracepoint for workingset refault */
>         /* no tracepoint for workingset thrashing */
>         /* no tracepoint for use memdelay */
>         /* no tracepoint for swapio */
> but psi_memstall_{enter, leave} gives us a unified entrance for all
> types of memstall and we don't need to add many begin and end tracepoints
> that hasn't been implemented yet.
>
> Patch #2 gives us an example of how to use it with ebpf. With the help of
> ebpf we can trace a specific task, application, container and etc. It also
> can help us to analyze the spread of latencies and whether they were
> clustered at a point of time or spread out over long periods of time.
>
> To summarize, with the pressure data in /proc/pressure/memroy we know that
> the system is under memory pressure, and then with the newly added tracing
> facility in this patchset we can get the reason of this memory pressure,
> and then thinks about how to make the change.
> The workflow can be illustrated as bellow.
>
>                    REASON         ACTION
>                  | compcation   | improve compcation    |
>                  | vmscan       | improve vmscan        |
> Memory pressure -| workingset   | improve workingset    |
>                  | etc          | ...                   |
>

I have not looked at the patch series in detail but I wanted to get
your thoughts if it is possible to achieve what I am trying to do with
this patch series.

At the moment I am only interested in global reclaim and I wanted to
enable alerts like "alert if there is process stuck in global reclaim
for x seconds in last y seconds window" or "alert if all the processes
are stuck in global reclaim for some z seconds".

I see that using this series I can identify global reclaim but I am
wondering if alert or notifications are possible. Android is using psi
monitors for such alerts but it does not use cgroups, so, most of the
memstalls are related to global reclaim stall. For cgroup environment,
do we need for add support to psi monitor similar to this patch
series?

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/2] psi: enhance psi with the help of ebpf
  2020-07-15 16:36 ` [PATCH v2 0/2] psi: enhance psi with the help of ebpf Shakeel Butt
@ 2020-07-16  3:18   ` Yafang Shao
  2020-07-16 17:04     ` Shakeel Butt
  0 siblings, 1 reply; 7+ messages in thread
From: Yafang Shao @ 2020-07-16  3:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Peter Zijlstra (Intel),
	Andrew Morton, Linux MM, open list:BLOCK LAYER, LKML

On Thu, Jul 16, 2020 at 12:36 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Hi Yafang,
>
> On Tue, Mar 31, 2020 at 3:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > PSI gives us a powerful way to anaylze memory pressure issue, but we can
> > make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> > Especially with ebpf we can flexiblely get more details of the memory
> > pressure.
> >
> > In orderc to achieve this goal, a new parameter is added into
> > psi_memstall_{enter, leave}, which indicates the specific type of a
> > memstall. There're totally ten memstalls by now,
> >         MEMSTALL_KSWAPD
> >         MEMSTALL_RECLAIM_DIRECT
> >         MEMSTALL_RECLAIM_MEMCG
> >         MEMSTALL_RECLAIM_HIGH
> >         MEMSTALL_KCOMPACTD
> >         MEMSTALL_COMPACT
> >         MEMSTALL_WORKINGSET_REFAULT
> >         MEMSTALL_WORKINGSET_THRASH
> >         MEMSTALL_MEMDELAY
> >         MEMSTALL_SWAPIO
> > With the help of kprobe or tracepoint to trace this newly added agument we
> > can know which type of memstall it is and then do corresponding
> > improvement. I can also help us to analyze the latency spike caused by
> > memory pressure.
> >
> > But note that we can't use it to build memory pressure for a specific type
> > of memstall, e.g. memcg pressure, compaction pressure and etc, because it
> > doesn't implement various types of task->in_memstall, e.g.
> > task->in_memcgstall, task->in_compactionstall and etc.
> >
> > Although there're already some tracepoints can help us to achieve this
> > goal, e.g.
> >         vmscan:mm_vmscan_kswapd_{wake, sleep}
> >         vmscan:mm_vmscan_direct_reclaim_{begin, end}
> >         vmscan:mm_vmscan_memcg_reclaim_{begin, end}
> >         /* no tracepoint for memcg high reclaim*/
> >         compcation:mm_compaction_kcompactd_{wake, sleep}
> >         compcation:mm_compaction_begin_{begin, end}
> >         /* no tracepoint for workingset refault */
> >         /* no tracepoint for workingset thrashing */
> >         /* no tracepoint for use memdelay */
> >         /* no tracepoint for swapio */
> > but psi_memstall_{enter, leave} gives us a unified entrance for all
> > types of memstall and we don't need to add many begin and end tracepoints
> > that hasn't been implemented yet.
> >
> > Patch #2 gives us an example of how to use it with ebpf. With the help of
> > ebpf we can trace a specific task, application, container and etc. It also
> > can help us to analyze the spread of latencies and whether they were
> > clustered at a point of time or spread out over long periods of time.
> >
> > To summarize, with the pressure data in /proc/pressure/memroy we know that
> > the system is under memory pressure, and then with the newly added tracing
> > facility in this patchset we can get the reason of this memory pressure,
> > and then thinks about how to make the change.
> > The workflow can be illustrated as bellow.
> >
> >                    REASON         ACTION
> >                  | compcation   | improve compcation    |
> >                  | vmscan       | improve vmscan        |
> > Memory pressure -| workingset   | improve workingset    |
> >                  | etc          | ...                   |
> >
>
> I have not looked at the patch series in detail but I wanted to get
> your thoughts if it is possible to achieve what I am trying to do with
> this patch series.
>
> At the moment I am only interested in global reclaim and I wanted to
> enable alerts like "alert if there is process stuck in global reclaim
> for x seconds in last y seconds window" or "alert if all the processes
> are stuck in global reclaim for some z seconds".
>
> I see that using this series I can identify global reclaim but I am
> wondering if alert or notifications are possible. Android is using psi
> monitors for such alerts but it does not use cgroups, so, most of the
> memstalls are related to global reclaim stall. For cgroup environment,
> do we need for add support to psi monitor similar to this patch
> series?
>

Hi Shakeel,

We use the PSI tracepoints in our kernel to analyze the individual
latency caused by memory pressure, but the PSI tracepoints are
implemented with a new version as bellow:
    trace_psi_memstall_enter(_RET_IP_);
    trace_psi_memstall_leave(_RET_IP_);
And then using the _RET_IP_ to identify the specific PSI type.

If the _RET_IP_ is at try_to_free_mem_cgroup_pages(), then it means
the pressure caused by the memory cgroup, IOW, the limit of memcg is
reached and it has to do memcg reclaim. Otherwise we can consider it
as global memory pressure.
try_to_free_mem_cgroup_pages
    psi_memstall_enter
        if (static_branch_likely(&psi_disabled))
            return;
        *flags = current->in_memstall;
         if (*flags)
             return;
         trace_psi_memstall_enter(_RET_IP_);  <<<<< memcg pressure


-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/2] psi: enhance psi with the help of ebpf
  2020-07-16  3:18   ` Yafang Shao
@ 2020-07-16 17:04     ` Shakeel Butt
  2020-07-17  1:43       ` Yafang Shao
  0 siblings, 1 reply; 7+ messages in thread
From: Shakeel Butt @ 2020-07-16 17:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Johannes Weiner, Peter Zijlstra (Intel),
	Andrew Morton, Linux MM, open list:BLOCK LAYER, LKML

On Wed, Jul 15, 2020 at 8:19 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 12:36 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > Hi Yafang,
> >
> > On Tue, Mar 31, 2020 at 3:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > PSI gives us a powerful way to anaylze memory pressure issue, but we can
> > > make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> > > Especially with ebpf we can flexiblely get more details of the memory
> > > pressure.
> > >
> > > In orderc to achieve this goal, a new parameter is added into
> > > psi_memstall_{enter, leave}, which indicates the specific type of a
> > > memstall. There're totally ten memstalls by now,
> > >         MEMSTALL_KSWAPD
> > >         MEMSTALL_RECLAIM_DIRECT
> > >         MEMSTALL_RECLAIM_MEMCG
> > >         MEMSTALL_RECLAIM_HIGH
> > >         MEMSTALL_KCOMPACTD
> > >         MEMSTALL_COMPACT
> > >         MEMSTALL_WORKINGSET_REFAULT
> > >         MEMSTALL_WORKINGSET_THRASH
> > >         MEMSTALL_MEMDELAY
> > >         MEMSTALL_SWAPIO
> > > With the help of kprobe or tracepoint to trace this newly added agument we
> > > can know which type of memstall it is and then do corresponding
> > > improvement. I can also help us to analyze the latency spike caused by
> > > memory pressure.
> > >
> > > But note that we can't use it to build memory pressure for a specific type
> > > of memstall, e.g. memcg pressure, compaction pressure and etc, because it
> > > doesn't implement various types of task->in_memstall, e.g.
> > > task->in_memcgstall, task->in_compactionstall and etc.
> > >
> > > Although there're already some tracepoints can help us to achieve this
> > > goal, e.g.
> > >         vmscan:mm_vmscan_kswapd_{wake, sleep}
> > >         vmscan:mm_vmscan_direct_reclaim_{begin, end}
> > >         vmscan:mm_vmscan_memcg_reclaim_{begin, end}
> > >         /* no tracepoint for memcg high reclaim*/
> > >         compcation:mm_compaction_kcompactd_{wake, sleep}
> > >         compcation:mm_compaction_begin_{begin, end}
> > >         /* no tracepoint for workingset refault */
> > >         /* no tracepoint for workingset thrashing */
> > >         /* no tracepoint for use memdelay */
> > >         /* no tracepoint for swapio */
> > > but psi_memstall_{enter, leave} gives us a unified entrance for all
> > > types of memstall and we don't need to add many begin and end tracepoints
> > > that hasn't been implemented yet.
> > >
> > > Patch #2 gives us an example of how to use it with ebpf. With the help of
> > > ebpf we can trace a specific task, application, container and etc. It also
> > > can help us to analyze the spread of latencies and whether they were
> > > clustered at a point of time or spread out over long periods of time.
> > >
> > > To summarize, with the pressure data in /proc/pressure/memroy we know that
> > > the system is under memory pressure, and then with the newly added tracing
> > > facility in this patchset we can get the reason of this memory pressure,
> > > and then thinks about how to make the change.
> > > The workflow can be illustrated as bellow.
> > >
> > >                    REASON         ACTION
> > >                  | compcation   | improve compcation    |
> > >                  | vmscan       | improve vmscan        |
> > > Memory pressure -| workingset   | improve workingset    |
> > >                  | etc          | ...                   |
> > >
> >
> > I have not looked at the patch series in detail but I wanted to get
> > your thoughts if it is possible to achieve what I am trying to do with
> > this patch series.
> >
> > At the moment I am only interested in global reclaim and I wanted to
> > enable alerts like "alert if there is process stuck in global reclaim
> > for x seconds in last y seconds window" or "alert if all the processes
> > are stuck in global reclaim for some z seconds".
> >
> > I see that using this series I can identify global reclaim but I am
> > wondering if alert or notifications are possible. Android is using psi
> > monitors for such alerts but it does not use cgroups, so, most of the
> > memstalls are related to global reclaim stall. For cgroup environment,
> > do we need for add support to psi monitor similar to this patch
> > series?
> >
>
> Hi Shakeel,
>
> We use the PSI tracepoints in our kernel to analyze the individual
> latency caused by memory pressure, but the PSI tracepoints are
> implemented with a new version as bellow:
>     trace_psi_memstall_enter(_RET_IP_);
>     trace_psi_memstall_leave(_RET_IP_);
> And then using the _RET_IP_ to identify the specific PSI type.
>
> If the _RET_IP_ is at try_to_free_mem_cgroup_pages(), then it means
> the pressure caused by the memory cgroup, IOW, the limit of memcg is
> reached and it has to do memcg reclaim. Otherwise we can consider it
> as global memory pressure.
> try_to_free_mem_cgroup_pages
>     psi_memstall_enter
>         if (static_branch_likely(&psi_disabled))
>             return;
>         *flags = current->in_memstall;
>          if (*flags)
>              return;
>          trace_psi_memstall_enter(_RET_IP_);  <<<<< memcg pressure
>

Thanks for the response. I am looking for 'always on' monitoring. More
specifically defining the system level SLIs based on PSI. My concern
with ftrace is its global shared state and also it is not really for
'always on' monitoring. You have mentioned ebpf. Is ebpf fine for
'always on' monitoring and is it possible to notify user space by ebpf
on specific conditions (e.g. a process stuck in global reclaim for 60
seconds)?

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/2] psi: enhance psi with the help of ebpf
  2020-07-16 17:04     ` Shakeel Butt
@ 2020-07-17  1:43       ` Yafang Shao
  0 siblings, 0 replies; 7+ messages in thread
From: Yafang Shao @ 2020-07-17  1:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Peter Zijlstra (Intel),
	Andrew Morton, Linux MM, open list:BLOCK LAYER, LKML

On Fri, Jul 17, 2020 at 1:04 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Jul 15, 2020 at 8:19 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Jul 16, 2020 at 12:36 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > Hi Yafang,
> > >
> > > On Tue, Mar 31, 2020 at 3:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > PSI gives us a powerful way to anaylze memory pressure issue, but we can
> > > > make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> > > > Especially with ebpf we can flexiblely get more details of the memory
> > > > pressure.
> > > >
> > > > In orderc to achieve this goal, a new parameter is added into
> > > > psi_memstall_{enter, leave}, which indicates the specific type of a
> > > > memstall. There're totally ten memstalls by now,
> > > >         MEMSTALL_KSWAPD
> > > >         MEMSTALL_RECLAIM_DIRECT
> > > >         MEMSTALL_RECLAIM_MEMCG
> > > >         MEMSTALL_RECLAIM_HIGH
> > > >         MEMSTALL_KCOMPACTD
> > > >         MEMSTALL_COMPACT
> > > >         MEMSTALL_WORKINGSET_REFAULT
> > > >         MEMSTALL_WORKINGSET_THRASH
> > > >         MEMSTALL_MEMDELAY
> > > >         MEMSTALL_SWAPIO
> > > > With the help of kprobe or tracepoint to trace this newly added agument we
> > > > can know which type of memstall it is and then do corresponding
> > > > improvement. I can also help us to analyze the latency spike caused by
> > > > memory pressure.
> > > >
> > > > But note that we can't use it to build memory pressure for a specific type
> > > > of memstall, e.g. memcg pressure, compaction pressure and etc, because it
> > > > doesn't implement various types of task->in_memstall, e.g.
> > > > task->in_memcgstall, task->in_compactionstall and etc.
> > > >
> > > > Although there're already some tracepoints can help us to achieve this
> > > > goal, e.g.
> > > >         vmscan:mm_vmscan_kswapd_{wake, sleep}
> > > >         vmscan:mm_vmscan_direct_reclaim_{begin, end}
> > > >         vmscan:mm_vmscan_memcg_reclaim_{begin, end}
> > > >         /* no tracepoint for memcg high reclaim*/
> > > >         compcation:mm_compaction_kcompactd_{wake, sleep}
> > > >         compcation:mm_compaction_begin_{begin, end}
> > > >         /* no tracepoint for workingset refault */
> > > >         /* no tracepoint for workingset thrashing */
> > > >         /* no tracepoint for use memdelay */
> > > >         /* no tracepoint for swapio */
> > > > but psi_memstall_{enter, leave} gives us a unified entrance for all
> > > > types of memstall and we don't need to add many begin and end tracepoints
> > > > that hasn't been implemented yet.
> > > >
> > > > Patch #2 gives us an example of how to use it with ebpf. With the help of
> > > > ebpf we can trace a specific task, application, container and etc. It also
> > > > can help us to analyze the spread of latencies and whether they were
> > > > clustered at a point of time or spread out over long periods of time.
> > > >
> > > > To summarize, with the pressure data in /proc/pressure/memroy we know that
> > > > the system is under memory pressure, and then with the newly added tracing
> > > > facility in this patchset we can get the reason of this memory pressure,
> > > > and then thinks about how to make the change.
> > > > The workflow can be illustrated as bellow.
> > > >
> > > >                    REASON         ACTION
> > > >                  | compcation   | improve compcation    |
> > > >                  | vmscan       | improve vmscan        |
> > > > Memory pressure -| workingset   | improve workingset    |
> > > >                  | etc          | ...                   |
> > > >
> > >
> > > I have not looked at the patch series in detail but I wanted to get
> > > your thoughts if it is possible to achieve what I am trying to do with
> > > this patch series.
> > >
> > > At the moment I am only interested in global reclaim and I wanted to
> > > enable alerts like "alert if there is process stuck in global reclaim
> > > for x seconds in last y seconds window" or "alert if all the processes
> > > are stuck in global reclaim for some z seconds".
> > >
> > > I see that using this series I can identify global reclaim but I am
> > > wondering if alert or notifications are possible. Android is using psi
> > > monitors for such alerts but it does not use cgroups, so, most of the
> > > memstalls are related to global reclaim stall. For cgroup environment,
> > > do we need for add support to psi monitor similar to this patch
> > > series?
> > >
> >
> > Hi Shakeel,
> >
> > We use the PSI tracepoints in our kernel to analyze the individual
> > latency caused by memory pressure, but the PSI tracepoints are
> > implemented with a new version as bellow:
> >     trace_psi_memstall_enter(_RET_IP_);
> >     trace_psi_memstall_leave(_RET_IP_);
> > And then using the _RET_IP_ to identify the specific PSI type.
> >
> > If the _RET_IP_ is at try_to_free_mem_cgroup_pages(), then it means
> > the pressure caused by the memory cgroup, IOW, the limit of memcg is
> > reached and it has to do memcg reclaim. Otherwise we can consider it
> > as global memory pressure.
> > try_to_free_mem_cgroup_pages
> >     psi_memstall_enter
> >         if (static_branch_likely(&psi_disabled))
> >             return;
> >         *flags = current->in_memstall;
> >          if (*flags)
> >              return;
> >          trace_psi_memstall_enter(_RET_IP_);  <<<<< memcg pressure
> >
>
> Thanks for the response. I am looking for 'always on' monitoring. More
> specifically defining the system level SLIs based on PSI. My concern
> with ftrace is its global shared state and also it is not really for
> 'always on' monitoring. You have mentioned ebpf. Is ebpf fine for
> 'always on' monitoring and is it possible to notify user space by ebpf
> on specific conditions (e.g. a process stuck in global reclaim for 60
> seconds)?
>

ebpf is fine for  'always on' monitoring  from my experience, but I'm
not sure whether it is possible to notify user space on specific
conditions.
Notifying user space would be a useful feature, so I think we can have a try.


-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-17  1:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-31 10:04 [PATCH v2 0/2] psi: enhance psi with the help of ebpf Yafang Shao
2020-03-31 10:04 ` [PATCH v2 1/2] psi: introduce various types of memstall Yafang Shao
2020-03-31 10:04 ` [PATCH v2 2/2] psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave} Yafang Shao
2020-07-15 16:36 ` [PATCH v2 0/2] psi: enhance psi with the help of ebpf Shakeel Butt
2020-07-16  3:18   ` Yafang Shao
2020-07-16 17:04     ` Shakeel Butt
2020-07-17  1:43       ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).