linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: hannes@cmpxchg.org, peterz@infradead.org, akpm@linux-foundation.org
Cc: linux-mm@kvack.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yafang Shao <laoar.shao@gmail.com>
Subject: [PATCH v2 2/2] psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave}
Date: Tue, 31 Mar 2020 06:04:37 -0400	[thread overview]
Message-ID: <1585649077-10896-3-git-send-email-laoar.shao@gmail.com> (raw)
In-Reply-To: <1585649077-10896-1-git-send-email-laoar.shao@gmail.com>

With the new parameter introduced in psi_memstall_{enter, leave} we can
get the specific type of memstal. To make it easier to use, we'd better
introduce tracepoints for them. Once these two tracepoints are added we
can easily use other tools like ebpf or bash script to collect the
memstall data and analyze.

The output of these tracepoints is,

          usemem-30288 [012] .... 302479.734290: psi_memstall_enter: type=MEMSTALL_RECLAIM_DIRECT
          usemem-30288 [012] .N.. 302479.741186: psi_memstall_leave: type=MEMSTALL_RECLAIM_DIRECT
          usemem-30288 [021] .... 302479.742075: psi_memstall_enter: type=MEMSTALL_COMPACT
          usemem-30288 [021] .... 302479.744869: psi_memstall_leave: type=MEMSTALL_COMPACT
           <...>-388   [000] .... 302514.609040: psi_memstall_enter: type=MEMSTALL_KSWAPD
         kswapd0-388   [000] .... 302514.616376: psi_memstall_leave: type=MEMSTALL_KSWAPD
           <...>-223   [024] .... 302514.616380: psi_memstall_enter: type=MEMSTALL_KCOMPACTD
      kcompactd0-223   [024] .... 302514.618414: psi_memstall_leave: type=MEMSTALL_KCOMPACTD
   supervisorctl-31675 [014] .... 302516.281293: psi_memstall_enter: type=MEMSTALL_WORKINGSET_REFAULT
   supervisorctl-31675 [014] .N.. 302516.281314: psi_memstall_leave: type=MEMSTALL_WORKINGSET_REFAULT
            bash-32092 [034] .... 302526.225639: psi_memstall_enter: type=MEMSTALL_WORKINGSET_THRASH
            bash-32092 [034] .... 302526.225843: psi_memstall_leave: type=MEMSTALL_WORKINGSET_THRASH

Here's one example with bpftrace to measure application's latency with
these tracepoints.

tracepoint:sched:psi_memstall_enter
{
        @start[tid, args->type] = nsecs
}

tracepoint:sched:psi_memstall_leave
{
        @time[comm, args->type] = hist(nsecs - @start[tid, args->type]);
        delete(@start[tid, args->type]);
}

Bellow is part of the result after producing some memory pressure.
@time[objdump, 7]:
[256K, 512K)           1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[objdump, 6]:
[8K, 16K)              2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[objcopy, 7]:
[16K, 32K)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[ld, 7]:
[4M, 8M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8M, 16M)              1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[khugepaged, 5]:
[4K, 8K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@time[kswapd0, 0]:
[16K, 32K)             1 |@@@@@                                               |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              1 |@@@@@                                               |
[16M, 32M)            10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32M, 64M)             9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[64M, 128M)            2 |@@@@@@@@@@                                          |
[128M, 256M)           2 |@@@@@@@@@@                                          |
[256M, 512M)           3 |@@@@@@@@@@@@@@@                                     |
[512M, 1G)             1 |@@@@@                                               |

@time[kswapd1, 0]:
[1M, 2M)               1 |@@@@                                                |
[2M, 4M)               2 |@@@@@@@@                                            |
[4M, 8M)               0 |                                                    |
[8M, 16M)             12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16M, 32M)             7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[32M, 64M)             5 |@@@@@@@@@@@@@@@@@@@@@                               |
[64M, 128M)            5 |@@@@@@@@@@@@@@@@@@@@@                               |
[128M, 256M)           3 |@@@@@@@@@@@@@                                       |
[256M, 512M)           1 |@@@@                                                |

@time[khugepaged, 1]:
[2M, 4M)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

From these traced data, we can find that the high latencies of user tasks
are always type 7 of memstall,  which is MEMSTALL_WORKINGSET_THRASH, and
then we should look into the details of wokingset of the user tasks and think
about how to improve it - for example by reducing the workingset.

With the builtin variable 'cgroup' of bpftrace we can also filter a
memcg and its descendants.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/trace/events/sched.h | 41 ++++++++++++++++++++++++++++++++++++
 kernel/sched/psi.c           |  8 +++++++
 2 files changed, 49 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 420e80e56e55..8ea2cdf78810 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -7,8 +7,20 @@
 
 #include <linux/sched/numa_balancing.h>
 #include <linux/tracepoint.h>
+#include <linux/psi_types.h>
 #include <linux/binfmts.h>
 
+#define show_psi_memstall_type(type) __print_symbolic(type,		\
+	{MEMSTALL_KSWAPD, "MEMSTALL_KSWAPD"},				\
+	{MEMSTALL_RECLAIM_DIRECT, "MEMSTALL_RECLAIM_DIRECT"},		\
+	{MEMSTALL_RECLAIM_MEMCG, "MEMSTALL_RECLAIM_MEMCG"},		\
+	{MEMSTALL_RECLAIM_HIGH, "MEMSTALL_RECLAIM_HIGH"},		\
+	{MEMSTALL_KCOMPACTD, "MEMSTALL_KCOMPACTD"},			\
+	{MEMSTALL_COMPACT, "MEMSTALL_COMPACT"},				\
+	{MEMSTALL_WORKINGSET_REFAULT, "MEMSTALL_WORKINGSET_REFAULT"},	\
+	{MEMSTALL_WORKINGSET_THRASH, "MEMSTALL_WORKINGSET_THRASH"},	\
+	{MEMSTALL_MEMDELAY, "MEMSTALL_MEMDELAY"},			\
+	{MEMSTALL_SWAPIO, "MEMSTALL_SWAPIO"})
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
@@ -625,6 +637,35 @@ DECLARE_TRACE(sched_overutilized_tp,
 	TP_PROTO(struct root_domain *rd, bool overutilized),
 	TP_ARGS(rd, overutilized));
 
+DECLARE_EVENT_CLASS(psi_memstall_template,
+
+	TP_PROTO(int type),
+
+	TP_ARGS(type),
+
+	TP_STRUCT__entry(
+		__field(int, type)
+	),
+
+	TP_fast_assign(
+		__entry->type = type;
+	),
+
+	TP_printk("type=%s",
+		show_psi_memstall_type(__entry->type))
+);
+
+DEFINE_EVENT(psi_memstall_template, psi_memstall_enter,
+	TP_PROTO(int type),
+	TP_ARGS(type)
+);
+
+DEFINE_EVENT(psi_memstall_template, psi_memstall_leave,
+	TP_PROTO(int type),
+	TP_ARGS(type)
+);
+
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 460f08436b58..4c5a40222e88 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -142,6 +142,8 @@
 #include <linux/psi.h>
 #include "sched.h"
 
+#include <trace/events/sched.h>
+
 static int psi_bug __read_mostly;
 
 DEFINE_STATIC_KEY_FALSE(psi_disabled);
@@ -822,6 +824,9 @@ void psi_memstall_enter(unsigned long *flags, enum memstall_types type)
 	*flags = current->flags & PF_MEMSTALL;
 	if (*flags)
 		return;
+
+	trace_psi_memstall_enter(type);
+
 	/*
 	 * PF_MEMSTALL setting & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we can
@@ -852,6 +857,9 @@ void psi_memstall_leave(unsigned long *flags, enum memstall_types type)
 
 	if (*flags)
 		return;
+
+	trace_psi_memstall_leave(type);
+
 	/*
 	 * PF_MEMSTALL clearing & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we could
-- 
2.18.2


  parent reply	other threads:[~2020-03-31 10:05 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-31 10:04 [PATCH v2 0/2] psi: enhance psi with the help of ebpf Yafang Shao
2020-03-31 10:04 ` [PATCH v2 1/2] psi: introduce various types of memstall Yafang Shao
2020-03-31 10:04 ` Yafang Shao [this message]
2020-07-15 16:36 ` [PATCH v2 0/2] psi: enhance psi with the help of ebpf Shakeel Butt
2020-07-16  3:18   ` Yafang Shao
2020-07-16 17:04     ` Shakeel Butt
2020-07-17  1:43       ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1585649077-10896-3-git-send-email-laoar.shao@gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).