linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines
@ 2023-02-13  3:17 Feng Tang
  2023-02-13  8:02 ` Leo Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Feng Tang @ 2023-02-13  3:17 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joe Mario, linux-perf-users, linux-kernel
  Cc: Andi Kleen, Kan Liang, Xing Zhengjun, Feng Tang

Many platforms have feature of adjacent cachelines prefetch, when it
is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity,
if one is fetched to cache, the other one could likely be fetched too,
which sort of extends the cacheline size to double, thus the false
sharing could happens in adjacent cachelines.

0Day has captured performance changed related with this [1], and some
commercial software explicitly makes its hot global variables 128 bytes
aligned (2 cache lines) to avoid this kind of extended false sharing.

So add an option "-a" or "--double-cl" for c2c report to show false
sharing in double cache line granularity, which acts just like the
cacheline size is doubled. There is no change to c2c record. The
hardware HITM events are still per cacheline. The option just changes
the granularity of how events are grouped and displayed.

In the c2c report below (will-it-scale's pagefault2 case on old kernel):

  ----------------------------------------------------------------------
     26       31        2        0        0        0  0xffff888103ec6000
  ----------------------------------------------------------------------
   35.48%   50.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff8133148b      1153        66       971     3748        74  [k] get_mem_cgroup_from_mm
    6.45%    0.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff813396e4       570         0      1531      879        75  [k] mem_cgroup_charge
   25.81%   50.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81331472       949        70       593     3359        74  [k] get_mem_cgroup_from_mm
   19.35%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81339686      1352         0      1073     1022        74  [k] mem_cgroup_charge
    9.68%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff813396d6      1401         0       863      768        74  [k] mem_cgroup_charge
    3.23%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81333106       618         0       804       11         9  [k] uncharge_batch

The offset 0x10 and 0x54 used to displayed in 2 groups, and now they
are listed together to give users a hint.

[1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/

Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/Documentation/perf-c2c.txt |  6 ++++++
 tools/perf/builtin-c2c.c              | 22 +++++++++++++---------
 tools/perf/util/cacheline.h           | 25 ++++++++++++++++++++-----
 tools/perf/util/sort.c                | 13 ++++++++++---
 tools/perf/util/sort.h                |  1 +
 5 files changed, 50 insertions(+), 17 deletions(-)

diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
index 5c5eb2def83e..a8e1e40d270e 100644
--- a/tools/perf/Documentation/perf-c2c.txt
+++ b/tools/perf/Documentation/perf-c2c.txt
@@ -126,6 +126,12 @@ REPORT OPTIONS
 	The known limitations include exception handing such as
 	setjmp/longjmp will have calls/returns not match.
 
+-a::
+--double-cl::
+	Group HITM events in double cacheline granularity. Some architecture
+	has Adjacent Cacheline Prefetch feature, which behaves like the
+	cachline size is doubled.
+
 C2C RECORD
 ----------
 The perf c2c record command setup options related to HITM cacheline analysis
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 52d94c7dd836..7d495db7e5a2 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -524,7 +524,7 @@ static int dcacheline_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
 	char buf[20];
 
 	if (he->mem_info)
-		addr = cl_address(he->mem_info->daddr.addr);
+		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
 
 	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
 }
@@ -562,7 +562,7 @@ static int offset_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
 	char buf[20];
 
 	if (he->mem_info)
-		addr = cl_offset(he->mem_info->daddr.al_addr);
+		addr = cl_offset(he->mem_info->daddr.al_addr, chk_double_cl);
 
 	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
 }
@@ -574,9 +574,10 @@ offset_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
 	uint64_t l = 0, r = 0;
 
 	if (left->mem_info)
-		l = cl_offset(left->mem_info->daddr.addr);
+		l = cl_offset(left->mem_info->daddr.addr, chk_double_cl);
+
 	if (right->mem_info)
-		r = cl_offset(right->mem_info->daddr.addr);
+		r = cl_offset(right->mem_info->daddr.addr, chk_double_cl);
 
 	return (int64_t)(r - l);
 }
@@ -2590,7 +2591,7 @@ perf_c2c_cacheline_browser__title(struct hist_browser *browser,
 	he = cl_browser->he;
 
 	if (he->mem_info)
-		addr = cl_address(he->mem_info->daddr.addr);
+		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
 
 	scnprintf(bf, size, "Cacheline 0x%lx", addr);
 	return 0;
@@ -2788,15 +2789,16 @@ static int ui_quirks(void)
 	if (!c2c.use_stdio) {
 		dim_offset.width  = 5;
 		dim_offset.header = header_offset_tui;
-		nodestr = "CL";
+		nodestr = chk_double_cl ? "Double-CL" : "CL";
 	}
 
 	dim_percent_costly_snoop.header = percent_costly_snoop_header[c2c.display];
 
 	/* Fix the zero line for dcacheline column. */
-	buf = fill_line("Cacheline", dim_dcacheline.width +
-				     dim_dcacheline_node.width +
-				     dim_dcacheline_count.width + 4);
+	buf = fill_line(chk_double_cl ? "Double-Cacheline" : "Cacheline",
+				dim_dcacheline.width +
+				dim_dcacheline_node.width +
+				dim_dcacheline_count.width + 4);
 	if (!buf)
 		return -ENOMEM;
 
@@ -3037,6 +3039,8 @@ static int perf_c2c__report(int argc, const char **argv)
 	OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"),
 	OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr,
 		    "Enable LBR callgraph stitching approach"),
+	OPT_BOOLEAN('a', "double-cl", &chk_double_cl,
+		    "Check adjacent cachline false sharing"),
 	OPT_PARENT(c2c_options),
 	OPT_END()
 	};
diff --git a/tools/perf/util/cacheline.h b/tools/perf/util/cacheline.h
index dec8c0fb1f4a..630d16731b4f 100644
--- a/tools/perf/util/cacheline.h
+++ b/tools/perf/util/cacheline.h
@@ -6,16 +6,31 @@
 
 int __pure cacheline_size(void);
 
-static inline u64 cl_address(u64 address)
+
+/*
+ * Some architecture has 'Adjacent Cacheline Prefetch' feature,
+ * which performs like the cacheline size being doubled.
+ */
+static inline u64 cl_address(u64 address, bool double_cl)
 {
+	u64 size = cacheline_size();
+
+	if (double_cl)
+		size *= 2;
+
 	/* return the cacheline of the address */
-	return (address & ~(cacheline_size() - 1));
+	return (address & ~(size - 1));
 }
 
-static inline u64 cl_offset(u64 address)
+static inline u64 cl_offset(u64 address, bool double_cl)
 {
-	/* return the cacheline of the address */
-	return (address & (cacheline_size() - 1));
+	u64 size = cacheline_size();
+
+	if (double_cl)
+		size *= 2;
+
+	/* return the offset inside cachline */
+	return (address & (size - 1));
 }
 
 #endif // PERF_CACHELINE_H
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index e188f74698dd..148b28f0a7e2 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -52,6 +52,13 @@ enum sort_mode	sort__mode = SORT_MODE__NORMAL;
 static const char *const dynamic_headers[] = {"local_ins_lat", "ins_lat", "local_p_stage_cyc", "p_stage_cyc"};
 static const char *const arch_specific_sort_keys[] = {"local_p_stage_cyc", "p_stage_cyc"};
 
+/*
+ * Some architecture has Adjacent Cacheline Prefetch feature, which behaves
+ * like the cachline size is doubled. Enable this flag to check things in
+ * double cacheline granularity.
+ */
+bool chk_double_cl;
+
 /*
  * Replaces all occurrences of a char used with the:
  *
@@ -1499,8 +1506,8 @@ sort__dcacheline_cmp(struct hist_entry *left, struct hist_entry *right)
 
 addr:
 	/* al_addr does all the right addr - start + offset calculations */
-	l = cl_address(left->mem_info->daddr.al_addr);
-	r = cl_address(right->mem_info->daddr.al_addr);
+	l = cl_address(left->mem_info->daddr.al_addr, chk_double_cl);
+	r = cl_address(right->mem_info->daddr.al_addr, chk_double_cl);
 
 	if (l > r) return -1;
 	if (l < r) return 1;
@@ -1519,7 +1526,7 @@ static int hist_entry__dcacheline_snprintf(struct hist_entry *he, char *bf,
 	if (he->mem_info) {
 		struct map *map = he->mem_info->daddr.ms.map;
 
-		addr = cl_address(he->mem_info->daddr.al_addr);
+		addr = cl_address(he->mem_info->daddr.al_addr, chk_double_cl);
 		ms = &he->mem_info->daddr.ms;
 
 		/* print [s] for shared data mmaps */
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 921715e6aec4..04f0a6dc7381 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -35,6 +35,7 @@ extern struct sort_entry sort_sym_from;
 extern struct sort_entry sort_sym_to;
 extern struct sort_entry sort_srcline;
 extern const char default_mem_sort_order[];
+extern bool chk_double_cl;
 
 struct res_sample {
 	u64 time;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines
  2023-02-13  3:17 [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines Feng Tang
@ 2023-02-13  8:02 ` Leo Yan
  2023-02-13  8:18 ` Leo Yan
  2023-02-13 13:39 ` Joe Mario
  2 siblings, 0 replies; 8+ messages in thread
From: Leo Yan @ 2023-02-13  8:02 UTC (permalink / raw)
  To: Feng Tang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joe Mario, linux-perf-users, linux-kernel, Andi Kleen, Kan Liang,
	Xing Zhengjun

On Mon, Feb 13, 2023 at 11:17:33AM +0800, Feng Tang wrote:
> Many platforms have feature of adjacent cachelines prefetch, when it
> is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity,
> if one is fetched to cache, the other one could likely be fetched too,
> which sort of extends the cacheline size to double, thus the false
> sharing could happens in adjacent cachelines.
> 
> 0Day has captured performance changed related with this [1], and some
> commercial software explicitly makes its hot global variables 128 bytes
> aligned (2 cache lines) to avoid this kind of extended false sharing.
> 
> So add an option "-a" or "--double-cl" for c2c report to show false
> sharing in double cache line granularity, which acts just like the
> cacheline size is doubled. There is no change to c2c record. The
> hardware HITM events are still per cacheline. The option just changes
> the granularity of how events are grouped and displayed.
> 
> In the c2c report below (will-it-scale's pagefault2 case on old kernel):
> 
>   ----------------------------------------------------------------------
>      26       31        2        0        0        0  0xffff888103ec6000
>   ----------------------------------------------------------------------
>    35.48%   50.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff8133148b      1153        66       971     3748        74  [k] get_mem_cgroup_from_mm
>     6.45%    0.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff813396e4       570         0      1531      879        75  [k] mem_cgroup_charge
>    25.81%   50.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81331472       949        70       593     3359        74  [k] get_mem_cgroup_from_mm
>    19.35%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81339686      1352         0      1073     1022        74  [k] mem_cgroup_charge
>     9.68%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff813396d6      1401         0       863      768        74  [k] mem_cgroup_charge
>     3.23%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81333106       618         0       804       11         9  [k] uncharge_batch
> 
> The offset 0x10 and 0x54 used to displayed in 2 groups, and now they
> are listed together to give users a hint.
> 
> [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>

LGTM and I verified it on my Arm64 platform with peer flag:

Reviewed-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Leo Yan <leo.yan@linaro.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines
  2023-02-13  3:17 [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines Feng Tang
  2023-02-13  8:02 ` Leo Yan
@ 2023-02-13  8:18 ` Leo Yan
  2023-02-13  8:24   ` Feng Tang
  2023-02-13 13:39 ` Joe Mario
  2 siblings, 1 reply; 8+ messages in thread
From: Leo Yan @ 2023-02-13  8:18 UTC (permalink / raw)
  To: Feng Tang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joe Mario, linux-perf-users, linux-kernel, Andi Kleen, Kan Liang,
	Xing Zhengjun

On Mon, Feb 13, 2023 at 11:17:33AM +0800, Feng Tang wrote:
> Many platforms have feature of adjacent cachelines prefetch, when it
> is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity,
> if one is fetched to cache, the other one could likely be fetched too,
> which sort of extends the cacheline size to double, thus the false
> sharing could happens in adjacent cachelines.
> 
> 0Day has captured performance changed related with this [1], and some
> commercial software explicitly makes its hot global variables 128 bytes
> aligned (2 cache lines) to avoid this kind of extended false sharing.
> 
> So add an option "-a" or "--double-cl" for c2c report to show false
> sharing in double cache line granularity, which acts just like the
> cacheline size is doubled. There is no change to c2c record. The
> hardware HITM events are still per cacheline. The option just changes
> the granularity of how events are grouped and displayed.
> 
> In the c2c report below (will-it-scale's pagefault2 case on old kernel):
> 
>   ----------------------------------------------------------------------
>      26       31        2        0        0        0  0xffff888103ec6000
>   ----------------------------------------------------------------------
>    35.48%   50.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff8133148b      1153        66       971     3748        74  [k] get_mem_cgroup_from_mm
>     6.45%    0.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff813396e4       570         0      1531      879        75  [k] mem_cgroup_charge
>    25.81%   50.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81331472       949        70       593     3359        74  [k] get_mem_cgroup_from_mm
>    19.35%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81339686      1352         0      1073     1022        74  [k] mem_cgroup_charge
>     9.68%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff813396d6      1401         0       863      768        74  [k] mem_cgroup_charge
>     3.23%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81333106       618         0       804       11         9  [k] uncharge_batch
> 
> The offset 0x10 and 0x54 used to displayed in 2 groups, and now they
> are listed together to give users a hint.
> 
> [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
>  tools/perf/Documentation/perf-c2c.txt |  6 ++++++
>  tools/perf/builtin-c2c.c              | 22 +++++++++++++---------
>  tools/perf/util/cacheline.h           | 25 ++++++++++++++++++++-----
>  tools/perf/util/sort.c                | 13 ++++++++++---
>  tools/perf/util/sort.h                |  1 +
>  5 files changed, 50 insertions(+), 17 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
> index 5c5eb2def83e..a8e1e40d270e 100644
> --- a/tools/perf/Documentation/perf-c2c.txt
> +++ b/tools/perf/Documentation/perf-c2c.txt
> @@ -126,6 +126,12 @@ REPORT OPTIONS
>  	The known limitations include exception handing such as
>  	setjmp/longjmp will have calls/returns not match.
>  
> +-a::
> +--double-cl::
> +	Group HITM events in double cacheline granularity. Some architecture
> +	has Adjacent Cacheline Prefetch feature, which behaves like the
> +	cachline size is doubled.

Sorry I gave my review tags too quick...

Just found several typos, please fix it.

s/Some architecture has/Some architectures have

This is just literal fixing, except x86, I don't know if any other
architecture has Adjacent Cacheline Prefetch feature :)

s/cachline/cacheline

> +
>  C2C RECORD
>  ----------
>  The perf c2c record command setup options related to HITM cacheline analysis
> diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> index 52d94c7dd836..7d495db7e5a2 100644
> --- a/tools/perf/builtin-c2c.c
> +++ b/tools/perf/builtin-c2c.c
> @@ -524,7 +524,7 @@ static int dcacheline_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
>  	char buf[20];
>  
>  	if (he->mem_info)
> -		addr = cl_address(he->mem_info->daddr.addr);
> +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
>  
>  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
>  }
> @@ -562,7 +562,7 @@ static int offset_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
>  	char buf[20];
>  
>  	if (he->mem_info)
> -		addr = cl_offset(he->mem_info->daddr.al_addr);
> +		addr = cl_offset(he->mem_info->daddr.al_addr, chk_double_cl);
>  
>  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
>  }
> @@ -574,9 +574,10 @@ offset_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
>  	uint64_t l = 0, r = 0;
>  
>  	if (left->mem_info)
> -		l = cl_offset(left->mem_info->daddr.addr);
> +		l = cl_offset(left->mem_info->daddr.addr, chk_double_cl);
> +
>  	if (right->mem_info)
> -		r = cl_offset(right->mem_info->daddr.addr);
> +		r = cl_offset(right->mem_info->daddr.addr, chk_double_cl);
>  
>  	return (int64_t)(r - l);
>  }
> @@ -2590,7 +2591,7 @@ perf_c2c_cacheline_browser__title(struct hist_browser *browser,
>  	he = cl_browser->he;
>  
>  	if (he->mem_info)
> -		addr = cl_address(he->mem_info->daddr.addr);
> +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
>  
>  	scnprintf(bf, size, "Cacheline 0x%lx", addr);
>  	return 0;
> @@ -2788,15 +2789,16 @@ static int ui_quirks(void)
>  	if (!c2c.use_stdio) {
>  		dim_offset.width  = 5;
>  		dim_offset.header = header_offset_tui;
> -		nodestr = "CL";
> +		nodestr = chk_double_cl ? "Double-CL" : "CL";
>  	}
>  
>  	dim_percent_costly_snoop.header = percent_costly_snoop_header[c2c.display];
>  
>  	/* Fix the zero line for dcacheline column. */
> -	buf = fill_line("Cacheline", dim_dcacheline.width +
> -				     dim_dcacheline_node.width +
> -				     dim_dcacheline_count.width + 4);
> +	buf = fill_line(chk_double_cl ? "Double-Cacheline" : "Cacheline",
> +				dim_dcacheline.width +
> +				dim_dcacheline_node.width +
> +				dim_dcacheline_count.width + 4);
>  	if (!buf)
>  		return -ENOMEM;
>  
> @@ -3037,6 +3039,8 @@ static int perf_c2c__report(int argc, const char **argv)
>  	OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"),
>  	OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr,
>  		    "Enable LBR callgraph stitching approach"),
> +	OPT_BOOLEAN('a', "double-cl", &chk_double_cl,
> +		    "Check adjacent cachline false sharing"),

I personally think the word "Detect" is better than "Check".

s/cachline/cacheline

>  	OPT_PARENT(c2c_options),
>  	OPT_END()
>  	};
> diff --git a/tools/perf/util/cacheline.h b/tools/perf/util/cacheline.h
> index dec8c0fb1f4a..630d16731b4f 100644
> --- a/tools/perf/util/cacheline.h
> +++ b/tools/perf/util/cacheline.h
> @@ -6,16 +6,31 @@
>  
>  int __pure cacheline_size(void);
>  
> -static inline u64 cl_address(u64 address)
> +
> +/*
> + * Some architecture has 'Adjacent Cacheline Prefetch' feature,

s/Some architecture has/Some architectures have

> + * which performs like the cacheline size being doubled.
> + */
> +static inline u64 cl_address(u64 address, bool double_cl)
>  {
> +	u64 size = cacheline_size();
> +
> +	if (double_cl)
> +		size *= 2;
> +
>  	/* return the cacheline of the address */
> -	return (address & ~(cacheline_size() - 1));
> +	return (address & ~(size - 1));
>  }
>  
> -static inline u64 cl_offset(u64 address)
> +static inline u64 cl_offset(u64 address, bool double_cl)
>  {
> -	/* return the cacheline of the address */
> -	return (address & (cacheline_size() - 1));
> +	u64 size = cacheline_size();
> +
> +	if (double_cl)
> +		size *= 2;
> +
> +	/* return the offset inside cachline */

s/cachline/cacheline

> +	return (address & (size - 1));
>  }
>  
>  #endif // PERF_CACHELINE_H
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index e188f74698dd..148b28f0a7e2 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -52,6 +52,13 @@ enum sort_mode	sort__mode = SORT_MODE__NORMAL;
>  static const char *const dynamic_headers[] = {"local_ins_lat", "ins_lat", "local_p_stage_cyc", "p_stage_cyc"};
>  static const char *const arch_specific_sort_keys[] = {"local_p_stage_cyc", "p_stage_cyc"};
>  
> +/*
> + * Some architecture has Adjacent Cacheline Prefetch feature, which behaves
> + * like the cachline size is doubled. Enable this flag to check things in

s/Some architecture has/Some architectures have
s/cachline/cacheline

With these typo fixes, you could add my reviewed and tested tags.

Thanks,
Leo

> + * double cacheline granularity.
> + */
> +bool chk_double_cl;
> +
>  /*
>   * Replaces all occurrences of a char used with the:
>   *
> @@ -1499,8 +1506,8 @@ sort__dcacheline_cmp(struct hist_entry *left, struct hist_entry *right)
>  
>  addr:
>  	/* al_addr does all the right addr - start + offset calculations */
> -	l = cl_address(left->mem_info->daddr.al_addr);
> -	r = cl_address(right->mem_info->daddr.al_addr);
> +	l = cl_address(left->mem_info->daddr.al_addr, chk_double_cl);
> +	r = cl_address(right->mem_info->daddr.al_addr, chk_double_cl);
>  
>  	if (l > r) return -1;
>  	if (l < r) return 1;
> @@ -1519,7 +1526,7 @@ static int hist_entry__dcacheline_snprintf(struct hist_entry *he, char *bf,
>  	if (he->mem_info) {
>  		struct map *map = he->mem_info->daddr.ms.map;
>  
> -		addr = cl_address(he->mem_info->daddr.al_addr);
> +		addr = cl_address(he->mem_info->daddr.al_addr, chk_double_cl);
>  		ms = &he->mem_info->daddr.ms;
>  
>  		/* print [s] for shared data mmaps */
> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> index 921715e6aec4..04f0a6dc7381 100644
> --- a/tools/perf/util/sort.h
> +++ b/tools/perf/util/sort.h
> @@ -35,6 +35,7 @@ extern struct sort_entry sort_sym_from;
>  extern struct sort_entry sort_sym_to;
>  extern struct sort_entry sort_srcline;
>  extern const char default_mem_sort_order[];
> +extern bool chk_double_cl;
>  
>  struct res_sample {
>  	u64 time;
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines
  2023-02-13  8:18 ` Leo Yan
@ 2023-02-13  8:24   ` Feng Tang
  2023-02-13  8:54     ` Leo Yan
  0 siblings, 1 reply; 8+ messages in thread
From: Feng Tang @ 2023-02-13  8:24 UTC (permalink / raw)
  To: Leo Yan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joe Mario, linux-perf-users, linux-kernel, Andi Kleen, Kan Liang,
	Xing Zhengjun

Hi Leo,

On Mon, Feb 13, 2023 at 04:18:50PM +0800, Leo Yan wrote:
> On Mon, Feb 13, 2023 at 11:17:33AM +0800, Feng Tang wrote:
> > Many platforms have feature of adjacent cachelines prefetch, when it
> > is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity,
> > if one is fetched to cache, the other one could likely be fetched too,
> > which sort of extends the cacheline size to double, thus the false
> > sharing could happens in adjacent cachelines.
> > 
> > 0Day has captured performance changed related with this [1], and some
> > commercial software explicitly makes its hot global variables 128 bytes
> > aligned (2 cache lines) to avoid this kind of extended false sharing.
> > 
> > So add an option "-a" or "--double-cl" for c2c report to show false
> > sharing in double cache line granularity, which acts just like the
> > cacheline size is doubled. There is no change to c2c record. The
> > hardware HITM events are still per cacheline. The option just changes
> > the granularity of how events are grouped and displayed.
> > 
> > In the c2c report below (will-it-scale's pagefault2 case on old kernel):
> > 
> >   ----------------------------------------------------------------------
> >      26       31        2        0        0        0  0xffff888103ec6000
> >   ----------------------------------------------------------------------
> >    35.48%   50.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff8133148b      1153        66       971     3748        74  [k] get_mem_cgroup_from_mm
> >     6.45%    0.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff813396e4       570         0      1531      879        75  [k] mem_cgroup_charge
> >    25.81%   50.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81331472       949        70       593     3359        74  [k] get_mem_cgroup_from_mm
> >    19.35%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81339686      1352         0      1073     1022        74  [k] mem_cgroup_charge
> >     9.68%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff813396d6      1401         0       863      768        74  [k] mem_cgroup_charge
> >     3.23%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81333106       618         0       804       11         9  [k] uncharge_batch
> > 
> > The offset 0x10 and 0x54 used to displayed in 2 groups, and now they
> > are listed together to give users a hint.
> > 
> > [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > Reviewed-by: Andi Kleen <ak@linux.intel.com>
> > ---
> >  tools/perf/Documentation/perf-c2c.txt |  6 ++++++
> >  tools/perf/builtin-c2c.c              | 22 +++++++++++++---------
> >  tools/perf/util/cacheline.h           | 25 ++++++++++++++++++++-----
> >  tools/perf/util/sort.c                | 13 ++++++++++---
> >  tools/perf/util/sort.h                |  1 +
> >  5 files changed, 50 insertions(+), 17 deletions(-)
> > 
> > diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
> > index 5c5eb2def83e..a8e1e40d270e 100644
> > --- a/tools/perf/Documentation/perf-c2c.txt
> > +++ b/tools/perf/Documentation/perf-c2c.txt
> > @@ -126,6 +126,12 @@ REPORT OPTIONS
> >  	The known limitations include exception handing such as
> >  	setjmp/longjmp will have calls/returns not match.
> >  
> > +-a::
> > +--double-cl::
> > +	Group HITM events in double cacheline granularity. Some architecture
> > +	has Adjacent Cacheline Prefetch feature, which behaves like the
> > +	cachline size is doubled.
> 
> Sorry I gave my review tags too quick...
> 
> Just found several typos, please fix it.

Thanks for the catch, will fix them.

> s/Some architecture has/Some architectures have
> 
> This is just literal fixing, except x86, I don't know if any other
> architecture has Adjacent Cacheline Prefetch feature :)
> 
> s/cachline/cacheline
> 
> > +
> >  C2C RECORD
> >  ----------
> >  The perf c2c record command setup options related to HITM cacheline analysis
> > diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> > index 52d94c7dd836..7d495db7e5a2 100644
> > --- a/tools/perf/builtin-c2c.c
> > +++ b/tools/perf/builtin-c2c.c
> > @@ -524,7 +524,7 @@ static int dcacheline_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
> >  	char buf[20];
> >  
> >  	if (he->mem_info)
> > -		addr = cl_address(he->mem_info->daddr.addr);
> > +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
> >  }
> > @@ -562,7 +562,7 @@ static int offset_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
> >  	char buf[20];
> >  
> >  	if (he->mem_info)
> > -		addr = cl_offset(he->mem_info->daddr.al_addr);
> > +		addr = cl_offset(he->mem_info->daddr.al_addr, chk_double_cl);
> >  
> >  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
> >  }
> > @@ -574,9 +574,10 @@ offset_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
> >  	uint64_t l = 0, r = 0;
> >  
> >  	if (left->mem_info)
> > -		l = cl_offset(left->mem_info->daddr.addr);
> > +		l = cl_offset(left->mem_info->daddr.addr, chk_double_cl);
> > +
> >  	if (right->mem_info)
> > -		r = cl_offset(right->mem_info->daddr.addr);
> > +		r = cl_offset(right->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	return (int64_t)(r - l);
> >  }
> > @@ -2590,7 +2591,7 @@ perf_c2c_cacheline_browser__title(struct hist_browser *browser,
> >  	he = cl_browser->he;
> >  
> >  	if (he->mem_info)
> > -		addr = cl_address(he->mem_info->daddr.addr);
> > +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	scnprintf(bf, size, "Cacheline 0x%lx", addr);
> >  	return 0;
> > @@ -2788,15 +2789,16 @@ static int ui_quirks(void)
> >  	if (!c2c.use_stdio) {
> >  		dim_offset.width  = 5;
> >  		dim_offset.header = header_offset_tui;
> > -		nodestr = "CL";
> > +		nodestr = chk_double_cl ? "Double-CL" : "CL";
> >  	}
> >  
> >  	dim_percent_costly_snoop.header = percent_costly_snoop_header[c2c.display];
> >  
> >  	/* Fix the zero line for dcacheline column. */
> > -	buf = fill_line("Cacheline", dim_dcacheline.width +
> > -				     dim_dcacheline_node.width +
> > -				     dim_dcacheline_count.width + 4);
> > +	buf = fill_line(chk_double_cl ? "Double-Cacheline" : "Cacheline",
> > +				dim_dcacheline.width +
> > +				dim_dcacheline_node.width +
> > +				dim_dcacheline_count.width + 4);
> >  	if (!buf)
> >  		return -ENOMEM;
> >  
> > @@ -3037,6 +3039,8 @@ static int perf_c2c__report(int argc, const char **argv)
> >  	OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"),
> >  	OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr,
> >  		    "Enable LBR callgraph stitching approach"),
> > +	OPT_BOOLEAN('a', "double-cl", &chk_double_cl,
> > +		    "Check adjacent cachline false sharing"),
> 
> I personally think the word "Detect" is better than "Check".

Ok will use this. The possiblity of finding a false sharing of
adjacent cachelines is indeed much less.

> s/cachline/cacheline

Will fix.

> >  	OPT_PARENT(c2c_options),
> >  	OPT_END()
> >  	};
> > diff --git a/tools/perf/util/cacheline.h b/tools/perf/util/cacheline.h
> > index dec8c0fb1f4a..630d16731b4f 100644
> > --- a/tools/perf/util/cacheline.h
> > +++ b/tools/perf/util/cacheline.h
> > @@ -6,16 +6,31 @@
> >  
> >  int __pure cacheline_size(void);
> >  
> > -static inline u64 cl_address(u64 address)
> > +
> > +/*
> > + * Some architecture has 'Adjacent Cacheline Prefetch' feature,
> 
> s/Some architecture has/Some architectures have

Will change.

> > + * which performs like the cacheline size being doubled.
> > + */
> > +static inline u64 cl_address(u64 address, bool double_cl)
> >  {
> > +	u64 size = cacheline_size();
> > +
> > +	if (double_cl)
> > +		size *= 2;
> > +
> >  	/* return the cacheline of the address */
> > -	return (address & ~(cacheline_size() - 1));
> > +	return (address & ~(size - 1));
> >  }
> >  
> > -static inline u64 cl_offset(u64 address)
> > +static inline u64 cl_offset(u64 address, bool double_cl)
> >  {
> > -	/* return the cacheline of the address */
> > -	return (address & (cacheline_size() - 1));
> > +	u64 size = cacheline_size();
> > +
> > +	if (double_cl)
> > +		size *= 2;
> > +
> > +	/* return the offset inside cachline */
> 
> s/cachline/cacheline
 
Will fix.

> > +	return (address & (size - 1));
> >  }
> >  
> >  #endif // PERF_CACHELINE_H
> > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > index e188f74698dd..148b28f0a7e2 100644
> > --- a/tools/perf/util/sort.c
> > +++ b/tools/perf/util/sort.c
> > @@ -52,6 +52,13 @@ enum sort_mode	sort__mode = SORT_MODE__NORMAL;
> >  static const char *const dynamic_headers[] = {"local_ins_lat", "ins_lat", "local_p_stage_cyc", "p_stage_cyc"};
> >  static const char *const arch_specific_sort_keys[] = {"local_p_stage_cyc", "p_stage_cyc"};
> >  
> > +/*
> > + * Some architecture has Adjacent Cacheline Prefetch feature, which behaves
> > + * like the cachline size is doubled. Enable this flag to check things in
> 
> s/Some architecture has/Some architectures have
> s/cachline/cacheline
> 
> With these typo fixes, you could add my reviewed and tested tags.

Thanks a lot for the review and helping to test on Arm64 platform
which I don't have access to!

- Feng

> Thanks,
> Leo
> 
> > + * double cacheline granularity.
> > + */
> > +bool chk_double_cl;
> > +
> >  /*
> >   * Replaces all occurrences of a char used with the:
> >   *
> > @@ -1499,8 +1506,8 @@ sort__dcacheline_cmp(struct hist_entry *left, struct hist_entry *right)
> >  
> >  addr:
> >  	/* al_addr does all the right addr - start + offset calculations */
> > -	l = cl_address(left->mem_info->daddr.al_addr);
> > -	r = cl_address(right->mem_info->daddr.al_addr);
> > +	l = cl_address(left->mem_info->daddr.al_addr, chk_double_cl);
> > +	r = cl_address(right->mem_info->daddr.al_addr, chk_double_cl);
> >  
> >  	if (l > r) return -1;
> >  	if (l < r) return 1;
> > @@ -1519,7 +1526,7 @@ static int hist_entry__dcacheline_snprintf(struct hist_entry *he, char *bf,
> >  	if (he->mem_info) {
> >  		struct map *map = he->mem_info->daddr.ms.map;
> >  
> > -		addr = cl_address(he->mem_info->daddr.al_addr);
> > +		addr = cl_address(he->mem_info->daddr.al_addr, chk_double_cl);
> >  		ms = &he->mem_info->daddr.ms;
> >  
> >  		/* print [s] for shared data mmaps */
> > diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> > index 921715e6aec4..04f0a6dc7381 100644
> > --- a/tools/perf/util/sort.h
> > +++ b/tools/perf/util/sort.h
> > @@ -35,6 +35,7 @@ extern struct sort_entry sort_sym_from;
> >  extern struct sort_entry sort_sym_to;
> >  extern struct sort_entry sort_srcline;
> >  extern const char default_mem_sort_order[];
> > +extern bool chk_double_cl;
> >  
> >  struct res_sample {
> >  	u64 time;
> > -- 
> > 2.30.2
> > 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines
  2023-02-13  8:24   ` Feng Tang
@ 2023-02-13  8:54     ` Leo Yan
  0 siblings, 0 replies; 8+ messages in thread
From: Leo Yan @ 2023-02-13  8:54 UTC (permalink / raw)
  To: Feng Tang
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Joe Mario, linux-perf-users, linux-kernel, Andi Kleen, Kan Liang,
	Xing Zhengjun

On Mon, Feb 13, 2023 at 04:24:01PM +0800, Feng Tang wrote:

[...]

> > With these typo fixes, you could add my reviewed and tested tags.
> 
> Thanks a lot for the review and helping to test on Arm64 platform
> which I don't have access to!

Welcome, Feng!  And thanks for your patch!

Leo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines
  2023-02-13  3:17 [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines Feng Tang
  2023-02-13  8:02 ` Leo Yan
  2023-02-13  8:18 ` Leo Yan
@ 2023-02-13 13:39 ` Joe Mario
  2023-02-13 14:04   ` Feng Tang
  2023-02-14 19:31   ` Arnaldo Carvalho de Melo
  2 siblings, 2 replies; 8+ messages in thread
From: Joe Mario @ 2023-02-13 13:39 UTC (permalink / raw)
  To: Feng Tang, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	linux-perf-users, linux-kernel
  Cc: Andi Kleen, Kan Liang, Xing Zhengjun



On 2/12/23 10:17 PM, Feng Tang wrote:
> Many platforms have feature of adjacent cachelines prefetch, when it
> is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity,
> if one is fetched to cache, the other one could likely be fetched too,
> which sort of extends the cacheline size to double, thus the false
> sharing could happens in adjacent cachelines.
> 
> 0Day has captured performance changed related with this [1], and some
> commercial software explicitly makes its hot global variables 128 bytes
> aligned (2 cache lines) to avoid this kind of extended false sharing.
> 
> So add an option "-a" or "--double-cl" for c2c report to show false
> sharing in double cache line granularity, which acts just like the
> cacheline size is doubled. There is no change to c2c record. The
> hardware HITM events are still per cacheline. The option just changes
> the granularity of how events are grouped and displayed.
> 
> In the c2c report below (will-it-scale's pagefault2 case on old kernel):
> 
>   ----------------------------------------------------------------------
>      26       31        2        0        0        0  0xffff888103ec6000
>   ----------------------------------------------------------------------
>    35.48%   50.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff8133148b      1153        66       971     3748        74  [k] get_mem_cgroup_from_mm
>     6.45%    0.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff813396e4       570         0      1531      879        75  [k] mem_cgroup_charge
>    25.81%   50.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81331472       949        70       593     3359        74  [k] get_mem_cgroup_from_mm
>    19.35%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81339686      1352         0      1073     1022        74  [k] mem_cgroup_charge
>     9.68%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff813396d6      1401         0       863      768        74  [k] mem_cgroup_charge
>     3.23%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81333106       618         0       804       11         9  [k] uncharge_batch
> 
> The offset 0x10 and 0x54 used to displayed in 2 groups, and now they
> are listed together to give users a hint.
> 
> [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> ---
>  tools/perf/Documentation/perf-c2c.txt |  6 ++++++
>  tools/perf/builtin-c2c.c              | 22 +++++++++++++---------
>  tools/perf/util/cacheline.h           | 25 ++++++++++++++++++++-----
>  tools/perf/util/sort.c                | 13 ++++++++++---
>  tools/perf/util/sort.h                |  1 +
>  5 files changed, 50 insertions(+), 17 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
> index 5c5eb2def83e..a8e1e40d270e 100644
> --- a/tools/perf/Documentation/perf-c2c.txt
> +++ b/tools/perf/Documentation/perf-c2c.txt
> @@ -126,6 +126,12 @@ REPORT OPTIONS
>  	The known limitations include exception handing such as
>  	setjmp/longjmp will have calls/returns not match.
>  
> +-a::
> +--double-cl::
> +	Group HITM events in double cacheline granularity. Some architecture
> +	has Adjacent Cacheline Prefetch feature, which behaves like the
> +	cachline size is doubled.
> +

Hi Feng:
One or two of the recent C2C patches changed the Intel-specifc "HITM" wording to something more generic, mostly because other arches do not have HITM events.

How about changing the above description to something like:
+-a::
+--double-cl::
+       Group the detection of shared cacheline events into double cacheline granularity.
+       Some architectures have an Adjacent Cacheline Prefetch feature, which causes
+       cacheline sharing to behave like the cachline size is doubled.

By the way, this patch is awesome.  
It's not often that Adjacent Cacheline Prefetching causes performance problems, but when it does, it's a pain to track down.
This will really help.  Thank you for doing it.

Joe

>  C2C RECORD
>  ----------
>  The perf c2c record command setup options related to HITM cacheline analysis
> diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> index 52d94c7dd836..7d495db7e5a2 100644
> --- a/tools/perf/builtin-c2c.c
> +++ b/tools/perf/builtin-c2c.c
> @@ -524,7 +524,7 @@ static int dcacheline_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
>  	char buf[20];
>  
>  	if (he->mem_info)
> -		addr = cl_address(he->mem_info->daddr.addr);
> +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
>  
>  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
>  }
> @@ -562,7 +562,7 @@ static int offset_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
>  	char buf[20];
>  
>  	if (he->mem_info)
> -		addr = cl_offset(he->mem_info->daddr.al_addr);
> +		addr = cl_offset(he->mem_info->daddr.al_addr, chk_double_cl);
>  
>  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
>  }
> @@ -574,9 +574,10 @@ offset_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
>  	uint64_t l = 0, r = 0;
>  
>  	if (left->mem_info)
> -		l = cl_offset(left->mem_info->daddr.addr);
> +		l = cl_offset(left->mem_info->daddr.addr, chk_double_cl);
> +
>  	if (right->mem_info)
> -		r = cl_offset(right->mem_info->daddr.addr);
> +		r = cl_offset(right->mem_info->daddr.addr, chk_double_cl);
>  
>  	return (int64_t)(r - l);
>  }
> @@ -2590,7 +2591,7 @@ perf_c2c_cacheline_browser__title(struct hist_browser *browser,
>  	he = cl_browser->he;
>  
>  	if (he->mem_info)
> -		addr = cl_address(he->mem_info->daddr.addr);
> +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
>  
>  	scnprintf(bf, size, "Cacheline 0x%lx", addr);
>  	return 0;
> @@ -2788,15 +2789,16 @@ static int ui_quirks(void)
>  	if (!c2c.use_stdio) {
>  		dim_offset.width  = 5;
>  		dim_offset.header = header_offset_tui;
> -		nodestr = "CL";
> +		nodestr = chk_double_cl ? "Double-CL" : "CL";
>  	}
>  
>  	dim_percent_costly_snoop.header = percent_costly_snoop_header[c2c.display];
>  
>  	/* Fix the zero line for dcacheline column. */
> -	buf = fill_line("Cacheline", dim_dcacheline.width +
> -				     dim_dcacheline_node.width +
> -				     dim_dcacheline_count.width + 4);
> +	buf = fill_line(chk_double_cl ? "Double-Cacheline" : "Cacheline",
> +				dim_dcacheline.width +
> +				dim_dcacheline_node.width +
> +				dim_dcacheline_count.width + 4);
>  	if (!buf)
>  		return -ENOMEM;
>  
> @@ -3037,6 +3039,8 @@ static int perf_c2c__report(int argc, const char **argv)
>  	OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"),
>  	OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr,
>  		    "Enable LBR callgraph stitching approach"),
> +	OPT_BOOLEAN('a', "double-cl", &chk_double_cl,
> +		    "Check adjacent cachline false sharing"),
>  	OPT_PARENT(c2c_options),
>  	OPT_END()
>  	};
> diff --git a/tools/perf/util/cacheline.h b/tools/perf/util/cacheline.h
> index dec8c0fb1f4a..630d16731b4f 100644
> --- a/tools/perf/util/cacheline.h
> +++ b/tools/perf/util/cacheline.h
> @@ -6,16 +6,31 @@
>  
>  int __pure cacheline_size(void);
>  
> -static inline u64 cl_address(u64 address)
> +
> +/*
> + * Some architecture has 'Adjacent Cacheline Prefetch' feature,
> + * which performs like the cacheline size being doubled.
> + */
> +static inline u64 cl_address(u64 address, bool double_cl)
>  {
> +	u64 size = cacheline_size();
> +
> +	if (double_cl)
> +		size *= 2;
> +
>  	/* return the cacheline of the address */
> -	return (address & ~(cacheline_size() - 1));
> +	return (address & ~(size - 1));
>  }
>  
> -static inline u64 cl_offset(u64 address)
> +static inline u64 cl_offset(u64 address, bool double_cl)
>  {
> -	/* return the cacheline of the address */
> -	return (address & (cacheline_size() - 1));
> +	u64 size = cacheline_size();
> +
> +	if (double_cl)
> +		size *= 2;
> +
> +	/* return the offset inside cachline */
> +	return (address & (size - 1));
>  }
>  
>  #endif // PERF_CACHELINE_H
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index e188f74698dd..148b28f0a7e2 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -52,6 +52,13 @@ enum sort_mode	sort__mode = SORT_MODE__NORMAL;
>  static const char *const dynamic_headers[] = {"local_ins_lat", "ins_lat", "local_p_stage_cyc", "p_stage_cyc"};
>  static const char *const arch_specific_sort_keys[] = {"local_p_stage_cyc", "p_stage_cyc"};
>  
> +/*
> + * Some architecture has Adjacent Cacheline Prefetch feature, which behaves
> + * like the cachline size is doubled. Enable this flag to check things in
> + * double cacheline granularity.
> + */
> +bool chk_double_cl;
> +
>  /*
>   * Replaces all occurrences of a char used with the:
>   *
> @@ -1499,8 +1506,8 @@ sort__dcacheline_cmp(struct hist_entry *left, struct hist_entry *right)
>  
>  addr:
>  	/* al_addr does all the right addr - start + offset calculations */
> -	l = cl_address(left->mem_info->daddr.al_addr);
> -	r = cl_address(right->mem_info->daddr.al_addr);
> +	l = cl_address(left->mem_info->daddr.al_addr, chk_double_cl);
> +	r = cl_address(right->mem_info->daddr.al_addr, chk_double_cl);
>  
>  	if (l > r) return -1;
>  	if (l < r) return 1;
> @@ -1519,7 +1526,7 @@ static int hist_entry__dcacheline_snprintf(struct hist_entry *he, char *bf,
>  	if (he->mem_info) {
>  		struct map *map = he->mem_info->daddr.ms.map;
>  
> -		addr = cl_address(he->mem_info->daddr.al_addr);
> +		addr = cl_address(he->mem_info->daddr.al_addr, chk_double_cl);
>  		ms = &he->mem_info->daddr.ms;
>  
>  		/* print [s] for shared data mmaps */
> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> index 921715e6aec4..04f0a6dc7381 100644
> --- a/tools/perf/util/sort.h
> +++ b/tools/perf/util/sort.h
> @@ -35,6 +35,7 @@ extern struct sort_entry sort_sym_from;
>  extern struct sort_entry sort_sym_to;
>  extern struct sort_entry sort_srcline;
>  extern const char default_mem_sort_order[];
> +extern bool chk_double_cl;
>  
>  struct res_sample {
>  	u64 time;


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines
  2023-02-13 13:39 ` Joe Mario
@ 2023-02-13 14:04   ` Feng Tang
  2023-02-14 19:31   ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 8+ messages in thread
From: Feng Tang @ 2023-02-13 14:04 UTC (permalink / raw)
  To: Joe Mario
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	linux-perf-users, linux-kernel, Andi Kleen, Kan Liang,
	Xing Zhengjun

Hi Joe,

On Mon, Feb 13, 2023 at 08:39:15AM -0500, Joe Mario wrote:
> 
> 
> On 2/12/23 10:17 PM, Feng Tang wrote:
> > Many platforms have feature of adjacent cachelines prefetch, when it
> > is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity,
> > if one is fetched to cache, the other one could likely be fetched too,
> > which sort of extends the cacheline size to double, thus the false
> > sharing could happens in adjacent cachelines.
> > 
> > 0Day has captured performance changed related with this [1], and some
> > commercial software explicitly makes its hot global variables 128 bytes
> > aligned (2 cache lines) to avoid this kind of extended false sharing.
> > 
> > So add an option "-a" or "--double-cl" for c2c report to show false
> > sharing in double cache line granularity, which acts just like the
> > cacheline size is doubled. There is no change to c2c record. The
> > hardware HITM events are still per cacheline. The option just changes
> > the granularity of how events are grouped and displayed.
> > 
> > In the c2c report below (will-it-scale's pagefault2 case on old kernel):
> > 
> >   ----------------------------------------------------------------------
> >      26       31        2        0        0        0  0xffff888103ec6000
> >   ----------------------------------------------------------------------
> >    35.48%   50.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff8133148b      1153        66       971     3748        74  [k] get_mem_cgroup_from_mm
> >     6.45%    0.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff813396e4       570         0      1531      879        75  [k] mem_cgroup_charge
> >    25.81%   50.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81331472       949        70       593     3359        74  [k] get_mem_cgroup_from_mm
> >    19.35%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81339686      1352         0      1073     1022        74  [k] mem_cgroup_charge
> >     9.68%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff813396d6      1401         0       863      768        74  [k] mem_cgroup_charge
> >     3.23%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81333106       618         0       804       11         9  [k] uncharge_batch
> > 
> > The offset 0x10 and 0x54 used to displayed in 2 groups, and now they
> > are listed together to give users a hint.
> > 
> > [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > Reviewed-by: Andi Kleen <ak@linux.intel.com>
> > ---
> >  tools/perf/Documentation/perf-c2c.txt |  6 ++++++
> >  tools/perf/builtin-c2c.c              | 22 +++++++++++++---------
> >  tools/perf/util/cacheline.h           | 25 ++++++++++++++++++++-----
> >  tools/perf/util/sort.c                | 13 ++++++++++---
> >  tools/perf/util/sort.h                |  1 +
> >  5 files changed, 50 insertions(+), 17 deletions(-)
> > 
> > diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
> > index 5c5eb2def83e..a8e1e40d270e 100644
> > --- a/tools/perf/Documentation/perf-c2c.txt
> > +++ b/tools/perf/Documentation/perf-c2c.txt
> > @@ -126,6 +126,12 @@ REPORT OPTIONS
> >  	The known limitations include exception handing such as
> >  	setjmp/longjmp will have calls/returns not match.
> >  
> > +-a::
> > +--double-cl::
> > +	Group HITM events in double cacheline granularity. Some architecture
> > +	has Adjacent Cacheline Prefetch feature, which behaves like the
> > +	cachline size is doubled.
> > +
> 
> Hi Feng:
> One or two of the recent C2C patches changed the Intel-specifc "HITM" wording to something more generic, mostly because other arches do not have HITM events.
> 
> How about changing the above description to something like:
> +-a::
> +--double-cl::
> +       Group the detection of shared cacheline events into double cacheline granularity.
> +       Some architectures have an Adjacent Cacheline Prefetch feature, which causes
> +       cacheline sharing to behave like the cachline size is doubled.

Yes, HITM should not be in this architecture independent description.
Thanks for helping to improve it, which is much better now!

> By the way, this patch is awesome.  
> It's not often that Adjacent Cacheline Prefetching causes performance problems, but when it does, it's a pain to track down.
> This will really help.  Thank you for doing it.
 
Thanks for your work on perf-c2c and the great blog (I forwarded the
link to tens of colleagues :))

Thanks,
Feng

> Joe
> 
> >  C2C RECORD
> >  ----------
> >  The perf c2c record command setup options related to HITM cacheline analysis
> > diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> > index 52d94c7dd836..7d495db7e5a2 100644
> > --- a/tools/perf/builtin-c2c.c
> > +++ b/tools/perf/builtin-c2c.c
> > @@ -524,7 +524,7 @@ static int dcacheline_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
> >  	char buf[20];
> >  
> >  	if (he->mem_info)
> > -		addr = cl_address(he->mem_info->daddr.addr);
> > +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
> >  }
> > @@ -562,7 +562,7 @@ static int offset_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
> >  	char buf[20];
> >  
> >  	if (he->mem_info)
> > -		addr = cl_offset(he->mem_info->daddr.al_addr);
> > +		addr = cl_offset(he->mem_info->daddr.al_addr, chk_double_cl);
> >  
> >  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
> >  }
> > @@ -574,9 +574,10 @@ offset_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
> >  	uint64_t l = 0, r = 0;
> >  
> >  	if (left->mem_info)
> > -		l = cl_offset(left->mem_info->daddr.addr);
> > +		l = cl_offset(left->mem_info->daddr.addr, chk_double_cl);
> > +
> >  	if (right->mem_info)
> > -		r = cl_offset(right->mem_info->daddr.addr);
> > +		r = cl_offset(right->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	return (int64_t)(r - l);
> >  }
> > @@ -2590,7 +2591,7 @@ perf_c2c_cacheline_browser__title(struct hist_browser *browser,
> >  	he = cl_browser->he;
> >  
> >  	if (he->mem_info)
> > -		addr = cl_address(he->mem_info->daddr.addr);
> > +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	scnprintf(bf, size, "Cacheline 0x%lx", addr);
> >  	return 0;
> > @@ -2788,15 +2789,16 @@ static int ui_quirks(void)
> >  	if (!c2c.use_stdio) {
> >  		dim_offset.width  = 5;
> >  		dim_offset.header = header_offset_tui;
> > -		nodestr = "CL";
> > +		nodestr = chk_double_cl ? "Double-CL" : "CL";
> >  	}
> >  
> >  	dim_percent_costly_snoop.header = percent_costly_snoop_header[c2c.display];
> >  
> >  	/* Fix the zero line for dcacheline column. */
> > -	buf = fill_line("Cacheline", dim_dcacheline.width +
> > -				     dim_dcacheline_node.width +
> > -				     dim_dcacheline_count.width + 4);
> > +	buf = fill_line(chk_double_cl ? "Double-Cacheline" : "Cacheline",
> > +				dim_dcacheline.width +
> > +				dim_dcacheline_node.width +
> > +				dim_dcacheline_count.width + 4);
> >  	if (!buf)
> >  		return -ENOMEM;
> >  
> > @@ -3037,6 +3039,8 @@ static int perf_c2c__report(int argc, const char **argv)
> >  	OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"),
> >  	OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr,
> >  		    "Enable LBR callgraph stitching approach"),
> > +	OPT_BOOLEAN('a', "double-cl", &chk_double_cl,
> > +		    "Check adjacent cachline false sharing"),
> >  	OPT_PARENT(c2c_options),
> >  	OPT_END()
> >  	};
> > diff --git a/tools/perf/util/cacheline.h b/tools/perf/util/cacheline.h
> > index dec8c0fb1f4a..630d16731b4f 100644
> > --- a/tools/perf/util/cacheline.h
> > +++ b/tools/perf/util/cacheline.h
> > @@ -6,16 +6,31 @@
> >  
> >  int __pure cacheline_size(void);
> >  
> > -static inline u64 cl_address(u64 address)
> > +
> > +/*
> > + * Some architecture has 'Adjacent Cacheline Prefetch' feature,
> > + * which performs like the cacheline size being doubled.
> > + */
> > +static inline u64 cl_address(u64 address, bool double_cl)
> >  {
> > +	u64 size = cacheline_size();
> > +
> > +	if (double_cl)
> > +		size *= 2;
> > +
> >  	/* return the cacheline of the address */
> > -	return (address & ~(cacheline_size() - 1));
> > +	return (address & ~(size - 1));
> >  }
> >  
> > -static inline u64 cl_offset(u64 address)
> > +static inline u64 cl_offset(u64 address, bool double_cl)
> >  {
> > -	/* return the cacheline of the address */
> > -	return (address & (cacheline_size() - 1));
> > +	u64 size = cacheline_size();
> > +
> > +	if (double_cl)
> > +		size *= 2;
> > +
> > +	/* return the offset inside cachline */
> > +	return (address & (size - 1));
> >  }
> >  
> >  #endif // PERF_CACHELINE_H
> > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > index e188f74698dd..148b28f0a7e2 100644
> > --- a/tools/perf/util/sort.c
> > +++ b/tools/perf/util/sort.c
> > @@ -52,6 +52,13 @@ enum sort_mode	sort__mode = SORT_MODE__NORMAL;
> >  static const char *const dynamic_headers[] = {"local_ins_lat", "ins_lat", "local_p_stage_cyc", "p_stage_cyc"};
> >  static const char *const arch_specific_sort_keys[] = {"local_p_stage_cyc", "p_stage_cyc"};
> >  
> > +/*
> > + * Some architecture has Adjacent Cacheline Prefetch feature, which behaves
> > + * like the cachline size is doubled. Enable this flag to check things in
> > + * double cacheline granularity.
> > + */
> > +bool chk_double_cl;
> > +
> >  /*
> >   * Replaces all occurrences of a char used with the:
> >   *
> > @@ -1499,8 +1506,8 @@ sort__dcacheline_cmp(struct hist_entry *left, struct hist_entry *right)
> >  
> >  addr:
> >  	/* al_addr does all the right addr - start + offset calculations */
> > -	l = cl_address(left->mem_info->daddr.al_addr);
> > -	r = cl_address(right->mem_info->daddr.al_addr);
> > +	l = cl_address(left->mem_info->daddr.al_addr, chk_double_cl);
> > +	r = cl_address(right->mem_info->daddr.al_addr, chk_double_cl);
> >  
> >  	if (l > r) return -1;
> >  	if (l < r) return 1;
> > @@ -1519,7 +1526,7 @@ static int hist_entry__dcacheline_snprintf(struct hist_entry *he, char *bf,
> >  	if (he->mem_info) {
> >  		struct map *map = he->mem_info->daddr.ms.map;
> >  
> > -		addr = cl_address(he->mem_info->daddr.al_addr);
> > +		addr = cl_address(he->mem_info->daddr.al_addr, chk_double_cl);
> >  		ms = &he->mem_info->daddr.ms;
> >  
> >  		/* print [s] for shared data mmaps */
> > diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> > index 921715e6aec4..04f0a6dc7381 100644
> > --- a/tools/perf/util/sort.h
> > +++ b/tools/perf/util/sort.h
> > @@ -35,6 +35,7 @@ extern struct sort_entry sort_sym_from;
> >  extern struct sort_entry sort_sym_to;
> >  extern struct sort_entry sort_srcline;
> >  extern const char default_mem_sort_order[];
> > +extern bool chk_double_cl;
> >  
> >  struct res_sample {
> >  	u64 time;
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines
  2023-02-13 13:39 ` Joe Mario
  2023-02-13 14:04   ` Feng Tang
@ 2023-02-14 19:31   ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 8+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-02-14 19:31 UTC (permalink / raw)
  To: Joe Mario
  Cc: Feng Tang, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel, Andi Kleen, Kan Liang, Xing Zhengjun

Em Mon, Feb 13, 2023 at 08:39:15AM -0500, Joe Mario escreveu:
> 
> 
> On 2/12/23 10:17 PM, Feng Tang wrote:
> > Many platforms have feature of adjacent cachelines prefetch, when it
> > is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity,
> > if one is fetched to cache, the other one could likely be fetched too,
> > which sort of extends the cacheline size to double, thus the false
> > sharing could happens in adjacent cachelines.
> > 
> > 0Day has captured performance changed related with this [1], and some
> > commercial software explicitly makes its hot global variables 128 bytes
> > aligned (2 cache lines) to avoid this kind of extended false sharing.
> > 
> > So add an option "-a" or "--double-cl" for c2c report to show false
> > sharing in double cache line granularity, which acts just like the
> > cacheline size is doubled. There is no change to c2c record. The
> > hardware HITM events are still per cacheline. The option just changes
> > the granularity of how events are grouped and displayed.
> > 
> > In the c2c report below (will-it-scale's pagefault2 case on old kernel):
> > 
> >   ----------------------------------------------------------------------
> >      26       31        2        0        0        0  0xffff888103ec6000
> >   ----------------------------------------------------------------------
> >    35.48%   50.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff8133148b      1153        66       971     3748        74  [k] get_mem_cgroup_from_mm
> >     6.45%    0.00%    0.00%    0.00%    0.00%   0x10     0       1  0xffffffff813396e4       570         0      1531      879        75  [k] mem_cgroup_charge
> >    25.81%   50.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81331472       949        70       593     3359        74  [k] get_mem_cgroup_from_mm
> >    19.35%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81339686      1352         0      1073     1022        74  [k] mem_cgroup_charge
> >     9.68%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff813396d6      1401         0       863      768        74  [k] mem_cgroup_charge
> >     3.23%    0.00%    0.00%    0.00%    0.00%   0x54     0       1  0xffffffff81333106       618         0       804       11         9  [k] uncharge_batch
> > 
> > The offset 0x10 and 0x54 used to displayed in 2 groups, and now they
> > are listed together to give users a hint.
> > 
> > [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
> > 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > Reviewed-by: Andi Kleen <ak@linux.intel.com>
> > ---
> >  tools/perf/Documentation/perf-c2c.txt |  6 ++++++
> >  tools/perf/builtin-c2c.c              | 22 +++++++++++++---------
> >  tools/perf/util/cacheline.h           | 25 ++++++++++++++++++++-----
> >  tools/perf/util/sort.c                | 13 ++++++++++---
> >  tools/perf/util/sort.h                |  1 +
> >  5 files changed, 50 insertions(+), 17 deletions(-)
> > 
> > diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
> > index 5c5eb2def83e..a8e1e40d270e 100644
> > --- a/tools/perf/Documentation/perf-c2c.txt
> > +++ b/tools/perf/Documentation/perf-c2c.txt
> > @@ -126,6 +126,12 @@ REPORT OPTIONS
> >  	The known limitations include exception handing such as
> >  	setjmp/longjmp will have calls/returns not match.
> >  
> > +-a::
> > +--double-cl::
> > +	Group HITM events in double cacheline granularity. Some architecture
> > +	has Adjacent Cacheline Prefetch feature, which behaves like the
> > +	cachline size is doubled.
> > +
> 
> Hi Feng:
> One or two of the recent C2C patches changed the Intel-specifc "HITM" wording to something more generic, mostly because other arches do not have HITM events.
> 
> How about changing the above description to something like:
> +-a::
> +--double-cl::
> +       Group the detection of shared cacheline events into double cacheline granularity.
> +       Some architectures have an Adjacent Cacheline Prefetch feature, which causes
> +       cacheline sharing to behave like the cachline size is doubled.
> 
> By the way, this patch is awesome.  
> It's not often that Adjacent Cacheline Prefetching causes performance problems, but when it does, it's a pain to track down.
> This will really help.  Thank you for doing it.

Hi Joe,

	May I add an Acked-by or even Reviewed-by you to the patch?

Thanks,

- Arnaldo
 
> Joe
> 
> >  C2C RECORD
> >  ----------
> >  The perf c2c record command setup options related to HITM cacheline analysis
> > diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> > index 52d94c7dd836..7d495db7e5a2 100644
> > --- a/tools/perf/builtin-c2c.c
> > +++ b/tools/perf/builtin-c2c.c
> > @@ -524,7 +524,7 @@ static int dcacheline_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
> >  	char buf[20];
> >  
> >  	if (he->mem_info)
> > -		addr = cl_address(he->mem_info->daddr.addr);
> > +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
> >  }
> > @@ -562,7 +562,7 @@ static int offset_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp,
> >  	char buf[20];
> >  
> >  	if (he->mem_info)
> > -		addr = cl_offset(he->mem_info->daddr.al_addr);
> > +		addr = cl_offset(he->mem_info->daddr.al_addr, chk_double_cl);
> >  
> >  	return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr));
> >  }
> > @@ -574,9 +574,10 @@ offset_cmp(struct perf_hpp_fmt *fmt __maybe_unused,
> >  	uint64_t l = 0, r = 0;
> >  
> >  	if (left->mem_info)
> > -		l = cl_offset(left->mem_info->daddr.addr);
> > +		l = cl_offset(left->mem_info->daddr.addr, chk_double_cl);
> > +
> >  	if (right->mem_info)
> > -		r = cl_offset(right->mem_info->daddr.addr);
> > +		r = cl_offset(right->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	return (int64_t)(r - l);
> >  }
> > @@ -2590,7 +2591,7 @@ perf_c2c_cacheline_browser__title(struct hist_browser *browser,
> >  	he = cl_browser->he;
> >  
> >  	if (he->mem_info)
> > -		addr = cl_address(he->mem_info->daddr.addr);
> > +		addr = cl_address(he->mem_info->daddr.addr, chk_double_cl);
> >  
> >  	scnprintf(bf, size, "Cacheline 0x%lx", addr);
> >  	return 0;
> > @@ -2788,15 +2789,16 @@ static int ui_quirks(void)
> >  	if (!c2c.use_stdio) {
> >  		dim_offset.width  = 5;
> >  		dim_offset.header = header_offset_tui;
> > -		nodestr = "CL";
> > +		nodestr = chk_double_cl ? "Double-CL" : "CL";
> >  	}
> >  
> >  	dim_percent_costly_snoop.header = percent_costly_snoop_header[c2c.display];
> >  
> >  	/* Fix the zero line for dcacheline column. */
> > -	buf = fill_line("Cacheline", dim_dcacheline.width +
> > -				     dim_dcacheline_node.width +
> > -				     dim_dcacheline_count.width + 4);
> > +	buf = fill_line(chk_double_cl ? "Double-Cacheline" : "Cacheline",
> > +				dim_dcacheline.width +
> > +				dim_dcacheline_node.width +
> > +				dim_dcacheline_count.width + 4);
> >  	if (!buf)
> >  		return -ENOMEM;
> >  
> > @@ -3037,6 +3039,8 @@ static int perf_c2c__report(int argc, const char **argv)
> >  	OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"),
> >  	OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr,
> >  		    "Enable LBR callgraph stitching approach"),
> > +	OPT_BOOLEAN('a', "double-cl", &chk_double_cl,
> > +		    "Check adjacent cachline false sharing"),
> >  	OPT_PARENT(c2c_options),
> >  	OPT_END()
> >  	};
> > diff --git a/tools/perf/util/cacheline.h b/tools/perf/util/cacheline.h
> > index dec8c0fb1f4a..630d16731b4f 100644
> > --- a/tools/perf/util/cacheline.h
> > +++ b/tools/perf/util/cacheline.h
> > @@ -6,16 +6,31 @@
> >  
> >  int __pure cacheline_size(void);
> >  
> > -static inline u64 cl_address(u64 address)
> > +
> > +/*
> > + * Some architecture has 'Adjacent Cacheline Prefetch' feature,
> > + * which performs like the cacheline size being doubled.
> > + */
> > +static inline u64 cl_address(u64 address, bool double_cl)
> >  {
> > +	u64 size = cacheline_size();
> > +
> > +	if (double_cl)
> > +		size *= 2;
> > +
> >  	/* return the cacheline of the address */
> > -	return (address & ~(cacheline_size() - 1));
> > +	return (address & ~(size - 1));
> >  }
> >  
> > -static inline u64 cl_offset(u64 address)
> > +static inline u64 cl_offset(u64 address, bool double_cl)
> >  {
> > -	/* return the cacheline of the address */
> > -	return (address & (cacheline_size() - 1));
> > +	u64 size = cacheline_size();
> > +
> > +	if (double_cl)
> > +		size *= 2;
> > +
> > +	/* return the offset inside cachline */
> > +	return (address & (size - 1));
> >  }
> >  
> >  #endif // PERF_CACHELINE_H
> > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> > index e188f74698dd..148b28f0a7e2 100644
> > --- a/tools/perf/util/sort.c
> > +++ b/tools/perf/util/sort.c
> > @@ -52,6 +52,13 @@ enum sort_mode	sort__mode = SORT_MODE__NORMAL;
> >  static const char *const dynamic_headers[] = {"local_ins_lat", "ins_lat", "local_p_stage_cyc", "p_stage_cyc"};
> >  static const char *const arch_specific_sort_keys[] = {"local_p_stage_cyc", "p_stage_cyc"};
> >  
> > +/*
> > + * Some architecture has Adjacent Cacheline Prefetch feature, which behaves
> > + * like the cachline size is doubled. Enable this flag to check things in
> > + * double cacheline granularity.
> > + */
> > +bool chk_double_cl;
> > +
> >  /*
> >   * Replaces all occurrences of a char used with the:
> >   *
> > @@ -1499,8 +1506,8 @@ sort__dcacheline_cmp(struct hist_entry *left, struct hist_entry *right)
> >  
> >  addr:
> >  	/* al_addr does all the right addr - start + offset calculations */
> > -	l = cl_address(left->mem_info->daddr.al_addr);
> > -	r = cl_address(right->mem_info->daddr.al_addr);
> > +	l = cl_address(left->mem_info->daddr.al_addr, chk_double_cl);
> > +	r = cl_address(right->mem_info->daddr.al_addr, chk_double_cl);
> >  
> >  	if (l > r) return -1;
> >  	if (l < r) return 1;
> > @@ -1519,7 +1526,7 @@ static int hist_entry__dcacheline_snprintf(struct hist_entry *he, char *bf,
> >  	if (he->mem_info) {
> >  		struct map *map = he->mem_info->daddr.ms.map;
> >  
> > -		addr = cl_address(he->mem_info->daddr.al_addr);
> > +		addr = cl_address(he->mem_info->daddr.al_addr, chk_double_cl);
> >  		ms = &he->mem_info->daddr.ms;
> >  
> >  		/* print [s] for shared data mmaps */
> > diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> > index 921715e6aec4..04f0a6dc7381 100644
> > --- a/tools/perf/util/sort.h
> > +++ b/tools/perf/util/sort.h
> > @@ -35,6 +35,7 @@ extern struct sort_entry sort_sym_from;
> >  extern struct sort_entry sort_sym_to;
> >  extern struct sort_entry sort_srcline;
> >  extern const char default_mem_sort_order[];
> > +extern bool chk_double_cl;
> >  
> >  struct res_sample {
> >  	u64 time;
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-02-14 19:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-13  3:17 [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines Feng Tang
2023-02-13  8:02 ` Leo Yan
2023-02-13  8:18 ` Leo Yan
2023-02-13  8:24   ` Feng Tang
2023-02-13  8:54     ` Leo Yan
2023-02-13 13:39 ` Joe Mario
2023-02-13 14:04   ` Feng Tang
2023-02-14 19:31   ` Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).