linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Extend Parsing "ibm,thread-groups" for Shared-L2 information
@ 2020-12-04  4:48 Gautham R. Shenoy
  2020-12-04  4:48 ` [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties Gautham R. Shenoy
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Gautham R. Shenoy @ 2020-12-04  4:48 UTC (permalink / raw)
  To: Srikar Dronamraju, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Ellerman, Michael Neuling, Nicholas Piggin, Nathan Lynch,
	Peter Zijlstra, Valentin Schneider
  Cc: linuxppc-dev, linux-kernel, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.
   
The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

Furthermore, currently on platforms where groups of threads share L2
cache, we incorrectly create an extra CACHE level sched-domain that
maps to all the threads of the core.

For example, if "ibm,thread-groups" is 
		 00000001 00000002 00000004 00000000
		 00000002 00000004 00000006 00000001
		 00000003 00000005 00000007 00000002
		 00000002 00000004 00000000 00000002
		 00000004 00000006 00000001 00000003
		 00000005 00000007

then, the sub-array
[00000002 00000002 00000004
 00000000 00000002 00000004 00000006
 00000001 00000003 00000005 00000007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

However, the sched-domain hierarchy for CPUs 0,1 is
	CPU0 attaching sched-domain(s):
	domain-0: span=0,2,4,6 level=SMT
	domain-1: span=0-7 level=CACHE
	domain-2: span=0-15,24-39,48-55 level=MC
	domain-3: span=0-55 level=DIE

	CPU1 attaching sched-domain(s):
	domain-0: span=1,3,5,7 level=SMT
	domain-1: span=0-7 level=CACHE
	domain-2: span=0-15,24-39,48-55 level=MC
	domain-3: span=0-55 level=DIE

where the CACHE domain reports that L2 is shared across the entire
core which is incorrect on such platforms.


This patchset remedies these issues by extending the parsing support
for "ibm,thread-groups" to discover information about multiple
properties being shared by the corresponding groups of threads. In
particular we cano now detect if the groups of threads within a core
share the L2-cache. On such platforms, we populate the populating the
cpu_l2_cache_mask of every CPU to the core-siblings which share L2
with the CPU as specified in the by the "ibm,thread-groups" property
array.

With the patchset, the sched-domain hierarchy is correctly
reported. For eg for CPUs 0,1, with the patchset

     	CPU0 attaching sched-domain(s):
	domain-0: span=0,2,4,6 level=SMT
	domain-1: span=0-15,24-39,48-55 level=MC
	domain-2: span=0-55 level=DIE

	CPU1 attaching sched-domain(s):
	domain-0: span=1,3,5,7 level=SMT
	domain-1: span=0-15,24-39,48-55 level=MC
	domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Finally, this patchset reports the correct shared_cpu_map/list in the
sysfs for L2 cache on such platforms. With the patchset for CPUs0, 1,
for L2 cache we see the correct shared_cpu_map/list

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:000000,00000055

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:000000,000000aa

The patchset has been tested on older platforms which encode only the
L1 sharing information via "ibm,thread-groups" and there is no
regression found.


Gautham R. Shenoy (3):
  powerpc/smp: Parse ibm,thread-groups with multiple properties
  powerpc/smp: Add support detecting thread-groups sharing L2 cache
  powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

 arch/powerpc/kernel/cacheinfo.c |   7 ++
 arch/powerpc/kernel/smp.c       | 198 +++++++++++++++++++++++++++++-----------
 2 files changed, 150 insertions(+), 55 deletions(-)

-- 
1.9.4


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
  2020-12-04  4:48 [PATCH 0/3] Extend Parsing "ibm,thread-groups" for Shared-L2 information Gautham R. Shenoy
@ 2020-12-04  4:48 ` Gautham R. Shenoy
  2020-12-07 12:10   ` Srikar Dronamraju
  2020-12-04  4:48 ` [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache Gautham R. Shenoy
  2020-12-04  4:48 ` [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for " Gautham R. Shenoy
  2 siblings, 1 reply; 16+ messages in thread
From: Gautham R. Shenoy @ 2020-12-04  4:48 UTC (permalink / raw)
  To: Srikar Dronamraju, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Ellerman, Michael Neuling, Nicholas Piggin, Nathan Lynch,
	Peter Zijlstra, Valentin Schneider
  Cc: linuxppc-dev, linux-kernel, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.
   
The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

This patch extends the parsing support on platforms which encode
information about multiple properties being shared by the
corresponding groups of threads.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 146 +++++++++++++++++++++++++++++-----------------
 1 file changed, 92 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c2857c..6a242a3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -106,6 +106,15 @@ struct thread_groups {
 	unsigned int thread_list[MAX_THREAD_LIST_SIZE];
 };
 
+/* Maximum number of properties that groups of threads within a core can share */
+#define MAX_THREAD_GROUP_PROPERTIES 1
+
+struct thread_groups_list {
+	unsigned int nr_properties;
+	struct thread_groups property_tgs[MAX_THREAD_GROUP_PROPERTIES];
+};
+
+static struct thread_groups_list tgl[NR_CPUS] __initdata;
 /*
  * On big-cores system, cpu_l1_cache_map for each CPU corresponds to
  * the set its siblings that share the L1-cache.
@@ -695,81 +704,94 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int),
 /*
  * parse_thread_groups: Parses the "ibm,thread-groups" device tree
  *                      property for the CPU device node @dn and stores
- *                      the parsed output in the thread_groups
- *                      structure @tg if the ibm,thread-groups[0]
- *                      matches @property.
+ *                      the parsed output in the thread_groups_list
+ *                      structure @tglp.
  *
  * @dn: The device node of the CPU device.
- * @tg: Pointer to a thread group structure into which the parsed
+ * @tglp: Pointer to a thread group list structure into which the parsed
  *      output of "ibm,thread-groups" is stored.
- * @property: The property of the thread-group that the caller is
- *            interested in.
  *
  * ibm,thread-groups[0..N-1] array defines which group of threads in
  * the CPU-device node can be grouped together based on the property.
  *
- * ibm,thread-groups[0] tells us the property based on which the
+ * This array can represent thread groupings for multiple properties.
+ *
+ * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
  * that the threads in the same group share L1, translation cache.
  *
- * ibm,thread-groups[1] tells us how many such thread groups exist.
+ * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
+ * property ibm,thread-groups[i]
  *
- * ibm,thread-groups[2] tells us the number of threads in each such
+ * ibm,thread-groups[i+2] tells us the number of threads in each such
  * group.
+ * Suppose k = (ibm,thread-groups[i+1] * ibm,thread-groups[i+2]), then,
  *
- * ibm,thread-groups[3..N-1] is the list of threads identified by
+ * ibm,thread-groups[i+3..i+k+2] (is the list of threads identified by
  * "ibm,ppc-interrupt-server#s" arranged as per their membership in
  * the grouping.
  *
- * Example: If ibm,thread-groups = [1,2,4,5,6,7,8,9,10,11,12] it
- * implies that there are 2 groups of 4 threads each, where each group
- * of threads share L1, translation cache.
+ * Example:
+ * If ibm,thread-groups = [1,2,4,5,6,7,8,9,10,11,12,2,2,4,5,7,9,11,6,8,10,12]
+ * This can be decomposed up into two consecutive arrays:
+ * a) [1,2,4,5,6,7,8,9,10,11,12]
+ * b) [2,2,4,5,7,9,11,6,8,10,12]
+ * where in,
  *
- * The "ibm,ppc-interrupt-server#s" of the first group is {5,6,7,8}
- * and the "ibm,ppc-interrupt-server#s" of the second group is {9, 10,
- * 11, 12} structure
+ * a) there are 2 groups of 4 threads each, where each group of
+ * threads share Property 1 (L1, translation cache). The
+ * "ibm,ppc-interrupt-server#s" of the first group is {5,6,7,8} and
+ * the "ibm,ppc-interrupt-server#s" of the second group is {9, 10, 11,
+ * 12}.
+ *
+ * b) there are 2 groups of 4 threads each, where each group of
+ *    threads share some property indicated by the first value 2. The
+ *    "ibm,ppc-interrupt-server#s" of the first group is {5,7,9,11}
+ *    and the "ibm,ppc-interrupt-server#s" of the second group is
+ *    {6,8,10,12} structure
  *
  * Returns 0 on success, -EINVAL if the property does not exist,
  * -ENODATA if property does not have a value, and -EOVERFLOW if the
  * property data isn't large enough.
  */
 static int parse_thread_groups(struct device_node *dn,
-			       struct thread_groups *tg,
-			       unsigned int property)
+			       struct thread_groups_list *tglp)
 {
-	int i;
-	u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
+	int i = 0;
+	u32 *thread_group_array;
 	u32 *thread_list;
 	size_t total_threads;
-	int ret;
+	int ret = 0, count;
+	unsigned int property_idx = 0;
 
+	count = of_property_count_u32_elems(dn, "ibm,thread-groups");
+	thread_group_array = kcalloc(count, sizeof(u32), GFP_KERNEL);
 	ret = of_property_read_u32_array(dn, "ibm,thread-groups",
-					 thread_group_array, 3);
+					 thread_group_array, count);
 	if (ret)
-		return ret;
-
-	tg->property = thread_group_array[0];
-	tg->nr_groups = thread_group_array[1];
-	tg->threads_per_group = thread_group_array[2];
-	if (tg->property != property ||
-	    tg->nr_groups < 1 ||
-	    tg->threads_per_group < 1)
-		return -ENODATA;
+		goto out_free;
 
-	total_threads = tg->nr_groups * tg->threads_per_group;
+	while (i < count && property_idx < MAX_THREAD_GROUP_PROPERTIES) {
+		int j;
+		struct thread_groups *tg = &tglp->property_tgs[property_idx++];
 
-	ret = of_property_read_u32_array(dn, "ibm,thread-groups",
-					 thread_group_array,
-					 3 + total_threads);
-	if (ret)
-		return ret;
+		tg->property = thread_group_array[i];
+		tg->nr_groups = thread_group_array[i + 1];
+		tg->threads_per_group = thread_group_array[i + 2];
+		total_threads = tg->nr_groups * tg->threads_per_group;
+
+		thread_list = &thread_group_array[i + 3];
 
-	thread_list = &thread_group_array[3];
+		for (j = 0; j < total_threads; j++)
+			tg->thread_list[j] = thread_list[j];
+		i = i + 3 + total_threads;
+	}
 
-	for (i = 0 ; i < total_threads; i++)
-		tg->thread_list[i] = thread_list[i];
+	tglp->nr_properties = property_idx;
 
-	return 0;
+out_free:
+	kfree(thread_group_array);
+	return ret;
 }
 
 /*
@@ -805,24 +827,39 @@ static int get_cpu_thread_group_start(int cpu, struct thread_groups *tg)
 	return -1;
 }
 
-static int init_cpu_l1_cache_map(int cpu)
+static int init_cpu_cache_map(int cpu, unsigned int cache_property)
 
 {
 	struct device_node *dn = of_get_cpu_node(cpu, NULL);
-	struct thread_groups tg = {.property = 0,
-				   .nr_groups = 0,
-				   .threads_per_group = 0};
+	struct thread_groups *tg = NULL;
 	int first_thread = cpu_first_thread_sibling(cpu);
 	int i, cpu_group_start = -1, err = 0;
+	cpumask_var_t *mask;
+	struct thread_groups_list *cpu_tgl = &tgl[cpu];
 
 	if (!dn)
 		return -ENODATA;
 
-	err = parse_thread_groups(dn, &tg, THREAD_GROUP_SHARE_L1);
-	if (err)
-		goto out;
+	if (!(cache_property == THREAD_GROUP_SHARE_L1))
+		return -EINVAL;
 
-	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
+	if (!cpu_tgl->nr_properties) {
+		err = parse_thread_groups(dn, cpu_tgl);
+		if (err)
+			goto out;
+	}
+
+	for (i = 0; i < cpu_tgl->nr_properties; i++) {
+		if (cpu_tgl->property_tgs[i].property == cache_property) {
+			tg = &cpu_tgl->property_tgs[i];
+			break;
+		}
+	}
+
+	if (!tg)
+		return -EINVAL;
+
+	cpu_group_start = get_cpu_thread_group_start(cpu, tg);
 
 	if (unlikely(cpu_group_start == -1)) {
 		WARN_ON_ONCE(1);
@@ -830,11 +867,12 @@ static int init_cpu_l1_cache_map(int cpu)
 		goto out;
 	}
 
-	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
-				GFP_KERNEL, cpu_to_node(cpu));
+	mask = &per_cpu(cpu_l1_cache_map, cpu);
+
+	zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
 
 	for (i = first_thread; i < first_thread + threads_per_core; i++) {
-		int i_group_start = get_cpu_thread_group_start(i, &tg);
+		int i_group_start = get_cpu_thread_group_start(i, tg);
 
 		if (unlikely(i_group_start == -1)) {
 			WARN_ON_ONCE(1);
@@ -843,7 +881,7 @@ static int init_cpu_l1_cache_map(int cpu)
 		}
 
 		if (i_group_start == cpu_group_start)
-			cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu));
+			cpumask_set_cpu(i, *mask);
 	}
 
 out:
@@ -924,7 +962,7 @@ static int init_big_cores(void)
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
-		int err = init_cpu_l1_cache_map(cpu);
+		int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L1);
 
 		if (err)
 			return err;
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache
  2020-12-04  4:48 [PATCH 0/3] Extend Parsing "ibm,thread-groups" for Shared-L2 information Gautham R. Shenoy
  2020-12-04  4:48 ` [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties Gautham R. Shenoy
@ 2020-12-04  4:48 ` Gautham R. Shenoy
  2020-12-07 12:40   ` Srikar Dronamraju
  2020-12-04  4:48 ` [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for " Gautham R. Shenoy
  2 siblings, 1 reply; 16+ messages in thread
From: Gautham R. Shenoy @ 2020-12-04  4:48 UTC (permalink / raw)
  To: Srikar Dronamraju, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Ellerman, Michael Neuling, Nicholas Piggin, Nathan Lynch,
	Peter Zijlstra, Valentin Schneider
  Cc: linuxppc-dev, linux-kernel, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

On POWER systems, groups of threads within a core sharing the L2-cache
can be indicated by the "ibm,thread-groups" property array with the
identifier "2".

This patch adds support for detecting this, and when present, populate
the populating the cpu_l2_cache_mask of every CPU to the core-siblings
which share L2 with the CPU as specified in the by the
"ibm,thread-groups" property array.

On a platform with the following "ibm,thread-group" configuration
		 00000001 00000002 00000004 00000000
		 00000002 00000004 00000006 00000001
		 00000003 00000005 00000007 00000002
		 00000002 00000004 00000000 00000002
		 00000004 00000006 00000001 00000003
		 00000005 00000007

Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
	CPU0 attaching sched-domain(s):
	domain-0: span=0,2,4,6 level=SMT
	domain-1: span=0-7 level=CACHE
	domain-2: span=0-15,24-39,48-55 level=MC
	domain-3: span=0-55 level=DIE

	CPU1 attaching sched-domain(s):
	domain-0: span=1,3,5,7 level=SMT
	domain-1: span=0-7 level=CACHE
	domain-2: span=0-15,24-39,48-55 level=MC
	domain-3: span=0-55 level=DIE

The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
sub-array
[00000002 00000002 00000004
 00000000 00000002 00000004 00000006
 00000001 00000003 00000005 00000007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

With this patch, the sched-domain hierarchy for CPUs 0,1 would be
     	CPU0 attaching sched-domain(s):
	domain-0: span=0,2,4,6 level=SMT
	domain-1: span=0-15,24-39,48-55 level=MC
	domain-2: span=0-55 level=DIE

	CPU1 attaching sched-domain(s):
	domain-0: span=1,3,5,7 level=SMT
	domain-1: span=0-15,24-39,48-55 level=MC
	domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 66 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a242a3..a116d2d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -76,6 +76,7 @@
 struct task_struct *secondary_current;
 bool has_big_cores;
 bool coregroup_enabled;
+bool thread_group_shares_l2;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -99,6 +100,7 @@ enum {
 
 #define MAX_THREAD_LIST_SIZE	8
 #define THREAD_GROUP_SHARE_L1   1
+#define THREAD_GROUP_SHARE_L2   2
 struct thread_groups {
 	unsigned int property;
 	unsigned int nr_groups;
@@ -107,7 +109,7 @@ struct thread_groups {
 };
 
 /* Maximum number of properties that groups of threads within a core can share */
-#define MAX_THREAD_GROUP_PROPERTIES 1
+#define MAX_THREAD_GROUP_PROPERTIES 2
 
 struct thread_groups_list {
 	unsigned int nr_properties;
@@ -121,6 +123,13 @@ struct thread_groups_list {
  */
 DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
 
+/*
+ * On some big-cores system, thread_group_l2_cache_map for each CPU
+ * corresponds to the set its siblings within the core that share the
+ * L2-cache.
+ */
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
 /* SMP operations for this machine */
 struct smp_ops_t *smp_ops;
 
@@ -718,7 +727,9 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int),
  *
  * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
- * that the threads in the same group share L1, translation cache.
+ * that the threads in the same group share L1, translation cache. If
+ * the value is 2, it implies that the threads in the same group share
+ * the same L2 cache.
  *
  * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
  * property ibm,thread-groups[i]
@@ -745,10 +756,10 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int),
  * 12}.
  *
  * b) there are 2 groups of 4 threads each, where each group of
- *    threads share some property indicated by the first value 2. The
- *    "ibm,ppc-interrupt-server#s" of the first group is {5,7,9,11}
- *    and the "ibm,ppc-interrupt-server#s" of the second group is
- *    {6,8,10,12} structure
+ *    threads share some property indicated by the first value 2 (L2
+ *    cache). The "ibm,ppc-interrupt-server#s" of the first group is
+ *    {5,7,9,11} and the "ibm,ppc-interrupt-server#s" of the second
+ *    group is {6,8,10,12} structure
  *
  * Returns 0 on success, -EINVAL if the property does not exist,
  * -ENODATA if property does not have a value, and -EOVERFLOW if the
@@ -840,7 +851,8 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
 	if (!dn)
 		return -ENODATA;
 
-	if (!(cache_property == THREAD_GROUP_SHARE_L1))
+	if (!(cache_property == THREAD_GROUP_SHARE_L1 ||
+	      cache_property == THREAD_GROUP_SHARE_L2))
 		return -EINVAL;
 
 	if (!cpu_tgl->nr_properties) {
@@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
 		goto out;
 	}
 
-	mask = &per_cpu(cpu_l1_cache_map, cpu);
+	if (cache_property == THREAD_GROUP_SHARE_L1)
+		mask = &per_cpu(cpu_l1_cache_map, cpu);
+	else if (cache_property == THREAD_GROUP_SHARE_L2)
+		mask = &per_cpu(thread_group_l2_cache_map, cpu);
 
 	zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
 
@@ -973,6 +988,16 @@ static int init_big_cores(void)
 	}
 
 	has_big_cores = true;
+
+	for_each_possible_cpu(cpu) {
+		int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
+
+		if (err)
+			return err;
+	}
+
+	thread_group_shares_l2 = true;
+	pr_info("Thread-groups in a core share L2-cache\n");
 	return 0;
 }
 
@@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t *mask)
 	if (has_big_cores)
 		submask_fn = cpu_smallcore_mask;
 
+
+	/*
+	 * If the threads in a thread-group share L2 cache, then then
+	 * the L2-mask can be obtained from thread_group_l2_cache_map.
+	 */
+	if (thread_group_shares_l2) {
+		/* Siblings that share L1 is a subset of siblings that share L2.*/
+		or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
+		if (*mask) {
+			cpumask_andnot(*mask,
+				       per_cpu(thread_group_l2_cache_map, cpu),
+				       cpu_l2_cache_mask(cpu));
+		} else {
+			mask = &per_cpu(thread_group_l2_cache_map, cpu);
+		}
+
+		for_each_cpu(i, *mask) {
+			if (!cpu_online(i))
+				continue;
+			set_cpus_related(i, cpu, cpu_l2_cache_mask);
+		}
+
+		return true;
+	}
+
 	l2_cache = cpu_to_l2cache(cpu);
 	if (!l2_cache || !*mask) {
 		/* Assume only core siblings share cache with this CPU */
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
  2020-12-04  4:48 [PATCH 0/3] Extend Parsing "ibm,thread-groups" for Shared-L2 information Gautham R. Shenoy
  2020-12-04  4:48 ` [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties Gautham R. Shenoy
  2020-12-04  4:48 ` [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache Gautham R. Shenoy
@ 2020-12-04  4:48 ` Gautham R. Shenoy
  2020-12-07 13:11   ` Srikar Dronamraju
  2 siblings, 1 reply; 16+ messages in thread
From: Gautham R. Shenoy @ 2020-12-04  4:48 UTC (permalink / raw)
  To: Srikar Dronamraju, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Ellerman, Michael Neuling, Nicholas Piggin, Nathan Lynch,
	Peter Zijlstra, Valentin Schneider
  Cc: linuxppc-dev, linux-kernel, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

On POWER platforms where only some groups of threads within a core
share the L2-cache (indicated by the ibm,thread-groups device-tree
property), we currently print the incorrect shared_cpu_map/list for
L2-cache in the sysfs.

This patch reports the correct shared_cpu_map/list on such platforms.

Example:
On a platform with "ibm,thread-groups" set to
                 00000001 00000002 00000004 00000000
                 00000002 00000004 00000006 00000001
                 00000003 00000005 00000007 00000002
                 00000002 00000004 00000000 00000002
                 00000004 00000006 00000001 00000003
                 00000005 00000007

This indicates that threads {0,2,4,6} in the core share the L2-cache
and threads {1,3,5,7} in the core share the L2 cache.

However, without the patch, the shared_cpu_map/list for L2 for CPUs 0,
1 is reported in the sysfs as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:000000,000000ff

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:000000,000000ff

With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is
correctly reported as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:000000,00000055

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:000000,000000aa

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/cacheinfo.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 65ab9fc..1cc8f37 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -651,15 +651,22 @@ static unsigned int index_dir_to_cpu(struct cache_index_dir *index)
 	return dev->id;
 }
 
+extern bool thread_group_shares_l2;
 /*
  * On big-core systems, each core has two groups of CPUs each of which
  * has its own L1-cache. The thread-siblings which share l1-cache with
  * @cpu can be obtained via cpu_smallcore_mask().
+ *
+ * On some big-core systems, the L2 cache is shared only between some
+ * groups of siblings. This is already parsed and encoded in
+ * cpu_l2_cache_mask().
  */
 static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache *cache)
 {
 	if (cache->level == 1)
 		return cpu_smallcore_mask(cpu);
+	if (cache->level == 2 && thread_group_shares_l2)
+		return cpu_l2_cache_mask(cpu);
 
 	return &cache->shared_cpu_map;
 }
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
  2020-12-04  4:48 ` [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties Gautham R. Shenoy
@ 2020-12-07 12:10   ` Srikar Dronamraju
  2020-12-08 17:25     ` Gautham R Shenoy
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2020-12-07 12:10 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Anton Blanchard, Vaidyanathan Srinivasan, Michael Ellerman,
	Michael Neuling, Nicholas Piggin, Nathan Lynch, Peter Zijlstra,
	Valentin Schneider, linuxppc-dev, linux-kernel

* Gautham R. Shenoy <ego@linux.vnet.ibm.com> [2020-12-04 10:18:45]:

> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

<snipped>

> 
>  static int parse_thread_groups(struct device_node *dn,
> -			       struct thread_groups *tg,
> -			       unsigned int property)
> +			       struct thread_groups_list *tglp)
>  {
> -	int i;
> -	u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
> +	int i = 0;
> +	u32 *thread_group_array;
>  	u32 *thread_list;
>  	size_t total_threads;
> -	int ret;
> +	int ret = 0, count;
> +	unsigned int property_idx = 0;

NIT:
tglx mentions in one of his recent comments to try keep a reverse fir tree
ordering of variables where possible.

> 
> +	count = of_property_count_u32_elems(dn, "ibm,thread-groups");
> +	thread_group_array = kcalloc(count, sizeof(u32), GFP_KERNEL);
>  	ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> -					 thread_group_array, 3);
> +					 thread_group_array, count);
>  	if (ret)
> -		return ret;
> -
> -	tg->property = thread_group_array[0];
> -	tg->nr_groups = thread_group_array[1];
> -	tg->threads_per_group = thread_group_array[2];
> -	if (tg->property != property ||
> -	    tg->nr_groups < 1 ||
> -	    tg->threads_per_group < 1)
> -		return -ENODATA;
> +		goto out_free;
> 
> -	total_threads = tg->nr_groups * tg->threads_per_group;
> +	while (i < count && property_idx < MAX_THREAD_GROUP_PROPERTIES) {
> +		int j;
> +		struct thread_groups *tg = &tglp->property_tgs[property_idx++];

NIT: same as above.

> 
> -	ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> -					 thread_group_array,
> -					 3 + total_threads);
> -	if (ret)
> -		return ret;
> +		tg->property = thread_group_array[i];
> +		tg->nr_groups = thread_group_array[i + 1];
> +		tg->threads_per_group = thread_group_array[i + 2];
> +		total_threads = tg->nr_groups * tg->threads_per_group;
> +
> +		thread_list = &thread_group_array[i + 3];
> 
> -	thread_list = &thread_group_array[3];
> +		for (j = 0; j < total_threads; j++)
> +			tg->thread_list[j] = thread_list[j];
> +		i = i + 3 + total_threads;

	Can't we simply use memcpy instead?

> +	}
> 
> -	for (i = 0 ; i < total_threads; i++)
> -		tg->thread_list[i] = thread_list[i];
> +	tglp->nr_properties = property_idx;
> 
> -	return 0;
> +out_free:
> +	kfree(thread_group_array);
> +	return ret;
>  }
> 
>  /*
> @@ -805,24 +827,39 @@ static int get_cpu_thread_group_start(int cpu, struct thread_groups *tg)
>  	return -1;
>  }
> 
> -static int init_cpu_l1_cache_map(int cpu)
> +static int init_cpu_cache_map(int cpu, unsigned int cache_property)
> 
>  {
>  	struct device_node *dn = of_get_cpu_node(cpu, NULL);
> -	struct thread_groups tg = {.property = 0,
> -				   .nr_groups = 0,
> -				   .threads_per_group = 0};
> +	struct thread_groups *tg = NULL;
>  	int first_thread = cpu_first_thread_sibling(cpu);
>  	int i, cpu_group_start = -1, err = 0;
> +	cpumask_var_t *mask;
> +	struct thread_groups_list *cpu_tgl = &tgl[cpu];

NIT: same as 1st comment.

> 
>  	if (!dn)
>  		return -ENODATA;
> 
> -	err = parse_thread_groups(dn, &tg, THREAD_GROUP_SHARE_L1);
> -	if (err)
> -		goto out;
> +	if (!(cache_property == THREAD_GROUP_SHARE_L1))
> +		return -EINVAL;
> 
> -	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
> +	if (!cpu_tgl->nr_properties) {
> +		err = parse_thread_groups(dn, cpu_tgl);
> +		if (err)
> +			goto out;
> +	}
> +
> +	for (i = 0; i < cpu_tgl->nr_properties; i++) {
> +		if (cpu_tgl->property_tgs[i].property == cache_property) {
> +			tg = &cpu_tgl->property_tgs[i];
> +			break;
> +		}
> +	}
> +
> +	if (!tg)
> +		return -EINVAL;
> +
> +	cpu_group_start = get_cpu_thread_group_start(cpu, tg);

This whole hunk should be moved to a new function and called before
init_cpu_cache_map. It will simplify the logic to great extent.

> 
>  	if (unlikely(cpu_group_start == -1)) {
>  		WARN_ON_ONCE(1);
> @@ -830,11 +867,12 @@ static int init_cpu_l1_cache_map(int cpu)
>  		goto out;
>  	}
> 
> -	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> -				GFP_KERNEL, cpu_to_node(cpu));
> +	mask = &per_cpu(cpu_l1_cache_map, cpu);
> +
> +	zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> 

This hunk (and the next hunk) should be moved to next patch.

>  	for (i = first_thread; i < first_thread + threads_per_core; i++) {
> -		int i_group_start = get_cpu_thread_group_start(i, &tg);
> +		int i_group_start = get_cpu_thread_group_start(i, tg);
> 
>  		if (unlikely(i_group_start == -1)) {
>  			WARN_ON_ONCE(1);
> @@ -843,7 +881,7 @@ static int init_cpu_l1_cache_map(int cpu)
>  		}
> 
>  		if (i_group_start == cpu_group_start)
> -			cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu));
> +			cpumask_set_cpu(i, *mask);
>  	}
> 
>  out:
> @@ -924,7 +962,7 @@ static int init_big_cores(void)
>  	int cpu;
> 
>  	for_each_possible_cpu(cpu) {
> -		int err = init_cpu_l1_cache_map(cpu);
> +		int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L1);
> 
>  		if (err)
>  			return err;
> -- 
> 1.9.4
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache
  2020-12-04  4:48 ` [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache Gautham R. Shenoy
@ 2020-12-07 12:40   ` Srikar Dronamraju
  2020-12-08 17:42     ` Gautham R Shenoy
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2020-12-07 12:40 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Anton Blanchard, Vaidyanathan Srinivasan, Michael Ellerman,
	Michael Neuling, Nicholas Piggin, Nathan Lynch, Peter Zijlstra,
	Valentin Schneider, linuxppc-dev, linux-kernel

* Gautham R. Shenoy <ego@linux.vnet.ibm.com> [2020-12-04 10:18:46]:

> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> On POWER systems, groups of threads within a core sharing the L2-cache
> can be indicated by the "ibm,thread-groups" property array with the
> identifier "2".
> 
> This patch adds support for detecting this, and when present, populate
> the populating the cpu_l2_cache_mask of every CPU to the core-siblings
> which share L2 with the CPU as specified in the by the
> "ibm,thread-groups" property array.
> 
> On a platform with the following "ibm,thread-group" configuration
> 		 00000001 00000002 00000004 00000000
> 		 00000002 00000004 00000006 00000001
> 		 00000003 00000005 00000007 00000002
> 		 00000002 00000004 00000000 00000002
> 		 00000004 00000006 00000001 00000003
> 		 00000005 00000007
> 
> Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
> 	CPU0 attaching sched-domain(s):
> 	domain-0: span=0,2,4,6 level=SMT
> 	domain-1: span=0-7 level=CACHE
> 	domain-2: span=0-15,24-39,48-55 level=MC
> 	domain-3: span=0-55 level=DIE
> 
> 	CPU1 attaching sched-domain(s):
> 	domain-0: span=1,3,5,7 level=SMT
> 	domain-1: span=0-7 level=CACHE
> 	domain-2: span=0-15,24-39,48-55 level=MC
> 	domain-3: span=0-55 level=DIE
> 
> The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
> sub-array
> [00000002 00000002 00000004
>  00000000 00000002 00000004 00000006
>  00000001 00000003 00000005 00000007]
> indicates that L2 (Property "2") is shared only between the threads of a single
> group. There are "2" groups of threads where each group contains "4"
> threads each. The groups being {0,2,4,6} and {1,3,5,7}.
> 
> With this patch, the sched-domain hierarchy for CPUs 0,1 would be
>      	CPU0 attaching sched-domain(s):
> 	domain-0: span=0,2,4,6 level=SMT
> 	domain-1: span=0-15,24-39,48-55 level=MC
> 	domain-2: span=0-55 level=DIE
> 
> 	CPU1 attaching sched-domain(s):
> 	domain-0: span=1,3,5,7 level=SMT
> 	domain-1: span=0-15,24-39,48-55 level=MC
> 	domain-2: span=0-55 level=DIE
> 
> The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
> resp.) gets degenerated into the SMT domain. Furthermore, the
> last-level-cache domain gets correctly set to the SMT sched-domain.
> 
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/smp.c | 66 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 58 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 6a242a3..a116d2d 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -76,6 +76,7 @@
>  struct task_struct *secondary_current;
>  bool has_big_cores;
>  bool coregroup_enabled;
> +bool thread_group_shares_l2;

Either keep this as static in this patch or add its declaration

> 
>  DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> @@ -99,6 +100,7 @@ enum {
> 
>  #define MAX_THREAD_LIST_SIZE	8
>  #define THREAD_GROUP_SHARE_L1   1
> +#define THREAD_GROUP_SHARE_L2   2
>  struct thread_groups {
>  	unsigned int property;
>  	unsigned int nr_groups;
> @@ -107,7 +109,7 @@ struct thread_groups {
>  };
> 
>  /* Maximum number of properties that groups of threads within a core can share */
> -#define MAX_THREAD_GROUP_PROPERTIES 1
> +#define MAX_THREAD_GROUP_PROPERTIES 2
> 
>  struct thread_groups_list {
>  	unsigned int nr_properties;
> @@ -121,6 +123,13 @@ struct thread_groups_list {
>   */
>  DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
> 
> +/*
> + * On some big-cores system, thread_group_l2_cache_map for each CPU
> + * corresponds to the set its siblings within the core that share the
> + * L2-cache.
> + */
> +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
> +

NIT:
We are trying to confuse ourselves with the names.
For L1 we have cpu_l2_cache_map to store the tasks from the thread group.
but cpu_smallcore_map for keeping track of tasks.

For L2 we have thread_group_l2_cache_map to store the tasks from the thread
group.  but cpu_l2_cache_map for keeping track of tasks.

I think we should do some renaming to keep the names consistent.
I would say probably say move the current cpu_l2_cache_map to
cpu_llc_cache_map and move the new aka  thread_group_l2_cache_map as
cpu_l2_cache_map to be somewhat consistent.

>  /* SMP operations for this machine */
>  struct smp_ops_t *smp_ops;
> 
> @@ -840,7 +851,8 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
>  	if (!dn)
>  		return -ENODATA;
> 
> -	if (!(cache_property == THREAD_GROUP_SHARE_L1))
> +	if (!(cache_property == THREAD_GROUP_SHARE_L1 ||
> +	      cache_property == THREAD_GROUP_SHARE_L2))
>  		return -EINVAL;
> 
>  	if (!cpu_tgl->nr_properties) {
> @@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
>  		goto out;
>  	}
> 
> -	mask = &per_cpu(cpu_l1_cache_map, cpu);
> +	if (cache_property == THREAD_GROUP_SHARE_L1)
> +		mask = &per_cpu(cpu_l1_cache_map, cpu);
> +	else if (cache_property == THREAD_GROUP_SHARE_L2)
> +		mask = &per_cpu(thread_group_l2_cache_map, cpu);
> 
>  	zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> 
> @@ -973,6 +988,16 @@ static int init_big_cores(void)
>  	}
> 
>  	has_big_cores = true;
> +
> +	for_each_possible_cpu(cpu) {
> +		int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> +
> +		if (err)
> +			return err;
> +	}
> +
> +	thread_group_shares_l2 = true;

Why do we need a separate loop. Why cant we merge this in the above loop
itself? 

> +	pr_info("Thread-groups in a core share L2-cache\n");

Can this be moved to a pr_debug? Does it help any regular user/admins to
know if thread-groups shared l2 cache. Infact it may confuse users on what
thread groups are and which thread groups dont share cache.
I would prefer some other name than thread_group_shares_l2 but dont know any
better alternatives and may be my choices are even worse.

>  	return 0;
>  }
> 
> @@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t *mask)
>  	if (has_big_cores)
>  		submask_fn = cpu_smallcore_mask;
> 
> +

NIT: extra blank line?

> +	/*
> +	 * If the threads in a thread-group share L2 cache, then then
> +	 * the L2-mask can be obtained from thread_group_l2_cache_map.
> +	 */
> +	if (thread_group_shares_l2) {
> +		/* Siblings that share L1 is a subset of siblings that share L2.*/
> +		or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
> +		if (*mask) {
> +			cpumask_andnot(*mask,
> +				       per_cpu(thread_group_l2_cache_map, cpu),
> +				       cpu_l2_cache_mask(cpu));
> +		} else {
> +			mask = &per_cpu(thread_group_l2_cache_map, cpu);
> +		}
> +
> +		for_each_cpu(i, *mask) {
> +			if (!cpu_online(i))
> +				continue;
> +			set_cpus_related(i, cpu, cpu_l2_cache_mask);
> +		}
> +
> +		return true;
> +	}
> +

Ah this can be simplified to:
if (thread_group_shares_l2) {
	cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));

	for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
		if (cpu_online(i))
			set_cpus_related(i, cpu, cpu_l2_cache_mask);
	}
}

No?

>  	l2_cache = cpu_to_l2cache(cpu);
>  	if (!l2_cache || !*mask) {
>  		/* Assume only core siblings share cache with this CPU */

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
  2020-12-04  4:48 ` [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for " Gautham R. Shenoy
@ 2020-12-07 13:11   ` Srikar Dronamraju
  2020-12-08 17:56     ` Gautham R Shenoy
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2020-12-07 13:11 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Anton Blanchard, Vaidyanathan Srinivasan, Michael Ellerman,
	Michael Neuling, Nicholas Piggin, Nathan Lynch, Peter Zijlstra,
	Valentin Schneider, linuxppc-dev, linux-kernel

* Gautham R. Shenoy <ego@linux.vnet.ibm.com> [2020-12-04 10:18:47]:

> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> 
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
> 
> +extern bool thread_group_shares_l2;
>  /*
>   * On big-core systems, each core has two groups of CPUs each of which
>   * has its own L1-cache. The thread-siblings which share l1-cache with
>   * @cpu can be obtained via cpu_smallcore_mask().
> + *
> + * On some big-core systems, the L2 cache is shared only between some
> + * groups of siblings. This is already parsed and encoded in
> + * cpu_l2_cache_mask().
>   */
>  static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache *cache)
>  {
>  	if (cache->level == 1)
>  		return cpu_smallcore_mask(cpu);
> +	if (cache->level == 2 && thread_group_shares_l2)
> +		return cpu_l2_cache_mask(cpu);
> 
>  	return &cache->shared_cpu_map;

As pointed with lkp@intel.org, we need to do this only with #CONFIG_SMP,
even for cache->level = 1 too.

I agree that we are displaying shared_cpu_map correctly. Should we have also
update /clear shared_cpu_map in the first place. For example:- If for a P9
core with CPUs 0-7, the cache->shared_cpu_map for L1 would have 0-7 but
would display 0,2,4,6.

The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not
be released. Is this as expected?


-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
  2020-12-07 12:10   ` Srikar Dronamraju
@ 2020-12-08 17:25     ` Gautham R Shenoy
  2020-12-09  3:59       ` Michael Ellerman
  2020-12-09  8:35       ` Srikar Dronamraju
  0 siblings, 2 replies; 16+ messages in thread
From: Gautham R Shenoy @ 2020-12-08 17:25 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R. Shenoy, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Ellerman, Michael Neuling, Nicholas Piggin, Nathan Lynch,
	Peter Zijlstra, Valentin Schneider, linuxppc-dev, linux-kernel

Hello Srikar,

Thanks for taking a look at the patch.

On Mon, Dec 07, 2020 at 05:40:42PM +0530, Srikar Dronamraju wrote:
> * Gautham R. Shenoy <ego@linux.vnet.ibm.com> [2020-12-04 10:18:45]:
> 
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> <snipped>
> 
> > 
> >  static int parse_thread_groups(struct device_node *dn,
> > -			       struct thread_groups *tg,
> > -			       unsigned int property)
> > +			       struct thread_groups_list *tglp)
> >  {
> > -	int i;
> > -	u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
> > +	int i = 0;
> > +	u32 *thread_group_array;
> >  	u32 *thread_list;
> >  	size_t total_threads;
> > -	int ret;
> > +	int ret = 0, count;
> > +	unsigned int property_idx = 0;
> 
> NIT:
> tglx mentions in one of his recent comments to try keep a reverse fir tree
> ordering of variables where possible.

I suppose you mean moving the longer local variable declarations to to
the top and shorter ones to the bottom. Thanks. Will fix this.


> 
> > 
> > +	count = of_property_count_u32_elems(dn, "ibm,thread-groups");
> > +	thread_group_array = kcalloc(count, sizeof(u32), GFP_KERNEL);
> >  	ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> > -					 thread_group_array, 3);
> > +					 thread_group_array, count);
> >  	if (ret)
> > -		return ret;
> > -
> > -	tg->property = thread_group_array[0];
> > -	tg->nr_groups = thread_group_array[1];
> > -	tg->threads_per_group = thread_group_array[2];
> > -	if (tg->property != property ||
> > -	    tg->nr_groups < 1 ||
> > -	    tg->threads_per_group < 1)
> > -		return -ENODATA;
> > +		goto out_free;
> > 
> > -	total_threads = tg->nr_groups * tg->threads_per_group;
> > +	while (i < count && property_idx < MAX_THREAD_GROUP_PROPERTIES) {
> > +		int j;
> > +		struct thread_groups *tg = &tglp->property_tgs[property_idx++];
> 
> NIT: same as above.

Ok.
> 
> > 
> > -	ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> > -					 thread_group_array,
> > -					 3 + total_threads);
> > -	if (ret)
> > -		return ret;
> > +		tg->property = thread_group_array[i];
> > +		tg->nr_groups = thread_group_array[i + 1];
> > +		tg->threads_per_group = thread_group_array[i + 2];
> > +		total_threads = tg->nr_groups * tg->threads_per_group;
> > +
> > +		thread_list = &thread_group_array[i + 3];
> > 
> > -	thread_list = &thread_group_array[3];
> > +		for (j = 0; j < total_threads; j++)
> > +			tg->thread_list[j] = thread_list[j];
> > +		i = i + 3 + total_threads;
> 
> 	Can't we simply use memcpy instead?

We could. But this one makes it more explicit.


> 
> > +	}
> > 
> > -	for (i = 0 ; i < total_threads; i++)
> > -		tg->thread_list[i] = thread_list[i];
> > +	tglp->nr_properties = property_idx;
> > 
> > -	return 0;
> > +out_free:
> > +	kfree(thread_group_array);
> > +	return ret;
> >  }
> > 
> >  /*
> > @@ -805,24 +827,39 @@ static int get_cpu_thread_group_start(int cpu, struct thread_groups *tg)
> >  	return -1;
> >  }
> > 
> > -static int init_cpu_l1_cache_map(int cpu)
> > +static int init_cpu_cache_map(int cpu, unsigned int cache_property)
> > 
> >  {
> >  	struct device_node *dn = of_get_cpu_node(cpu, NULL);
> > -	struct thread_groups tg = {.property = 0,
> > -				   .nr_groups = 0,
> > -				   .threads_per_group = 0};
> > +	struct thread_groups *tg = NULL;
> >  	int first_thread = cpu_first_thread_sibling(cpu);
> >  	int i, cpu_group_start = -1, err = 0;
> > +	cpumask_var_t *mask;
> > +	struct thread_groups_list *cpu_tgl = &tgl[cpu];
> 
> NIT: same as 1st comment.

Sure, will fix this.

> 
> > 
> >  	if (!dn)
> >  		return -ENODATA;
> > 
> > -	err = parse_thread_groups(dn, &tg, THREAD_GROUP_SHARE_L1);
> > -	if (err)
> > -		goto out;
> > +	if (!(cache_property == THREAD_GROUP_SHARE_L1))
> > +		return -EINVAL;
> > 
> > -	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
> > +	if (!cpu_tgl->nr_properties) {
> > +		err = parse_thread_groups(dn, cpu_tgl);
> > +		if (err)
> > +			goto out;
> > +	}
> > +
> > +	for (i = 0; i < cpu_tgl->nr_properties; i++) {
> > +		if (cpu_tgl->property_tgs[i].property == cache_property) {
> > +			tg = &cpu_tgl->property_tgs[i];
> > +			break;
> > +		}
> > +	}
> > +
> > +	if (!tg)
> > +		return -EINVAL;
> > +
> > +	cpu_group_start = get_cpu_thread_group_start(cpu, tg);
> 
> This whole hunk should be moved to a new function and called before
> init_cpu_cache_map. It will simplify the logic to great extent.

I suppose you are referring to the part where we select the correct
tg. Yeah, that can move to a different helper.

> 
> > 
> >  	if (unlikely(cpu_group_start == -1)) {
> >  		WARN_ON_ONCE(1);
> > @@ -830,11 +867,12 @@ static int init_cpu_l1_cache_map(int cpu)
> >  		goto out;
> >  	}
> > 
> > -	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> > -				GFP_KERNEL, cpu_to_node(cpu));
> > +	mask = &per_cpu(cpu_l1_cache_map, cpu);
> > +
> > +	zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> > 
> 
> This hunk (and the next hunk) should be moved to next patch.
>

The next patch is only about introducing  THREAD_GROUP_SHARE_L2. Hence
I put in any other code in this patch, since it seems to be a logical
place to collate whatever we have in a generic form.



> >  	for (i = first_thread; i < first_thread + threads_per_core; i++) {
> > -		int i_group_start = get_cpu_thread_group_start(i, &tg);
> > +		int i_group_start = get_cpu_thread_group_start(i, tg);
> > 
> >  		if (unlikely(i_group_start == -1)) {
> >  			WARN_ON_ONCE(1);
> > @@ -843,7 +881,7 @@ static int init_cpu_l1_cache_map(int cpu)
> >  		}
> > 
> >  		if (i_group_start == cpu_group_start)
> > -			cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu));
> > +			cpumask_set_cpu(i, *mask);
> >  	}
> > 
> >  out:
> > @@ -924,7 +962,7 @@ static int init_big_cores(void)
> >  	int cpu;
> > 
> >  	for_each_possible_cpu(cpu) {
> > -		int err = init_cpu_l1_cache_map(cpu);
> > +		int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L1);
> > 
> >  		if (err)
> >  			return err;
> > -- 
> > 1.9.4
> > 
> 
> -- 
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache
  2020-12-07 12:40   ` Srikar Dronamraju
@ 2020-12-08 17:42     ` Gautham R Shenoy
  2020-12-09  9:14       ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Gautham R Shenoy @ 2020-12-08 17:42 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R. Shenoy, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Ellerman, Michael Neuling, Nicholas Piggin, Nathan Lynch,
	Peter Zijlstra, Valentin Schneider, linuxppc-dev, linux-kernel

Hello Srikar,

On Mon, Dec 07, 2020 at 06:10:39PM +0530, Srikar Dronamraju wrote:
> * Gautham R. Shenoy <ego@linux.vnet.ibm.com> [2020-12-04 10:18:46]:
> 
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > 
> > On POWER systems, groups of threads within a core sharing the L2-cache
> > can be indicated by the "ibm,thread-groups" property array with the
> > identifier "2".
> > 
> > This patch adds support for detecting this, and when present, populate
> > the populating the cpu_l2_cache_mask of every CPU to the core-siblings
> > which share L2 with the CPU as specified in the by the
> > "ibm,thread-groups" property array.
> > 
> > On a platform with the following "ibm,thread-group" configuration
> > 		 00000001 00000002 00000004 00000000
> > 		 00000002 00000004 00000006 00000001
> > 		 00000003 00000005 00000007 00000002
> > 		 00000002 00000004 00000000 00000002
> > 		 00000004 00000006 00000001 00000003
> > 		 00000005 00000007
> > 
> > Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
> > 	CPU0 attaching sched-domain(s):
> > 	domain-0: span=0,2,4,6 level=SMT
> > 	domain-1: span=0-7 level=CACHE
> > 	domain-2: span=0-15,24-39,48-55 level=MC
> > 	domain-3: span=0-55 level=DIE
> > 
> > 	CPU1 attaching sched-domain(s):
> > 	domain-0: span=1,3,5,7 level=SMT
> > 	domain-1: span=0-7 level=CACHE
> > 	domain-2: span=0-15,24-39,48-55 level=MC
> > 	domain-3: span=0-55 level=DIE
> > 
> > The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
> > sub-array
> > [00000002 00000002 00000004
> >  00000000 00000002 00000004 00000006
> >  00000001 00000003 00000005 00000007]
> > indicates that L2 (Property "2") is shared only between the threads of a single
> > group. There are "2" groups of threads where each group contains "4"
> > threads each. The groups being {0,2,4,6} and {1,3,5,7}.
> > 
> > With this patch, the sched-domain hierarchy for CPUs 0,1 would be
> >      	CPU0 attaching sched-domain(s):
> > 	domain-0: span=0,2,4,6 level=SMT
> > 	domain-1: span=0-15,24-39,48-55 level=MC
> > 	domain-2: span=0-55 level=DIE
> > 
> > 	CPU1 attaching sched-domain(s):
> > 	domain-0: span=1,3,5,7 level=SMT
> > 	domain-1: span=0-15,24-39,48-55 level=MC
> > 	domain-2: span=0-55 level=DIE
> > 
> > The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
> > resp.) gets degenerated into the SMT domain. Furthermore, the
> > last-level-cache domain gets correctly set to the SMT sched-domain.
> > 
> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/smp.c | 66 +++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 58 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 6a242a3..a116d2d 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -76,6 +76,7 @@
> >  struct task_struct *secondary_current;
> >  bool has_big_cores;
> >  bool coregroup_enabled;
> > +bool thread_group_shares_l2;
> 
> Either keep this as static in this patch or add its declaration
>

This will be used in Patch 3 in kernel/cacheinfo.c, but not any other
place. Hence I am not making it static here.


> > 
> >  DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
> >  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> > @@ -99,6 +100,7 @@ enum {
> > 
> >  #define MAX_THREAD_LIST_SIZE	8
> >  #define THREAD_GROUP_SHARE_L1   1
> > +#define THREAD_GROUP_SHARE_L2   2
> >  struct thread_groups {
> >  	unsigned int property;
> >  	unsigned int nr_groups;
> > @@ -107,7 +109,7 @@ struct thread_groups {
> >  };
> > 
> >  /* Maximum number of properties that groups of threads within a core can share */
> > -#define MAX_THREAD_GROUP_PROPERTIES 1
> > +#define MAX_THREAD_GROUP_PROPERTIES 2
> > 
> >  struct thread_groups_list {
> >  	unsigned int nr_properties;
> > @@ -121,6 +123,13 @@ struct thread_groups_list {
> >   */
> >  DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
> > 
> > +/*
> > + * On some big-cores system, thread_group_l2_cache_map for each CPU
> > + * corresponds to the set its siblings within the core that share the
> > + * L2-cache.
> > + */
> > +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
> > +
> 
> NIT:
> We are trying to confuse ourselves with the names.
> For L1 we have cpu_l2_cache_map to store the tasks from the thread group.
> but cpu_smallcore_map for keeping track of tasks.
>

I suppose you mean cpu_l1_cache_map here. We are using
cpu_smallcore_map, because when the ibm,thread-groups-property=1, it
shares both L1 and the instruction data flow (SMT). We already have a
cpu_smt_map, hence, this was named cpu_smallcore_map a couple of years
ago when I wrote that patch.


> For L2 we have thread_group_l2_cache_map to store the tasks from the thread
> group.  but cpu_l2_cache_map for keeping track of tasks.

> 
> I think we should do some renaming to keep the names consistent.
> I would say probably say move the current cpu_l2_cache_map to
> cpu_llc_cache_map and move the new aka  thread_group_l2_cache_map as
> cpu_l2_cache_map to be somewhat consistent.

Hmm.. cpu_llc_cache_map is still very generic. We want to have
something that defines l2 map.

I agree that we need to keep it consistent. How about renaming
cpu_l1_cache_map to thread_groups_l1_cache_map ?

That way thread_groups_l1_cache_map and thread_groups_l2_cache_map
refer to the corresponding L1 and L2 siblings as discovered from
ibm,thread-groups property.

We have the cpu_smallcore_mask and the cpu_l2_cache_map unchanged as
it was before.


> 
> >  /* SMP operations for this machine */
> >  struct smp_ops_t *smp_ops;
> > 
> > @@ -840,7 +851,8 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
> >  	if (!dn)
> >  		return -ENODATA;
> > 
> > -	if (!(cache_property == THREAD_GROUP_SHARE_L1))
> > +	if (!(cache_property == THREAD_GROUP_SHARE_L1 ||
> > +	      cache_property == THREAD_GROUP_SHARE_L2))
> >  		return -EINVAL;
> > 
> >  	if (!cpu_tgl->nr_properties) {
> > @@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int cache_property)
> >  		goto out;
> >  	}
> > 
> > -	mask = &per_cpu(cpu_l1_cache_map, cpu);
> > +	if (cache_property == THREAD_GROUP_SHARE_L1)
> > +		mask = &per_cpu(cpu_l1_cache_map, cpu);
> > +	else if (cache_property == THREAD_GROUP_SHARE_L2)
> > +		mask = &per_cpu(thread_group_l2_cache_map, cpu);
> > 
> >  	zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> > 
> > @@ -973,6 +988,16 @@ static int init_big_cores(void)
> >  	}
> > 
> >  	has_big_cores = true;
> > +
> > +	for_each_possible_cpu(cpu) {
> > +		int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> > +
> > +		if (err)
> > +			return err;
> > +	}
> > +
> > +	thread_group_shares_l2 = true;
> 
> Why do we need a separate loop. Why cant we merge this in the above loop
> itself?


No, there are platforms where one THREAD_GROUP_SHARE_L1 exists while
THREAD_GROUP_SHARE_L2 doesn't exist. It becomes easier if these are
separately tracked. Also, what do we gain if we put this in the same
loop? It will be (nr_possible_cpus * 2 * invocations of
init_cpu_cache_map()) as opposed to 2 * (nr_possible_cpus *
invocations of init_cpu_cache_map()). Isn't it ?




> 
> > +	pr_info("Thread-groups in a core share L2-cache\n");
> 
> Can this be moved to a pr_debug? Does it help any regular user/admins to
> know if thread-groups shared l2 cache. Infact it may confuse users on what
> thread groups are and which thread groups dont share cache.
> I would prefer some other name than thread_group_shares_l2 but dont know any
> better alternatives and may be my choices are even worse.

Would you be ok with "L2 cache shared by threads of the small core" ?


> 
> >  	return 0;
> >  }
> > 
> > @@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t *mask)
> >  	if (has_big_cores)
> >  		submask_fn = cpu_smallcore_mask;
> > 
> > +
> 
> NIT: extra blank line?

Will remove this. 
> 
> > +	/*
> > +	 * If the threads in a thread-group share L2 cache, then then
> > +	 * the L2-mask can be obtained from thread_group_l2_cache_map.
> > +	 */
> > +	if (thread_group_shares_l2) {
> > +		/* Siblings that share L1 is a subset of siblings that share L2.*/
> > +		or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
> > +		if (*mask) {
> > +			cpumask_andnot(*mask,
> > +				       per_cpu(thread_group_l2_cache_map, cpu),
> > +				       cpu_l2_cache_mask(cpu));
> > +		} else {
> > +			mask = &per_cpu(thread_group_l2_cache_map, cpu);
> > +		}
> > +
> > +		for_each_cpu(i, *mask) {
> > +			if (!cpu_online(i))
> > +				continue;
> > +			set_cpus_related(i, cpu, cpu_l2_cache_mask);
> > +		}
> > +
> > +		return true;
> > +	}
> > +
> 
> Ah this can be simplified to:
> if (thread_group_shares_l2) {
> 	cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
> 
> 	for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
> 		if (cpu_online(i))
> 			set_cpus_related(i, cpu, cpu_l2_cache_mask);
> 	}

Don't we want to enforce that the siblings sharing L1 be a subset of
the siblings sharing L2 ? Or do you recommend putting in a check for
that somewhere ?


> }
> 
> No?
> 
> >  	l2_cache = cpu_to_l2cache(cpu);
> >  	if (!l2_cache || !*mask) {
> >  		/* Assume only core siblings share cache with this CPU */
> 
> -- 
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
  2020-12-07 13:11   ` Srikar Dronamraju
@ 2020-12-08 17:56     ` Gautham R Shenoy
  2020-12-09  8:39       ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Gautham R Shenoy @ 2020-12-08 17:56 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R. Shenoy, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Ellerman, Michael Neuling, Nicholas Piggin, Nathan Lynch,
	Peter Zijlstra, Valentin Schneider, linuxppc-dev, linux-kernel

On Mon, Dec 07, 2020 at 06:41:38PM +0530, Srikar Dronamraju wrote:
> * Gautham R. Shenoy <ego@linux.vnet.ibm.com> [2020-12-04 10:18:47]:
> 
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > 
> > 
> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > ---
> > 
> > +extern bool thread_group_shares_l2;
> >  /*
> >   * On big-core systems, each core has two groups of CPUs each of which
> >   * has its own L1-cache. The thread-siblings which share l1-cache with
> >   * @cpu can be obtained via cpu_smallcore_mask().
> > + *
> > + * On some big-core systems, the L2 cache is shared only between some
> > + * groups of siblings. This is already parsed and encoded in
> > + * cpu_l2_cache_mask().
> >   */
> >  static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache *cache)
> >  {
> >  	if (cache->level == 1)
> >  		return cpu_smallcore_mask(cpu);
> > +	if (cache->level == 2 && thread_group_shares_l2)
> > +		return cpu_l2_cache_mask(cpu);
> > 
> >  	return &cache->shared_cpu_map;
> 
> As pointed with lkp@intel.org, we need to do this only with #CONFIG_SMP,
> even for cache->level = 1 too.

Yes, I have fixed that in the next version.

> 
> I agree that we are displaying shared_cpu_map correctly. Should we have also
> update /clear shared_cpu_map in the first place. For example:- If for a P9
> core with CPUs 0-7, the cache->shared_cpu_map for L1 would have 0-7 but
> would display 0,2,4,6.
> 
> The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not
> be released. Is this as expected?

cacheinfo populates the cache->shared_cpu_map on the basis of which
CPUs share the common device-tree node for a particular cache.  There
is one l1-cache object in the device-tree for a CPU node corresponding
to a big-core. That the L1 is further split between the threads of the
core is shown using ibm,thread-groups.

The ideal thing would be to add a "group_leader" field to "struct
cache" so that we can create separate cache objects , one per thread
group. I will take a stab at this in the v2.

Thanks for the review comments.



> 
> 
> -- 
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
  2020-12-08 17:25     ` Gautham R Shenoy
@ 2020-12-09  3:59       ` Michael Ellerman
  2020-12-09  8:35       ` Srikar Dronamraju
  1 sibling, 0 replies; 16+ messages in thread
From: Michael Ellerman @ 2020-12-09  3:59 UTC (permalink / raw)
  To: Gautham R Shenoy, Srikar Dronamraju
  Cc: Gautham R. Shenoy, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Neuling, Nicholas Piggin, Nathan Lynch, Peter Zijlstra,
	Valentin Schneider, linuxppc-dev, linux-kernel

Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
> Hello Srikar,
>
> Thanks for taking a look at the patch.
>
> On Mon, Dec 07, 2020 at 05:40:42PM +0530, Srikar Dronamraju wrote:
>> * Gautham R. Shenoy <ego@linux.vnet.ibm.com> [2020-12-04 10:18:45]:
>> 
>> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>> 
>> <snipped>
>> 
>> > 
>> >  static int parse_thread_groups(struct device_node *dn,
>> > -			       struct thread_groups *tg,
>> > -			       unsigned int property)
>> > +			       struct thread_groups_list *tglp)
>> >  {
>> > -	int i;
>> > -	u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
>> > +	int i = 0;
>> > +	u32 *thread_group_array;
>> >  	u32 *thread_list;
>> >  	size_t total_threads;
>> > -	int ret;
>> > +	int ret = 0, count;
>> > +	unsigned int property_idx = 0;
>> 
>> NIT:
>> tglx mentions in one of his recent comments to try keep a reverse fir tree
>> ordering of variables where possible.
>
> I suppose you mean moving the longer local variable declarations to to
> the top and shorter ones to the bottom. Thanks. Will fix this.

Yeah. It's called "reverse christmas tree", that's googleable.

I also prefer that style, it makes the locals visually sit with the
beginning of the function body.

cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
  2020-12-08 17:25     ` Gautham R Shenoy
  2020-12-09  3:59       ` Michael Ellerman
@ 2020-12-09  8:35       ` Srikar Dronamraju
  2020-12-09  9:05         ` Gautham R Shenoy
  1 sibling, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2020-12-09  8:35 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Anton Blanchard, Vaidyanathan Srinivasan, Michael Ellerman,
	Michael Neuling, Nicholas Piggin, Nathan Lynch, Peter Zijlstra,
	Valentin Schneider, linuxppc-dev, linux-kernel

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-12-08 22:55:40]:

> > 
> > NIT:
> > tglx mentions in one of his recent comments to try keep a reverse fir tree
> > ordering of variables where possible.
> 
> I suppose you mean moving the longer local variable declarations to to
> the top and shorter ones to the bottom. Thanks. Will fix this.
> 

Yes.

> > > +	}
> > > +
> > > +	if (!tg)
> > > +		return -EINVAL;
> > > +
> > > +	cpu_group_start = get_cpu_thread_group_start(cpu, tg);
> > 
> > This whole hunk should be moved to a new function and called before
> > init_cpu_cache_map. It will simplify the logic to great extent.
> 
> I suppose you are referring to the part where we select the correct
> tg. Yeah, that can move to a different helper.
> 

Yes, I would prefer if we could call this new helper outside
init_cpu_cache_map.

> > > 
> > > -	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> > > -				GFP_KERNEL, cpu_to_node(cpu));
> > > +	mask = &per_cpu(cpu_l1_cache_map, cpu);
> > > +
> > > +	zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> > > 
> > 
> > This hunk (and the next hunk) should be moved to next patch.
> >
> 
> The next patch is only about introducing  THREAD_GROUP_SHARE_L2. Hence
> I put in any other code in this patch, since it seems to be a logical
> place to collate whatever we have in a generic form.
> 

While I am fine with it, having a pointer that always points to the same
mask looks wierd.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
  2020-12-08 17:56     ` Gautham R Shenoy
@ 2020-12-09  8:39       ` Srikar Dronamraju
  2020-12-09  9:07         ` Gautham R Shenoy
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2020-12-09  8:39 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Anton Blanchard, Vaidyanathan Srinivasan, Michael Ellerman,
	Michael Neuling, Nicholas Piggin, Nathan Lynch, Peter Zijlstra,
	Valentin Schneider, linuxppc-dev, linux-kernel

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-12-08 23:26:47]:

> > The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not
> > be released. Is this as expected?
> 
> cacheinfo populates the cache->shared_cpu_map on the basis of which
> CPUs share the common device-tree node for a particular cache.  There
> is one l1-cache object in the device-tree for a CPU node corresponding
> to a big-core. That the L1 is further split between the threads of the
> core is shown using ibm,thread-groups.
> 

Yes.

> The ideal thing would be to add a "group_leader" field to "struct
> cache" so that we can create separate cache objects , one per thread
> group. I will take a stab at this in the v2.
> 

I am not saying this needs to be done immediately. We could add a TODO and
get it done later. Your patch is not making it worse. Its just that there is
still something more left to be done.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
  2020-12-09  8:35       ` Srikar Dronamraju
@ 2020-12-09  9:05         ` Gautham R Shenoy
  0 siblings, 0 replies; 16+ messages in thread
From: Gautham R Shenoy @ 2020-12-09  9:05 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R Shenoy, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Ellerman, Michael Neuling, Nicholas Piggin, Nathan Lynch,
	Peter Zijlstra, Valentin Schneider, linuxppc-dev, linux-kernel

On Wed, Dec 09, 2020 at 02:05:41PM +0530, Srikar Dronamraju wrote:
> * Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-12-08 22:55:40]:
> 
> > > 
> > > NIT:
> > > tglx mentions in one of his recent comments to try keep a reverse fir tree
> > > ordering of variables where possible.
> > 
> > I suppose you mean moving the longer local variable declarations to to
> > the top and shorter ones to the bottom. Thanks. Will fix this.
> > 
> 
> Yes.
> 
> > > > +	}
> > > > +
> > > > +	if (!tg)
> > > > +		return -EINVAL;
> > > > +
> > > > +	cpu_group_start = get_cpu_thread_group_start(cpu, tg);
> > > 
> > > This whole hunk should be moved to a new function and called before
> > > init_cpu_cache_map. It will simplify the logic to great extent.
> > 
> > I suppose you are referring to the part where we select the correct
> > tg. Yeah, that can move to a different helper.
> > 
> 
> Yes, I would prefer if we could call this new helper outside
> init_cpu_cache_map.
> 
> > > > 
> > > > -	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> > > > -				GFP_KERNEL, cpu_to_node(cpu));
> > > > +	mask = &per_cpu(cpu_l1_cache_map, cpu);
> > > > +
> > > > +	zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> > > > 
> > > 
> > > This hunk (and the next hunk) should be moved to next patch.
> > >
> > 
> > The next patch is only about introducing  THREAD_GROUP_SHARE_L2. Hence
> > I put in any other code in this patch, since it seems to be a logical
> > place to collate whatever we have in a generic form.
> > 
> 
> While I am fine with it, having a pointer that always points to the same
> mask looks wierd.

Sure. Moving some of this to a separate preparatory patch.

> 
> -- 
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
  2020-12-09  8:39       ` Srikar Dronamraju
@ 2020-12-09  9:07         ` Gautham R Shenoy
  0 siblings, 0 replies; 16+ messages in thread
From: Gautham R Shenoy @ 2020-12-09  9:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R Shenoy, Anton Blanchard, Vaidyanathan Srinivasan,
	Michael Ellerman, Michael Neuling, Nicholas Piggin, Nathan Lynch,
	Peter Zijlstra, Valentin Schneider, linuxppc-dev, linux-kernel

On Wed, Dec 09, 2020 at 02:09:21PM +0530, Srikar Dronamraju wrote:
> * Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-12-08 23:26:47]:
> 
> > > The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not
> > > be released. Is this as expected?
> > 
> > cacheinfo populates the cache->shared_cpu_map on the basis of which
> > CPUs share the common device-tree node for a particular cache.  There
> > is one l1-cache object in the device-tree for a CPU node corresponding
> > to a big-core. That the L1 is further split between the threads of the
> > core is shown using ibm,thread-groups.
> > 
> 
> Yes.
> 
> > The ideal thing would be to add a "group_leader" field to "struct
> > cache" so that we can create separate cache objects , one per thread
> > group. I will take a stab at this in the v2.
> > 
> 
> I am not saying this needs to be done immediately. We could add a TODO and
> get it done later. Your patch is not making it worse. Its just that there is
> still something more left to be done.

Yeah, it needs to be fixed but it may not be a 5.11 target. For now I
will fix this patch to take care of the build errors on !PPC64 !SMT
configs. I will post a separate series for making cacheinfo.c aware of
thread-groups at the time of construction of the cache-chain.

> 
> -- 
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache
  2020-12-08 17:42     ` Gautham R Shenoy
@ 2020-12-09  9:14       ` Srikar Dronamraju
  0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2020-12-09  9:14 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Anton Blanchard, Vaidyanathan Srinivasan, Michael Ellerman,
	Michael Neuling, Nicholas Piggin, Nathan Lynch, Peter Zijlstra,
	Valentin Schneider, linuxppc-dev, linux-kernel

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-12-08 23:12:37]:

> 
> > For L2 we have thread_group_l2_cache_map to store the tasks from the thread
> > group.  but cpu_l2_cache_map for keeping track of tasks.
> 
> > 
> > I think we should do some renaming to keep the names consistent.
> > I would say probably say move the current cpu_l2_cache_map to
> > cpu_llc_cache_map and move the new aka  thread_group_l2_cache_map as
> > cpu_l2_cache_map to be somewhat consistent.
> 
> Hmm.. cpu_llc_cache_map is still very generic. We want to have
> something that defines l2 map.
> 
> I agree that we need to keep it consistent. How about renaming
> cpu_l1_cache_map to thread_groups_l1_cache_map ?
> 
> That way thread_groups_l1_cache_map and thread_groups_l2_cache_map
> refer to the corresponding L1 and L2 siblings as discovered from
> ibm,thread-groups property.

I am fine with this.

> > > +
> > > +	for_each_possible_cpu(cpu) {
> > > +		int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> > > +
> > > +		if (err)
> > > +			return err;
> > > +	}
> > > +
> > > +	thread_group_shares_l2 = true;
> > 
> > Why do we need a separate loop. Why cant we merge this in the above loop
> > itself?
> 
> No, there are platforms where one THREAD_GROUP_SHARE_L1 exists while
> THREAD_GROUP_SHARE_L2 doesn't exist. It becomes easier if these are
> separately tracked. Also, what do we gain if we put this in the same
> loop? It will be (nr_possible_cpus * 2 * invocations of
> init_cpu_cache_map()) as opposed to 2 * (nr_possible_cpus *
> invocations of init_cpu_cache_map()). Isn't it ?
> 
Its not about the number of invocations but per-cpu thread group list
that would need not be loaded again. Currently they would probably be in the
cache-line, but get dropped to be loaded again in the next loop.
And we still can support platforms with only THREAD_GROUP_SHARE_L1 since
parse_thread_groups would have given us how many levels of thread groups are
supported on a platform.

> > 
> > > +	pr_info("Thread-groups in a core share L2-cache\n");
> > 
> > Can this be moved to a pr_debug? Does it help any regular user/admins to
> > know if thread-groups shared l2 cache. Infact it may confuse users on what
> > thread groups are and which thread groups dont share cache.
> > I would prefer some other name than thread_group_shares_l2 but dont know any
> > better alternatives and may be my choices are even worse.
> 
> Would you be ok with "L2 cache shared by threads of the small core" ?

Sounds better to me. I would still think pr_debug is better since regular
Admins/users may not make too much information from this.
> 
> > 
> > Ah this can be simplified to:
> > if (thread_group_shares_l2) {
> > 	cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
> > 
> > 	for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
> > 		if (cpu_online(i))
> > 			set_cpus_related(i, cpu, cpu_l2_cache_mask);
> > 	}
> 
> Don't we want to enforce that the siblings sharing L1 be a subset of
> the siblings sharing L2 ? Or do you recommend putting in a check for
> that somewhere ?
> 
I didnt think about the case where the device-tree could show L2 to be a
subset of L1.

How about initializing thread_group_l2_cache_map itself with
cpu_l1_cache_map. It would be a simple one time operation and reduce the
overhead here every CPU online.
And it would help in your subsequent patch too. We dont want the cacheinfo
for L1 showing CPUs not present in L2.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-12-09  9:17 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-04  4:48 [PATCH 0/3] Extend Parsing "ibm,thread-groups" for Shared-L2 information Gautham R. Shenoy
2020-12-04  4:48 ` [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties Gautham R. Shenoy
2020-12-07 12:10   ` Srikar Dronamraju
2020-12-08 17:25     ` Gautham R Shenoy
2020-12-09  3:59       ` Michael Ellerman
2020-12-09  8:35       ` Srikar Dronamraju
2020-12-09  9:05         ` Gautham R Shenoy
2020-12-04  4:48 ` [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache Gautham R. Shenoy
2020-12-07 12:40   ` Srikar Dronamraju
2020-12-08 17:42     ` Gautham R Shenoy
2020-12-09  9:14       ` Srikar Dronamraju
2020-12-04  4:48 ` [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for " Gautham R. Shenoy
2020-12-07 13:11   ` Srikar Dronamraju
2020-12-08 17:56     ` Gautham R Shenoy
2020-12-09  8:39       ` Srikar Dronamraju
2020-12-09  9:07         ` Gautham R Shenoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).