linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug
@ 2023-05-08  8:41 K Prateek Nayak
  2023-05-08  8:41 ` [PATCH 1/2] drivers: base: cacheinfo: Fix shared_cpu_map changes " K Prateek Nayak
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: K Prateek Nayak @ 2023-05-08  8:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: gregkh, rafael, sudeep.holla, yongxuan.wang, pierre.gondois,
	vincent.chen, greentime.hu, yangyicong, prime.zeng, palmer,
	puwen

Since v6.3-rc1, the shared_cpu_list in per-cpu cacheinfo breaks in case
of hotplug activity on x86. This can be tracked back to two commits:

o commit 198102c9103f ("cacheinfo: Fix shared_cpu_map to handle shared
  caches at different levels") that matches cache instance IDs without
  considering if the instance IDs belong to same cache level or not.

o commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported through
  sysfs") which skips calling populate_cache_leaves() if
  last_level_cache_is_valid(cpu) returns true. populate_cache_leaves()
  on x86 would have populated the shared_cpu_map when CPU comes online,
  which is now skipped, and the alternate path has an early bailout
  before setting the CPU in the shared_cpu_map is even attempted.

  On x86, populate_cache_leaves() also sets the
  cpu_cacheinfo->cpu_map_populated flag when the cacheinfo is first
  populated, the cache_shared_cpu_map_setup() in the driver is bypassed
  when a thread comes back online during the hotplug activity. This leads
  to the shared_cpu_list displaying abnormal values for the CPU that was
  offlined and then onlined since the shared_cpu_maps are never
  revaluated.

Following is the output from a dual socket 3rd Generation AMD EPYC
processor (2 x 64C/128T) for cachinfo when offlining and then onlining
CPU8:

o v6.3-rc5 with no changes:

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # echo 0 > /sys/devices/system/cpu/cpu8/online
  # echo 1 > /sys/devices/system/cpu/cpu8/online

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8

  # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
    136

  # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
    9-15,136-143

o v6.3-rc5 with commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
  through sysfs") reverted (Behavior consistent with v6.2):

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # echo 0 > /sys/devices/system/cpu/cpu8/online
  # echo 1 > /sys/devices/system/cpu/cpu8/online

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
    8,136

  # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
    8-15,136-143

This is not only limited to AMD processors but affects Intel processors
too. Following is the output from same experiment on a dual socket Intel
Ice Lake server (2 x 32C/64T) running kernel v6.3-rc5:

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,
    26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,
    88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126

  # echo 0 > /sys/devices/system/cpu/cpu8/online
  # echo 1 > /sys/devices/system/cpu/cpu8/online

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8

  # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
    72

  # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
    0,2,4,6,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,
    66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,
    120,122,124,126

This patch addresses two issues associated with building
shared_cpu_list:

o Patch 1 fixes an ID matching issue that can lead to cacheinfo
  associating CPUs from different cache levels in case IDs are not
  unique across all the different cache levels. 

o Patch 2 clears the cpu_cacheinfo->cpu_map_populated flag when CPU goes
  offline and is removed from the shared_cpu_map.

Following are the results after applying the series on v6.3-rc5 on
respective x86 platforms:

o 3rd Generation AMD EPYC Processor (2 x 64C/128T)

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
  
  # echo 0 > /sys/devices/system/cpu/cpu8/online
  # echo 1 > /sys/devices/system/cpu/cpu8/online
  
  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
  
  # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
    8,136
  
  # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
    8-15,136-143

o Intel Ice Lake Xeon (2 x 32C/128T)

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
    28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
    92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
  
  # echo 0 > /sys/devices/system/cpu/cpu8/online
  # echo 1 > /sys/devices/system/cpu/cpu8/online
  
  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
    28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
    92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
  
  # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
    8,72
  
  # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
    0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,
    68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,
    124,126

Running "grep -r 'cpu_map_populated' arch/" shows MIPS and loongarch too
set the cpu_cacheinfo->cpu_map_populated who might also be affected by
the changes in commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
through sysfs") and this series. Changes from Patch 1 might also affect
RISC-V since Yong-Xuan Wang <yongxuan.wang@sifive.com> from SiFive last
made changes to cache_shared_cpu_map_setup() and
cache_shared_cpu_map_remove() in commit 198102c9103f ("cacheinfo: Fix
shared_cpu_map to handle shared caches at different levels").
---
K Prateek Nayak (2):
  drivers: base: cacheinfo: Fix shared_cpu_map changes in event of CPU
    hotplug
  drivers: base: cacheinfo: Clear cpu_map_populated when CPU goes
    offline

 drivers/base/cacheinfo.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] drivers: base: cacheinfo: Fix shared_cpu_map changes in event of CPU hotplug
  2023-05-08  8:41 [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug K Prateek Nayak
@ 2023-05-08  8:41 ` K Prateek Nayak
  2023-05-08  8:41 ` [PATCH 2/2] drivers: base: cacheinfo: Update cpu_map_populated during CPU Hotplug K Prateek Nayak
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: K Prateek Nayak @ 2023-05-08  8:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: gregkh, rafael, sudeep.holla, yongxuan.wang, pierre.gondois,
	vincent.chen, greentime.hu, yangyicong, prime.zeng, palmer,
	puwen

While building the shared_cpu_map, check if the cache level and cache
type matches. On certain systems that build the cache topology based on
the instance ID, there are cases where the same ID may repeat across
multiple cache levels, leading inaccurate topology.

In event of CPU offlining, the cache_shared_cpu_map_remove() does not
consider if IDs at same level are being compared. As a result, when same
IDs repeat across different cache levels, the CPU going offline is not
removed from all the shared_cpu_map.

Below is the output of cache topology of CPU8 and it's SMT sibling after
CPU8 is offlined on a dual socket 3rd Generation AMD EPYC processor
(2 x 64C/128T) running kernel release v6.3:

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # echo 0 > /sys/devices/system/cpu/cpu8/online

  # for i in /sys/devices/system/cpu/cpu136/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list: 136
    /sys/devices/system/cpu/cpu136/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu136/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list: 9-15,136-143

CPU8 is removed from index0 (L1i) but remains in the shared_cpu_list of
index1 (L1d) and index2 (L2). Since L1i, L1d, and L2 are shared by the
SMT siblings, and they have the same cache instance ID, CPU 2 is only
removed from the first index with matching ID which is index1 (L1i) in
this case. With this fix, the results are as expected when performing
the same experiment on the same system:

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # echo 0 > /sys/devices/system/cpu/cpu8/online

  # for i in /sys/devices/system/cpu/cpu136/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list: 136
    /sys/devices/system/cpu/cpu136/cache/index1/shared_cpu_list: 136
    /sys/devices/system/cpu/cpu136/cache/index2/shared_cpu_list: 136
    /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list: 9-15,136-143

When rebuilding topology, the same problem appears as
cache_shared_cpu_map_setup() implements a similar logic. Consider the
same 3rd Generation EPYC processor: CPUs in Core 1, that share the L1
and L2 caches, have L1 and L2 instance ID as 1. For all the CPUs on
the second chiplet, the L3 ID is also 1 leading to grouping on CPUs from
Core 1 (1, 17) and the entire second chiplet (8-15, 24-31) as CPUs
sharing one cache domain. This went undetected since x86 processors
depended on arch specific populate_cache_leaves() method to repopulate
the shared_cpus_map when CPU came back online until kernel release
v6.3-rc5.

Fixes: 198102c9103f ("cacheinfo: Fix shared_cpu_map to handle shared caches at different levels")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 drivers/base/cacheinfo.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index bba3482ddeb8..d1ae443fd7a0 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -388,6 +388,16 @@ static int cache_shared_cpu_map_setup(unsigned int cpu)
 				continue;/* skip if itself or no cacheinfo */
 			for (sib_index = 0; sib_index < cache_leaves(i); sib_index++) {
 				sib_leaf = per_cpu_cacheinfo_idx(i, sib_index);
+
+				/*
+				 * Comparing cache IDs only makes sense if the leaves
+				 * belong to the same cache level of same type. Skip
+				 * the check if level and type do not match.
+				 */
+				if (sib_leaf->level != this_leaf->level ||
+				    sib_leaf->type != this_leaf->type)
+					continue;
+
 				if (cache_leaves_are_shared(this_leaf, sib_leaf)) {
 					cpumask_set_cpu(cpu, &sib_leaf->shared_cpu_map);
 					cpumask_set_cpu(i, &this_leaf->shared_cpu_map);
@@ -419,6 +429,16 @@ static void cache_shared_cpu_map_remove(unsigned int cpu)
 
 			for (sib_index = 0; sib_index < cache_leaves(sibling); sib_index++) {
 				sib_leaf = per_cpu_cacheinfo_idx(sibling, sib_index);
+
+				/*
+				 * Comparing cache IDs only makes sense if the leaves
+				 * belong to the same cache level of same type. Skip
+				 * the check if level and type do not match.
+				 */
+				if (sib_leaf->level != this_leaf->level ||
+				    sib_leaf->type != this_leaf->type)
+					continue;
+
 				if (cache_leaves_are_shared(this_leaf, sib_leaf)) {
 					cpumask_clear_cpu(cpu, &sib_leaf->shared_cpu_map);
 					cpumask_clear_cpu(sibling, &this_leaf->shared_cpu_map);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] drivers: base: cacheinfo: Update cpu_map_populated during CPU Hotplug
  2023-05-08  8:41 [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug K Prateek Nayak
  2023-05-08  8:41 ` [PATCH 1/2] drivers: base: cacheinfo: Fix shared_cpu_map changes " K Prateek Nayak
@ 2023-05-08  8:41 ` K Prateek Nayak
  2023-05-20  6:56   ` Yicong Yang
  2023-05-09 15:01 ` [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug Sudeep Holla
  2023-05-18  1:02 ` Ricardo Neri
  3 siblings, 1 reply; 9+ messages in thread
From: K Prateek Nayak @ 2023-05-08  8:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: gregkh, rafael, sudeep.holla, yongxuan.wang, pierre.gondois,
	vincent.chen, greentime.hu, yangyicong, prime.zeng, palmer,
	puwen

Until commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported through
sysfs"), cacheinfo called populate_cache_leaves() for CPU coming online
which let the arch specific functions handle (at least on x86)
populating the shared_cpu_map. However, with the changes in the
aforementioned commit, populate_cache_leaves() is not called when a CPU
comes online as a result of hotplug since last_level_cache_is_valid()
returns true as the cacheinfo data is not discarded. The CPU coming
online is not present in shared_cpu_map, however, it will not be added
since the cpu_cacheinfo->cpu_map_populated flag is set (it is set in
populate_cache_leaves() when cacheinfo is first populated for x86)

This can lead to inconsistencies in the shared_cpu_map when an offlined
CPU comes online again. Example below depicts the inconsistency in the
shared_cpu_list in cacheinfo when CPU8 is offlined and onlined again on
a 3rd Generation EPYC processor:

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # echo 0 > /sys/devices/system/cpu/cpu8/online
  # echo 1 > /sys/devices/system/cpu/cpu8/online

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8

  # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
    136

  # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
    9-15,136-143

Clear the flag when the CPU is removed from shared_cpu_map when
cache_shared_cpu_map_remove() is called during CPU hotplug. This will
allow cache_shared_cpu_map_setup() to add the CPU coming back online in
the shared_cpu_map. Set the flag again when the shared_cpu_map is setup.
Following are results of performing the same test as described above with
the changes:

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # echo 0 > /sys/devices/system/cpu/cpu8/online
  # echo 1 > /sys/devices/system/cpu/cpu8/online

  # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
    /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
    /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143

  # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
    8,136

  # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
    8-15,136-143

Fixes: 5c2712387d48 ("cacheinfo: Fix LLC is not exported through sysfs")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 drivers/base/cacheinfo.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index d1ae443fd7a0..cbae8be1fe52 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -410,11 +410,14 @@ static int cache_shared_cpu_map_setup(unsigned int cpu)
 			coherency_max_size = this_leaf->coherency_line_size;
 	}
 
+	/* shared_cpu_map is now populated for the cpu */
+	this_cpu_ci->cpu_map_populated = true;
 	return 0;
 }
 
 static void cache_shared_cpu_map_remove(unsigned int cpu)
 {
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
 	struct cacheinfo *this_leaf, *sib_leaf;
 	unsigned int sibling, index, sib_index;
 
@@ -447,6 +450,9 @@ static void cache_shared_cpu_map_remove(unsigned int cpu)
 			}
 		}
 	}
+
+	/* cpu is no longer populated in the shared map */
+	this_cpu_ci->cpu_map_populated = false;
 }
 
 static void free_cache_attributes(unsigned int cpu)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug
  2023-05-08  8:41 [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug K Prateek Nayak
  2023-05-08  8:41 ` [PATCH 1/2] drivers: base: cacheinfo: Fix shared_cpu_map changes " K Prateek Nayak
  2023-05-08  8:41 ` [PATCH 2/2] drivers: base: cacheinfo: Update cpu_map_populated during CPU Hotplug K Prateek Nayak
@ 2023-05-09 15:01 ` Sudeep Holla
  2023-05-09 15:10   ` K Prateek Nayak
  2023-05-18  1:02 ` Ricardo Neri
  3 siblings, 1 reply; 9+ messages in thread
From: Sudeep Holla @ 2023-05-09 15:01 UTC (permalink / raw)
  To: K Prateek Nayak, gregkh
  Cc: linux-kernel, Sudeep Holla, rafael, yongxuan.wang,
	pierre.gondois, vincent.chen, greentime.hu, yangyicong,
	prime.zeng, palmer, puwen

On Mon, May 08, 2023 at 02:11:13PM +0530, K Prateek Nayak wrote:
> Since v6.3-rc1, the shared_cpu_list in per-cpu cacheinfo breaks in case
> of hotplug activity on x86. This can be tracked back to two commits:
> 
> o commit 198102c9103f ("cacheinfo: Fix shared_cpu_map to handle shared
>   caches at different levels") that matches cache instance IDs without
>   considering if the instance IDs belong to same cache level or not.
> 
> o commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported through
>   sysfs") which skips calling populate_cache_leaves() if
>   last_level_cache_is_valid(cpu) returns true. populate_cache_leaves()
>   on x86 would have populated the shared_cpu_map when CPU comes online,
>   which is now skipped, and the alternate path has an early bailout
>   before setting the CPU in the shared_cpu_map is even attempted.
> 
>   On x86, populate_cache_leaves() also sets the
>   cpu_cacheinfo->cpu_map_populated flag when the cacheinfo is first
>   populated, the cache_shared_cpu_map_setup() in the driver is bypassed
>   when a thread comes back online during the hotplug activity. This leads
>   to the shared_cpu_list displaying abnormal values for the CPU that was
>   offlined and then onlined since the shared_cpu_maps are never
>   revaluated.
> 
> Following is the output from a dual socket 3rd Generation AMD EPYC
> processor (2 x 64C/128T) for cachinfo when offlining and then onlining
> CPU8:
> 
> o v6.3-rc5 with no changes:
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
> 
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>     136
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>     9-15,136-143
> 
> o v6.3-rc5 with commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
>   through sysfs") reverted (Behavior consistent with v6.2):
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
> 
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>     8,136
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>     8-15,136-143
> 
> This is not only limited to AMD processors but affects Intel processors
> too. Following is the output from same experiment on a dual socket Intel
> Ice Lake server (2 x 32C/64T) running kernel v6.3-rc5:
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,
>     26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,
>     88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
> 
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
> 
>   # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
>     72
> 
>   # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
>     0,2,4,6,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,
>     66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,
>     120,122,124,126
> 
> This patch addresses two issues associated with building
> shared_cpu_list:
> 
> o Patch 1 fixes an ID matching issue that can lead to cacheinfo
>   associating CPUs from different cache levels in case IDs are not
>   unique across all the different cache levels. 
> 
> o Patch 2 clears the cpu_cacheinfo->cpu_map_populated flag when CPU goes
>   offline and is removed from the shared_cpu_map.
> 
> Following are the results after applying the series on v6.3-rc5 on
> respective x86 platforms:
> 
> o 3rd Generation AMD EPYC Processor (2 x 64C/128T)
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>   
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>   
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>   
>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>     8,136
>   
>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>     8-15,136-143
> 
> o Intel Ice Lake Xeon (2 x 32C/128T)
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
>     28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
>     92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>   
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>   
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
>     28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
>     92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>   
>   # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
>     8,72
>   
>   # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
>     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,
>     68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,
>     124,126
> 
> Running "grep -r 'cpu_map_populated' arch/" shows MIPS and loongarch too
> set the cpu_cacheinfo->cpu_map_populated who might also be affected by
> the changes in commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
> through sysfs") and this series. Changes from Patch 1 might also affect
> RISC-V since Yong-Xuan Wang <yongxuan.wang@sifive.com> from SiFive last
> made changes to cache_shared_cpu_map_setup() and
> cache_shared_cpu_map_remove() in commit 198102c9103f ("cacheinfo: Fix
> shared_cpu_map to handle shared caches at different levels").

I think they may be affected as well, it is just that it is not caught
in the testing.

Thanks for the detailed explanation and output logs. Not sure how much
time it took you to write but saved lot of time by making it so simple
to understand the exact issue. The changes look good.

Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>

Hi Greg,

Can you please pick up these fixes in your next cycle of fixes for v6.4 ?

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug
  2023-05-09 15:01 ` [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug Sudeep Holla
@ 2023-05-09 15:10   ` K Prateek Nayak
  0 siblings, 0 replies; 9+ messages in thread
From: K Prateek Nayak @ 2023-05-09 15:10 UTC (permalink / raw)
  To: Sudeep Holla, gregkh
  Cc: linux-kernel, rafael, yongxuan.wang, pierre.gondois,
	vincent.chen, greentime.hu, yangyicong, prime.zeng, palmer,
	puwen

Hello Sudeep,

On 5/9/2023 8:31 PM, Sudeep Holla wrote:
> On Mon, May 08, 2023 at 02:11:13PM +0530, K Prateek Nayak wrote:
>> Since v6.3-rc1, the shared_cpu_list in per-cpu cacheinfo breaks in case
>> of hotplug activity on x86. This can be tracked back to two commits:
>>
>> o commit 198102c9103f ("cacheinfo: Fix shared_cpu_map to handle shared
>>   caches at different levels") that matches cache instance IDs without
>>   considering if the instance IDs belong to same cache level or not.
>>
>> o commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported through
>>   sysfs") which skips calling populate_cache_leaves() if
>>   last_level_cache_is_valid(cpu) returns true. populate_cache_leaves()
>>   on x86 would have populated the shared_cpu_map when CPU comes online,
>>   which is now skipped, and the alternate path has an early bailout
>>   before setting the CPU in the shared_cpu_map is even attempted.
>>
>>   On x86, populate_cache_leaves() also sets the
>>   cpu_cacheinfo->cpu_map_populated flag when the cacheinfo is first
>>   populated, the cache_shared_cpu_map_setup() in the driver is bypassed
>>   when a thread comes back online during the hotplug activity. This leads
>>   to the shared_cpu_list displaying abnormal values for the CPU that was
>>   offlined and then onlined since the shared_cpu_maps are never
>>   revaluated.
>>
>> Following is the output from a dual socket 3rd Generation AMD EPYC
>> processor (2 x 64C/128T) for cachinfo when offlining and then onlining
>> CPU8:
>>
>> o v6.3-rc5 with no changes:
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
>>
>>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>>     136
>>
>>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>>     9-15,136-143
>>
>> o v6.3-rc5 with commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
>>   through sysfs") reverted (Behavior consistent with v6.2):
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>>     8,136
>>
>>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>>     8-15,136-143
>>
>> This is not only limited to AMD processors but affects Intel processors
>> too. Following is the output from same experiment on a dual socket Intel
>> Ice Lake server (2 x 32C/64T) running kernel v6.3-rc5:
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,
>>     26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,
>>     88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>>
>>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
>>
>>   # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
>>     72
>>
>>   # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
>>     0,2,4,6,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,
>>     66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,
>>     120,122,124,126
>>
>> This patch addresses two issues associated with building
>> shared_cpu_list:
>>
>> o Patch 1 fixes an ID matching issue that can lead to cacheinfo
>>   associating CPUs from different cache levels in case IDs are not
>>   unique across all the different cache levels. 
>>
>> o Patch 2 clears the cpu_cacheinfo->cpu_map_populated flag when CPU goes
>>   offline and is removed from the shared_cpu_map.
>>
>> Following are the results after applying the series on v6.3-rc5 on
>> respective x86 platforms:
>>
>> o 3rd Generation AMD EPYC Processor (2 x 64C/128T)
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>   
>>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>>   
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>   
>>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>>     8,136
>>   
>>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>>     8-15,136-143
>>
>> o Intel Ice Lake Xeon (2 x 32C/128T)
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
>>     28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
>>     92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>>   
>>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>>   
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
>>     28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
>>     92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>>   
>>   # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
>>     8,72
>>   
>>   # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
>>     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,
>>     68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,
>>     124,126
>>
>> Running "grep -r 'cpu_map_populated' arch/" shows MIPS and loongarch too
>> set the cpu_cacheinfo->cpu_map_populated who might also be affected by
>> the changes in commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
>> through sysfs") and this series. Changes from Patch 1 might also affect
>> RISC-V since Yong-Xuan Wang <yongxuan.wang@sifive.com> from SiFive last
>> made changes to cache_shared_cpu_map_setup() and
>> cache_shared_cpu_map_remove() in commit 198102c9103f ("cacheinfo: Fix
>> shared_cpu_map to handle shared caches at different levels").
> 
> I think they may be affected as well, it is just that it is not caught
> in the testing.

Yes, you are right, since those platforms support CPU hotplug as well.

> 
> Thanks for the detailed explanation and output logs. Not sure how much
> time it took you to write but saved lot of time by making it so simple
> to understand the exact issue. The changes look good.

Glad to know it helped during the review!

> 
> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>

Thank you for reviewing the changes :)

> 
> Hi Greg,
> 
> Can you please pick up these fixes in your next cycle of fixes for v6.4 ?
> 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug
  2023-05-08  8:41 [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug K Prateek Nayak
                   ` (2 preceding siblings ...)
  2023-05-09 15:01 ` [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug Sudeep Holla
@ 2023-05-18  1:02 ` Ricardo Neri
  2023-05-18  2:08   ` K Prateek Nayak
  3 siblings, 1 reply; 9+ messages in thread
From: Ricardo Neri @ 2023-05-18  1:02 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel, gregkh, rafael, sudeep.holla, yongxuan.wang,
	pierre.gondois, vincent.chen, greentime.hu, yangyicong,
	prime.zeng, palmer, puwen

On Mon, May 08, 2023 at 02:11:13PM +0530, K Prateek Nayak wrote:
> Since v6.3-rc1, the shared_cpu_list in per-cpu cacheinfo breaks in case
> of hotplug activity on x86. This can be tracked back to two commits:
> 
> o commit 198102c9103f ("cacheinfo: Fix shared_cpu_map to handle shared
>   caches at different levels") that matches cache instance IDs without
>   considering if the instance IDs belong to same cache level or not.
> 
> o commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported through
>   sysfs") which skips calling populate_cache_leaves() if
>   last_level_cache_is_valid(cpu) returns true. populate_cache_leaves()
>   on x86 would have populated the shared_cpu_map when CPU comes online,
>   which is now skipped, and the alternate path has an early bailout
>   before setting the CPU in the shared_cpu_map is even attempted.
> 
>   On x86, populate_cache_leaves() also sets the
>   cpu_cacheinfo->cpu_map_populated flag when the cacheinfo is first
>   populated, the cache_shared_cpu_map_setup() in the driver is bypassed
>   when a thread comes back online during the hotplug activity. This leads
>   to the shared_cpu_list displaying abnormal values for the CPU that was
>   offlined and then onlined since the shared_cpu_maps are never
>   revaluated.
> 
> Following is the output from a dual socket 3rd Generation AMD EPYC
> processor (2 x 64C/128T) for cachinfo when offlining and then onlining
> CPU8:
> 
> o v6.3-rc5 with no changes:
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
> 
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>     136
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>     9-15,136-143
> 
> o v6.3-rc5 with commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
>   through sysfs") reverted (Behavior consistent with v6.2):
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
> 
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>     8,136
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>     8-15,136-143
> 
> This is not only limited to AMD processors but affects Intel processors
> too. Following is the output from same experiment on a dual socket Intel
> Ice Lake server (2 x 32C/64T) running kernel v6.3-rc5:
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,
>     26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,
>     88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
> 
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
> 
>   # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
>     72
> 
>   # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
>     0,2,4,6,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,
>     66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,
>     120,122,124,126
> 
> This patch addresses two issues associated with building
> shared_cpu_list:
> 
> o Patch 1 fixes an ID matching issue that can lead to cacheinfo
>   associating CPUs from different cache levels in case IDs are not
>   unique across all the different cache levels. 
> 
> o Patch 2 clears the cpu_cacheinfo->cpu_map_populated flag when CPU goes
>   offline and is removed from the shared_cpu_map.
> 
> Following are the results after applying the series on v6.3-rc5 on
> respective x86 platforms:
> 
> o 3rd Generation AMD EPYC Processor (2 x 64C/128T)
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>   
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>   
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>   
>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>     8,136
>   
>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>     8-15,136-143
> 
> o Intel Ice Lake Xeon (2 x 32C/128T)
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
>     28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
>     92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>   
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>   
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,72
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,
>     28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,
>     92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126
>   
>   # cat /sys/devices/system/cpu/cpu72/cache/index0/shared_cpu_list
>     8,72
>   
>   # cat /sys/devices/system/cpu/cpu72/cache/index3/shared_cpu_list
>     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,
>     68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,
>     124,126
> 
> Running "grep -r 'cpu_map_populated' arch/" shows MIPS and loongarch too
> set the cpu_cacheinfo->cpu_map_populated who might also be affected by
> the changes in commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
> through sysfs") and this series. Changes from Patch 1 might also affect
> RISC-V since Yong-Xuan Wang <yongxuan.wang@sifive.com> from SiFive last
> made changes to cache_shared_cpu_map_setup() and
> cache_shared_cpu_map_remove() in commit 198102c9103f ("cacheinfo: Fix
> shared_cpu_map to handle shared caches at different levels").


I was able to reproduce the issue on an Alder Lake system (single socket).
I also was able to verify that the this patchset fixes the issue.

FWIW, Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug
  2023-05-18  1:02 ` Ricardo Neri
@ 2023-05-18  2:08   ` K Prateek Nayak
  0 siblings, 0 replies; 9+ messages in thread
From: K Prateek Nayak @ 2023-05-18  2:08 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: linux-kernel, gregkh, rafael, sudeep.holla, yongxuan.wang,
	pierre.gondois, vincent.chen, greentime.hu, yangyicong,
	prime.zeng, palmer, puwen

Hello Ricardo,

On 5/18/2023 6:32 AM, Ricardo Neri wrote:
> On Mon, May 08, 2023 at 02:11:13PM +0530, K Prateek Nayak wrote:
>> [..snip..]
>>
>>
>> Running "grep -r 'cpu_map_populated' arch/" shows MIPS and loongarch too
>> set the cpu_cacheinfo->cpu_map_populated who might also be affected by
>> the changes in commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported
>> through sysfs") and this series. Changes from Patch 1 might also affect
>> RISC-V since Yong-Xuan Wang <yongxuan.wang@sifive.com> from SiFive last
>> made changes to cache_shared_cpu_map_setup() and
>> cache_shared_cpu_map_remove() in commit 198102c9103f ("cacheinfo: Fix
>> shared_cpu_map to handle shared caches at different levels").
> 
> 
> I was able to reproduce the issue on an Alder Lake system (single socket).
> I also was able to verify that the this patchset fixes the issue.
> 
> FWIW, Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

Thank you for confirming the observation and testing the patches :)

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] drivers: base: cacheinfo: Update cpu_map_populated during CPU Hotplug
  2023-05-08  8:41 ` [PATCH 2/2] drivers: base: cacheinfo: Update cpu_map_populated during CPU Hotplug K Prateek Nayak
@ 2023-05-20  6:56   ` Yicong Yang
  2023-05-24  3:43     ` K Prateek Nayak
  0 siblings, 1 reply; 9+ messages in thread
From: Yicong Yang @ 2023-05-20  6:56 UTC (permalink / raw)
  To: K Prateek Nayak, linux-kernel, sudeep.holla
  Cc: yangyicong, gregkh, rafael, yongxuan.wang, pierre.gondois,
	vincent.chen, greentime.hu, prime.zeng, palmer, puwen

Hi Prateek,

On 2023/5/8 16:41, K Prateek Nayak wrote:
> Until commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported through
> sysfs"), cacheinfo called populate_cache_leaves() for CPU coming online
> which let the arch specific functions handle (at least on x86)
> populating the shared_cpu_map. However, with the changes in the
> aforementioned commit, populate_cache_leaves() is not called when a CPU
> comes online as a result of hotplug since last_level_cache_is_valid()
> returns true as the cacheinfo data is not discarded. The CPU coming

Yes in free_cache_attributes() we only update the shared_cpu_map but make
other attributes remained. From my feelings we should do all the work
opposite to detect_cache_attributes(), including free the memory allocated.

> online is not present in shared_cpu_map, however, it will not be added
> since the cpu_cacheinfo->cpu_map_populated flag is set (it is set in
> populate_cache_leaves() when cacheinfo is first populated for x86)
> 
> This can lead to inconsistencies in the shared_cpu_map when an offlined
> CPU comes online again. Example below depicts the inconsistency in the
> shared_cpu_list in cacheinfo when CPU8 is offlined and onlined again on
> a 3rd Generation EPYC processor:
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
> 
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>     136
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>     9-15,136-143
> 
> Clear the flag when the CPU is removed from shared_cpu_map when
> cache_shared_cpu_map_remove() is called during CPU hotplug. This will
> allow cache_shared_cpu_map_setup() to add the CPU coming back online in
> the shared_cpu_map. Set the flag again when the shared_cpu_map is setup.
> Following are results of performing the same test as described above with
> the changes:
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
> 
>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>   # echo 1 > /sys/devices/system/cpu/cpu8/online
> 
>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>     8,136
> 
>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>     8-15,136-143
> 
> Fixes: 5c2712387d48 ("cacheinfo: Fix LLC is not exported through sysfs")

It's ok for me to have this tag but I don't think this is the root cause,
the commit happens to expose the problem. Other arthitectures like arm64
never updates the this_cpu_ci->cpu_map_populated even after the cpumap is
populated.

> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thanks for fixing this!

Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>

> ---
>  drivers/base/cacheinfo.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> index d1ae443fd7a0..cbae8be1fe52 100644
> --- a/drivers/base/cacheinfo.c
> +++ b/drivers/base/cacheinfo.c
> @@ -410,11 +410,14 @@ static int cache_shared_cpu_map_setup(unsigned int cpu)
>  			coherency_max_size = this_leaf->coherency_line_size;
>  	}
>  
> +	/* shared_cpu_map is now populated for the cpu */
> +	this_cpu_ci->cpu_map_populated = true;
>  	return 0;
>  }
>  
>  static void cache_shared_cpu_map_remove(unsigned int cpu)
>  {
> +	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
>  	struct cacheinfo *this_leaf, *sib_leaf;
>  	unsigned int sibling, index, sib_index;
>  
> @@ -447,6 +450,9 @@ static void cache_shared_cpu_map_remove(unsigned int cpu)
>  			}
>  		}
>  	}
> +
> +	/* cpu is no longer populated in the shared map */
> +	this_cpu_ci->cpu_map_populated = false;
>  }
>  
>  static void free_cache_attributes(unsigned int cpu)
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] drivers: base: cacheinfo: Update cpu_map_populated during CPU Hotplug
  2023-05-20  6:56   ` Yicong Yang
@ 2023-05-24  3:43     ` K Prateek Nayak
  0 siblings, 0 replies; 9+ messages in thread
From: K Prateek Nayak @ 2023-05-24  3:43 UTC (permalink / raw)
  To: Yicong Yang, linux-kernel, sudeep.holla
  Cc: yangyicong, gregkh, rafael, yongxuan.wang, pierre.gondois,
	vincent.chen, greentime.hu, prime.zeng, palmer, puwen

Hello Yicong,

On 5/20/2023 12:26 PM, Yicong Yang wrote:
> Hi Prateek,
> 
> On 2023/5/8 16:41, K Prateek Nayak wrote:
>> Until commit 5c2712387d48 ("cacheinfo: Fix LLC is not exported through
>> sysfs"), cacheinfo called populate_cache_leaves() for CPU coming online
>> which let the arch specific functions handle (at least on x86)
>> populating the shared_cpu_map. However, with the changes in the
>> aforementioned commit, populate_cache_leaves() is not called when a CPU
>> comes online as a result of hotplug since last_level_cache_is_valid()
>> returns true as the cacheinfo data is not discarded. The CPU coming
> 
> Yes in free_cache_attributes() we only update the shared_cpu_map but make
> other attributes remained. From my feelings we should do all the work
> opposite to detect_cache_attributes(), including free the memory allocated.

In fact, when free_cache_attributes() was first added in
commit 246246cbde5e ("drivers: base: support cpu cache information
interface to userspace via sysfs"), it did exactly that. It was later
changed in commit 5944ce092b97 ("arch_topology: Build cacheinfo from
primary CPU")

> 
>> online is not present in shared_cpu_map, however, it will not be added
>> since the cpu_cacheinfo->cpu_map_populated flag is set (it is set in
>> populate_cache_leaves() when cacheinfo is first populated for x86)
>>
>> This can lead to inconsistencies in the shared_cpu_map when an offlined
>> CPU comes online again. Example below depicts the inconsistency in the
>> shared_cpu_list in cacheinfo when CPU8 is offlined and onlined again on
>> a 3rd Generation EPYC processor:
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8
>>
>>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>>     136
>>
>>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>>     9-15,136-143
>>
>> Clear the flag when the CPU is removed from shared_cpu_map when
>> cache_shared_cpu_map_remove() is called during CPU hotplug. This will
>> allow cache_shared_cpu_map_setup() to add the CPU coming back online in
>> the shared_cpu_map. Set the flag again when the shared_cpu_map is setup.
>> Following are results of performing the same test as described above with
>> the changes:
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>>   # echo 0 > /sys/devices/system/cpu/cpu8/online
>>   # echo 1 > /sys/devices/system/cpu/cpu8/online
>>
>>   # for i in /sys/devices/system/cpu/cpu8/cache/index*/shared_cpu_list; do echo -n "$i: "; cat $i; done
>>     /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list: 8,136
>>     /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list: 8-15,136-143
>>
>>   # cat /sys/devices/system/cpu/cpu136/cache/index0/shared_cpu_list
>>     8,136
>>
>>   # cat /sys/devices/system/cpu/cpu136/cache/index3/shared_cpu_list
>>     8-15,136-143
>>
>> Fixes: 5c2712387d48 ("cacheinfo: Fix LLC is not exported through sysfs")
> 
> It's ok for me to have this tag but I don't think this is the root cause,
> the commit happens to expose the problem. Other arthitectures like arm64
> never updates the this_cpu_ci->cpu_map_populated even after the cpumap is
> populated.

I agree. I added the tag to indicate where the behavior changed for x86.
Fun fact, "cpu_map_populated" was added specifically for x86 in commit
fac51482577d5 ("drivers: base: cacheinfo: fix x86 with CONFIG_OF
enabled") back in 2016. There seems to be a long history between the
cacheinfo driver and the arch specific methods :)

> 
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> Thanks for fixing this!
> 
> Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>

Thank you for reviewing the changes.

> 
>> ---
>>  drivers/base/cacheinfo.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> index d1ae443fd7a0..cbae8be1fe52 100644
>> --- a/drivers/base/cacheinfo.c
>> +++ b/drivers/base/cacheinfo.c
>> @@ -410,11 +410,14 @@ static int cache_shared_cpu_map_setup(unsigned int cpu)
>>  			coherency_max_size = this_leaf->coherency_line_size;
>>  	}
>>  
>> +	/* shared_cpu_map is now populated for the cpu */
>> +	this_cpu_ci->cpu_map_populated = true;
>>  	return 0;
>>  }
>>  
>>  static void cache_shared_cpu_map_remove(unsigned int cpu)
>>  {
>> +	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
>>  	struct cacheinfo *this_leaf, *sib_leaf;
>>  	unsigned int sibling, index, sib_index;
>>  
>> @@ -447,6 +450,9 @@ static void cache_shared_cpu_map_remove(unsigned int cpu)
>>  			}
>>  		}
>>  	}
>> +
>> +	/* cpu is no longer populated in the shared map */
>> +	this_cpu_ci->cpu_map_populated = false;
>>  }
>>  
>>  static void free_cache_attributes(unsigned int cpu)
>>
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-05-24  3:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-08  8:41 [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug K Prateek Nayak
2023-05-08  8:41 ` [PATCH 1/2] drivers: base: cacheinfo: Fix shared_cpu_map changes " K Prateek Nayak
2023-05-08  8:41 ` [PATCH 2/2] drivers: base: cacheinfo: Update cpu_map_populated during CPU Hotplug K Prateek Nayak
2023-05-20  6:56   ` Yicong Yang
2023-05-24  3:43     ` K Prateek Nayak
2023-05-09 15:01 ` [PATCH 0/2] drivers: base: cacheinfo: Fix shared_cpu_list inconsistency in event of CPU hotplug Sudeep Holla
2023-05-09 15:10   ` K Prateek Nayak
2023-05-18  1:02 ` Ricardo Neri
2023-05-18  2:08   ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).