linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/24] sched: Introduce classes of tasks for load balance
@ 2023-06-13  4:23 Ricardo Neri
  2023-06-13  4:23 ` [PATCH v4 01/24] sched/task_struct: Introduce IPC classes of tasks Ricardo Neri
                   ` (23 more replies)
  0 siblings, 24 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:23 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri

Hi,

This is the fourth version of this patchset. Previous versions can be found
here [1], here [2], and here[3]. For brevity, I did not include the cover
letter from the original posting. You can read it here [1].

This patchset depends on a separate patchset to handle better asym_packing
between SMT cores [4]. This patchset was recently merged into the tip tree.

For your convenience, you can retrieve this patchset from [5], and is based
on the master branch of the tip tree as on 12th June 2023.

Changes since v3:

Vincent highlighted the issue of using the current tasks of runqueues to
collect IPCC statistics. Since these are the last tasks to be pulled during
load balance, the proposed tie breakers would have little effect. Instead,
I have reworked the collection of IPCC statistics to look at the back of
the runqueues. This makes the IPCC scores more useful, as these are the
first tasks to be pulled during load balance.

Rafael argued that the HFI driver should not handle tasks directly. I have
reworked the code to let the HFI driver read the classification result.
A new sched_ipcc.c file under arch/x86 is in charge or handling tasks and
postprocess classification.

Rafael also raised concerns about the need for a memory barrier to order
reads and writes of the cached IPCC scores that the scheduler can access
without holding the HFI lock. I decided to use a seqcount. The seqcount
provides store-release and load-acquire semantics, ensuring proper
ordering of reads and writes of the cached IPCC scores. It also provides a
compiler barrier. A CPU memory barrier is not needed.

Zhao noticed that there is a new CPUID_7_1_EAX leaf and using cpuid_bits[]
in scattered.c is not needed. I fixed it.

lkp@intel.com found a build break due to a misplaced definition of
MSR_IA32_HW_FEEDBACK_CHAR. I moved the definition to patch 13.

Updated patches: 6, 10, 13, 14, 15, 17, 21, 22
New patches: none
Unchanged patches: 1, 2, 3, 4, 5, 7, 8, 9, 11, 12, 16, 18, 19, 20, 23, 24

Thanks in advance for your kind feedback!
BR,
Ricardo

[1]. https://lore.kernel.org/lkml/20220909231205.14009-1-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/lkml/20221128132100.30253-1-ricardo.neri-calderon@linux.intel.com/
[3]. https://lore.kernel.org/all/20230207051105.11575-1-ricardo.neri-calderon@linux.intel.com/
[4]. https://lore.kernel.org/all/20230406203148.19182-1-ricardo.neri-calderon@linux.intel.com/
[5]. https://github.com/ricardon/tip/tree/rneri/ipc_classes_v4

Ricardo Neri (24):
  sched/task_struct: Introduce IPC classes of tasks
  sched: Add interfaces for IPC classes
  sched/core: Initialize the IPC class of a new task
  sched/core: Add user_tick as argument to scheduler_tick()
  sched/core: Update the IPC class of the current task
  sched/fair: Collect load-balancing stats for IPC classes
  sched/fair: Compute IPC class scores for load balancing
  sched/fair: Use IPCC stats to break ties between asym_packing sched
    groups
  sched/fair: Use IPCC stats to break ties between fully_busy SMT groups
  sched/fair: Use IPCC scores to select a busiest runqueue
  thermal: intel: hfi: Introduce Intel Thread Director classes
  x86/cpufeatures: Add the Intel Thread Director feature definitions
  x86/sched: Update the IPC class of the current task
  thermal: intel: hfi: Store per-CPU IPCC scores
  thermal: intel: hfi: Report the IPC class score of a CPU
  thermal: intel: hfi: Define a default class for unclassified tasks
  thermal: intel: hfi: Enable the Intel Thread Director
  sched/task_struct: Add helpers for IPC classification
  sched/core: Initialize helpers of task classification
  sched/fair: Introduce sched_smt_siblings_idle()
  x86/sched/ipcc: Implement model-specific checks for task
    classification
  x86/cpufeatures: Add feature bit for HRESET
  x86/hreset: Configure history reset
  x86/process: Reset hardware history in context switch

 arch/x86/include/asm/cpufeatures.h       |   2 +
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/hreset.h            |  30 +++
 arch/x86/include/asm/msr-index.h         |   5 +
 arch/x86/include/asm/topology.h          |  15 ++
 arch/x86/kernel/Makefile                 |   2 +
 arch/x86/kernel/cpu/common.c             |  30 ++-
 arch/x86/kernel/cpu/cpuid-deps.c         |   1 +
 arch/x86/kernel/process_32.c             |   3 +
 arch/x86/kernel/process_64.c             |   3 +
 arch/x86/kernel/sched_ipcc.c             |  93 +++++++
 drivers/thermal/intel/Kconfig            |   1 +
 drivers/thermal/intel/intel_hfi.c        | 219 +++++++++++++++-
 include/linux/sched.h                    |  24 +-
 include/linux/sched/topology.h           |   6 +
 init/Kconfig                             |  12 +
 kernel/sched/core.c                      |  10 +-
 kernel/sched/fair.c                      | 316 ++++++++++++++++++++++-
 kernel/sched/sched.h                     |  66 +++++
 kernel/sched/topology.c                  |   9 +
 kernel/time/timer.c                      |   2 +-
 21 files changed, 840 insertions(+), 17 deletions(-)
 create mode 100644 arch/x86/include/asm/hreset.h
 create mode 100644 arch/x86/kernel/sched_ipcc.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v4 01/24] sched/task_struct: Introduce IPC classes of tasks
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
@ 2023-06-13  4:23 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 02/24] sched: Add interfaces for IPC classes Ricardo Neri
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:23 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

On hybrid processors, the architecture differences between the types of
CPUs result in different instructions-per-cycle (IPC) rates for each type
of CPU. IPCs may vary further by the type of instructions being executed.
Instructions can be grouped into classes of similar IPCs.

Tasks can be classified into groups based on the type of instructions they
execute.

Add a new member task_struct::ipcc to associate a particular task to
an IPC class that depends on the instructions it executes.

The scheduler may use the IPC class of a task and data about the
performance among CPUs of a given IPC class to improve throughput. It
may, for instance, place certain classes of tasks on CPUs of higher
performance.

The methods to determine the classification of a task and its relative
IPC score are specific to each CPU architecture.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * Changed the type of task_struct::ipcc to unsigned short. A subsequent
   patch uses bit fields to use 9 bits, along with other auxiliary
   members.

Changes since v1:
 * Renamed task_struct::class as task_struct::ipcc. (Joel)
 * Use task_struct::ipcc = 0 for unclassified tasks. (PeterZ)
 * Renamed CONFIG_SCHED_TASK_CLASSES as CONFIG_IPC_CLASSES. (PeterZ, Joel)
---
 include/linux/sched.h | 10 ++++++++++
 init/Kconfig          | 12 ++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1292d38d66cc..9fdee040f450 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -129,6 +129,8 @@ struct user_event_mm;
 					 __TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
 					 TASK_PARKED)
 
+#define IPC_CLASS_UNCLASSIFIED		0
+
 #define task_is_running(task)		(READ_ONCE((task)->__state) == TASK_RUNNING)
 
 #define task_is_traced(task)		((READ_ONCE(task->jobctl) & JOBCTL_TRACED) != 0)
@@ -1534,6 +1536,14 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_IPC_CLASSES
+	/*
+	 * A hardware-defined classification of task that reflects but is
+	 * not identical to the number of instructions per cycle.
+	 */
+	unsigned short			ipcc;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/init/Kconfig b/init/Kconfig
index 32c24950c4ce..ea3371ccb530 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -839,6 +839,18 @@ config UCLAMP_BUCKETS_COUNT
 
 	  If in doubt, use the default value.
 
+config IPC_CLASSES
+	bool "IPC classes of tasks"
+	depends on SMP
+	help
+	  If selected, each task is assigned a classification value that
+	  reflects the type of instructions that the task executes. This
+	  classification reflects but is not equal to the number of
+	  instructions retired per cycle.
+
+	  The scheduler uses the classification value to improve the placement
+	  of tasks.
+
 endmenu
 
 #
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 02/24] sched: Add interfaces for IPC classes
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
  2023-06-13  4:23 ` [PATCH v4 01/24] sched/task_struct: Introduce IPC classes of tasks Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 03/24] sched/core: Initialize the IPC class of a new task Ricardo Neri
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Add the interfaces that architectures shall implement to convey the data
to support IPC classes.

arch_update_ipcc() updates the IPC classification of the current task as
given by hardware.

arch_get_ipcc_score() provides a performance score for a given IPC class
when placed on a specific CPU. Higher scores indicate higher performance.

When a driver or equivalent enablement code has configured the necessary
hardware to support IPC classes, it should call sched_enable_ipc_classes()
to notify the scheduler that it can start using IPC classes data.

The number of classes and the score of each class of task are determined
by hardware.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * Clarified the properties of the IPC score: abstract, and linear. It can
   normalized when needed. (Ionela)
 * Selected a better default IPC score. (Ionela)
 * Removed arch_has_ipc_classes(). It is not suitable for hardware that is
   not ready to support IPC classes after boot. (Lukasz)
 * Added a new sched_enable_ipc_classes() interface that drivers or
   enablement code can call when ready to support IPC classes. (Lukasz)

Changes since v1:
 * Shortened the names of the IPCC interfaces (PeterZ):
   sched_task_classes_enabled >> sched_ipcc_enabled
   arch_has_task_classes >> arch_has_ipc_classes
   arch_update_task_class >> arch_update_ipcc
   arch_get_task_class_score >> arch_get_ipcc_score
 * Removed smt_siblings_idle argument from arch_update_ipcc(). (PeterZ)
---
 include/linux/sched/topology.h |  6 ++++
 kernel/sched/sched.h           | 66 ++++++++++++++++++++++++++++++++++
 kernel/sched/topology.c        |  9 +++++
 3 files changed, 81 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..5b084d3c9ad1 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -280,4 +280,10 @@ static inline int task_node(const struct task_struct *p)
 	return cpu_to_node(task_cpu(p));
 }
 
+#ifdef CONFIG_IPC_CLASSES
+extern void sched_enable_ipc_classes(void);
+#else
+static inline void sched_enable_ipc_classes(void) { }
+#endif
+
 #endif /* _LINUX_SCHED_TOPOLOGY_H */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 556496c77dc2..03fb53e9f340 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2544,6 +2544,72 @@ void arch_scale_freq_tick(void)
 }
 #endif
 
+#ifdef CONFIG_IPC_CLASSES
+DECLARE_STATIC_KEY_FALSE(sched_ipcc);
+
+static inline bool sched_ipcc_enabled(void)
+{
+	return static_branch_unlikely(&sched_ipcc);
+}
+
+#ifndef arch_update_ipcc
+/**
+ * arch_update_ipcc() - Update the IPC class of the current task
+ * @curr:		The current task
+ *
+ * Request that the IPC classification of @curr is updated.
+ *
+ * Returns: none
+ */
+static __always_inline
+void arch_update_ipcc(struct task_struct *curr)
+{
+}
+#endif
+
+#ifndef arch_get_ipcc_score
+
+#define SCHED_IPCC_SCORE_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
+/**
+ * arch_get_ipcc_score() - Get the IPC score of a class of task
+ * @ipcc:	The IPC class
+ * @cpu:	A CPU number
+ *
+ * The IPC performance scores reflects (but it is not identical to) the number
+ * of instructions retired per cycle for a given IPC class. It is a linear and
+ * abstract metric. Higher scores reflect better performance.
+ *
+ * The IPC score can be normalized with respect to the class, i, with the
+ * highest IPC score on the CPU, c, with highest performance:
+ *
+ *            IPC(i, c)
+ *  ------------------------------------ * SCHED_IPCC_SCORE_SCALE
+ *     max(IPC(i, c) : (i, c))
+ *
+ * Scheduling schemes that want to use the IPC score along with other
+ * normalized metrics for scheduling (e.g., CPU capacity) may need to normalize
+ * it.
+ *
+ * Other scheduling schemes (e.g., asym_packing) do not need normalization.
+ *
+ * Returns the performance score of an IPC class, @ipcc, when running on @cpu.
+ * Error when either @ipcc or @cpu are invalid.
+ */
+static __always_inline
+unsigned long arch_get_ipcc_score(unsigned short ipcc, int cpu)
+{
+	return SCHED_IPCC_SCORE_SCALE;
+}
+#endif
+#else /* CONFIG_IPC_CLASSES */
+
+#define arch_get_ipcc_score(ipcc, cpu) (-EINVAL)
+#define arch_update_ipcc(curr)
+
+static inline bool sched_ipcc_enabled(void) { return false; }
+
+#endif /* CONFIG_IPC_CLASSES */
+
 #ifndef arch_scale_freq_capacity
 /**
  * arch_scale_freq_capacity - get the frequency scale factor of a given CPU.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ca4472281c28..c3a633a4b474 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -672,6 +672,15 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 
+#ifdef CONFIG_IPC_CLASSES
+DEFINE_STATIC_KEY_FALSE(sched_ipcc);
+
+void sched_enable_ipc_classes(void)
+{
+	static_branch_enable_cpuslocked(&sched_ipcc);
+}
+#endif
+
 static void update_top_cache_domain(int cpu)
 {
 	struct sched_domain_shared *sds = NULL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 03/24] sched/core: Initialize the IPC class of a new task
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
  2023-06-13  4:23 ` [PATCH v4 01/24] sched/task_struct: Introduce IPC classes of tasks Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 02/24] sched: Add interfaces for IPC classes Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 04/24] sched/core: Add user_tick as argument to scheduler_tick() Ricardo Neri
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

New tasks shall start life as unclassified. They will be classified by
hardware when they run.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9a04faa48a1..77b769c23186 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4500,6 +4500,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+#ifdef CONFIG_IPC_CLASSES
+	p->ipcc				= IPC_CLASS_UNCLASSIFIED;
+#endif
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 04/24] sched/core: Add user_tick as argument to scheduler_tick()
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (2 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 03/24] sched/core: Initialize the IPC class of a new task Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 05/24] sched/core: Update the IPC class of the current task Ricardo Neri
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Differentiate between user and kernel ticks so that the scheduler updates
the IPC class of the current task during the former.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * Corrected error in the changeset description: the IPC class of the
   current task is updated at user tick. (Dietmar)

Changes since v1:
 * None
---
 include/linux/sched.h | 2 +-
 kernel/sched/core.c   | 2 +-
 kernel/time/timer.c   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9fdee040f450..0e1c38ad09c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -295,7 +295,7 @@ enum {
 	TASK_COMM_LEN = 16,
 };
 
-extern void scheduler_tick(void);
+extern void scheduler_tick(bool user_tick);
 
 #define	MAX_SCHEDULE_TIMEOUT		LONG_MAX
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 77b769c23186..6ac53f01d989 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5639,7 +5639,7 @@ static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
  */
-void scheduler_tick(void)
+void scheduler_tick(bool user_tick)
 {
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 63a8ce7177dd..e15e24105891 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2073,7 +2073,7 @@ void update_process_times(int user_tick)
 	if (in_irq())
 		irq_work_tick();
 #endif
-	scheduler_tick();
+	scheduler_tick(user_tick);
 	if (IS_ENABLED(CONFIG_POSIX_TIMERS))
 		run_posix_cpu_timers();
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 05/24] sched/core: Update the IPC class of the current task
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (3 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 04/24] sched/core: Add user_tick as argument to scheduler_tick() Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes Ricardo Neri
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

When supported, hardware monitors the instruction stream to classify the
current task. Hence, at userspace tick, we are ready to read the most
recent classification result for the current task.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Removed argument smt_siblings_idle from call to arch_ipcc_update().
 * Used the new IPCC interfaces names.
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ac53f01d989..876396b1d077 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5651,6 +5651,9 @@ void scheduler_tick(bool user_tick)
 	if (housekeeping_cpu(cpu, HK_TYPE_TICK))
 		arch_scale_freq_tick();
 
+	if (sched_ipcc_enabled() && user_tick)
+		arch_update_ipcc(curr);
+
 	sched_clock_tick();
 
 	rq_lock(rq, &rf);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (4 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 05/24] sched/core: Update the IPC class of the current task Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-14  0:29   ` Ricardo Neri
  2023-06-22  9:01   ` Ionela Voinescu
  2023-06-13  4:24 ` [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing Ricardo Neri
                   ` (17 subsequent siblings)
  23 siblings, 2 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

When selecting the busiest scheduling group between two otherwise identical
groups of types asym_packing or fully_busy, IPC classes can be used to
break the tie.

Compute the IPC class performance score for a scheduling group. It is
defined as the sum of the IPC scores of the tasks at the back of each
runqueue in the group. Load balancing starts by pulling tasks from the back
of the runqueue first, making this tiebreaker more useful.

Also, track the IPC class with the lowest score in the scheduling group. A
task of this class will be pulled when the destination CPU has lower
priority than the fully_busy busiest group.

These two metrics will be used during idle load balancing to compute the
current and the potential IPC class score of a scheduling group in a
subsequent changeset.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Do not compute the IPCC stats using the current tasks of runqueues.
   Instead, use the tasks at the back of the queue. These are the tasks
   that will be pulled first during load balance. (Vincent)

Changes since v2:
 * Also excluded deadline and realtime tasks from IPCC stats. (Dietmar)
 * Also excluded tasks that cannot run on the destination CPU from the
   IPCC stats.
 * Folded struct sg_lb_ipcc_stats into struct sg_lb_stats. (Dietmar)
 * Reworded description sg_lb_stats::min_ipcc. (Ionela)
 * Handle errors of arch_get_ipcc_score(). (Ionela)

Changes since v1:
 * Implemented cleanups and reworks from PeterZ. Thanks!
 * Used the new interface names.
---
 kernel/sched/fair.c | 79 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6189d1a45635..c0cab5e501b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9110,6 +9110,11 @@ struct sg_lb_stats {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
+#ifdef CONFIG_IPC_CLASSES
+	unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
+	unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
+	unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
+#endif
 };
 
 /*
@@ -9387,6 +9392,77 @@ group_type group_classify(unsigned int imbalance_pct,
 	return group_has_spare;
 }
 
+#ifdef CONFIG_IPC_CLASSES
+static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
+{
+	/* All IPCC stats have been set to zero in update_sg_lb_stats(). */
+	sgs->min_score = ULONG_MAX;
+}
+
+static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
+{
+	struct list_head *tasks = &rq->cfs_tasks;
+	struct task_struct *p;
+	struct rq_flags rf;
+	int ret = -EINVAL;
+
+	rq_lock_irqsave(rq, &rf);
+	if (list_empty(tasks))
+		goto out;
+
+	p = list_last_entry(tasks, struct task_struct, se.group_node);
+	if (p->flags & PF_EXITING || is_idle_task(p) ||
+	    !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
+		goto out;
+
+	ret = 0;
+	*ipcc = p->ipcc;
+out:
+	rq_unlock(rq, &rf);
+	return ret;
+}
+
+/* Called only if cpu_of(@rq) is not idle and has tasks running. */
+static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
+				    struct rq *rq)
+{
+	unsigned short ipcc;
+	unsigned long score;
+
+	if (!sched_ipcc_enabled())
+		return;
+
+	if (rq_last_task_ipcc(dst_cpu, rq, &ipcc))
+		return;
+
+	score = arch_get_ipcc_score(ipcc, cpu_of(rq));
+
+	/*
+	 * Ignore tasks with invalid scores. When finding the busiest group, we
+	 * prefer those with higher sum_score. This group will not be selected.
+	 */
+	if (IS_ERR_VALUE(score))
+		return;
+
+	sgs->sum_score += score;
+
+	if (score < sgs->min_score) {
+		sgs->min_score = score;
+		sgs->min_ipcc = ipcc;
+	}
+}
+
+#else /* CONFIG_IPC_CLASSES */
+static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
+				    struct rq *rq)
+{
+}
+
+static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
+{
+}
+#endif /* CONFIG_IPC_CLASSES */
+
 /**
  * sched_use_asym_prio - Check whether asym_packing priority must be used
  * @sd:		The scheduling domain of the load balancing
@@ -9477,6 +9553,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	int i, nr_running, local_group;
 
 	memset(sgs, 0, sizeof(*sgs));
+	init_rq_ipcc_stats(sgs);
 
 	local_group = group == sds->local;
 
@@ -9526,6 +9603,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			if (sgs->group_misfit_task_load < load)
 				sgs->group_misfit_task_load = load;
 		}
+
+		update_sg_lb_ipcc_stats(env->dst_cpu, sgs, rq);
 	}
 
 	sgs->group_capacity = group->sgc->capacity;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (5 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-22  9:02   ` Ionela Voinescu
  2023-06-13  4:24 ` [PATCH v4 08/24] sched/fair: Use IPCC stats to break ties between asym_packing sched groups Ricardo Neri
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

When using IPCC scores to break ties between two scheduling groups, it is
necessary to consider both the current score and the score that would
result after load balancing.

Compute the combined IPC class score of a scheduling group and the local
scheduling group. Compute both the current score and the prospective score.

Collect IPCC statistics only for asym_packing and fully_busy scheduling
groups. These are the only cases that use IPCC scores.

These IPCC statistics are used during idle load balancing. The candidate
scheduling group will have one fewer busy CPU after load balancing. This
observation is important for cores with SMT support.

The IPCC score of scheduling groups composed of SMT siblings needs to
consider that the siblings share CPU resources. When computing the total
IPCC score of the scheduling group, divide the score of each sibling by
the number of busy siblings.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * Also collect IPCC stats for fully_busy sched groups.
 * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
 * Handle errors of arch_get_ipcc_score(). (Ionela)

Changes since v1:
 * Implemented cleanups and reworks from PeterZ. I took all his
   suggestions, except the computation of the  IPC score before and after
   load balancing. We are computing not the average score, but the *total*.
 * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
   siblings of a physical core.
 * Used the new interface names.
 * Reworded commit message for clarity.
---
 kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0cab5e501b6..a51c65c9335f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9114,6 +9114,8 @@ struct sg_lb_stats {
 	unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
 	unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
 	unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
+	long ipcc_score_after; /* Prospective IPCC score after load balancing */
+	unsigned long ipcc_score_before; /* IPCC score before load balancing */
 #endif
 };
 
@@ -9452,6 +9454,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
 	}
 }
 
+static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
+				      struct sched_group *sg,
+				      struct lb_env *env)
+{
+	unsigned long score_on_dst_cpu, before;
+	int busy_cpus;
+	long after;
+
+	if (!sched_ipcc_enabled())
+		return;
+
+	/*
+	 * IPCC scores are only useful during idle load balancing. For now,
+	 * only asym_packing uses IPCC scores.
+	 */
+	if (!(env->sd->flags & SD_ASYM_PACKING) ||
+	    env->idle == CPU_NOT_IDLE)
+		return;
+
+	/*
+	 * IPCC scores are used to break ties only between these types of
+	 * groups.
+	 */
+	if (sgs->group_type != group_fully_busy &&
+	    sgs->group_type != group_asym_packing)
+		return;
+
+	busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+	/* No busy CPUs in the group. No tasks to move. */
+	if (!busy_cpus)
+		return;
+
+	score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
+
+	/*
+	 * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
+	 * and not used.
+	 */
+	if (IS_ERR_VALUE(score_on_dst_cpu))
+		return;
+
+	before = sgs->sum_score;
+	after = before - sgs->min_score;
+
+	/* SMT siblings share throughput. */
+	if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
+		before /= busy_cpus;
+		/* One sibling will become idle after load balance. */
+		after /= busy_cpus - 1;
+	}
+
+	sgs->ipcc_score_after = after + score_on_dst_cpu;
+	sgs->ipcc_score_before = before;
+}
+
 #else /* CONFIG_IPC_CLASSES */
 static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
 				    struct rq *rq)
@@ -9461,6 +9519,13 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
 static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
 {
 }
+
+static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
+				      struct sched_group *sg,
+				      struct lb_env *env)
+{
+}
+
 #endif /* CONFIG_IPC_CLASSES */
 
 /**
@@ -9620,6 +9685,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
+	if (!local_group)
+		update_sg_lb_stats_scores(sgs, group, env);
+
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 08/24] sched/fair: Use IPCC stats to break ties between asym_packing sched groups
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (6 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 09/24] sched/fair: Use IPCC stats to break ties between fully_busy SMT groups Ricardo Neri
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

update_sd_pick_busiest() selects as busiest the candidate group passed to
it as argument if it has the same priority as the current busiest. Either
group is a good choice. IPCC statistics reflect the class of work on a
scheduling group. Use this data to break the priority tie between the
candidate and current busiest groups.

Pick as busiest the scheduling group that yields a higher IPCC score
after load balancing.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Added a comment to clarify why sched_asym_prefer() needs a tie breaker
   only in update_sd_pick_busiest(). (PeterZ)
 * Renamed functions for accuracy:
   sched_asym_class_prefer() >> sched_asym_ipcc_prefer()
   sched_asym_class_pick() >> sched_asym_ipcc_pick()
 * Reworded commit message for clarity.
---
 kernel/sched/fair.c | 72 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a51c65c9335f..fb3d793fe9ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9510,6 +9510,60 @@ static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
 	sgs->ipcc_score_before = before;
 }
 
+/**
+ * sched_asym_ipcc_prefer - Select a sched group based on its IPCC score
+ * @a:	Load balancing statistics of a sched group
+ * @b:	Load balancing statistics of a second sched group
+ *
+ * Returns: true if @a has a higher IPCC score than @b after load balance.
+ * False otherwise.
+ */
+static bool sched_asym_ipcc_prefer(struct sg_lb_stats *a,
+				   struct sg_lb_stats *b)
+{
+	if (!sched_ipcc_enabled())
+		return false;
+
+	/* @a increases overall throughput after load balance. */
+	if (a->ipcc_score_after > b->ipcc_score_after)
+		return true;
+
+	/*
+	 * If @a and @b yield the same overall throughput, pick @a if
+	 * its current throughput is lower than that of @b.
+	 */
+	if (a->ipcc_score_after == b->ipcc_score_after)
+		return a->ipcc_score_before < b->ipcc_score_before;
+
+	return false;
+}
+
+/**
+ * sched_asym_ipcc_pick - Select a sched group based on its IPCC score
+ * @a:		A scheduling group
+ * @b:		A second scheduling group
+ * @a_stats:	Load balancing statistics of @a
+ * @b_stats:	Load balancing statistics of @b
+ *
+ * Returns: true if @a has the same priority and @a has tasks with IPC classes
+ * that yield higher overall throughput after load balance. False otherwise.
+ */
+static bool sched_asym_ipcc_pick(struct sched_group *a,
+				 struct sched_group *b,
+				 struct sg_lb_stats *a_stats,
+				 struct sg_lb_stats *b_stats)
+{
+	/*
+	 * Only use the class-specific preference selection if both sched
+	 * groups have the same priority.
+	 */
+	if (arch_asym_cpu_priority(a->asym_prefer_cpu) !=
+	    arch_asym_cpu_priority(b->asym_prefer_cpu))
+		return false;
+
+	return sched_asym_ipcc_prefer(a_stats, b_stats);
+}
+
 #else /* CONFIG_IPC_CLASSES */
 static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
 				    struct rq *rq)
@@ -9526,6 +9580,14 @@ static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
 {
 }
 
+static bool sched_asym_ipcc_pick(struct sched_group *a,
+				 struct sched_group *b,
+				 struct sg_lb_stats *a_stats,
+				 struct sg_lb_stats *b_stats)
+{
+	return false;
+}
+
 #endif /* CONFIG_IPC_CLASSES */
 
 /**
@@ -9759,6 +9821,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		/* Prefer to move from lowest priority CPU's work */
 		if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
 			return false;
+
+		/*
+		 * Unlike other callers of sched_asym_prefer(), here both @sg
+		 * and @sds::busiest have tasks running. When they have equal
+		 * priority, their IPC class scores can be used to select a
+		 * better busiest.
+		 */
+		if (sched_asym_ipcc_pick(sds->busiest, sg, &sds->busiest_stat, sgs))
+			return false;
+
 		break;
 
 	case group_misfit_task:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 09/24] sched/fair: Use IPCC stats to break ties between fully_busy SMT groups
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (7 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 08/24] sched/fair: Use IPCC stats to break ties between asym_packing sched groups Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 10/24] sched/fair: Use IPCC scores to select a busiest runqueue Ricardo Neri
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

IPCC statistics are used during idle load balancing. After balancing one
of the siblings of an SMT core becomes idle. The remaining busy siblings
experience increased throughput. The IPCC statistics provide a measure of
the increased throughput. Use them to select the busiest group between
otherwise identical fully_busy scheduling groups. (The avg_load is not
computed in this case and is zero for both groups).

IPCC scores are not needed to break ties with non-SMT fully_busy sched
groups. SMT sched groups always need more help.

Add a stub sched_asym_ipcc_prefer() to handle !CONFIG_IPC_CLASSES.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 kernel/sched/fair.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb3d793fe9ad..fcec791ede4f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9580,6 +9580,12 @@ static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
 {
 }
 
+static bool sched_asym_ipcc_prefer(struct sg_lb_stats *a,
+				   struct sg_lb_stats *b)
+{
+	return false;
+}
+
 static bool sched_asym_ipcc_pick(struct sched_group *a,
 				 struct sched_group *b,
 				 struct sg_lb_stats *a_stats,
@@ -9861,10 +9867,21 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		if (sgs->avg_load == busiest->avg_load) {
 			/*
 			 * SMT sched groups need more help than non-SMT groups.
-			 * If @sg happens to also be SMT, either choice is good.
 			 */
-			if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
-				return false;
+			if (sds->busiest->flags & SD_SHARE_CPUCAPACITY) {
+				if (!(sg->flags & SD_SHARE_CPUCAPACITY))
+					return false;
+
+				/*
+				 * Between two SMT groups, use IPCC scores to pick the
+				 * one that would improve throughput the most (only
+				 * asym_packing uses IPCC scores for now).
+				 */
+				if (sched_ipcc_enabled() &&
+				    env->sd->flags & SD_ASYM_PACKING &&
+				    sched_asym_ipcc_prefer(busiest, sgs))
+					return false;
+			}
 		}
 
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 10/24] sched/fair: Use IPCC scores to select a busiest runqueue
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (8 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 09/24] sched/fair: Use IPCC stats to break ties between fully_busy SMT groups Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-22  9:03   ` Ionela Voinescu
  2023-06-13  4:24 ` [PATCH v4 11/24] thermal: intel: hfi: Introduce Intel Thread Director classes Ricardo Neri
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Use IPCC scores to break a tie between two runqueues with the same priority
and number of running tasks: select the runqueue of which the task enqueued
last would get a higher IPC boost when migrated to the destination CPU.
(These tasks are migrated first during load balance.)

For now, restrict the utilization of IPCC scores to scheduling domains
marked with the SD_ASYM_PACKING flag.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Do not compute the IPCC stats using the current tasks of runqueues.
   Instead, use the tasks at the back of the queue. These are the tasks
   that will be pulled first during load balance. (Vincent)

Changes since v2:
 * Only use IPCC scores to break ties if the sched domain uses
   asym_packing. (Ionela)
 * Handle errors of arch_get_ipcc_score(). (Ionela)

Changes since v1:
 * Fixed a bug when selecting a busiest runqueue: when comparing two
   runqueues with equal nr_running, we must compute the IPCC score delta
   of both.
 * Renamed local variables to improve the layout of the code block.
   (PeterZ)
 * Used the new interface names.
---
 kernel/sched/fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcec791ede4f..da3e009eef42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9564,6 +9564,41 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
 	return sched_asym_ipcc_prefer(a_stats, b_stats);
 }
 
+/**
+ * ipcc_score_delta - Get the IPCC score delta wrt the load balance's dst_cpu
+ * @rq:		A runqueue
+ * @env:	Load balancing environment
+ *
+ * Returns: The IPCC score delta that the last task enqueued in @rq would get
+ * if placed in the destination CPU of @env. LONG_MIN to indicate that the
+ * delta should not be used.
+ */
+static long ipcc_score_delta(struct rq *rq, struct lb_env *env)
+{
+	unsigned long score_src, score_dst;
+	unsigned short ipcc;
+
+	if (!sched_ipcc_enabled())
+		return LONG_MIN;
+
+	/* Only asym_packing uses IPCC scores at the moment. */
+	if (!(env->sd->flags & SD_ASYM_PACKING))
+		return LONG_MIN;
+
+	if (rq_last_task_ipcc(env->dst_cpu, rq, &ipcc))
+		return LONG_MIN;
+
+	score_dst = arch_get_ipcc_score(ipcc, env->dst_cpu);
+	if (IS_ERR_VALUE(score_dst))
+		return LONG_MIN;
+
+	score_src = arch_get_ipcc_score(ipcc, cpu_of(rq));
+	if (IS_ERR_VALUE(score_src))
+		return LONG_MIN;
+
+	return score_dst - score_src;
+}
+
 #else /* CONFIG_IPC_CLASSES */
 static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
 				    struct rq *rq)
@@ -9594,6 +9629,11 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
 	return false;
 }
 
+static long ipcc_score_delta(struct rq *rq, struct lb_env *env)
+{
+	return LONG_MIN;
+}
+
 #endif /* CONFIG_IPC_CLASSES */
 
 /**
@@ -10769,6 +10809,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 {
 	struct rq *busiest = NULL, *rq;
 	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
+	long busiest_ipcc_delta = LONG_MIN;
 	unsigned int busiest_nr = 0;
 	int i;
 
@@ -10885,6 +10926,26 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			if (busiest_nr < nr_running) {
 				busiest_nr = nr_running;
 				busiest = rq;
+
+				/*
+				 * Remember the IPCC score of the busiest
+				 * runqueue. We may need it to break a tie with
+				 * other queues with equal nr_running.
+				 */
+				busiest_ipcc_delta = ipcc_score_delta(busiest, env);
+			/*
+			 * For ties, select @rq if doing would give its last
+			 * queued task a bigger IPC boost when migrated to
+			 * dst_cpu.
+			 */
+			} else if (busiest_nr == nr_running) {
+				long delta = ipcc_score_delta(rq, env);
+
+				if (busiest_ipcc_delta < delta) {
+					busiest_ipcc_delta = delta;
+					busiest_nr = nr_running;
+					busiest = rq;
+				}
 			}
 			break;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 11/24] thermal: intel: hfi: Introduce Intel Thread Director classes
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (9 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 10/24] sched/fair: Use IPCC scores to select a busiest runqueue Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 12/24] x86/cpufeatures: Add the Intel Thread Director feature definitions Ricardo Neri
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

On Intel hybrid parts, each type of CPU has specific performance and
energy efficiency capabilities. The Intel Thread Director technology
extends the Hardware Feedback Interface (HFI) to provide performance and
energy efficiency data for advanced classes of instructions.

Add support to parse per-class capabilities.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Added Acked-by tag from Rafael.

Changes since v2:
 * None

Changes since v1:
 * Removed a now obsolete comment.
---
 drivers/thermal/intel/intel_hfi.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index c69db6c90869..b41547912fda 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -78,7 +78,7 @@ union cpuid6_edx {
  * @ee_cap:		Energy efficiency capability
  *
  * Capabilities of a logical processor in the HFI table. These capabilities are
- * unitless.
+ * unitless and specific to each HFI class.
  */
 struct hfi_cpu_data {
 	u8	perf_cap;
@@ -90,7 +90,8 @@ struct hfi_cpu_data {
  * @perf_updated:	Hardware updated performance capabilities
  * @ee_updated:		Hardware updated energy efficiency capabilities
  *
- * Properties of the data in an HFI table.
+ * Properties of the data in an HFI table. There exists one header per each
+ * HFI class.
  */
 struct hfi_hdr {
 	u8	perf_updated;
@@ -128,16 +129,21 @@ struct hfi_instance {
 
 /**
  * struct hfi_features - Supported HFI features
+ * @nr_classes:		Number of classes supported
  * @nr_table_pages:	Size of the HFI table in 4KB pages
  * @cpu_stride:		Stride size to locate the capability data of a logical
  *			processor within the table (i.e., row stride)
+ * @class_stride:	Stride size to locate a class within the capability
+ *			data of a logical processor or the HFI table header
  * @hdr_size:		Size of the table header
  *
  * Parameters and supported features that are common to all HFI instances
  */
 struct hfi_features {
+	unsigned int	nr_classes;
 	size_t		nr_table_pages;
 	unsigned int	cpu_stride;
+	unsigned int	class_stride;
 	unsigned int	hdr_size;
 };
 
@@ -334,8 +340,8 @@ static void init_hfi_cpu_index(struct hfi_cpu_info *info)
 }
 
 /*
- * The format of the HFI table depends on the number of capabilities that the
- * hardware supports. Keep a data structure to navigate the table.
+ * The format of the HFI table depends on the number of capabilities and classes
+ * that the hardware supports. Keep a data structure to navigate the table.
  */
 static void init_hfi_instance(struct hfi_instance *hfi_instance)
 {
@@ -516,18 +522,30 @@ static __init int hfi_parse_features(void)
 	/* The number of 4KB pages required by the table */
 	hfi_features.nr_table_pages = edx.split.table_pages + 1;
 
+	/*
+	 * Capability fields of an HFI class are grouped together. Classes are
+	 * contiguous in memory.  Hence, use the number of supported features to
+	 * locate a specific class.
+	 */
+	hfi_features.class_stride = nr_capabilities;
+
+	/* For now, use only one class of the HFI table */
+	hfi_features.nr_classes = 1;
+
 	/*
 	 * The header contains change indications for each supported feature.
 	 * The size of the table header is rounded up to be a multiple of 8
 	 * bytes.
 	 */
-	hfi_features.hdr_size = DIV_ROUND_UP(nr_capabilities, 8) * 8;
+	hfi_features.hdr_size = DIV_ROUND_UP(nr_capabilities *
+					     hfi_features.nr_classes, 8) * 8;
 
 	/*
 	 * Data of each logical processor is also rounded up to be a multiple
 	 * of 8 bytes.
 	 */
-	hfi_features.cpu_stride = DIV_ROUND_UP(nr_capabilities, 8) * 8;
+	hfi_features.cpu_stride = DIV_ROUND_UP(nr_capabilities *
+					       hfi_features.nr_classes, 8) * 8;
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 12/24] x86/cpufeatures: Add the Intel Thread Director feature definitions
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (10 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 11/24] thermal: intel: hfi: Introduce Intel Thread Director classes Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 13/24] x86/sched: Update the IPC class of the current task Ricardo Neri
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Intel Thread Director (ITD) provides hardware resources to classify
the current task. The classification reflects the type of instructions that
a task currently executes.

ITD extends the Hardware Feedback Interface table to provide performance
and energy efficiency capabilities for each of the supported classes of
tasks.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Removed dependency on CONFIG_INTEL_THREAD_DIRECTOR. Instead, depend on
   CONFIG_IPC_CLASSES.
 * Added DISABLE_ITD to the correct DISABLE_MASK: 14 instead of 13.
---
 arch/x86/include/asm/cpufeatures.h       | 1 +
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 arch/x86/kernel/cpu/cpuid-deps.c         | 1 +
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index cb8ca46213be..98a84cbf4261 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -353,6 +353,7 @@
 #define X86_FEATURE_HWP_EPP		(14*32+10) /* HWP Energy Perf. Preference */
 #define X86_FEATURE_HWP_PKG_REQ		(14*32+11) /* HWP Package Level Request */
 #define X86_FEATURE_HFI			(14*32+19) /* Hardware Feedback Interface */
+#define X86_FEATURE_ITD			(14*32+23) /* Intel Thread Director */
 
 /* AMD SVM Feature Identification, CPUID level 0x8000000a (EDX), word 15 */
 #define X86_FEATURE_NPT			(15*32+ 0) /* Nested Page Table support */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index fafe9be7a6f4..fad78bd840cd 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
 # define DISABLE_TDX_GUEST	(1 << (X86_FEATURE_TDX_GUEST & 31))
 #endif
 
+#ifdef CONFIG_IPC_CLASSES
+# define DISABLE_ITD	0
+#else
+# define DISABLE_ITD	(1 << (X86_FEATURE_ITD & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -123,7 +129,7 @@
 			 DISABLE_CALL_DEPTH_TRACKING)
 #define DISABLED_MASK12	(DISABLE_LAM)
 #define DISABLED_MASK13	0
-#define DISABLED_MASK14	0
+#define DISABLED_MASK14	(DISABLE_ITD)
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index f6748c8bd647..7a87b823eef3 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -81,6 +81,7 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_XFD,			X86_FEATURE_XSAVES    },
 	{ X86_FEATURE_XFD,			X86_FEATURE_XGETBV1   },
 	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XFD       },
+	{ X86_FEATURE_ITD,			X86_FEATURE_HFI       },
 	{}
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 13/24] x86/sched: Update the IPC class of the current task
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (11 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 12/24] x86/cpufeatures: Add the Intel Thread Director feature definitions Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 14/24] thermal: intel: hfi: Store per-CPU IPCC scores Ricardo Neri
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Intel Thread Director provides a classification value based on the type of
instructions that CPU is currently executing. Use this classification to
update the IPC class of the current task.

The responsibility for configuring and enabling both the Hardware Feedback
Interface and Intel Thread Director lies with the HFI driver, but it should
not directly handle tasks.

Update the HFI driver to read the register that provides the classification
result. Implement the arch_update_ipcc() interface of the scheduler under
arch/x86 code to update the IPC class of individual tasks.

Task classification only makes sense when used along with the HFI driver.
Make HFI driver select CONFIG_IPC_CLASSES. However, users may still select
CONFIG_IPC_CLASSES. Add function stubs to prevent build errors.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Relocated the implementation of arch_update_ipcc() from drivers/thermal
   to arch/x86. (Rafael)
 * Moved the definition of MSR_IA32_HW_FEEDBACK_CHAR into this patch.
   (Reported-by: kernel test robot <lkp@intel.com>)
 * Select CONFIG_IPC_CLASSES when CONFIG_INTEL_HFI_THERMAL is selected to
   reduce the configuration burden of the user/administrator. (Srinivas)

Changes since v2:
 * Removed the implementation of arch_has_ipc_classes().

Changes since v1:
 * Adjusted the result the classification of Intel Thread Director to start
   at class 1. Class 0 for the scheduler means that the task is
   unclassified.
 * Redefined union hfi_thread_feedback_char_msr to ensure all
   bit-fields are packed. (PeterZ)
 * Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
 * Shortened the names of the functions that implement IPC classes.
 * Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
   (PeterZ)
---
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/include/asm/topology.h   | 11 +++++++++
 arch/x86/kernel/Makefile          |  2 ++
 arch/x86/kernel/sched_ipcc.c      | 35 +++++++++++++++++++++++++++++
 drivers/thermal/intel/Kconfig     |  1 +
 drivers/thermal/intel/intel_hfi.c | 37 +++++++++++++++++++++++++++++++
 6 files changed, 87 insertions(+)
 create mode 100644 arch/x86/kernel/sched_ipcc.c

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3aedae61af4f..0bc4ed0ff787 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1108,6 +1108,7 @@
 /* Hardware Feedback Interface */
 #define MSR_IA32_HW_FEEDBACK_PTR        0x17d0
 #define MSR_IA32_HW_FEEDBACK_CONFIG     0x17d1
+#define MSR_IA32_HW_FEEDBACK_CHAR	0x17d2
 
 /* x2APIC locked status */
 #define MSR_IA32_XAPIC_DISABLE_STATUS	0xBD
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index caf41c4869a0..7d2ed7f7bb8c 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -235,4 +235,15 @@ void init_freq_invariance_cppc(void);
 #define arch_init_invariance_cppc init_freq_invariance_cppc
 #endif
 
+#ifdef CONFIG_INTEL_HFI_THERMAL
+int intel_hfi_read_classid(u8 *classid);
+#else
+static inline int intel_hfi_read_classid(u8 *classid) { return -ENODEV; }
+#endif
+
+#ifdef CONFIG_IPC_CLASSES
+void intel_update_ipcc(struct task_struct *curr);
+#define arch_update_ipcc intel_update_ipcc
+#endif
+
 #endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 4070a01c11b7..f9b9d213ddaa 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -145,6 +145,8 @@ obj-$(CONFIG_CFI_CLANG)			+= cfi.o
 
 obj-$(CONFIG_CALL_THUNKS)		+= callthunks.o
 
+obj-$(CONFIG_IPC_CLASSES)		+= sched_ipcc.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/sched_ipcc.c b/arch/x86/kernel/sched_ipcc.c
new file mode 100644
index 000000000000..685e7b3b5375
--- /dev/null
+++ b/arch/x86/kernel/sched_ipcc.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Intel support for scheduler IPC classes
+ *
+ * Copyright (c) 2023, Intel Corporation.
+ *
+ * Author: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
+ *
+ * On hybrid processors, the architecture differences between types of CPUs
+ * lead to different number of retired instructions per cycle (IPC). IPCs may
+ * differ further by classes of instructions.
+ *
+ * The scheduler assigns an IPC class to every task with arch_update_ipcc()
+ * from data that hardware provides. Implement this interface for x86.
+ *
+ * See kernel/sched/sched.h for details.
+ */
+
+#include <linux/sched.h>
+
+#include <asm/topology.h>
+
+void intel_update_ipcc(struct task_struct *curr)
+{
+	u8 hfi_class;
+
+	if (intel_hfi_read_classid(&hfi_class))
+		return;
+
+	/*
+	 * 0 is a valid classification for Intel Thread Director. A scheduler
+	 * IPCC class of 0 means that the task is unclassified. Adjust.
+	 */
+	curr->ipcc = hfi_class + 1;
+}
diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
index ecd7e07eece0..418db04dc876 100644
--- a/drivers/thermal/intel/Kconfig
+++ b/drivers/thermal/intel/Kconfig
@@ -109,6 +109,7 @@ config INTEL_HFI_THERMAL
 	depends on CPU_SUP_INTEL
 	depends on X86_THERMAL_VECTOR
 	select THERMAL_NETLINK
+	select IPC_CLASSES
 	help
 	  Select this option to enable the Hardware Feedback Interface. If
 	  selected, hardware provides guidance to the operating system on
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index b41547912fda..20ee4264dcd4 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -72,6 +72,15 @@ union cpuid6_edx {
 	u32 full;
 };
 
+union hfi_thread_feedback_char_msr {
+	struct {
+		u64	classid : 8;
+		u64	__reserved : 55;
+		u64	valid : 1;
+	} split;
+	u64 full;
+};
+
 /**
  * struct hfi_cpu_data - HFI capabilities per CPU
  * @perf_cap:		Performance capability
@@ -171,6 +180,34 @@ static struct workqueue_struct *hfi_updates_wq;
 #define HFI_UPDATE_INTERVAL		HZ
 #define HFI_MAX_THERM_NOTIFY_COUNT	16
 
+/**
+ * intel_hfi_read_classid() - Read the currrent classid
+ * @classid:	Variable to which the classid will be written.
+ *
+ * Read the classification that Intel Thread Director has produced when this
+ * function is called. Thread classification must be enabled before calling
+ * this function.
+ *
+ * Return: 0 if the produced classification is valid. Error otherwise.
+ */
+int intel_hfi_read_classid(u8 *classid)
+{
+	union hfi_thread_feedback_char_msr msr;
+
+	/* We should not be here if ITD is not supported. */
+	if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
+		pr_warn_once("task classification requested but not supported!");
+		return -ENODEV;
+	}
+
+	rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
+	if (!msr.split.valid)
+		return -EINVAL;
+
+	*classid = msr.split.classid;
+	return 0;
+}
+
 static void get_hfi_caps(struct hfi_instance *hfi_instance,
 			 struct thermal_genl_cpu_caps *cpu_caps)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 14/24] thermal: intel: hfi: Store per-CPU IPCC scores
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (12 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 13/24] x86/sched: Update the IPC class of the current task Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-29 18:53   ` Rafael J. Wysocki
  2023-06-13  4:24 ` [PATCH v4 15/24] thermal: intel: hfi: Report the IPC class score of a CPU Ricardo Neri
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

The scheduler reads the IPCC scores when balancing load. These reads can
occur frequently and originate from many CPUs. Hardware may also
occasionally update the HFI table. Controlling access with locks would
cause contention.

Cache the IPCC scores in separate per-CPU variables that the scheduler can
use. Use a seqcount to synchronize memory accesses to these cached values.
This eliminates the need for locks, as the sequence counter provides the
memory ordering required to prevent the use of stale data.

The HFI delayed workqueue guarantees that only one CPU writes the cached
IPCC scores. The frequency of updates is low (every CONFIG_HZ jiffies or
less), and the number of writes per update is in the order of tens. Writes
should not starve reads.

Only cache IPCC scores in this changeset. A subsequent changeset will
use these scores.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * As Rafael requested, I reworked the memory ordering of the cached IPCC
   scores. I selected a seqcount as is less expensive than a memory
   barrier, which is not necessary anyways.
 * Made alloc_hfi_ipcc_scores() return -ENOMEM on allocation failure.
   (Rafael)
 * Added a comment to describe hfi_ipcc_scores. (Rafael)

Changes since v2:
 * Only create these per-CPU variables when Intel Thread Director is
   supported.

Changes since v1:
 * Added this patch.
---
 drivers/thermal/intel/intel_hfi.c | 66 +++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 20ee4264dcd4..d822ed0bb5c1 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -29,9 +29,11 @@
 #include <linux/kernel.h>
 #include <linux/math.h>
 #include <linux/mutex.h>
+#include <linux/percpu.h>
 #include <linux/percpu-defs.h>
 #include <linux/printk.h>
 #include <linux/processor.h>
+#include <linux/seqlock.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/string.h>
@@ -180,6 +182,62 @@ static struct workqueue_struct *hfi_updates_wq;
 #define HFI_UPDATE_INTERVAL		HZ
 #define HFI_MAX_THERM_NOTIFY_COUNT	16
 
+/* A cache of the HFI perf capabilities for lockless access. */
+static int __percpu *hfi_ipcc_scores;
+/* Sequence counter for hfi_ipcc_scores */
+static seqcount_t hfi_ipcc_seqcount = SEQCNT_ZERO(hfi_ipcc_seqcount);
+
+static int alloc_hfi_ipcc_scores(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_ITD))
+		return 0;
+
+	hfi_ipcc_scores = __alloc_percpu(sizeof(*hfi_ipcc_scores) *
+					 hfi_features.nr_classes,
+					 sizeof(*hfi_ipcc_scores));
+
+	return hfi_ipcc_scores ? 0 : -ENOMEM;
+}
+
+static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
+{
+	int cpu;
+
+	if (!cpu_feature_enabled(X86_FEATURE_ITD))
+		return;
+
+	/*
+	 * Serialize with writes to the HFI table. It also protects the write
+	 * loop against seqcount readers running in interrupt context.
+	 */
+	raw_spin_lock_irq(&hfi_instance->table_lock);
+	/*
+	 * The seqcount implies store-release semantics to order stores with
+	 * lockless loads from the seqcount read side. It also implies a
+	 * compiler barrier.
+	 */
+	write_seqcount_begin(&hfi_ipcc_seqcount);
+	for_each_cpu(cpu, hfi_instance->cpus) {
+		int c, *scores;
+		s16 index;
+
+		index = per_cpu(hfi_cpu_info, cpu).index;
+		scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
+
+		for (c = 0;  c < hfi_features.nr_classes; c++) {
+			struct hfi_cpu_data *caps;
+
+			caps = hfi_instance->data +
+			       index * hfi_features.cpu_stride +
+			       c * hfi_features.class_stride;
+			scores[c] = caps->perf_cap;
+		}
+	}
+
+	write_seqcount_end(&hfi_ipcc_seqcount);
+	raw_spin_unlock_irq(&hfi_instance->table_lock);
+}
+
 /**
  * intel_hfi_read_classid() - Read the currrent classid
  * @classid:	Variable to which the classid will be written.
@@ -275,6 +333,8 @@ static void update_capabilities(struct hfi_instance *hfi_instance)
 		thermal_genl_cpu_capability_event(cpu_count, &cpu_caps[i]);
 
 	kfree(cpu_caps);
+
+	set_hfi_ipcc_scores(hfi_instance);
 out:
 	mutex_unlock(&hfi_instance_lock);
 }
@@ -618,8 +678,14 @@ void __init intel_hfi_init(void)
 	if (!hfi_updates_wq)
 		goto err_nomem;
 
+	if (alloc_hfi_ipcc_scores())
+		goto err_ipcc;
+
 	return;
 
+err_ipcc:
+	destroy_workqueue(hfi_updates_wq);
+
 err_nomem:
 	for (j = 0; j < i; ++j) {
 		hfi_instance = &hfi_instances[j];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 15/24] thermal: intel: hfi: Report the IPC class score of a CPU
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (13 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 14/24] thermal: intel: hfi: Store per-CPU IPCC scores Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-29 18:56   ` Rafael J. Wysocki
  2023-06-13  4:24 ` [PATCH v4 16/24] thermal: intel: hfi: Define a default class for unclassified tasks Ricardo Neri
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Implement the arch_get_ipcc_score() interface of the scheduler. Use the
performance capabilities of the extended Hardware Feedback Interface table
as the IPC score.

Use the cached per-CPU IPCC scores. A seqcount provides lockless access and
the required memory ordering to avoid stale data.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Ordered memory loads using a seqcount.
 * Removed local variable hfi_class from intel_hfi_get_ipcc_score().
   (Rafael).

Changes since v2:
 * None

Changes since v1:
 * Adjusted the returned HFI class (which starts at 0) to match the
   scheduler IPCC class (which starts at 1). (PeterZ)
 * Used the new interface names.
---
 arch/x86/include/asm/topology.h   |  4 ++++
 drivers/thermal/intel/intel_hfi.c | 40 +++++++++++++++++++++++++++++--
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 7d2ed7f7bb8c..8dd328cdb5cf 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -237,13 +237,17 @@ void init_freq_invariance_cppc(void);
 
 #ifdef CONFIG_INTEL_HFI_THERMAL
 int intel_hfi_read_classid(u8 *classid);
+unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu);
 #else
 static inline int intel_hfi_read_classid(u8 *classid) { return -ENODEV; }
+static inline unsigned long
+intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu) { return -ENODEV; }
 #endif
 
 #ifdef CONFIG_IPC_CLASSES
 void intel_update_ipcc(struct task_struct *curr);
 #define arch_update_ipcc intel_update_ipcc
+#define arch_get_ipcc_score intel_hfi_get_ipcc_score
 #endif
 
 #endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index d822ed0bb5c1..fec2c01fda1b 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -199,6 +199,42 @@ static int alloc_hfi_ipcc_scores(void)
 	return hfi_ipcc_scores ? 0 : -ENOMEM;
 }
 
+unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
+{
+	int *scores, score;
+	unsigned long seq;
+
+	scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
+	if (!scores)
+		return -ENODEV;
+
+	if (cpu < 0 || cpu >= nr_cpu_ids)
+		return -EINVAL;
+
+	if (ipcc == IPC_CLASS_UNCLASSIFIED)
+		return -EINVAL;
+
+	/*
+	 * Scheduler IPC classes start at 1. HFI classes start at 0.
+	 * See note intel_hfi_update_ipcc().
+	 */
+	if (ipcc >= hfi_features.nr_classes + 1)
+		return -EINVAL;
+
+	/*
+	 * The seqcount implies load-acquire semantics to order loads with
+	 * lockless stores of the write side in set_hfi_ipcc_score(). It
+	 * also implies a compiler barrier.
+	 */
+	do {
+		seq = read_seqcount_begin(&hfi_ipcc_seqcount);
+		/* @ipcc is never 0. */
+		score = scores[ipcc - 1];
+	} while (read_seqcount_retry(&hfi_ipcc_seqcount, seq));
+
+	return score;
+}
+
 static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
 {
 	int cpu;
@@ -213,8 +249,8 @@ static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
 	raw_spin_lock_irq(&hfi_instance->table_lock);
 	/*
 	 * The seqcount implies store-release semantics to order stores with
-	 * lockless loads from the seqcount read side. It also implies a
-	 * compiler barrier.
+	 * lockless loads from the seqcount read side in
+	 * intel_hfi_get_ipcc_score(). It also implies a compiler barrier.
 	 */
 	write_seqcount_begin(&hfi_ipcc_seqcount);
 	for_each_cpu(cpu, hfi_instance->cpus) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 16/24] thermal: intel: hfi: Define a default class for unclassified tasks
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (14 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 15/24] thermal: intel: hfi: Report the IPC class score of a CPU Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 17/24] thermal: intel: hfi: Enable the Intel Thread Director Ricardo Neri
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

A task may be unclassified if it has been recently created, spend most of
its lifetime sleeping, or hardware has not provided a classification.

Most tasks will be eventually classified as scheduler's IPC class 1
(HFI class 0). This class corresponds to the capabilities in the legacy,
classless, HFI table.

IPC class 1 is a reasonable choice until hardware provides an actual
classification. Meanwhile, the scheduler will place classes of tasks with
higher IPC scores on higher-performance CPUs.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Added Acked-by tag from Rafael.

Changes since v2:
 * None

Changes since v1:
 * Now the default class is 1.
---
 drivers/thermal/intel/intel_hfi.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index fec2c01fda1b..e23c49da02ee 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -182,6 +182,19 @@ static struct workqueue_struct *hfi_updates_wq;
 #define HFI_UPDATE_INTERVAL		HZ
 #define HFI_MAX_THERM_NOTIFY_COUNT	16
 
+/*
+ * A task may be unclassified if it has been recently created, spend most of
+ * its lifetime sleeping, or hardware has not provided a classification.
+ *
+ * Most tasks will be classified as scheduler's IPC class 1 (HFI class 0)
+ * eventually. Meanwhile, the scheduler will place classes of tasks with higher
+ * IPC scores on higher-performance CPUs.
+ *
+ * IPC class 1 is a reasonable choice. It matches the performance capability
+ * of the legacy, classless, HFI table.
+ */
+#define HFI_UNCLASSIFIED_DEFAULT 1
+
 /* A cache of the HFI perf capabilities for lockless access. */
 static int __percpu *hfi_ipcc_scores;
 /* Sequence counter for hfi_ipcc_scores */
@@ -212,7 +225,7 @@ unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
 		return -EINVAL;
 
 	if (ipcc == IPC_CLASS_UNCLASSIFIED)
-		return -EINVAL;
+		ipcc = HFI_UNCLASSIFIED_DEFAULT;
 
 	/*
 	 * Scheduler IPC classes start at 1. HFI classes start at 0.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 17/24] thermal: intel: hfi: Enable the Intel Thread Director
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (15 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 16/24] thermal: intel: hfi: Define a default class for unclassified tasks Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 18/24] sched/task_struct: Add helpers for IPC classification Ricardo Neri
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Enable Intel Thread Director from the CPU hotplug callback: globally from
CPU0 and then enable the thread-classification hardware in each logical
processor individually.

Also, initialize the number of classes supported.

Let the scheduler know that it can start using IPC classes.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> # intel_hfi.c
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Dropped the definition of MSR_IA32_HW_FEEDBACK_CHAR. It is now added in
   patch 14 to fix a build break during bisection.
   (Reported-by: kernel test robot <lkp@intel.com>).
 * Added Acked-by from Rafael.

Changes since v2:
 * Use the new sched_enable_ipc_classes() interface to enable the use of
   IPC classes in the scheduler.

Changes since v1:
 * None
---
 arch/x86/include/asm/msr-index.h  |  1 +
 drivers/thermal/intel/intel_hfi.c | 41 +++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0bc4ed0ff787..7823b87bf383 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1108,6 +1108,7 @@
 /* Hardware Feedback Interface */
 #define MSR_IA32_HW_FEEDBACK_PTR        0x17d0
 #define MSR_IA32_HW_FEEDBACK_CONFIG     0x17d1
+#define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
 #define MSR_IA32_HW_FEEDBACK_CHAR	0x17d2
 
 /* x2APIC locked status */
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index e23c49da02ee..75bf18dc7f51 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -33,6 +33,7 @@
 #include <linux/percpu-defs.h>
 #include <linux/printk.h>
 #include <linux/processor.h>
+#include <linux/sched/topology.h>
 #include <linux/seqlock.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
@@ -50,6 +51,8 @@
 /* Hardware Feedback Interface MSR configuration bits */
 #define HW_FEEDBACK_PTR_VALID_BIT		BIT(0)
 #define HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT	BIT(0)
+#define HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT	BIT(1)
+#define HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT	BIT(0)
 
 /* CPUID detection and enumeration definitions for HFI */
 
@@ -74,6 +77,15 @@ union cpuid6_edx {
 	u32 full;
 };
 
+union cpuid6_ecx {
+	struct {
+		u32	dont_care0:8;
+		u32	nr_classes:8;
+		u32	dont_care1:16;
+	} split;
+	u32 full;
+};
+
 union hfi_thread_feedback_char_msr {
 	struct {
 		u64	classid : 8;
@@ -541,6 +553,11 @@ void intel_hfi_online(unsigned int cpu)
 
 	init_hfi_cpu_index(info);
 
+	if (cpu_feature_enabled(X86_FEATURE_ITD)) {
+		msr_val = HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT;
+		wrmsrl(MSR_IA32_HW_FEEDBACK_THREAD_CONFIG, msr_val);
+	}
+
 	/*
 	 * Now check if the HFI instance of the package/die of @cpu has been
 	 * initialized (by checking its header). In such case, all we have to
@@ -596,8 +613,22 @@ void intel_hfi_online(unsigned int cpu)
 	 */
 	rdmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
 	msr_val |= HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT;
+
+	if (cpu_feature_enabled(X86_FEATURE_ITD))
+		msr_val |= HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT;
+
 	wrmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
 
+	/*
+	 * We have all we need to support IPC classes. Task classification is
+	 * now working.
+	 *
+	 * All class scores are zero until after the first HFI update. That is
+	 * OK. The scheduler queries these scores at every load balance.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_ITD))
+		sched_enable_ipc_classes();
+
 unlock:
 	mutex_unlock(&hfi_instance_lock);
 	return;
@@ -675,8 +706,14 @@ static __init int hfi_parse_features(void)
 	 */
 	hfi_features.class_stride = nr_capabilities;
 
-	/* For now, use only one class of the HFI table */
-	hfi_features.nr_classes = 1;
+	if (cpu_feature_enabled(X86_FEATURE_ITD)) {
+		union cpuid6_ecx ecx;
+
+		ecx.full = cpuid_ecx(CPUID_HFI_LEAF);
+		hfi_features.nr_classes = ecx.split.nr_classes;
+	} else {
+		hfi_features.nr_classes = 1;
+	}
 
 	/*
 	 * The header contains change indications for each supported feature.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 18/24] sched/task_struct: Add helpers for IPC classification
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (16 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 17/24] thermal: intel: hfi: Enable the Intel Thread Director Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-22 10:20   ` Ionela Voinescu
  2023-06-13  4:24 ` [PATCH v4 19/24] sched/core: Initialize helpers of task classification Ricardo Neri
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

The raw classification that hardware provides for a task may not
be directly usable by the scheduler: the classification may change too
frequently or architecture-specific implementations may need to consider
additional factors. For instance, some processors with Intel Thread
Director need to consider the state of the SMT siblings of a core.

Provide per-task helper variables that architectures can use to
postprocess the classification that hardware provides.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Used bit-fields to fit all the IPC class data in 4 bytes. (PeterZ)
 * Shortened names of the helpers.
 * Renamed helpers with the ipcc_ prefix.
 * Reworded commit message for clarity
---
 include/linux/sched.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0e1c38ad09c2..719147460ca8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1541,7 +1541,17 @@ struct task_struct {
 	 * A hardware-defined classification of task that reflects but is
 	 * not identical to the number of instructions per cycle.
 	 */
-	unsigned short			ipcc;
+	unsigned int			ipcc : 9;
+	/*
+	 * A candidate classification that arch-specific implementations
+	 * qualify for correctness.
+	 */
+	unsigned int			ipcc_tmp : 9;
+	/*
+	 * Counter to filter out transient candidate classifications
+	 * of a task.
+	 */
+	unsigned int			ipcc_cntr : 14;
 #endif
 
 	/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 19/24] sched/core: Initialize helpers of task classification
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (17 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 18/24] sched/task_struct: Add helpers for IPC classification Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 20/24] sched/fair: Introduce sched_smt_siblings_idle() Ricardo Neri
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Just as tasks start life unclassified, initialize the classification
auxiliary variables.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 876396b1d077..9c28be596938 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4502,6 +4502,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.vruntime			= 0;
 #ifdef CONFIG_IPC_CLASSES
 	p->ipcc				= IPC_CLASS_UNCLASSIFIED;
+	p->ipcc_tmp			= IPC_CLASS_UNCLASSIFIED;
+	p->ipcc_cntr			= 0;
 #endif
 	INIT_LIST_HEAD(&p->se.group_node);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 20/24] sched/fair: Introduce sched_smt_siblings_idle()
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (18 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 19/24] sched/core: Initialize helpers of task classification Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 21/24] x86/sched/ipcc: Implement model-specific checks for task classification Ricardo Neri
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

X86 needs to know the idle state of the SMT siblings of a CPU to improve
the accuracy of IPCC classification. X86 implements support for IPC classes
in the thermal HFI driver.

Rename is_core_idle() as sched_smt_siblings_idle() and make it available
outside the scheduler code.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
is_core_idle() is no longer an inline function after this patch. To rule
out performance degradation, I compared the execution time of the inline
and non-inline versions on a 4-socket Cascade Lake system using the NUMA
stressor of stress-ng.

I used this test command:

        $ stress-ng --numa 1500 -t 10m

During the test, is_core_idle() was called ~200,000 times. To measure the
execution time, I recorded the value of the TSC counter before and after
calling is_core_idle() and calculated the difference.
value.

I arbitrarily removed outliers (defined as any delta larger than 5000
counts). This required removing ~40 samples.

The table below summarizes the difference in execution time. All values
are expressed in TSC counts, except for the standard deviation, expressed
as a percentage of the average.

                              Average  Median  Std(%) Mode
        TSCdelta inline        668.76     626   67.24   42
        TSCdelta non-inline    677.64     624   67.67   46

The metrics show that both the inline and non-inline versions exhibit
similar performance characteristics.
---
Changes since v3:
 * None

Changes since v2:
 * Brought back this previously dropped patch.
 * Profiled inline vs non-inline is_core_idle(). I found not major penalty.
 * Merged is_core_idle() and sched_smt_siblings_idle() into a single
   function. (Dietmar)

Changes since v1:
 * Dropped this patch.
---
 include/linux/sched.h |  2 ++
 kernel/sched/fair.c   | 13 ++++++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 719147460ca8..986344ebf2f6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2461,4 +2461,6 @@ static inline void sched_core_fork(struct task_struct *p) { }
 
 extern void sched_set_stop_task(int cpu, struct task_struct *stop);
 
+extern bool sched_smt_siblings_idle(int cpu);
+
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da3e009eef42..a52589a6c561 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,7 +1064,14 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */
 
-static inline bool is_core_idle(int cpu)
+/**
+ * sched_smt_siblings_idle - Check whether SMT siblings of a CPU are idle
+ * @cpu:	The CPU to check
+ *
+ * Returns true if all the SMT siblings of @cpu are idle or @cpu does not have
+ * SMT siblings. The idle state of @cpu is not considered.
+ */
+bool sched_smt_siblings_idle(int cpu)
 {
 #ifdef CONFIG_SCHED_SMT
 	int sibling;
@@ -1767,7 +1774,7 @@ static inline int numa_idle_core(int idle_core, int cpu)
 	 * Prefer cores instead of packing HT siblings
 	 * and triggering future load balancing.
 	 */
-	if (is_core_idle(cpu))
+	if (sched_smt_siblings_idle(cpu))
 		idle_core = cpu;
 
 	return idle_core;
@@ -9652,7 +9659,7 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
 	if (!sched_smt_active())
 		return true;
 
-	return sd->flags & SD_SHARE_CPUCAPACITY || is_core_idle(cpu);
+	return sd->flags & SD_SHARE_CPUCAPACITY || sched_smt_siblings_idle(cpu);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 21/24] x86/sched/ipcc: Implement model-specific checks for task classification
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (19 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 20/24] sched/fair: Introduce sched_smt_siblings_idle() Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 22/24] x86/cpufeatures: Add feature bit for HRESET Ricardo Neri
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

In Alder Lake and Raptor Lake, the result of thread classification is more
accurate when only one SMT sibling is busy. Classification results for
class 2 and 3 are always reliable.

Changing the classification of a task too frequently may lead to
unnecessary migrations.

Only update the class of a task if it is considered accurate and has been
constant during four consecutive user ticks.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Relocated this code to arch/x86/kernel/sched_ipcc.c (Rafael)

Changes since v2:
 * None

Changes since v1:
 * Adjusted the result the classification of Intel Thread Director to start
   at class 1. Class 0 for the scheduler means that the task is
   unclassified.
 * Used the new names of the IPC classes members in task_struct.
 * Reworked helper functions to use sched_smt_siblings_idle() to query
   the idle state of the SMT siblings of a CPU.
---
 arch/x86/kernel/sched_ipcc.c | 60 +++++++++++++++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/sched_ipcc.c b/arch/x86/kernel/sched_ipcc.c
index 685e7b3b5375..dd73fc8be49b 100644
--- a/arch/x86/kernel/sched_ipcc.c
+++ b/arch/x86/kernel/sched_ipcc.c
@@ -18,11 +18,67 @@
 
 #include <linux/sched.h>
 
+#include <asm/intel-family.h>
 #include <asm/topology.h>
 
+#define CLASS_DEBOUNCER_SKIPS 4
+
+/**
+ * debounce_and_update_class() - Process and update a task's classification
+ *
+ * @p:		The task of which the classification will be updated
+ * @new_ipcc:	The new IPC classification
+ *
+ * Update the classification of @p with the new value that hardware provides.
+ * Only update the classification of @p if it has been the same during
+ * CLASS_DEBOUNCER_SKIPS consecutive ticks.
+ */
+static void debounce_and_update_class(struct task_struct *p, u8 new_ipcc)
+{
+	u16 debounce_skip;
+
+	/* The class of @p changed. Only restart the debounce counter. */
+	if (p->ipcc_tmp != new_ipcc) {
+		p->ipcc_cntr = 1;
+		goto out;
+	}
+
+	/*
+	 * The class of @p did not change. Update it if it has been the same
+	 * for CLASS_DEBOUNCER_SKIPS user ticks.
+	 */
+	debounce_skip = p->ipcc_cntr + 1;
+	if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
+		p->ipcc_cntr++;
+	else
+		p->ipcc = new_ipcc;
+
+out:
+	p->ipcc_tmp = new_ipcc;
+}
+
+static bool classification_is_accurate(u8 hfi_class, bool smt_siblings_idle)
+{
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_ALDERLAKE:
+	case INTEL_FAM6_ALDERLAKE_L:
+	case INTEL_FAM6_RAPTORLAKE:
+	case INTEL_FAM6_RAPTORLAKE_P:
+	case INTEL_FAM6_RAPTORLAKE_S:
+		if (hfi_class == 3 || hfi_class == 2 || smt_siblings_idle)
+			return true;
+
+		return false;
+
+	default:
+		return false;
+	}
+}
+
 void intel_update_ipcc(struct task_struct *curr)
 {
 	u8 hfi_class;
+	bool idle;
 
 	if (intel_hfi_read_classid(&hfi_class))
 		return;
@@ -31,5 +87,7 @@ void intel_update_ipcc(struct task_struct *curr)
 	 * 0 is a valid classification for Intel Thread Director. A scheduler
 	 * IPCC class of 0 means that the task is unclassified. Adjust.
 	 */
-	curr->ipcc = hfi_class + 1;
+	idle = sched_smt_siblings_idle(task_cpu(curr));
+	if (classification_is_accurate(hfi_class, idle))
+		debounce_and_update_class(curr, hfi_class + 1);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 22/24] x86/cpufeatures: Add feature bit for HRESET
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (20 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 21/24] x86/sched/ipcc: Implement model-specific checks for task classification Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 23/24] x86/hreset: Configure history reset Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 24/24] x86/process: Reset hardware history in context switch Ricardo Neri
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

The HRESET instruction isolates the classification of individual
tasks when they run sequentially on the same logical processor. It resets
the classification history that the logical processor maintains.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * Moved definition of HRESET to its correct leaf: CPUID_7_1_EAX (Zhao)

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/include/asm/msr-index.h   | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 98a84cbf4261..5edb63af2996 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -319,6 +319,7 @@
 #define X86_FEATURE_FSRC		(12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
 #define X86_FEATURE_LKGS		(12*32+18) /* "" Load "kernel" (userspace) GS */
 #define X86_FEATURE_AMX_FP16		(12*32+21) /* "" AMX fp16 Support */
+#define X86_FEATURE_HRESET		(12*32+22) /* Hardware history reset instruction */
 #define X86_FEATURE_AVX_IFMA            (12*32+23) /* "" Support for VPMADD52[H,L]UQ */
 #define X86_FEATURE_LAM			(12*32+26) /* Linear Address Masking */
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7823b87bf383..605c01539b0d 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1111,6 +1111,9 @@
 #define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
 #define MSR_IA32_HW_FEEDBACK_CHAR	0x17d2
 
+/* Hardware History Reset  */
+#define MSR_IA32_HW_HRESET_ENABLE	0x17da
+
 /* x2APIC locked status */
 #define MSR_IA32_XAPIC_DISABLE_STATUS	0xBD
 #define LEGACY_XAPIC_DISABLED		BIT(0) /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 23/24] x86/hreset: Configure history reset
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (21 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 22/24] x86/cpufeatures: Add feature bit for HRESET Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  2023-06-13  4:24 ` [PATCH v4 24/24] x86/process: Reset hardware history in context switch Ricardo Neri
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Configure the MSR that controls the behavior of HRESET on each logical
processor.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Marked hardware_history_features as __ro_after_init instead of
   __read_mostly. (PeterZ)
---
 arch/x86/kernel/cpu/common.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8f284e185aea..d47a442900ad 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -396,6 +396,26 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
 	cr4_clear_bits(X86_CR4_UMIP);
 }
 
+static u32 hardware_history_features __ro_after_init;
+
+static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_HRESET))
+		return;
+
+	/*
+	 * Use on all CPUs the hardware history features that the boot
+	 * CPU supports.
+	 */
+	if (c == &boot_cpu_data)
+		hardware_history_features = cpuid_ebx(0x20);
+
+	if (!hardware_history_features)
+		return;
+
+	wrmsrl(MSR_IA32_HW_HRESET_ENABLE, hardware_history_features);
+}
+
 /* These bits should not change their value after CPU init is finished. */
 static const unsigned long cr4_pinned_mask =
 	X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
@@ -1834,10 +1854,11 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	/* Disable the PN if appropriate */
 	squash_the_stupid_serial_number(c);
 
-	/* Set up SMEP/SMAP/UMIP */
+	/* Set up SMEP/SMAP/UMIP/HRESET */
 	setup_smep(c);
 	setup_smap(c);
 	setup_umip(c);
+	setup_hreset(c);
 
 	/* Enable FSGSBASE instructions if available. */
 	if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 24/24] x86/process: Reset hardware history in context switch
  2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
                   ` (22 preceding siblings ...)
  2023-06-13  4:24 ` [PATCH v4 23/24] x86/hreset: Configure history reset Ricardo Neri
@ 2023-06-13  4:24 ` Ricardo Neri
  23 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-13  4:24 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Ricardo Neri, Tim C . Chen, Zhao Liu

Reset the classification history of the current task when switching to the
next task. Hardware will start the classification of the next task from
scratch.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ionela Voinescu <ionela.voinescu@arm.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Perry Yuan <Perry.Yuan@amd.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Zhao Liu <zhao1.liu@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Measurements of the cost of the HRESET instruction

   Methodology:
   I created a tight loop with interrupts and preemption disabled. I
   recorded the value of the TSC counter before and after executing
   HRESET or RDTSC. I repeated the measurement 100,000 times.
   I performed the experiment using an Alder Lake S system. I set the
   frequency of the CPUs at a fixed value.

   The table below compares the cost of HRESET with RDTSC (expressed in
   the elapsed TSC count). The cost of the two instructions is
   comparable.

                              PCore      ECore
        Frequency (GHz)        5.0        3.8
        HRESET (avg)          28.5       44.7
        HRESET (stdev %)       3.6        2.3
        RDTSC  (avg)          25.2       35.7
        RDTSC  (stdev %)       3.9        2.6

 * Used an ALTERNATIVE macro instead of static_cpu_has() to execute HRESET
   when supported. (PeterZ)
---
 arch/x86/include/asm/hreset.h | 30 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/common.c  |  7 +++++++
 arch/x86/kernel/process_32.c  |  3 +++
 arch/x86/kernel/process_64.c  |  3 +++
 4 files changed, 43 insertions(+)
 create mode 100644 arch/x86/include/asm/hreset.h

diff --git a/arch/x86/include/asm/hreset.h b/arch/x86/include/asm/hreset.h
new file mode 100644
index 000000000000..d68ca2fb8642
--- /dev/null
+++ b/arch/x86/include/asm/hreset.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_HRESET_H
+
+/**
+ * HRESET - History reset. Available since binutils v2.36.
+ *
+ * Request the processor to reset the history of task classification on the
+ * current logical processor. The history components to be
+ * reset are specified in %eax. Only bits specified in CPUID(0x20).EBX
+ * and enabled in the IA32_HRESET_ENABLE MSR can be selected.
+ *
+ * The assembly code looks like:
+ *
+ *	hreset %eax
+ *
+ * The corresponding machine code looks like:
+ *
+ *	F3 0F 3A F0 ModRM Imm
+ *
+ * The value of ModRM is 0xc0 to specify %eax register addressing.
+ * The ignored immediate operand is set to 0.
+ *
+ * The instruction is documented in the Intel SDM.
+ */
+
+#define __ASM_HRESET  ".byte 0xf3, 0xf, 0x3a, 0xf0, 0xc0, 0x0"
+
+void reset_hardware_history(void);
+
+#endif /* _ASM_X86_HRESET_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d47a442900ad..8cdd88dab7ed 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -53,6 +53,7 @@
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/cacheinfo.h>
+#include <asm/hreset.h>
 #include <asm/memtype.h>
 #include <asm/microcode.h>
 #include <asm/microcode_intel.h>
@@ -398,6 +399,12 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
 
 static u32 hardware_history_features __ro_after_init;
 
+void reset_hardware_history(void)
+{
+	asm_inline volatile (ALTERNATIVE("", __ASM_HRESET, X86_FEATURE_HRESET)
+			     : : "a" (hardware_history_features) : "memory");
+}
+
 static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_HRESET))
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 708c87b88cc1..7353bb119e79 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -52,6 +52,7 @@
 #include <asm/switch_to.h>
 #include <asm/vm86.h>
 #include <asm/resctrl.h>
+#include <asm/hreset.h>
 #include <asm/proto.h>
 
 #include "process.h"
@@ -214,6 +215,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	/* Load the Intel cache allocation PQR MSR. */
 	resctrl_sched_in(next_p);
 
+	reset_hardware_history();
+
 	return prev_p;
 }
 
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d181c16a2f6..fcf14cba2a9d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -54,6 +54,7 @@
 #include <asm/xen/hypervisor.h>
 #include <asm/vdso.h>
 #include <asm/resctrl.h>
+#include <asm/hreset.h>
 #include <asm/unistd.h>
 #include <asm/fsgsbase.h>
 #ifdef CONFIG_IA32_EMULATION
@@ -659,6 +660,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	/* Load the Intel cache allocation PQR MSR. */
 	resctrl_sched_in(next_p);
 
+	reset_hardware_history();
+
 	return prev_p;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes
  2023-06-13  4:24 ` [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes Ricardo Neri
@ 2023-06-14  0:29   ` Ricardo Neri
  2023-06-22  9:01   ` Ionela Voinescu
  1 sibling, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-14  0:29 UTC (permalink / raw)
  To: Peter Zijlstra (Intel), Juri Lelli, Vincent Guittot
  Cc: Ricardo Neri, Ravi V. Shankar, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Len Brown,
	Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Mon, Jun 12, 2023 at 09:24:04PM -0700, Ricardo Neri wrote:
> +static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
> +{
> +	struct list_head *tasks = &rq->cfs_tasks;
> +	struct task_struct *p;
> +	struct rq_flags rf;
> +	int ret = -EINVAL;
> +
> +	rq_lock_irqsave(rq, &rf);
> +	if (list_empty(tasks))
> +		goto out;
> +
> +	p = list_last_entry(tasks, struct task_struct, se.group_node);
> +	if (p->flags & PF_EXITING || is_idle_task(p) ||
> +	    !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
> +		goto out;
> +
> +	ret = 0;
> +	*ipcc = p->ipcc;
> +out:
> +	rq_unlock(rq, &rf);

This should be rq_unlock_irqrestore(). I will correct it in the next version.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes
  2023-06-13  4:24 ` [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes Ricardo Neri
  2023-06-14  0:29   ` Ricardo Neri
@ 2023-06-22  9:01   ` Ionela Voinescu
  2023-06-24  0:01     ` Ricardo Neri
  1 sibling, 1 reply; 43+ messages in thread
From: Ionela Voinescu @ 2023-06-22  9:01 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Zhao Liu, Yuan, Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

Hi Ricardo,

On Monday 12 Jun 2023 at 21:24:04 (-0700), Ricardo Neri wrote:
> When selecting the busiest scheduling group between two otherwise identical
> groups of types asym_packing or fully_busy, IPC classes can be used to
> break the tie.
> 
> Compute the IPC class performance score for a scheduling group. It is
> defined as the sum of the IPC scores of the tasks at the back of each
> runqueue in the group. Load balancing starts by pulling tasks from the back
> of the runqueue first, making this tiebreaker more useful.
> 
> Also, track the IPC class with the lowest score in the scheduling group. A
> task of this class will be pulled when the destination CPU has lower
> priority than the fully_busy busiest group.
> 
> These two metrics will be used during idle load balancing to compute the
> current and the potential IPC class score of a scheduling group in a
> subsequent changeset.
> 
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Lukasz Luba <lukasz.luba@arm.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Perry Yuan <Perry.Yuan@amd.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim C. Chen <tim.c.chen@intel.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> Cc: x86@kernel.org
> Cc: linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v3:
>  * Do not compute the IPCC stats using the current tasks of runqueues.
>    Instead, use the tasks at the back of the queue. These are the tasks
>    that will be pulled first during load balance. (Vincent)
> 
> Changes since v2:
>  * Also excluded deadline and realtime tasks from IPCC stats. (Dietmar)
>  * Also excluded tasks that cannot run on the destination CPU from the
>    IPCC stats.
>  * Folded struct sg_lb_ipcc_stats into struct sg_lb_stats. (Dietmar)
>  * Reworded description sg_lb_stats::min_ipcc. (Ionela)
>  * Handle errors of arch_get_ipcc_score(). (Ionela)
> 
> Changes since v1:
>  * Implemented cleanups and reworks from PeterZ. Thanks!
>  * Used the new interface names.
> ---
>  kernel/sched/fair.c | 79 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 79 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6189d1a45635..c0cab5e501b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9110,6 +9110,11 @@ struct sg_lb_stats {
>  	unsigned int nr_numa_running;
>  	unsigned int nr_preferred_running;
>  #endif
> +#ifdef CONFIG_IPC_CLASSES
> +	unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
                                              ^^^^^^^^^^^^^^
> +	unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> +	unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
                                              ^^^^^^^^^^^^^^
These no longer apply. It might be easier to describe them all in a
comment just above their declaration. Something like:

"The sum, min and its class of the IPC scores of the tasks at the back of each
runqueue in the group."

> +#endif
>  };
>  
>  /*
> @@ -9387,6 +9392,77 @@ group_type group_classify(unsigned int imbalance_pct,
>  	return group_has_spare;
>  }
>  
> +#ifdef CONFIG_IPC_CLASSES
> +static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
> +{
> +	/* All IPCC stats have been set to zero in update_sg_lb_stats(). */
> +	sgs->min_score = ULONG_MAX;
> +}
> +
> +static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
> +{
> +	struct list_head *tasks = &rq->cfs_tasks;
> +	struct task_struct *p;
> +	struct rq_flags rf;
> +	int ret = -EINVAL;
> +

It's more typical of ret to be initialised to 0 and changed to an error
value when there's an error case.

> +	rq_lock_irqsave(rq, &rf);
> +	if (list_empty(tasks))
> +		goto out;
> +
> +	p = list_last_entry(tasks, struct task_struct, se.group_node);
> +	if (p->flags & PF_EXITING || is_idle_task(p) ||
> +	    !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
> +		goto out;
> +
> +	ret = 0;
> +	*ipcc = p->ipcc;
> +out:
> +	rq_unlock(rq, &rf);
> +	return ret;
> +}
> +
> +/* Called only if cpu_of(@rq) is not idle and has tasks running. */
> +static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> +				    struct rq *rq)
> +{
> +	unsigned short ipcc;
> +	unsigned long score;
> +
> +	if (!sched_ipcc_enabled())
> +		return;
> +
> +	if (rq_last_task_ipcc(dst_cpu, rq, &ipcc))
> +		return;
> +
> +	score = arch_get_ipcc_score(ipcc, cpu_of(rq));
> +
> +	/*
> +	 * Ignore tasks with invalid scores. When finding the busiest group, we
> +	 * prefer those with higher sum_score. This group will not be selected.
> +	 */

nit: the comment is unnecessary, and a bit misleading, IMO.

The comment says "This group will not be selected." but the only way to
guarantee that here is to reset the sum_score to 0 when you find an
invalid score, which I don't believe is your intention.

Also the use of sum_score is captured in later functions, so I don't
believe there's a need for additional comments here.

Hope it helps,
Ionela.

> +	if (IS_ERR_VALUE(score))
> +		return;
> +
> +	sgs->sum_score += score;
> +
> +	if (score < sgs->min_score) {
> +		sgs->min_score = score;
> +		sgs->min_ipcc = ipcc;
> +	}
> +}
> +
> +#else /* CONFIG_IPC_CLASSES */
> +static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> +				    struct rq *rq)
> +{
> +}
> +
> +static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
> +{
> +}
> +#endif /* CONFIG_IPC_CLASSES */
> +
>  /**
>   * sched_use_asym_prio - Check whether asym_packing priority must be used
>   * @sd:		The scheduling domain of the load balancing
> @@ -9477,6 +9553,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	int i, nr_running, local_group;
>  
>  	memset(sgs, 0, sizeof(*sgs));
> +	init_rq_ipcc_stats(sgs);
>  
>  	local_group = group == sds->local;
>  
> @@ -9526,6 +9603,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			if (sgs->group_misfit_task_load < load)
>  				sgs->group_misfit_task_load = load;
>  		}
> +
> +		update_sg_lb_ipcc_stats(env->dst_cpu, sgs, rq);
>  	}
>  
>  	sgs->group_capacity = group->sgc->capacity;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing
  2023-06-13  4:24 ` [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing Ricardo Neri
@ 2023-06-22  9:02   ` Ionela Voinescu
  2023-06-25 20:11     ` Ricardo Neri
  0 siblings, 1 reply; 43+ messages in thread
From: Ionela Voinescu @ 2023-06-22  9:02 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Zhao Liu, Yuan, Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Monday 12 Jun 2023 at 21:24:05 (-0700), Ricardo Neri wrote:
> When using IPCC scores to break ties between two scheduling groups, it is
> necessary to consider both the current score and the score that would
> result after load balancing.
> 
> Compute the combined IPC class score of a scheduling group and the local
> scheduling group. Compute both the current score and the prospective score.
> 
> Collect IPCC statistics only for asym_packing and fully_busy scheduling
> groups. These are the only cases that use IPCC scores.
> 
> These IPCC statistics are used during idle load balancing. The candidate
> scheduling group will have one fewer busy CPU after load balancing. This
> observation is important for cores with SMT support.
> 
> The IPCC score of scheduling groups composed of SMT siblings needs to
> consider that the siblings share CPU resources. When computing the total
> IPCC score of the scheduling group, divide the score of each sibling by
> the number of busy siblings.
> 
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Lukasz Luba <lukasz.luba@arm.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Perry Yuan <Perry.Yuan@amd.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim C. Chen <tim.c.chen@intel.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> Cc: x86@kernel.org
> Cc: linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v3:
>  * None
> 
> Changes since v2:
>  * Also collect IPCC stats for fully_busy sched groups.
>  * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
>  * Handle errors of arch_get_ipcc_score(). (Ionela)
> 
> Changes since v1:
>  * Implemented cleanups and reworks from PeterZ. I took all his
>    suggestions, except the computation of the  IPC score before and after
>    load balancing. We are computing not the average score, but the *total*.
>  * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
>    siblings of a physical core.
>  * Used the new interface names.
>  * Reworded commit message for clarity.
> ---
>  kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c0cab5e501b6..a51c65c9335f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9114,6 +9114,8 @@ struct sg_lb_stats {
>  	unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
>  	unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
>  	unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> +	long ipcc_score_after; /* Prospective IPCC score after load balancing */
> +	unsigned long ipcc_score_before; /* IPCC score before load balancing */
>  #endif
>  };
>  
> @@ -9452,6 +9454,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
>  	}
>  }
>  
> +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> +				      struct sched_group *sg,
> +				      struct lb_env *env)
> +{
> +	unsigned long score_on_dst_cpu, before;
> +	int busy_cpus;
> +	long after;
> +
> +	if (!sched_ipcc_enabled())
> +		return;
> +
> +	/*
> +	 * IPCC scores are only useful during idle load balancing. For now,
> +	 * only asym_packing uses IPCC scores.
> +	 */
> +	if (!(env->sd->flags & SD_ASYM_PACKING) ||
> +	    env->idle == CPU_NOT_IDLE)
> +		return;
> +
> +	/*
> +	 * IPCC scores are used to break ties only between these types of
> +	 * groups.
> +	 */
> +	if (sgs->group_type != group_fully_busy &&
> +	    sgs->group_type != group_asym_packing)
> +		return;
> +
> +	busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> +	/* No busy CPUs in the group. No tasks to move. */
> +	if (!busy_cpus)
> +		return;
> +
> +	score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> +
> +	/*
> +	 * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> +	 * and not used.
> +	 */
> +	if (IS_ERR_VALUE(score_on_dst_cpu))
> +		return;
> +
> +	before = sgs->sum_score;
> +	after = before - sgs->min_score;

I don't believe this can end up being negative as the sum of all
scores should be higher or equal to the min score, right?

I'm just wondering if ipcc_score_after can be made unsigned long as well,
just for consistency.

> +
> +	/* SMT siblings share throughput. */
> +	if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> +		before /= busy_cpus;
> +		/* One sibling will become idle after load balance. */
> +		after /= busy_cpus - 1;
> +	}
> +
> +	sgs->ipcc_score_after = after + score_on_dst_cpu;
> +	sgs->ipcc_score_before = before;

Shouldn't the score_on_dst_cpu be added to "after" before being divided
between the SMT siblings?

Thanks,
Ionela.

> +}
> +
>  #else /* CONFIG_IPC_CLASSES */
>  static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
>  				    struct rq *rq)
> @@ -9461,6 +9519,13 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
>  static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
>  {
>  }
> +
> +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> +				      struct sched_group *sg,
> +				      struct lb_env *env)
> +{
> +}
> +
>  #endif /* CONFIG_IPC_CLASSES */
>  
>  /**
> @@ -9620,6 +9685,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  
>  	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>  
> +	if (!local_group)
> +		update_sg_lb_stats_scores(sgs, group, env);
> +
>  	/* Computing avg_load makes sense only when group is overloaded */
>  	if (sgs->group_type == group_overloaded)
>  		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 10/24] sched/fair: Use IPCC scores to select a busiest runqueue
  2023-06-13  4:24 ` [PATCH v4 10/24] sched/fair: Use IPCC scores to select a busiest runqueue Ricardo Neri
@ 2023-06-22  9:03   ` Ionela Voinescu
  2023-06-24  0:25     ` Ricardo Neri
  0 siblings, 1 reply; 43+ messages in thread
From: Ionela Voinescu @ 2023-06-22  9:03 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Zhao Liu, Yuan, Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Monday 12 Jun 2023 at 21:24:08 (-0700), Ricardo Neri wrote:
> Use IPCC scores to break a tie between two runqueues with the same priority
> and number of running tasks: select the runqueue of which the task enqueued
> last would get a higher IPC boost when migrated to the destination CPU.
> (These tasks are migrated first during load balance.)
> 
> For now, restrict the utilization of IPCC scores to scheduling domains
> marked with the SD_ASYM_PACKING flag.
> 
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Lukasz Luba <lukasz.luba@arm.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Perry Yuan <Perry.Yuan@amd.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim C. Chen <tim.c.chen@intel.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> Cc: x86@kernel.org
> Cc: linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v3:
>  * Do not compute the IPCC stats using the current tasks of runqueues.
>    Instead, use the tasks at the back of the queue. These are the tasks
>    that will be pulled first during load balance. (Vincent)
> 
> Changes since v2:
>  * Only use IPCC scores to break ties if the sched domain uses
>    asym_packing. (Ionela)
>  * Handle errors of arch_get_ipcc_score(). (Ionela)
> 
> Changes since v1:
>  * Fixed a bug when selecting a busiest runqueue: when comparing two
>    runqueues with equal nr_running, we must compute the IPCC score delta
>    of both.
>  * Renamed local variables to improve the layout of the code block.
>    (PeterZ)
>  * Used the new interface names.
> ---
>  kernel/sched/fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fcec791ede4f..da3e009eef42 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9564,6 +9564,41 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
>  	return sched_asym_ipcc_prefer(a_stats, b_stats);
>  }
>  
> +/**
> + * ipcc_score_delta - Get the IPCC score delta wrt the load balance's dst_cpu
> + * @rq:		A runqueue
> + * @env:	Load balancing environment
> + *
> + * Returns: The IPCC score delta that the last task enqueued in @rq would get
> + * if placed in the destination CPU of @env. LONG_MIN to indicate that the
> + * delta should not be used.
> + */
> +static long ipcc_score_delta(struct rq *rq, struct lb_env *env)
> +{
> +	unsigned long score_src, score_dst;
> +	unsigned short ipcc;
> +
> +	if (!sched_ipcc_enabled())
> +		return LONG_MIN;
> +
> +	/* Only asym_packing uses IPCC scores at the moment. */
> +	if (!(env->sd->flags & SD_ASYM_PACKING))
> +		return LONG_MIN;
> +
> +	if (rq_last_task_ipcc(env->dst_cpu, rq, &ipcc))
> +		return LONG_MIN;
> +
> +	score_dst = arch_get_ipcc_score(ipcc, env->dst_cpu);
> +	if (IS_ERR_VALUE(score_dst))
> +		return LONG_MIN;
> +
> +	score_src = arch_get_ipcc_score(ipcc, cpu_of(rq));
> +	if (IS_ERR_VALUE(score_src))
> +		return LONG_MIN;
> +
> +	return score_dst - score_src;
> +}
> +
>  #else /* CONFIG_IPC_CLASSES */
>  static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
>  				    struct rq *rq)
> @@ -9594,6 +9629,11 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
>  	return false;
>  }
>  
> +static long ipcc_score_delta(struct rq *rq, struct lb_env *env)
> +{
> +	return LONG_MIN;
> +}
> +
>  #endif /* CONFIG_IPC_CLASSES */
>  
>  /**
> @@ -10769,6 +10809,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  {
>  	struct rq *busiest = NULL, *rq;
>  	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
> +	long busiest_ipcc_delta = LONG_MIN;
>  	unsigned int busiest_nr = 0;
>  	int i;
>  
> @@ -10885,6 +10926,26 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  			if (busiest_nr < nr_running) {
>  				busiest_nr = nr_running;
>  				busiest = rq;
> +
> +				/*
> +				 * Remember the IPCC score of the busiest
> +				 * runqueue. We may need it to break a tie with
> +				 * other queues with equal nr_running.
> +				 */
> +				busiest_ipcc_delta = ipcc_score_delta(busiest, env);
> +			/*
> +			 * For ties, select @rq if doing would give its last
> +			 * queued task a bigger IPC boost when migrated to
> +			 * dst_cpu.
> +			 */
> +			} else if (busiest_nr == nr_running) {
> +				long delta = ipcc_score_delta(rq, env);
> +
> +				if (busiest_ipcc_delta < delta) {
> +					busiest_ipcc_delta = delta;
> +					busiest_nr = nr_running;

nit: there's no need as busiest_nr is already equal to nr_running.

Ionela.

> +					busiest = rq;
> +				}
>  			}
>  			break;
>  
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 18/24] sched/task_struct: Add helpers for IPC classification
  2023-06-13  4:24 ` [PATCH v4 18/24] sched/task_struct: Add helpers for IPC classification Ricardo Neri
@ 2023-06-22 10:20   ` Ionela Voinescu
  2023-06-25 20:23     ` Ricardo Neri
  0 siblings, 1 reply; 43+ messages in thread
From: Ionela Voinescu @ 2023-06-22 10:20 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Zhao Liu, Yuan, Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

Hi,

On Monday 12 Jun 2023 at 21:24:16 (-0700), Ricardo Neri wrote:
> The raw classification that hardware provides for a task may not
> be directly usable by the scheduler: the classification may change too
> frequently or architecture-specific implementations may need to consider
> additional factors. For instance, some processors with Intel Thread
> Director need to consider the state of the SMT siblings of a core.
> 
> Provide per-task helper variables that architectures can use to
> postprocess the classification that hardware provides.
> 
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Lukasz Luba <lukasz.luba@arm.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Perry Yuan <Perry.Yuan@amd.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim C. Chen <tim.c.chen@intel.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> Cc: x86@kernel.org
> Cc: linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v3:
>  * None
> 
> Changes since v2:
>  * None
> 
> Changes since v1:
>  * Used bit-fields to fit all the IPC class data in 4 bytes. (PeterZ)
>  * Shortened names of the helpers.
>  * Renamed helpers with the ipcc_ prefix.
>  * Reworded commit message for clarity
> ---
>  include/linux/sched.h | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0e1c38ad09c2..719147460ca8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1541,7 +1541,17 @@ struct task_struct {
>  	 * A hardware-defined classification of task that reflects but is
>  	 * not identical to the number of instructions per cycle.
>  	 */
> -	unsigned short			ipcc;
> +	unsigned int			ipcc : 9;
> +	/*
> +	 * A candidate classification that arch-specific implementations
> +	 * qualify for correctness.
> +	 */
> +	unsigned int			ipcc_tmp : 9;
> +	/*
> +	 * Counter to filter out transient candidate classifications
> +	 * of a task.
> +	 */
> +	unsigned int			ipcc_cntr : 14;
>  #endif
>  

Why does this need to be split in task_struct? Isn't this architecture
specific? IMO the scheduler should never make use of class information
itself. It only receives the value though the call of an arch function
and passes it as an argument to an arch function to get a performance
score. So the way one interprets the class value (splits it in relevant
and helper bits) should be defined and considered in the architecture
specific code.

Thanks,
Ionela.

>  	/*
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes
  2023-06-22  9:01   ` Ionela Voinescu
@ 2023-06-24  0:01     ` Ricardo Neri
  2023-06-26 19:52       ` Tim Chen
  0 siblings, 1 reply; 43+ messages in thread
From: Ricardo Neri @ 2023-06-24  0:01 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Zhao Liu, Yuan, Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Thu, Jun 22, 2023 at 10:01:48AM +0100, Ionela Voinescu wrote:
> Hi Ricardo,

Hi Ionela,

Thank you very much for reviewing the patches!

> 
> On Monday 12 Jun 2023 at 21:24:04 (-0700), Ricardo Neri wrote:
> > When selecting the busiest scheduling group between two otherwise identical
> > groups of types asym_packing or fully_busy, IPC classes can be used to
> > break the tie.
> > 
> > Compute the IPC class performance score for a scheduling group. It is
> > defined as the sum of the IPC scores of the tasks at the back of each
> > runqueue in the group. Load balancing starts by pulling tasks from the back
> > of the runqueue first, making this tiebreaker more useful.
> > 
> > Also, track the IPC class with the lowest score in the scheduling group. A
> > task of this class will be pulled when the destination CPU has lower
> > priority than the fully_busy busiest group.
> > 
> > These two metrics will be used during idle load balancing to compute the
> > current and the potential IPC class score of a scheduling group in a
> > subsequent changeset.
> > 
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Perry Yuan <Perry.Yuan@amd.com>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> > Cc: x86@kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v3:
> >  * Do not compute the IPCC stats using the current tasks of runqueues.
> >    Instead, use the tasks at the back of the queue. These are the tasks
> >    that will be pulled first during load balance. (Vincent)
> > 
> > Changes since v2:
> >  * Also excluded deadline and realtime tasks from IPCC stats. (Dietmar)
> >  * Also excluded tasks that cannot run on the destination CPU from the
> >    IPCC stats.
> >  * Folded struct sg_lb_ipcc_stats into struct sg_lb_stats. (Dietmar)
> >  * Reworded description sg_lb_stats::min_ipcc. (Ionela)
> >  * Handle errors of arch_get_ipcc_score(). (Ionela)
> > 
> > Changes since v1:
> >  * Implemented cleanups and reworks from PeterZ. Thanks!
> >  * Used the new interface names.
> > ---
> >  kernel/sched/fair.c | 79 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 79 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6189d1a45635..c0cab5e501b6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9110,6 +9110,11 @@ struct sg_lb_stats {
> >  	unsigned int nr_numa_running;
> >  	unsigned int nr_preferred_running;
> >  #endif
> > +#ifdef CONFIG_IPC_CLASSES
> > +	unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
>                                               ^^^^^^^^^^^^^^
> > +	unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> > +	unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
>                                               ^^^^^^^^^^^^^^
> These no longer apply.

Indeed, I missed this. I will update them in the next version.

> It might be easier to describe them all in a
> comment just above their declaration. Something like:
> 
> "The sum, min and its class of the IPC scores of the tasks at the back of each
> runqueue in the group."

I am hesitant to describe the three members in a single comment as it
would not comply with what Documentation/doc-guide/kernel-doc.rst
specifies.

Admittedly, the current format does not comply either. :) I would opt to
be consistent with the existing format.

> 
> > +#endif
> >  };
> >  
> >  /*
> > @@ -9387,6 +9392,77 @@ group_type group_classify(unsigned int imbalance_pct,
> >  	return group_has_spare;
> >  }
> >  
> > +#ifdef CONFIG_IPC_CLASSES
> > +static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
> > +{
> > +	/* All IPCC stats have been set to zero in update_sg_lb_stats(). */
> > +	sgs->min_score = ULONG_MAX;
> > +}
> > +
> > +static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
> > +{
> > +	struct list_head *tasks = &rq->cfs_tasks;
> > +	struct task_struct *p;
> > +	struct rq_flags rf;
> > +	int ret = -EINVAL;
> > +
> 
> It's more typical of ret to be initialised to 0 and changed to an error
> value when there's an error case.

My intention was to save lines of code. I would need to set the error
value and then goto out. I am ok with your suggestion if it improves
readability.

> 
> > +	rq_lock_irqsave(rq, &rf);
> > +	if (list_empty(tasks))
> > +		goto out;
> > +
> > +	p = list_last_entry(tasks, struct task_struct, se.group_node);
> > +	if (p->flags & PF_EXITING || is_idle_task(p) ||
> > +	    !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
> > +		goto out;
> > +
> > +	ret = 0;
> > +	*ipcc = p->ipcc;
> > +out:
> > +	rq_unlock(rq, &rf);
> > +	return ret;
> > +}
> > +
> > +/* Called only if cpu_of(@rq) is not idle and has tasks running. */
> > +static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > +				    struct rq *rq)
> > +{
> > +	unsigned short ipcc;
> > +	unsigned long score;
> > +
> > +	if (!sched_ipcc_enabled())
> > +		return;
> > +
> > +	if (rq_last_task_ipcc(dst_cpu, rq, &ipcc))
> > +		return;
> > +
> > +	score = arch_get_ipcc_score(ipcc, cpu_of(rq));
> > +
> > +	/*
> > +	 * Ignore tasks with invalid scores. When finding the busiest group, we
> > +	 * prefer those with higher sum_score. This group will not be selected.
> > +	 */
> 
> nit: the comment is unnecessary, and a bit misleading, IMO.
> 
> The comment says "This group will not be selected." but the only way to
> guarantee that here is to reset the sum_score to 0 when you find an
> invalid score, which I don't believe is your intention.

You raise an interesting point. We will call this function for each rq
in the sched group. Thus, if we encounter an error, the scores would be
incomplete. Therefore, I think that the scores should be discarded and
reset to 0 so that they are not used, IMO.

> 
> Also the use of sum_score is captured in later functions, so I don't
> believe there's a need for additional comments here.

Sure, I can remove the comment. Or maybe rephrase it as per my previous
comment?

> 
> Hope it helps,

It does! Thanks again for taking the time.

BR,
Ricardo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 10/24] sched/fair: Use IPCC scores to select a busiest runqueue
  2023-06-22  9:03   ` Ionela Voinescu
@ 2023-06-24  0:25     ` Ricardo Neri
  0 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-24  0:25 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Zhao Liu, Yuan, Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Thu, Jun 22, 2023 at 10:03:17AM +0100, Ionela Voinescu wrote:
> On Monday 12 Jun 2023 at 21:24:08 (-0700), Ricardo Neri wrote:
> > Use IPCC scores to break a tie between two runqueues with the same priority
> > and number of running tasks: select the runqueue of which the task enqueued
> > last would get a higher IPC boost when migrated to the destination CPU.
> > (These tasks are migrated first during load balance.)
> > 
> > For now, restrict the utilization of IPCC scores to scheduling domains
> > marked with the SD_ASYM_PACKING flag.
> > 
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Perry Yuan <Perry.Yuan@amd.com>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> > Cc: x86@kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v3:
> >  * Do not compute the IPCC stats using the current tasks of runqueues.
> >    Instead, use the tasks at the back of the queue. These are the tasks
> >    that will be pulled first during load balance. (Vincent)
> > 
> > Changes since v2:
> >  * Only use IPCC scores to break ties if the sched domain uses
> >    asym_packing. (Ionela)
> >  * Handle errors of arch_get_ipcc_score(). (Ionela)
> > 
> > Changes since v1:
> >  * Fixed a bug when selecting a busiest runqueue: when comparing two
> >    runqueues with equal nr_running, we must compute the IPCC score delta
> >    of both.
> >  * Renamed local variables to improve the layout of the code block.
> >    (PeterZ)
> >  * Used the new interface names.
> > ---
> >  kernel/sched/fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 61 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fcec791ede4f..da3e009eef42 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9564,6 +9564,41 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> >  	return sched_asym_ipcc_prefer(a_stats, b_stats);
> >  }
> >  
> > +/**
> > + * ipcc_score_delta - Get the IPCC score delta wrt the load balance's dst_cpu
> > + * @rq:		A runqueue
> > + * @env:	Load balancing environment
> > + *
> > + * Returns: The IPCC score delta that the last task enqueued in @rq would get
> > + * if placed in the destination CPU of @env. LONG_MIN to indicate that the
> > + * delta should not be used.
> > + */
> > +static long ipcc_score_delta(struct rq *rq, struct lb_env *env)
> > +{
> > +	unsigned long score_src, score_dst;
> > +	unsigned short ipcc;
> > +
> > +	if (!sched_ipcc_enabled())
> > +		return LONG_MIN;
> > +
> > +	/* Only asym_packing uses IPCC scores at the moment. */
> > +	if (!(env->sd->flags & SD_ASYM_PACKING))
> > +		return LONG_MIN;
> > +
> > +	if (rq_last_task_ipcc(env->dst_cpu, rq, &ipcc))
> > +		return LONG_MIN;
> > +
> > +	score_dst = arch_get_ipcc_score(ipcc, env->dst_cpu);
> > +	if (IS_ERR_VALUE(score_dst))
> > +		return LONG_MIN;
> > +
> > +	score_src = arch_get_ipcc_score(ipcc, cpu_of(rq));
> > +	if (IS_ERR_VALUE(score_src))
> > +		return LONG_MIN;
> > +
> > +	return score_dst - score_src;
> > +}
> > +
> >  #else /* CONFIG_IPC_CLASSES */
> >  static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> >  				    struct rq *rq)
> > @@ -9594,6 +9629,11 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> >  	return false;
> >  }
> >  
> > +static long ipcc_score_delta(struct rq *rq, struct lb_env *env)
> > +{
> > +	return LONG_MIN;
> > +}
> > +
> >  #endif /* CONFIG_IPC_CLASSES */
> >  
> >  /**
> > @@ -10769,6 +10809,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> >  {
> >  	struct rq *busiest = NULL, *rq;
> >  	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
> > +	long busiest_ipcc_delta = LONG_MIN;
> >  	unsigned int busiest_nr = 0;
> >  	int i;
> >  
> > @@ -10885,6 +10926,26 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> >  			if (busiest_nr < nr_running) {
> >  				busiest_nr = nr_running;
> >  				busiest = rq;
> > +
> > +				/*
> > +				 * Remember the IPCC score of the busiest
> > +				 * runqueue. We may need it to break a tie with
> > +				 * other queues with equal nr_running.
> > +				 */
> > +				busiest_ipcc_delta = ipcc_score_delta(busiest, env);
> > +			/*
> > +			 * For ties, select @rq if doing would give its last
> > +			 * queued task a bigger IPC boost when migrated to
> > +			 * dst_cpu.
> > +			 */
> > +			} else if (busiest_nr == nr_running) {
> > +				long delta = ipcc_score_delta(rq, env);
> > +
> > +				if (busiest_ipcc_delta < delta) {
> > +					busiest_ipcc_delta = delta;
> > +					busiest_nr = nr_running;
> 
> nit: there's no need as busiest_nr is already equal to nr_running.

True! I will remove this pointless assignment.

Thanks and BR,
Ricardo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing
  2023-06-22  9:02   ` Ionela Voinescu
@ 2023-06-25 20:11     ` Ricardo Neri
  2023-06-26 21:01       ` Tim Chen
  2023-06-27 15:19       ` Ionela Voinescu
  0 siblings, 2 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-25 20:11 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Zhao Liu, Yuan, Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Thu, Jun 22, 2023 at 10:02:44AM +0100, Ionela Voinescu wrote:
> On Monday 12 Jun 2023 at 21:24:05 (-0700), Ricardo Neri wrote:
> > When using IPCC scores to break ties between two scheduling groups, it is
> > necessary to consider both the current score and the score that would
> > result after load balancing.
> > 
> > Compute the combined IPC class score of a scheduling group and the local
> > scheduling group. Compute both the current score and the prospective score.
> > 
> > Collect IPCC statistics only for asym_packing and fully_busy scheduling
> > groups. These are the only cases that use IPCC scores.
> > 
> > These IPCC statistics are used during idle load balancing. The candidate
> > scheduling group will have one fewer busy CPU after load balancing. This
> > observation is important for cores with SMT support.
> > 
> > The IPCC score of scheduling groups composed of SMT siblings needs to
> > consider that the siblings share CPU resources. When computing the total
> > IPCC score of the scheduling group, divide the score of each sibling by
> > the number of busy siblings.
> > 
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Perry Yuan <Perry.Yuan@amd.com>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> > Cc: x86@kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v3:
> >  * None
> > 
> > Changes since v2:
> >  * Also collect IPCC stats for fully_busy sched groups.
> >  * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> >  * Handle errors of arch_get_ipcc_score(). (Ionela)
> > 
> > Changes since v1:
> >  * Implemented cleanups and reworks from PeterZ. I took all his
> >    suggestions, except the computation of the  IPC score before and after
> >    load balancing. We are computing not the average score, but the *total*.
> >  * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> >    siblings of a physical core.
> >  * Used the new interface names.
> >  * Reworded commit message for clarity.
> > ---
> >  kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 68 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c0cab5e501b6..a51c65c9335f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9114,6 +9114,8 @@ struct sg_lb_stats {
> >  	unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> >  	unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> >  	unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> > +	long ipcc_score_after; /* Prospective IPCC score after load balancing */
> > +	unsigned long ipcc_score_before; /* IPCC score before load balancing */
> >  #endif
> >  };
> >  
> > @@ -9452,6 +9454,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> >  	}
> >  }
> >  
> > +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> > +				      struct sched_group *sg,
> > +				      struct lb_env *env)
> > +{
> > +	unsigned long score_on_dst_cpu, before;
> > +	int busy_cpus;
> > +	long after;
> > +
> > +	if (!sched_ipcc_enabled())
> > +		return;
> > +
> > +	/*
> > +	 * IPCC scores are only useful during idle load balancing. For now,
> > +	 * only asym_packing uses IPCC scores.
> > +	 */
> > +	if (!(env->sd->flags & SD_ASYM_PACKING) ||
> > +	    env->idle == CPU_NOT_IDLE)
> > +		return;
> > +
> > +	/*
> > +	 * IPCC scores are used to break ties only between these types of
> > +	 * groups.
> > +	 */
> > +	if (sgs->group_type != group_fully_busy &&
> > +	    sgs->group_type != group_asym_packing)
> > +		return;
> > +
> > +	busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > +
> > +	/* No busy CPUs in the group. No tasks to move. */
> > +	if (!busy_cpus)
> > +		return;
> > +
> > +	score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > +
> > +	/*
> > +	 * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > +	 * and not used.
> > +	 */
> > +	if (IS_ERR_VALUE(score_on_dst_cpu))
> > +		return;
> > +
> > +	before = sgs->sum_score;
> > +	after = before - sgs->min_score;
> 
> I don't believe this can end up being negative as the sum of all
> scores should be higher or equal to the min score, right?

Yes, I agree. `after` cannot be negative.

> 
> I'm just wondering if ipcc_score_after can be made unsigned long as well,
> just for consistency.

Sure. I can make it of type unsigned long as well.

> 
> > +
> > +	/* SMT siblings share throughput. */
> > +	if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> > +		before /= busy_cpus;
> > +		/* One sibling will become idle after load balance. */
> > +		after /= busy_cpus - 1;
> > +	}
> > +
> > +	sgs->ipcc_score_after = after + score_on_dst_cpu;
> > +	sgs->ipcc_score_before = before;
> 
> Shouldn't the score_on_dst_cpu be added to "after" before being divided
> between the SMT siblings?

No, because ipcc_score_after represents the joint score of the busiest
core and the destination core after load balance has taken place. The
destination core was previously idle and now contributes to the joint
score.

Thanks and BR,
Ricardo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 18/24] sched/task_struct: Add helpers for IPC classification
  2023-06-22 10:20   ` Ionela Voinescu
@ 2023-06-25 20:23     ` Ricardo Neri
  0 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-06-25 20:23 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Zhao Liu, Yuan, Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Thu, Jun 22, 2023 at 11:20:47AM +0100, Ionela Voinescu wrote:
> Hi,
> 
> On Monday 12 Jun 2023 at 21:24:16 (-0700), Ricardo Neri wrote:
> > The raw classification that hardware provides for a task may not
> > be directly usable by the scheduler: the classification may change too
> > frequently or architecture-specific implementations may need to consider
> > additional factors. For instance, some processors with Intel Thread
> > Director need to consider the state of the SMT siblings of a core.
> > 
> > Provide per-task helper variables that architectures can use to
> > postprocess the classification that hardware provides.
> > 
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Perry Yuan <Perry.Yuan@amd.com>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> > Cc: x86@kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v3:
> >  * None
> > 
> > Changes since v2:
> >  * None
> > 
> > Changes since v1:
> >  * Used bit-fields to fit all the IPC class data in 4 bytes. (PeterZ)
> >  * Shortened names of the helpers.
> >  * Renamed helpers with the ipcc_ prefix.
> >  * Reworded commit message for clarity
> > ---
> >  include/linux/sched.h | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 0e1c38ad09c2..719147460ca8 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1541,7 +1541,17 @@ struct task_struct {
> >  	 * A hardware-defined classification of task that reflects but is
> >  	 * not identical to the number of instructions per cycle.
> >  	 */
> > -	unsigned short			ipcc;
> > +	unsigned int			ipcc : 9;
> > +	/*
> > +	 * A candidate classification that arch-specific implementations
> > +	 * qualify for correctness.
> > +	 */
> > +	unsigned int			ipcc_tmp : 9;
> > +	/*
> > +	 * Counter to filter out transient candidate classifications
> > +	 * of a task.
> > +	 */
> > +	unsigned int			ipcc_cntr : 14;
> >  #endif
> >  
> 
> Why does this need to be split in task_struct? Isn't this architecture
> specific? IMO the scheduler should never make use of class information
> itself. It only receives the value though the call of an arch function
> and passes it as an argument to an arch function to get a performance
> score. So the way one interprets the class value (splits it in relevant
> and helper bits) should be defined and considered in the architecture
> specific code.

This is an excellent observation. The scheduler is unconcerned with what
the classes mean. I'll remove these auxiliary members.

This also removes 2 patches from the series.

Thanks and BR,
Ricardo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes
  2023-06-24  0:01     ` Ricardo Neri
@ 2023-06-26 19:52       ` Tim Chen
  2023-07-06 23:40         ` Ricardo Neri
  0 siblings, 1 reply; 43+ messages in thread
From: Tim Chen @ 2023-06-26 19:52 UTC (permalink / raw)
  To: Ricardo Neri, Ionela Voinescu
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Lukasz Luba, Zhao Liu, Yuan,
	Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu


> > 
> > nit: the comment is unnecessary, and a bit misleading, IMO.
> > 
> > The comment says "This group will not be selected." but the only way to
> > guarantee that here is to reset the sum_score to 0 when you find an
> > invalid score, which I don't believe is your intention.
> 
> You raise an interesting point. We will call this function for each rq
> in the sched group. Thus, if we encounter an error, the scores would be
> incomplete. Therefore, I think that the scores should be discarded and
> reset to 0 so that they are not used, IMO.

Since we still have other rqs to loop through, reset to 0 here does
not guarantee that it will stay that way at the end of the loop when
the sum could still be added to.  May need a special value like -EINVAL
and make the score a "long" to mark such case.

Tim




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing
  2023-06-25 20:11     ` Ricardo Neri
@ 2023-06-26 21:01       ` Tim Chen
  2023-07-06 23:48         ` Ricardo Neri
  2023-06-27 15:19       ` Ionela Voinescu
  1 sibling, 1 reply; 43+ messages in thread
From: Tim Chen @ 2023-06-26 21:01 UTC (permalink / raw)
  To: Ricardo Neri, Ionela Voinescu
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Lukasz Luba, Zhao Liu, Yuan,
	Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Sun, 2023-06-25 at 13:11 -0700, Ricardo Neri wrote:
> 
> > > +
> > > +	score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > > +
> > > +	/*
> > > +	 * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > > +	 * and not used.
> > > +	 */

The comment is not matching the check below.  If zero
is not used, the check should also reflect the case.

> > > +	if (IS_ERR_VALUE(score_on_dst_cpu))
> > > +		return;
> > > +
> > > +	before = sgs->sum_score;
> > > +	after = before - sgs->min_score;
> > 
> 
Tim

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing
  2023-06-25 20:11     ` Ricardo Neri
  2023-06-26 21:01       ` Tim Chen
@ 2023-06-27 15:19       ` Ionela Voinescu
  1 sibling, 0 replies; 43+ messages in thread
From: Ionela Voinescu @ 2023-06-27 15:19 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Zhao Liu, Yuan, Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

Hey,

On Sunday 25 Jun 2023 at 13:11:55 (-0700), Ricardo Neri wrote:
> On Thu, Jun 22, 2023 at 10:02:44AM +0100, Ionela Voinescu wrote:
> > On Monday 12 Jun 2023 at 21:24:05 (-0700), Ricardo Neri wrote:
> > > When using IPCC scores to break ties between two scheduling groups, it is
> > > necessary to consider both the current score and the score that would
> > > result after load balancing.
> > > 
> > > Compute the combined IPC class score of a scheduling group and the local
> > > scheduling group. Compute both the current score and the prospective score.
> > > 
> > > Collect IPCC statistics only for asym_packing and fully_busy scheduling
> > > groups. These are the only cases that use IPCC scores.
> > > 
> > > These IPCC statistics are used during idle load balancing. The candidate
> > > scheduling group will have one fewer busy CPU after load balancing. This
> > > observation is important for cores with SMT support.
> > > 
> > > The IPCC score of scheduling groups composed of SMT siblings needs to
> > > consider that the siblings share CPU resources. When computing the total
> > > IPCC score of the scheduling group, divide the score of each sibling by
> > > the number of busy siblings.
> > > 
> > > Cc: Ben Segall <bsegall@google.com>
> > > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > Cc: Len Brown <len.brown@intel.com>
> > > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Cc: Perry Yuan <Perry.Yuan@amd.com>
> > > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > > Cc: Valentin Schneider <vschneid@redhat.com>
> > > Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> > > Cc: x86@kernel.org
> > > Cc: linux-pm@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > > ---
> > > Changes since v3:
> > >  * None
> > > 
> > > Changes since v2:
> > >  * Also collect IPCC stats for fully_busy sched groups.
> > >  * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> > >  * Handle errors of arch_get_ipcc_score(). (Ionela)
> > > 
> > > Changes since v1:
> > >  * Implemented cleanups and reworks from PeterZ. I took all his
> > >    suggestions, except the computation of the  IPC score before and after
> > >    load balancing. We are computing not the average score, but the *total*.
> > >  * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> > >    siblings of a physical core.
> > >  * Used the new interface names.
> > >  * Reworded commit message for clarity.
> > > ---
> > >  kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 68 insertions(+)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c0cab5e501b6..a51c65c9335f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9114,6 +9114,8 @@ struct sg_lb_stats {
> > >  	unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> > >  	unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> > >  	unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> > > +	long ipcc_score_after; /* Prospective IPCC score after load balancing */
> > > +	unsigned long ipcc_score_before; /* IPCC score before load balancing */
> > >  #endif
> > >  };
> > >  
> > > @@ -9452,6 +9454,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > >  	}
> > >  }
> > >  
> > > +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> > > +				      struct sched_group *sg,
> > > +				      struct lb_env *env)
> > > +{
> > > +	unsigned long score_on_dst_cpu, before;
> > > +	int busy_cpus;
> > > +	long after;
> > > +
> > > +	if (!sched_ipcc_enabled())
> > > +		return;
> > > +
> > > +	/*
> > > +	 * IPCC scores are only useful during idle load balancing. For now,
> > > +	 * only asym_packing uses IPCC scores.
> > > +	 */
> > > +	if (!(env->sd->flags & SD_ASYM_PACKING) ||
> > > +	    env->idle == CPU_NOT_IDLE)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * IPCC scores are used to break ties only between these types of
> > > +	 * groups.
> > > +	 */
> > > +	if (sgs->group_type != group_fully_busy &&
> > > +	    sgs->group_type != group_asym_packing)
> > > +		return;
> > > +
> > > +	busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > > +
> > > +	/* No busy CPUs in the group. No tasks to move. */
> > > +	if (!busy_cpus)
> > > +		return;
> > > +
> > > +	score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > > +
> > > +	/*
> > > +	 * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > > +	 * and not used.
> > > +	 */
> > > +	if (IS_ERR_VALUE(score_on_dst_cpu))
> > > +		return;
> > > +
> > > +	before = sgs->sum_score;
> > > +	after = before - sgs->min_score;
> > 
> > I don't believe this can end up being negative as the sum of all
> > scores should be higher or equal to the min score, right?
> 
> Yes, I agree. `after` cannot be negative.
> 
> > 
> > I'm just wondering if ipcc_score_after can be made unsigned long as well,
> > just for consistency.
> 
> Sure. I can make it of type unsigned long as well.
> 
> > 
> > > +
> > > +	/* SMT siblings share throughput. */
> > > +	if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> > > +		before /= busy_cpus;
> > > +		/* One sibling will become idle after load balance. */
> > > +		after /= busy_cpus - 1;
> > > +	}
> > > +
> > > +	sgs->ipcc_score_after = after + score_on_dst_cpu;
> > > +	sgs->ipcc_score_before = before;
> > 
> > Shouldn't the score_on_dst_cpu be added to "after" before being divided
> > between the SMT siblings?
> 
> No, because ipcc_score_after represents the joint score of the busiest
> core and the destination core after load balance has taken place. The
> destination core was previously idle and now contributes to the joint
> score.
> 

Right! score_on_dst_cpu does not contribute to the per-cpu throughput of
the busiest core, but it reflects the improvement in score gained by the
move to the destination.

Thanks,
Ionela.

> Thanks and BR,
> Ricardo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 14/24] thermal: intel: hfi: Store per-CPU IPCC scores
  2023-06-13  4:24 ` [PATCH v4 14/24] thermal: intel: hfi: Store per-CPU IPCC scores Ricardo Neri
@ 2023-06-29 18:53   ` Rafael J. Wysocki
  2023-07-06 23:23     ` Ricardo Neri
  0 siblings, 1 reply; 43+ messages in thread
From: Rafael J. Wysocki @ 2023-06-29 18:53 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Tue, Jun 13, 2023 at 6:23 AM Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> The scheduler reads the IPCC scores when balancing load. These reads can
> occur frequently and originate from many CPUs. Hardware may also
> occasionally update the HFI table. Controlling access with locks would
> cause contention.
>
> Cache the IPCC scores in separate per-CPU variables that the scheduler can
> use. Use a seqcount to synchronize memory accesses to these cached values.
> This eliminates the need for locks, as the sequence counter provides the
> memory ordering required to prevent the use of stale data.
>
> The HFI delayed workqueue guarantees that only one CPU writes the cached
> IPCC scores. The frequency of updates is low (every CONFIG_HZ jiffies or
> less), and the number of writes per update is in the order of tens. Writes
> should not starve reads.
>
> Only cache IPCC scores in this changeset. A subsequent changeset will
> use these scores.
>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Lukasz Luba <lukasz.luba@arm.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Perry Yuan <Perry.Yuan@amd.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim C. Chen <tim.c.chen@intel.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> Cc: x86@kernel.org
> Cc: linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v3:
>  * As Rafael requested, I reworked the memory ordering of the cached IPCC
>    scores. I selected a seqcount as is less expensive than a memory
>    barrier, which is not necessary anyways.
>  * Made alloc_hfi_ipcc_scores() return -ENOMEM on allocation failure.
>    (Rafael)
>  * Added a comment to describe hfi_ipcc_scores. (Rafael)
>
> Changes since v2:
>  * Only create these per-CPU variables when Intel Thread Director is
>    supported.
>
> Changes since v1:
>  * Added this patch.
> ---
>  drivers/thermal/intel/intel_hfi.c | 66 +++++++++++++++++++++++++++++++
>  1 file changed, 66 insertions(+)
>
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index 20ee4264dcd4..d822ed0bb5c1 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -29,9 +29,11 @@
>  #include <linux/kernel.h>
>  #include <linux/math.h>
>  #include <linux/mutex.h>
> +#include <linux/percpu.h>
>  #include <linux/percpu-defs.h>
>  #include <linux/printk.h>
>  #include <linux/processor.h>
> +#include <linux/seqlock.h>
>  #include <linux/slab.h>
>  #include <linux/spinlock.h>
>  #include <linux/string.h>
> @@ -180,6 +182,62 @@ static struct workqueue_struct *hfi_updates_wq;
>  #define HFI_UPDATE_INTERVAL            HZ
>  #define HFI_MAX_THERM_NOTIFY_COUNT     16
>
> +/* A cache of the HFI perf capabilities for lockless access. */
> +static int __percpu *hfi_ipcc_scores;
> +/* Sequence counter for hfi_ipcc_scores */
> +static seqcount_t hfi_ipcc_seqcount = SEQCNT_ZERO(hfi_ipcc_seqcount);
> +
> +static int alloc_hfi_ipcc_scores(void)
> +{
> +       if (!cpu_feature_enabled(X86_FEATURE_ITD))
> +               return 0;
> +
> +       hfi_ipcc_scores = __alloc_percpu(sizeof(*hfi_ipcc_scores) *
> +                                        hfi_features.nr_classes,
> +                                        sizeof(*hfi_ipcc_scores));
> +
> +       return hfi_ipcc_scores ? 0 : -ENOMEM;
> +}

This doesn't need to return an int.  It could be a bool function
returning !!hfi_ipcc_scores (or false for the feature missing case).

Apart from this minor thing, the patch looks reasonable to me.

> +
> +static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
> +{
> +       int cpu;
> +
> +       if (!cpu_feature_enabled(X86_FEATURE_ITD))
> +               return;
> +
> +       /*
> +        * Serialize with writes to the HFI table. It also protects the write
> +        * loop against seqcount readers running in interrupt context.
> +        */
> +       raw_spin_lock_irq(&hfi_instance->table_lock);
> +       /*
> +        * The seqcount implies store-release semantics to order stores with
> +        * lockless loads from the seqcount read side. It also implies a
> +        * compiler barrier.
> +        */
> +       write_seqcount_begin(&hfi_ipcc_seqcount);
> +       for_each_cpu(cpu, hfi_instance->cpus) {
> +               int c, *scores;
> +               s16 index;
> +
> +               index = per_cpu(hfi_cpu_info, cpu).index;
> +               scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
> +
> +               for (c = 0;  c < hfi_features.nr_classes; c++) {
> +                       struct hfi_cpu_data *caps;
> +
> +                       caps = hfi_instance->data +
> +                              index * hfi_features.cpu_stride +
> +                              c * hfi_features.class_stride;
> +                       scores[c] = caps->perf_cap;
> +               }
> +       }
> +
> +       write_seqcount_end(&hfi_ipcc_seqcount);
> +       raw_spin_unlock_irq(&hfi_instance->table_lock);
> +}
> +
>  /**
>   * intel_hfi_read_classid() - Read the currrent classid
>   * @classid:   Variable to which the classid will be written.
> @@ -275,6 +333,8 @@ static void update_capabilities(struct hfi_instance *hfi_instance)
>                 thermal_genl_cpu_capability_event(cpu_count, &cpu_caps[i]);
>
>         kfree(cpu_caps);
> +
> +       set_hfi_ipcc_scores(hfi_instance);
>  out:
>         mutex_unlock(&hfi_instance_lock);
>  }
> @@ -618,8 +678,14 @@ void __init intel_hfi_init(void)
>         if (!hfi_updates_wq)
>                 goto err_nomem;
>
> +       if (alloc_hfi_ipcc_scores())
> +               goto err_ipcc;
> +
>         return;
>
> +err_ipcc:
> +       destroy_workqueue(hfi_updates_wq);
> +
>  err_nomem:
>         for (j = 0; j < i; ++j) {
>                 hfi_instance = &hfi_instances[j];
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 15/24] thermal: intel: hfi: Report the IPC class score of a CPU
  2023-06-13  4:24 ` [PATCH v4 15/24] thermal: intel: hfi: Report the IPC class score of a CPU Ricardo Neri
@ 2023-06-29 18:56   ` Rafael J. Wysocki
  2023-07-06 23:10     ` Ricardo Neri
  0 siblings, 1 reply; 43+ messages in thread
From: Rafael J. Wysocki @ 2023-06-29 18:56 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Tue, Jun 13, 2023 at 6:23 AM Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> Implement the arch_get_ipcc_score() interface of the scheduler. Use the
> performance capabilities of the extended Hardware Feedback Interface table
> as the IPC score.
>
> Use the cached per-CPU IPCC scores. A seqcount provides lockless access and
> the required memory ordering to avoid stale data.
>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Lukasz Luba <lukasz.luba@arm.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Perry Yuan <Perry.Yuan@amd.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim C. Chen <tim.c.chen@intel.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> Cc: x86@kernel.org
> Cc: linux-pm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
> Changes since v3:
>  * Ordered memory loads using a seqcount.
>  * Removed local variable hfi_class from intel_hfi_get_ipcc_score().
>    (Rafael).
>
> Changes since v2:
>  * None
>
> Changes since v1:
>  * Adjusted the returned HFI class (which starts at 0) to match the
>    scheduler IPCC class (which starts at 1). (PeterZ)
>  * Used the new interface names.
> ---
>  arch/x86/include/asm/topology.h   |  4 ++++
>  drivers/thermal/intel/intel_hfi.c | 40 +++++++++++++++++++++++++++++--
>  2 files changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 7d2ed7f7bb8c..8dd328cdb5cf 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -237,13 +237,17 @@ void init_freq_invariance_cppc(void);
>
>  #ifdef CONFIG_INTEL_HFI_THERMAL
>  int intel_hfi_read_classid(u8 *classid);
> +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu);
>  #else
>  static inline int intel_hfi_read_classid(u8 *classid) { return -ENODEV; }
> +static inline unsigned long
> +intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu) { return -ENODEV; }
>  #endif
>
>  #ifdef CONFIG_IPC_CLASSES
>  void intel_update_ipcc(struct task_struct *curr);
>  #define arch_update_ipcc intel_update_ipcc
> +#define arch_get_ipcc_score intel_hfi_get_ipcc_score
>  #endif
>
>  #endif /* _ASM_X86_TOPOLOGY_H */
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index d822ed0bb5c1..fec2c01fda1b 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -199,6 +199,42 @@ static int alloc_hfi_ipcc_scores(void)
>         return hfi_ipcc_scores ? 0 : -ENOMEM;
>  }
>
> +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
> +{
> +       int *scores, score;
> +       unsigned long seq;
> +
> +       scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
> +       if (!scores)
> +               return -ENODEV;
> +
> +       if (cpu < 0 || cpu >= nr_cpu_ids)
> +               return -EINVAL;
> +
> +       if (ipcc == IPC_CLASS_UNCLASSIFIED)
> +               return -EINVAL;
> +
> +       /*
> +        * Scheduler IPC classes start at 1. HFI classes start at 0.
> +        * See note intel_hfi_update_ipcc().
> +        */
> +       if (ipcc >= hfi_features.nr_classes + 1)
> +               return -EINVAL;
> +
> +       /*
> +        * The seqcount implies load-acquire semantics to order loads with
> +        * lockless stores of the write side in set_hfi_ipcc_score(). It
> +        * also implies a compiler barrier.
> +        */
> +       do {
> +               seq = read_seqcount_begin(&hfi_ipcc_seqcount);
> +               /* @ipcc is never 0. */
> +               score = scores[ipcc - 1];
> +       } while (read_seqcount_retry(&hfi_ipcc_seqcount, seq));
> +
> +       return score;
> +}
> +
>  static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
>  {
>         int cpu;
> @@ -213,8 +249,8 @@ static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
>         raw_spin_lock_irq(&hfi_instance->table_lock);
>         /*
>          * The seqcount implies store-release semantics to order stores with
> -        * lockless loads from the seqcount read side. It also implies a
> -        * compiler barrier.
> +        * lockless loads from the seqcount read side in
> +        * intel_hfi_get_ipcc_score(). It also implies a compiler barrier.
>          */
>         write_seqcount_begin(&hfi_ipcc_seqcount);
>         for_each_cpu(cpu, hfi_instance->cpus) {
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 15/24] thermal: intel: hfi: Report the IPC class score of a CPU
  2023-06-29 18:56   ` Rafael J. Wysocki
@ 2023-07-06 23:10     ` Ricardo Neri
  0 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-07-06 23:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Thu, Jun 29, 2023 at 08:56:36PM +0200, Rafael J. Wysocki wrote:
> On Tue, Jun 13, 2023 at 6:23 AM Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > Implement the arch_get_ipcc_score() interface of the scheduler. Use the
> > performance capabilities of the extended Hardware Feedback Interface table
> > as the IPC score.
> >
> > Use the cached per-CPU IPCC scores. A seqcount provides lockless access and
> > the required memory ordering to avoid stale data.
> >
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Perry Yuan <Perry.Yuan@amd.com>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> > Cc: x86@kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> 
> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 

Thank you Rafael!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 14/24] thermal: intel: hfi: Store per-CPU IPCC scores
  2023-06-29 18:53   ` Rafael J. Wysocki
@ 2023-07-06 23:23     ` Ricardo Neri
  0 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-07-06 23:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Tim Chen, Valentin Schneider, Lukasz Luba,
	Ionela Voinescu, Zhao Liu, Yuan, Perry, x86,
	Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Thu, Jun 29, 2023 at 08:53:31PM +0200, Rafael J. Wysocki wrote:
> On Tue, Jun 13, 2023 at 6:23 AM Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > The scheduler reads the IPCC scores when balancing load. These reads can
> > occur frequently and originate from many CPUs. Hardware may also
> > occasionally update the HFI table. Controlling access with locks would
> > cause contention.
> >
> > Cache the IPCC scores in separate per-CPU variables that the scheduler can
> > use. Use a seqcount to synchronize memory accesses to these cached values.
> > This eliminates the need for locks, as the sequence counter provides the
> > memory ordering required to prevent the use of stale data.
> >
> > The HFI delayed workqueue guarantees that only one CPU writes the cached
> > IPCC scores. The frequency of updates is low (every CONFIG_HZ jiffies or
> > less), and the number of writes per update is in the order of tens. Writes
> > should not starve reads.
> >
> > Only cache IPCC scores in this changeset. A subsequent changeset will
> > use these scores.
> >
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Ionela Voinescu <ionela.voinescu@arm.com>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: Lukasz Luba <lukasz.luba@arm.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Perry Yuan <Perry.Yuan@amd.com>
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Tim C. Chen <tim.c.chen@intel.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Zhao Liu <zhao1.liu@linux.intel.com>
> > Cc: x86@kernel.org
> > Cc: linux-pm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v3:
> >  * As Rafael requested, I reworked the memory ordering of the cached IPCC
> >    scores. I selected a seqcount as is less expensive than a memory
> >    barrier, which is not necessary anyways.
> >  * Made alloc_hfi_ipcc_scores() return -ENOMEM on allocation failure.
> >    (Rafael)
> >  * Added a comment to describe hfi_ipcc_scores. (Rafael)
> >
> > Changes since v2:
> >  * Only create these per-CPU variables when Intel Thread Director is
> >    supported.
> >
> > Changes since v1:
> >  * Added this patch.
> > ---
> >  drivers/thermal/intel/intel_hfi.c | 66 +++++++++++++++++++++++++++++++
> >  1 file changed, 66 insertions(+)
> >
> > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > index 20ee4264dcd4..d822ed0bb5c1 100644
> > --- a/drivers/thermal/intel/intel_hfi.c
> > +++ b/drivers/thermal/intel/intel_hfi.c
> > @@ -29,9 +29,11 @@
> >  #include <linux/kernel.h>
> >  #include <linux/math.h>
> >  #include <linux/mutex.h>
> > +#include <linux/percpu.h>
> >  #include <linux/percpu-defs.h>
> >  #include <linux/printk.h>
> >  #include <linux/processor.h>
> > +#include <linux/seqlock.h>
> >  #include <linux/slab.h>
> >  #include <linux/spinlock.h>
> >  #include <linux/string.h>
> > @@ -180,6 +182,62 @@ static struct workqueue_struct *hfi_updates_wq;
> >  #define HFI_UPDATE_INTERVAL            HZ
> >  #define HFI_MAX_THERM_NOTIFY_COUNT     16
> >
> > +/* A cache of the HFI perf capabilities for lockless access. */
> > +static int __percpu *hfi_ipcc_scores;
> > +/* Sequence counter for hfi_ipcc_scores */
> > +static seqcount_t hfi_ipcc_seqcount = SEQCNT_ZERO(hfi_ipcc_seqcount);
> > +
> > +static int alloc_hfi_ipcc_scores(void)
> > +{
> > +       if (!cpu_feature_enabled(X86_FEATURE_ITD))
> > +               return 0;
> > +
> > +       hfi_ipcc_scores = __alloc_percpu(sizeof(*hfi_ipcc_scores) *
> > +                                        hfi_features.nr_classes,
> > +                                        sizeof(*hfi_ipcc_scores));
> > +
> > +       return hfi_ipcc_scores ? 0 : -ENOMEM;
> > +}
> 
> This doesn't need to return an int.  It could be a bool function
> returning !!hfi_ipcc_scores (or false for the feature missing case).

Sure Rafael, I can make this change.

> 
> Apart from this minor thing, the patch looks reasonable to me.

Thank you for your review!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes
  2023-06-26 19:52       ` Tim Chen
@ 2023-07-06 23:40         ` Ricardo Neri
  0 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-07-06 23:40 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ionela Voinescu, Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Lukasz Luba, Zhao Liu, Yuan,
	Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Mon, Jun 26, 2023 at 12:52:59PM -0700, Tim Chen wrote:
> 
> > > 
> > > nit: the comment is unnecessary, and a bit misleading, IMO.
> > > 
> > > The comment says "This group will not be selected." but the only way to
> > > guarantee that here is to reset the sum_score to 0 when you find an
> > > invalid score, which I don't believe is your intention.
> > 
> > You raise an interesting point. We will call this function for each rq
> > in the sched group. Thus, if we encounter an error, the scores would be
> > incomplete. Therefore, I think that the scores should be discarded and
> > reset to 0 so that they are not used, IMO.
> 
> Since we still have other rqs to loop through, reset to 0 here does
> not guarantee that it will stay that way at the end of the loop when
> the sum could still be added to.

That is true, to discard the would need an indication that we should not
keep iterating on the runqueues of this group.

> May need a special value like -EINVAL
> and make the score a "long" to mark such case.

IIUC, unsigned longs allow to handle negative errors if you use
IS_ERR_VALUE(); but yes, it looks odd.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing
  2023-06-26 21:01       ` Tim Chen
@ 2023-07-06 23:48         ` Ricardo Neri
  0 siblings, 0 replies; 43+ messages in thread
From: Ricardo Neri @ 2023-07-06 23:48 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ionela Voinescu, Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Ricardo Neri, Ravi V. Shankar,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Len Brown, Mel Gorman, Rafael J. Wysocki, Srinivas Pandruvada,
	Steven Rostedt, Valentin Schneider, Lukasz Luba, Zhao Liu, Yuan,
	Perry, x86, Joel Fernandes (Google),
	linux-kernel, linux-pm, Tim C . Chen, Zhao Liu

On Mon, Jun 26, 2023 at 02:01:25PM -0700, Tim Chen wrote:
> On Sun, 2023-06-25 at 13:11 -0700, Ricardo Neri wrote:
> > 
> > > > +
> > > > +	score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > > > +
> > > > +	/*
> > > > +	 * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > > > +	 * and not used.
> > > > +	 */
> 
> The comment is not matching the check below.  If zero
> is not used, the check should also reflect the case.

Agreed. This comment is not clear. I meant to say that returning here
has the effect of leaving the `before` and `after` scores of this group as
zero.

Since zero is the minimum possible score, this group will not be selected
during the tie breaker, unless the statistics of all other groups are also
zero.

> 
> > > > +	if (IS_ERR_VALUE(score_on_dst_cpu))
> > > > +		return;
> > > > +
> > > > +	before = sgs->sum_score;
> > > > +	after = before - sgs->min_score;
> > > 
> > 
> Tim

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2023-07-06 23:46 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-13  4:23 [PATCH v4 00/24] sched: Introduce classes of tasks for load balance Ricardo Neri
2023-06-13  4:23 ` [PATCH v4 01/24] sched/task_struct: Introduce IPC classes of tasks Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 02/24] sched: Add interfaces for IPC classes Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 03/24] sched/core: Initialize the IPC class of a new task Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 04/24] sched/core: Add user_tick as argument to scheduler_tick() Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 05/24] sched/core: Update the IPC class of the current task Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 06/24] sched/fair: Collect load-balancing stats for IPC classes Ricardo Neri
2023-06-14  0:29   ` Ricardo Neri
2023-06-22  9:01   ` Ionela Voinescu
2023-06-24  0:01     ` Ricardo Neri
2023-06-26 19:52       ` Tim Chen
2023-07-06 23:40         ` Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 07/24] sched/fair: Compute IPC class scores for load balancing Ricardo Neri
2023-06-22  9:02   ` Ionela Voinescu
2023-06-25 20:11     ` Ricardo Neri
2023-06-26 21:01       ` Tim Chen
2023-07-06 23:48         ` Ricardo Neri
2023-06-27 15:19       ` Ionela Voinescu
2023-06-13  4:24 ` [PATCH v4 08/24] sched/fair: Use IPCC stats to break ties between asym_packing sched groups Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 09/24] sched/fair: Use IPCC stats to break ties between fully_busy SMT groups Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 10/24] sched/fair: Use IPCC scores to select a busiest runqueue Ricardo Neri
2023-06-22  9:03   ` Ionela Voinescu
2023-06-24  0:25     ` Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 11/24] thermal: intel: hfi: Introduce Intel Thread Director classes Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 12/24] x86/cpufeatures: Add the Intel Thread Director feature definitions Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 13/24] x86/sched: Update the IPC class of the current task Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 14/24] thermal: intel: hfi: Store per-CPU IPCC scores Ricardo Neri
2023-06-29 18:53   ` Rafael J. Wysocki
2023-07-06 23:23     ` Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 15/24] thermal: intel: hfi: Report the IPC class score of a CPU Ricardo Neri
2023-06-29 18:56   ` Rafael J. Wysocki
2023-07-06 23:10     ` Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 16/24] thermal: intel: hfi: Define a default class for unclassified tasks Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 17/24] thermal: intel: hfi: Enable the Intel Thread Director Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 18/24] sched/task_struct: Add helpers for IPC classification Ricardo Neri
2023-06-22 10:20   ` Ionela Voinescu
2023-06-25 20:23     ` Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 19/24] sched/core: Initialize helpers of task classification Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 20/24] sched/fair: Introduce sched_smt_siblings_idle() Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 21/24] x86/sched/ipcc: Implement model-specific checks for task classification Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 22/24] x86/cpufeatures: Add feature bit for HRESET Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 23/24] x86/hreset: Configure history reset Ricardo Neri
2023-06-13  4:24 ` [PATCH v4 24/24] x86/process: Reset hardware history in context switch Ricardo Neri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).