linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/4] Introduce group balancer
@ 2022-03-08  9:26 Tianchen Ding
  2022-03-08  9:26 ` [RFC PATCH v2 1/4] sched, cpuset: Introduce infrastructure of " Tianchen Ding
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Tianchen Ding @ 2022-03-08  9:26 UTC (permalink / raw)
  To: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Tejun Heo,
	Johannes Weiner, Tianchen Ding, Michael Wang, Cruz Zhao,
	Masahiro Yamada, Nathan Chancellor, Kees Cook, Andrew Morton,
	Vlastimil Babka, Gustavo A. R. Silva, Arnd Bergmann,
	Miguel Ojeda, Chris Down, Vipin Sharma, Daniel Borkmann
  Cc: linux-kernel, cgroups

Modern platform are growing fast on CPU numbers. To achieve better
utility of CPU resource, multiple apps are starting to sharing the CPUs.

What we need is a way to ease confliction in share mode,
make groups as exclusive as possible, to gain both performance
and resource efficiency.

The main idea of group balancer is to fulfill this requirement
by balancing groups of tasks among groups of CPUs, consider this
as a dynamic demi-exclusive mode. Task trigger work to settle it's
group into a proper partition (minimum predicted load), then try
migrate itself into it. To gradually settle groups into the most
exclusively partition.

GB can be seen as an optimize policy based on load balance,
it obeys the main idea of load balance and makes adjustment
based on that.

Our test on ARM64 platform with 128 CPUs shows that,
throughput of sysbench memory is improved about 25%,
and redis-benchmark is improved up to about 10%.

See each patch for detail:
The 1st patch introduces infrastructure.
The 2nd patch introduces detail about partition info.
The 3rd patch is the main part of group balancer.
The 4th patch is about stats.

v2:
Put partition info and period settings to cpuset subsys of cgroup_v2.

v1: https://lore.kernel.org/all/98f41efd-74b2-198a-839c-51b785b748a6@linux.alibaba.com/

Michael Wang (1):
  sched: Introduce group balancer

Tianchen Ding (3):
  sched, cpuset: Introduce infrastructure of group balancer
  cpuset: Handle input of partition info for group balancer
  cpuset, gb: Add stat for group balancer

 include/linux/cpuset.h   |   5 +
 include/linux/sched.h    |   5 +
 include/linux/sched/gb.h |  70 ++++++
 init/Kconfig             |  12 +
 kernel/cgroup/cpuset.c   | 405 +++++++++++++++++++++++++++++++-
 kernel/sched/Makefile    |   1 +
 kernel/sched/core.c      |   5 +
 kernel/sched/debug.c     |  10 +-
 kernel/sched/fair.c      |  26 ++-
 kernel/sched/gb.c        | 487 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h     |  14 ++
 11 files changed, 1037 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/sched/gb.h
 create mode 100644 kernel/sched/gb.c

-- 
2.27.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 1/4] sched, cpuset: Introduce infrastructure of group balancer
  2022-03-08  9:26 [RFC PATCH v2 0/4] Introduce group balancer Tianchen Ding
@ 2022-03-08  9:26 ` Tianchen Ding
  2022-03-08  9:26 ` [RFC PATCH v2 2/4] cpuset: Handle input of partition info for " Tianchen Ding
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Tianchen Ding @ 2022-03-08  9:26 UTC (permalink / raw)
  To: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Tejun Heo,
	Johannes Weiner, Tianchen Ding, Michael Wang, Cruz Zhao,
	Masahiro Yamada, Nathan Chancellor, Kees Cook, Andrew Morton,
	Vlastimil Babka, Gustavo A. R. Silva, Arnd Bergmann,
	Miguel Ojeda, Chris Down, Vipin Sharma, Daniel Borkmann
  Cc: linux-kernel, cgroups

Introduce CONFIG and struct about group balancer.
Add a few interfaces on cpuset and debugfs to read&write params.

In detail:
cgroup:
  cpuset.gb.period_us:
    To control work period.
    Write a positive number to enable group balancer
    for all descendants EXCEPT itself.
    Write 0 to disable it.
    MAX 1000000. (1s)
    Can only write after partition info set.

  cpuset.gb.partition:
    Partition info used by group balancer.
    Tasks in child cgroups (EXCEPT itself) will try to
    settle at one of given partitions when gb enabled.

/sys/kernel/debug/sched/group_balancer/:
  settle_period_ms:
    Default settle period for each group.

  settle_period_max_ms:
    Max settle period for each group.

  global_settle_period_ms:
    Global settle period for all groups.
    For one group, the operation of settling must
    satisfy the periods of itself and global.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
 include/linux/sched/gb.h |  29 +++++++++++
 init/Kconfig             |  12 +++++
 kernel/cgroup/cpuset.c   | 109 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/Makefile    |   1 +
 kernel/sched/debug.c     |  10 +++-
 kernel/sched/gb.c        |  14 +++++
 kernel/sched/sched.h     |   4 ++
 7 files changed, 178 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/sched/gb.h
 create mode 100644 kernel/sched/gb.c

diff --git a/include/linux/sched/gb.h b/include/linux/sched/gb.h
new file mode 100644
index 000000000000..63c14289a748
--- /dev/null
+++ b/include/linux/sched/gb.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ *  Group Balancer header
+ *
+ *  Copyright (C) 2021 Alibaba, Inc., Michael Wang <yun.wang@linux.alibaba.com>
+ */
+#ifndef _LINUX_SCHED_GB_H
+#define _LINUX_SCHED_GB_H
+
+#include <linux/sched.h>
+
+struct gb_info {
+	/* ---for ancestor as controller--- */
+	/* Period(ns) for task work in task tick, 0 means disabled. */
+	u64 gb_period;
+
+	/* ---for descendants who scheduled by group balancer--- */
+	/* Period for settling to specific partition. */
+	unsigned long settle_period;
+
+	/* Stamp to record next settling. */
+	unsigned long settle_next;
+};
+
+#ifdef CONFIG_GROUP_BALANCER
+extern unsigned int sysctl_gb_settle_period;
+#endif
+
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 8b79eda8e1d4..8f05e5e7e299 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1043,6 +1043,18 @@ config RT_GROUP_SCHED
 	  realtime bandwidth for them.
 	  See Documentation/scheduler/sched-rt-group.rst for more information.
 
+config GROUP_BALANCER
+	bool "Group balancer support for SCHED_OTHER"
+	depends on (CGROUP_SCHED && SMP)
+	default CGROUP_SCHED
+	help
+	  This feature allow you to do load balance in group mode. In other
+	  word, balance groups of tasks among groups of CPUs.
+
+	  This can reduce the conflict between task groups and gain benefit
+	  from hot cache and affinity domain, usually for the cases when there
+	  are multiple apps sharing the same CPUs.
+
 endif #CGROUP_SCHED
 
 config UCLAMP_TASK_GROUP
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ef88cc366bb8..0349f3f64e3d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -65,6 +65,7 @@
 #include <linux/mutex.h>
 #include <linux/cgroup.h>
 #include <linux/wait.h>
+#include <linux/sched/gb.h>
 
 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -170,6 +171,14 @@ struct cpuset {
 
 	/* Handle for cpuset.cpus.partition */
 	struct cgroup_file partition_file;
+
+#ifdef CONFIG_GROUP_BALANCER
+	/* Partition info and configs of group balancer */
+	struct gb_info gi;
+
+	/* Point to the controller (in ancestor cgroup) who has partition info */
+	struct gb_info *control_gi;
+#endif
 };
 
 /*
@@ -215,6 +224,24 @@ static inline struct cpuset *parent_cs(struct cpuset *cs)
 	return css_cs(cs->css.parent);
 }
 
+#ifdef CONFIG_GROUP_BALANCER
+static inline struct gb_info *cs_gi(struct cpuset *cs)
+{
+	return &cs->gi;
+}
+static struct gb_info *css_gi(struct cgroup_subsys_state *css, bool get_controller)
+{
+	struct cpuset *cs = css_cs(css);
+
+	return get_controller ? cs->control_gi : &cs->gi;
+}
+#else
+static inline struct gb_info *cs_gi(struct cpuset *cs)
+{
+	return NULL;
+}
+#endif /* CONFIG_GROUP_BALANCER */
+
 /* bits in struct cpuset flags field */
 typedef enum {
 	CS_ONLINE,
@@ -2365,6 +2392,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_GB_CPULIST,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
@@ -2510,6 +2538,69 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	return retval ?: nbytes;
 }
 
+#ifdef CONFIG_GROUP_BALANCER
+static inline void init_gb(struct cpuset *cs)
+{
+	struct gb_info *gi = cs_gi(cs);
+
+	gi->settle_period = msecs_to_jiffies(sysctl_gb_settle_period);
+	gi->settle_next = jiffies + gi->settle_period;
+}
+
+static u64 gb_period_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct gb_info *gi = css_gi(css, false);
+
+	return gi->gb_period / NSEC_PER_USEC;
+}
+
+#define MAX_GB_PERIOD USEC_PER_SEC /* 1s */
+
+static int gb_period_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct cpuset *cs = css_cs(css);
+	struct gb_info *gi = cs_gi(cs);
+
+	if (val > MAX_GB_PERIOD)
+		return -EINVAL;
+
+	percpu_down_write(&cpuset_rwsem);
+
+	gi->gb_period = val * NSEC_PER_USEC;
+
+	percpu_up_write(&cpuset_rwsem);
+	return 0;
+}
+
+static void gb_partition_read(struct seq_file *sf, struct gb_info *gi)
+{
+	seq_putc(sf, '\n');
+}
+
+static ssize_t gb_partition_write(struct kernfs_open_file *of, char *buf,
+				     size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct cpuset *cs = css_cs(css);
+	int retval = -ENODEV;
+
+	cpus_read_lock();
+	percpu_down_write(&cpuset_rwsem);
+	if (!is_cpuset_online(cs))
+		goto out_unlock;
+
+	retval = 0;
+
+out_unlock:
+	percpu_up_write(&cpuset_rwsem);
+	cpus_read_unlock();
+	return retval ?: nbytes;
+}
+#else
+static inline void init_gb(struct cpuset *cs) { }
+static inline void gb_partition_read(struct seq_file *sf, struct gb_info *gi) { }
+#endif /* CONFIG_GROUP_BALANCER */
+
 /*
  * These ascii lists should be read in a single call, by using a user
  * buffer large enough to hold the entire map.  If read in smaller
@@ -2542,6 +2633,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_SUBPARTS_CPULIST:
 		seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->subparts_cpus));
 		break;
+	case FILE_GB_CPULIST:
+		gb_partition_read(sf, cs_gi(cs));
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -2803,7 +2897,21 @@ static struct cftype dfl_files[] = {
 		.private = FILE_SUBPARTS_CPULIST,
 		.flags = CFTYPE_DEBUG,
 	},
+#ifdef CONFIG_GROUP_BALANCER
+	{
+		.name = "gb.period_us",
+		.read_u64 = gb_period_read_u64,
+		.write_u64 = gb_period_write_u64,
+	},
 
+	{
+		.name = "gb.partition",
+		.seq_show = cpuset_common_seq_show,
+		.write = gb_partition_write,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_GB_CPULIST,
+	},
+#endif
 	{ }	/* terminate */
 };
 
@@ -2870,6 +2978,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 		cs->effective_mems = parent->effective_mems;
 		cs->use_parent_ecpus = true;
 		parent->child_ecpus_count++;
+		init_gb(cs);
 	}
 	spin_unlock_irq(&callback_lock);
 
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index c83b37af155b..cb5b86bb161e 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -40,3 +40,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
 obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_GROUP_BALANCER) += gb.o
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 102d6f70e84d..1800bdfe1d61 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -299,7 +299,7 @@ static struct dentry *debugfs_sched;
 
 static __init int sched_init_debug(void)
 {
-	struct dentry __maybe_unused *numa;
+	struct dentry __maybe_unused *numa, *gb;
 
 	debugfs_sched = debugfs_create_dir("sched", NULL);
 
@@ -336,6 +336,14 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
 #endif
 
+#ifdef CONFIG_GROUP_BALANCER
+	gb = debugfs_create_dir("group_balancer", debugfs_sched);
+
+	debugfs_create_u32("settle_period_ms", 0644, gb, &sysctl_gb_settle_period);
+	debugfs_create_u32("settle_period_max_ms", 0644, gb, &sysctl_gb_settle_period_max);
+	debugfs_create_u32("global_settle_period_ms", 0644, gb, &sysctl_gb_global_settle_period);
+#endif
+
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	return 0;
diff --git a/kernel/sched/gb.c b/kernel/sched/gb.c
new file mode 100644
index 000000000000..4a2876ec1cca
--- /dev/null
+++ b/kernel/sched/gb.c
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Group Balancer code
+ *
+ *  Copyright (C) 2021 Alibaba, Inc., Michael Wang <yun.wang@linux.alibaba.com>
+ */
+#include <linux/sched/gb.h>
+
+#include "sched.h"
+
+/* Params about settle period. (ms) */
+unsigned int sysctl_gb_settle_period = 200;
+unsigned int sysctl_gb_settle_period_max = 10000;
+unsigned int sysctl_gb_global_settle_period = 200;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ac1f9470c617..d6555aacdf78 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2433,6 +2433,10 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
+
+extern unsigned int sysctl_gb_settle_period;
+extern unsigned int sysctl_gb_settle_period_max;
+extern unsigned int sysctl_gb_global_settle_period;
 #endif
 
 #ifdef CONFIG_SCHED_HRTICK
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 2/4] cpuset: Handle input of partition info for group balancer
  2022-03-08  9:26 [RFC PATCH v2 0/4] Introduce group balancer Tianchen Ding
  2022-03-08  9:26 ` [RFC PATCH v2 1/4] sched, cpuset: Introduce infrastructure of " Tianchen Ding
@ 2022-03-08  9:26 ` Tianchen Ding
  2022-03-08  9:26 ` [RFC PATCH v2 3/4] sched: Introduce " Tianchen Ding
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Tianchen Ding @ 2022-03-08  9:26 UTC (permalink / raw)
  To: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Tejun Heo,
	Johannes Weiner, Tianchen Ding, Michael Wang, Cruz Zhao,
	Masahiro Yamada, Nathan Chancellor, Kees Cook, Andrew Morton,
	Vlastimil Babka, Gustavo A. R. Silva, Arnd Bergmann,
	Miguel Ojeda, Chris Down, Vipin Sharma, Daniel Borkmann
  Cc: linux-kernel, cgroups

Partition info is the key param of group balancer. It represents the
division of effective cpus. Cpus are divided into several parts, and
group balancer will select one part for each cgroup (called settle)
and try to gather all tasks of that cgroup running on the cpus of
selected part.

For example, the effective cpulist is "0-63". A valid input of partition
info can be "0-15;16-31;32-47;48-63;". This divide cpus into 4 parts,
each part with 16 cpus.
Intersections between parts or conflicts with effective cpus are not
allowd, e.g., "0-40;30-63" or "0-31;32-70;".

Once the partition info is set, the corresponding cgroup becomes a
controller. All descendants of it are managed by group balancer,
according to params set in this controller cgroup.
A cgroup can become a controller only when its ancestors and descendants
are not controller.

To unset a controller, write empty string to it.
(i.e., echo > cpuset.gb.partition)
This will clear its partition info and reset the relationship between
itself and its descendants.

Setting, modifying, or unsetting partition info must be done when group
balancer is disabled. (i.e., gb.period_us = 0.)

ROOT_CG
   |
   |
   |_________
   |         |
 CG_A       CG_B(controller, gb.partition = "...", gb.period_us > 0)
             |
             |
             |________________
             |                |
           CG_C(gb active)  CG_D(gb active)
               (prefer=0)       (prefer=3)

For example, if we input valid partition info "0-15;16-31;32-47;48-63;"
and positive period to CG_B, group balancer for all descendants of CG_B
is enabled. All tasks in CG_C will tend to run on cpus of part0 (0-15),
and all tasks in CG_D will tend to run on part3 (48-63).

After setting partition info, CG_B becomes a controller. So setting
another partition info in ROOT_CG, CG_C or CG_D is not allowed until
CG_B is disabled and unset. While CG_A can still be set.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
 include/linux/sched/gb.h |  29 +++++
 kernel/cgroup/cpuset.c   | 245 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 267 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/gb.h b/include/linux/sched/gb.h
index 63c14289a748..b03a2b4ef4b7 100644
--- a/include/linux/sched/gb.h
+++ b/include/linux/sched/gb.h
@@ -8,9 +8,30 @@
 #define _LINUX_SCHED_GB_H
 
 #include <linux/sched.h>
+#include <linux/cpumask.h>
+
+#define CPU_PART_MAX		32
+struct gb_part {
+	int id;
+	struct cpumask cpus;
+};
+
+struct gb_part_info {
+	int nr_part;
+	struct rcu_head rcu_head;
+	struct gb_part parts[CPU_PART_MAX];
+	int ctop[NR_CPUS];
+};
 
 struct gb_info {
 	/* ---for ancestor as controller--- */
+	/*
+	 * Partition info. Non-NULL means this cgroup acting as a controller.
+	 * While otherwise, NULL means not a controller.
+	 * (Maybe a descendant of a controller, or not managed by gb at all.)
+	 */
+	struct gb_part_info *part_info;
+
 	/* Period(ns) for task work in task tick, 0 means disabled. */
 	u64 gb_period;
 
@@ -22,6 +43,14 @@ struct gb_info {
 	unsigned long settle_next;
 };
 
+#define for_each_gbpart(i, pi)			\
+	for (i = 0; i < (pi)->nr_part; i++)
+
+static inline struct cpumask *part_cpus(struct gb_part_info *pi, int id)
+{
+	return &pi->parts[id].cpus;
+}
+
 #ifdef CONFIG_GROUP_BALANCER
 extern unsigned int sysctl_gb_settle_period;
 #endif
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0349f3f64e3d..4b456b379b87 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -683,13 +683,24 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 	if (cur == &top_cpuset)
 		goto out;
 
+	ret = -EINVAL;
+
+#ifdef CONFIG_GROUP_BALANCER
+	/*
+	 * If group balancer part_info is set, this cgroup acts as a controller.
+	 * Not allow to change cpumask until unset it
+	 * by writing empty string to cpuset.gb.partition.
+	 */
+	if (cs_gi(cur)->part_info)
+		goto out;
+#endif
+
 	par = parent_cs(cur);
 
 	/*
 	 * If either I or some sibling (!= me) is exclusive, we can't
 	 * overlap
 	 */
-	ret = -EINVAL;
 	cpuset_for_each_child(c, css, par) {
 		if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
 		    c != cur &&
@@ -2539,12 +2550,77 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 }
 
 #ifdef CONFIG_GROUP_BALANCER
-static inline void init_gb(struct cpuset *cs)
+static void free_gb_part_info(struct rcu_head *rcu_head)
+{
+	struct gb_part_info *pi = container_of(rcu_head, struct gb_part_info, rcu_head);
+
+	kfree(pi);
+}
+
+static inline void update_child_gb_controller(struct cpuset *cs, struct gb_info *gi)
+{
+	struct cpuset *cp;
+	struct cgroup_subsys_state *pos_css;
+
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
+		if (cp == cs)
+			continue;
+
+		cp->control_gi = gi;
+	}
+	rcu_read_unlock();
+}
+
+static inline void update_gb_part_info(struct cpuset *cs, struct gb_part_info *new)
 {
 	struct gb_info *gi = cs_gi(cs);
+	struct gb_part_info *old;
+
+	if (gi->part_info && !new) {
+		/*
+		 * We are clearing partition info.
+		 * This cgroup is no longer a controller.
+		 * Reset all descendants.
+		 */
+		update_child_gb_controller(cs, NULL);
+	}
+
+	old = xchg(&gi->part_info, new);
+
+	if (old) {
+		call_rcu(&old->rcu_head, free_gb_part_info);
+	} else if (new) {
+		/*
+		 * This cgroup is newly becoming a controller.
+		 * Set relationship between cs and descendants.
+		 */
+		update_child_gb_controller(cs, gi);
+	}
+}
+
+static inline void init_gb(struct cpuset *cs)
+{
+	struct gb_info *gi = cs_gi(cs), *gi_iter;
+	struct cgroup_subsys_state *css;
 
 	gi->settle_period = msecs_to_jiffies(sysctl_gb_settle_period);
 	gi->settle_next = jiffies + gi->settle_period;
+
+	/* Search upwards to find any existing controller. */
+	for (css = cs->css.parent; css; css = css->parent) {
+		gi_iter = css_gi(css, false);
+		if (gi_iter->part_info) {
+			cs->control_gi = gi_iter;
+			break;
+		}
+	}
+}
+
+static inline void remove_gb(struct cpuset *cs)
+{
+	if (cs_gi(cs)->part_info)
+		update_gb_part_info(cs, NULL);
 }
 
 static u64 gb_period_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
@@ -2560,28 +2636,156 @@ static int gb_period_write_u64(struct cgroup_subsys_state *css, struct cftype *c
 {
 	struct cpuset *cs = css_cs(css);
 	struct gb_info *gi = cs_gi(cs);
+	int retval = -EINVAL;
 
 	if (val > MAX_GB_PERIOD)
-		return -EINVAL;
+		return retval;
 
 	percpu_down_write(&cpuset_rwsem);
 
+	/*
+	 * Cannot enable group balancer on cgroups
+	 * whose partition info not set.
+	 */
+	if (!gi->part_info)
+		goto out_unlock;
+
 	gi->gb_period = val * NSEC_PER_USEC;
+	retval = 0;
 
+out_unlock:
 	percpu_up_write(&cpuset_rwsem);
-	return 0;
+	return retval;
 }
 
 static void gb_partition_read(struct seq_file *sf, struct gb_info *gi)
 {
+	struct gb_part_info *pi = gi->part_info;
+	int i;
+
+	if (!pi)
+		return;
+
+	for_each_gbpart(i, pi)
+		seq_printf(sf, "%*pbl;", cpumask_pr_args(part_cpus(pi, i)));
+
 	seq_putc(sf, '\n');
 }
 
+static void build_gb_partition(struct cpumask *cpus_allowed, struct gb_part_info *pi, int id)
+{
+	int i;
+	struct gb_part *part = &pi->parts[id];
+
+	part->id = id;
+	cpumask_copy(&part->cpus, cpus_allowed);
+
+	for_each_cpu(i, &part->cpus)
+		pi->ctop[i] = id;
+}
+
+static int __gb_partition_write(struct cpuset *cs, char *buf, size_t nbytes)
+{
+	struct gb_part_info *new_pi;
+	bool should_stop = false;
+	int ret, retval = -EINVAL, id = 0;
+	char *start, *end;
+	cpumask_var_t summask, cpus_allowed;
+
+	/*
+	 * Write empty string to clear partition info.
+	 * Then this cgroup is not a controller.
+	 */
+	if (nbytes < 2) {
+		update_gb_part_info(cs, NULL);
+		retval = 0;
+		goto out;
+	}
+
+	retval = -ENOMEM;
+	if (!zalloc_cpumask_var(&summask, GFP_KERNEL))
+		goto out;
+	if (!zalloc_cpumask_var(&cpus_allowed, GFP_KERNEL))
+		goto out_free_cpumask;
+	new_pi = kzalloc(sizeof(*new_pi), GFP_KERNEL);
+	if (!new_pi)
+		goto out_free_cpumask2;
+
+	buf = strstrip(buf);
+	memset(new_pi->ctop, -1, sizeof(int) * num_possible_cpus());
+	start = buf;
+	end = strchr(start, ';');
+	retval = -EINVAL;
+
+	/* Handle user input in format of "cpulist1;cpulist2;...;cpulistN;" */
+	for (;;) {
+		if (!end)
+			should_stop = true;
+		else
+			*end = '\0';
+
+		if (*start == '\0')
+			goto next;
+
+		if (new_pi->nr_part >= CPU_PART_MAX) {
+			pr_warn("part number should be no larger than %d\n", CPU_PART_MAX);
+			goto out_free_pi;
+		}
+
+		ret = cpulist_parse(start, cpus_allowed);
+		if (ret || cpumask_empty(cpus_allowed)) {
+			pr_warn("invalid cpulist: %s\n", start);
+			goto out_free_pi;
+		}
+
+		/* There should not be intersections betweem partitions. */
+		if (cpumask_intersects(summask, cpus_allowed)) {
+			pr_warn("%*pbl intersect with others\n", cpumask_pr_args(cpus_allowed));
+			goto out_free_pi;
+		}
+
+		cpumask_or(summask, summask, cpus_allowed);
+
+		build_gb_partition(cpus_allowed, new_pi, id);
+		id++;
+		new_pi->nr_part++;
+next:
+		if (should_stop)
+			break;
+
+		start = end + 1;
+		end = strchr(start, ';');
+	}
+
+	/*
+	 * Check whether the input is valid.
+	 * Should not conflict with effective_cpus.
+	 */
+	if (!cpumask_subset(summask, cs->effective_cpus) || new_pi->nr_part < 2) {
+		pr_warn("invalid cpulist\n");
+		goto out_free_pi;
+	}
+
+	update_gb_part_info(cs, new_pi);
+
+	retval = 0;
+	goto out_free_cpumask2;
+
+out_free_pi:
+	kfree(new_pi);
+out_free_cpumask2:
+	free_cpumask_var(cpus_allowed);
+out_free_cpumask:
+	free_cpumask_var(summask);
+out:
+	return retval;
+}
+
 static ssize_t gb_partition_write(struct kernfs_open_file *of, char *buf,
 				     size_t nbytes, loff_t off)
 {
-	struct cgroup_subsys_state *css = of_css(of);
-	struct cpuset *cs = css_cs(css);
+	struct cgroup_subsys_state *css = of_css(of), *pos_css;
+	struct cpuset *cs = css_cs(css), *cp;
 	int retval = -ENODEV;
 
 	cpus_read_lock();
@@ -2589,7 +2793,31 @@ static ssize_t gb_partition_write(struct kernfs_open_file *of, char *buf,
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
-	retval = 0;
+	retval = -EINVAL;
+	/* Cannot change gb partition during enabling. */
+	if (cs_gi(cs)->gb_period)
+		goto out_unlock;
+
+	/*
+	 * Cannot set gb partitons on cgroup whose
+	 * ancestor or descendant has been set.
+	 */
+	if (css_gi(css, true))
+		goto out_unlock;
+
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
+		if (cp == cs)
+			continue;
+
+		if (cs_gi(cp)->part_info) {
+			rcu_read_unlock();
+			goto out_unlock;
+		}
+	}
+	rcu_read_unlock();
+
+	retval = __gb_partition_write(cs, buf, nbytes);
 
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
@@ -2598,6 +2826,7 @@ static ssize_t gb_partition_write(struct kernfs_open_file *of, char *buf,
 }
 #else
 static inline void init_gb(struct cpuset *cs) { }
+static inline void remove_gb(struct cpuset *cs) { }
 static inline void gb_partition_read(struct seq_file *sf, struct gb_info *gi) { }
 #endif /* CONFIG_GROUP_BALANCER */
 
@@ -3051,6 +3280,8 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 		parent->child_ecpus_count--;
 	}
 
+	remove_gb(cs);
+
 	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 3/4] sched: Introduce group balancer
  2022-03-08  9:26 [RFC PATCH v2 0/4] Introduce group balancer Tianchen Ding
  2022-03-08  9:26 ` [RFC PATCH v2 1/4] sched, cpuset: Introduce infrastructure of " Tianchen Ding
  2022-03-08  9:26 ` [RFC PATCH v2 2/4] cpuset: Handle input of partition info for " Tianchen Ding
@ 2022-03-08  9:26 ` Tianchen Ding
  2022-03-08  9:26 ` [RFC PATCH v2 4/4] cpuset, gb: Add stat for " Tianchen Ding
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Tianchen Ding @ 2022-03-08  9:26 UTC (permalink / raw)
  To: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Tejun Heo,
	Johannes Weiner, Tianchen Ding, Michael Wang, Cruz Zhao,
	Masahiro Yamada, Nathan Chancellor, Kees Cook, Andrew Morton,
	Vlastimil Babka, Gustavo A. R. Silva, Arnd Bergmann,
	Miguel Ojeda, Chris Down, Vipin Sharma, Daniel Borkmann
  Cc: linux-kernel, cgroups

From: Michael Wang <yun.wang@linux.alibaba.com>

Modern platform are growing fast on CPU numbers, multiple
apps sharing one box are very common, they used to have
exclusive cpu setting but nowadays things are changing.

To achieve better utility of CPU resource, multiple apps
are starting to sharing the CPUs. The CPU resources usually
overcommitted since app's workload are undulated.

This introduced problems on performance when share mode vs
exclusive mode, for eg with cgroup A,B and C deployed in
exclusive mode, it will be:

  CPU_X (100%)     CPU_Y (100%)     CPU_Z (50%)
  T_1_CG_A         T_1_CG_B         T_1_CG_C
  T_2_CG_A         T_2_CG_B         T_2_CG_C
  T_3_CG_A         T_3_CG_B
  T_4_CG_A         T_4_CG_B

while the share mode will be:

  CPU_X (100%)     CPU_Y (75%)     CPU_Z (75%)
  T_1_CG_A         T_2_CG_A         T_1_CG_B
  T_2_CG_B         T_3_CG_B         T_2_CG_C
  T_4_CG_B         T_4_CG_A         T_3_CG_A
  T_1_CG_C

As we can see, the confliction between groups on CPU
resources are now happening all over the CPUs.

The testing on sysbench-memory show 30+% drop on share
mode, and redis-benchmark show 10+% drop too, compared
to the exclusive mode.

However, despite of the performance drop, in real world
we still prefer share mode. The undulated workload can
make the exclusive mode so unefficient on CPU utilization,
for eg the next period, when CG_A become 'idle', exclusive
mode will like:

  CPU_X (0%)     CPU_Y (100%)     CPU_Z (50%)
                 T_1_CG_B         T_1_CG_C
                 T_2_CG_B         T_2_CG_C
                 T_3_CG_B
                 T_4_CG_B

while share mode like:

  CPU_X (50%)     CPU_Y (50%)     CPU_Z (50%)
  T_2_CG_B         T_1_CG_C        T_3_CG_B
  T_4_CG_B         T_1_CG_B        T_2_CG_C

The CPU_X is totally wasted in exclusive mode, the resource
efficiency are really poor.

Thus what we need, is a way to ease confliction in share mode,
make groups as exclusive as possible, to gain both performance
and resource efficiency.

The main idea of group balancer is to fulfill this requirement
by balancing groups of tasks among groups of CPUs, consider this
as a dynamic demi-exclusive mode. Task trigger work to settle it's
group into a proper partition (minimum predicted load), then try
migrate itself into it. To gradually settle groups into the most
exclusively partition.

Just like balance the task among CPUs, now with GB a user can
put CPU X,Y,Z into three partitions, and balance group A,B,C
into these partition, to make them as exclusive as possible.
GB can be seen as an optimize policy based on load balance,
it obeys the main idea of load balance and makes adjustment
based on that.

How to use:

First, make sure the children of your cgroup both enabling "cpu" and
"cpuset" subsys, because group balancer gather load info from task group
and partition info from cpuset group. So tasks should stay at the same
cgroup. Do this recursively if necessary:
  echo "+cpu +cpuset" > $CGROUP_PATH/cgroup.subtree_control

To create partition, for example run:
  echo "0-15;16-31;32-47;48-63;" > $CGROUP_PATH/cpuset.gb.partition

This will create 4 partitions contain CPUs 0-15,16-31,32-47 and
48-63 separately.

Then enable GB for your cgroup, run:
  echo 200000 > $CGROUP_PATH/cpuset.gb.period_us

This will enable GB for all descendants of $CGROUP_PATH, EXCEPT itself.

Testing Results:
  In order to enlarge the differences, we do testing on ARM platform
  with 128 CPUs, create 8 partition according to cluster info.

  Since we pick benchmark which can gain benefit from exclusive mode,
  this is more like a functional testing rather than performance, to
  show that GB help winback the performance.

  Create 8 cgroup each running 'sysbench memory --threads=16 run',
  the output of share mode is:
    events/s (eps):                      4939865.8892
    events/s (eps):                      4699033.0351
    events/s (eps):                      4373262.0563
    events/s (eps):                      3534852.1000
    events/s (eps):                      4724359.4354
    events/s (eps):                      3438985.1082
    events/s (eps):                      3600268.9196
    events/s (eps):                      3782130.8202
  the output of gb mode is:
    events/s (eps):                      4919287.0608
    events/s (eps):                      5926525.9995
    events/s (eps):                      4933459.3272
    events/s (eps):                      5184040.1349
    events/s (eps):                      5030940.1116
    events/s (eps):                      5773255.0246
    events/s (eps):                      4649109.0129
    events/s (eps):                      5683217.7641

  Create 4 cgroup each running redis-server with 16 io threads,
  4 redis-benchmark per each server show average rps as:

                    share mode       gb mode
  PING_INLINE        45903.94        46529.9      1.36%
  PING_MBULK         48342.58        50881.99     5.25%
  SET                38681.95        42108.17     8.86%
  GET                46774.09        51067.99     9.18%
  INCR               46092.24        50543.98     9.66%
  LPUSH              41723.35        45464.73     8.97%
  RPUSH              42722.77        47667.76     11.57%
  LPOP               41010.75        45077.65     9.92%
  RPOP               43198.33        47248.05     9.37%
  SADD               44750.16        50253.79     12.30%
  HSET               44352           47940.12     8.09%
  SPOP               47436.64        51658.99     8.90%
  ZADD               43124.02        45992.96     6.65%
  ZPOPMIN            46854.7         51561.52     10.05%
  LPUSH              41723.35        45464.73     8.97%
  LRANGE_100         22411.47        23311.32     4.02%
  LRANGE_300         11323.8         11585.06     2.31%
  LRANGE_500         7516.12         7577.76      0.82%
  LRANGE_600         6632.1          6737.31      1.59%
  MSET               27945.01        29401.3      5.21%

Co-developed-by: Cruz Zhao <cruzzhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <cruzzhao@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
 include/linux/cpuset.h   |   5 +
 include/linux/sched.h    |   5 +
 include/linux/sched/gb.h |  10 +
 kernel/cgroup/cpuset.c   |  39 ++++
 kernel/sched/core.c      |   5 +
 kernel/sched/fair.c      |  26 ++-
 kernel/sched/gb.c        | 448 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h     |  10 +
 8 files changed, 547 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d58e0476ee8e..3be2ab42bb98 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -178,6 +178,11 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
+#ifdef CONFIG_GROUP_BALANCER
+struct gb_info *task_gi(struct task_struct *p, bool get_controller);
+cpumask_var_t *gi_cpus(struct gb_info *gi);
+#endif
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e0b5b4a4c8f..fca655770925 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1289,6 +1289,11 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_GROUP_BALANCER
+	u64			gb_stamp;
+	struct callback_head		gb_work;
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_sig;
diff --git a/include/linux/sched/gb.h b/include/linux/sched/gb.h
index b03a2b4ef4b7..7af91662b740 100644
--- a/include/linux/sched/gb.h
+++ b/include/linux/sched/gb.h
@@ -9,10 +9,14 @@
 
 #include <linux/sched.h>
 #include <linux/cpumask.h>
+#include <linux/seq_file.h>
+#include <linux/cgroup-defs.h>
 
 #define CPU_PART_MAX		32
 struct gb_part {
 	int id;
+	unsigned int mgrt_on;
+	u64 predict_load;
 	struct cpumask cpus;
 };
 
@@ -41,6 +45,12 @@ struct gb_info {
 
 	/* Stamp to record next settling. */
 	unsigned long settle_next;
+
+	/*
+	 * Record preferred partition (in part_info of controller) of this cgroup.
+	 * Default -1.
+	 */
+	int gb_prefer;
 };
 
 #define for_each_gbpart(i, pi)			\
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4b456b379b87..de13c22c1921 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -235,6 +235,20 @@ static struct gb_info *css_gi(struct cgroup_subsys_state *css, bool get_controll
 
 	return get_controller ? cs->control_gi : &cs->gi;
 }
+
+struct gb_info *task_gi(struct task_struct *p, bool get_controller)
+{
+	struct cpuset *cs = task_cs(p);
+
+	return cs ? css_gi(&cs->css, get_controller) : NULL;
+}
+
+cpumask_var_t *gi_cpus(struct gb_info *gi)
+{
+	struct cpuset *cs = container_of(gi, struct cpuset, gi);
+
+	return &cs->effective_cpus;
+}
 #else
 static inline struct gb_info *cs_gi(struct cpuset *cs)
 {
@@ -2572,6 +2586,21 @@ static inline void update_child_gb_controller(struct cpuset *cs, struct gb_info
 	rcu_read_unlock();
 }
 
+static inline void reset_child_gb_prefer(struct cpuset *cs)
+{
+	struct cpuset *cp;
+	struct cgroup_subsys_state *pos_css;
+
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
+		if (cp == cs)
+			continue;
+
+		cs_gi(cp)->gb_prefer = -1;
+	}
+	rcu_read_unlock();
+}
+
 static inline void update_gb_part_info(struct cpuset *cs, struct gb_part_info *new)
 {
 	struct gb_info *gi = cs_gi(cs);
@@ -2606,6 +2635,7 @@ static inline void init_gb(struct cpuset *cs)
 
 	gi->settle_period = msecs_to_jiffies(sysctl_gb_settle_period);
 	gi->settle_next = jiffies + gi->settle_period;
+	gi->gb_prefer = -1;
 
 	/* Search upwards to find any existing controller. */
 	for (css = cs->css.parent; css; css = css->parent) {
@@ -2637,6 +2667,7 @@ static int gb_period_write_u64(struct cgroup_subsys_state *css, struct cftype *c
 	struct cpuset *cs = css_cs(css);
 	struct gb_info *gi = cs_gi(cs);
 	int retval = -EINVAL;
+	bool reset;
 
 	if (val > MAX_GB_PERIOD)
 		return retval;
@@ -2650,9 +2681,17 @@ static int gb_period_write_u64(struct cgroup_subsys_state *css, struct cftype *c
 	if (!gi->part_info)
 		goto out_unlock;
 
+	/* gb_period = 0 means disabling group balancer. */
+	reset = gi->gb_period && !val;
+
 	gi->gb_period = val * NSEC_PER_USEC;
 	retval = 0;
 
+	if (reset) {
+		synchronize_rcu();
+		reset_child_gb_prefer(cs);
+	}
+
 out_unlock:
 	percpu_up_write(&cpuset_rwsem);
 	return retval;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e5365755bdd..d57d6210bcfc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4484,6 +4484,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_GROUP_BALANCER
+	p->gb_stamp = 0;
+	p->gb_work.next = &p->gb_work;
+	init_task_work(&p->gb_work, group_balancing_work);
 #endif
 	return 0;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10c8dc9b7494..c1bd3f62e39c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1609,6 +1609,9 @@ static void update_numa_stats(struct task_numa_env *env,
 			    !cpumask_test_cpu(cpu, env->p->cpus_ptr))
 				continue;
 
+			if (group_hot(env->p, env->src_cpu, cpu) == 0)
+				continue;
+
 			if (ns->idle_cpu == -1)
 				ns->idle_cpu = cpu;
 
@@ -1944,6 +1947,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+		if (group_hot(env->p, env->src_cpu, cpu) == 0)
+			continue;
+
 		env->dst_cpu = cpu;
 		if (task_numa_compare(env, taskimp, groupimp, maymove))
 			break;
@@ -6050,8 +6056,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
 
 	schedstat_inc(p->stats.nr_wakeups_affine_attempts);
-	if (target == nr_cpumask_bits)
+	if (target == nr_cpumask_bits) {
+		if (group_hot(p, prev_cpu, this_cpu) == 1)
+			return this_cpu;
 		return prev_cpu;
+	}
 
 	schedstat_inc(sd->ttwu_move_affine);
 	schedstat_inc(p->stats.nr_wakeups_affine);
@@ -7846,6 +7855,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	tsk_cache_hot = group_hot(p, env->src_cpu, env->dst_cpu);
+	if (tsk_cache_hot != -1)
+		return tsk_cache_hot;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) active balance
@@ -8266,6 +8279,15 @@ static unsigned long task_h_load(struct task_struct *p)
 	return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
 			cfs_rq_load_avg(cfs_rq) + 1);
 }
+
+#ifdef CONFIG_GROUP_BALANCER
+unsigned long cfs_h_load(struct cfs_rq *cfs_rq)
+{
+	update_cfs_rq_h_load(cfs_rq);
+	return cfs_rq->h_load;
+}
+#endif
+
 #else
 static bool __update_blocked_fair(struct rq *rq, bool *done)
 {
@@ -11213,6 +11235,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
+	task_tick_gb(rq, curr);
+
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
 
diff --git a/kernel/sched/gb.c b/kernel/sched/gb.c
index 4a2876ec1cca..f7da96253ad0 100644
--- a/kernel/sched/gb.c
+++ b/kernel/sched/gb.c
@@ -8,7 +8,455 @@
 
 #include "sched.h"
 
+#define DECAY_PERIOD		HZ
+
 /* Params about settle period. (ms) */
 unsigned int sysctl_gb_settle_period = 200;
 unsigned int sysctl_gb_settle_period_max = 10000;
 unsigned int sysctl_gb_global_settle_period = 200;
+
+static unsigned long global_settle_next;
+static unsigned long predict_period_stamp;
+DEFINE_SPINLOCK(settle_lock);
+
+static inline int ctop(struct gb_part_info *pi, int cpu)
+{
+	return pi->ctop[cpu];
+}
+
+static u64 tg_load_of_part(struct gb_part_info *pi, struct task_group *tg, int id)
+{
+	int i;
+	u64 load = 0;
+
+	for_each_cpu(i, part_cpus(pi, id))
+		load += cfs_h_load(tg->cfs_rq[i]);
+
+	return load;
+}
+
+static u64 load_of_part(struct gb_part_info *pi, int id)
+{
+	int i;
+	u64 load = 0;
+
+	for_each_cpu(i, part_cpus(pi, id))
+		load += cpu_rq(i)->cfs.avg.load_avg;
+
+	return load;
+}
+
+static inline int part_mgrt_lock(struct gb_part_info *pi, int src, int dst)
+{
+	struct gb_part *src_part, *dst_part;
+
+	if (src != -1) {
+		src_part = &pi->parts[src];
+		if (READ_ONCE(src_part->mgrt_on))
+			return 0;
+	}
+
+	if (dst != -1) {
+		dst_part = &pi->parts[dst];
+		if (READ_ONCE(dst_part->mgrt_on))
+			return 0;
+	}
+
+	if (src != -1 && xchg(&src_part->mgrt_on, 1))
+		return 0;
+
+	if (dst != -1 && xchg(&dst_part->mgrt_on, 1)) {
+		WRITE_ONCE(src_part->mgrt_on, 0);
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline void part_mgrt_unlock(struct gb_part_info *pi, int src, int dst)
+{
+	struct gb_part *src_part, *dst_part;
+
+	if (src != -1) {
+		src_part = &pi->parts[src];
+		WRITE_ONCE(src_part->mgrt_on, 0);
+	}
+
+	if (dst != -1) {
+		dst_part = &pi->parts[dst];
+		WRITE_ONCE(dst_part->mgrt_on, 0);
+	}
+}
+
+static u64 cap_of_part(struct gb_part_info *pi, int id)
+{
+	int i;
+	u64 cap = 0;
+
+	for_each_cpu(i, part_cpus(pi, id))
+		cap += cpu_rq(i)->cpu_capacity;
+
+	return cap;
+}
+
+static inline void predict_load_add(struct gb_part_info *pi, int id, u64 load)
+{
+	pi->parts[id].predict_load += load;
+}
+
+static void predict_load_decay(struct gb_part_info *pi)
+{
+	int i, fact, passed;
+
+	passed = jiffies - global_settle_next + predict_period_stamp;
+	predict_period_stamp = passed % DECAY_PERIOD;
+	fact = passed / DECAY_PERIOD;
+
+	if (!fact)
+		return;
+
+	for_each_gbpart(i, pi) {
+		struct gb_part *gp = &pi->parts[i];
+
+		/*
+		 * Decay NICE_0_LOAD into zero after 10 seconds
+		 */
+		if (fact > 10)
+			gp->predict_load = 0;
+		else
+			gp->predict_load >>= fact;
+	}
+}
+
+static int try_to_settle(struct gb_part_info *pi, struct gb_info *gi, struct task_group *tg)
+{
+	int i, src, dst, ret;
+	u64 mgrt_load, tg_load, min_load, src_load, dst_load;
+	cpumask_var_t *effective_cpus = gi_cpus(gi);
+
+	src = dst = -1;
+	min_load = U64_MAX;
+	tg_load = 0;
+	for_each_gbpart(i, pi) {
+		u64 mgrt, load;
+
+		/* DO NOT settle in parts out of effective_cpus. */
+		if (!cpumask_intersects(part_cpus(pi, i), effective_cpus[0]))
+			continue;
+
+		mgrt = tg_load_of_part(pi, tg, i);
+		load = load_of_part(pi, i);
+		/* load after migration */
+		if (load > mgrt)
+			load -= mgrt;
+		else
+			load = 0;
+
+		/*
+		 * Try to find out the partition contain
+		 * minimum load, and the load of this task
+		 * group is excluded on comparison.
+		 *
+		 * This help to prevent that a partition
+		 * full of the tasks from this task group was
+		 * considered as busy.
+		 *
+		 * As for the prediction load, the partition
+		 * this group preferred will be excluded, since
+		 * these prediction load could be introduced by
+		 * itself.
+		 *
+		 * This is not precise, but it serves the idea
+		 * to prefer a partition as long as possible,
+		 * to save the cost of resettle as much as
+		 * possible.
+		 */
+		if (i == gi->gb_prefer) {
+			src = i;
+			src_load = load + mgrt;
+			mgrt_load = mgrt;
+		} else
+			load += pi->parts[i].predict_load;
+
+		if (load < min_load) {
+			dst = i;
+			min_load = load;
+			dst_load = load + mgrt;
+		}
+
+		tg_load += mgrt;
+	}
+
+	if (!tg_load)
+		return 0;
+
+	ret = 0;
+	gi->settle_period *= 2;
+	if (src == -1) {
+		/* First settle */
+		gi->gb_prefer = dst;
+		predict_load_add(pi, dst, tg_load);
+		ret = 1;
+	} else if (src != dst) {
+		/* Resettle will cost, be careful */
+		long dst_imb, src_imb, dst_cap, src_cap;
+
+		src_cap = cap_of_part(pi, src);
+		dst_cap = cap_of_part(pi, dst);
+
+		/*
+		 * src_load        dst_load
+		 * ------------ vs ---------
+		 * src_capacity    dst_capacity
+		 *
+		 * Should not cause further imbalancing after
+		 * resettle.
+		 */
+		src_imb = abs(src_load * dst_cap - dst_load * src_cap);
+		dst_imb = abs((src_load - mgrt_load) * dst_cap - (dst_load + mgrt_load) * src_cap);
+
+		if (dst_imb <= src_imb) {
+			gi->gb_prefer = dst;
+			predict_load_add(pi, dst, tg_load);
+			gi->settle_period = msecs_to_jiffies(sysctl_gb_settle_period) * 2;
+			ret = 1;
+		}
+	}
+
+	if (gi->settle_period > msecs_to_jiffies(sysctl_gb_settle_period_max))
+		gi->settle_period = msecs_to_jiffies(sysctl_gb_settle_period_max);
+
+	return ret;
+}
+
+/*
+ * group_hot() will tell us which cpu is contained in the
+ * preferred CPU partition of the task group of a task.
+ *
+ * return 1 if prefer dst_cpu
+ * return 0 if prefer src_cpu
+ * return -1 if prefer either or neither
+ */
+int group_hot(struct task_struct *p, int src_cpu, int dst_cpu)
+{
+	int ret = -1;
+	struct task_group *tg;
+	struct gb_info *gi, *control_gi;
+	struct gb_part_info *pi;
+
+	rcu_read_lock();
+
+	control_gi = task_gi(p, true);
+
+	if (!control_gi || !control_gi->gb_period)
+		goto out_unlock;
+
+	gi = task_gi(p, false);
+	pi = control_gi->part_info;
+	tg = task_group(p);
+	if (gi->gb_prefer != -1 && ctop(pi, src_cpu) != ctop(pi, dst_cpu))
+		ret = (gi->gb_prefer == ctop(pi, dst_cpu));
+
+out_unlock:
+	rcu_read_unlock();
+	return ret;
+}
+
+void task_tick_gb(struct rq *rq, struct task_struct *curr)
+{
+	struct callback_head *work = &curr->gb_work;
+	struct gb_info *gi, *control_gi;
+	struct gb_part_info *pi;
+	u64 period, now;
+
+	if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
+		return;
+
+	rcu_read_lock();
+
+	control_gi = task_gi(curr, true);
+
+	if (!control_gi || !control_gi->gb_period)
+		goto unlock;
+
+	gi = task_gi(curr, false);
+	pi = control_gi->part_info;
+
+	/* Save it when already satisfied. */
+	if (gi->gb_prefer != -1 &&
+	    gi->gb_prefer == ctop(pi, task_cpu(curr)))
+		goto unlock;
+
+	now = curr->se.sum_exec_runtime;
+	period = control_gi->gb_period;
+
+	if (now > curr->gb_stamp + period) {
+		curr->gb_stamp = now;
+		task_work_add(curr, work, TWA_RESUME);
+	}
+
+unlock:
+	rcu_read_unlock();
+}
+
+void group_balancing_work(struct callback_head *work)
+{
+	int cpu, this_cpu, this_part, this_prefer, best_cpu;
+	struct task_group *this_tg;
+	struct gb_info *this_gi, *control_gi;
+	struct gb_part_info *pi;
+	struct task_struct *best_task;
+	struct cpumask cpus;
+
+	SCHED_WARN_ON(current != container_of(work, struct task_struct, gb_work));
+
+	work->next = work;
+	if (current->flags & PF_EXITING)
+		return;
+
+	rcu_read_lock();
+	/*
+	 * We build group balancer on "cpuset" subsys, and gather load info from
+	 * "cpu" subsys. So we need to ensure these two subsys belonging to
+	 * the same cgroup.
+	 */
+	if (task_cgroup(current, cpuset_cgrp_id) != task_group(current)->css.cgroup)
+		goto unlock;
+
+	control_gi = task_gi(current, true);
+
+	if (!control_gi || !control_gi->gb_period)
+		goto unlock;
+
+	this_gi = task_gi(current, false);
+	pi = control_gi->part_info;
+	this_tg = task_group(current);
+
+	/*
+	 * Settle task group one-by-one help prevent the
+	 * situation when multiple group try to settle the
+	 * same partition at the same time.
+	 *
+	 * However, when bunch of groups trying to settle at
+	 * the same time, there are no guarantee on the
+	 * fairness, some of them may get more chances and
+	 * settle sooner than the others.
+	 *
+	 * So one trick here is to grow the cg_settle_period
+	 * of settled group, to make sure they yield the
+	 * next chances to others.
+	 *
+	 * Another trick here is about prediction, as settle
+	 * group will followed by bunch of task migration,
+	 * the current load of CPU partition can't imply it's
+	 * busyness in future, and we may pick a busy one in
+	 * the end.
+	 *
+	 * Thus we maintain the predict load after each settle,
+	 * so next try will be able to do the prediction and
+	 * avoid to pick those which is already busy enough.
+	 */
+	if (spin_trylock(&settle_lock)) {
+		if (time_after(jiffies, global_settle_next) &&
+		    time_after(jiffies, this_gi->settle_next)) {
+			predict_load_decay(pi);
+
+			global_settle_next = jiffies;
+			if (try_to_settle(pi, this_gi, this_tg))
+				global_settle_next +=
+					msecs_to_jiffies(sysctl_gb_global_settle_period);
+
+			this_gi->settle_next = jiffies + this_gi->settle_period;
+		}
+		spin_unlock(&settle_lock);
+	}
+
+	this_cpu = task_cpu(current);
+	this_prefer = this_gi->gb_prefer;
+	this_part = ctop(pi, this_cpu);
+	if (this_prefer == -1 ||
+	    this_part == this_prefer ||
+	    !part_mgrt_lock(pi, this_part, this_prefer)) {
+		goto unlock;
+	}
+
+	cpumask_copy(&cpus, part_cpus(pi, this_prefer));
+
+	/*
+	 * We arrived here when current task A don't prefer
+	 * it's current CPU, but prefer CPUs of partition Y.
+	 *
+	 * In other word, if task A could run on CPUs of
+	 * partition Y, it have good chance to reduce conflict
+	 * with the tasks from other groups, and share hot
+	 * cache with the tasks from the same group.
+	 *
+	 * So here is the main logical of group balancer to
+	 * achieve it's purpose, make sure groups of tasks
+	 * are balanced into groups of CPUs.
+	 *
+	 * If A's current CPU belong to CPU partition X,
+	 * try to find a CPU from partition Y, which is
+	 * running a task prefer partition X, and swap them.
+	 *
+	 * Otherwise, or if can't find such CPU and task,
+	 * just find an idle CPU from partition Y, and do
+	 * migration.
+	 *
+	 * Ideally the migration and swap work will finally
+	 * put the tasks into right places, but the wakeup
+	 * stuff can easily break that by locate an idle CPU
+	 * out of the range.
+	 *
+	 * However, since the whole idea is to gain cache
+	 * benefit and reduce conflict between groups, if
+	 * there are enough idle CPU out there then every
+	 * thing just fine, so let it go.
+	 */
+
+	best_cpu = -1;
+	best_task = NULL;
+	for_each_cpu_and(cpu, &cpus, current->cpus_ptr) {
+		struct task_struct *p;
+
+		WARN_ON(cpu == this_cpu);
+
+		if (available_idle_cpu(cpu)) {
+			best_cpu = cpu;
+			continue;
+		}
+
+		if (this_part == -1)
+			continue;
+
+		p = rcu_dereference(cpu_rq(cpu)->curr);
+
+		if (!p || (p->flags & PF_EXITING) || is_idle_task(p)
+		    || p == current || task_gi(p, true) != control_gi)
+			continue;
+
+		if (task_cpu(p) == cpu &&
+		    this_part == task_gi(p, false)->gb_prefer &&
+		    cpumask_test_cpu(this_cpu, p->cpus_ptr)) {
+			get_task_struct(p);
+			best_task = p;
+			best_cpu = cpu;
+			break;
+		}
+	}
+
+	rcu_read_unlock();
+
+	if (best_task) {
+		migrate_swap(current, best_task, best_cpu, this_cpu);
+		put_task_struct(best_task);
+	} else if (best_cpu != -1) {
+		migrate_task_to(current, best_cpu);
+	}
+
+	part_mgrt_unlock(pi, this_part, this_prefer);
+	return;
+
+unlock:
+	rcu_read_unlock();
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d6555aacdf78..97c621bcded3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3148,3 +3148,13 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
 
+#ifdef CONFIG_GROUP_BALANCER
+extern unsigned long cfs_h_load(struct cfs_rq *cfs_rq);
+extern int group_hot(struct task_struct *p, int src_cpu, int dst_cpu);
+extern void task_tick_gb(struct rq *rq, struct task_struct *curr);
+extern void group_balancing_work(struct callback_head *work);
+#else
+static inline int group_hot(struct task_struct *p, int src_cpu, int dst_cpu) { return -1; }
+static inline void task_tick_gb(struct rq *rq, struct task_struct *curr) {};
+static inline void group_balancing_work(struct callback_head *work) {};
+#endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 4/4] cpuset, gb: Add stat for group balancer
  2022-03-08  9:26 [RFC PATCH v2 0/4] Introduce group balancer Tianchen Ding
                   ` (2 preceding siblings ...)
  2022-03-08  9:26 ` [RFC PATCH v2 3/4] sched: Introduce " Tianchen Ding
@ 2022-03-08  9:26 ` Tianchen Ding
  2022-03-08 17:13 ` [RFC PATCH v2 0/4] Introduce " Tejun Heo
  2022-04-06  2:47 ` Tianchen Ding
  5 siblings, 0 replies; 12+ messages in thread
From: Tianchen Ding @ 2022-03-08  9:26 UTC (permalink / raw)
  To: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Tejun Heo,
	Johannes Weiner, Tianchen Ding, Michael Wang, Cruz Zhao,
	Masahiro Yamada, Nathan Chancellor, Kees Cook, Andrew Morton,
	Vlastimil Babka, Gustavo A. R. Silva, Arnd Bergmann,
	Miguel Ojeda, Chris Down, Vipin Sharma, Daniel Borkmann
  Cc: linux-kernel, cgroups

When group balancer is enabled by:
  echo 200000 > $CGROUP_PATH/cpuset.gb.period_us

Then you can check:
  $CPU_CGROUP_PATH/childX/cpuset.gb.stat

which give output as:
  PART-0 0-15 1008 1086  *
  PART-1 16-31 0 2
  PART-2 32-47 0 0
  PART-3 48-63 0 1024

The partition ID followed by it's CPUs range, load of group, load
of partition and a star mark as preferred.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
 include/linux/sched/gb.h |  2 ++
 kernel/cgroup/cpuset.c   | 24 ++++++++++++++++++++++++
 kernel/sched/gb.c        | 25 +++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/include/linux/sched/gb.h b/include/linux/sched/gb.h
index 7af91662b740..ec5a97d8160a 100644
--- a/include/linux/sched/gb.h
+++ b/include/linux/sched/gb.h
@@ -63,6 +63,8 @@ static inline struct cpumask *part_cpus(struct gb_part_info *pi, int id)
 
 #ifdef CONFIG_GROUP_BALANCER
 extern unsigned int sysctl_gb_settle_period;
+int gb_stat_show(struct seq_file *sf, struct cgroup_subsys_state *css,
+		 struct gb_info *gi, struct gb_part_info *pi);
 #endif
 
 #endif
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index de13c22c1921..035606e8fa95 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2863,6 +2863,24 @@ static ssize_t gb_partition_write(struct kernfs_open_file *of, char *buf,
 	cpus_read_unlock();
 	return retval ?: nbytes;
 }
+
+static int gb_stat_seq_show(struct seq_file *sf, void *v)
+{
+	struct cgroup_subsys_state *css = seq_css(sf);
+	struct gb_info *control_gi;
+	int retval = -EINVAL;
+
+	rcu_read_lock();
+	control_gi = css_gi(css, true);
+	if (!control_gi || !control_gi->gb_period)
+		goto out_unlock;
+
+	retval = gb_stat_show(sf, css, css_gi(css, false), control_gi->part_info);
+
+out_unlock:
+	rcu_read_unlock();
+	return retval;
+}
 #else
 static inline void init_gb(struct cpuset *cs) { }
 static inline void remove_gb(struct cpuset *cs) { }
@@ -3179,6 +3197,12 @@ static struct cftype dfl_files[] = {
 		.max_write_len = (100U + 6 * NR_CPUS),
 		.private = FILE_GB_CPULIST,
 	},
+
+	{
+		.name = "gb.stat",
+		.seq_show = gb_stat_seq_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/gb.c b/kernel/sched/gb.c
index f7da96253ad0..8ae1db83b587 100644
--- a/kernel/sched/gb.c
+++ b/kernel/sched/gb.c
@@ -46,6 +46,31 @@ static u64 load_of_part(struct gb_part_info *pi, int id)
 	return load;
 }
 
+int gb_stat_show(struct seq_file *sf, struct cgroup_subsys_state *css,
+		 struct gb_info *gi, struct gb_part_info *pi)
+{
+	struct cgroup_subsys_state *tg_css;
+	struct task_group *tg;
+	int i;
+
+	tg_css = cgroup_e_css(css->cgroup, &cpu_cgrp_subsys);
+	/* Make sure that "cpu" and "cpuset" subsys belonging to the same cgroup. */
+	if (tg_css->cgroup != css->cgroup)
+		return -EINVAL;
+	tg = container_of(tg_css, struct task_group, css);
+
+	for_each_gbpart(i, pi) {
+		seq_printf(sf, "PART-%d ", i);
+		seq_printf(sf, "%*pbl ", cpumask_pr_args(part_cpus(pi, i)));
+		seq_printf(sf, "%llu ", tg_load_of_part(pi, tg, i));
+		seq_printf(sf, "%llu ", load_of_part(pi, i));
+		if (gi->gb_prefer == i)
+			seq_puts(sf, " *");
+		seq_putc(sf, '\n');
+	}
+	return 0;
+}
+
 static inline int part_mgrt_lock(struct gb_part_info *pi, int src, int dst)
 {
 	struct gb_part *src_part, *dst_part;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 0/4] Introduce group balancer
  2022-03-08  9:26 [RFC PATCH v2 0/4] Introduce group balancer Tianchen Ding
                   ` (3 preceding siblings ...)
  2022-03-08  9:26 ` [RFC PATCH v2 4/4] cpuset, gb: Add stat for " Tianchen Ding
@ 2022-03-08 17:13 ` Tejun Heo
  2022-03-09  8:30   ` Tianchen Ding
  2022-04-06  2:47 ` Tianchen Ding
  5 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2022-03-08 17:13 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Johannes Weiner,
	Michael Wang, Cruz Zhao, Masahiro Yamada, Nathan Chancellor,
	Kees Cook, Andrew Morton, Vlastimil Babka, Gustavo A. R. Silva,
	Arnd Bergmann, Miguel Ojeda, Chris Down, Vipin Sharma,
	Daniel Borkmann, linux-kernel, cgroups

Hello,

On Tue, Mar 08, 2022 at 05:26:25PM +0800, Tianchen Ding wrote:
> Modern platform are growing fast on CPU numbers. To achieve better
> utility of CPU resource, multiple apps are starting to sharing the CPUs.
> 
> What we need is a way to ease confliction in share mode,
> make groups as exclusive as possible, to gain both performance
> and resource efficiency.
> 
> The main idea of group balancer is to fulfill this requirement
> by balancing groups of tasks among groups of CPUs, consider this
> as a dynamic demi-exclusive mode. Task trigger work to settle it's
> group into a proper partition (minimum predicted load), then try
> migrate itself into it. To gradually settle groups into the most
> exclusively partition.
> 
> GB can be seen as an optimize policy based on load balance,
> it obeys the main idea of load balance and makes adjustment
> based on that.
> 
> Our test on ARM64 platform with 128 CPUs shows that,
> throughput of sysbench memory is improved about 25%,
> and redis-benchmark is improved up to about 10%.

The motivation makes sense to me but I'm not sure this is the right way to
architecture it. We already have the framework to do all these - the sched
domains and the load balancer. Architecturally, what the suggested patchset
is doing is building a separate load balancer on top of cpuset after using
cpuset to disable the existing load balancer, which is rather obviously
convoluted.

* AFAICS, none of what the suggested code does is all that complicated or
  needs a lot of input from userspace. it should be possible to parametrize
  the existing load balancer to behave better.

* If, for some reason, you need more customizable behavior in terms of cpu
  allocation, which is what cpuset is for, maybe it'd be better to build the
  load balancer in userspace. That'd fit way better with how cgroup is used
  in general and with threaded cgroups, it should fit nicely with everything
  else.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 0/4] Introduce group balancer
  2022-03-08 17:13 ` [RFC PATCH v2 0/4] Introduce " Tejun Heo
@ 2022-03-09  8:30   ` Tianchen Ding
  2022-03-09 18:00     ` Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Tianchen Ding @ 2022-03-09  8:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Johannes Weiner,
	Michael Wang, Cruz Zhao, Masahiro Yamada, Nathan Chancellor,
	Kees Cook, Andrew Morton, Vlastimil Babka, Gustavo A. R. Silva,
	Arnd Bergmann, Miguel Ojeda, Chris Down, Vipin Sharma,
	Daniel Borkmann, linux-kernel, cgroups

On 2022/3/9 01:13, Tejun Heo wrote:
> Hello,
> 
> On Tue, Mar 08, 2022 at 05:26:25PM +0800, Tianchen Ding wrote:
>> Modern platform are growing fast on CPU numbers. To achieve better
>> utility of CPU resource, multiple apps are starting to sharing the CPUs.
>>
>> What we need is a way to ease confliction in share mode,
>> make groups as exclusive as possible, to gain both performance
>> and resource efficiency.
>>
>> The main idea of group balancer is to fulfill this requirement
>> by balancing groups of tasks among groups of CPUs, consider this
>> as a dynamic demi-exclusive mode. Task trigger work to settle it's
>> group into a proper partition (minimum predicted load), then try
>> migrate itself into it. To gradually settle groups into the most
>> exclusively partition.
>>
>> GB can be seen as an optimize policy based on load balance,
>> it obeys the main idea of load balance and makes adjustment
>> based on that.
>>
>> Our test on ARM64 platform with 128 CPUs shows that,
>> throughput of sysbench memory is improved about 25%,
>> and redis-benchmark is improved up to about 10%.
> 
> The motivation makes sense to me but I'm not sure this is the right way to
> architecture it. We already have the framework to do all these - the sched
> domains and the load balancer. Architecturally, what the suggested patchset
> is doing is building a separate load balancer on top of cpuset after using
> cpuset to disable the existing load balancer, which is rather obviously
> convoluted.
> 

"the sched domains and the load balancer" you mentioned are the ways to 
"balance" tasks on each domains. However, this patchset aims to "group" 
them together to win hot cache and less competition, which is different 
from load balancer. See commit log of the patch 3/4 and this link:
https://lore.kernel.org/all/11d4c86a-40ef-6ce5-6d08-e9d0bc9b512a@linux.alibaba.com/

> * AFAICS, none of what the suggested code does is all that complicated or
>    needs a lot of input from userspace. it should be possible to parametrize
>    the existing load balancer to behave better.
> 

Group balancer mainly needs 2 inputs from userspace: cpu partition info 
and cgroup info.
Cpu partition info does need user input (and maybe a bit complicated). 
As a result, the division methods are __free__ to users(can refer to 
NUMA nodes, clusters, cache, etc.)
Cgroup info doesn't need extra input. It's naturally configured.

It do parametrize the existing load balancer to behave better.
Group balancer is a kind of optimize policy, and should obey the basic
policy (load balance) and improve it.
The relationship between load balancer and group balancer is explained 
in detail at the above link.

> * If, for some reason, you need more customizable behavior in terms of cpu
>    allocation, which is what cpuset is for, maybe it'd be better to build the
>    load balancer in userspace. That'd fit way better with how cgroup is used
>    in general and with threaded cgroups, it should fit nicely with everything
>    else.
> 

We put group balancer in kernel space because this new policy does not 
depend on userspace apps. It's a "general" feature.
Doing "dynamic cpuset" in userspace may also introduce performance 
issue, since it may need to bind and unbind different cpusets for 
several times, and is too strict(compared with our "soft bind").


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 0/4] Introduce group balancer
  2022-03-09  8:30   ` Tianchen Ding
@ 2022-03-09 18:00     ` Tejun Heo
  2022-03-10  5:47       ` Tianchen Ding
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2022-03-09 18:00 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Johannes Weiner,
	Michael Wang, Cruz Zhao, Masahiro Yamada, Nathan Chancellor,
	Kees Cook, Andrew Morton, Vlastimil Babka, Gustavo A. R. Silva,
	Arnd Bergmann, Miguel Ojeda, Chris Down, Vipin Sharma,
	Daniel Borkmann, linux-kernel, cgroups

Hello,

On Wed, Mar 09, 2022 at 04:30:51PM +0800, Tianchen Ding wrote:
> "the sched domains and the load balancer" you mentioned are the ways to
> "balance" tasks on each domains. However, this patchset aims to "group" them
> together to win hot cache and less competition, which is different from load
> balancer. See commit log of the patch 3/4 and this link:
> https://lore.kernel.org/all/11d4c86a-40ef-6ce5-6d08-e9d0bc9b512a@linux.alibaba.com/

I read that but it doesn't make whole lot of sense to me. As Peter noted, we
already have issues with cross NUMA node balancing interacting with in-node
balancing, which likely indicates that it needs more unified solution rather
than more fragmented. I have a hard time seeing how adding yet another layer
on top helps the situation.

> > * If, for some reason, you need more customizable behavior in terms of cpu
> >    allocation, which is what cpuset is for, maybe it'd be better to build the
> >    load balancer in userspace. That'd fit way better with how cgroup is used
> >    in general and with threaded cgroups, it should fit nicely with everything
> >    else.
> > 
> 
> We put group balancer in kernel space because this new policy does not
> depend on userspace apps. It's a "general" feature.

Well, it's general for use cases which are happy with the two knobs that you
defined for your use case.

> Doing "dynamic cpuset" in userspace may also introduce performance issue,
> since it may need to bind and unbind different cpusets for several times,
> and is too strict(compared with our "soft bind").

My bet is that you're gonna be able to get just about the same bench results
with userspace diddling with thread cgroup membership. Why not try that
first? The interface is already there. I have a hard time seeing the
justification for hard coding this into the kernel at this stage.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 0/4] Introduce group balancer
  2022-03-09 18:00     ` Tejun Heo
@ 2022-03-10  5:47       ` Tianchen Ding
  2022-03-21 18:16         ` Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Tianchen Ding @ 2022-03-10  5:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Johannes Weiner,
	Michael Wang, Cruz Zhao, Masahiro Yamada, Nathan Chancellor,
	Kees Cook, Andrew Morton, Vlastimil Babka, Gustavo A. R. Silva,
	Arnd Bergmann, Miguel Ojeda, Chris Down, Vipin Sharma,
	Daniel Borkmann, linux-kernel, cgroups

On 2022/3/10 02:00, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 09, 2022 at 04:30:51PM +0800, Tianchen Ding wrote:
>> "the sched domains and the load balancer" you mentioned are the ways to
>> "balance" tasks on each domains. However, this patchset aims to "group" them
>> together to win hot cache and less competition, which is different from load
>> balancer. See commit log of the patch 3/4 and this link:
>> https://lore.kernel.org/all/11d4c86a-40ef-6ce5-6d08-e9d0bc9b512a@linux.alibaba.com/
> 
> I read that but it doesn't make whole lot of sense to me. As Peter noted, we
> already have issues with cross NUMA node balancing interacting with in-node
> balancing, which likely indicates that it needs more unified solution rather
> than more fragmented. I have a hard time seeing how adding yet another layer
> on top helps the situation.
> 
>>> * If, for some reason, you need more customizable behavior in terms of cpu
>>>     allocation, which is what cpuset is for, maybe it'd be better to build the
>>>     load balancer in userspace. That'd fit way better with how cgroup is used
>>>     in general and with threaded cgroups, it should fit nicely with everything
>>>     else.
>>>
>>
>> We put group balancer in kernel space because this new policy does not
>> depend on userspace apps. It's a "general" feature.
> 
> Well, it's general for use cases which are happy with the two knobs that you
> defined for your use case.
> 
>> Doing "dynamic cpuset" in userspace may also introduce performance issue,
>> since it may need to bind and unbind different cpusets for several times,
>> and is too strict(compared with our "soft bind").
> 
> My bet is that you're gonna be able to get just about the same bench results
> with userspace diddling with thread cgroup membership. Why not try that
> first? The interface is already there. I have a hard time seeing the
> justification for hard coding this into the kernel at this stage.
> 
> Thanks.
> 

Well, I understand your point is putting this in userspace. However, 
we've considered about that but found it hard to do so. If you have any 
better idea, please share with us. :-)

If we want to build group balancer in userspace, we need:
   1) gather load info from each rq periodically
   2) make decision to set cpuset.cpus of each cgroup

However, there're some problems about this way.

For 1), we need to consider how frequently collecting these info, which 
may impact performance and accuracy. If the load changes hugely right 
after we get it once, our data are expired and then the decision may be 
wrong. (If we are in kernel, faster action can be taken.)

We believe 2) is harder. The specific policy may be complex and alter 
according to different scenes. There's not a general method.
e.g., with 16cpus and 4 cgroups, how to decide when we set one of them 
0-3(when busy)or 0-7(when some of other cgroups are idle)? If there are 
much more threads in cgroupA than cgroupB/C/D , and we want to satisfy 
cgroupA as far as possible(on the premise of fairness of B/C/D), 
dynamically enlarging(when B/C/D partly idle) and shrinking(when B/C/D 
busy) cpuset of cgroupA requires complex policy. In this example, 
fairness and performance can be provided by existing scheduler, but when 
it comes to grouping hot cache or decreasing competion, both scheduler 
in kernel and action in userspace are hard to solve.
What's more, in many cloud computing scenes, there may be hundreds or 
thousands of containers, which are much larger than partition number. 
These containers may be dynamically created and destroyed at any time. 
Making policy to manage them from userspace will not be practical.

These problems become easy when going to kernelspace. We get info 
directly from scheduler, and help revising its decision at some key 
points, or do some support work(e.g., task migration if possible).

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 0/4] Introduce group balancer
  2022-03-10  5:47       ` Tianchen Ding
@ 2022-03-21 18:16         ` Tejun Heo
  2022-03-24  6:50           ` Tianchen Ding
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2022-03-21 18:16 UTC (permalink / raw)
  To: Tianchen Ding
  Cc: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Johannes Weiner,
	Michael Wang, Cruz Zhao, Masahiro Yamada, Nathan Chancellor,
	Kees Cook, Andrew Morton, Vlastimil Babka, Gustavo A. R. Silva,
	Arnd Bergmann, Miguel Ojeda, Chris Down, Vipin Sharma,
	Daniel Borkmann, linux-kernel, cgroups

Hello,

On Thu, Mar 10, 2022 at 01:47:34PM +0800, Tianchen Ding wrote:
> If we want to build group balancer in userspace, we need:
>   1) gather load info from each rq periodically
>   2) make decision to set cpuset.cpus of each cgroup
> 
> However, there're some problems about this way.
> 
> For 1), we need to consider how frequently collecting these info, which may
> impact performance and accuracy. If the load changes hugely right after we
> get it once, our data are expired and then the decision may be wrong. (If we
> are in kernel, faster action can be taken.)

We now have a pretty well established way to transport data to userspace at
really low overhead. If you piggy back on bpf interfaces, they can usually
be pretty unintrusive and low effort as long as you have the right kind of
data aggregated already, which shouldn't be that difficult here.

> We believe 2) is harder. The specific policy may be complex and alter
> according to different scenes. There's not a general method.
> e.g., with 16cpus and 4 cgroups, how to decide when we set one of them
> 0-3(when busy)or 0-7(when some of other cgroups are idle)? If there are much
> more threads in cgroupA than cgroupB/C/D , and we want to satisfy cgroupA as
> far as possible(on the premise of fairness of B/C/D), dynamically
> enlarging(when B/C/D partly idle) and shrinking(when B/C/D busy) cpuset of
> cgroupA requires complex policy. In this example, fairness and performance
> can be provided by existing scheduler, but when it comes to grouping hot
> cache or decreasing competion, both scheduler in kernel and action in
> userspace are hard to solve.

So, I get that it's not easy. In fact, we don't even yet know how to
properly compare loads across groups of CPUs - simple sums that you're using
break down when there are big gaps in weight numbers across tasks and can
become meaningless in the presence of CPU affinities. They can still work
when the configuration is fairly homogeenous and controlled but the standard
should be far higher for something we bake into the kernel and expose
userland-visible interface for.

> What's more, in many cloud computing scenes, there may be hundreds or
> thousands of containers, which are much larger than partition number. These
> containers may be dynamically created and destroyed at any time. Making
> policy to manage them from userspace will not be practical.
> 
> These problems become easy when going to kernelspace. We get info directly
> from scheduler, and help revising its decision at some key points, or do
> some support work(e.g., task migration if possible).

I don't think they become necessarily easy. Sure, you can hack up something
which works for some cases by poking into existing code; however, the bar
for acceptance is also way higher for a kernel interface - it should be
generic, consistent with other interfaces (I won't go into cgroup interface
issues here), and work orthogonally with other kernel features (ie. task /
group weights should work in an explainable way). I don't think the proposed
patches are scoring high in those axes.

I'm not against the goal here. Given that cgroups express the logical
structure of applications running on the system, it does make sense to
factor that into scheduling decisions. However, what's proposed seems too
premature and I have a hard time seeing why this level of functionality
would be difficult to be implement from userspace with some additions in
terms of visibility which is really easy to do these days.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 0/4] Introduce group balancer
  2022-03-21 18:16         ` Tejun Heo
@ 2022-03-24  6:50           ` Tianchen Ding
  0 siblings, 0 replies; 12+ messages in thread
From: Tianchen Ding @ 2022-03-24  6:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Johannes Weiner,
	Michael Wang, Cruz Zhao, Masahiro Yamada, Nathan Chancellor,
	Kees Cook, Andrew Morton, Vlastimil Babka, Gustavo A. R. Silva,
	Arnd Bergmann, Miguel Ojeda, Chris Down, Vipin Sharma,
	Daniel Borkmann, linux-kernel, cgroups

On 2022/3/22 02:16, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 10, 2022 at 01:47:34PM +0800, Tianchen Ding wrote:
>> If we want to build group balancer in userspace, we need:
>>    1) gather load info from each rq periodically
>>    2) make decision to set cpuset.cpus of each cgroup
>>
>> However, there're some problems about this way.
>>
>> For 1), we need to consider how frequently collecting these info, which may
>> impact performance and accuracy. If the load changes hugely right after we
>> get it once, our data are expired and then the decision may be wrong. (If we
>> are in kernel, faster action can be taken.)
> 
> We now have a pretty well established way to transport data to userspace at
> really low overhead. If you piggy back on bpf interfaces, they can usually
> be pretty unintrusive and low effort as long as you have the right kind of
> data aggregated already, which shouldn't be that difficult here.
> 

Yes, bpf is a good way to fetch these data.
In fact we may also consider impacting on some decisions of scheduler in 
some scenes. But it seems to be a long term work...

>> We believe 2) is harder. The specific policy may be complex and alter
>> according to different scenes. There's not a general method.
>> e.g., with 16cpus and 4 cgroups, how to decide when we set one of them
>> 0-3(when busy)or 0-7(when some of other cgroups are idle)? If there are much
>> more threads in cgroupA than cgroupB/C/D , and we want to satisfy cgroupA as
>> far as possible(on the premise of fairness of B/C/D), dynamically
>> enlarging(when B/C/D partly idle) and shrinking(when B/C/D busy) cpuset of
>> cgroupA requires complex policy. In this example, fairness and performance
>> can be provided by existing scheduler, but when it comes to grouping hot
>> cache or decreasing competion, both scheduler in kernel and action in
>> userspace are hard to solve.
> 
> So, I get that it's not easy. In fact, we don't even yet know how to
> properly compare loads across groups of CPUs - simple sums that you're using
> break down when there are big gaps in weight numbers across tasks and can
> become meaningless in the presence of CPU affinities. They can still work

"simple sums" is fine because we only use this load when selecting 
partitions, not migrating.

If I understand correctly, do you mean a scene that:
cgroupA has only one thread with full busy load (1024)
cgroup B/C/D have many threads with small load (maybe 100)
There're two CPU partitons:0-3 and 4-7

If A choose 0-3 (we suppose its thread sticking on cpu0), so the load of 
partition 0-3 becomes 1024 because simple sums. So B/C/D will all choose 
partition 4-7. This will be fine since our group balancer is a kind of 
"soft" bind. If we are waking a thread from cgroupB and find there is no 
free cpu in 4-7, we'll fallback to normal path and select other idle cpu 
(i.e., 1-3) if possible.
Otherwise, if there is no idle cpu totally, meaning the whole system is 
busy. The existing load balance mechanism will handle it, and our 
patches only try to switch two tasks if their cache in their own cgroups 
can be closer after migration.

> when the configuration is fairly homogeenous and controlled but the standard
> should be far higher for something we bake into the kernel and expose
> userland-visible interface for.
> 

Our aim is to improve performance of apps, so will not limit cpu usage 
strictly. This kind of soft bind also helps to avoid conflicts with 
existing load balance mechanism. We may explain this in document.

>> What's more, in many cloud computing scenes, there may be hundreds or
>> thousands of containers, which are much larger than partition number. These
>> containers may be dynamically created and destroyed at any time. Making
>> policy to manage them from userspace will not be practical.
>>
>> These problems become easy when going to kernelspace. We get info directly
>> from scheduler, and help revising its decision at some key points, or do
>> some support work(e.g., task migration if possible).
> 
> I don't think they become necessarily easy. Sure, you can hack up something
> which works for some cases by poking into existing code; however, the bar
> for acceptance is also way higher for a kernel interface - it should be
> generic, consistent with other interfaces (I won't go into cgroup interface
> issues here), and work orthogonally with other kernel features (ie. task /
> group weights should work in an explainable way). I don't think the proposed
> patches are scoring high in those axes.
> 

We set the interface into cpuset subsys because partition info should 
follow effective_cpus. There may be a problem that we need to get load 
info from cpu subsys, so "cpu" and "cpuset" should be both enabled in 
the same cgroup. Fortunately, this is easy in cgroup-v2. Any other 
conflict do you see?

About "work orthogonally with other kernel features", we've explained 
above. Weights are only used in selecting partitions and we just tend to 
trying pulling tasks to their selected partitions, not force.

> I'm not against the goal here. Given that cgroups express the logical
> structure of applications running on the system, it does make sense to
> factor that into scheduling decisions. However, what's proposed seems too
> premature and I have a hard time seeing why this level of functionality
> would be difficult to be implement from userspace with some additions in
> terms of visibility which is really easy to do these days.

Just as out explanation in the last mail, we believe building this in 
userspace is not easy. Because interfaces provided by kernel now is 
inflexible. From the view of cgroup, it is given a binded "cpuset.cpus", 
and all cpus in this range are treated equally, while all cpus out of 
this range are strictly forbidden. The existing cpuset mechanisms do 
help resource limitation and isolation, but may be weak on improving 
perforamce.

So, our idea is, doing this in userspace, which relies on existing 
kernel interfaces, is hard. Only exposing more data to increase 
visibility in userspace is not enough. We need a whole solution from 
collecting data to impacting decisions of the scheduler. Maybe eBPF can 
cover all these works in the future, but it takes long time to develop 
and maybe too much to achieve just dynamic cpu bind.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 0/4] Introduce group balancer
  2022-03-08  9:26 [RFC PATCH v2 0/4] Introduce group balancer Tianchen Ding
                   ` (4 preceding siblings ...)
  2022-03-08 17:13 ` [RFC PATCH v2 0/4] Introduce " Tejun Heo
@ 2022-04-06  2:47 ` Tianchen Ding
  5 siblings, 0 replies; 12+ messages in thread
From: Tianchen Ding @ 2022-04-06  2:47 UTC (permalink / raw)
  To: Zefan Li, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Tejun Heo,
	Johannes Weiner, Michael Wang, Cruz Zhao, Masahiro Yamada,
	Nathan Chancellor, Kees Cook, Andrew Morton, Vlastimil Babka,
	Gustavo A. R. Silva, Arnd Bergmann, Miguel Ojeda, Chris Down,
	Vipin Sharma, Daniel Borkmann
  Cc: linux-kernel, cgroups

Ping~
Any idea for this patchset? :-)

On 2022/3/8 17:26, Tianchen Ding wrote:
> Modern platform are growing fast on CPU numbers. To achieve better
> utility of CPU resource, multiple apps are starting to sharing the CPUs.
> 
> What we need is a way to ease confliction in share mode,
> make groups as exclusive as possible, to gain both performance
> and resource efficiency.
> 
> The main idea of group balancer is to fulfill this requirement
> by balancing groups of tasks among groups of CPUs, consider this
> as a dynamic demi-exclusive mode. Task trigger work to settle it's
> group into a proper partition (minimum predicted load), then try
> migrate itself into it. To gradually settle groups into the most
> exclusively partition.
> 
> GB can be seen as an optimize policy based on load balance,
> it obeys the main idea of load balance and makes adjustment
> based on that.
> 
> Our test on ARM64 platform with 128 CPUs shows that,
> throughput of sysbench memory is improved about 25%,
> and redis-benchmark is improved up to about 10%.
> 
> See each patch for detail:
> The 1st patch introduces infrastructure.
> The 2nd patch introduces detail about partition info.
> The 3rd patch is the main part of group balancer.
> The 4th patch is about stats.
> 
> v2:
> Put partition info and period settings to cpuset subsys of cgroup_v2.
> 
> v1: https://lore.kernel.org/all/98f41efd-74b2-198a-839c-51b785b748a6@linux.alibaba.com/
> 
> Michael Wang (1):
>    sched: Introduce group balancer
> 
> Tianchen Ding (3):
>    sched, cpuset: Introduce infrastructure of group balancer
>    cpuset: Handle input of partition info for group balancer
>    cpuset, gb: Add stat for group balancer
> 
>   include/linux/cpuset.h   |   5 +
>   include/linux/sched.h    |   5 +
>   include/linux/sched/gb.h |  70 ++++++
>   init/Kconfig             |  12 +
>   kernel/cgroup/cpuset.c   | 405 +++++++++++++++++++++++++++++++-
>   kernel/sched/Makefile    |   1 +
>   kernel/sched/core.c      |   5 +
>   kernel/sched/debug.c     |  10 +-
>   kernel/sched/fair.c      |  26 ++-
>   kernel/sched/gb.c        | 487 +++++++++++++++++++++++++++++++++++++++
>   kernel/sched/sched.h     |  14 ++
>   11 files changed, 1037 insertions(+), 3 deletions(-)
>   create mode 100644 include/linux/sched/gb.h
>   create mode 100644 kernel/sched/gb.c
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-04-06  9:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-08  9:26 [RFC PATCH v2 0/4] Introduce group balancer Tianchen Ding
2022-03-08  9:26 ` [RFC PATCH v2 1/4] sched, cpuset: Introduce infrastructure of " Tianchen Ding
2022-03-08  9:26 ` [RFC PATCH v2 2/4] cpuset: Handle input of partition info for " Tianchen Ding
2022-03-08  9:26 ` [RFC PATCH v2 3/4] sched: Introduce " Tianchen Ding
2022-03-08  9:26 ` [RFC PATCH v2 4/4] cpuset, gb: Add stat for " Tianchen Ding
2022-03-08 17:13 ` [RFC PATCH v2 0/4] Introduce " Tejun Heo
2022-03-09  8:30   ` Tianchen Ding
2022-03-09 18:00     ` Tejun Heo
2022-03-10  5:47       ` Tianchen Ding
2022-03-21 18:16         ` Tejun Heo
2022-03-24  6:50           ` Tianchen Ding
2022-04-06  2:47 ` Tianchen Ding

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).