linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset
@ 2021-07-14 13:54 Frederic Weisbecker
  2021-07-14 13:54 ` [RFC PATCH 1/6] pci: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch Frederic Weisbecker
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-14 13:54 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Tejun Heo, Peter Zijlstra, Juri Lelli,
	Alex Belits, Nitesh Lal, Thomas Gleixner, Nicolas Saenz,
	Christoph Lameter, Marcelo Tosatti, Zefan Li, cgroups

The fact that "isolcpus=" behaviour can't be modified at runtime is an
eternal source of discussion and debate opposing a useful feature against
a terrible interface.

I've long since tried to figure out a proper way to control this at
runtime using cpusets, which isn't easy as a boot time single cpumask
is difficult to map to a hierarchy of cpusets that can even overlap.

The idea here is to map the boot-set isolation behaviour to any cpuset
directory whose cpumask is a subset of "isolcpus=". I let you browse
for details on the last patch.

Note this is still WIP and half-baked, but I figured it's important to
validate the interface early.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	cpuset/isolation

HEAD: 6d3dba1115b7ea464febf3763244c783e87c7baf

Thanks,
	Frederic
---

Frederic Weisbecker (6):
      pci: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
      workqueue: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
      net: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
      sched/isolation: Split domain housekeeping mask from the rest
      sched/isolation: Make HK_FLAG_DOMAIN mutable
      cpuset: Add cpuset.isolation_mask file


 drivers/pci/pci-driver.c        |  21 ++++++--
 include/linux/sched/isolation.h |   4 ++
 kernel/cgroup/cpuset.c          | 111 ++++++++++++++++++++++++++++++++++++++--
 kernel/sched/isolation.c        |  73 ++++++++++++++++++++++----
 kernel/workqueue.c              |   4 +-
 net/core/net-sysfs.c            |   6 +--
 6 files changed, 196 insertions(+), 23 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/6] pci: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
  2021-07-14 13:54 [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Frederic Weisbecker
@ 2021-07-14 13:54 ` Frederic Weisbecker
  2021-07-14 13:54 ` [RFC PATCH 2/6] workqueue: " Frederic Weisbecker
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-14 13:54 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Tejun Heo, Peter Zijlstra, Juri Lelli,
	Alex Belits, Nitesh Lal, Thomas Gleixner, Nicolas Saenz,
	Christoph Lameter, Marcelo Tosatti,
	Zefan Li cgroups @ vger . kernel . org

To prepare for supporting each feature of the housekeeping cpumask
toward cpuset, prepare for HK_FLAG_DOMAIN to move to its own cpumask.
This will allow to modify the set passed through "isolcpus=" kernel boot
parameter on runtime.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Nitesh Lal <nilal@redhat.com>
Cc: Nicolas Saenz <nsaenzju@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@gentwo.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Alex Belits <abelits@marvell.com>
---
 drivers/pci/pci-driver.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 3a72352aa5cf..ab84fa81c29a 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -336,7 +336,6 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 			  const struct pci_device_id *id)
 {
 	int error, node, cpu;
-	int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
 	struct drv_dev_and_id ddi = { drv, dev, id };
 
 	/*
@@ -354,17 +353,29 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	 * device is probed from work_on_cpu() of the Physical device.
 	 */
 	if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
-	    pci_physfn_is_probed(dev))
+	    pci_physfn_is_probed(dev)) {
 		cpu = nr_cpu_ids;
-	else
+	} else {
+		cpumask_var_t wq_domain_mask;
+
+		if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
+			error = -ENOMEM;
+			goto out;
+		}
+		cpumask_and(wq_domain_mask,
+			    housekeeping_cpumask(HK_FLAG_WQ),
+			    housekeeping_cpumask(HK_FLAG_DOMAIN));
+
 		cpu = cpumask_any_and(cpumask_of_node(node),
-				      housekeeping_cpumask(hk_flags));
+				      wq_domain_mask);
+		free_cpumask_var(wq_domain_mask);
+	}
 
 	if (cpu < nr_cpu_ids)
 		error = work_on_cpu(cpu, local_pci_probe, &ddi);
 	else
 		error = local_pci_probe(&ddi);
-
+out:
 	dev->is_probed = 0;
 	cpu_hotplug_enable();
 	return error;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/6] workqueue: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
  2021-07-14 13:54 [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Frederic Weisbecker
  2021-07-14 13:54 ` [RFC PATCH 1/6] pci: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch Frederic Weisbecker
@ 2021-07-14 13:54 ` Frederic Weisbecker
  2021-07-14 13:54 ` [RFC PATCH 3/6] net: " Frederic Weisbecker
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-14 13:54 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Tejun Heo, Peter Zijlstra, Juri Lelli,
	Alex Belits, Nitesh Lal, Thomas Gleixner, Nicolas Saenz,
	Christoph Lameter, Marcelo Tosatti, Zefan Li, cgroups

To prepare for supporting each feature of the housekeeping cpumask
toward cpuset, prepare for HK_FLAG_DOMAIN to move to its own cpumask.
This will allow to modify the set passed through "isolcpus=" kernel boot
parameter on runtime.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Nitesh Lal <nilal@redhat.com>
Cc: Nicolas Saenz <nsaenzju@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@gentwo.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Alex Belits <abelits@marvell.com>
---
 kernel/workqueue.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 50142fc08902..d29c5b61a333 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5938,13 +5938,13 @@ static void __init wq_numa_init(void)
 void __init workqueue_init_early(void)
 {
 	int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
-	int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
 	int i, cpu;
 
 	BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
 
 	BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
-	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(hk_flags));
+	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_FLAG_WQ));
+	cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, housekeeping_cpumask(HK_FLAG_DOMAIN));
 
 	pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/6] net: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
  2021-07-14 13:54 [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Frederic Weisbecker
  2021-07-14 13:54 ` [RFC PATCH 1/6] pci: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch Frederic Weisbecker
  2021-07-14 13:54 ` [RFC PATCH 2/6] workqueue: " Frederic Weisbecker
@ 2021-07-14 13:54 ` Frederic Weisbecker
  2021-07-14 13:54 ` [RFC PATCH 4/6] sched/isolation: Split domain housekeeping mask from the rest Frederic Weisbecker
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-14 13:54 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Tejun Heo, Peter Zijlstra, Juri Lelli,
	Alex Belits, Nitesh Lal, Thomas Gleixner, Nicolas Saenz,
	Christoph Lameter, Marcelo Tosatti, Zefan Li, cgroups

To prepare for supporting each feature of the housekeeping cpumask
toward cpuset, prepare for HK_FLAG_DOMAIN to move to its own cpumask.
This will allow to modify the set passed through "isolcpus=" kernel boot
parameter on runtime.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Nitesh Lal <nilal@redhat.com>
Cc: Nicolas Saenz <nsaenzju@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@gentwo.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Alex Belits <abelits@marvell.com>
---
 net/core/net-sysfs.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index f6197774048b..78ea904e9206 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -782,7 +782,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 {
 	struct rps_map *old_map, *map;
 	cpumask_var_t mask;
-	int err, cpu, i, hk_flags;
+	int err, cpu, i;
 	static DEFINE_MUTEX(rps_map_mutex);
 
 	if (!capable(CAP_NET_ADMIN))
@@ -798,8 +798,8 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 	}
 
 	if (!cpumask_empty(mask)) {
-		hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
-		cpumask_and(mask, mask, housekeeping_cpumask(hk_flags));
+		cpumask_and(mask, mask, housekeeping_cpumask(HK_FLAG_DOMAIN));
+		cpumask_and(mask, mask, housekeeping_cpumask(HK_FLAG_WQ));
 		if (cpumask_empty(mask)) {
 			free_cpumask_var(mask);
 			return -EINVAL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 4/6] sched/isolation: Split domain housekeeping mask from the rest
  2021-07-14 13:54 [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2021-07-14 13:54 ` [RFC PATCH 3/6] net: " Frederic Weisbecker
@ 2021-07-14 13:54 ` Frederic Weisbecker
  2021-07-14 13:54 ` [RFC PATCH 5/6] sched/isolation: Make HK_FLAG_DOMAIN mutable Frederic Weisbecker
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-14 13:54 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Tejun Heo, Peter Zijlstra, Juri Lelli,
	Alex Belits, Nitesh Lal, Thomas Gleixner, Nicolas Saenz,
	Christoph Lameter, Marcelo Tosatti, Zefan Li, cgroups

To prepare for supporting each feature of the housekeeping cpumask
toward cpuset, move HK_FLAG_DOMAIN to its own cpumask. This will allow
to modify the set passed through "isolcpus=" kernel boot parameter on
runtime.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Nitesh Lal <nilal@redhat.com>
Cc: Nicolas Saenz <nsaenzju@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@gentwo.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Alex Belits <abelits@marvell.com>
---
 kernel/sched/isolation.c | 54 +++++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 7f06eaf12818..c2bdf7e6dc39 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -12,6 +12,7 @@
 DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
 EXPORT_SYMBOL_GPL(housekeeping_overridden);
 static cpumask_var_t housekeeping_mask;
+static cpumask_var_t hk_domain_mask;
 static unsigned int housekeeping_flags;
 
 bool housekeeping_enabled(enum hk_flags flags)
@@ -26,7 +27,7 @@ int housekeeping_any_cpu(enum hk_flags flags)
 
 	if (static_branch_unlikely(&housekeeping_overridden)) {
 		if (housekeeping_flags & flags) {
-			cpu = sched_numa_find_closest(housekeeping_mask, smp_processor_id());
+			cpu = sched_numa_find_closest(housekeeping_cpumask(flags), smp_processor_id());
 			if (cpu < nr_cpu_ids)
 				return cpu;
 
@@ -39,9 +40,13 @@ EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
 
 const struct cpumask *housekeeping_cpumask(enum hk_flags flags)
 {
-	if (static_branch_unlikely(&housekeeping_overridden))
+	if (static_branch_unlikely(&housekeeping_overridden)) {
+		WARN_ON_ONCE((flags & HK_FLAG_DOMAIN) && (flags & ~HK_FLAG_DOMAIN));
+		if (housekeeping_flags & HK_FLAG_DOMAIN)
+			return hk_domain_mask;
 		if (housekeeping_flags & flags)
 			return housekeeping_mask;
+	}
 	return cpu_possible_mask;
 }
 EXPORT_SYMBOL_GPL(housekeeping_cpumask);
@@ -50,7 +55,7 @@ void housekeeping_affine(struct task_struct *t, enum hk_flags flags)
 {
 	if (static_branch_unlikely(&housekeeping_overridden))
 		if (housekeeping_flags & flags)
-			set_cpus_allowed_ptr(t, housekeeping_mask);
+			set_cpus_allowed_ptr(t, housekeeping_cpumask(flags));
 }
 EXPORT_SYMBOL_GPL(housekeeping_affine);
 
@@ -58,11 +63,13 @@ bool housekeeping_test_cpu(int cpu, enum hk_flags flags)
 {
 	if (static_branch_unlikely(&housekeeping_overridden))
 		if (housekeeping_flags & flags)
-			return cpumask_test_cpu(cpu, housekeeping_mask);
+			return cpumask_test_cpu(cpu, housekeeping_cpumask(flags));
 	return true;
 }
 EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
 
+
+
 void __init housekeeping_init(void)
 {
 	if (!housekeeping_flags)
@@ -91,28 +98,57 @@ static int __init housekeeping_setup(char *str, enum hk_flags flags)
 
 	alloc_bootmem_cpumask_var(&tmp);
 	if (!housekeeping_flags) {
-		alloc_bootmem_cpumask_var(&housekeeping_mask);
-		cpumask_andnot(housekeeping_mask,
-			       cpu_possible_mask, non_housekeeping_mask);
+		if (flags & ~HK_FLAG_DOMAIN) {
+			alloc_bootmem_cpumask_var(&housekeeping_mask);
+			cpumask_andnot(housekeeping_mask,
+				       cpu_possible_mask, non_housekeeping_mask);
+		}
+
+		if (flags & HK_FLAG_DOMAIN) {
+			alloc_bootmem_cpumask_var(&hk_domain_mask);
+			cpumask_andnot(hk_domain_mask,
+				       cpu_possible_mask, non_housekeeping_mask);
+		}
 
 		cpumask_andnot(tmp, cpu_present_mask, non_housekeeping_mask);
 		if (cpumask_empty(tmp)) {
 			pr_warn("Housekeeping: must include one present CPU, "
 				"using boot CPU:%d\n", smp_processor_id());
-			__cpumask_set_cpu(smp_processor_id(), housekeeping_mask);
+			if (flags & ~HK_FLAG_DOMAIN)
+				__cpumask_set_cpu(smp_processor_id(), housekeeping_mask);
+			if (flags & HK_FLAG_DOMAIN)
+				__cpumask_set_cpu(smp_processor_id(), hk_domain_mask);
 			__cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
 		}
 	} else {
+		struct cpumask *prev;
+
 		cpumask_andnot(tmp, cpu_present_mask, non_housekeeping_mask);
 		if (cpumask_empty(tmp))
 			__cpumask_clear_cpu(smp_processor_id(), non_housekeeping_mask);
 		cpumask_andnot(tmp, cpu_possible_mask, non_housekeeping_mask);
-		if (!cpumask_equal(tmp, housekeeping_mask)) {
+
+		if (housekeeping_flags == HK_FLAG_DOMAIN)
+			prev = hk_domain_mask;
+		else
+			prev = housekeeping_mask;
+
+		if (!cpumask_equal(tmp, prev)) {
 			pr_warn("Housekeeping: nohz_full= must match isolcpus=\n");
 			free_bootmem_cpumask_var(tmp);
 			free_bootmem_cpumask_var(non_housekeeping_mask);
 			return 0;
 		}
+
+		if ((housekeeping_flags & HK_FLAG_DOMAIN) && (flags & ~HK_FLAG_DOMAIN)) {
+			alloc_bootmem_cpumask_var(&housekeeping_mask);
+			cpumask_copy(housekeeping_mask, hk_domain_mask);
+		}
+
+		if ((housekeeping_flags & ~HK_FLAG_DOMAIN) && (flags & HK_FLAG_DOMAIN)) {
+			alloc_bootmem_cpumask_var(&hk_domain_mask);
+			cpumask_copy(hk_domain_mask, housekeeping_mask);
+		}
 	}
 	free_bootmem_cpumask_var(tmp);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 5/6] sched/isolation: Make HK_FLAG_DOMAIN mutable
  2021-07-14 13:54 [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2021-07-14 13:54 ` [RFC PATCH 4/6] sched/isolation: Split domain housekeeping mask from the rest Frederic Weisbecker
@ 2021-07-14 13:54 ` Frederic Weisbecker
  2021-07-21 14:28   ` Vincent Donnefort
  2021-07-14 13:54 ` [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file Frederic Weisbecker
  2021-07-16 18:02 ` [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Waiman Long
  6 siblings, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-14 13:54 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Tejun Heo, Peter Zijlstra, Juri Lelli,
	Alex Belits, Nitesh Lal, Thomas Gleixner, Nicolas Saenz,
	Christoph Lameter, Marcelo Tosatti, Zefan Li, cgroups

In order to prepare supporting "isolcpus=" changes toward cpuset,
provide an API to modify the "isolcpus=" cpumask passed on boot.

TODO:

* Propagate the change to all interested subsystems (workqueue, net, pci)
* Make sure we can't concurrently change this cpumask (assert cpuset_rwsem
  is held).

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Nitesh Lal <nilal@redhat.com>
Cc: Nicolas Saenz <nsaenzju@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@gentwo.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Alex Belits <abelits@marvell.com>
---
 include/linux/sched/isolation.h |  4 ++++
 kernel/sched/isolation.c        | 19 +++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cc9f393e2a70..a5960cb357d2 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -21,6 +21,7 @@ enum hk_flags {
 DECLARE_STATIC_KEY_FALSE(housekeeping_overridden);
 extern int housekeeping_any_cpu(enum hk_flags flags);
 extern const struct cpumask *housekeeping_cpumask(enum hk_flags flags);
+extern void housekeeping_cpumask_set(struct cpumask *mask, enum hk_flags flags);
 extern bool housekeeping_enabled(enum hk_flags flags);
 extern void housekeeping_affine(struct task_struct *t, enum hk_flags flags);
 extern bool housekeeping_test_cpu(int cpu, enum hk_flags flags);
@@ -38,6 +39,9 @@ static inline const struct cpumask *housekeeping_cpumask(enum hk_flags flags)
 	return cpu_possible_mask;
 }
 
+static inline void housekeeping_cpumask_set(struct cpumask *mask,
+					    enum hk_flags flags) { }
+
 static inline bool housekeeping_enabled(enum hk_flags flags)
 {
 	return false;
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index c2bdf7e6dc39..c071433059cf 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -68,7 +68,26 @@ bool housekeeping_test_cpu(int cpu, enum hk_flags flags)
 }
 EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
 
+// Only support HK_FLAG_DOMAIN for now
+// TODO: propagate the changes through all interested subsystems:
+// workqueues, net, pci; ...
+void housekeeping_cpumask_set(struct cpumask *mask, enum hk_flags flags)
+{
+	/* Only HK_FLAG_DOMAIN change supported for now */
+	if (WARN_ON_ONCE(flags != HK_FLAG_DOMAIN))
+		return;
 
+	if (!static_key_enabled(&housekeeping_overridden.key)) {
+		if (cpumask_equal(mask, cpu_possible_mask))
+			return;
+		if (WARN_ON_ONCE(!alloc_cpumask_var(&hk_domain_mask, GFP_KERNEL)))
+			return;
+		cpumask_copy(hk_domain_mask, mask);
+		static_branch_enable(&housekeeping_overridden);
+	} else {
+		cpumask_copy(hk_domain_mask, mask);
+	}
+}
 
 void __init housekeeping_init(void)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-14 13:54 [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2021-07-14 13:54 ` [RFC PATCH 5/6] sched/isolation: Make HK_FLAG_DOMAIN mutable Frederic Weisbecker
@ 2021-07-14 13:54 ` Frederic Weisbecker
  2021-07-14 16:31   ` Marcelo Tosatti
  2021-07-14 16:52   ` Peter Zijlstra
  2021-07-16 18:02 ` [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Waiman Long
  6 siblings, 2 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-14 13:54 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Tejun Heo, Peter Zijlstra, Juri Lelli,
	Alex Belits, Nitesh Lal, Thomas Gleixner, Nicolas Saenz,
	Christoph Lameter, Marcelo Tosatti, Zefan Li, cgroups

Add a new cpuset.isolation_mask file in order to be able to modify the
housekeeping cpumask for each individual isolation feature on runtime.
In the future this will include nohz_full, unbound timers,
unbound workqueues, unbound kthreads, managed irqs, etc...

Start with supporting domain exclusion and CPUs passed through
"isolcpus=".

The cpuset.isolation_mask defaults to 0. Setting it to 1 will exclude
the given cpuset from the domains (they will be attached to NULL domain).
As long as a CPU is part of any cpuset with cpuset.isolation_mask set to
1, it will remain isolated even if it overlaps with another cpuset that
has cpuset.isolation_mask  set to 0. The same applies to parent and
subdirectories.

If a cpuset is a subset of "isolcpus=", it automatically maps it and
cpuset.isolation_mask will be set to 1. This subset is then cleared from
the initial "isolcpus=" mask. The user is then free to override
cpuset.isolation_mask to 0 in order to revert the effect of "isolcpus=".

Here is an example of use where the CPU 7 has been isolated on boot and
get re-attached to domains later from cpuset:

	$ cat /proc/cmdline
		isolcpus=7
	$ cd /sys/fs/cgroup/cpuset
	$ mkdir cpu7
	$ cd cpu7
	$ cat cpuset.cpus
		0-7
	$ cat cpuset.isolation_mask
		0
	$ ls /sys/kernel/debug/domains/cpu7	# empty because isolcpus=7
	$ echo 7 > cpuset.cpus
	$ cat cpuset.isolation_mask	# isolcpus subset automatically mapped
		1
	$ echo 0 > cpuset.isolation_mask
	$ ls /sys/kernel/debug/domains/cpu7/
		domain0  domain1

CHECKME: Should we have individual cpuset.isolation.$feature files for
         each isolation feature instead of a single mask file?

CHECKME: The scheduler is unhappy when _every_ CPUs are isolated

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Nitesh Lal <nilal@redhat.com>
Cc: Nicolas Saenz <nsaenzju@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@gentwo.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Alex Belits <abelits@marvell.com>
---
 kernel/cgroup/cpuset.c | 111 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 107 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index adb5190c4429..ecb63be04408 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -82,6 +82,7 @@ struct cpuset {
 	struct cgroup_subsys_state css;
 
 	unsigned long flags;		/* "unsigned long" so bitops work */
+	unsigned long isol_flags;
 
 	/*
 	 * On default hierarchy:
@@ -258,6 +259,17 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+/* bits in struct cpuset flags field */
+typedef enum {
+	CS_ISOL_DOMAIN,
+	CS_ISOL_MAX
+} isol_flagbits_t;
+
+static inline int is_isol_domain(const struct cpuset *cs)
+{
+	return test_bit(CS_ISOL_DOMAIN, &cs->isol_flags);
+}
+
 static inline int is_partition_root(const struct cpuset *cs)
 {
 	return cs->partition_root_state > 0;
@@ -269,6 +281,13 @@ static struct cpuset top_cpuset = {
 	.partition_root_state = PRS_ENABLED,
 };
 
+/*
+ * CPUs passed through "isolcpus=" on boot, waiting to be mounted
+ * as soon as we meet a cpuset directory whose cpus_allowed is a
+ * subset of "isolcpus="
+ */
+static cpumask_var_t unmounted_isolcpus_mask;
+
 /**
  * cpuset_for_each_child - traverse online children of a cpuset
  * @child_cs: loop cursor pointing to the current child
@@ -681,6 +700,39 @@ static inline int nr_cpusets(void)
 	return static_key_count(&cpusets_enabled_key.key) + 1;
 }
 
+static int update_domain_housekeeping_mask(void)
+{
+	struct cpuset *cp;	/* top-down scan of cpusets */
+	struct cgroup_subsys_state *pos_css;
+	cpumask_var_t domain_mask;
+
+	if (!zalloc_cpumask_var(&domain_mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	cpumask_andnot(domain_mask, cpu_possible_mask, unmounted_isolcpus_mask);
+
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
+		if (is_isol_domain(cp))
+			cpumask_andnot(domain_mask, domain_mask, cp->cpus_allowed);
+
+		if (cpumask_subset(cp->cpus_allowed, unmounted_isolcpus_mask)) {
+			unsigned long flags;
+			cpumask_andnot(unmounted_isolcpus_mask, unmounted_isolcpus_mask,
+				       cp->cpus_allowed);
+			spin_lock_irqsave(&callback_lock, flags);
+			cp->isol_flags |= BIT(CS_ISOL_DOMAIN);
+			spin_unlock_irqrestore(&callback_lock, flags);
+		}
+	}
+	rcu_read_unlock();
+
+	housekeeping_cpumask_set(domain_mask, HK_FLAG_DOMAIN);
+	free_cpumask_var(domain_mask);
+
+	return 0;
+}
+
 /*
  * generate_sched_domains()
  *
@@ -741,6 +793,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	struct cpuset **csa;	/* array of all cpuset ptrs */
 	int csn;		/* how many cpuset ptrs in csa so far */
 	int i, j, k;		/* indices for partition finding loops */
+	int err;
 	cpumask_var_t *doms;	/* resulting partition; i.e. sched domains */
 	struct sched_domain_attr *dattr;  /* attributes for custom domains */
 	int ndoms = 0;		/* number of sched domains in result */
@@ -752,6 +805,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	dattr = NULL;
 	csa = NULL;
 
+	err = update_domain_housekeeping_mask();
+	if (err < 0)
+		pr_err("Can't update housekeeping cpumask\n");
+
 	/* Special case for the 99% of systems with one, full, sched domain */
 	if (root_load_balance && !top_cpuset.nr_subparts_cpus) {
 		ndoms = 1;
@@ -1449,7 +1506,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
 		 * root as well.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
-		    is_sched_load_balance(cp) &&
+		    (is_sched_load_balance(cp) || is_isol_domain(cs)) &&
 		   (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
 		    is_partition_root(cp)))
 			need_rebuild_sched_domains = true;
@@ -1935,6 +1992,30 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	return err;
 }
 
+/*
+ * update_isol_flags - read a 0 or a 1 in a file and update associated isol flag
+ * mask:	the new mask value to apply (see isol_flagbits_t)
+ * cs:		the cpuset to update
+ *
+ * Call with cpuset_mutex held.
+ */
+static int update_isol_flags(struct cpuset *cs, u64 mask)
+{
+	unsigned long old_mask = cs->isol_flags;
+
+	if (mask & ~(BIT_ULL(CS_ISOL_MAX) - 1))
+		return -EINVAL;
+
+	spin_lock_irq(&callback_lock);
+	cs->isol_flags = (unsigned long)mask;
+	spin_unlock_irq(&callback_lock);
+
+	if (mask ^ old_mask)
+		rebuild_sched_domains_locked();
+
+	return 0;
+}
+
 /*
  * update_prstate - update partititon_root_state
  * cs:	the cpuset to update
@@ -2273,6 +2354,9 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+//CHECKME: should we have individual cpuset.isolation.$feature files
+//instead of a mask of features in a single file?
+	FILE_ISOLATION_MASK,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
@@ -2314,6 +2398,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	case FILE_SPREAD_SLAB:
 		retval = update_flag(CS_SPREAD_SLAB, cs, val);
 		break;
+	case FILE_ISOLATION_MASK:
+		retval = update_isol_flags(cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -2481,6 +2568,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+	case FILE_ISOLATION_MASK:
+		return cs->isol_flags;
 	default:
 		BUG();
 	}
@@ -2658,6 +2747,13 @@ static struct cftype legacy_files[] = {
 		.private = FILE_MEMORY_PRESSURE_ENABLED,
 	},
 
+	{
+		.name = "isolation_mask",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_ISOLATION_MASK,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -2834,9 +2930,12 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	if (is_partition_root(cs))
 		update_prstate(cs, 0);
 
-	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
-	    is_sched_load_balance(cs))
-		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+		if (is_sched_load_balance(cs))
+			update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+		if (is_isol_domain(cs))
+			update_isol_flags(cs, cs->isol_flags & ~BIT(CS_ISOL_DOMAIN));
+	}
 
 	if (cs->use_parent_ecpus) {
 		struct cpuset *parent = parent_cs(cs);
@@ -2873,6 +2972,9 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
+	cpumask_andnot(unmounted_isolcpus_mask, cpu_possible_mask,
+		       housekeeping_cpumask(HK_FLAG_DOMAIN));
+
 	spin_unlock_irq(&callback_lock);
 	percpu_up_write(&cpuset_rwsem);
 }
@@ -2932,6 +3034,7 @@ int __init cpuset_init(void)
 	top_cpuset.relax_domain_level = -1;
 
 	BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
+	BUG_ON(!alloc_cpumask_var(&unmounted_isolcpus_mask, GFP_KERNEL));
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-14 13:54 ` [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file Frederic Weisbecker
@ 2021-07-14 16:31   ` Marcelo Tosatti
  2021-07-19 13:26     ` Frederic Weisbecker
  2021-07-14 16:52   ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Marcelo Tosatti @ 2021-07-14 16:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Tejun Heo, Peter Zijlstra, Juri Lelli, Alex Belits,
	Nitesh Lal, Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Zefan Li, cgroups

On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> Add a new cpuset.isolation_mask file in order to be able to modify the
> housekeeping cpumask for each individual isolation feature on runtime.
> In the future this will include nohz_full, unbound timers,
> unbound workqueues, unbound kthreads, managed irqs, etc...
> 
> Start with supporting domain exclusion and CPUs passed through
> "isolcpus=".

It is possible to just add return -ENOTSUPPORTED for the features 
whose support is not present?

> The cpuset.isolation_mask defaults to 0. Setting it to 1 will exclude
> the given cpuset from the domains (they will be attached to NULL domain).
> As long as a CPU is part of any cpuset with cpuset.isolation_mask set to
> 1, it will remain isolated even if it overlaps with another cpuset that
> has cpuset.isolation_mask  set to 0. The same applies to parent and
> subdirectories.
> 
> If a cpuset is a subset of "isolcpus=", it automatically maps it and
> cpuset.isolation_mask will be set to 1. This subset is then cleared from
> the initial "isolcpus=" mask. The user is then free to override
> cpuset.isolation_mask to 0 in order to revert the effect of "isolcpus=".
> 
> Here is an example of use where the CPU 7 has been isolated on boot and
> get re-attached to domains later from cpuset:
> 
> 	$ cat /proc/cmdline
> 		isolcpus=7
> 	$ cd /sys/fs/cgroup/cpuset
> 	$ mkdir cpu7
> 	$ cd cpu7
> 	$ cat cpuset.cpus
> 		0-7
> 	$ cat cpuset.isolation_mask
> 		0
> 	$ ls /sys/kernel/debug/domains/cpu7	# empty because isolcpus=7
> 	$ echo 7 > cpuset.cpus
> 	$ cat cpuset.isolation_mask	# isolcpus subset automatically mapped
> 		1
> 	$ echo 0 > cpuset.isolation_mask
> 	$ ls /sys/kernel/debug/domains/cpu7/
> 		domain0  domain1
> 
> CHECKME: Should we have individual cpuset.isolation.$feature files for
>          each isolation feature instead of a single mask file?

Yes, guess that is useful, for example due to the -ENOTSUPPORTED
comment above.


Guarantees on updates
=====================

Perhaps start with a document with:

On return to the write to the cpumask file, what are the guarantees?

For example, for kthread it is that any kernel threads from that point
on should start with the new mask. Therefore userspace should 
respect the order:

1) Change kthread mask.
2) Move threads.


Updates to interface
====================

Also, thinking about updates to the interface (which today are one
cpumask per isolation feature) might be useful. What can happen:

1) New isolation feature is added, feature name added to the interface.

Userspace must support new filename. If not there, then thats an 
old kernel without support for it.

2) If an isolation feature is removed, a file will be gone. What should
be the behaviour there? Remove the file? (userspace should probably 
ignore the failure in that case?) (then features names should not be
reused, as that can confuse #1 above).

Or maybe have a versioned scheme?

> 
> CHECKME: The scheduler is unhappy when _every_ CPUs are isolated
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Cc: Nitesh Lal <nilal@redhat.com>
> Cc: Nicolas Saenz <nsaenzju@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Christoph Lameter <cl@gentwo.de>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Zefan Li <lizefan.x@bytedance.com>
> Cc: Alex Belits <abelits@marvell.com>
> ---
>  kernel/cgroup/cpuset.c | 111 +++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 107 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index adb5190c4429..ecb63be04408 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -82,6 +82,7 @@ struct cpuset {
>  	struct cgroup_subsys_state css;
>  
>  	unsigned long flags;		/* "unsigned long" so bitops work */
> +	unsigned long isol_flags;
>  
>  	/*
>  	 * On default hierarchy:
> @@ -258,6 +259,17 @@ static inline int is_spread_slab(const struct cpuset *cs)
>  	return test_bit(CS_SPREAD_SLAB, &cs->flags);
>  }
>  
> +/* bits in struct cpuset flags field */
> +typedef enum {
> +	CS_ISOL_DOMAIN,
> +	CS_ISOL_MAX
> +} isol_flagbits_t;
> +
> +static inline int is_isol_domain(const struct cpuset *cs)
> +{
> +	return test_bit(CS_ISOL_DOMAIN, &cs->isol_flags);
> +}
> +
>  static inline int is_partition_root(const struct cpuset *cs)
>  {
>  	return cs->partition_root_state > 0;
> @@ -269,6 +281,13 @@ static struct cpuset top_cpuset = {
>  	.partition_root_state = PRS_ENABLED,
>  };
>  
> +/*
> + * CPUs passed through "isolcpus=" on boot, waiting to be mounted
> + * as soon as we meet a cpuset directory whose cpus_allowed is a
> + * subset of "isolcpus="
> + */
> +static cpumask_var_t unmounted_isolcpus_mask;
> +
>  /**
>   * cpuset_for_each_child - traverse online children of a cpuset
>   * @child_cs: loop cursor pointing to the current child
> @@ -681,6 +700,39 @@ static inline int nr_cpusets(void)
>  	return static_key_count(&cpusets_enabled_key.key) + 1;
>  }
>  
> +static int update_domain_housekeeping_mask(void)
> +{
> +	struct cpuset *cp;	/* top-down scan of cpusets */
> +	struct cgroup_subsys_state *pos_css;
> +	cpumask_var_t domain_mask;
> +
> +	if (!zalloc_cpumask_var(&domain_mask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	cpumask_andnot(domain_mask, cpu_possible_mask, unmounted_isolcpus_mask);
> +
> +	rcu_read_lock();
> +	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
> +		if (is_isol_domain(cp))
> +			cpumask_andnot(domain_mask, domain_mask, cp->cpus_allowed);
> +
> +		if (cpumask_subset(cp->cpus_allowed, unmounted_isolcpus_mask)) {
> +			unsigned long flags;
> +			cpumask_andnot(unmounted_isolcpus_mask, unmounted_isolcpus_mask,
> +				       cp->cpus_allowed);
> +			spin_lock_irqsave(&callback_lock, flags);
> +			cp->isol_flags |= BIT(CS_ISOL_DOMAIN);
> +			spin_unlock_irqrestore(&callback_lock, flags);
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	housekeeping_cpumask_set(domain_mask, HK_FLAG_DOMAIN);
> +	free_cpumask_var(domain_mask);
> +
> +	return 0;
> +}
> +
>  /*
>   * generate_sched_domains()
>   *
> @@ -741,6 +793,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>  	struct cpuset **csa;	/* array of all cpuset ptrs */
>  	int csn;		/* how many cpuset ptrs in csa so far */
>  	int i, j, k;		/* indices for partition finding loops */
> +	int err;
>  	cpumask_var_t *doms;	/* resulting partition; i.e. sched domains */
>  	struct sched_domain_attr *dattr;  /* attributes for custom domains */
>  	int ndoms = 0;		/* number of sched domains in result */
> @@ -752,6 +805,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
>  	dattr = NULL;
>  	csa = NULL;
>  
> +	err = update_domain_housekeeping_mask();
> +	if (err < 0)
> +		pr_err("Can't update housekeeping cpumask\n");
> +
>  	/* Special case for the 99% of systems with one, full, sched domain */
>  	if (root_load_balance && !top_cpuset.nr_subparts_cpus) {
>  		ndoms = 1;
> @@ -1449,7 +1506,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
>  		 * root as well.
>  		 */
>  		if (!cpumask_empty(cp->cpus_allowed) &&
> -		    is_sched_load_balance(cp) &&
> +		    (is_sched_load_balance(cp) || is_isol_domain(cs)) &&
>  		   (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
>  		    is_partition_root(cp)))
>  			need_rebuild_sched_domains = true;
> @@ -1935,6 +1992,30 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
>  	return err;
>  }
>  
> +/*
> + * update_isol_flags - read a 0 or a 1 in a file and update associated isol flag
> + * mask:	the new mask value to apply (see isol_flagbits_t)
> + * cs:		the cpuset to update
> + *
> + * Call with cpuset_mutex held.
> + */
> +static int update_isol_flags(struct cpuset *cs, u64 mask)
> +{
> +	unsigned long old_mask = cs->isol_flags;
> +
> +	if (mask & ~(BIT_ULL(CS_ISOL_MAX) - 1))
> +		return -EINVAL;
> +
> +	spin_lock_irq(&callback_lock);
> +	cs->isol_flags = (unsigned long)mask;
> +	spin_unlock_irq(&callback_lock);
> +
> +	if (mask ^ old_mask)
> +		rebuild_sched_domains_locked();
> +
> +	return 0;
> +}
> +
>  /*
>   * update_prstate - update partititon_root_state
>   * cs:	the cpuset to update
> @@ -2273,6 +2354,9 @@ typedef enum {
>  	FILE_MEMORY_PRESSURE,
>  	FILE_SPREAD_PAGE,
>  	FILE_SPREAD_SLAB,
> +//CHECKME: should we have individual cpuset.isolation.$feature files
> +//instead of a mask of features in a single file?
> +	FILE_ISOLATION_MASK,
>  } cpuset_filetype_t;
>  
>  static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
> @@ -2314,6 +2398,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
>  	case FILE_SPREAD_SLAB:
>  		retval = update_flag(CS_SPREAD_SLAB, cs, val);
>  		break;
> +	case FILE_ISOLATION_MASK:
> +		retval = update_isol_flags(cs, val);
> +		break;
>  	default:
>  		retval = -EINVAL;
>  		break;
> @@ -2481,6 +2568,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
>  		return is_spread_page(cs);
>  	case FILE_SPREAD_SLAB:
>  		return is_spread_slab(cs);
> +	case FILE_ISOLATION_MASK:
> +		return cs->isol_flags;
>  	default:
>  		BUG();
>  	}
> @@ -2658,6 +2747,13 @@ static struct cftype legacy_files[] = {
>  		.private = FILE_MEMORY_PRESSURE_ENABLED,
>  	},
>  
> +	{
> +		.name = "isolation_mask",
> +		.read_u64 = cpuset_read_u64,
> +		.write_u64 = cpuset_write_u64,
> +		.private = FILE_ISOLATION_MASK,
> +	},
> +
>  	{ }	/* terminate */
>  };
>  
> @@ -2834,9 +2930,12 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
>  	if (is_partition_root(cs))
>  		update_prstate(cs, 0);
>  
> -	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
> -	    is_sched_load_balance(cs))
> -		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
> +	if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
> +		if (is_sched_load_balance(cs))
> +			update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
> +		if (is_isol_domain(cs))
> +			update_isol_flags(cs, cs->isol_flags & ~BIT(CS_ISOL_DOMAIN));
> +	}
>  
>  	if (cs->use_parent_ecpus) {
>  		struct cpuset *parent = parent_cs(cs);
> @@ -2873,6 +2972,9 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
>  		top_cpuset.mems_allowed = top_cpuset.effective_mems;
>  	}
>  
> +	cpumask_andnot(unmounted_isolcpus_mask, cpu_possible_mask,
> +		       housekeeping_cpumask(HK_FLAG_DOMAIN));
> +
>  	spin_unlock_irq(&callback_lock);
>  	percpu_up_write(&cpuset_rwsem);
>  }
> @@ -2932,6 +3034,7 @@ int __init cpuset_init(void)
>  	top_cpuset.relax_domain_level = -1;
>  
>  	BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
> +	BUG_ON(!alloc_cpumask_var(&unmounted_isolcpus_mask, GFP_KERNEL));
>  
>  	return 0;
>  }
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-14 13:54 ` [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file Frederic Weisbecker
  2021-07-14 16:31   ` Marcelo Tosatti
@ 2021-07-14 16:52   ` Peter Zijlstra
  2021-07-14 23:13     ` Frederic Weisbecker
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2021-07-14 16:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Tejun Heo, Juri Lelli, Alex Belits, Nitesh Lal,
	Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Marcelo Tosatti, Zefan Li, cgroups

On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> Add a new cpuset.isolation_mask file in order to be able to modify the
> housekeeping cpumask for each individual isolation feature on runtime.
> In the future this will include nohz_full, unbound timers,
> unbound workqueues, unbound kthreads, managed irqs, etc...
> 
> Start with supporting domain exclusion and CPUs passed through
> "isolcpus=".
> 
> The cpuset.isolation_mask defaults to 0. Setting it to 1 will exclude
> the given cpuset from the domains (they will be attached to NULL domain).
> As long as a CPU is part of any cpuset with cpuset.isolation_mask set to
> 1, it will remain isolated even if it overlaps with another cpuset that
> has cpuset.isolation_mask  set to 0. The same applies to parent and
> subdirectories.
> 
> If a cpuset is a subset of "isolcpus=", it automatically maps it and
> cpuset.isolation_mask will be set to 1. This subset is then cleared from
> the initial "isolcpus=" mask. The user is then free to override
> cpuset.isolation_mask to 0 in order to revert the effect of "isolcpus=".
> 
> Here is an example of use where the CPU 7 has been isolated on boot and
> get re-attached to domains later from cpuset:
> 
> 	$ cat /proc/cmdline
> 		isolcpus=7
> 	$ cd /sys/fs/cgroup/cpuset
> 	$ mkdir cpu7
> 	$ cd cpu7
> 	$ cat cpuset.cpus
> 		0-7
> 	$ cat cpuset.isolation_mask
> 		0
> 	$ ls /sys/kernel/debug/domains/cpu7	# empty because isolcpus=7
> 	$ echo 7 > cpuset.cpus
> 	$ cat cpuset.isolation_mask	# isolcpus subset automatically mapped
> 		1
> 	$ echo 0 > cpuset.isolation_mask
> 	$ ls /sys/kernel/debug/domains/cpu7/
> 		domain0  domain1
> 

cpusets already has means to create paritions; why are you creating
something else?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-14 16:52   ` Peter Zijlstra
@ 2021-07-14 23:13     ` Frederic Weisbecker
  2021-07-14 23:44       ` Valentin Schneider
  2021-07-15  9:04       ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-14 23:13 UTC (permalink / raw)
  To: Peter Zijlstra, Valentin Schneider
  Cc: LKML, Tejun Heo, Juri Lelli, Alex Belits, Nitesh Lal,
	Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Marcelo Tosatti, Zefan Li, cgroups

On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> > Add a new cpuset.isolation_mask file in order to be able to modify the
> > housekeeping cpumask for each individual isolation feature on runtime.
> > In the future this will include nohz_full, unbound timers,
> > unbound workqueues, unbound kthreads, managed irqs, etc...
> > 
> > Start with supporting domain exclusion and CPUs passed through
> > "isolcpus=".
> > 
> > The cpuset.isolation_mask defaults to 0. Setting it to 1 will exclude
> > the given cpuset from the domains (they will be attached to NULL domain).
> > As long as a CPU is part of any cpuset with cpuset.isolation_mask set to
> > 1, it will remain isolated even if it overlaps with another cpuset that
> > has cpuset.isolation_mask  set to 0. The same applies to parent and
> > subdirectories.
> > 
> > If a cpuset is a subset of "isolcpus=", it automatically maps it and
> > cpuset.isolation_mask will be set to 1. This subset is then cleared from
> > the initial "isolcpus=" mask. The user is then free to override
> > cpuset.isolation_mask to 0 in order to revert the effect of "isolcpus=".
> > 
> > Here is an example of use where the CPU 7 has been isolated on boot and
> > get re-attached to domains later from cpuset:
> > 
> > 	$ cat /proc/cmdline
> > 		isolcpus=7
> > 	$ cd /sys/fs/cgroup/cpuset
> > 	$ mkdir cpu7
> > 	$ cd cpu7
> > 	$ cat cpuset.cpus
> > 		0-7
> > 	$ cat cpuset.isolation_mask
> > 		0
> > 	$ ls /sys/kernel/debug/domains/cpu7	# empty because isolcpus=7
> > 	$ echo 7 > cpuset.cpus
> > 	$ cat cpuset.isolation_mask	# isolcpus subset automatically mapped
> > 		1
> > 	$ echo 0 > cpuset.isolation_mask
> > 	$ ls /sys/kernel/debug/domains/cpu7/
> > 		domain0  domain1
> > 
> 
> cpusets already has means to create paritions; why are you creating
> something else?

I was about to answer that the semantics of isolcpus, which reference
a NULL domain, are different from SD_LOAD_BALANCE implied by
cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
been removed.

How cpuset.sched_load_balance is implemented then? Commit
e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
ends up creating NULL domain but that's not what I get. For example if I
mount a single cpuset root (no other cpuset mountpoints):

$ mount -t cgroup none ./cpuset -o cpuset
$ cd cpuset
$ cat cpuset.cpus
0-7
$ cat cpuset.sched_load_balance
1
$ echo 0 > cpuset.sched_load_balance
$ ls /sys/kernel/debug/domains/cpu1/
domain0  domain1

I still get the domains on all CPUs...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-14 23:13     ` Frederic Weisbecker
@ 2021-07-14 23:44       ` Valentin Schneider
  2021-07-15  0:07         ` Frederic Weisbecker
  2021-07-15  9:04       ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Valentin Schneider @ 2021-07-14 23:44 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra
  Cc: LKML, Tejun Heo, Juri Lelli, Alex Belits, Nitesh Lal,
	Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Marcelo Tosatti, Zefan Li, cgroups

On 15/07/21 01:13, Frederic Weisbecker wrote:
> On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:
>>
>> cpusets already has means to create paritions; why are you creating
>> something else?
>
> I was about to answer that the semantics of isolcpus, which reference
> a NULL domain, are different from SD_LOAD_BALANCE implied by
> cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
> been removed.
>
> How cpuset.sched_load_balance is implemented then? Commit
> e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
> SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
> ends up creating NULL domain but that's not what I get. For example if I
> mount a single cpuset root (no other cpuset mountpoints):
>
> $ mount -t cgroup none ./cpuset -o cpuset
> $ cd cpuset
> $ cat cpuset.cpus
> 0-7
> $ cat cpuset.sched_load_balance
> 1
> $ echo 0 > cpuset.sched_load_balance
> $ ls /sys/kernel/debug/domains/cpu1/
> domain0  domain1
>
> I still get the domains on all CPUs...

Huh. That's on v5.14-rc1 with an automounted cpuset:

$ cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-5
$ cat /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
1

$ ls /sys/kernel/debug/sched/domains/cpu*
/sys/kernel/debug/sched/domains/cpu0:
domain0  domain1

/sys/kernel/debug/sched/domains/cpu1:
domain0  domain1

/sys/kernel/debug/sched/domains/cpu2:
domain0  domain1

/sys/kernel/debug/sched/domains/cpu3:
domain0  domain1

/sys/kernel/debug/sched/domains/cpu4:
domain0  domain1

/sys/kernel/debug/sched/domains/cpu5:
domain0  domain1

$ echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
$ ls /sys/kernel/debug/sched/domains/cpu*
/sys/kernel/debug/sched/domains/cpu0:

/sys/kernel/debug/sched/domains/cpu1:

/sys/kernel/debug/sched/domains/cpu2:

/sys/kernel/debug/sched/domains/cpu3:

/sys/kernel/debug/sched/domains/cpu4:

/sys/kernel/debug/sched/domains/cpu5:


I also checked that you can keep cpuset.sched_load_balance=0 at the root
and create exclusive child cpusets with different values of
sched_load_balance, giving you some CPUs attached to the NULL domain and
some others with a sched_domain hierarchy that stays within the cpuset span.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-14 23:44       ` Valentin Schneider
@ 2021-07-15  0:07         ` Frederic Weisbecker
  0 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-15  0:07 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, LKML, Tejun Heo, Juri Lelli, Alex Belits,
	Nitesh Lal, Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Marcelo Tosatti, Zefan Li, cgroups

On Thu, Jul 15, 2021 at 12:44:08AM +0100, Valentin Schneider wrote:
> On 15/07/21 01:13, Frederic Weisbecker wrote:
> > On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:
> >>
> >> cpusets already has means to create paritions; why are you creating
> >> something else?
> >
> > I was about to answer that the semantics of isolcpus, which reference
> > a NULL domain, are different from SD_LOAD_BALANCE implied by
> > cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
> > been removed.
> >
> > How cpuset.sched_load_balance is implemented then? Commit
> > e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
> > SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
> > ends up creating NULL domain but that's not what I get. For example if I
> > mount a single cpuset root (no other cpuset mountpoints):
> >
> > $ mount -t cgroup none ./cpuset -o cpuset
> > $ cd cpuset
> > $ cat cpuset.cpus
> > 0-7
> > $ cat cpuset.sched_load_balance
> > 1
> > $ echo 0 > cpuset.sched_load_balance
> > $ ls /sys/kernel/debug/domains/cpu1/
> > domain0  domain1
> >
> > I still get the domains on all CPUs...
> 
> Huh. That's on v5.14-rc1 with an automounted cpuset:
> 
> $ cat /sys/fs/cgroup/cpuset/cpuset.cpus
> 0-5
> $ cat /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
> 1
> 
> $ ls /sys/kernel/debug/sched/domains/cpu*
> /sys/kernel/debug/sched/domains/cpu0:
> domain0  domain1
> 
> /sys/kernel/debug/sched/domains/cpu1:
> domain0  domain1
> 
> /sys/kernel/debug/sched/domains/cpu2:
> domain0  domain1
> 
> /sys/kernel/debug/sched/domains/cpu3:
> domain0  domain1
> 
> /sys/kernel/debug/sched/domains/cpu4:
> domain0  domain1
> 
> /sys/kernel/debug/sched/domains/cpu5:
> domain0  domain1
> 
> $ echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
> $ ls /sys/kernel/debug/sched/domains/cpu*
> /sys/kernel/debug/sched/domains/cpu0:
> 
> /sys/kernel/debug/sched/domains/cpu1:
> 
> /sys/kernel/debug/sched/domains/cpu2:
> 
> /sys/kernel/debug/sched/domains/cpu3:
> 
> /sys/kernel/debug/sched/domains/cpu4:
> 
> /sys/kernel/debug/sched/domains/cpu5:
> 
> 
> I also checked that you can keep cpuset.sched_load_balance=0 at the root
> and create exclusive child cpusets with different values of
> sched_load_balance, giving you some CPUs attached to the NULL domain and
> some others with a sched_domain hierarchy that stays within the cpuset span.

Ok I must have done something wrong somewhere, I'll check further tomorrow.

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-14 23:13     ` Frederic Weisbecker
  2021-07-14 23:44       ` Valentin Schneider
@ 2021-07-15  9:04       ` Peter Zijlstra
  2021-07-19 13:17         ` Frederic Weisbecker
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2021-07-15  9:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Valentin Schneider, LKML, Tejun Heo, Juri Lelli, Alex Belits,
	Nitesh Lal, Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Marcelo Tosatti, Zefan Li, cgroups

On Thu, Jul 15, 2021 at 01:13:38AM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:

> > cpusets already has means to create paritions; why are you creating
> > something else?
> 
> I was about to answer that the semantics of isolcpus, which reference
> a NULL domain, are different from SD_LOAD_BALANCE implied by
> cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
> been removed.
> 
> How cpuset.sched_load_balance is implemented then? Commit
> e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
> SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
> ends up creating NULL domain but that's not what I get. For example if I
> mount a single cpuset root (no other cpuset mountpoints):

SD_LOAD_BALANCE was only for when you wanted to stop balancing inside a
domain tree. That no longer happens (and hasn't for a *long* time).
Cpusets simply creates multiple domain trees (or the empty one if its
just one CPU).

> $ mount -t cgroup none ./cpuset -o cpuset
> $ cd cpuset
> $ cat cpuset.cpus
> 0-7
> $ cat cpuset.sched_load_balance
> 1
> $ echo 0 > cpuset.sched_load_balance
> $ ls /sys/kernel/debug/domains/cpu1/
> domain0  domain1
> 
> I still get the domains on all CPUs...

(note, that's the cgroup-v1 interface, the cgroup-v2 interface is
significantly different)

I'd suggest doing: echo 1 > /debug/sched/verbose, if I do the above I
get:

[1290784.889705] CPU0 attaching NULL sched-domain.
[1290784.894830] CPU1 attaching NULL sched-domain.
[1290784.899947] CPU2 attaching NULL sched-domain.
[1290784.905056] CPU3 attaching NULL sched-domain.
[1290784.910153] CPU4 attaching NULL sched-domain.
[1290784.915252] CPU5 attaching NULL sched-domain.
[1290784.920338] CPU6 attaching NULL sched-domain.
[1290784.925439] CPU7 attaching NULL sched-domain.
[1290784.930535] CPU8 attaching NULL sched-domain.
[1290784.935660] CPU9 attaching NULL sched-domain.
[1290784.940911] CPU10 attaching NULL sched-domain.
[1290784.946117] CPU11 attaching NULL sched-domain.
[1290784.951317] CPU12 attaching NULL sched-domain.
[1290784.956507] CPU13 attaching NULL sched-domain.
[1290784.961688] CPU14 attaching NULL sched-domain.
[1290784.966876] CPU15 attaching NULL sched-domain.
[1290784.972047] CPU16 attaching NULL sched-domain.
[1290784.977218] CPU17 attaching NULL sched-domain.
[1290784.982383] CPU18 attaching NULL sched-domain.
[1290784.987552] CPU19 attaching NULL sched-domain.
[1290784.992724] CPU20 attaching NULL sched-domain.
[1290784.997893] CPU21 attaching NULL sched-domain.
[1290785.003063] CPU22 attaching NULL sched-domain.
[1290785.008230] CPU23 attaching NULL sched-domain.
[1290785.013400] CPU24 attaching NULL sched-domain.
[1290785.018568] CPU25 attaching NULL sched-domain.
[1290785.023736] CPU26 attaching NULL sched-domain.
[1290785.028905] CPU27 attaching NULL sched-domain.
[1290785.034074] CPU28 attaching NULL sched-domain.
[1290785.039241] CPU29 attaching NULL sched-domain.
[1290785.044409] CPU30 attaching NULL sched-domain.
[1290785.049579] CPU31 attaching NULL sched-domain.
[1290785.054816] CPU32 attaching NULL sched-domain.
[1290785.059986] CPU33 attaching NULL sched-domain.
[1290785.065154] CPU34 attaching NULL sched-domain.
[1290785.070323] CPU35 attaching NULL sched-domain.
[1290785.075492] CPU36 attaching NULL sched-domain.
[1290785.080662] CPU37 attaching NULL sched-domain.
[1290785.085832] CPU38 attaching NULL sched-domain.
[1290785.091001] CPU39 attaching NULL sched-domain.

Then when I do:

# mkdir /cgroup/A
# echo 0,20 > /cgroup/A/cpuset.cpus

I get:

[1291020.749036] CPU0 attaching sched-domain(s):
[1291020.754251]  domain-0: span=0,20 level=SMT
[1291020.759061]   groups: 0:{ span=0 }, 20:{ span=20 }
[1291020.765386] CPU20 attaching sched-domain(s):
[1291020.770399]  domain-0: span=0,20 level=SMT
[1291020.775210]   groups: 20:{ span=20 }, 0:{ span=0 }
[1291020.780831] root domain span: 0,20 (max cpu_capacity = 1024)

IOW, I've created a load-balance domain on just the first core of the
system.

# echo 0-1,20-21 > /cgroup/A/cpuset.cpus

Extends it to the first two cores:

[1291340.260699] CPU0 attaching NULL sched-domain.
[1291340.265820] CPU20 attaching NULL sched-domain.
[1291340.271403] CPU0 attaching sched-domain(s):
[1291340.276315]  domain-0: span=0,20 level=SMT
[1291340.281122]   groups: 0:{ span=0 }, 20:{ span=20 }
[1291340.286719]   domain-1: span=0-1,20-21 level=MC
[1291340.292011]    groups: 0:{ span=0,20 cap=2048 }, 1:{ span=1,21 cap=2048 }
[1291340.299855] CPU1 attaching sched-domain(s):
[1291340.304757]  domain-0: span=1,21 level=SMT
[1291340.309564]   groups: 1:{ span=1 }, 21:{ span=21 }
[1291340.315190]   domain-1: span=0-1,20-21 level=MC
[1291340.320474]    groups: 1:{ span=1,21 cap=2048 }, 0:{ span=0,20 cap=2048 }
[1291340.328307] CPU20 attaching sched-domain(s):
[1291340.333344]  domain-0: span=0,20 level=SMT
[1291340.338136]   groups: 20:{ span=20 }, 0:{ span=0 }
[1291340.343721]   domain-1: span=0-1,20-21 level=MC
[1291340.348980]    groups: 0:{ span=0,20 cap=2048 }, 1:{ span=1,21 cap=2048 }
[1291340.356783] CPU21 attaching sched-domain(s):
[1291340.361755]  domain-0: span=1,21 level=SMT
[1291340.366534]   groups: 21:{ span=21 }, 1:{ span=1 }
[1291340.372099]   domain-1: span=0-1,20-21 level=MC
[1291340.377364]    groups: 1:{ span=1,21 cap=2048 }, 0:{ span=0,20 cap=2048 }
[1291340.385216] root domain span: 0-1,20-21 (max cpu_capacity = 1024)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset
  2021-07-14 13:54 [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2021-07-14 13:54 ` [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file Frederic Weisbecker
@ 2021-07-16 18:02 ` Waiman Long
  2021-07-19 13:57   ` Frederic Weisbecker
  6 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2021-07-16 18:02 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Tejun Heo, Peter Zijlstra, Juri Lelli, Alex Belits, Nitesh Lal,
	Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Marcelo Tosatti, Zefan Li, cgroups

On 7/14/21 9:54 AM, Frederic Weisbecker wrote:
> The fact that "isolcpus=" behaviour can't be modified at runtime is an
> eternal source of discussion and debate opposing a useful feature against
> a terrible interface.
>
> I've long since tried to figure out a proper way to control this at
> runtime using cpusets, which isn't easy as a boot time single cpumask
> is difficult to map to a hierarchy of cpusets that can even overlap.

I have a cpuset patch that allow disabling of load balancing in a 
cgroup-v2 setting:

https://lore.kernel.org/lkml/20210621184924.27493-1-longman@redhat.com/

The idea of cpuset partition is that there will be no overlap of cpus in 
different partitions. So there will be no confusion whether a cpu is 
load-balanced or not.

>
> The idea here is to map the boot-set isolation behaviour to any cpuset
> directory whose cpumask is a subset of "isolcpus=". I let you browse
> for details on the last patch.
>
> Note this is still WIP and half-baked, but I figured it's important to
> validate the interface early.

Using different cpumasks for different isolated properties is the easy 
part. The hard part is to make different subsystems to change their 
behavior as the isolation masks change dynamically at run time. 
Currently, they check the housekeeping cpumask only at boot time or when 
certain events happen.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-15  9:04       ` Peter Zijlstra
@ 2021-07-19 13:17         ` Frederic Weisbecker
  0 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-19 13:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, LKML, Tejun Heo, Juri Lelli, Alex Belits,
	Nitesh Lal, Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Marcelo Tosatti, Zefan Li, cgroups

On Thu, Jul 15, 2021 at 11:04:19AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 15, 2021 at 01:13:38AM +0200, Frederic Weisbecker wrote:
> > On Wed, Jul 14, 2021 at 06:52:43PM +0200, Peter Zijlstra wrote:
> 
> > > cpusets already has means to create paritions; why are you creating
> > > something else?
> > 
> > I was about to answer that the semantics of isolcpus, which reference
> > a NULL domain, are different from SD_LOAD_BALANCE implied by
> > cpuset.sched_load_balance. But then I realize that SD_LOAD_BALANCE has
> > been removed.
> > 
> > How cpuset.sched_load_balance is implemented then? Commit
> > e669ac8ab952df2f07dee1e1efbf40647d6de332 ("sched: Remove checks against
> > SD_LOAD_BALANCE") advertize that setting cpuset.sched_load_balance to 0
> > ends up creating NULL domain but that's not what I get. For example if I
> > mount a single cpuset root (no other cpuset mountpoints):
> 
> SD_LOAD_BALANCE was only for when you wanted to stop balancing inside a
> domain tree. That no longer happens (and hasn't for a *long* time).
> Cpusets simply creates multiple domain trees (or the empty one if its
> just one CPU).

Ok.

> 
> > $ mount -t cgroup none ./cpuset -o cpuset
> > $ cd cpuset
> > $ cat cpuset.cpus
> > 0-7
> > $ cat cpuset.sched_load_balance
> > 1
> > $ echo 0 > cpuset.sched_load_balance
> > $ ls /sys/kernel/debug/domains/cpu1/
> > domain0  domain1
> > 
> > I still get the domains on all CPUs...
> 
> (note, that's the cgroup-v1 interface, the cgroup-v2 interface is
> significantly different)
> 
> I'd suggest doing: echo 1 > /debug/sched/verbose, if I do the above I
> get:
> 
> [1290784.889705] CPU0 attaching NULL sched-domain.
> [1290784.894830] CPU1 attaching NULL sched-domain.

Thanks!

Eventually I uninstalled cgmanager and things seem to work now. I have
no idea why and I'm not sure I'm willing to investigate further :o)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-14 16:31   ` Marcelo Tosatti
@ 2021-07-19 13:26     ` Frederic Weisbecker
  2021-07-19 15:41       ` Marcelo Tosatti
  0 siblings, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-19 13:26 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: LKML, Tejun Heo, Peter Zijlstra, Juri Lelli, Alex Belits,
	Nitesh Lal, Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Zefan Li, cgroups

On Wed, Jul 14, 2021 at 01:31:57PM -0300, Marcelo Tosatti wrote:
> On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> > Add a new cpuset.isolation_mask file in order to be able to modify the
> > housekeeping cpumask for each individual isolation feature on runtime.
> > In the future this will include nohz_full, unbound timers,
> > unbound workqueues, unbound kthreads, managed irqs, etc...
> > 
> > Start with supporting domain exclusion and CPUs passed through
> > "isolcpus=".
> 
> It is possible to just add return -ENOTSUPPORTED for the features 
> whose support is not present?

Maybe, although that looks like a specialized error for corner cases.

> > 
> > CHECKME: Should we have individual cpuset.isolation.$feature files for
> >          each isolation feature instead of a single mask file?
> 
> Yes, guess that is useful, for example due to the -ENOTSUPPORTED
> comment above.
> 
> 
> Guarantees on updates
> =====================
> 
> Perhaps start with a document with:
> 
> On return to the write to the cpumask file, what are the guarantees?
> 
> For example, for kthread it is that any kernel threads from that point
> on should start with the new mask. Therefore userspace should 
> respect the order:
> 
> 1) Change kthread mask.
> 2) Move threads.
> 

Yep.

> Updates to interface
> ====================
> 
> Also, thinking about updates to the interface (which today are one
> cpumask per isolation feature) might be useful. What can happen:
> 
> 1) New isolation feature is added, feature name added to the interface.
> 
> Userspace must support new filename. If not there, then thats an 
> old kernel without support for it.
> 
> 2) If an isolation feature is removed, a file will be gone. What should
> be the behaviour there? Remove the file? (userspace should probably 
> ignore the failure in that case?) (then features names should not be
> reused, as that can confuse #1 above).

Heh, yeah that's complicated. I guess we should use one flag per file as that
fits well within the current cpuset design. But we must carefully choose the new
files to make sure they have the least chances to be useless in the long term.

> Or maybe have a versioned scheme?

I suspect we should avoid that at all costs :-)

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset
  2021-07-16 18:02 ` [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Waiman Long
@ 2021-07-19 13:57   ` Frederic Weisbecker
  0 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2021-07-19 13:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: LKML, Tejun Heo, Peter Zijlstra, Juri Lelli, Alex Belits,
	Nitesh Lal, Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Marcelo Tosatti, Zefan Li, cgroups

On Fri, Jul 16, 2021 at 02:02:50PM -0400, Waiman Long wrote:
> On 7/14/21 9:54 AM, Frederic Weisbecker wrote:
> > The fact that "isolcpus=" behaviour can't be modified at runtime is an
> > eternal source of discussion and debate opposing a useful feature against
> > a terrible interface.
> > 
> > I've long since tried to figure out a proper way to control this at
> > runtime using cpusets, which isn't easy as a boot time single cpumask
> > is difficult to map to a hierarchy of cpusets that can even overlap.
> 
> I have a cpuset patch that allow disabling of load balancing in a cgroup-v2
> setting:
> 
> https://lore.kernel.org/lkml/20210621184924.27493-1-longman@redhat.com/
> 
> The idea of cpuset partition is that there will be no overlap of cpus in
> different partitions. So there will be no confusion whether a cpu is
> load-balanced or not.

Oh ok I missed that, time for me to check your patchset.

Thanks!

> 
> > 
> > The idea here is to map the boot-set isolation behaviour to any cpuset
> > directory whose cpumask is a subset of "isolcpus=". I let you browse
> > for details on the last patch.
> > 
> > Note this is still WIP and half-baked, but I figured it's important to
> > validate the interface early.
> 
> Using different cpumasks for different isolated properties is the easy part.
> The hard part is to make different subsystems to change their behavior as
> the isolation masks change dynamically at run time. Currently, they check
> the housekeeping cpumask only at boot time or when certain events happen.
> 
> Cheers,
> Longman
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file
  2021-07-19 13:26     ` Frederic Weisbecker
@ 2021-07-19 15:41       ` Marcelo Tosatti
  0 siblings, 0 replies; 19+ messages in thread
From: Marcelo Tosatti @ 2021-07-19 15:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Tejun Heo, Peter Zijlstra, Juri Lelli, Alex Belits,
	Nitesh Lal, Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Zefan Li, cgroups

On Mon, Jul 19, 2021 at 03:26:49PM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 14, 2021 at 01:31:57PM -0300, Marcelo Tosatti wrote:
> > On Wed, Jul 14, 2021 at 03:54:20PM +0200, Frederic Weisbecker wrote:
> > > Add a new cpuset.isolation_mask file in order to be able to modify the
> > > housekeeping cpumask for each individual isolation feature on runtime.
> > > In the future this will include nohz_full, unbound timers,
> > > unbound workqueues, unbound kthreads, managed irqs, etc...
> > > 
> > > Start with supporting domain exclusion and CPUs passed through
> > > "isolcpus=".
> > 
> > It is possible to just add return -ENOTSUPPORTED for the features 
> > whose support is not present?
> 
> Maybe, although that looks like a specialized error for corner cases.

Well, are you going to implement runtime enablement for all features,
including nohz_full, in the first patch set?

From my POV returning -ENOTSUPPORTED would allow for a gradual 
implementation of the features.

> > > CHECKME: Should we have individual cpuset.isolation.$feature files for
> > >          each isolation feature instead of a single mask file?
> > 
> > Yes, guess that is useful, for example due to the -ENOTSUPPORTED
> > comment above.
> > 
> > 
> > Guarantees on updates
> > =====================
> > 
> > Perhaps start with a document with:
> > 
> > On return to the write to the cpumask file, what are the guarantees?
> > 
> > For example, for kthread it is that any kernel threads from that point
> > on should start with the new mask. Therefore userspace should 
> > respect the order:
> > 
> > 1) Change kthread mask.
> > 2) Move threads.
> > 
> 
> Yep.
> 
> > Updates to interface
> > ====================
> > 
> > Also, thinking about updates to the interface (which today are one
> > cpumask per isolation feature) might be useful. What can happen:
> > 
> > 1) New isolation feature is added, feature name added to the interface.
> > 
> > Userspace must support new filename. If not there, then thats an 
> > old kernel without support for it.
> > 
> > 2) If an isolation feature is removed, a file will be gone. What should
> > be the behaviour there? Remove the file? (userspace should probably 
> > ignore the failure in that case?) (then features names should not be
> > reused, as that can confuse #1 above).
> 
> Heh, yeah that's complicated. I guess we should use one flag per file as that
> fits well within the current cpuset design. But we must carefully choose the new
> files to make sure they have the least chances to be useless in the long term.
> 
> > Or maybe have a versioned scheme?
> 
> I suspect we should avoid that at all costs :-)
> 
> Thanks!

Makes sense.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 5/6] sched/isolation: Make HK_FLAG_DOMAIN mutable
  2021-07-14 13:54 ` [RFC PATCH 5/6] sched/isolation: Make HK_FLAG_DOMAIN mutable Frederic Weisbecker
@ 2021-07-21 14:28   ` Vincent Donnefort
  0 siblings, 0 replies; 19+ messages in thread
From: Vincent Donnefort @ 2021-07-21 14:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Tejun Heo, Peter Zijlstra, Juri Lelli, Alex Belits,
	Nitesh Lal, Thomas Gleixner, Nicolas Saenz, Christoph Lameter,
	Marcelo Tosatti, Zefan Li, cgroups

Hi Frederic,

[...]

>  
> +// Only support HK_FLAG_DOMAIN for now
> +// TODO: propagate the changes through all interested subsystems:
> +// workqueues, net, pci; ...
> +void housekeeping_cpumask_set(struct cpumask *mask, enum hk_flags flags)
> +{
> +	/* Only HK_FLAG_DOMAIN change supported for now */
> +	if (WARN_ON_ONCE(flags != HK_FLAG_DOMAIN))
> +		return;
>  
> +	if (!static_key_enabled(&housekeeping_overridden.key)) {
> +		if (cpumask_equal(mask, cpu_possible_mask))
> +			return;
> +		if (WARN_ON_ONCE(!alloc_cpumask_var(&hk_domain_mask, GFP_KERNEL)))
> +			return;
> +		cpumask_copy(hk_domain_mask, mask);
> +		static_branch_enable(&housekeeping_overridden);

I get a warning here. static_branch_enable() is trying to take cpus_read_lock().
But the same lock is already taken by cpuset_write_u64().

Also, shouldn't it set HK_FLAG_DOMAIN in housekeeping_flags to enable
housekeeping if the kernel started without isolcpus="" ?

-- 
Vincent

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-07-21 14:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 13:54 [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Frederic Weisbecker
2021-07-14 13:54 ` [RFC PATCH 1/6] pci: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch Frederic Weisbecker
2021-07-14 13:54 ` [RFC PATCH 2/6] workqueue: " Frederic Weisbecker
2021-07-14 13:54 ` [RFC PATCH 3/6] net: " Frederic Weisbecker
2021-07-14 13:54 ` [RFC PATCH 4/6] sched/isolation: Split domain housekeeping mask from the rest Frederic Weisbecker
2021-07-14 13:54 ` [RFC PATCH 5/6] sched/isolation: Make HK_FLAG_DOMAIN mutable Frederic Weisbecker
2021-07-21 14:28   ` Vincent Donnefort
2021-07-14 13:54 ` [RFC PATCH 6/6] cpuset: Add cpuset.isolation_mask file Frederic Weisbecker
2021-07-14 16:31   ` Marcelo Tosatti
2021-07-19 13:26     ` Frederic Weisbecker
2021-07-19 15:41       ` Marcelo Tosatti
2021-07-14 16:52   ` Peter Zijlstra
2021-07-14 23:13     ` Frederic Weisbecker
2021-07-14 23:44       ` Valentin Schneider
2021-07-15  0:07         ` Frederic Weisbecker
2021-07-15  9:04       ` Peter Zijlstra
2021-07-19 13:17         ` Frederic Weisbecker
2021-07-16 18:02 ` [RFC PATCH 0/6] cpuset: Allow to modify isolcpus through cpuset Waiman Long
2021-07-19 13:57   ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).