All of lore.kernel.org
 help / color / mirror / Atom feed
From: Phil Auld <pauld@redhat.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vishal Chourasia <vishalc@linux.vnet.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	vincent.guittot@linaro.org, vschneid@redhat.com,
	srikar@linux.vnet.ibm.com, sshegde@linux.ibm.com,
	linuxppc-dev@lists.ozlabs.org, ritesh.list@gmail.com,
	aneesh.kumar@linux.ibm.com
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems
Date: Mon, 12 Dec 2022 14:17:58 -0500	[thread overview]
Message-ID: <Y5d+ZqdxeJD2eIHL@lorien.usersys.redhat.com> (raw)
In-Reply-To: <Y2pKh3H0Ukvmfuco@kroah.com>

Hi,

On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > 
> > Thanks Greg & Peter for your direction. 
> > 
> > While we pursue the idea of having debugfs based on kernfs, we thought about
> > having a boot time parameter which would disable creating and updating of the
> > sched_domain debugfs files and this would also be useful even when the kernfs
> > solution kicks in, as users who may not care about these debugfs files would
> > benefit from a faster CPU hotplug operation.
> 
> Ick, no, you would be adding a new user/kernel api that you will be
> required to support for the next 20+ years.  Just to get over a
> short-term issue before you solve the problem properly.

I'm not convinced moving these files from debugfs to kernfs is the right
fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
I don't think either of those numbers is reasonable.

The issue as I see it is the full rebuild for every change with no way to
batch the changes. How about something like the below?

This puts the domains/* files under the sched_verbose flag. About the only
thing under that flag now are the detailed topology discovery printks anyway
so this fits together nicely.

This way the files would be off by default (assuming you don't boot with
sched_verbose) and can be created at runtime by enabling verbose. Multiple
changes could also be batched by disabling/makeing changes/re-enabling.

It does not create a new API, uses one that is already there.

> 
> If you really do not want these debugfs files, just disable debugfs from
> your system.  That should be a better short-term solution, right?

We do find these files useful at times for debugging issue and looking
at what's going on on the system.

> 
> Or better yet, disable SCHED_DEBUG, why can't you do that?

Same with this... useful information with (modulo issues like this)
small cost. There are also tuning knobs that are only available
with SCHED_DEBUG. 


Cheers,
Phil

---------------

sched/debug: Put sched/domains files under verbose flag

The debug files under sched/domains can take a long time to regenerate,
especially when updates are done one at a time. Move these files under
the verbose debug flag. Allow changes to verbose to trigger generation
of the files. This lets a user batch the updates but still have the
information available.  The detailed topology printk messages are also
under verbose.

Signed-off-by: Phil Auld <pauld@redhat.com>
---
 kernel/sched/debug.c | 68 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 66 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..2eb51ee3ccab 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = {
 
 __read_mostly bool sched_debug_verbose;
 
+static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
+				   size_t cnt, loff_t *ppos);
+
+static int sched_verbose_show(struct seq_file *m, void *v)
+{
+	if (sched_debug_verbose)
+		seq_puts(m,"Y\n");
+	else
+		seq_puts(m,"N\n");
+	return 0;
+}
+
+static int sched_verbose_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_verbose_show, NULL);
+}
+
+static const struct file_operations sched_verbose_fops = {
+	.open		= sched_verbose_open,
+	.write		= sched_verbose_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 static const struct seq_operations sched_debug_sops;
 
 static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
 	debugfs_sched = debugfs_create_dir("sched", NULL);
 
 	debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
-	debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
+	debugfs_create_file("verbose", 0644, debugfs_sched, NULL, &sched_verbose_fops);
 #ifdef CONFIG_PREEMPT_DYNAMIC
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
@@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
 	if (!debugfs_sched)
 		return;
 
+	if (!sched_debug_verbose)
+		return;
+
 	if (!cpumask_available(sd_sysctl_cpus)) {
 		if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
 			return;
 		cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
 	}
 
-	if (!sd_dentry)
+	if (!sd_dentry) {
 		sd_dentry = debugfs_create_dir("domains", debugfs_sched);
 
+		/* rebuild sd_sysclt_cpus if empty since it gets cleared below */
+		if (cpumask_first(sd_sysctl_cpus) >=  nr_cpu_ids)
+			cpumask_copy(sd_sysctl_cpus, cpu_online_mask);
+	}
+
 	for_each_cpu(cpu, sd_sysctl_cpus) {
 		struct sched_domain *sd;
 		struct dentry *d_cpu;
@@ -443,6 +476,37 @@ void dirty_sched_domain_sysctl(int cpu)
 
 #endif /* CONFIG_SMP */
 
+static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
+				   size_t cnt, loff_t *ppos)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	bool orig = sched_debug_verbose;
+	bool bv;
+	int r;
+
+	r = kstrtobool_from_user(ubuf, cnt, &bv);
+	if (!r) {
+		mutex_lock(&sched_domains_mutex);
+		r = debugfs_file_get(dentry);
+		if (unlikely(r))
+			return r;
+		sched_debug_verbose = bv;
+		debugfs_file_put(dentry);
+
+#ifdef CONFIG_SMP
+		if (sched_debug_verbose && !orig)
+			update_sched_domain_debugfs();
+		else if (!sched_debug_verbose && orig){
+			debugfs_remove(sd_dentry);
+			sd_dentry = NULL;
+		}
+#endif
+
+		mutex_unlock(&sched_domains_mutex);
+	}
+	return cnt;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
 {
-- 
2.31.1



-- 


WARNING: multiple messages have this Message-ID (diff)
From: Phil Auld <pauld@redhat.com>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: ritesh.list@gmail.com, vschneid@redhat.com,
	vincent.guittot@linaro.org, srikar@linux.vnet.ibm.com,
	Peter Zijlstra <peterz@infradead.org>,
	aneesh.kumar@linux.ibm.com,
	Vishal Chourasia <vishalc@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org, sshegde@linux.ibm.com,
	mingo@redhat.com, linuxppc-dev@lists.ozlabs.org
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems
Date: Mon, 12 Dec 2022 14:17:58 -0500	[thread overview]
Message-ID: <Y5d+ZqdxeJD2eIHL@lorien.usersys.redhat.com> (raw)
In-Reply-To: <Y2pKh3H0Ukvmfuco@kroah.com>

Hi,

On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > 
> > Thanks Greg & Peter for your direction. 
> > 
> > While we pursue the idea of having debugfs based on kernfs, we thought about
> > having a boot time parameter which would disable creating and updating of the
> > sched_domain debugfs files and this would also be useful even when the kernfs
> > solution kicks in, as users who may not care about these debugfs files would
> > benefit from a faster CPU hotplug operation.
> 
> Ick, no, you would be adding a new user/kernel api that you will be
> required to support for the next 20+ years.  Just to get over a
> short-term issue before you solve the problem properly.

I'm not convinced moving these files from debugfs to kernfs is the right
fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
I don't think either of those numbers is reasonable.

The issue as I see it is the full rebuild for every change with no way to
batch the changes. How about something like the below?

This puts the domains/* files under the sched_verbose flag. About the only
thing under that flag now are the detailed topology discovery printks anyway
so this fits together nicely.

This way the files would be off by default (assuming you don't boot with
sched_verbose) and can be created at runtime by enabling verbose. Multiple
changes could also be batched by disabling/makeing changes/re-enabling.

It does not create a new API, uses one that is already there.

> 
> If you really do not want these debugfs files, just disable debugfs from
> your system.  That should be a better short-term solution, right?

We do find these files useful at times for debugging issue and looking
at what's going on on the system.

> 
> Or better yet, disable SCHED_DEBUG, why can't you do that?

Same with this... useful information with (modulo issues like this)
small cost. There are also tuning knobs that are only available
with SCHED_DEBUG. 


Cheers,
Phil

---------------

sched/debug: Put sched/domains files under verbose flag

The debug files under sched/domains can take a long time to regenerate,
especially when updates are done one at a time. Move these files under
the verbose debug flag. Allow changes to verbose to trigger generation
of the files. This lets a user batch the updates but still have the
information available.  The detailed topology printk messages are also
under verbose.

Signed-off-by: Phil Auld <pauld@redhat.com>
---
 kernel/sched/debug.c | 68 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 66 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..2eb51ee3ccab 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = {
 
 __read_mostly bool sched_debug_verbose;
 
+static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
+				   size_t cnt, loff_t *ppos);
+
+static int sched_verbose_show(struct seq_file *m, void *v)
+{
+	if (sched_debug_verbose)
+		seq_puts(m,"Y\n");
+	else
+		seq_puts(m,"N\n");
+	return 0;
+}
+
+static int sched_verbose_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_verbose_show, NULL);
+}
+
+static const struct file_operations sched_verbose_fops = {
+	.open		= sched_verbose_open,
+	.write		= sched_verbose_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 static const struct seq_operations sched_debug_sops;
 
 static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
 	debugfs_sched = debugfs_create_dir("sched", NULL);
 
 	debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
-	debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
+	debugfs_create_file("verbose", 0644, debugfs_sched, NULL, &sched_verbose_fops);
 #ifdef CONFIG_PREEMPT_DYNAMIC
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
@@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
 	if (!debugfs_sched)
 		return;
 
+	if (!sched_debug_verbose)
+		return;
+
 	if (!cpumask_available(sd_sysctl_cpus)) {
 		if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
 			return;
 		cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
 	}
 
-	if (!sd_dentry)
+	if (!sd_dentry) {
 		sd_dentry = debugfs_create_dir("domains", debugfs_sched);
 
+		/* rebuild sd_sysclt_cpus if empty since it gets cleared below */
+		if (cpumask_first(sd_sysctl_cpus) >=  nr_cpu_ids)
+			cpumask_copy(sd_sysctl_cpus, cpu_online_mask);
+	}
+
 	for_each_cpu(cpu, sd_sysctl_cpus) {
 		struct sched_domain *sd;
 		struct dentry *d_cpu;
@@ -443,6 +476,37 @@ void dirty_sched_domain_sysctl(int cpu)
 
 #endif /* CONFIG_SMP */
 
+static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
+				   size_t cnt, loff_t *ppos)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	bool orig = sched_debug_verbose;
+	bool bv;
+	int r;
+
+	r = kstrtobool_from_user(ubuf, cnt, &bv);
+	if (!r) {
+		mutex_lock(&sched_domains_mutex);
+		r = debugfs_file_get(dentry);
+		if (unlikely(r))
+			return r;
+		sched_debug_verbose = bv;
+		debugfs_file_put(dentry);
+
+#ifdef CONFIG_SMP
+		if (sched_debug_verbose && !orig)
+			update_sched_domain_debugfs();
+		else if (!sched_debug_verbose && orig){
+			debugfs_remove(sd_dentry);
+			sd_dentry = NULL;
+		}
+#endif
+
+		mutex_unlock(&sched_domains_mutex);
+	}
+	return cnt;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
 {
-- 
2.31.1



-- 


  parent reply	other threads:[~2022-12-12 19:19 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 13:10 sched/debug: CPU hotplug operation suffers in a large cpu systems Vishal Chourasia
2022-10-17 14:19 ` Peter Zijlstra
2022-10-17 14:54   ` Greg Kroah-Hartman
2022-10-18 10:37     ` Vishal Chourasia
2022-10-18 11:04       ` Greg Kroah-Hartman
2022-10-18 11:04         ` Greg Kroah-Hartman
2022-10-26  6:37         ` Vishal Chourasia
2022-10-26  6:37           ` Vishal Chourasia
2022-10-26  7:02           ` Greg Kroah-Hartman
2022-10-26  7:02             ` Greg Kroah-Hartman
2022-10-26  9:10             ` Peter Zijlstra
2022-10-26  9:10               ` Peter Zijlstra
2022-11-08 10:00               ` Vishal Chourasia
2022-11-08 10:00                 ` Vishal Chourasia
2022-11-08 12:24                 ` Greg Kroah-Hartman
2022-11-08 12:24                   ` Greg Kroah-Hartman
2022-11-08 14:51                   ` Srikar Dronamraju
2022-11-08 14:51                     ` Srikar Dronamraju
2022-11-08 15:38                     ` Greg Kroah-Hartman
2022-11-08 15:38                       ` Greg Kroah-Hartman
2022-12-12 19:17                   ` Phil Auld [this message]
2022-12-12 19:17                     ` Phil Auld
2022-12-13  2:17                     ` kernel test robot
2022-12-13  2:17                       ` kernel test robot
2022-12-13  6:23                     ` Greg Kroah-Hartman
2022-12-13  6:23                       ` Greg Kroah-Hartman
2022-12-13 13:22                       ` Phil Auld
2022-12-13 13:22                         ` Phil Auld
2022-12-13 14:31                         ` Greg Kroah-Hartman
2022-12-13 14:31                           ` Greg Kroah-Hartman
2022-12-13 14:45                           ` Phil Auld
2022-12-13 14:45                             ` Phil Auld
2023-01-19 15:31                           ` Phil Auld
2023-01-19 15:31                             ` Phil Auld
2022-12-13 23:41                         ` Michael Ellerman
2022-12-13 23:41                           ` Michael Ellerman
2022-12-14  2:26                           ` Phil Auld
2022-12-14  2:26                             ` Phil Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y5d+ZqdxeJD2eIHL@lorien.usersys.redhat.com \
    --to=pauld@redhat.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=ritesh.list@gmail.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=sshegde@linux.ibm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vishalc@linux.vnet.ibm.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.