linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND 0/4] /proc/schedstat and /proc/sched_debug fail at 4096
@ 2013-01-15 21:46 Nathan Zimmer
  2013-01-15 21:46 ` [PATCH RESEND 1/4] sched: /proc/sched_stat fails on very very large machines Nathan Zimmer
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Nathan Zimmer @ 2013-01-15 21:46 UTC (permalink / raw)
  To: mingo, peterz, tglx, johnstul; +Cc: linux-kernel, Nathan Zimmer

When running with 4096 cores attemping to read /proc/schedstat,
/proc/sched_debug, /proc/timer_list will fail with an ENOMEM condition.
On a sufficantly large systems the total amount of data is more then 4mb, so
it won't fit into a single buffer.  The failure can also occur on smaller
systems when memory fragmentation is high as reported by Dave Jones.
Also thanks to Al Viro for pointing me in the right direction with the
iterator.

An solution considered but not tried for sched_stat and scheddebug would to be
create an alternative mechanism to single_open but rather then calling *_show
once it calls show once per possible cpu.  If someone feels strongly that is
the way to go I can give it a spin.

Nathan Zimmer (4):
  sched: /proc/sched_stat fails on very very large machines.
  sched: /proc/sched_debug fails on very very large machines.
  timer_list: split timer_list_show_tickdevices
  timer_list: Convert timer list to be a proper seq_file.

 kernel/sched/debug.c     |   84 +++++++++++++++++++++++++++++++++-----
 kernel/sched/stats.c     |   73 ++++++++++++++++++++++++---------
 kernel/time/timer_list.c |  100 +++++++++++++++++++++++++++++++++++++---------
 3 files changed, 207 insertions(+), 50 deletions(-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH RESEND 1/4] sched: /proc/sched_stat fails on very very large machines.
  2013-01-15 21:46 [PATCH RESEND 0/4] /proc/schedstat and /proc/sched_debug fail at 4096 Nathan Zimmer
@ 2013-01-15 21:46 ` Nathan Zimmer
  2013-01-16 21:53   ` Andrew Morton
  2013-01-15 21:46 ` [PATCH RESEND 2/4] sched: /proc/sched_debug " Nathan Zimmer
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Nathan Zimmer @ 2013-01-15 21:46 UTC (permalink / raw)
  To: mingo, peterz, tglx, johnstul; +Cc: linux-kernel, Nathan Zimmer

On systems with 4096 cores doing a cat /proc/sched_stat fails.
We are trying to push all the data into a single kmalloc buffer.
The issue is on these very large machines all the data will not fit in 4mb.

A better solution is to not us the single_open mechanism but to provide
our own seq_operations.

The output should be identical to previous version and thus not need the
version number.

CC: Ingo Molnar <mingo@redhat.com> 
CC: Peter Zijlstra <peterz@infradead.org> 
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
---
 kernel/sched/stats.c |   73 ++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 53 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 903ffa9..33a85c9 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -21,9 +21,13 @@ static int show_schedstat(struct seq_file *seq, void *v)
 	if (mask_str == NULL)
 		return -ENOMEM;
 
-	seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
-	seq_printf(seq, "timestamp %lu\n", jiffies);
-	for_each_online_cpu(cpu) {
+	if (v == (void *)1) {
+		seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
+		seq_printf(seq, "timestamp %lu\n", jiffies);
+	} else {
+
+		cpu = (unsigned long)(v - 2);
+
 		struct rq *rq = cpu_rq(cpu);
 #ifdef CONFIG_SMP
 		struct sched_domain *sd;
@@ -72,35 +76,64 @@ static int show_schedstat(struct seq_file *seq, void *v)
 		}
 		rcu_read_unlock();
 #endif
+		kfree(mask_str);
 	}
-	kfree(mask_str);
 	return 0;
 }
 
-static int schedstat_open(struct inode *inode, struct file *file)
+static void *schedstat_start(struct seq_file *file, loff_t *offset)
 {
-	unsigned int size = PAGE_SIZE * (1 + num_online_cpus() / 32);
-	char *buf = kmalloc(size, GFP_KERNEL);
-	struct seq_file *m;
-	int res;
+	unsigned long n = *offset;
 
-	if (!buf)
-		return -ENOMEM;
-	res = single_open(file, show_schedstat, NULL);
-	if (!res) {
-		m = file->private_data;
-		m->buf = buf;
-		m->size = size;
-	} else
-		kfree(buf);
-	return res;
+	if (n == 0)
+		return (void *) 1;
+
+	n--;
+
+	if (n > 0)
+		n = cpumask_next(n - 1, cpu_online_mask);
+	else
+		n = cpumask_first(cpu_online_mask);
+
+	*offset = n + 1;
+
+	if (n < nr_cpu_ids)
+		return (void *)(unsigned long)(n + 2);
+	return NULL;
+}
+
+static void *schedstat_next(struct seq_file *file, void *data, loff_t *offset)
+{
+	(*offset)++;
+	return schedstat_start(file, offset);
+}
+
+static void schedstat_stop(struct seq_file *file, void *data)
+{
+}
+
+static const struct seq_operations schedstat_sops = {
+	.start = schedstat_start,
+	.next  = schedstat_next,
+	.stop  = schedstat_stop,
+	.show  = show_schedstat,
+};
+
+static int schedstat_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &schedstat_sops);
 }
 
+static int schedstat_release(struct inode *inode, struct file *file)
+{
+	return 0;
+};
+
 static const struct file_operations proc_schedstat_operations = {
 	.open    = schedstat_open,
 	.read    = seq_read,
 	.llseek  = seq_lseek,
-	.release = single_release,
+	.release = schedstat_release,
 };
 
 static int __init proc_schedstat_init(void)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RESEND 2/4] sched: /proc/sched_debug fails on very very large machines.
  2013-01-15 21:46 [PATCH RESEND 0/4] /proc/schedstat and /proc/sched_debug fail at 4096 Nathan Zimmer
  2013-01-15 21:46 ` [PATCH RESEND 1/4] sched: /proc/sched_stat fails on very very large machines Nathan Zimmer
@ 2013-01-15 21:46 ` Nathan Zimmer
  2013-01-16 21:56   ` Andrew Morton
  2013-01-15 21:46 ` [PATCH RESEND 3/4] timer_list: split timer_list_show_tickdevices Nathan Zimmer
  2013-01-15 21:46 ` [PATCH RESEND 4/4] timer_list: Convert timer list to be a proper seq_file Nathan Zimmer
  3 siblings, 1 reply; 9+ messages in thread
From: Nathan Zimmer @ 2013-01-15 21:46 UTC (permalink / raw)
  To: mingo, peterz, tglx, johnstul; +Cc: linux-kernel, Nathan Zimmer

On systems with 4096 cores attemping to read /proc/sched_debug fails.
We are trying to push all the data into a single kmalloc buffer.
The issue is on these very large machines all the data will not fit in 4mb.

A better solution is to not us the single_open mechanism but to provide
our own seq_operations and treat each cpu as an individual record.

The output should be identical to previous version.

CC: Ingo Molnar <mingo@redhat.com> 
CC: Peter Zijlstra <peterz@infradead.org>)
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
---
 kernel/sched/debug.c |   84 +++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6f79596..1ffdd42 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -253,11 +253,11 @@ static void print_cpu(struct seq_file *m, int cpu)
 	{
 		unsigned int freq = cpu_khz ? : 1;
 
-		SEQ_printf(m, "\ncpu#%d, %u.%03u MHz\n",
+		SEQ_printf(m, "cpu#%d, %u.%03u MHz\n",
 			   cpu, freq / 1000, (freq % 1000));
 	}
 #else
-	SEQ_printf(m, "\ncpu#%d\n", cpu);
+	SEQ_printf(m, "cpu#%d\n", cpu);
 #endif
 
 #define P(x)								\
@@ -314,6 +314,7 @@ do {									\
 	print_rq(m, rq, cpu);
 	rcu_read_unlock();
 	spin_unlock_irqrestore(&sched_debug_lock, flags);
+	SEQ_printf(m, "\n");
 }
 
 static const char *sched_tunable_scaling_names[] = {
@@ -322,11 +323,10 @@ static const char *sched_tunable_scaling_names[] = {
 	"linear"
 };
 
-static int sched_debug_show(struct seq_file *m, void *v)
+static void sched_debug_header(struct seq_file *m)
 {
 	u64 ktime, sched_clk, cpu_clk;
 	unsigned long flags;
-	int cpu;
 
 	local_irq_save(flags);
 	ktime = ktime_to_ns(ktime_get());
@@ -368,33 +368,95 @@ static int sched_debug_show(struct seq_file *m, void *v)
 #undef PN
 #undef P
 
-	SEQ_printf(m, "  .%-40s: %d (%s)\n", "sysctl_sched_tunable_scaling",
+	SEQ_printf(m, "  .%-40s: %d (%s)\n",
+		"sysctl_sched_tunable_scaling",
 		sysctl_sched_tunable_scaling,
 		sched_tunable_scaling_names[sysctl_sched_tunable_scaling]);
+	SEQ_printf(m, "\n");
+}
 
-	for_each_online_cpu(cpu)
-		print_cpu(m, cpu);
+static int sched_debug_show(struct seq_file *m, void *v)
+{
+	int cpu = (unsigned long)(v - 2);
 
-	SEQ_printf(m, "\n");
+	if (cpu != -1)
+		print_cpu(m, cpu);
+	else
+		sched_debug_header(m);
 
 	return 0;
 }
 
 void sysrq_sched_debug_show(void)
 {
-	sched_debug_show(NULL, NULL);
+	int cpu;
+
+	sched_debug_header(NULL);
+	for_each_online_cpu(cpu)
+		print_cpu(NULL, cpu);
+
+}
+
+static void *sched_debug_start(struct seq_file *file, loff_t *offset)
+{
+	unsigned long n = *offset;
+
+	if (n == 0)
+		return (void *) 1;
+
+	n--;
+
+	if (n > 0)
+		n = cpumask_next(n - 1, cpu_online_mask);
+	else
+		n = cpumask_first(cpu_online_mask);
+
+	*offset = n + 1;
+
+	if (n < nr_cpu_ids)
+		return (void *)(unsigned long)(n + 2);
+	return NULL;
+}
+
+static void *sched_debug_next(struct seq_file *file, void *data, loff_t *offset)
+{
+	(*offset)++;
+	return sched_debug_start(file, offset);
+}
+
+static void sched_debug_stop(struct seq_file *file, void *data)
+{
+}
+
+
+static const struct seq_operations sched_debug_sops = {
+	.start = sched_debug_start,
+	.next = sched_debug_next,
+	.stop = sched_debug_stop,
+	.show = sched_debug_show,
+};
+
+static int sched_debug_release(struct inode *inode, struct file *file)
+{
+	seq_release(inode, file);
+
+	return 0;
 }
 
 static int sched_debug_open(struct inode *inode, struct file *filp)
 {
-	return single_open(filp, sched_debug_show, NULL);
+	int ret = 0;
+
+	ret = seq_open(filp, &sched_debug_sops);
+
+	return ret;
 }
 
 static const struct file_operations sched_debug_fops = {
 	.open		= sched_debug_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= single_release,
+	.release	= sched_debug_release,
 };
 
 static int __init init_sched_debug_procfs(void)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RESEND 3/4] timer_list: split timer_list_show_tickdevices
  2013-01-15 21:46 [PATCH RESEND 0/4] /proc/schedstat and /proc/sched_debug fail at 4096 Nathan Zimmer
  2013-01-15 21:46 ` [PATCH RESEND 1/4] sched: /proc/sched_stat fails on very very large machines Nathan Zimmer
  2013-01-15 21:46 ` [PATCH RESEND 2/4] sched: /proc/sched_debug " Nathan Zimmer
@ 2013-01-15 21:46 ` Nathan Zimmer
  2013-01-16 22:09   ` Andrew Morton
  2013-01-15 21:46 ` [PATCH RESEND 4/4] timer_list: Convert timer list to be a proper seq_file Nathan Zimmer
  3 siblings, 1 reply; 9+ messages in thread
From: Nathan Zimmer @ 2013-01-15 21:46 UTC (permalink / raw)
  To: mingo, peterz, tglx, johnstul; +Cc: linux-kernel, Nathan Zimmer

Split timer_list_show_tickdevices out the header and just pull the rest up
to timer_list_show.  Also tweak the location of the whitespace.  This is all
to prep for the fix.

CC: John Stultz <johnstul@us.ibm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
---
 kernel/time/timer_list.c |   20 +++++++++-----------
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index af5a7e9..b3dc3d6 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -133,7 +133,6 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 	struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
 	int i;
 
-	SEQ_printf(m, "\n");
 	SEQ_printf(m, "cpu: %d\n", cpu);
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
 		SEQ_printf(m, " clock %d:\n", i);
@@ -187,6 +186,7 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 
 #undef P
 #undef P_ns
+	SEQ_printf(m, "\n");
 }
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
@@ -195,7 +195,6 @@ print_tickdevice(struct seq_file *m, struct tick_device *td, int cpu)
 {
 	struct clock_event_device *dev = td->evtdev;
 
-	SEQ_printf(m, "\n");
 	SEQ_printf(m, "Tick Device: mode:     %d\n", td->mode);
 	if (cpu < 0)
 		SEQ_printf(m, "Broadcast device\n");
@@ -230,12 +229,11 @@ print_tickdevice(struct seq_file *m, struct tick_device *td, int cpu)
 	print_name_offset(m, dev->event_handler);
 	SEQ_printf(m, "\n");
 	SEQ_printf(m, " retries:        %lu\n", dev->retries);
+	SEQ_printf(m, "\n");
 }
 
-static void timer_list_show_tickdevices(struct seq_file *m)
+static void timer_list_show_tickdevices_header(struct seq_file *m)
 {
-	int cpu;
-
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 	print_tickdevice(m, tick_get_broadcast_device(), -1);
 	SEQ_printf(m, "tick_broadcast_mask: %08lx\n",
@@ -246,12 +244,8 @@ static void timer_list_show_tickdevices(struct seq_file *m)
 #endif
 	SEQ_printf(m, "\n");
 #endif
-	for_each_online_cpu(cpu)
-		print_tickdevice(m, tick_get_device(cpu), cpu);
 	SEQ_printf(m, "\n");
 }
-#else
-static void timer_list_show_tickdevices(struct seq_file *m) { }
 #endif
 
 static int timer_list_show(struct seq_file *m, void *v)
@@ -262,12 +256,16 @@ static int timer_list_show(struct seq_file *m, void *v)
 	SEQ_printf(m, "Timer List Version: v0.7\n");
 	SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
 	SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
+	SEQ_printf(m, "\n");
 
 	for_each_online_cpu(cpu)
 		print_cpu(m, cpu, now);
 
-	SEQ_printf(m, "\n");
-	timer_list_show_tickdevices(m);
+#ifdef CONFIG_GENERIC_CLOCKEVENTS
+	timer_list_show_tickdevices_header(m);
+	for_each_online_cpu(cpu)
+		print_tickdevice(m, tick_get_device(cpu), cpu);
+#endif
 
 	return 0;
 }
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RESEND 4/4] timer_list: Convert timer list to be a proper seq_file.
  2013-01-15 21:46 [PATCH RESEND 0/4] /proc/schedstat and /proc/sched_debug fail at 4096 Nathan Zimmer
                   ` (2 preceding siblings ...)
  2013-01-15 21:46 ` [PATCH RESEND 3/4] timer_list: split timer_list_show_tickdevices Nathan Zimmer
@ 2013-01-15 21:46 ` Nathan Zimmer
  3 siblings, 0 replies; 9+ messages in thread
From: Nathan Zimmer @ 2013-01-15 21:46 UTC (permalink / raw)
  To: mingo, peterz, tglx, johnstul; +Cc: linux-kernel, Nathan Zimmer

Convert /proc/timer_list to a proper seq_file with its own iterator.  This is a
little more complex given that we have to make two passes with two seperate
headers.

John Stultz <johnstul@us.ibm.com> 
Thomas Gleixner <tglx@linutronix.de> 
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
---
 kernel/time/timer_list.c |   88 +++++++++++++++++++++++++++++++++++++++------
 1 files changed, 76 insertions(+), 12 deletions(-)

diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index b3dc3d6..6a2cfc2 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -253,38 +253,102 @@ static int timer_list_show(struct seq_file *m, void *v)
 	u64 now = ktime_to_ns(ktime_get());
 	int cpu;
 
-	SEQ_printf(m, "Timer List Version: v0.7\n");
-	SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
-	SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
-	SEQ_printf(m, "\n");
-
-	for_each_online_cpu(cpu)
+	if (v == (void *)1) {
+		SEQ_printf(m, "Timer List Version: v0.7\n");
+		SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n",
+			   HRTIMER_MAX_CLOCK_BASES);
+		SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
+		SEQ_printf(m, "\n");
+	} else if (v < (void *)(nr_cpu_ids + 2)) {
+		cpu = (unsigned long)(v - 2);
 		print_cpu(m, cpu, now);
-
+	}
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-	timer_list_show_tickdevices_header(m);
-	for_each_online_cpu(cpu)
+	else if (v == (void *)nr_cpu_ids + 2) {
+		timer_list_show_tickdevices_header(m);
+	} else {
+		cpu = (unsigned long)(v - 3 - nr_cpu_ids);
 		print_tickdevice(m, tick_get_device(cpu), cpu);
+	}
 #endif
-
 	return 0;
 }
 
+
+static void *timer_list_start(struct seq_file *file, loff_t *offset)
+{
+	unsigned long n = *offset;
+
+	if (n == 0)
+		return (void *) 1;
+
+	if (n < nr_cpu_ids + 1) {
+		n--;
+		if (n > 0)
+			n = cpumask_next(n - 1, cpu_online_mask);
+		else
+			n = cpumask_first(cpu_online_mask);
+		*offset = n + 1;
+		return (void *)(unsigned long)(n + 2);
+	}
+
+#ifdef CONFIG_GENERIC_CLOCKEVENTS
+	if (n == nr_cpu_ids + 1)
+		return (void *) (nr_cpu_ids + 2);
+
+	if (n < nr_cpu_ids * 2 + 2) {
+		n -= (nr_cpu_ids + 2);
+		if (n > 0)
+			n = cpumask_next(n - 1, cpu_online_mask);
+		else
+			n = cpumask_first(cpu_online_mask);
+		*offset = n + 2 + nr_cpu_ids;
+		return (void *)(unsigned long)(n + 3 + nr_cpu_ids);
+	}
+#endif
+
+	return NULL;
+}
+
+static void *timer_list_next(struct seq_file *file, void *data, loff_t *offset)
+{
+	(*offset)++;
+	return timer_list_start(file, offset);
+}
+
+static void timer_list_stop(struct seq_file *file, void *data)
+{
+}
+
+static const struct seq_operations timer_list_sops = {
+	.start = timer_list_start,
+	.next = timer_list_next,
+	.stop = timer_list_stop,
+	.show = timer_list_show,
+};
+
 void sysrq_timer_list_show(void)
 {
 	timer_list_show(NULL, NULL);
 }
 
+static int timer_list_release(struct inode *inode, struct file *filep)
+{
+	seq_release(inode, filep);
+
+	return 0;
+}
+
 static int timer_list_open(struct inode *inode, struct file *filp)
 {
-	return single_open(filp, timer_list_show, NULL);
+	return seq_open(filp, &timer_list_sops);
 }
 
 static const struct file_operations timer_list_fops = {
 	.open		= timer_list_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= single_release,
+	.release	= timer_list_release,
 };
 
 static int __init init_timer_list_procfs(void)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND 1/4] sched: /proc/sched_stat fails on very very large machines.
  2013-01-15 21:46 ` [PATCH RESEND 1/4] sched: /proc/sched_stat fails on very very large machines Nathan Zimmer
@ 2013-01-16 21:53   ` Andrew Morton
  2013-01-17 22:36     ` Nathan Zimmer
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2013-01-16 21:53 UTC (permalink / raw)
  To: Nathan Zimmer; +Cc: mingo, peterz, tglx, johnstul, linux-kernel

On Tue, 15 Jan 2013 15:46:09 -0600
Nathan Zimmer <nzimmer@sgi.com> wrote:

> On systems with 4096 cores doing a cat /proc/sched_stat fails.
> We are trying to push all the data into a single kmalloc buffer.
> The issue is on these very large machines all the data will not fit in 4mb.
> 
> A better solution is to not us the single_open mechanism but to provide
> our own seq_operations.
> 
> The output should be identical to previous version and thus not need the
> version number.
> 
> ...
>
> index 903ffa9..33a85c9 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -21,9 +21,13 @@ static int show_schedstat(struct seq_file *seq, void *v)
>  	if (mask_str == NULL)
>  		return -ENOMEM;
>  
> -	seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
> -	seq_printf(seq, "timestamp %lu\n", jiffies);
> -	for_each_online_cpu(cpu) {
> +	if (v == (void *)1) {

The magic-numbers-in-pointers at least need comments, please.  Or nice
and meaningful #defines.

> +		seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
> +		seq_printf(seq, "timestamp %lu\n", jiffies);

The code leaks the memory at mask_str here.

> +	} else {
> +
> +		cpu = (unsigned long)(v - 2);
> +
>  		struct rq *rq = cpu_rq(cpu);
>  #ifdef CONFIG_SMP
>  		struct sched_domain *sd;
> @@ -72,35 +76,64 @@ static int show_schedstat(struct seq_file *seq, void *v)
>  		}
>  		rcu_read_unlock();
>  #endif
> +		kfree(mask_str);
>  	}
> -	kfree(mask_str);
>  	return 0;
>  }

Undoing this change will fix the leak.

The schedstats code (both the original and after the patch) appears to
be racy against cpu hotplug?  What prevents the rq from vanishing while
we're playing with it?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND 2/4] sched: /proc/sched_debug fails on very very large machines.
  2013-01-15 21:46 ` [PATCH RESEND 2/4] sched: /proc/sched_debug " Nathan Zimmer
@ 2013-01-16 21:56   ` Andrew Morton
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2013-01-16 21:56 UTC (permalink / raw)
  To: Nathan Zimmer; +Cc: mingo, peterz, tglx, johnstul, linux-kernel

On Tue, 15 Jan 2013 15:46:10 -0600
Nathan Zimmer <nzimmer@sgi.com> wrote:

> On systems with 4096 cores attemping to read /proc/sched_debug fails.
> We are trying to push all the data into a single kmalloc buffer.
> The issue is on these very large machines all the data will not fit in 4mb.
> 
> A better solution is to not us the single_open mechanism but to provide
> our own seq_operations and treat each cpu as an individual record.
> 
> The output should be identical to previous version.
> 
> ...
>
> +	int cpu = (unsigned long)(v - 2);

Again, the meaning of the magic offsets are unobvious to readers of
this code.

> -	SEQ_printf(m, "\n");
> +	if (cpu != -1)
> +		print_cpu(m, cpu);

Same concerns with cpu hotplug.

> +	else
> +		sched_debug_header(m);
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND 3/4] timer_list: split timer_list_show_tickdevices
  2013-01-15 21:46 ` [PATCH RESEND 3/4] timer_list: split timer_list_show_tickdevices Nathan Zimmer
@ 2013-01-16 22:09   ` Andrew Morton
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2013-01-16 22:09 UTC (permalink / raw)
  To: Nathan Zimmer; +Cc: mingo, peterz, tglx, johnstul, linux-kernel

On Tue, 15 Jan 2013 15:46:11 -0600
Nathan Zimmer <nzimmer@sgi.com> wrote:

> Split timer_list_show_tickdevices out the header and just pull the rest up
> to timer_list_show.  Also tweak the location of the whitespace.  This is all
> to prep for the fix.

I'm thinking you didn't put a lot of hours into that description.
Wanna try again?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND 1/4] sched: /proc/sched_stat fails on very very large machines.
  2013-01-16 21:53   ` Andrew Morton
@ 2013-01-17 22:36     ` Nathan Zimmer
  0 siblings, 0 replies; 9+ messages in thread
From: Nathan Zimmer @ 2013-01-17 22:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, peterz, tglx, johnstul, linux-kernel

On 01/16/2013 03:53 PM, Andrew Morton wrote:
> On Tue, 15 Jan 2013 15:46:09 -0600
> Nathan Zimmer <nzimmer@sgi.com> wrote:
>
>> On systems with 4096 cores doing a cat /proc/sched_stat fails.
>> We are trying to push all the data into a single kmalloc buffer.
>> The issue is on these very large machines all the data will not fit in 4mb.
>>
>> A better solution is to not us the single_open mechanism but to provide
>> our own seq_operations.
>>
>> The output should be identical to previous version and thus not need the
>> version number.
>>
>> ...
>>
>> index 903ffa9..33a85c9 100644
>> --- a/kernel/sched/stats.c
>> +++ b/kernel/sched/stats.c
>> @@ -21,9 +21,13 @@ static int show_schedstat(struct seq_file *seq, void *v)
>>   	if (mask_str == NULL)
>>   		return -ENOMEM;
>>   
>> -	seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
>> -	seq_printf(seq, "timestamp %lu\n", jiffies);
>> -	for_each_online_cpu(cpu) {
>> +	if (v == (void *)1) {
> The magic-numbers-in-pointers at least need comments, please.  Or nice
> and meaningful #defines.
>
>> +		seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
>> +		seq_printf(seq, "timestamp %lu\n", jiffies);
> The code leaks the memory at mask_str here.
>
>> +	} else {
>> +
>> +		cpu = (unsigned long)(v - 2);
>> +
>>   		struct rq *rq = cpu_rq(cpu);
>>   #ifdef CONFIG_SMP
>>   		struct sched_domain *sd;
>> @@ -72,35 +76,64 @@ static int show_schedstat(struct seq_file *seq, void *v)
>>   		}
>>   		rcu_read_unlock();
>>   #endif
>> +		kfree(mask_str);
>>   	}
>> -	kfree(mask_str);
>>   	return 0;
>>   }
> Undoing this change will fix the leak.
>
> The schedstats code (both the original and after the patch) appears to
> be racy against cpu hotplug?  What prevents the rq from vanishing while
> we're playing with it?
>
>
Looking at other usages people seem to be quite willing to just read a 
variable
here and there without locking.  The structure is a percpu structure so 
I don't
believe rq will vanish, perhaps the backing data might become 
meaningless though...



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-01-17 22:36 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-15 21:46 [PATCH RESEND 0/4] /proc/schedstat and /proc/sched_debug fail at 4096 Nathan Zimmer
2013-01-15 21:46 ` [PATCH RESEND 1/4] sched: /proc/sched_stat fails on very very large machines Nathan Zimmer
2013-01-16 21:53   ` Andrew Morton
2013-01-17 22:36     ` Nathan Zimmer
2013-01-15 21:46 ` [PATCH RESEND 2/4] sched: /proc/sched_debug " Nathan Zimmer
2013-01-16 21:56   ` Andrew Morton
2013-01-15 21:46 ` [PATCH RESEND 3/4] timer_list: split timer_list_show_tickdevices Nathan Zimmer
2013-01-16 22:09   ` Andrew Morton
2013-01-15 21:46 ` [PATCH RESEND 4/4] timer_list: Convert timer list to be a proper seq_file Nathan Zimmer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).