sched/core: expand sched_getaffinity(2) to return number of CPUs
diff mbox series

Message ID 20190403200809.GA13876@avx2
State New, archived
Headers show
Series
  • sched/core: expand sched_getaffinity(2) to return number of CPUs
Related show

Commit Message

Alexey Dobriyan April 3, 2019, 8:08 p.m. UTC
Currently there is no easy way to get the number of CPUs on the system.
Applications are divided into 2 groups:
One group allocates buffer and call sched_getaffinity(2) once. It works
but either underallocate or overallocates and in the future such application
will become buggy as Linux will start working on even more SMP-ier systems.

Glibc in particular shipped with 1024 CPUs support maximum at some point
which is quite surprising as glibc maitainers should know better.

Another group dynamically grow buffer until cpumask fits. This is
inefficient as multiple system calls are done.

Nobody seems to parse "/sys/devices/system/cpu/possible".
Even if someone does, parsing sysfs is much slower than necessary.

Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
This will make gettting CPU mask require at most 2 system calls
and will eliminate unnecessary code.

len=0 is chosen so that
* passing zeroes is the simplest thing

	syscall(__NR_sched_getaffinity, 0, 0, NULL)

  will simply do the right thing,

* old kernels returned -EINVAL unconditionally.

Note: glibc segfaults upon exiting from system call because it tries to
clear the rest of the buffer if return value is positive, so
applications will have to use syscall(3).
Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 kernel/sched/core.c |    3 +++
 1 file changed, 3 insertions(+)

Comments

Peter Zijlstra April 4, 2019, 8:42 a.m. UTC | #1
On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
> Currently there is no easy way to get the number of CPUs on the system.

And this patch doesn't change that :-) Still, it does the right thing
and I like it.

The point is that nr_cpu_ids is the length of the bitmap, but does not
contain information on how many CPUs are in the system. Consider the
case where the bitmap is sparse.

> Applications are divided into 2 groups:
> One group allocates buffer and call sched_getaffinity(2) once. It works
> but either underallocate or overallocates and in the future such application
> will become buggy as Linux will start working on even more SMP-ier systems.
> 
> Glibc in particular shipped with 1024 CPUs support maximum at some point
> which is quite surprising as glibc maitainers should know better.
> 
> Another group dynamically grow buffer until cpumask fits. This is
> inefficient as multiple system calls are done.
> 
> Nobody seems to parse "/sys/devices/system/cpu/possible".
> Even if someone does, parsing sysfs is much slower than necessary.

True; but I suppose glibc already does lots of that anyway, right? It
does contain the right information.

> Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
> This will make gettting CPU mask require at most 2 system calls
> and will eliminate unnecessary code.
> 
> len=0 is chosen so that
> * passing zeroes is the simplest thing
> 
> 	syscall(__NR_sched_getaffinity, 0, 0, NULL)
> 
>   will simply do the right thing,
> 
> * old kernels returned -EINVAL unconditionally.
> 
> Note: glibc segfaults upon exiting from system call because it tries to
> clear the rest of the buffer if return value is positive, so
> applications will have to use syscall(3).
> Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).

This also needs a manpage update. And I'm missing the libc people on Cc.

> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> ---
> 
>  kernel/sched/core.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4942,6 +4942,9 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
>  	int ret;
>  	cpumask_var_t mask;
>  
> +	if (len == 0)
> +		return nr_cpu_ids;
> +
>  	if ((len * BITS_PER_BYTE) < nr_cpu_ids)
>  		return -EINVAL;
>  	if (len & (sizeof(unsigned long)-1))
Alexey Dobriyan April 4, 2019, 6:02 p.m. UTC | #2
On Thu, Apr 04, 2019 at 10:42:49AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
> > Currently there is no easy way to get the number of CPUs on the system.
> 
> And this patch doesn't change that :-)

It does! Application or a library could do one idempotent system call
in a constructor.

> Still, it does the right thing and I like it.

Thanks.

> The point is that nr_cpu_ids is the length of the bitmap, but does not
> contain information on how many CPUs are in the system. Consider the
> case where the bitmap is sparse.

I understand that but how do you ship number of CPUs _and_ possible mask
in one go?

> > Applications are divided into 2 groups:
> > One group allocates buffer and call sched_getaffinity(2) once. It works
> > but either underallocate or overallocates and in the future such application
> > will become buggy as Linux will start working on even more SMP-ier systems.
> > 
> > Glibc in particular shipped with 1024 CPUs support maximum at some point
> > which is quite surprising as glibc maitainers should know better.
> > 
> > Another group dynamically grow buffer until cpumask fits. This is
> > inefficient as multiple system calls are done.
> > 
> > Nobody seems to parse "/sys/devices/system/cpu/possible".
> > Even if someone does, parsing sysfs is much slower than necessary.
> 
> True; but I suppose glibc already does lots of that anyway, right? It
> does contain the right information.

sched_getaffinity(3) does sched_getaffinity(2) + memset()

sysconf(_SC_NPROCESSORS_ONLN) does "/sys/devices/system/cpu/online"

sysconf(_SC_NPROCESSORS_CONF) does readdir("/sys/devices/system/cpu")
which is 5 syscalls. I'm not sure which cpumask readdir represents.

> > Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
> > This will make gettting CPU mask require at most 2 system calls
> > and will eliminate unnecessary code.
> > 
> > len=0 is chosen so that
> > * passing zeroes is the simplest thing
> > 
> > 	syscall(__NR_sched_getaffinity, 0, 0, NULL)
> > 
> >   will simply do the right thing,
> > 
> > * old kernels returned -EINVAL unconditionally.
> > 
> > Note: glibc segfaults upon exiting from system call because it tries to
> > clear the rest of the buffer if return value is positive, so
> > applications will have to use syscall(3).
> > Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).
> 
> This also needs a manpage update. And I'm missing the libc people on Cc.

[nods]
Shipping "man/2/" with kernel is long overdue. :^)

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4942,6 +4942,9 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
> >  	int ret;
> >  	cpumask_var_t mask;
> >  
> > +	if (len == 0)
> > +		return nr_cpu_ids;
> > +
> >  	if ((len * BITS_PER_BYTE) < nr_cpu_ids)
> >  		return -EINVAL;
> >  	if (len & (sizeof(unsigned long)-1))
Peter Zijlstra April 5, 2019, 9:26 a.m. UTC | #3
On Thu, Apr 04, 2019 at 09:02:30PM +0300, Alexey Dobriyan wrote:
> On Thu, Apr 04, 2019 at 10:42:49AM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
> > > Currently there is no easy way to get the number of CPUs on the system.
> > 
> > And this patch doesn't change that :-)
> 
> It does! Application or a library could do one idempotent system call
> in a constructor.

You said: "get the number of CPUs on the system", but what this call
actually returns is: "size of CPU bitmask", these two things are
distinctly not the same.

> > The point is that nr_cpu_ids is the length of the bitmap, but does not
> > contain information on how many CPUs are in the system. Consider the
> > case where the bitmap is sparse.
> 
> I understand that but how do you ship number of CPUs _and_ possible mask
> in one go?

You don't, possible mask is unrelated to number of CPUs and both are
unrelated to bitmap size.

In particular:

  nr_cpus <= nr_possible_cpus <= nr_cpumask_bits

In absurdum an architecture could choose to iterate its CPUs as
1024*cpuid, just for giggles. Then suppose it has two sockets, and 16
CPUs per socket and only the second socket is populated.

The we get:

  nr_cpumask_bits = 32k
  nr_possible_cpus = 32
  nr_cpus = 16

see what I mean?

Now, luckily we typically don't have crap like that, but I suspect that
with a little creative ACPI table magic we could actually get something
slightly less absurd but still non-trivial even on x86.

Also see how num_possible_cpus() is defined as
cpumask_weight(cpu_possible_mask), which is very much not nr_cpu_ids in
the generic case.

> > > Nobody seems to parse "/sys/devices/system/cpu/possible".
> > > Even if someone does, parsing sysfs is much slower than necessary.
> > 
> > True; but I suppose glibc already does lots of that anyway, right? It
> > does contain the right information.
> 
> sched_getaffinity(3) does sched_getaffinity(2) + memset()
> 
> sysconf(_SC_NPROCESSORS_ONLN) does "/sys/devices/system/cpu/online"
> 
> sysconf(_SC_NPROCESSORS_CONF) does readdir("/sys/devices/system/cpu")
> which is 5 syscalls. I'm not sure which cpumask readdir represents.

It counts the number of CPUs, which is strictly not the same as the
bitmap size.
Florian Weimer April 5, 2019, 10:16 a.m. UTC | #4
* Peter Zijlstra:

> On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
>> Currently there is no easy way to get the number of CPUs on the system.

The size of the affinity mask is only related to the number of CPUs in
the system in such a way that the number of CPUs cannot be larger than
the number of bits in the affinity mask.

>> Glibc in particular shipped with 1024 CPUs support maximum at some point
>> which is quite surprising as glibc maitainers should know better.

This dates back to a time when the kernel was never going to support
more than 1024 CPUs.

A lot of distribution kernels still enforce a hard limit, which papers
over firmware bugs which tell the kernel that the system can be
hot-plugged to a ridiculous number of sockets/CPUs.

>> Another group dynamically grow buffer until cpumask fits. This is
>> inefficient as multiple system calls are done.
>> 
>> Nobody seems to parse "/sys/devices/system/cpu/possible".
>> Even if someone does, parsing sysfs is much slower than necessary.
>
> True; but I suppose glibc already does lots of that anyway, right? It
> does contain the right information.

If I recall correctly my last investigation,
/sys/devices/system/cpu/possible does not reflect the size of the
affinity mask, either.

>> Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
>> This will make gettting CPU mask require at most 2 system calls
>> and will eliminate unnecessary code.
>> 
>> len=0 is chosen so that
>> * passing zeroes is the simplest thing
>> 
>> 	syscall(__NR_sched_getaffinity, 0, 0, NULL)
>> 
>>   will simply do the right thing,
>> 
>> * old kernels returned -EINVAL unconditionally.
>> 
>> Note: glibc segfaults upon exiting from system call because it tries to
>> clear the rest of the buffer if return value is positive, so
>> applications will have to use syscall(3).
>> Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).

Given that old kernels fail with EINVAL, that evidence is fairly
restricted.

I'm not sure if it's a good idea to overload this interface.  I expect
that users will want to call sched_getaffinity (the system call wrapper)
with cpusetsize == 0 to query the value, so there will be pressure on
glibc to remove the memset.  At that point we have an API that obscurely
fails with old glibc versions, but suceeds with newer ones, which isn't
great.

Thanks,
Florian
Peter Zijlstra April 5, 2019, 11:04 a.m. UTC | #5
On Fri, Apr 05, 2019 at 12:16:39PM +0200, Florian Weimer wrote:

> > True; but I suppose glibc already does lots of that anyway, right? It
> > does contain the right information.
> 
> If I recall correctly my last investigation,
> /sys/devices/system/cpu/possible does not reflect the size of the
> affinity mask, either.

Strictly speaking correct; the bitmap can be longer than the highest
possible cpu number, however the remainder would be 0-padding and could
thus be stripped without issue.

So a bitmap sized to fit the max possible cpu number, should for all
practical purposes suffice.
Florian Weimer April 5, 2019, 11:08 a.m. UTC | #6
* Peter Zijlstra:

> On Fri, Apr 05, 2019 at 12:16:39PM +0200, Florian Weimer wrote:
>
>> > True; but I suppose glibc already does lots of that anyway, right? It
>> > does contain the right information.
>> 
>> If I recall correctly my last investigation,
>> /sys/devices/system/cpu/possible does not reflect the size of the
>> affinity mask, either.
>
> Strictly speaking correct; the bitmap can be longer than the highest
> possible cpu number, however the remainder would be 0-padding and could
> thus be stripped without issue.

Doesn't the kernel still enforce the larget bitmap in sched_getaffinity,
even if the bits are always zero?

Thanks,
Florian
Peter Zijlstra April 5, 2019, 11:49 a.m. UTC | #7
On Fri, Apr 05, 2019 at 01:08:58PM +0200, Florian Weimer wrote:
> * Peter Zijlstra:
> 
> > On Fri, Apr 05, 2019 at 12:16:39PM +0200, Florian Weimer wrote:
> >
> >> > True; but I suppose glibc already does lots of that anyway, right? It
> >> > does contain the right information.
> >> 
> >> If I recall correctly my last investigation,
> >> /sys/devices/system/cpu/possible does not reflect the size of the
> >> affinity mask, either.
> >
> > Strictly speaking correct; the bitmap can be longer than the highest
> > possible cpu number, however the remainder would be 0-padding and could
> > thus be stripped without issue.
> 
> Doesn't the kernel still enforce the larget bitmap in sched_getaffinity,
> even if the bits are always zero?

Oh crap, you're right. That's unfortunate I suppose.
Alexey Dobriyan April 6, 2019, 7:48 p.m. UTC | #8
On Fri, Apr 05, 2019 at 12:16:39PM +0200, Florian Weimer wrote:
> * Peter Zijlstra:
> 
> > On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
> >> Currently there is no easy way to get the number of CPUs on the system.
> 
> The size of the affinity mask is only related to the number of CPUs in
> the system in such a way that the number of CPUs cannot be larger than
> the number of bits in the affinity mask.
> 
> >> Glibc in particular shipped with 1024 CPUs support maximum at some point
> >> which is quite surprising as glibc maitainers should know better.
> 
> This dates back to a time when the kernel was never going to support
> more than 1024 CPUs.
> 
> A lot of distribution kernels still enforce a hard limit, which papers
> over firmware bugs which tell the kernel that the system can be
> hot-plugged to a ridiculous number of sockets/CPUs.
> 
> >> Another group dynamically grow buffer until cpumask fits. This is
> >> inefficient as multiple system calls are done.
> >> 
> >> Nobody seems to parse "/sys/devices/system/cpu/possible".
> >> Even if someone does, parsing sysfs is much slower than necessary.
> >
> > True; but I suppose glibc already does lots of that anyway, right? It
> > does contain the right information.
> 
> If I recall correctly my last investigation,
> /sys/devices/system/cpu/possible does not reflect the size of the
> affinity mask, either.
> 
> >> Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
> >> This will make gettting CPU mask require at most 2 system calls
> >> and will eliminate unnecessary code.
> >> 
> >> len=0 is chosen so that
> >> * passing zeroes is the simplest thing
> >> 
> >> 	syscall(__NR_sched_getaffinity, 0, 0, NULL)
> >> 
> >>   will simply do the right thing,
> >> 
> >> * old kernels returned -EINVAL unconditionally.
> >> 
> >> Note: glibc segfaults upon exiting from system call because it tries to
> >> clear the rest of the buffer if return value is positive, so
> >> applications will have to use syscall(3).
> >> Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).
> 
> Given that old kernels fail with EINVAL, that evidence is fairly
> restricted.
> 
> I'm not sure if it's a good idea to overload this interface.  I expect
> that users will want to call sched_getaffinity (the system call wrapper)
> with cpusetsize == 0 to query the value, so there will be pressure on
> glibc to remove the memset.  At that point we have an API that obscurely
> fails with old glibc versions, but suceeds with newer ones, which isn't
> great.

I can do "if (len == 536870912)" so that bit count overflows on old
kernels into EINVAL and is unlikely to be used ever.
Florian Weimer April 8, 2019, 7:49 a.m. UTC | #9
* Alexey Dobriyan:

>> >> Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
>> >> This will make gettting CPU mask require at most 2 system calls
>> >> and will eliminate unnecessary code.
>> >> 
>> >> len=0 is chosen so that
>> >> * passing zeroes is the simplest thing
>> >> 
>> >> 	syscall(__NR_sched_getaffinity, 0, 0, NULL)
>> >> 
>> >>   will simply do the right thing,
>> >> 
>> >> * old kernels returned -EINVAL unconditionally.
>> >> 
>> >> Note: glibc segfaults upon exiting from system call because it tries to
>> >> clear the rest of the buffer if return value is positive, so
>> >> applications will have to use syscall(3).
>> >> Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).
>> 
>> Given that old kernels fail with EINVAL, that evidence is fairly
>> restricted.
>> 
>> I'm not sure if it's a good idea to overload this interface.  I expect
>> that users will want to call sched_getaffinity (the system call wrapper)
>> with cpusetsize == 0 to query the value, so there will be pressure on
>> glibc to remove the memset.  At that point we have an API that obscurely
>> fails with old glibc versions, but suceeds with newer ones, which isn't
>> great.
>
> I can do "if (len == 536870912)" so that bit count overflows on old
> kernels into EINVAL and is unlikely to be used ever.

I don't see how this solves this particular issue.  It will still result
in a mysterious crash if programs use an updated system call wrapper.

Thanks,
Florian

Patch
diff mbox series

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4942,6 +4942,9 @@  SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
 	int ret;
 	cpumask_var_t mask;
 
+	if (len == 0)
+		return nr_cpu_ids;
+
 	if ((len * BITS_PER_BYTE) < nr_cpu_ids)
 		return -EINVAL;
 	if (len & (sizeof(unsigned long)-1))