linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [regression] cpuset: offlined CPUs removed from affinity masks
@ 2020-01-16 17:41 Mathieu Desnoyers
  2020-01-16 18:27 ` Valentin Schneider
  2020-02-17 16:03 ` Mathieu Desnoyers
  0 siblings, 2 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2020-01-16 17:41 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

Hi,

I noticed the following regression with CONFIG_CPUSET=y. Note that
I am not using cpusets at all (only using the root cpuset I'm given
at boot), it's just configured in. I am currently working on a 5.2.5
kernel. I am simply combining use of taskset(1) (setting the affinity
mask of a process) and cpu hotplug. The result is that with
CONFIG_CPUSET=y, setting the affinity mask including an offline CPU number
don't keep that CPU in the affinity mask, and it is never put back when the
CPU comes back online. CONFIG_CPUSET=n behaves as expected, and puts back
the CPU into the affinity mask reported to user-space when it comes back
online.


* With CONFIG_CPUSET=y (unexpected behavior):

# echo 0 > /sys/devices/system/cpu/cpu1/online

% taskset 0x7 ./loop &
% taskset -p $!
pid 1341's current affinity mask: 5

# echo 1 > /sys/devices/system/cpu/cpu1/online

taskset -p $!
pid 1341's current affinity mask: 5

kill $!


* With CONFIG_CPUSET=n (expected behavior):

(Offlining CPU, then start task)

# echo 0 > /sys/devices/system/cpu/cpu1/online

% taskset 0x7 ./loop &
% taskset -p $!
pid 1358's current affinity mask: 5

# echo 1 > /sys/devices/system/cpu/cpu1/online

taskset -p $!
pid 1358's current affinity mask: 7

kill $!


Test system lscpu output:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               60
Model name:          Intel Core Processor (Haswell, no TSX, IBRS)
Stepping:            1
CPU MHz:             2399.996
BogoMIPS:            4799.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
NUMA node0 CPU(s):   0-7,16-23
NUMA node1 CPU(s):   8-15,24-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti ibrs ibpb fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-01-16 17:41 [regression] cpuset: offlined CPUs removed from affinity masks Mathieu Desnoyers
@ 2020-01-16 18:27 ` Valentin Schneider
  2020-02-17 16:03 ` Mathieu Desnoyers
  1 sibling, 0 replies; 16+ messages in thread
From: Valentin Schneider @ 2020-01-16 18:27 UTC (permalink / raw)
  To: Mathieu Desnoyers, Li Zefan; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On 16/01/2020 17:41, Mathieu Desnoyers wrote:
> Hi,
> 
> I noticed the following regression with CONFIG_CPUSET=y. Note that
> I am not using cpusets at all (only using the root cpuset I'm given
> at boot), it's just configured in. I am currently working on a 5.2.5
> kernel. I am simply combining use of taskset(1) (setting the affinity
> mask of a process) and cpu hotplug. The result is that with
> CONFIG_CPUSET=y, setting the affinity mask including an offline CPU number
> don't keep that CPU in the affinity mask, and it is never put back when the
> CPU comes back online. CONFIG_CPUSET=n behaves as expected, and puts back
> the CPU into the affinity mask reported to user-space when it comes back
> online.
> 
> 
> * With CONFIG_CPUSET=y (unexpected behavior):
> 
> # echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> % taskset 0x7 ./loop &
> % taskset -p $!
> pid 1341's current affinity mask: 5
> 
> # echo 1 > /sys/devices/system/cpu/cpu1/online
> 
> taskset -p $!
> pid 1341's current affinity mask: 5
> 
> kill $!
> 

As discussed on IRC, this is because we have in sched_setaffinity():

  cpuset_cpus_allowed(p, cpus_allowed);
  cpumask_and(new_mask, in_mask, cpus_allowed);

Another source of issue is that CPUs are taken out of cpusets when
hotplugged out, and not put back in when hotplugged back in (except for the
root cpuset which follows cpu_active_mask).

Both cpuset.effective_cpus and cpuset.allowed_cpus seem to only span
online CPUs:

  root@valsch-juno:~# cat /sys/fs/cgroup/cpuset/cpuset.effective_cpus 
  0-5
  root@valsch-juno:~# cat /sys/fs/cgroup/cpuset/cpuset.cpus
  0-5
  root@valsch-juno:~# echo 0 > /sys/devices/system/cpu/cpu3/online                                     
  [93418.733050] CPU3: shutdown
  [93418.735815] psci: CPU3 killed (polled 0 ms)
  root@valsch-juno:~# cat /sys/fs/cgroup/cpuset/cpuset.cpus                                            
  0-2,4-5
  root@valsch-juno:~# cat /sys/fs/cgroup/cpuset/cpuset.effective_cpus                                  
  0-2,4-5

The thing is, with CONFIG_CPUSET=n, we can absolutely cope with p->cpus_ptr
spanning CPUs that are offline because we still check the active/online
mask (is_cpu_allowed()). So one thing I'd like to know is why do cpusets
remove offline cpus from their mask? I could see cpuset.allowed containing
both online & offline CPUs, and cpuset.effective containing just the online
ones.

That way in sched_setaffinity() we can still check for cpuset.allowed, and
we still have the online/active check in __set_cpus_allowed_ptr() to deny
stupid requests.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-01-16 17:41 [regression] cpuset: offlined CPUs removed from affinity masks Mathieu Desnoyers
  2020-01-16 18:27 ` Valentin Schneider
@ 2020-02-17 16:03 ` Mathieu Desnoyers
  2020-02-19 15:19   ` Tejun Heo
  1 sibling, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2020-02-17 16:03 UTC (permalink / raw)
  To: Li Zefan, Tejun Heo, cgroups
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Valentin Schneider

Hi,

Adding Tejun and the cgroups mailing list in CC for this cpuset regression I
reported last month.

Thanks,

Mathieu

----- On Jan 16, 2020, at 12:41 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> Hi,
> 
> I noticed the following regression with CONFIG_CPUSET=y. Note that
> I am not using cpusets at all (only using the root cpuset I'm given
> at boot), it's just configured in. I am currently working on a 5.2.5
> kernel. I am simply combining use of taskset(1) (setting the affinity
> mask of a process) and cpu hotplug. The result is that with
> CONFIG_CPUSET=y, setting the affinity mask including an offline CPU number
> don't keep that CPU in the affinity mask, and it is never put back when the
> CPU comes back online. CONFIG_CPUSET=n behaves as expected, and puts back
> the CPU into the affinity mask reported to user-space when it comes back
> online.
> 
> 
> * With CONFIG_CPUSET=y (unexpected behavior):
> 
> # echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> % taskset 0x7 ./loop &
> % taskset -p $!
> pid 1341's current affinity mask: 5
> 
> # echo 1 > /sys/devices/system/cpu/cpu1/online
> 
> taskset -p $!
> pid 1341's current affinity mask: 5
> 
> kill $!
> 
> 
> * With CONFIG_CPUSET=n (expected behavior):
> 
> (Offlining CPU, then start task)
> 
> # echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> % taskset 0x7 ./loop &
> % taskset -p $!
> pid 1358's current affinity mask: 5
> 
> # echo 1 > /sys/devices/system/cpu/cpu1/online
> 
> taskset -p $!
> pid 1358's current affinity mask: 7
> 
> kill $!
> 
> 
> Test system lscpu output:
> 
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> CPU(s):              32
> On-line CPU(s) list: 0-31
> Thread(s) per core:  2
> Core(s) per socket:  8
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               60
> Model name:          Intel Core Processor (Haswell, no TSX, IBRS)
> Stepping:            1
> CPU MHz:             2399.996
> BogoMIPS:            4799.99
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            4096K
> NUMA node0 CPU(s):   0-7,16-23
> NUMA node1 CPU(s):   8-15,24-31
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc
> rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1
> sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
> hypervisor lahf_lm abm cpuid_fault invpcid_single pti ibrs ibpb fsgsbase bmi1
> avx2 smep bmi2 erms invpcid xsaveopt
> 
> 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-02-17 16:03 ` Mathieu Desnoyers
@ 2020-02-19 15:19   ` Tejun Heo
  2020-02-19 15:43     ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2020-02-19 15:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider

Hello,

On Mon, Feb 17, 2020 at 11:03:07AM -0500, Mathieu Desnoyers wrote:
> Hi,
> 
> Adding Tejun and the cgroups mailing list in CC for this cpuset regression I
> reported last month.
> 
> Thanks,
> 
> Mathieu
> 
> ----- On Jan 16, 2020, at 12:41 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
> 
> > Hi,
> > 
> > I noticed the following regression with CONFIG_CPUSET=y. Note that
> > I am not using cpusets at all (only using the root cpuset I'm given
> > at boot), it's just configured in. I am currently working on a 5.2.5
> > kernel. I am simply combining use of taskset(1) (setting the affinity
> > mask of a process) and cpu hotplug. The result is that with
> > CONFIG_CPUSET=y, setting the affinity mask including an offline CPU number
> > don't keep that CPU in the affinity mask, and it is never put back when the
> > CPU comes back online. CONFIG_CPUSET=n behaves as expected, and puts back
> > the CPU into the affinity mask reported to user-space when it comes back
> > online.

Because cpuset operations irreversibly change task affinity masks
rather than masking them dynamically, the interaction has always been
kinda broken. Hmm... Are there older kernel vesions which behave
differently? Off the top of my head, I can't think of sth which could
have changed that behavior recently but I could easily be missing
something.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-02-19 15:19   ` Tejun Heo
@ 2020-02-19 15:43     ` Mathieu Desnoyers
  2020-02-19 15:47       ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2020-02-19 15:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider

----- On Feb 19, 2020, at 10:19 AM, Tejun Heo tj@kernel.org wrote:

> Hello,
> 
> On Mon, Feb 17, 2020 at 11:03:07AM -0500, Mathieu Desnoyers wrote:
>> Hi,
>> 
>> Adding Tejun and the cgroups mailing list in CC for this cpuset regression I
>> reported last month.
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> ----- On Jan 16, 2020, at 12:41 PM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>> 
>> > Hi,
>> > 
>> > I noticed the following regression with CONFIG_CPUSET=y. Note that
>> > I am not using cpusets at all (only using the root cpuset I'm given
>> > at boot), it's just configured in. I am currently working on a 5.2.5
>> > kernel. I am simply combining use of taskset(1) (setting the affinity
>> > mask of a process) and cpu hotplug. The result is that with
>> > CONFIG_CPUSET=y, setting the affinity mask including an offline CPU number
>> > don't keep that CPU in the affinity mask, and it is never put back when the
>> > CPU comes back online. CONFIG_CPUSET=n behaves as expected, and puts back
>> > the CPU into the affinity mask reported to user-space when it comes back
>> > online.
> 
> Because cpuset operations irreversibly change task affinity masks
> rather than masking them dynamically, the interaction has always been
> kinda broken. Hmm... Are there older kernel vesions which behave
> differently? Off the top of my head, I can't think of sth which could
> have changed that behavior recently but I could easily be missing
> something.

Hi Tejun,

The regression I'm talking about here is that CONFIG_CPUSET=y changes the
behavior of the sched_setaffinify system call, which existed prior to
cpusets.

sched_setaffinity should behave in the same way for kernels configured with
CONFIG_CPUSET=y or CONFIG_CPUSET=n.

The fact that cpuset decides to irreversibly change the task affinity mask
may not be considered a regression if it has always done that, but changing
the behavior of sched_setaffinity seems to fit the definition of a regression.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-02-19 15:43     ` Mathieu Desnoyers
@ 2020-02-19 15:47       ` Tejun Heo
  2020-02-19 15:50         ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2020-02-19 15:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider

On Wed, Feb 19, 2020 at 10:43:05AM -0500, Mathieu Desnoyers wrote:
> The regression I'm talking about here is that CONFIG_CPUSET=y changes the
> behavior of the sched_setaffinify system call, which existed prior to
> cpusets.
> 
> sched_setaffinity should behave in the same way for kernels configured with
> CONFIG_CPUSET=y or CONFIG_CPUSET=n.
> 
> The fact that cpuset decides to irreversibly change the task affinity mask
> may not be considered a regression if it has always done that, but changing
> the behavior of sched_setaffinity seems to fit the definition of a regression.

We generally use "regression" for breakages which weren't in past
versions but then appeared later. It has debugging implications
because if we know something is a regression, we generally can point
to the commit which introduced the bug either through examining the
history or bisection.

It is a silly bug, for sure, but slapping regression name on it just
confuses rather than helping anything.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-02-19 15:47       ` Tejun Heo
@ 2020-02-19 15:50         ` Mathieu Desnoyers
  2020-02-19 15:52           ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2020-02-19 15:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider

----- On Feb 19, 2020, at 10:47 AM, Tejun Heo tj@kernel.org wrote:

> On Wed, Feb 19, 2020 at 10:43:05AM -0500, Mathieu Desnoyers wrote:
>> The regression I'm talking about here is that CONFIG_CPUSET=y changes the
>> behavior of the sched_setaffinify system call, which existed prior to
>> cpusets.
>> 
>> sched_setaffinity should behave in the same way for kernels configured with
>> CONFIG_CPUSET=y or CONFIG_CPUSET=n.
>> 
>> The fact that cpuset decides to irreversibly change the task affinity mask
>> may not be considered a regression if it has always done that, but changing
>> the behavior of sched_setaffinity seems to fit the definition of a regression.
> 
> We generally use "regression" for breakages which weren't in past
> versions but then appeared later. It has debugging implications
> because if we know something is a regression, we generally can point
> to the commit which introduced the bug either through examining the
> history or bisection.
> 
> It is a silly bug, for sure, but slapping regression name on it just
> confuses rather than helping anything.

I can look into figuring out the commit introducing this issue, which I
suspect will be close to the introduction of CONFIG_CPUSET into the
kernel (which was ages ago). I'll check and let you know.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-02-19 15:50         ` Mathieu Desnoyers
@ 2020-02-19 15:52           ` Tejun Heo
  2020-02-19 16:08             ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2020-02-19 15:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider

On Wed, Feb 19, 2020 at 10:50:35AM -0500, Mathieu Desnoyers wrote:
> I can look into figuring out the commit introducing this issue, which I
> suspect will be close to the introduction of CONFIG_CPUSET into the
> kernel (which was ages ago). I'll check and let you know.

Oh, yeah, I'm pretty sure it goes way back. I don't think tracking
that down would be necessary. I was just wondering whether it was a
recent change because you said it was a regression.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-02-19 15:52           ` Tejun Heo
@ 2020-02-19 16:08             ` Mathieu Desnoyers
  2020-02-19 16:12               ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2020-02-19 16:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Thomas Gleixner

----- On Feb 19, 2020, at 10:52 AM, Tejun Heo tj@kernel.org wrote:

> On Wed, Feb 19, 2020 at 10:50:35AM -0500, Mathieu Desnoyers wrote:
>> I can look into figuring out the commit introducing this issue, which I
>> suspect will be close to the introduction of CONFIG_CPUSET into the
>> kernel (which was ages ago). I'll check and let you know.
> 
> Oh, yeah, I'm pretty sure it goes way back. I don't think tracking
> that down would be necessary. I was just wondering whether it was a
> recent change because you said it was a regression.

It's most likely not a recent regression, but it has unfortunate effects
on the affinity mask which directly affects my ongoing work on the
pin_on_cpu() system call [1].

The sched_setaffinity vs cpu hotplug semantic provided by CONFIG_CPUSET=n
if fine for the needs on pin_on_cpu(): when a CPU comes back online,
those reappear in the affinity mask, but it is not the case with
CONFIG_CPUSET=y.

I wonder if applying the online cpu masks to the per-thread affinity mask
is the correct approach ? I suspect what we may be looking for here is to keep
the affinity mask independent of cpu hotplug, and look-up both the per-thread
affinity mask and the online cpu mask whenever the scheduler needs to perform
"is_cpu_allowed()" to check task placement.

Then whenever sched_getaffinity or cpusets try to query the current set of
cpus on which a task can run right now, it could also look at both the task's
affinity mask and the online cpu mask.

Thanks,

Mathieu

[1] https://lore.kernel.org/r/20200121160312.26545-1-mathieu.desnoyers@efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-02-19 16:08             ` Mathieu Desnoyers
@ 2020-02-19 16:12               ` Tejun Heo
  2020-03-07 16:06                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2020-02-19 16:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Thomas Gleixner

On Wed, Feb 19, 2020 at 11:08:39AM -0500, Mathieu Desnoyers wrote:
> I wonder if applying the online cpu masks to the per-thread affinity mask
> is the correct approach ? I suspect what we may be looking for here is to keep

Oh, the whole thing is wrong.

> the affinity mask independent of cpu hotplug, and look-up both the per-thread
> affinity mask and the online cpu mask whenever the scheduler needs to perform
> "is_cpu_allowed()" to check task placement.

Yes, that's what it should have done from the get-go. The way it's
implemented now, maybe we can avoid some specific cases like cpuset
not being used at all but it'll constantly get in the way if you're
expecting thread affinity to retain its value across offlines.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-02-19 16:12               ` Tejun Heo
@ 2020-03-07 16:06                 ` Mathieu Desnoyers
  2020-03-12 18:26                   ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2020-03-07 16:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Thomas Gleixner

----- On Feb 19, 2020, at 11:12 AM, Tejun Heo tj@kernel.org wrote:

> On Wed, Feb 19, 2020 at 11:08:39AM -0500, Mathieu Desnoyers wrote:
>> I wonder if applying the online cpu masks to the per-thread affinity mask
>> is the correct approach ? I suspect what we may be looking for here is to keep
> 
> Oh, the whole thing is wrong.
> 
>> the affinity mask independent of cpu hotplug, and look-up both the per-thread
>> affinity mask and the online cpu mask whenever the scheduler needs to perform
>> "is_cpu_allowed()" to check task placement.
> 
> Yes, that's what it should have done from the get-go. The way it's
> implemented now, maybe we can avoid some specific cases like cpuset
> not being used at all but it'll constantly get in the way if you're
> expecting thread affinity to retain its value across offlines.

Looking into solving this, one key issue seems to get in the way: cpuset
appear to care about not allowing to create a cpuset which has no currently
active CPU where to run, e.g.:

# it is forbidden to create an empty cpuset if the cpu is offlined first:

mkdir /sys/fs/cgroup/cpuset/test

echo 2 > /sys/fs/cgroup/cpuset/test/cpusets.cpus

cat /sys/fs/cgroup/cpuset/test/cpusets.cpu
2

echo 0 > /sys/devices/system/cpu/cpu1/online

echo 1 > /sys/fs/cgroup/cpuset/test/cpuset.cpus
bash: echo: write error: Invalid argument

cat /sys/fs/cgroup/cpuset/test/cpusets.cpu
2


# but it's perfectly fine to generate this empty cpuset by offlining
# a cpu _after_ creating the cpuset:

echo 0 > /sys/devices/system/cpu/cpu2/online

cat /sys/fs/cgroup/cpuset/test/cpusets.cpu
  <----- empty (nothing)

Some further testing seems to show that tasks belonging to that empty
cpuset are placed anywhere on active cpus.

Clearly, there is an intent that cpusets take the active mask into
account to prohibit creating an empty cpuset, but nothing prevents
cpu hotplug from creating an empty cpuset.

I wonder how to solve this inconsistency ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-03-07 16:06                 ` Mathieu Desnoyers
@ 2020-03-12 18:26                   ` Tejun Heo
  2020-03-12 19:47                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2020-03-12 18:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Thomas Gleixner

Hello,

On Sat, Mar 07, 2020 at 11:06:47AM -0500, Mathieu Desnoyers wrote:
> Looking into solving this, one key issue seems to get in the way: cpuset
> appear to care about not allowing to create a cpuset which has no currently
> active CPU where to run, e.g.:
...
> Clearly, there is an intent that cpusets take the active mask into
> account to prohibit creating an empty cpuset, but nothing prevents
> cpu hotplug from creating an empty cpuset.
> 
> I wonder how to solve this inconsistency ?

Please try cpuset in cgroup2. It shouldn't have those issues.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-03-12 18:26                   ` Tejun Heo
@ 2020-03-12 19:47                     ` Mathieu Desnoyers
  2020-03-24 18:01                       ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2020-03-12 19:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Thomas Gleixner

----- On Mar 12, 2020, at 2:26 PM, Tejun Heo tj@kernel.org wrote:

> Hello,
> 
> On Sat, Mar 07, 2020 at 11:06:47AM -0500, Mathieu Desnoyers wrote:
>> Looking into solving this, one key issue seems to get in the way: cpuset
>> appear to care about not allowing to create a cpuset which has no currently
>> active CPU where to run, e.g.:
> ...
>> Clearly, there is an intent that cpusets take the active mask into
>> account to prohibit creating an empty cpuset, but nothing prevents
>> cpu hotplug from creating an empty cpuset.
>> 
>> I wonder how to solve this inconsistency ?
> 
> Please try cpuset in cgroup2. It shouldn't have those issues.

After figuring how to use cgroup2 (systemd.unified_cgroup_hierarchy=1 boot
parameter helped tremendously), and testing similar scenarios, it indeed
seems to have a much saner behavior than cgroup1.

Considering that the allowed cpu mask is weird wrt cgroup1 and cpu hotplug,
and that cgroup2 allows thread-level granularity, it does not make much sense
to prevent the pin_on_cpu() system call I am working on from pinning
on cpus which are not present in the allowed mask.

I'm currently investigating approaches that would detect situations
where a thread is pinned onto a CPU which is not part of its allowed
mask, and set the task prio at MAX_PRIO-1 (the lowest fair priority
possible) in those cases.

The basic idea is to allow applications to pin to every possible cpu, but
not allow them to use this to consume a lot of cpu time on CPUs they
are not allowed to run.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-03-12 19:47                     ` Mathieu Desnoyers
@ 2020-03-24 18:01                       ` Tejun Heo
  2020-03-24 19:30                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2020-03-24 18:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Thomas Gleixner

Sorry about long delay.

On Thu, Mar 12, 2020 at 03:47:50PM -0400, Mathieu Desnoyers wrote:
> The basic idea is to allow applications to pin to every possible cpu, but
> not allow them to use this to consume a lot of cpu time on CPUs they
> are not allowed to run.
> 
> Thoughts ?

One thing that we learned is that priority alone isn't enough in isolating cpu
consumptions no matter how low the priority may be if the workload is latency
sensitive. The actual computation capacity of cpus gets saturated way before cpu
time is saturated and latency impact from lowered mips becomes noticeable. So,
depending on workloads, allowing threads to run at the lowest priority on
disallowed cpus might not lead to behaviors that users expect but I have no idea
what kind of usage models you have on mind for the new system call.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-03-24 18:01                       ` Tejun Heo
@ 2020-03-24 19:30                         ` Mathieu Desnoyers
  2020-03-30 19:53                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2020-03-24 19:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Thomas Gleixner

----- On Mar 24, 2020, at 2:01 PM, Tejun Heo tj@kernel.org wrote:

> On Thu, Mar 12, 2020 at 03:47:50PM -0400, Mathieu Desnoyers wrote:
>> The basic idea is to allow applications to pin to every possible cpu, but
>> not allow them to use this to consume a lot of cpu time on CPUs they
>> are not allowed to run.
>> 
>> Thoughts ?
> 
> One thing that we learned is that priority alone isn't enough in isolating cpu
> consumptions no matter how low the priority may be if the workload is latency
> sensitive. The actual computation capacity of cpus gets saturated way before cpu
> time is saturated and latency impact from lowered mips becomes noticeable. So,
> depending on workloads, allowing threads to run at the lowest priority on
> disallowed cpus might not lead to behaviors that users expect but I have no idea
> what kind of usage models you have on mind for the new system call.

Let me take a step back and focus on the requirements for the moment. It should
help us navigate more easily through the various solutions available to us.

Two goals are to enable use-cases such as user-space memory allocator migration of
free memory (typically single-process), and issue operations on each per-CPU data
from the consumer of a user-space per-CPU ring buffer (multi-process over shared
memory).

For the memory allocator use-case, one scenario which illustrates the situation well
is related to CPU hotplug: with per-CPU memory pools, what should the application do
when a CPU goes offline ? Ideally, it should have a manager thread able to detect
that a CPU is offline, and be able to reclaim free memory or move it into other
CPU's pools. However, considering that user-space has no mean to synchronously
do this wrt CPU hotplug, the CPU may very well come back online and start using
those data structures once more, so we cannot presume mutual exclusion from an
offline CPU.

One way to achieve this is by allowing user-space to run rseq critical sections
targeting the per-CPU (user-space) data of any possible CPU.

However, when considering allowing threads to pin themselves on any of the possible
CPUs, three concerns arise:

- CPU hotplug (offline CPUs),
- sched_setaffinity affinity mask, which can be set either internally by the process
  or externally by a manager process,
- cgroup cpuset allowed mask, which can be set either internally or by manager process,

For offline CPUs, the pin_on_cpu system call ensures that a task can run on
a "backup runqueue" when it pins itself onto an offline CPU. The current algorithm
is to choose the first online CPU's runqueue for this. As soon as the offline CPU
is brought back online, all tasks pinned to that CPU are moved to their rightful
runqueue.

For sched_setaffinity's affinity mask, I don't think it is such an issue, because
pinning onto specific CPUs does not provide more rights than what could have been
done by setting the affinity mask to a single CPU. The main difference between
sched_setaffinity to a single cpu and pin_on_cpu is the behavior when the target
CPU goes offline: sched_setaffinity then allows the thread to move to any runqueue
(which is really bad for rseq purposes), whereas pin_on_cpu moves the thread to a
runqueue which is guaranteed to be the same for all threads which want to be pinned
on that CPU.

Then there is the issue of cgroup cpuset: AFAIU, cgroup v1's integration with CPU
hotplug removes the offlined CPUs from the cgroup's allowed mask, which basically
breaks the memory allocator free memory migration/reclaim use-case, because there is
then no way to target an offline CPU if we apply the cgroup's allowed mask.

For cgroup v2, AFAIU it allows creation of groups which target specific threads within
a process. Therefore, some threads could have allowed mask which differ from others.
In this kind of scenario, it's not possible to have a manager thread allowed to
pin itself onto each CPUs which can be accessed by other threads in the same process.

Also, for the multi-process shared memory use-case (ring buffer), if the various
processes which interact with the same shared memory end up in different cgroups
allowed to run on a different subset of the possible CPUs, it becomes impossible to
have a consumer allowed to pin itself on all the CPUs it needs.

Ideally, I would like to come up with an approach that is not fragile when
combined with cgroups or cpu hotplug.

One approach I have envisioned is to allow pin_on_cpu to target CPUs which are
not part of the cpuset's allowed mask, but lower the priority of the threads to
the lowest possible priority while doing so. That approach would allow threads
to pin themselves on basically any CPU part of the possible cpu mask. But as
you point out, maybe this is an issue in terms of workload isolation.

I am welcoming ideas on how to solve this.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [regression] cpuset: offlined CPUs removed from affinity masks
  2020-03-24 19:30                         ` Mathieu Desnoyers
@ 2020-03-30 19:53                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2020-03-30 19:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, cgroups, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Thomas Gleixner

----- On Mar 24, 2020, at 3:30 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Mar 24, 2020, at 2:01 PM, Tejun Heo tj@kernel.org wrote:
> 
>> On Thu, Mar 12, 2020 at 03:47:50PM -0400, Mathieu Desnoyers wrote:
>>> The basic idea is to allow applications to pin to every possible cpu, but
>>> not allow them to use this to consume a lot of cpu time on CPUs they
>>> are not allowed to run.
>>> 
>>> Thoughts ?
>> 
>> One thing that we learned is that priority alone isn't enough in isolating cpu
>> consumptions no matter how low the priority may be if the workload is latency
>> sensitive. The actual computation capacity of cpus gets saturated way before cpu
>> time is saturated and latency impact from lowered mips becomes noticeable. So,
>> depending on workloads, allowing threads to run at the lowest priority on
>> disallowed cpus might not lead to behaviors that users expect but I have no idea
>> what kind of usage models you have on mind for the new system call.
> 
[...]

One possibility would be to use SCHED_IDLE scheduling class rather than SCHED_OTHER
with nice +19. The unfortunate side-effect AFAIU shows up when a thread requests to
be pinned on a CPU which is continuously overcommitted. It may never run. This could
come as a surprise for the user. The only case where this would happen is if:

- A thread is pinned on CPU N, and
  - CPU N is not part of the allowed mask for the task's cpuset (and is overcommitted), or
  - CPU N is offline, and the fallback CPU is not part of the allowed mask for the
    task's cpuset (and is overcommitted).

Is it an acceptable behavior ? How is userspace supposed to detect this kind of situation
and mitigate it ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-03-30 19:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-16 17:41 [regression] cpuset: offlined CPUs removed from affinity masks Mathieu Desnoyers
2020-01-16 18:27 ` Valentin Schneider
2020-02-17 16:03 ` Mathieu Desnoyers
2020-02-19 15:19   ` Tejun Heo
2020-02-19 15:43     ` Mathieu Desnoyers
2020-02-19 15:47       ` Tejun Heo
2020-02-19 15:50         ` Mathieu Desnoyers
2020-02-19 15:52           ` Tejun Heo
2020-02-19 16:08             ` Mathieu Desnoyers
2020-02-19 16:12               ` Tejun Heo
2020-03-07 16:06                 ` Mathieu Desnoyers
2020-03-12 18:26                   ` Tejun Heo
2020-03-12 19:47                     ` Mathieu Desnoyers
2020-03-24 18:01                       ` Tejun Heo
2020-03-24 19:30                         ` Mathieu Desnoyers
2020-03-30 19:53                           ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).