linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
@ 2022-04-25 15:55 Waiman Long
  2022-04-26  3:23 ` Feng Tang
  2022-04-27 13:53 ` Michal Koutný
  0 siblings, 2 replies; 8+ messages in thread
From: Waiman Long @ 2022-04-25 15:55 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner
  Cc: cgroups, linux-mm, linux-kernel, Feng Tang, Andrew Morton,
	Michal Hocko, Dave Hansen, ying.huang, Waiman Long, stable

There are 3 places where the cpu and node masks of the top cpuset can
be initialized in the order they are executed:
 1) start_kernel -> cpuset_init()
 2) start_kernel -> cgroup_init() -> cpuset_bind()
 3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()

The first cpuset_init() function just sets all the bits in the masks.
The last one executed is cpuset_init_smp() which sets up cpu and node
masks suitable for v1, but not v2.  cpuset_bind() does the right setup
for both v1 and v2.

For systems with cgroup v2 setup, cpuset_bind() is called once. For
systems with cgroup v1 setup, cpuset_bind() is called twice. It is
first called before cpuset_init_smp() in cgroup v2 mode.  Then it is
called again when cgroup v1 filesystem is mounted in v1 mode after
cpuset_init_smp().

  [    2.609781] cpuset_bind() called - v2 = 1
  [    3.079473] cpuset_init_smp() called
  [    7.103710] cpuset_bind() called - v2 = 0

As a result, cpu and memory node hot add may fail to update the cpu and
node masks of the top cpuset to include the newly added cpu or node in
a cgroup v2 environment.

smp_init() is called after the first two init functions.  So we don't
have a complete list of active cpus and memory nodes until later in
cpuset_init_smp() which is the right time to set up effective_cpus
and effective_mems.

To fix this problem, the potentially incorrect cpus_allowed &
mems_allowed setup in cpuset_init_smp() are removed.  For cgroup v2
systems, the initial cpuset_bind() call will set them up correctly.
For cgroup v1 systems, the second call to cpuset_bind() will do the
right setup.

cc: stable@vger.kernel.org
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9390bfd9f1cd..6bd8f5ef40fe 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3390,8 +3390,9 @@ static struct notifier_block cpuset_track_online_nodes_nb = {
  */
 void __init cpuset_init_smp(void)
 {
-	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
-	top_cpuset.mems_allowed = node_states[N_MEMORY];
+	/*
+	 * cpus_allowd/mems_allowed will be properly set up in cpuset_bind().
+	 */
 	top_cpuset.old_mems_allowed = top_cpuset.mems_allowed;
 
 	cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
  2022-04-25 15:55 [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp() Waiman Long
@ 2022-04-26  3:23 ` Feng Tang
  2022-04-26 14:58   ` Waiman Long
  2022-04-27 13:53 ` Michal Koutný
  1 sibling, 1 reply; 8+ messages in thread
From: Feng Tang @ 2022-04-26  3:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, cgroups, linux-mm,
	linux-kernel, Andrew Morton, Michal Hocko, Dave Hansen,
	ying.huang, stable

Hi Waiman,

On Mon, Apr 25, 2022 at 11:55:05AM -0400, Waiman Long wrote:
> There are 3 places where the cpu and node masks of the top cpuset can
> be initialized in the order they are executed:
>  1) start_kernel -> cpuset_init()
>  2) start_kernel -> cgroup_init() -> cpuset_bind()
>  3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()
> 
> The first cpuset_init() function just sets all the bits in the masks.
> The last one executed is cpuset_init_smp() which sets up cpu and node
> masks suitable for v1, but not v2.  cpuset_bind() does the right setup
> for both v1 and v2.
> 
> For systems with cgroup v2 setup, cpuset_bind() is called once. For
> systems with cgroup v1 setup, cpuset_bind() is called twice. It is
> first called before cpuset_init_smp() in cgroup v2 mode.  Then it is
> called again when cgroup v1 filesystem is mounted in v1 mode after
> cpuset_init_smp().
> 
>   [    2.609781] cpuset_bind() called - v2 = 1
>   [    3.079473] cpuset_init_smp() called
>   [    7.103710] cpuset_bind() called - v2 = 0

I run some test, on a server with centOS, this did happen that
cpuset_bind() is called twice, first as v2 during kernel boot,
and then as v1 post-boot. 

However on a QEMU running with a basic debian rootfs image,
the second  call of cpuset_bind() didn't happen. 

> As a result, cpu and memory node hot add may fail to update the cpu and
> node masks of the top cpuset to include the newly added cpu or node in
> a cgroup v2 environment.
> 
> smp_init() is called after the first two init functions.  So we don't
> have a complete list of active cpus and memory nodes until later in
> cpuset_init_smp() which is the right time to set up effective_cpus
> and effective_mems.
> 
> To fix this problem, the potentially incorrect cpus_allowed &
> mems_allowed setup in cpuset_init_smp() are removed.  For cgroup v2
> systems, the initial cpuset_bind() call will set them up correctly.
> For cgroup v1 systems, the second call to cpuset_bind() will do the
> right setup.
> 
> cc: stable@vger.kernel.org
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 9390bfd9f1cd..6bd8f5ef40fe 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3390,8 +3390,9 @@ static struct notifier_block cpuset_track_online_nodes_nb = {
>   */
>  void __init cpuset_init_smp(void)
>  {
> -	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> -	top_cpuset.mems_allowed = node_states[N_MEMORY];

So can we keep line
  cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);

and only remove line 
       top_cpuset.mems_allowed = node_states[N_MEMORY];
?

Thanks,
Feng


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
  2022-04-26  3:23 ` Feng Tang
@ 2022-04-26 14:58   ` Waiman Long
  2022-04-27  1:06     ` Feng Tang
  0 siblings, 1 reply; 8+ messages in thread
From: Waiman Long @ 2022-04-26 14:58 UTC (permalink / raw)
  To: Feng Tang
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, cgroups, linux-mm,
	linux-kernel, Andrew Morton, Michal Hocko, Dave Hansen,
	ying.huang, stable

On 4/25/22 23:23, Feng Tang wrote:
> Hi Waiman,
>
> On Mon, Apr 25, 2022 at 11:55:05AM -0400, Waiman Long wrote:
>> There are 3 places where the cpu and node masks of the top cpuset can
>> be initialized in the order they are executed:
>>   1) start_kernel -> cpuset_init()
>>   2) start_kernel -> cgroup_init() -> cpuset_bind()
>>   3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()
>>
>> The first cpuset_init() function just sets all the bits in the masks.
>> The last one executed is cpuset_init_smp() which sets up cpu and node
>> masks suitable for v1, but not v2.  cpuset_bind() does the right setup
>> for both v1 and v2.
>>
>> For systems with cgroup v2 setup, cpuset_bind() is called once. For
>> systems with cgroup v1 setup, cpuset_bind() is called twice. It is
>> first called before cpuset_init_smp() in cgroup v2 mode.  Then it is
>> called again when cgroup v1 filesystem is mounted in v1 mode after
>> cpuset_init_smp().
>>
>>    [    2.609781] cpuset_bind() called - v2 = 1
>>    [    3.079473] cpuset_init_smp() called
>>    [    7.103710] cpuset_bind() called - v2 = 0
> I run some test, on a server with centOS, this did happen that
> cpuset_bind() is called twice, first as v2 during kernel boot,
> and then as v1 post-boot.
>
> However on a QEMU running with a basic debian rootfs image,
> the second  call of cpuset_bind() didn't happen.

The first time cpuset_bind() is called in cgroup_init(), the kernel 
doesn't know if userspace is going to mount v1 or v2 cgroup. By default, 
it is assumed to be v2. However, if userspace mounts the cgroup v1 
filesystem for cpuset, cpuset_bind() will be run at this point by 
rebind_subsystem() to set up cgroup v1 environment and 
cpus_allowed/mems_allowed will be correctly set at this point. Mounting 
the cgroup v2 filesystem, however, does not cause rebind_subsystem() to 
run and hence cpuset_bind() is not called again.

Is the QEMU setup not mounting any cgroup filesystem at all? If so, does 
it matter whether v1 or v2 setup is used?

>> As a result, cpu and memory node hot add may fail to update the cpu and
>> node masks of the top cpuset to include the newly added cpu or node in
>> a cgroup v2 environment.
>>
>> smp_init() is called after the first two init functions.  So we don't
>> have a complete list of active cpus and memory nodes until later in
>> cpuset_init_smp() which is the right time to set up effective_cpus
>> and effective_mems.
>>
>> To fix this problem, the potentially incorrect cpus_allowed &
>> mems_allowed setup in cpuset_init_smp() are removed.  For cgroup v2
>> systems, the initial cpuset_bind() call will set them up correctly.
>> For cgroup v1 systems, the second call to cpuset_bind() will do the
>> right setup.
>>
>> cc: stable@vger.kernel.org
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 9390bfd9f1cd..6bd8f5ef40fe 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -3390,8 +3390,9 @@ static struct notifier_block cpuset_track_online_nodes_nb = {
>>    */
>>   void __init cpuset_init_smp(void)
>>   {
>> -	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
>> -	top_cpuset.mems_allowed = node_states[N_MEMORY];
> So can we keep line
>    cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
>
> and only remove line
>         top_cpuset.mems_allowed = node_states[N_MEMORY];
> ?

That may cause cpusets.cpu to be set incorrectly for systems using 
cgroup v2. What is really important is that effective_cpus and 
effective_mems are set correctly.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
  2022-04-26 14:58   ` Waiman Long
@ 2022-04-27  1:06     ` Feng Tang
  2022-04-27  2:34       ` Waiman Long
  0 siblings, 1 reply; 8+ messages in thread
From: Feng Tang @ 2022-04-27  1:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, cgroups, linux-mm,
	linux-kernel, Andrew Morton, Michal Hocko, Hansen, Dave, Huang,
	Ying, stable

On Tue, Apr 26, 2022 at 10:58:21PM +0800, Waiman Long wrote:
> On 4/25/22 23:23, Feng Tang wrote:
> > Hi Waiman,
> >
> > On Mon, Apr 25, 2022 at 11:55:05AM -0400, Waiman Long wrote:
> >> There are 3 places where the cpu and node masks of the top cpuset can
> >> be initialized in the order they are executed:
> >>   1) start_kernel -> cpuset_init()
> >>   2) start_kernel -> cgroup_init() -> cpuset_bind()
> >>   3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()
> >>
> >> The first cpuset_init() function just sets all the bits in the masks.
> >> The last one executed is cpuset_init_smp() which sets up cpu and node
> >> masks suitable for v1, but not v2.  cpuset_bind() does the right setup
> >> for both v1 and v2.
> >>
> >> For systems with cgroup v2 setup, cpuset_bind() is called once. For
> >> systems with cgroup v1 setup, cpuset_bind() is called twice. It is
> >> first called before cpuset_init_smp() in cgroup v2 mode.  Then it is
> >> called again when cgroup v1 filesystem is mounted in v1 mode after
> >> cpuset_init_smp().
> >>
> >>    [    2.609781] cpuset_bind() called - v2 = 1
> >>    [    3.079473] cpuset_init_smp() called
> >>    [    7.103710] cpuset_bind() called - v2 = 0
> > I run some test, on a server with centOS, this did happen that
> > cpuset_bind() is called twice, first as v2 during kernel boot,
> > and then as v1 post-boot.
> >
> > However on a QEMU running with a basic debian rootfs image,
> > the second  call of cpuset_bind() didn't happen.
> 
> The first time cpuset_bind() is called in cgroup_init(), the kernel 
> doesn't know if userspace is going to mount v1 or v2 cgroup. By default, 
> it is assumed to be v2. However, if userspace mounts the cgroup v1 
> filesystem for cpuset, cpuset_bind() will be run at this point by 
> rebind_subsystem() to set up cgroup v1 environment and 
> cpus_allowed/mems_allowed will be correctly set at this point. Mounting 
> the cgroup v2 filesystem, however, does not cause rebind_subsystem() to 
> run and hence cpuset_bind() is not called again.
> 
> Is the QEMU setup not mounting any cgroup filesystem at all? If so, does 
> it matter whether v1 or v2 setup is used?

When I got the cpuset binding error report, I tried first on qemu to
reproduce and failed (due to there was no memory hotplug), then I
reproduced it on a real server. For both system, I used "cgroup_no_v1=all"
cmdline parameter to test cgroup-v2, could this be the reason? (TBH,
this is the first time I use cgroup-v2).

Here is the info dump:

# mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)

#cat /proc/filesystems | grep cgroup
nodev   cgroup
nodev   cgroup2

Thanks,
Feng

> >> As a result, cpu and memory node hot add may fail to update the cpu and
> >> node masks of the top cpuset to include the newly added cpu or node in
> >> a cgroup v2 environment.
> >>
> >> smp_init() is called after the first two init functions.  So we don't
> >> have a complete list of active cpus and memory nodes until later in
> >> cpuset_init_smp() which is the right time to set up effective_cpus
> >> and effective_mems.
> >>
> >> To fix this problem, the potentially incorrect cpus_allowed &
> >> mems_allowed setup in cpuset_init_smp() are removed.  For cgroup v2
> >> systems, the initial cpuset_bind() call will set them up correctly.
> >> For cgroup v1 systems, the second call to cpuset_bind() will do the
> >> right setup.
> >>
> >> cc: stable@vger.kernel.org
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> >> ---
> >>   kernel/cgroup/cpuset.c | 5 +++--
> >>   1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> >> index 9390bfd9f1cd..6bd8f5ef40fe 100644
> >> --- a/kernel/cgroup/cpuset.c
> >> +++ b/kernel/cgroup/cpuset.c
> >> @@ -3390,8 +3390,9 @@ static struct notifier_block cpuset_track_online_nodes_nb = {
> >>    */
> >>   void __init cpuset_init_smp(void)
> >>   {
> >> -	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> >> -	top_cpuset.mems_allowed = node_states[N_MEMORY];
> > So can we keep line
> >    cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> >
> > and only remove line
> >         top_cpuset.mems_allowed = node_states[N_MEMORY];
> > ?
> 
> That may cause cpusets.cpu to be set incorrectly for systems using 
> cgroup v2. What is really important is that effective_cpus and 
> effective_mems are set correctly.
> 
> Cheers,
> Longman
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
  2022-04-27  1:06     ` Feng Tang
@ 2022-04-27  2:34       ` Waiman Long
  2022-04-27 12:09         ` Feng Tang
  0 siblings, 1 reply; 8+ messages in thread
From: Waiman Long @ 2022-04-27  2:34 UTC (permalink / raw)
  To: Feng Tang
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, cgroups, linux-mm,
	linux-kernel, Andrew Morton, Michal Hocko, Hansen, Dave, Huang,
	Ying, stable

On 4/26/22 21:06, Feng Tang wrote:
> On Tue, Apr 26, 2022 at 10:58:21PM +0800, Waiman Long wrote:
>> On 4/25/22 23:23, Feng Tang wrote:
>>> Hi Waiman,
>>>
>>> On Mon, Apr 25, 2022 at 11:55:05AM -0400, Waiman Long wrote:
>>>> There are 3 places where the cpu and node masks of the top cpuset can
>>>> be initialized in the order they are executed:
>>>>    1) start_kernel -> cpuset_init()
>>>>    2) start_kernel -> cgroup_init() -> cpuset_bind()
>>>>    3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()
>>>>
>>>> The first cpuset_init() function just sets all the bits in the masks.
>>>> The last one executed is cpuset_init_smp() which sets up cpu and node
>>>> masks suitable for v1, but not v2.  cpuset_bind() does the right setup
>>>> for both v1 and v2.
>>>>
>>>> For systems with cgroup v2 setup, cpuset_bind() is called once. For
>>>> systems with cgroup v1 setup, cpuset_bind() is called twice. It is
>>>> first called before cpuset_init_smp() in cgroup v2 mode.  Then it is
>>>> called again when cgroup v1 filesystem is mounted in v1 mode after
>>>> cpuset_init_smp().
>>>>
>>>>     [    2.609781] cpuset_bind() called - v2 = 1
>>>>     [    3.079473] cpuset_init_smp() called
>>>>     [    7.103710] cpuset_bind() called - v2 = 0
>>> I run some test, on a server with centOS, this did happen that
>>> cpuset_bind() is called twice, first as v2 during kernel boot,
>>> and then as v1 post-boot.
>>>
>>> However on a QEMU running with a basic debian rootfs image,
>>> the second  call of cpuset_bind() didn't happen.
>> The first time cpuset_bind() is called in cgroup_init(), the kernel
>> doesn't know if userspace is going to mount v1 or v2 cgroup. By default,
>> it is assumed to be v2. However, if userspace mounts the cgroup v1
>> filesystem for cpuset, cpuset_bind() will be run at this point by
>> rebind_subsystem() to set up cgroup v1 environment and
>> cpus_allowed/mems_allowed will be correctly set at this point. Mounting
>> the cgroup v2 filesystem, however, does not cause rebind_subsystem() to
>> run and hence cpuset_bind() is not called again.
>>
>> Is the QEMU setup not mounting any cgroup filesystem at all? If so, does
>> it matter whether v1 or v2 setup is used?
> When I got the cpuset binding error report, I tried first on qemu to
> reproduce and failed (due to there was no memory hotplug), then I
> reproduced it on a real server. For both system, I used "cgroup_no_v1=all"
> cmdline parameter to test cgroup-v2, could this be the reason? (TBH,
> this is the first time I use cgroup-v2).
>
> Here is the info dump:
>
> # mount | grep cgroup
> tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
> cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
>
> #cat /proc/filesystems | grep cgroup
> nodev   cgroup
> nodev   cgroup2
>
> Thanks,
> Feng

For cgroup v2, cpus_allowed should be set to cpu_possible_mask and 
mems_allowed to node_possible_map as is done in the first invocation of 
cpuset_bind(). That is the correct behavior.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
  2022-04-27  2:34       ` Waiman Long
@ 2022-04-27 12:09         ` Feng Tang
  0 siblings, 0 replies; 8+ messages in thread
From: Feng Tang @ 2022-04-27 12:09 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, cgroups, linux-mm,
	linux-kernel, Andrew Morton, Michal Hocko, Hansen, Dave, Huang,
	Ying, stable

On Tue, Apr 26, 2022 at 10:34:21PM -0400, Waiman Long wrote:
> On 4/26/22 21:06, Feng Tang wrote:
> > On Tue, Apr 26, 2022 at 10:58:21PM +0800, Waiman Long wrote:
> > > On 4/25/22 23:23, Feng Tang wrote:
> > > > Hi Waiman,
> > > > 
> > > > On Mon, Apr 25, 2022 at 11:55:05AM -0400, Waiman Long wrote:
> > > > > There are 3 places where the cpu and node masks of the top cpuset can
> > > > > be initialized in the order they are executed:
> > > > >    1) start_kernel -> cpuset_init()
> > > > >    2) start_kernel -> cgroup_init() -> cpuset_bind()
> > > > >    3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()
> > > > > 
> > > > > The first cpuset_init() function just sets all the bits in the masks.
> > > > > The last one executed is cpuset_init_smp() which sets up cpu and node
> > > > > masks suitable for v1, but not v2.  cpuset_bind() does the right setup
> > > > > for both v1 and v2.
> > > > > 
> > > > > For systems with cgroup v2 setup, cpuset_bind() is called once. For
> > > > > systems with cgroup v1 setup, cpuset_bind() is called twice. It is
> > > > > first called before cpuset_init_smp() in cgroup v2 mode.  Then it is
> > > > > called again when cgroup v1 filesystem is mounted in v1 mode after
> > > > > cpuset_init_smp().
> > > > > 
> > > > >     [    2.609781] cpuset_bind() called - v2 = 1
> > > > >     [    3.079473] cpuset_init_smp() called
> > > > >     [    7.103710] cpuset_bind() called - v2 = 0
> > > > I run some test, on a server with centOS, this did happen that
> > > > cpuset_bind() is called twice, first as v2 during kernel boot,
> > > > and then as v1 post-boot.
> > > > 
> > > > However on a QEMU running with a basic debian rootfs image,
> > > > the second  call of cpuset_bind() didn't happen.
> > > The first time cpuset_bind() is called in cgroup_init(), the kernel
> > > doesn't know if userspace is going to mount v1 or v2 cgroup. By default,
> > > it is assumed to be v2. However, if userspace mounts the cgroup v1
> > > filesystem for cpuset, cpuset_bind() will be run at this point by
> > > rebind_subsystem() to set up cgroup v1 environment and
> > > cpus_allowed/mems_allowed will be correctly set at this point. Mounting
> > > the cgroup v2 filesystem, however, does not cause rebind_subsystem() to
> > > run and hence cpuset_bind() is not called again.
> > > 
> > > Is the QEMU setup not mounting any cgroup filesystem at all? If so, does
> > > it matter whether v1 or v2 setup is used?
> > When I got the cpuset binding error report, I tried first on qemu to
> > reproduce and failed (due to there was no memory hotplug), then I
> > reproduced it on a real server. For both system, I used "cgroup_no_v1=all"
> > cmdline parameter to test cgroup-v2, could this be the reason? (TBH,
> > this is the first time I use cgroup-v2).
> > 
> > Here is the info dump:
> > 
> > # mount | grep cgroup
> > tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
> > cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
> > 
> > #cat /proc/filesystems | grep cgroup
> > nodev   cgroup
> > nodev   cgroup2
> > 
> > Thanks,
> > Feng
> 
> For cgroup v2, cpus_allowed should be set to cpu_possible_mask and
> mems_allowed to node_possible_map as is done in the first invocation of
> cpuset_bind(). That is the correct behavior.
 
OK. For the cgroup v2 mem binding problem with hot-added nodes, I
retested today, and it can't be reproduced with this patch. So feel
free to add:
  
  Tested-by: Feng Tang <feng.tang@intel.com>

Thanks,
Feng


> Cheers,
> Longman
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
  2022-04-25 15:55 [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp() Waiman Long
  2022-04-26  3:23 ` Feng Tang
@ 2022-04-27 13:53 ` Michal Koutný
  2022-04-27 14:33   ` Waiman Long
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Koutný @ 2022-04-27 13:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, cgroups, linux-mm,
	linux-kernel, Feng Tang, Andrew Morton, Michal Hocko,
	Dave Hansen, ying.huang, stable

Hello.

On Mon, Apr 25, 2022 at 11:55:05AM -0400, Waiman Long <longman@redhat.com> wrote:
> smp_init() is called after the first two init functions.  So we don't
> have a complete list of active cpus and memory nodes until later in
> cpuset_init_smp() which is the right time to set up effective_cpus
> and effective_mems.

Yes.

	setup_arch
	  prefill_possible_map
	cpuset_init (1)
	cgroup_init
	  cpuset_bind (2a)
	...
	kernel_init
	  kernel_init_freeable
	    ...
	      cpuset_init_smp (3)
	...
	...
	cpuset_bind (2b)


> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 9390bfd9f1cd..6bd8f5ef40fe 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3390,8 +3390,9 @@ static struct notifier_block cpuset_track_online_nodes_nb = {
>   */
>  void __init cpuset_init_smp(void)
>  {
> -	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
> -	top_cpuset.mems_allowed = node_states[N_MEMORY];
> +	/*
> +	 * cpus_allowd/mems_allowed will be properly set up in cpuset_bind().
> +	 */

IIUC, the comment should say

> +	 * cpus_allowed/mems_allowed were (v2) or will be (v1) properly set up in cpuset_bind().

(nit)

Reviewed-by: Michal Koutný <mkoutny@suse.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
  2022-04-27 13:53 ` Michal Koutný
@ 2022-04-27 14:33   ` Waiman Long
  0 siblings, 0 replies; 8+ messages in thread
From: Waiman Long @ 2022-04-27 14:33 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, cgroups, linux-mm,
	linux-kernel, Feng Tang, Andrew Morton, Michal Hocko,
	Dave Hansen, ying.huang, stable

On 4/27/22 09:53, Michal Koutný wrote:
> Hello.
>
> On Mon, Apr 25, 2022 at 11:55:05AM -0400, Waiman Long <longman@redhat.com> wrote:
>> smp_init() is called after the first two init functions.  So we don't
>> have a complete list of active cpus and memory nodes until later in
>> cpuset_init_smp() which is the right time to set up effective_cpus
>> and effective_mems.
> Yes.
>
> 	setup_arch
> 	  prefill_possible_map
> 	cpuset_init (1)
> 	cgroup_init
> 	  cpuset_bind (2a)
> 	...
> 	kernel_init
> 	  kernel_init_freeable
> 	    ...
> 	      cpuset_init_smp (3)
> 	...
> 	...
> 	cpuset_bind (2b)
>
>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 9390bfd9f1cd..6bd8f5ef40fe 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -3390,8 +3390,9 @@ static struct notifier_block cpuset_track_online_nodes_nb = {
>>    */
>>   void __init cpuset_init_smp(void)
>>   {
>> -	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
>> -	top_cpuset.mems_allowed = node_states[N_MEMORY];
>> +	/*
>> +	 * cpus_allowd/mems_allowed will be properly set up in cpuset_bind().
>> +	 */
> IIUC, the comment should say
>
>> +	 * cpus_allowed/mems_allowed were (v2) or will be (v1) properly set up in cpuset_bind().
> (nit)
>
> Reviewed-by: Michal Koutný <mkoutny@suse.com>
>
Thanks for the review. I plan to post v3 with updated commit log and 
comment soon.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-04-27 14:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-25 15:55 [PATCH v2] cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp() Waiman Long
2022-04-26  3:23 ` Feng Tang
2022-04-26 14:58   ` Waiman Long
2022-04-27  1:06     ` Feng Tang
2022-04-27  2:34       ` Waiman Long
2022-04-27 12:09         ` Feng Tang
2022-04-27 13:53 ` Michal Koutný
2022-04-27 14:33   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).