All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-12 15:37 ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-04-12 15:37 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Waiman Long

This patch series introduces a new "isolcpus" partition type to the
existing list of {member, root, isolated} types. The primary reason
of adding this new "isolcpus" partition is to facilitate the
distribution of isolated CPUs down the cgroup v2 hierarchy.

The other non-member partition types have the limitation that their
parents have to be valid partitions too. It will be hard to create a
partition a few layers down the hierarchy.

It is relatively rare to have applications that require creation of
a separate scheduling domain (root). However, it is more common to
have applications that require the use of isolated CPUs (isolated),
e.g. DPDK. One can use the "isolcpus" or "nohz_full" boot command options
to get that statically. Of course, the "isolated" partition is another
way to achieve that dynamically.

Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. If a container needs to use
isolated CPUs, it is hard to get those with existing set of cpuset
partition types. With this patch series, a new "isolcpus" partition
can be created to hold a set of isolated CPUs that can be pull into
other "isolated" partitions.

The "isolcpus" partition is special that there can have at most one
instance of this in a system. It serves as a pool for isolated CPUs
and cannot hold tasks or sub-cpusets underneath it. It is also not
cpu-exclusive so that the isolated CPUs can be distributed down the
sibling hierarchies, though those isolated CPUs will not be useable
until the partition type becomes "isolated".

Once isolated CPUs are needed in a cgroup, the administrator can write
a list of isolated CPUs into its "cpuset.cpus" and change its partition
type to "isolated" to pull in those isolated CPUs from the "isolcpus"
partition and use them in that cgroup. That will make the distribution
of isolated CPUs to cgroups that need them much easier.

In the future, we may be able to extend this special "isolcpus" partition
type to support other isolation attributes like those that can be
specified with the "isolcpus" boot command line and related options.

Waiman Long (5):
  cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
    handling
  cgroup/cpuset: Add a new "isolcpus" paritition root state
  cgroup/cpuset: Make isolated partition pull CPUs from isolcpus
    partition
  cgroup/cpuset: Documentation update for the new "isolcpus" partition
  cgroup/cpuset: Extend test_cpuset_prs.sh to test isolcpus partition

 Documentation/admin-guide/cgroup-v2.rst       |  89 ++-
 kernel/cgroup/cpuset.c                        | 548 +++++++++++++++---
 .../selftests/cgroup/test_cpuset_prs.sh       | 376 ++++++++----
 3 files changed, 789 insertions(+), 224 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-12 15:37 ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-04-12 15:37 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Waiman Long

This patch series introduces a new "isolcpus" partition type to the
existing list of {member, root, isolated} types. The primary reason
of adding this new "isolcpus" partition is to facilitate the
distribution of isolated CPUs down the cgroup v2 hierarchy.

The other non-member partition types have the limitation that their
parents have to be valid partitions too. It will be hard to create a
partition a few layers down the hierarchy.

It is relatively rare to have applications that require creation of
a separate scheduling domain (root). However, it is more common to
have applications that require the use of isolated CPUs (isolated),
e.g. DPDK. One can use the "isolcpus" or "nohz_full" boot command options
to get that statically. Of course, the "isolated" partition is another
way to achieve that dynamically.

Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. If a container needs to use
isolated CPUs, it is hard to get those with existing set of cpuset
partition types. With this patch series, a new "isolcpus" partition
can be created to hold a set of isolated CPUs that can be pull into
other "isolated" partitions.

The "isolcpus" partition is special that there can have at most one
instance of this in a system. It serves as a pool for isolated CPUs
and cannot hold tasks or sub-cpusets underneath it. It is also not
cpu-exclusive so that the isolated CPUs can be distributed down the
sibling hierarchies, though those isolated CPUs will not be useable
until the partition type becomes "isolated".

Once isolated CPUs are needed in a cgroup, the administrator can write
a list of isolated CPUs into its "cpuset.cpus" and change its partition
type to "isolated" to pull in those isolated CPUs from the "isolcpus"
partition and use them in that cgroup. That will make the distribution
of isolated CPUs to cgroups that need them much easier.

In the future, we may be able to extend this special "isolcpus" partition
type to support other isolation attributes like those that can be
specified with the "isolcpus" boot command line and related options.

Waiman Long (5):
  cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
    handling
  cgroup/cpuset: Add a new "isolcpus" paritition root state
  cgroup/cpuset: Make isolated partition pull CPUs from isolcpus
    partition
  cgroup/cpuset: Documentation update for the new "isolcpus" partition
  cgroup/cpuset: Extend test_cpuset_prs.sh to test isolcpus partition

 Documentation/admin-guide/cgroup-v2.rst       |  89 ++-
 kernel/cgroup/cpuset.c                        | 548 +++++++++++++++---
 .../selftests/cgroup/test_cpuset_prs.sh       | 376 ++++++++----
 3 files changed, 789 insertions(+), 224 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-12 15:37 ` Waiman Long
  (?)
@ 2023-04-12 19:28 ` Tejun Heo
       [not found]   ` <1ce6a073-e573-0c32-c3d8-f67f3d389a28@redhat.com>
  -1 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2023-04-12 19:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

Hello, Waiman.

On Wed, Apr 12, 2023 at 11:37:53AM -0400, Waiman Long wrote:
> This patch series introduces a new "isolcpus" partition type to the
> existing list of {member, root, isolated} types. The primary reason
> of adding this new "isolcpus" partition is to facilitate the
> distribution of isolated CPUs down the cgroup v2 hierarchy.
> 
> The other non-member partition types have the limitation that their
> parents have to be valid partitions too. It will be hard to create a
> partition a few layers down the hierarchy.
> 
> It is relatively rare to have applications that require creation of
> a separate scheduling domain (root). However, it is more common to
> have applications that require the use of isolated CPUs (isolated),
> e.g. DPDK. One can use the "isolcpus" or "nohz_full" boot command options
> to get that statically. Of course, the "isolated" partition is another
> way to achieve that dynamically.
> 
> Modern container orchestration tools like Kubernetes use the cgroup
> hierarchy to manage different containers. If a container needs to use
> isolated CPUs, it is hard to get those with existing set of cpuset
> partition types. With this patch series, a new "isolcpus" partition
> can be created to hold a set of isolated CPUs that can be pull into
> other "isolated" partitions.
> 
> The "isolcpus" partition is special that there can have at most one
> instance of this in a system. It serves as a pool for isolated CPUs
> and cannot hold tasks or sub-cpusets underneath it. It is also not
> cpu-exclusive so that the isolated CPUs can be distributed down the
> sibling hierarchies, though those isolated CPUs will not be useable
> until the partition type becomes "isolated".
> 
> Once isolated CPUs are needed in a cgroup, the administrator can write
> a list of isolated CPUs into its "cpuset.cpus" and change its partition
> type to "isolated" to pull in those isolated CPUs from the "isolcpus"
> partition and use them in that cgroup. That will make the distribution
> of isolated CPUs to cgroups that need them much easier.

I'm not sure about this. It feels really hacky in that it side-steps the
distribution hierarchy completely. I can imagine a non-isolated cpuset
wanting to allow isolated cpusets downstream but that should be done
hierarchically - e.g. by allowing a cgroup to express what isolated cpus are
allowed in the subtree. Also, can you give more details on the targeted use
cases?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
       [not found]   ` <1ce6a073-e573-0c32-c3d8-f67f3d389a28@redhat.com>
@ 2023-04-12 20:22     ` Tejun Heo
  2023-04-12 20:33       ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2023-04-12 20:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

Hello, Waiman.

On Wed, Apr 12, 2023 at 03:52:36PM -0400, Waiman Long wrote:
> There is still a distribution hierarchy as the list of isolation CPUs have
> to be distributed down to the target cgroup through the hierarchy. For
> example,
> 
> cgroup root
>   +- isolcpus  (cpus 8,9; isolcpus)
>   +- user.slice (cpus 1-9; ecpus 1-7; member)
>      +- user-x.slice (cpus 8,9; ecpus 8,9; isolated)
>      +- user-y.slice (cpus 1,2; ecpus 1,2; member)
> 
> OTOH, I do agree that this can be somewhat hacky. That is why I post it as a
> RFC to solicit feedback.

Wouldn't it be possible to make it hierarchical by adding another cpumask to
cpuset which lists the cpus which are allowed in the hierarchy but not used
unless claimed by an isolated domain?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-12 20:22     ` Tejun Heo
@ 2023-04-12 20:33       ` Waiman Long
  2023-04-13  0:03         ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-04-12 20:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

On 4/12/23 16:22, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Apr 12, 2023 at 03:52:36PM -0400, Waiman Long wrote:
>> There is still a distribution hierarchy as the list of isolation CPUs have
>> to be distributed down to the target cgroup through the hierarchy. For
>> example,
>>
>> cgroup root
>>    +- isolcpus  (cpus 8,9; isolcpus)
>>    +- user.slice (cpus 1-9; ecpus 1-7; member)
>>       +- user-x.slice (cpus 8,9; ecpus 8,9; isolated)
>>       +- user-y.slice (cpus 1,2; ecpus 1,2; member)
>>
>> OTOH, I do agree that this can be somewhat hacky. That is why I post it as a
>> RFC to solicit feedback.
> Wouldn't it be possible to make it hierarchical by adding another cpumask to
> cpuset which lists the cpus which are allowed in the hierarchy but not used
> unless claimed by an isolated domain?

I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs 
file. So there will be one in the root cgroup that defines all the 
isolated CPUs one can have. It is then distributed down the hierarchy 
and can be claimed only if a cgroup becomes an "isolated" partition. 
There will be a slight change in the semantics of an "isolated" 
partition, but I doubt there will be much users out there.

If you are OK with this approach, I can modify my patch series to do that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-12 20:33       ` Waiman Long
@ 2023-04-13  0:03         ` Tejun Heo
  2023-04-13  0:26             ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2023-04-13  0:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

Hello,

On Wed, Apr 12, 2023 at 04:33:29PM -0400, Waiman Long wrote:
> I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file.
> So there will be one in the root cgroup that defines all the isolated CPUs
> one can have. It is then distributed down the hierarchy and can be claimed
> only if a cgroup becomes an "isolated" partition. There will be a slight

Yeah, that seems a lot more congruent with the typical pattern.

> change in the semantics of an "isolated" partition, but I doubt there will
> be much users out there.

I haven't thought through it too hard but what prevents staying compatible
with the current behavior?

> If you are OK with this approach, I can modify my patch series to do that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-13  0:03         ` Tejun Heo
@ 2023-04-13  0:26             ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-04-13  0:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker


On 4/12/23 20:03, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 12, 2023 at 04:33:29PM -0400, Waiman Long wrote:
>> I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file.
>> So there will be one in the root cgroup that defines all the isolated CPUs
>> one can have. It is then distributed down the hierarchy and can be claimed
>> only if a cgroup becomes an "isolated" partition. There will be a slight
> Yeah, that seems a lot more congruent with the typical pattern.
>
>> change in the semantics of an "isolated" partition, but I doubt there will
>> be much users out there.
> I haven't thought through it too hard but what prevents staying compatible
> with the current behavior?

It is possible to stay compatible with existing behavior. It is just 
that a break from existing behavior will make the solution more clean.

So the new behavior will be:

   If the "cpuset.cpus.isolated" isn't set, the existing rules applies. 
If it is set, the new rule will be used.

Does that look reasonable to you?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-13  0:26             ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-04-13  0:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker


On 4/12/23 20:03, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 12, 2023 at 04:33:29PM -0400, Waiman Long wrote:
>> I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file.
>> So there will be one in the root cgroup that defines all the isolated CPUs
>> one can have. It is then distributed down the hierarchy and can be claimed
>> only if a cgroup becomes an "isolated" partition. There will be a slight
> Yeah, that seems a lot more congruent with the typical pattern.
>
>> change in the semantics of an "isolated" partition, but I doubt there will
>> be much users out there.
> I haven't thought through it too hard but what prevents staying compatible
> with the current behavior?

It is possible to stay compatible with existing behavior. It is just 
that a break from existing behavior will make the solution more clean.

So the new behavior will be:

   If the "cpuset.cpus.isolated" isn't set, the existing rules applies. 
If it is set, the new rule will be used.

Does that look reasonable to you?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-13  0:33               ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-04-13  0:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

Hello,

On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote:
>   If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it
> is set, the new rule will be used.
> 
> Does that look reasonable to you?

Sounds a bit contrived. Does it need to be something defined in the root
cgroup? The only thing that's needed is that a cgroup needs to claim CPUs
exclusively without using them, right? Let's say we add a new interface
file, say, cpuset.cpus.reserve which is always exclusive and can be consumed
by children whichever way they want, wouldn't that be sufficient? Then,
there would be nothing to describe in the root cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-13  0:33               ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-04-13  0:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

Hello,

On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote:
>   If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it
> is set, the new rule will be used.
> 
> Does that look reasonable to you?

Sounds a bit contrived. Does it need to be something defined in the root
cgroup? The only thing that's needed is that a cgroup needs to claim CPUs
exclusively without using them, right? Let's say we add a new interface
file, say, cpuset.cpus.reserve which is always exclusive and can be consumed
by children whichever way they want, wouldn't that be sufficient? Then,
there would be nothing to describe in the root cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-13  0:33               ` Tejun Heo
@ 2023-04-13  0:55                 ` Waiman Long
  -1 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-04-13  0:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker


On 4/12/23 20:33, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote:
>>    If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it
>> is set, the new rule will be used.
>>
>> Does that look reasonable to you?
> Sounds a bit contrived. Does it need to be something defined in the root
> cgroup?

Yes, because we need to take away the isolated CPUs from the effective 
cpus of the root cgroup. So it needs to start from the root. That is 
also why we have the partition rule that the parent of a partition has 
to be a partition root itself. With the new scheme, we don't need a 
special cgroup to hold the isolated CPUs. The new root cgroup file will 
be enough to inform the system what CPUs will have to be isolated.

My current thinking is that the root's "cpuset.cpus.isolated" will start 
with whatever have been set in the "isolcpus" or "nohz_full" boot 
command line and can be extended from there but not shrank below that as 
there can be additional isolation attributes with those isolated CPUs.

Cheers,
Longman

> The only thing that's needed is that a cgroup needs to claim CPUs
> exclusively without using them, right? Let's say we add a new interface
> file, say, cpuset.cpus.reserve which is always exclusive and can be consumed
> by children whichever way they want, wouldn't that be sufficient? Then,
> there would be nothing to describe in the root cgroup.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-13  0:55                 ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-04-13  0:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker


On 4/12/23 20:33, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote:
>>    If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it
>> is set, the new rule will be used.
>>
>> Does that look reasonable to you?
> Sounds a bit contrived. Does it need to be something defined in the root
> cgroup?

Yes, because we need to take away the isolated CPUs from the effective 
cpus of the root cgroup. So it needs to start from the root. That is 
also why we have the partition rule that the parent of a partition has 
to be a partition root itself. With the new scheme, we don't need a 
special cgroup to hold the isolated CPUs. The new root cgroup file will 
be enough to inform the system what CPUs will have to be isolated.

My current thinking is that the root's "cpuset.cpus.isolated" will start 
with whatever have been set in the "isolcpus" or "nohz_full" boot 
command line and can be extended from there but not shrank below that as 
there can be additional isolation attributes with those isolated CPUs.

Cheers,
Longman

> The only thing that's needed is that a cgroup needs to claim CPUs
> exclusively without using them, right? Let's say we add a new interface
> file, say, cpuset.cpus.reserve which is always exclusive and can be consumed
> by children whichever way they want, wouldn't that be sufficient? Then,
> there would be nothing to describe in the root cgroup.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-13  1:17                   ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-04-13  1:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

Hello, Waiman.

On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
> > Sounds a bit contrived. Does it need to be something defined in the root
> > cgroup?
> 
> Yes, because we need to take away the isolated CPUs from the effective cpus
> of the root cgroup. So it needs to start from the root. That is also why we
> have the partition rule that the parent of a partition has to be a partition
> root itself. With the new scheme, we don't need a special cgroup to hold the

I'm following. The root is already a partition root and the cgroupfs control
knobs are owned by the parent, so the root cgroup would own the first level
cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some
CPUs exclusively to a first level cgroup, it can then set that cgroup's
reserve knob accordingly (or maybe the better name is
cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's
partition and give them to the first level cgroup. The first level cgroup
then is free to do whatever with those CPUs that now belong exclusively to
the cgroup subtree.

> isolated CPUs. The new root cgroup file will be enough to inform the system
> what CPUs will have to be isolated.
> 
> My current thinking is that the root's "cpuset.cpus.isolated" will start
> with whatever have been set in the "isolcpus" or "nohz_full" boot command
> line and can be extended from there but not shrank below that as there can
> be additional isolation attributes with those isolated CPUs.

I'm not sure we wanna tie with those automatically. I think it'd be
confusing than helpful.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-13  1:17                   ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-04-13  1:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

Hello, Waiman.

On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
> > Sounds a bit contrived. Does it need to be something defined in the root
> > cgroup?
> 
> Yes, because we need to take away the isolated CPUs from the effective cpus
> of the root cgroup. So it needs to start from the root. That is also why we
> have the partition rule that the parent of a partition has to be a partition
> root itself. With the new scheme, we don't need a special cgroup to hold the

I'm following. The root is already a partition root and the cgroupfs control
knobs are owned by the parent, so the root cgroup would own the first level
cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some
CPUs exclusively to a first level cgroup, it can then set that cgroup's
reserve knob accordingly (or maybe the better name is
cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's
partition and give them to the first level cgroup. The first level cgroup
then is free to do whatever with those CPUs that now belong exclusively to
the cgroup subtree.

> isolated CPUs. The new root cgroup file will be enough to inform the system
> what CPUs will have to be isolated.
> 
> My current thinking is that the root's "cpuset.cpus.isolated" will start
> with whatever have been set in the "isolcpus" or "nohz_full" boot command
> line and can be extended from there but not shrank below that as there can
> be additional isolation attributes with those isolated CPUs.

I'm not sure we wanna tie with those automatically. I think it'd be
confusing than helpful.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-13  1:17                   ` Tejun Heo
  (?)
@ 2023-04-13  1:55                   ` Waiman Long
  2023-04-14  1:22                       ` Waiman Long
  -1 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-04-13  1:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

On 4/12/23 21:17, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
>>> Sounds a bit contrived. Does it need to be something defined in the root
>>> cgroup?
>> Yes, because we need to take away the isolated CPUs from the effective cpus
>> of the root cgroup. So it needs to start from the root. That is also why we
>> have the partition rule that the parent of a partition has to be a partition
>> root itself. With the new scheme, we don't need a special cgroup to hold the
> I'm following. The root is already a partition root and the cgroupfs control
> knobs are owned by the parent, so the root cgroup would own the first level
> cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some
> CPUs exclusively to a first level cgroup, it can then set that cgroup's
> reserve knob accordingly (or maybe the better name is
> cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's
> partition and give them to the first level cgroup. The first level cgroup
> then is free to do whatever with those CPUs that now belong exclusively to
> the cgroup subtree.

I am OK with the cpuset.cpus.reserve name, but not that much with the 
cpuset.cpus.exclusive name as it can get confused with cgroup v1's 
cpuset.cpu_exclusive. Of course, I prefer the cpuset.cpus.isolated name 
a bit more. Once an isolated CPU gets used in an isolated partition, it 
is exclusive and it can't be used in another isolated partition.

Since we will allow users to set cpuset.cpus.reserve to whatever value 
they want. The distribution of isolated CPUs is only valid if the cpus 
are present in its parent's cpuset.cpus.reserve and all the way up to 
the root. It is a bit expensive, but it should be a relatively rare 
operation.

>
>> isolated CPUs. The new root cgroup file will be enough to inform the system
>> what CPUs will have to be isolated.
>>
>> My current thinking is that the root's "cpuset.cpus.isolated" will start
>> with whatever have been set in the "isolcpus" or "nohz_full" boot command
>> line and can be extended from there but not shrank below that as there can
>> be additional isolation attributes with those isolated CPUs.
> I'm not sure we wanna tie with those automatically. I think it'd be
> confusing than helpful.

Yes, I am fine with taking this off for now.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-14  1:22                       ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-04-14  1:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker


On 4/12/23 21:55, Waiman Long wrote:
> On 4/12/23 21:17, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
>>>> Sounds a bit contrived. Does it need to be something defined in the 
>>>> root
>>>> cgroup?
>>> Yes, because we need to take away the isolated CPUs from the 
>>> effective cpus
>>> of the root cgroup. So it needs to start from the root. That is also 
>>> why we
>>> have the partition rule that the parent of a partition has to be a 
>>> partition
>>> root itself. With the new scheme, we don't need a special cgroup to 
>>> hold the
>> I'm following. The root is already a partition root and the cgroupfs 
>> control
>> knobs are owned by the parent, so the root cgroup would own the first 
>> level
>> cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to 
>> assign some
>> CPUs exclusively to a first level cgroup, it can then set that cgroup's
>> reserve knob accordingly (or maybe the better name is
>> cpuset.cpus.exclusive), which will take those CPUs out of the root 
>> cgroup's
>> partition and give them to the first level cgroup. The first level 
>> cgroup
>> then is free to do whatever with those CPUs that now belong 
>> exclusively to
>> the cgroup subtree.
>
> I am OK with the cpuset.cpus.reserve name, but not that much with the 
> cpuset.cpus.exclusive name as it can get confused with cgroup v1's 
> cpuset.cpu_exclusive. Of course, I prefer the cpuset.cpus.isolated 
> name a bit more. Once an isolated CPU gets used in an isolated 
> partition, it is exclusive and it can't be used in another isolated 
> partition.
>
> Since we will allow users to set cpuset.cpus.reserve to whatever value 
> they want. The distribution of isolated CPUs is only valid if the cpus 
> are present in its parent's cpuset.cpus.reserve and all the way up to 
> the root. It is a bit expensive, but it should be a relatively rare 
> operation.

I now have a slightly different idea of how to do that. We already have 
an internal cpumask for partitioning - subparts_cpus. I am thinking 
about exposing it as cpuset.cpus.reserve. The current way of creating 
subpartitions will be called automatic reservation and require a direct 
parent/child partition relationship. But as soon as a user write 
anything to it, it will break automatic reservation and require manual 
reservation going forward.

In that way, we can keep the old behavior, but also support new use 
cases. I am going to work on that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-14  1:22                       ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-04-14  1:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker


On 4/12/23 21:55, Waiman Long wrote:
> On 4/12/23 21:17, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
>>>> Sounds a bit contrived. Does it need to be something defined in the 
>>>> root
>>>> cgroup?
>>> Yes, because we need to take away the isolated CPUs from the 
>>> effective cpus
>>> of the root cgroup. So it needs to start from the root. That is also 
>>> why we
>>> have the partition rule that the parent of a partition has to be a 
>>> partition
>>> root itself. With the new scheme, we don't need a special cgroup to 
>>> hold the
>> I'm following. The root is already a partition root and the cgroupfs 
>> control
>> knobs are owned by the parent, so the root cgroup would own the first 
>> level
>> cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to 
>> assign some
>> CPUs exclusively to a first level cgroup, it can then set that cgroup's
>> reserve knob accordingly (or maybe the better name is
>> cpuset.cpus.exclusive), which will take those CPUs out of the root 
>> cgroup's
>> partition and give them to the first level cgroup. The first level 
>> cgroup
>> then is free to do whatever with those CPUs that now belong 
>> exclusively to
>> the cgroup subtree.
>
> I am OK with the cpuset.cpus.reserve name, but not that much with the 
> cpuset.cpus.exclusive name as it can get confused with cgroup v1's 
> cpuset.cpu_exclusive. Of course, I prefer the cpuset.cpus.isolated 
> name a bit more. Once an isolated CPU gets used in an isolated 
> partition, it is exclusive and it can't be used in another isolated 
> partition.
>
> Since we will allow users to set cpuset.cpus.reserve to whatever value 
> they want. The distribution of isolated CPUs is only valid if the cpus 
> are present in its parent's cpuset.cpus.reserve and all the way up to 
> the root. It is a bit expensive, but it should be a relatively rare 
> operation.

I now have a slightly different idea of how to do that. We already have 
an internal cpumask for partitioning - subparts_cpus. I am thinking 
about exposing it as cpuset.cpus.reserve. The current way of creating 
subpartitions will be called automatic reservation and require a direct 
parent/child partition relationship. But as soon as a user write 
anything to it, it will break automatic reservation and require manual 
reservation going forward.

In that way, we can keep the old behavior, but also support new use 
cases. I am going to work on that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-14 16:54                         ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-04-14 16:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
> I now have a slightly different idea of how to do that. We already have an
> internal cpumask for partitioning - subparts_cpus. I am thinking about
> exposing it as cpuset.cpus.reserve. The current way of creating
> subpartitions will be called automatic reservation and require a direct
> parent/child partition relationship. But as soon as a user write anything to
> it, it will break automatic reservation and require manual reservation going
> forward.
> 
> In that way, we can keep the old behavior, but also support new use cases. I
> am going to work on that.

I'm not sure I fully understand the proposed behavior but it does sound more
quirky.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-04-14 16:54                         ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-04-14 16:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
> I now have a slightly different idea of how to do that. We already have an
> internal cpumask for partitioning - subparts_cpus. I am thinking about
> exposing it as cpuset.cpus.reserve. The current way of creating
> subpartitions will be called automatic reservation and require a direct
> parent/child partition relationship. But as soon as a user write anything to
> it, it will break automatic reservation and require manual reservation going
> forward.
> 
> In that way, we can keep the old behavior, but also support new use cases. I
> am going to work on that.

I'm not sure I fully understand the proposed behavior but it does sound more
quirky.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-14 16:54                         ` Tejun Heo
  (?)
@ 2023-04-14 17:29                         ` Waiman Long
  2023-04-14 17:34                           ` Tejun Heo
  -1 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-04-14 17:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker


On 4/14/23 12:54, Tejun Heo wrote:
> On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
>> I now have a slightly different idea of how to do that. We already have an
>> internal cpumask for partitioning - subparts_cpus. I am thinking about
>> exposing it as cpuset.cpus.reserve. The current way of creating
>> subpartitions will be called automatic reservation and require a direct
>> parent/child partition relationship. But as soon as a user write anything to
>> it, it will break automatic reservation and require manual reservation going
>> forward.
>>
>> In that way, we can keep the old behavior, but also support new use cases. I
>> am going to work on that.
> I'm not sure I fully understand the proposed behavior but it does sound more
> quirky.

The idea is to use the existing subparts_cpus for cpu reservation 
instead of adding a new cpumask for that purpose. The current way of 
partition creation does cpus reservation (setting subparts_cpus) 
automatically with the constraint that the parent of a partition must be 
a partition root itself. One way to relax this constraint is to allow a 
new manual reservation mode where users can set reserve cpus manually 
and distribute them down the hierarchy before activating a partition to 
use those cpus.

Now the question is how to enable this new manual reservation mode. One 
way to do it is to enable it whenever the new cpuset.cpus.reserve file 
is modified. Alternatively, we may enable it by a cgroupfs mount option 
or a boot command line option.

Hope this can clarify your confusion.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-14 17:29                         ` Waiman Long
@ 2023-04-14 17:34                           ` Tejun Heo
  2023-04-14 17:38                             ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2023-04-14 17:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
> 
> On 4/14/23 12:54, Tejun Heo wrote:
> > On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
> > > I now have a slightly different idea of how to do that. We already have an
> > > internal cpumask for partitioning - subparts_cpus. I am thinking about
> > > exposing it as cpuset.cpus.reserve. The current way of creating
> > > subpartitions will be called automatic reservation and require a direct
> > > parent/child partition relationship. But as soon as a user write anything to
> > > it, it will break automatic reservation and require manual reservation going
> > > forward.
> > > 
> > > In that way, we can keep the old behavior, but also support new use cases. I
> > > am going to work on that.
> > I'm not sure I fully understand the proposed behavior but it does sound more
> > quirky.
> 
> The idea is to use the existing subparts_cpus for cpu reservation instead of
> adding a new cpumask for that purpose. The current way of partition creation
> does cpus reservation (setting subparts_cpus) automatically with the
> constraint that the parent of a partition must be a partition root itself.
> One way to relax this constraint is to allow a new manual reservation mode
> where users can set reserve cpus manually and distribute them down the
> hierarchy before activating a partition to use those cpus.
> 
> Now the question is how to enable this new manual reservation mode. One way
> to do it is to enable it whenever the new cpuset.cpus.reserve file is
> modified. Alternatively, we may enable it by a cgroupfs mount option or a
> boot command line option.

It'd probably be best if we can keep the behavior within cgroupfs if
possible. Would you mind writing up the documentation section describing the
behavior beforehand? I think things would be clearer if we look at it from
the interface documentation side.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-14 17:34                           ` Tejun Heo
@ 2023-04-14 17:38                             ` Waiman Long
  2023-04-14 19:06                               ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-04-14 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

On 4/14/23 13:34, Tejun Heo wrote:
> On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
>> On 4/14/23 12:54, Tejun Heo wrote:
>>> On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
>>>> I now have a slightly different idea of how to do that. We already have an
>>>> internal cpumask for partitioning - subparts_cpus. I am thinking about
>>>> exposing it as cpuset.cpus.reserve. The current way of creating
>>>> subpartitions will be called automatic reservation and require a direct
>>>> parent/child partition relationship. But as soon as a user write anything to
>>>> it, it will break automatic reservation and require manual reservation going
>>>> forward.
>>>>
>>>> In that way, we can keep the old behavior, but also support new use cases. I
>>>> am going to work on that.
>>> I'm not sure I fully understand the proposed behavior but it does sound more
>>> quirky.
>> The idea is to use the existing subparts_cpus for cpu reservation instead of
>> adding a new cpumask for that purpose. The current way of partition creation
>> does cpus reservation (setting subparts_cpus) automatically with the
>> constraint that the parent of a partition must be a partition root itself.
>> One way to relax this constraint is to allow a new manual reservation mode
>> where users can set reserve cpus manually and distribute them down the
>> hierarchy before activating a partition to use those cpus.
>>
>> Now the question is how to enable this new manual reservation mode. One way
>> to do it is to enable it whenever the new cpuset.cpus.reserve file is
>> modified. Alternatively, we may enable it by a cgroupfs mount option or a
>> boot command line option.
> It'd probably be best if we can keep the behavior within cgroupfs if
> possible. Would you mind writing up the documentation section describing the
> behavior beforehand? I think things would be clearer if we look at it from
> the interface documentation side.

Sure, will do that. I need some time and so it will be early next week.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-14 17:38                             ` Waiman Long
@ 2023-04-14 19:06                               ` Waiman Long
  2023-05-02 18:01                                 ` Michal Koutný
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-04-14 19:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

On 4/14/23 13:38, Waiman Long wrote:
> On 4/14/23 13:34, Tejun Heo wrote:
>> On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
>>> On 4/14/23 12:54, Tejun Heo wrote:
>>>> On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
>>>>> I now have a slightly different idea of how to do that. We already 
>>>>> have an
>>>>> internal cpumask for partitioning - subparts_cpus. I am thinking 
>>>>> about
>>>>> exposing it as cpuset.cpus.reserve. The current way of creating
>>>>> subpartitions will be called automatic reservation and require a 
>>>>> direct
>>>>> parent/child partition relationship. But as soon as a user write 
>>>>> anything to
>>>>> it, it will break automatic reservation and require manual 
>>>>> reservation going
>>>>> forward.
>>>>>
>>>>> In that way, we can keep the old behavior, but also support new 
>>>>> use cases. I
>>>>> am going to work on that.
>>>> I'm not sure I fully understand the proposed behavior but it does 
>>>> sound more
>>>> quirky.
>>> The idea is to use the existing subparts_cpus for cpu reservation 
>>> instead of
>>> adding a new cpumask for that purpose. The current way of partition 
>>> creation
>>> does cpus reservation (setting subparts_cpus) automatically with the
>>> constraint that the parent of a partition must be a partition root 
>>> itself.
>>> One way to relax this constraint is to allow a new manual 
>>> reservation mode
>>> where users can set reserve cpus manually and distribute them down the
>>> hierarchy before activating a partition to use those cpus.
>>>
>>> Now the question is how to enable this new manual reservation mode. 
>>> One way
>>> to do it is to enable it whenever the new cpuset.cpus.reserve file is
>>> modified. Alternatively, we may enable it by a cgroupfs mount option 
>>> or a
>>> boot command line option.
>> It'd probably be best if we can keep the behavior within cgroupfs if
>> possible. Would you mind writing up the documentation section 
>> describing the
>> behavior beforehand? I think things would be clearer if we look at it 
>> from
>> the interface documentation side.
>
> Sure, will do that. I need some time and so it will be early next week.

Just kidding :-)

Below is a draft of the new cpuset.cpus.reserve cgroupfs file:

   cpuset.cpus.reserve
         A read-write multiple values file which exists on all
         cpuset-enabled cgroups.

         It lists the reserved CPUs to be used for the creation of
         child partitions.  See the section on "cpuset.cpus.partition"
         below for more information on cpuset partition.  These reserved
         CPUs should be a subset of "cpuset.cpus" and will be mutually
         exclusive of "cpuset.cpus.effective" when used since these
         reserved CPUs cannot be used by tasks in the current cgroup.

         There are two modes for partition CPUs reservation -
         auto or manual.  The system starts up in auto mode where
         "cpuset.cpus.reserve" will be set automatically when valid
         child partitions are created and users don't need to touch the
         file at all.  This mode has the limitation that the parent of a
         partition must be a partition root itself.  So child partition
         has to be created one-by-one from the cgroup root down.

         To enable the creation of a partition down in the hierarchy
         without the intermediate cgroups to be partition roots, one
         has to turn on the manual reservation mode by writing directly
         to "cpuset.cpus.reserve" with a value different from its
         current value.  By distributing the reserve CPUs down the cgroup
         hierarchy to the parent of the target cgroup, this target cgroup
         can be switched to become a partition root if its "cpuset.cpus"
         is a subset of the set of valid reserve CPUs in its parent. The
         set of valid reserve CPUs is the set that are present in all
         its ancestors' "cpuset.cpus.reserve" up to cgroup root and
         which have not been allocated to another valid partition yet.

         Once manual reservation mode is enabled, a cgroup administrator
         must always set up "cpuset.cpus.reserve" files properly before
         a valid partition can be created. So this mode has more
         administrative overhead but with greater flexibility.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-04-14 19:06                               ` Waiman Long
@ 2023-05-02 18:01                                 ` Michal Koutný
  2023-05-02 21:26                                   ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Koutný @ 2023-05-02 18:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest,
	Juri Lelli, Valentin Schneider, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 2456 bytes --]

Hello.

The previous thread arrived incomplete to me, so I respond to the last
message only. Point me to a message URL if it was covered.

On Fri, Apr 14, 2023 at 03:06:27PM -0400, Waiman Long <longman@redhat.com> wrote:
> Below is a draft of the new cpuset.cpus.reserve cgroupfs file:
> 
>   cpuset.cpus.reserve
>         A read-write multiple values file which exists on all
>         cpuset-enabled cgroups.
> 
>         It lists the reserved CPUs to be used for the creation of
>         child partitions.  See the section on "cpuset.cpus.partition"
>         below for more information on cpuset partition.  These reserved
>         CPUs should be a subset of "cpuset.cpus" and will be mutually
>         exclusive of "cpuset.cpus.effective" when used since these
>         reserved CPUs cannot be used by tasks in the current cgroup.
> 
>         There are two modes for partition CPUs reservation -
>         auto or manual.  The system starts up in auto mode where
>         "cpuset.cpus.reserve" will be set automatically when valid
>         child partitions are created and users don't need to touch the
>         file at all.  This mode has the limitation that the parent of a
>         partition must be a partition root itself.  So child partition
>         has to be created one-by-one from the cgroup root down.
> 
>         To enable the creation of a partition down in the hierarchy
>         without the intermediate cgroups to be partition roots,

Why would be this needed? Owning a CPU (a resource) must logically be
passed all the way from root to the target cgroup, i.e. this is
expressed by valid partitioning down to given level.

>         one
>         has to turn on the manual reservation mode by writing directly
>         to "cpuset.cpus.reserve" with a value different from its
>         current value.  By distributing the reserve CPUs down the cgroup
>         hierarchy to the parent of the target cgroup, this target cgroup
>         can be switched to become a partition root if its "cpuset.cpus"
>         is a subset of the set of valid reserve CPUs in its parent.

level n
`- level n+1
   cpuset.cpus	// these are actually configured by "owner" of level n
   cpuset.cpus.partition // similrly here, level n decides if child is a partition

I.e. what would be level n/cpuset.cpus.reserve good for when it can
directly control level n+1/cpuset.cpus?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-05-02 18:01                                 ` Michal Koutný
@ 2023-05-02 21:26                                   ` Waiman Long
  2023-05-02 22:27                                       ` Michal Koutný
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-05-02 21:26 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest,
	Juri Lelli, Valentin Schneider, Frederic Weisbecker

On 5/2/23 14:01, Michal Koutný wrote:
> Hello.
>
> The previous thread arrived incomplete to me, so I respond to the last
> message only. Point me to a message URL if it was covered.
>
> On Fri, Apr 14, 2023 at 03:06:27PM -0400, Waiman Long <longman@redhat.com> wrote:
>> Below is a draft of the new cpuset.cpus.reserve cgroupfs file:
>>
>>    cpuset.cpus.reserve
>>          A read-write multiple values file which exists on all
>>          cpuset-enabled cgroups.
>>
>>          It lists the reserved CPUs to be used for the creation of
>>          child partitions.  See the section on "cpuset.cpus.partition"
>>          below for more information on cpuset partition.  These reserved
>>          CPUs should be a subset of "cpuset.cpus" and will be mutually
>>          exclusive of "cpuset.cpus.effective" when used since these
>>          reserved CPUs cannot be used by tasks in the current cgroup.
>>
>>          There are two modes for partition CPUs reservation -
>>          auto or manual.  The system starts up in auto mode where
>>          "cpuset.cpus.reserve" will be set automatically when valid
>>          child partitions are created and users don't need to touch the
>>          file at all.  This mode has the limitation that the parent of a
>>          partition must be a partition root itself.  So child partition
>>          has to be created one-by-one from the cgroup root down.
>>
>>          To enable the creation of a partition down in the hierarchy
>>          without the intermediate cgroups to be partition roots,
> Why would be this needed? Owning a CPU (a resource) must logically be
> passed all the way from root to the target cgroup, i.e. this is
> expressed by valid partitioning down to given level.
>
>>          one
>>          has to turn on the manual reservation mode by writing directly
>>          to "cpuset.cpus.reserve" with a value different from its
>>          current value.  By distributing the reserve CPUs down the cgroup
>>          hierarchy to the parent of the target cgroup, this target cgroup
>>          can be switched to become a partition root if its "cpuset.cpus"
>>          is a subset of the set of valid reserve CPUs in its parent.
> level n
> `- level n+1
>     cpuset.cpus	// these are actually configured by "owner" of level n
>     cpuset.cpus.partition // similrly here, level n decides if child is a partition
>
> I.e. what would be level n/cpuset.cpus.reserve good for when it can
> directly control level n+1/cpuset.cpus?

In the new scheme, the available cpus are still directly passed down to 
a descendant cgroup. However, isolated CPUs (or more generally CPUs 
dedicated to a partition) have to be exclusive. So what the 
cpuset.cpus.reserve does is to identify those exclusive CPUs that can be 
excluded from the effective_cpus of the parent cgroups before they are 
claimed by a child partition. Currently this is done automatically when 
a child partition is created off a parent partition root. The new scheme 
will break it into 2 separate steps without the requirement that the 
parent of a partition has to be a partition root itself.

Cheers,
Longman

claimed by a partition and will be excluded from the effective_cpus of 
the parent


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-05-02 21:26                                   ` Waiman Long
@ 2023-05-02 22:27                                       ` Michal Koutný
  0 siblings, 0 replies; 45+ messages in thread
From: Michal Koutný @ 2023-05-02 22:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest,
	Juri Lelli, Valentin Schneider, Frederic Weisbecker

On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman@redhat.com> wrote:
> In the new scheme, the available cpus are still directly passed down to a
> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
> is to identify those exclusive CPUs that can be excluded from the
> effective_cpus of the parent cgroups before they are claimed by a child
> partition. Currently this is done automatically when a child partition is
> created off a parent partition root. The new scheme will break it into 2
> separate steps without the requirement that the parent of a partition has to
> be a partition root itself.

new scheme
  1st step:
  echo C >p/cpuset.cpus.reserve
  # p/cpuset.cpus.effective == A-C (1)
  2nd step (claim):
  echo C' >p/c/cpuset.cpus # C'⊆C
  echo root >p/c/cpuset.cpus.partition

current scheme
  1st step (configure):
  echo C >p/c/cpuset.cpus
  2nd step (reserve & claim):
  echo root >p/c/cpuset.cpus.partition
  # p/cpuset.cpus.effective == A-C (2)

As long as p/c is unpopulated, (1) and (2) are equal situations.
Why is the (different) two step procedure needed?

Also the relaxation of requirement of a parent being a partition
confuses me -- if the parent is not a partition, i.e. it has no
exclusive ownership of CPUs but it can still "give" it to children -- is
child partition meant to be exclusive? (IOW can parent siblings reserve
some same CPUs?)

Thanks,
Michal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-05-02 22:27                                       ` Michal Koutný
  0 siblings, 0 replies; 45+ messages in thread
From: Michal Koutný @ 2023-05-02 22:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest,
	Juri Lelli, Valentin Schneider, Frederic Weisbecker

On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman@redhat.com> wrote:
> In the new scheme, the available cpus are still directly passed down to a
> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
> is to identify those exclusive CPUs that can be excluded from the
> effective_cpus of the parent cgroups before they are claimed by a child
> partition. Currently this is done automatically when a child partition is
> created off a parent partition root. The new scheme will break it into 2
> separate steps without the requirement that the parent of a partition has to
> be a partition root itself.

new scheme
  1st step:
  echo C >p/cpuset.cpus.reserve
  # p/cpuset.cpus.effective == A-C (1)
  2nd step (claim):
  echo C' >p/c/cpuset.cpus # C'⊆C
  echo root >p/c/cpuset.cpus.partition

current scheme
  1st step (configure):
  echo C >p/c/cpuset.cpus
  2nd step (reserve & claim):
  echo root >p/c/cpuset.cpus.partition
  # p/cpuset.cpus.effective == A-C (2)

As long as p/c is unpopulated, (1) and (2) are equal situations.
Why is the (different) two step procedure needed?

Also the relaxation of requirement of a parent being a partition
confuses me -- if the parent is not a partition, i.e. it has no
exclusive ownership of CPUs but it can still "give" it to children -- is
child partition meant to be exclusive? (IOW can parent siblings reserve
some same CPUs?)

Thanks,
Michal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-05-02 22:27                                       ` Michal Koutný
  (?)
@ 2023-05-04  3:01                                       ` Waiman Long
  2023-05-05 16:03                                           ` Tejun Heo
  -1 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-05-04  3:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest,
	Juri Lelli, Valentin Schneider, Frederic Weisbecker


On 5/2/23 18:27, Michal Koutný wrote:
> On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman@redhat.com> wrote:
>> In the new scheme, the available cpus are still directly passed down to a
>> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
>> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
>> is to identify those exclusive CPUs that can be excluded from the
>> effective_cpus of the parent cgroups before they are claimed by a child
>> partition. Currently this is done automatically when a child partition is
>> created off a parent partition root. The new scheme will break it into 2
>> separate steps without the requirement that the parent of a partition has to
>> be a partition root itself.
> new scheme
>    1st step:
>    echo C >p/cpuset.cpus.reserve
>    # p/cpuset.cpus.effective == A-C (1)
>    2nd step (claim):
>    echo C' >p/c/cpuset.cpus # C'⊆C
>    echo root >p/c/cpuset.cpus.partition

It is something like that. However, the current scheme of automatic 
reservation is also supported, i.e. cpuset.cpus.reserve will be set 
automatically when the child cgroup becomes a valid partition as long as 
the cpuset.cpus.reserve file is not written to. This is for backward 
compatibility.

Once it is written to, automatic mode will end and users have to 
manually set it afterward.


>
> current scheme
>    1st step (configure):
>    echo C >p/c/cpuset.cpus
>    2nd step (reserve & claim):
>    echo root >p/c/cpuset.cpus.partition
>    # p/cpuset.cpus.effective == A-C (2)
>
> As long as p/c is unpopulated, (1) and (2) are equal situations.
> Why is the (different) two step procedure needed?
>
> Also the relaxation of requirement of a parent being a partition
> confuses me -- if the parent is not a partition, i.e. it has no
> exclusive ownership of CPUs but it can still "give" it to children -- is
> child partition meant to be exclusive? (IOW can parent siblings reserve
> some same CPUs?)

A valid partition root has exclusive ownership of its CPUs. That is a 
rule that won't be changed. As a result, an incoming partition root 
cannot claim CPUs that have been allocated to another partition. To 
simplify thing, transition to a valid partition root is not possible if 
any of the CPUs in its cpuset.cpus are not in the cpuset.cpus.reserve of 
its ancestor or have been allocated to another partition. The partition 
root simply becomes invalid.

The parent can virtually give the reserved CPUs from the root down the 
hierarchy and a child can claim them once it becomes a partition root. 
In manual mode, we need to check all the way up the hierarchy to the 
root to figure out what CPUs in cpuset.cpus.reserve are valid. It has 
higher overhead, but enabling partition is not a fast operation anyway.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-05-05 16:03                                           ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-05-05 16:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote:
> 
> On 5/2/23 18:27, Michal Koutný wrote:
> > On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman@redhat.com> wrote:
> > > In the new scheme, the available cpus are still directly passed down to a
> > > descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
> > > to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
> > > is to identify those exclusive CPUs that can be excluded from the
> > > effective_cpus of the parent cgroups before they are claimed by a child
> > > partition. Currently this is done automatically when a child partition is
> > > created off a parent partition root. The new scheme will break it into 2
> > > separate steps without the requirement that the parent of a partition has to
> > > be a partition root itself.
> > new scheme
> >    1st step:
> >    echo C >p/cpuset.cpus.reserve
> >    # p/cpuset.cpus.effective == A-C (1)
> >    2nd step (claim):
> >    echo C' >p/c/cpuset.cpus # C'⊆C
> >    echo root >p/c/cpuset.cpus.partition
> 
> It is something like that. However, the current scheme of automatic
> reservation is also supported, i.e. cpuset.cpus.reserve will be set
> automatically when the child cgroup becomes a valid partition as long as the
> cpuset.cpus.reserve file is not written to. This is for backward
> compatibility.
> 
> Once it is written to, automatic mode will end and users have to manually
> set it afterward.

I really don't like the implicit switching behavior. This is interface
behavior modifying internal state that userspace can't view or control
directly. Regardless of how the rest of the discussion develops, this part
should be improved (e.g. would it work to always try to auto-reserve if the
cpu isn't already reserved?).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-05-05 16:03                                           ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-05-05 16:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker

On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote:
> 
> On 5/2/23 18:27, Michal Koutný wrote:
> > On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > In the new scheme, the available cpus are still directly passed down to a
> > > descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
> > > to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
> > > is to identify those exclusive CPUs that can be excluded from the
> > > effective_cpus of the parent cgroups before they are claimed by a child
> > > partition. Currently this is done automatically when a child partition is
> > > created off a parent partition root. The new scheme will break it into 2
> > > separate steps without the requirement that the parent of a partition has to
> > > be a partition root itself.
> > new scheme
> >    1st step:
> >    echo C >p/cpuset.cpus.reserve
> >    # p/cpuset.cpus.effective == A-C (1)
> >    2nd step (claim):
> >    echo C' >p/c/cpuset.cpus # C'⊆C
> >    echo root >p/c/cpuset.cpus.partition
> 
> It is something like that. However, the current scheme of automatic
> reservation is also supported, i.e. cpuset.cpus.reserve will be set
> automatically when the child cgroup becomes a valid partition as long as the
> cpuset.cpus.reserve file is not written to. This is for backward
> compatibility.
> 
> Once it is written to, automatic mode will end and users have to manually
> set it afterward.

I really don't like the implicit switching behavior. This is interface
behavior modifying internal state that userspace can't view or control
directly. Regardless of how the rest of the discussion develops, this part
should be improved (e.g. would it work to always try to auto-reserve if the
cpu isn't already reserved?).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-05-05 16:25                                             ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-05-05 16:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker


On 5/5/23 12:03, Tejun Heo wrote:
> On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote:
>> On 5/2/23 18:27, Michal Koutný wrote:
>>> On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman@redhat.com> wrote:
>>>> In the new scheme, the available cpus are still directly passed down to a
>>>> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
>>>> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
>>>> is to identify those exclusive CPUs that can be excluded from the
>>>> effective_cpus of the parent cgroups before they are claimed by a child
>>>> partition. Currently this is done automatically when a child partition is
>>>> created off a parent partition root. The new scheme will break it into 2
>>>> separate steps without the requirement that the parent of a partition has to
>>>> be a partition root itself.
>>> new scheme
>>>     1st step:
>>>     echo C >p/cpuset.cpus.reserve
>>>     # p/cpuset.cpus.effective == A-C (1)
>>>     2nd step (claim):
>>>     echo C' >p/c/cpuset.cpus # C'⊆C
>>>     echo root >p/c/cpuset.cpus.partition
>> It is something like that. However, the current scheme of automatic
>> reservation is also supported, i.e. cpuset.cpus.reserve will be set
>> automatically when the child cgroup becomes a valid partition as long as the
>> cpuset.cpus.reserve file is not written to. This is for backward
>> compatibility.
>>
>> Once it is written to, automatic mode will end and users have to manually
>> set it afterward.
> I really don't like the implicit switching behavior. This is interface
> behavior modifying internal state that userspace can't view or control
> directly. Regardless of how the rest of the discussion develops, this part
> should be improved (e.g. would it work to always try to auto-reserve if the
> cpu isn't already reserved?).

After some more thought yesterday, I have a slight change in my design 
that auto-reserve as it is now will stay for partitions that have a 
partition root parent. For remote partition that doesn't have a 
partition root parent, its creation will require pre-allocating 
additional CPUs into top_cpuset's cpuset.cpus.reserve first. So there 
will be no change in behavior for existing use cases whether a remote 
partition is created or not.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-05-05 16:25                                             ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-05-05 16:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker


On 5/5/23 12:03, Tejun Heo wrote:
> On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote:
>> On 5/2/23 18:27, Michal Koutný wrote:
>>> On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>> In the new scheme, the available cpus are still directly passed down to a
>>>> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
>>>> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
>>>> is to identify those exclusive CPUs that can be excluded from the
>>>> effective_cpus of the parent cgroups before they are claimed by a child
>>>> partition. Currently this is done automatically when a child partition is
>>>> created off a parent partition root. The new scheme will break it into 2
>>>> separate steps without the requirement that the parent of a partition has to
>>>> be a partition root itself.
>>> new scheme
>>>     1st step:
>>>     echo C >p/cpuset.cpus.reserve
>>>     # p/cpuset.cpus.effective == A-C (1)
>>>     2nd step (claim):
>>>     echo C' >p/c/cpuset.cpus # C'⊆C
>>>     echo root >p/c/cpuset.cpus.partition
>> It is something like that. However, the current scheme of automatic
>> reservation is also supported, i.e. cpuset.cpus.reserve will be set
>> automatically when the child cgroup becomes a valid partition as long as the
>> cpuset.cpus.reserve file is not written to. This is for backward
>> compatibility.
>>
>> Once it is written to, automatic mode will end and users have to manually
>> set it afterward.
> I really don't like the implicit switching behavior. This is interface
> behavior modifying internal state that userspace can't view or control
> directly. Regardless of how the rest of the discussion develops, this part
> should be improved (e.g. would it work to always try to auto-reserve if the
> cpu isn't already reserved?).

After some more thought yesterday, I have a slight change in my design 
that auto-reserve as it is now will stay for partitions that have a 
partition root parent. For remote partition that doesn't have a 
partition root parent, its creation will require pre-allocating 
additional CPUs into top_cpuset's cpuset.cpus.reserve first. So there 
will be no change in behavior for existing use cases whether a remote 
partition is created or not.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-05-05 16:25                                             ` Waiman Long
  (?)
@ 2023-05-08  1:03                                             ` Waiman Long
  2023-05-22 19:49                                               ` Tejun Heo
  -1 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-05-08  1:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

Hi,

The following is the proposed text for "cpuset.cpus.reserve" and 
"cpuset.cpus.partition" of the new cpuset partition in 
Documentation/admin-guide/cgroup-v2.rst.

   cpuset.cpus.reserve
     A read-write multiple values file which exists only on root
     cgroup.

     It lists all the CPUs that are reserved for adjacent and remote
     partitions created in the system.  See the next section for
     more information on what an adjacent or remote partitions is.

     Creation of adjacent partition does not require touching this
     control file as CPU reservation will be done automatically.
     In order to create a remote partition, the CPUs needed by the
     remote partition has to be written to this file first.

     A "+" prefix can be used to indicate a list of additional
     CPUs that are to be added without disturbing the CPUs that are
     originally there.  For example, if its current value is "3-4",
     echoing ""+5" to it will change it to "3-5".

     Once a remote partition is destroyed, its CPUs have to be
     removed from this file or no other process can use them.  A "-"
     prefix can be used to remove a list of CPUs from it.  However,
     removing CPUs that are currently used in existing partitions
     may cause those partitions to become invalid.  A single "-"
     character without any number can be used to indicate removal
     of all the free CPUs not allocated to any partitions to avoid
     accidental partition invalidation.

   cpuset.cpus.partition
     A read-write single value file which exists on non-root
     cpuset-enabled cgroups.  This flag is owned by the parent cgroup
     and is not delegatable.

     It accepts only the following input values when written to.

       ==========    =====================================
       "member"    Non-root member of a partition
       "root"    Partition root
       "isolated"    Partition root without load balancing
       ==========    =====================================

     A cpuset partition is a collection of cgroups with a partition
     root at the top of the hierarchy and its descendants except
     those that are separate partition roots themselves and their
     descendants.  A partition has exclusive access to the set of
     CPUs allocated to it.  Other cgroups outside of that partition
     cannot use any CPUs in that set.

     There are two types of partitions - adjacent and remote.  The
     parent of an adjacent partition must be a valid partition root.
     Partition roots of adjacent partitions are all clustered around
     the root cgroup.  Creation of adjacent partition is done by
     writing the desired partition type into "cpuset.cpus.partition".

     A remote partition does not require a partition root parent.
     So a remote partition can be formed far from the root cgroup.
     However, its creation is a 2-step process.  The CPUs needed
     by a remote partition ("cpuset.cpus" of the partition root)
     has to be written into "cpuset.cpus.reserve" of the root
     cgroup first.  After that, "isolated" can be written into
     "cpuset.cpus.partition" of the partition root to form a remote
     isolated partition which is the only supported remote partition
     type for now.

     All remote partitions are terminal as adjacent partition cannot
     be created underneath it.

     The root cgroup is always a partition root and its state cannot
     be changed.  All other non-root cgroups start out as "member".

     When set to "root", the current cgroup is the root of a new
     partition or scheduling domain.

     When set to "isolated", the CPUs in that partition will
     be in an isolated state without any load balancing from the
     scheduler.  Tasks placed in such a partition with multiple
     CPUs should be carefully distributed and bound to each of the
     individual CPUs for optimal performance.

     The value shown in "cpuset.cpus.effective" of a partition root is
     the CPUs that are dedicated to that partition and not available
     to cgroups outside of that partittion.

     A partition root ("root" or "isolated") can be in one of the
     two possible states - valid or invalid.  An invalid partition
     root is in a degraded state where some state information may
     be retained, but behaves more like a "member".

     All possible state transitions among "member", "root" and
     "isolated" are allowed.

     On read, the "cpuset.cpus.partition" file can show the following
     values.

       ============================= =====================================
       "member"            Non-root member of a partition
       "root"            Partition root
       "isolated"            Partition root without load balancing
       "root invalid (<reason>)"    Invalid partition root
       "isolated invalid (<reason>)"    Invalid isolated partition root
       ============================= =====================================

     In the case of an invalid partition root, a descriptive string on
     why the partition is invalid is included within parentheses.

     For an adjacent partition root to be valid, the following
     conditions must be met.

     1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
        are not shared by any of its siblings (exclusivity rule).
     2) The parent cgroup is a valid partition root.
     3) The "cpuset.cpus" is not empty and must contain at least
        one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
     4) The "cpuset.cpus.effective" cannot be empty unless there is
        no task associated with this partition.

     For a remote partition root to be valid, the following conditions
     must be met.

     1) The same exclusivity rule as adjacent partition root.
     2) The "cpuset.cpus" is not empty and all the CPUs must be
        present in "cpuset.cpus.reserve" of the root cgroup and none
        of them are allocated to another partition.
     3) The "cpuset.cpus" value must be present in all its ancestors
        to ensure proper hierarchical cpu distribution.

     External events like hotplug or changes to "cpuset.cpus" can
     cause a valid partition root to become invalid and vice versa.
     Note that a task cannot be moved to a cgroup with empty
     "cpuset.cpus.effective".

     For a valid partition root with the sibling cpu exclusivity
     rule enabled, changes made to "cpuset.cpus" that violate the
     exclusivity rule will invalidate the partition as well as its
     sibling partitions with conflicting cpuset.cpus values. So
     care must be taking in changing "cpuset.cpus".

     A valid non-root parent partition may distribute out all its CPUs
     to its child partitions when there is no task associated with it.

     Care must be taken to change a valid partition root to
     "member" as all its child partitions, if present, will become
     invalid causing disruption to tasks running in those child
     partitions. These inactivated partitions could be recovered if
     their parent is switched back to a partition root with a proper
     set of "cpuset.cpus".

     Poll and inotify events are triggered whenever the state of
     "cpuset.cpus.partition" changes.  That includes changes caused
     by write to "cpuset.cpus.partition", cpu hotplug or other
     changes that modify the validity status of the partition.
     This will allow user space agents to monitor unexpected changes
     to "cpuset.cpus.partition" without the need to do continuous
     polling.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-05-08  1:03                                             ` Waiman Long
@ 2023-05-22 19:49                                               ` Tejun Heo
  2023-05-28 21:18                                                 ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2023-05-22 19:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

Hello, Waiman.

On Sun, May 07, 2023 at 09:03:44PM -0400, Waiman Long wrote:
...
>   cpuset.cpus.reserve
>     A read-write multiple values file which exists only on root
>     cgroup.
> 
>     It lists all the CPUs that are reserved for adjacent and remote
>     partitions created in the system.  See the next section for
>     more information on what an adjacent or remote partitions is.
> 
>     Creation of adjacent partition does not require touching this
>     control file as CPU reservation will be done automatically.
>     In order to create a remote partition, the CPUs needed by the
>     remote partition has to be written to this file first.
> 
>     A "+" prefix can be used to indicate a list of additional
>     CPUs that are to be added without disturbing the CPUs that are
>     originally there.  For example, if its current value is "3-4",
>     echoing ""+5" to it will change it to "3-5".
>
>     Once a remote partition is destroyed, its CPUs have to be
>     removed from this file or no other process can use them.  A "-"
>     prefix can be used to remove a list of CPUs from it.  However,
>     removing CPUs that are currently used in existing partitions
>     may cause those partitions to become invalid.  A single "-"
>     character without any number can be used to indicate removal
>     of all the free CPUs not allocated to any partitions to avoid
>     accidental partition invalidation.

Why is the syntax different from .cpus? Wouldn't it be better to keep them
the same?

>   cpuset.cpus.partition
>     A read-write single value file which exists on non-root
>     cpuset-enabled cgroups.  This flag is owned by the parent cgroup
>     and is not delegatable.
> 
>     It accepts only the following input values when written to.
> 
>       ==========    =====================================
>       "member"    Non-root member of a partition
>       "root"    Partition root
>       "isolated"    Partition root without load balancing
>       ==========    =====================================
> 
>     A cpuset partition is a collection of cgroups with a partition
>     root at the top of the hierarchy and its descendants except
>     those that are separate partition roots themselves and their
>     descendants.  A partition has exclusive access to the set of
>     CPUs allocated to it.  Other cgroups outside of that partition
>     cannot use any CPUs in that set.
> 
>     There are two types of partitions - adjacent and remote.  The
>     parent of an adjacent partition must be a valid partition root.
>     Partition roots of adjacent partitions are all clustered around
>     the root cgroup.  Creation of adjacent partition is done by
>     writing the desired partition type into "cpuset.cpus.partition".
> 
>     A remote partition does not require a partition root parent.
>     So a remote partition can be formed far from the root cgroup.
>     However, its creation is a 2-step process.  The CPUs needed
>     by a remote partition ("cpuset.cpus" of the partition root)
>     has to be written into "cpuset.cpus.reserve" of the root
>     cgroup first.  After that, "isolated" can be written into
>     "cpuset.cpus.partition" of the partition root to form a remote
>     isolated partition which is the only supported remote partition
>     type for now.
> 
>     All remote partitions are terminal as adjacent partition cannot
>     be created underneath it.

Can you elaborate this extra restriction a bit further?

In general, I think it'd be really helpful if the document explains the
reasoning behind the design decisions. ie. Why is reserving for? What
purpose does it serve that the regular isolated ones cannot? That'd help
clarifying the design decisions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-05-22 19:49                                               ` Tejun Heo
@ 2023-05-28 21:18                                                 ` Waiman Long
  2023-06-05 18:03                                                   ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-05-28 21:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

On 5/22/23 15:49, Tejun Heo wrote:
> Hello, Waiman.

Sorry for the late reply as I had been off for almost 2 weeks due to PTO.


>
> On Sun, May 07, 2023 at 09:03:44PM -0400, Waiman Long wrote:
> ...
>>    cpuset.cpus.reserve
>>      A read-write multiple values file which exists only on root
>>      cgroup.
>>
>>      It lists all the CPUs that are reserved for adjacent and remote
>>      partitions created in the system.  See the next section for
>>      more information on what an adjacent or remote partitions is.
>>
>>      Creation of adjacent partition does not require touching this
>>      control file as CPU reservation will be done automatically.
>>      In order to create a remote partition, the CPUs needed by the
>>      remote partition has to be written to this file first.
>>
>>      A "+" prefix can be used to indicate a list of additional
>>      CPUs that are to be added without disturbing the CPUs that are
>>      originally there.  For example, if its current value is "3-4",
>>      echoing ""+5" to it will change it to "3-5".
>>
>>      Once a remote partition is destroyed, its CPUs have to be
>>      removed from this file or no other process can use them.  A "-"
>>      prefix can be used to remove a list of CPUs from it.  However,
>>      removing CPUs that are currently used in existing partitions
>>      may cause those partitions to become invalid.  A single "-"
>>      character without any number can be used to indicate removal
>>      of all the free CPUs not allocated to any partitions to avoid
>>      accidental partition invalidation.
> Why is the syntax different from .cpus? Wouldn't it be better to keep them
> the same?

Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs 
that are used in multiple partitions. Also automatic reservation of 
adjacent partitions can happen in parallel. That is why I think it will 
be safer if we allow incremental increase or decrease of reserve CPUs to 
be used for remote partitions. I will include this reasoning into the 
doc file.


>>    cpuset.cpus.partition
>>      A read-write single value file which exists on non-root
>>      cpuset-enabled cgroups.  This flag is owned by the parent cgroup
>>      and is not delegatable.
>>
>>      It accepts only the following input values when written to.
>>
>>        ==========    =====================================
>>        "member"    Non-root member of a partition
>>        "root"    Partition root
>>        "isolated"    Partition root without load balancing
>>        ==========    =====================================
>>
>>      A cpuset partition is a collection of cgroups with a partition
>>      root at the top of the hierarchy and its descendants except
>>      those that are separate partition roots themselves and their
>>      descendants.  A partition has exclusive access to the set of
>>      CPUs allocated to it.  Other cgroups outside of that partition
>>      cannot use any CPUs in that set.
>>
>>      There are two types of partitions - adjacent and remote.  The
>>      parent of an adjacent partition must be a valid partition root.
>>      Partition roots of adjacent partitions are all clustered around
>>      the root cgroup.  Creation of adjacent partition is done by
>>      writing the desired partition type into "cpuset.cpus.partition".
>>
>>      A remote partition does not require a partition root parent.
>>      So a remote partition can be formed far from the root cgroup.
>>      However, its creation is a 2-step process.  The CPUs needed
>>      by a remote partition ("cpuset.cpus" of the partition root)
>>      has to be written into "cpuset.cpus.reserve" of the root
>>      cgroup first.  After that, "isolated" can be written into
>>      "cpuset.cpus.partition" of the partition root to form a remote
>>      isolated partition which is the only supported remote partition
>>      type for now.
>>
>>      All remote partitions are terminal as adjacent partition cannot
>>      be created underneath it.
> Can you elaborate this extra restriction a bit further?

Are you referring to the fact that only remote isolated partitions are 
supported? I do not preclude the support of load balancing remote 
partitions. I keep it to isolated partitions for now for ease of 
implementation and I am not currently aware of a use case where such a 
remote partition type is needed.

If you are talking about remote partition being terminal. It is mainly 
because it can be more tricky to support hierarchical adjacent 
partitions underneath it especially if it is not isolated. We can 
certainly support it if a use case arises. I just don't want to 
implement code that nobody is really going to use.

BTW, with the current way the remote partition is created, it is not 
possible to have another remote partition underneath it.

>
> In general, I think it'd be really helpful if the document explains the
> reasoning behind the design decisions. ie. Why is reserving for? What
> purpose does it serve that the regular isolated ones cannot? That'd help
> clarifying the design decisions.

I understand your concern. If you think it is better to support both 
types of remote partitions or hierarchical adjacent partitions 
underneath it for symmetry purpose, I can certain do that. It just needs 
to take a bit more time.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-05-28 21:18                                                 ` Waiman Long
@ 2023-06-05 18:03                                                   ` Tejun Heo
  2023-06-05 20:00                                                       ` Waiman Long
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2023-06-05 18:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

Hello, Waiman.

On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote:
> On 5/22/23 15:49, Tejun Heo wrote:
> Sorry for the late reply as I had been off for almost 2 weeks due to PTO.

And me too. Just moved.

> > Why is the syntax different from .cpus? Wouldn't it be better to keep them
> > the same?
> 
> Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that
> are used in multiple partitions. Also automatic reservation of adjacent
> partitions can happen in parallel. That is why I think it will be safer if

Ah, I see, this is because cpu.reserve is only in the root cgroup, so you
can't say that the knob is owned by the parent cgroup and thus access is
controlled that way.

...
> > >      There are two types of partitions - adjacent and remote.  The
> > >      parent of an adjacent partition must be a valid partition root.
> > >      Partition roots of adjacent partitions are all clustered around
> > >      the root cgroup.  Creation of adjacent partition is done by
> > >      writing the desired partition type into "cpuset.cpus.partition".
> > > 
> > >      A remote partition does not require a partition root parent.
> > >      So a remote partition can be formed far from the root cgroup.
> > >      However, its creation is a 2-step process.  The CPUs needed
> > >      by a remote partition ("cpuset.cpus" of the partition root)
> > >      has to be written into "cpuset.cpus.reserve" of the root
> > >      cgroup first.  After that, "isolated" can be written into
> > >      "cpuset.cpus.partition" of the partition root to form a remote
> > >      isolated partition which is the only supported remote partition
> > >      type for now.
> > > 
> > >      All remote partitions are terminal as adjacent partition cannot
> > >      be created underneath it.
> >
> > Can you elaborate this extra restriction a bit further?
> 
> Are you referring to the fact that only remote isolated partitions are
> supported? I do not preclude the support of load balancing remote
> partitions. I keep it to isolated partitions for now for ease of
> implementation and I am not currently aware of a use case where such a
> remote partition type is needed.
>
> If you are talking about remote partition being terminal. It is mainly
> because it can be more tricky to support hierarchical adjacent partitions
> underneath it especially if it is not isolated. We can certainly support it
> if a use case arises. I just don't want to implement code that nobody is
> really going to use.
> 
> BTW, with the current way the remote partition is created, it is not
> possible to have another remote partition underneath it.

The fact that the control is spread across a root-only file and per-cgroup
file seems hacky to me. e.g. How would it interact with namespacing? Are
there reasons why this can't be properly hierarchical other than the amount
of work needed? For example:

  cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs
  that the cgroup holds exclusively. The mask is always a subset of
  cpuset.cpus. The parent loses access to a CPU when the CPU is given to a
  child by setting the CPU in the child's cpus.exclusive and the CPU can't
  be given to more than one child. IOW, exclusive CPUs are available only to
  the leaf cgroups that have them set in their .exclusive file.

  When a cgroup is turned into a partition, its cpuset.cpus and
  cpuset.cpus.exclusive should be the same. For backward compatibility, if
  the cgroup's parent is already a partition, cpuset will automatically
  attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.

I could well be missing something important but I'd really like to see
something like the above where the reservation feature blends in with the
rest of cpuset.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-06-05 20:00                                                       ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-06-05 20:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

On 6/5/23 14:03, Tejun Heo wrote:
> Hello, Waiman.
>
> On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote:
>> On 5/22/23 15:49, Tejun Heo wrote:
>> Sorry for the late reply as I had been off for almost 2 weeks due to PTO.
> And me too. Just moved.
>
>>> Why is the syntax different from .cpus? Wouldn't it be better to keep them
>>> the same?
>> Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that
>> are used in multiple partitions. Also automatic reservation of adjacent
>> partitions can happen in parallel. That is why I think it will be safer if
> Ah, I see, this is because cpu.reserve is only in the root cgroup, so you
> can't say that the knob is owned by the parent cgroup and thus access is
> controlled that way.
>
> ...
>>>>       There are two types of partitions - adjacent and remote.  The
>>>>       parent of an adjacent partition must be a valid partition root.
>>>>       Partition roots of adjacent partitions are all clustered around
>>>>       the root cgroup.  Creation of adjacent partition is done by
>>>>       writing the desired partition type into "cpuset.cpus.partition".
>>>>
>>>>       A remote partition does not require a partition root parent.
>>>>       So a remote partition can be formed far from the root cgroup.
>>>>       However, its creation is a 2-step process.  The CPUs needed
>>>>       by a remote partition ("cpuset.cpus" of the partition root)
>>>>       has to be written into "cpuset.cpus.reserve" of the root
>>>>       cgroup first.  After that, "isolated" can be written into
>>>>       "cpuset.cpus.partition" of the partition root to form a remote
>>>>       isolated partition which is the only supported remote partition
>>>>       type for now.
>>>>
>>>>       All remote partitions are terminal as adjacent partition cannot
>>>>       be created underneath it.
>>> Can you elaborate this extra restriction a bit further?
>> Are you referring to the fact that only remote isolated partitions are
>> supported? I do not preclude the support of load balancing remote
>> partitions. I keep it to isolated partitions for now for ease of
>> implementation and I am not currently aware of a use case where such a
>> remote partition type is needed.
>>
>> If you are talking about remote partition being terminal. It is mainly
>> because it can be more tricky to support hierarchical adjacent partitions
>> underneath it especially if it is not isolated. We can certainly support it
>> if a use case arises. I just don't want to implement code that nobody is
>> really going to use.
>>
>> BTW, with the current way the remote partition is created, it is not
>> possible to have another remote partition underneath it.
> The fact that the control is spread across a root-only file and per-cgroup
> file seems hacky to me. e.g. How would it interact with namespacing? Are
> there reasons why this can't be properly hierarchical other than the amount
> of work needed? For example:
>
>    cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs
>    that the cgroup holds exclusively. The mask is always a subset of
>    cpuset.cpus. The parent loses access to a CPU when the CPU is given to a
>    child by setting the CPU in the child's cpus.exclusive and the CPU can't
>    be given to more than one child. IOW, exclusive CPUs are available only to
>    the leaf cgroups that have them set in their .exclusive file.
>
>    When a cgroup is turned into a partition, its cpuset.cpus and
>    cpuset.cpus.exclusive should be the same. For backward compatibility, if
>    the cgroup's parent is already a partition, cpuset will automatically
>    attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
>
> I could well be missing something important but I'd really like to see
> something like the above where the reservation feature blends in with the
> rest of cpuset.

It can certainly be made hierarchical as you suggest. It does increase 
complexity from both user and kernel point of view.

 From the user point of view, there is one more knob to manage 
hierarchically which is not used that often.

 From the kernel point of view, we may need to have one more cpumask per 
cpuset as the current subparts_cpus is used to track automatic 
reservation. We need another cpumask to contain extra exclusive CPUs not 
allocated through automatic reservation. The fact that you mention this 
new control file as a list of exclusively owned CPUs for this cgroup. 
Creating a partition is in fact allocating exclusive CPUs to a cgroup. 
So it kind of overlaps with the cpuset.cpus.partititon file. Can we fail 
a write to cpuset.cpus.exclusive if those exclusive CPUs cannot be 
granted or will this exclusive list is only valid if a valid partition 
can be formed. So we need to properly manage the dependency between 
these 2 control files.

Alternatively, I have no problem exposing cpuset.cpus.exclusive as a 
read-only file. It is a bit problematic if we need to make it writable.

As for namespacing, you do raise a good point. I was thinking mostly 
from a whole system point of view as the use case that I am aware of 
does not needs that. To allow delegation of exclusive CPUs to a child 
cgroup, that cgroup has to be a partition root itself. One compromise 
that I can think of is to only allow automatic reservation only in such 
a scenario. In that case, I need to support a remote load balanced 
partition as well and hierarchical sub-partitions underneath it. That 
can be done with some extra code to the existing v2 patchset without 
introducing too much complexity.

IOW, the use of remote partition is only allowed on the whole system 
level where one has access to the cgroup root. Exclusive CPUs 
distribution within a container can only be done via the use of adjacent 
partitions with automatic reservation. Will that be a good enough 
compromise from your point of view?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-06-05 20:00                                                       ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-06-05 20:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

On 6/5/23 14:03, Tejun Heo wrote:
> Hello, Waiman.
>
> On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote:
>> On 5/22/23 15:49, Tejun Heo wrote:
>> Sorry for the late reply as I had been off for almost 2 weeks due to PTO.
> And me too. Just moved.
>
>>> Why is the syntax different from .cpus? Wouldn't it be better to keep them
>>> the same?
>> Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that
>> are used in multiple partitions. Also automatic reservation of adjacent
>> partitions can happen in parallel. That is why I think it will be safer if
> Ah, I see, this is because cpu.reserve is only in the root cgroup, so you
> can't say that the knob is owned by the parent cgroup and thus access is
> controlled that way.
>
> ...
>>>>       There are two types of partitions - adjacent and remote.  The
>>>>       parent of an adjacent partition must be a valid partition root.
>>>>       Partition roots of adjacent partitions are all clustered around
>>>>       the root cgroup.  Creation of adjacent partition is done by
>>>>       writing the desired partition type into "cpuset.cpus.partition".
>>>>
>>>>       A remote partition does not require a partition root parent.
>>>>       So a remote partition can be formed far from the root cgroup.
>>>>       However, its creation is a 2-step process.  The CPUs needed
>>>>       by a remote partition ("cpuset.cpus" of the partition root)
>>>>       has to be written into "cpuset.cpus.reserve" of the root
>>>>       cgroup first.  After that, "isolated" can be written into
>>>>       "cpuset.cpus.partition" of the partition root to form a remote
>>>>       isolated partition which is the only supported remote partition
>>>>       type for now.
>>>>
>>>>       All remote partitions are terminal as adjacent partition cannot
>>>>       be created underneath it.
>>> Can you elaborate this extra restriction a bit further?
>> Are you referring to the fact that only remote isolated partitions are
>> supported? I do not preclude the support of load balancing remote
>> partitions. I keep it to isolated partitions for now for ease of
>> implementation and I am not currently aware of a use case where such a
>> remote partition type is needed.
>>
>> If you are talking about remote partition being terminal. It is mainly
>> because it can be more tricky to support hierarchical adjacent partitions
>> underneath it especially if it is not isolated. We can certainly support it
>> if a use case arises. I just don't want to implement code that nobody is
>> really going to use.
>>
>> BTW, with the current way the remote partition is created, it is not
>> possible to have another remote partition underneath it.
> The fact that the control is spread across a root-only file and per-cgroup
> file seems hacky to me. e.g. How would it interact with namespacing? Are
> there reasons why this can't be properly hierarchical other than the amount
> of work needed? For example:
>
>    cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs
>    that the cgroup holds exclusively. The mask is always a subset of
>    cpuset.cpus. The parent loses access to a CPU when the CPU is given to a
>    child by setting the CPU in the child's cpus.exclusive and the CPU can't
>    be given to more than one child. IOW, exclusive CPUs are available only to
>    the leaf cgroups that have them set in their .exclusive file.
>
>    When a cgroup is turned into a partition, its cpuset.cpus and
>    cpuset.cpus.exclusive should be the same. For backward compatibility, if
>    the cgroup's parent is already a partition, cpuset will automatically
>    attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
>
> I could well be missing something important but I'd really like to see
> something like the above where the reservation feature blends in with the
> rest of cpuset.

It can certainly be made hierarchical as you suggest. It does increase 
complexity from both user and kernel point of view.

 From the user point of view, there is one more knob to manage 
hierarchically which is not used that often.

 From the kernel point of view, we may need to have one more cpumask per 
cpuset as the current subparts_cpus is used to track automatic 
reservation. We need another cpumask to contain extra exclusive CPUs not 
allocated through automatic reservation. The fact that you mention this 
new control file as a list of exclusively owned CPUs for this cgroup. 
Creating a partition is in fact allocating exclusive CPUs to a cgroup. 
So it kind of overlaps with the cpuset.cpus.partititon file. Can we fail 
a write to cpuset.cpus.exclusive if those exclusive CPUs cannot be 
granted or will this exclusive list is only valid if a valid partition 
can be formed. So we need to properly manage the dependency between 
these 2 control files.

Alternatively, I have no problem exposing cpuset.cpus.exclusive as a 
read-only file. It is a bit problematic if we need to make it writable.

As for namespacing, you do raise a good point. I was thinking mostly 
from a whole system point of view as the use case that I am aware of 
does not needs that. To allow delegation of exclusive CPUs to a child 
cgroup, that cgroup has to be a partition root itself. One compromise 
that I can think of is to only allow automatic reservation only in such 
a scenario. In that case, I need to support a remote load balanced 
partition as well and hierarchical sub-partitions underneath it. That 
can be done with some extra code to the existing v2 patchset without 
introducing too much complexity.

IOW, the use of remote partition is only allowed on the whole system 
level where one has access to the cgroup root. Exclusive CPUs 
distribution within a container can only be done via the use of adjacent 
partitions with automatic reservation. Will that be a good enough 
compromise from your point of view?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-06-05 20:00                                                       ` Waiman Long
  (?)
@ 2023-06-05 20:27                                                       ` Tejun Heo
  2023-06-06  2:47                                                         ` Waiman Long
  -1 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2023-06-05 20:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

Hello,

On Mon, Jun 05, 2023 at 04:00:39PM -0400, Waiman Long wrote:
...
> > file seems hacky to me. e.g. How would it interact with namespacing? Are
> > there reasons why this can't be properly hierarchical other than the amount
> > of work needed? For example:
> > 
> >    cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs
> >    that the cgroup holds exclusively. The mask is always a subset of
> >    cpuset.cpus. The parent loses access to a CPU when the CPU is given to a
> >    child by setting the CPU in the child's cpus.exclusive and the CPU can't
> >    be given to more than one child. IOW, exclusive CPUs are available only to
> >    the leaf cgroups that have them set in their .exclusive file.
> > 
> >    When a cgroup is turned into a partition, its cpuset.cpus and
> >    cpuset.cpus.exclusive should be the same. For backward compatibility, if
> >    the cgroup's parent is already a partition, cpuset will automatically
> >    attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
> > 
> > I could well be missing something important but I'd really like to see
> > something like the above where the reservation feature blends in with the
> > rest of cpuset.
> 
> It can certainly be made hierarchical as you suggest. It does increase
> complexity from both user and kernel point of view.
> 
> From the user point of view, there is one more knob to manage hierarchically
> which is not used that often.

From user pov, this only affects them when they want to create partitions
down the tree, right?

> From the kernel point of view, we may need to have one more cpumask per
> cpuset as the current subparts_cpus is used to track automatic reservation.
> We need another cpumask to contain extra exclusive CPUs not allocated
> through automatic reservation. The fact that you mention this new control
> file as a list of exclusively owned CPUs for this cgroup. Creating a
> partition is in fact allocating exclusive CPUs to a cgroup. So it kind of
> overlaps with the cpuset.cpus.partititon file. Can we fail a write to

Yes, it substitutes and expands on cpuset.cpus.partition behavior.

> cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this
> exclusive list is only valid if a valid partition can be formed. So we need
> to properly manage the dependency between these 2 control files.

So, I think cpus.exclusive can become the sole mechanism to arbitrate
exclusive owenership of CPUs and .partition can depend on .exclusive.

> Alternatively, I have no problem exposing cpuset.cpus.exclusive as a
> read-only file. It is a bit problematic if we need to make it writable.

I don't follow. How would remote partitions work then?

> As for namespacing, you do raise a good point. I was thinking mostly from a
> whole system point of view as the use case that I am aware of does not needs
> that. To allow delegation of exclusive CPUs to a child cgroup, that cgroup
> has to be a partition root itself. One compromise that I can think of is to
> only allow automatic reservation only in such a scenario. In that case, I
> need to support a remote load balanced partition as well and hierarchical
> sub-partitions underneath it. That can be done with some extra code to the
> existing v2 patchset without introducing too much complexity.
> 
> IOW, the use of remote partition is only allowed on the whole system level
> where one has access to the cgroup root. Exclusive CPUs distribution within
> a container can only be done via the use of adjacent partitions with
> automatic reservation. Will that be a good enough compromise from your point
> of view?

It seems too twisted to me. I'd much prefer it to be better integrated with
the rest of cpuset.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-06-05 20:27                                                       ` Tejun Heo
@ 2023-06-06  2:47                                                         ` Waiman Long
  2023-06-06 19:58                                                             ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Waiman Long @ 2023-06-06  2:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

On 6/5/23 16:27, Tejun Heo wrote:
> Hello,
>
> On Mon, Jun 05, 2023 at 04:00:39PM -0400, Waiman Long wrote:
> ...
>>> file seems hacky to me. e.g. How would it interact with namespacing? Are
>>> there reasons why this can't be properly hierarchical other than the amount
>>> of work needed? For example:
>>>
>>>     cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs
>>>     that the cgroup holds exclusively. The mask is always a subset of
>>>     cpuset.cpus. The parent loses access to a CPU when the CPU is given to a
>>>     child by setting the CPU in the child's cpus.exclusive and the CPU can't
>>>     be given to more than one child. IOW, exclusive CPUs are available only to
>>>     the leaf cgroups that have them set in their .exclusive file.
>>>
>>>     When a cgroup is turned into a partition, its cpuset.cpus and
>>>     cpuset.cpus.exclusive should be the same. For backward compatibility, if
>>>     the cgroup's parent is already a partition, cpuset will automatically
>>>     attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
>>>
>>> I could well be missing something important but I'd really like to see
>>> something like the above where the reservation feature blends in with the
>>> rest of cpuset.
>> It can certainly be made hierarchical as you suggest. It does increase
>> complexity from both user and kernel point of view.
>>
>>  From the user point of view, there is one more knob to manage hierarchically
>> which is not used that often.
>  From user pov, this only affects them when they want to create partitions
> down the tree, right?
>
>>  From the kernel point of view, we may need to have one more cpumask per
>> cpuset as the current subparts_cpus is used to track automatic reservation.
>> We need another cpumask to contain extra exclusive CPUs not allocated
>> through automatic reservation. The fact that you mention this new control
>> file as a list of exclusively owned CPUs for this cgroup. Creating a
>> partition is in fact allocating exclusive CPUs to a cgroup. So it kind of
>> overlaps with the cpuset.cpus.partititon file. Can we fail a write to
> Yes, it substitutes and expands on cpuset.cpus.partition behavior.
>
>> cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this
>> exclusive list is only valid if a valid partition can be formed. So we need
>> to properly manage the dependency between these 2 control files.
> So, I think cpus.exclusive can become the sole mechanism to arbitrate
> exclusive owenership of CPUs and .partition can depend on .exclusive.
>
>> Alternatively, I have no problem exposing cpuset.cpus.exclusive as a
>> read-only file. It is a bit problematic if we need to make it writable.
> I don't follow. How would remote partitions work then?

I had a different idea on the semantics of the cpuset.cpus.exclusive at 
the beginning. My original thinking is that it was the actual exclusive 
CPUs that are allocated to the cgroup. Now if we treat this as a hint of 
what exclusive CPUs should be used and it becomes valid only if the 
cgroup can become a valid partition. I can see it as a value that can be 
hierarchically set throughout the whole cpuset hierarchy.

So a transition to a valid partition is possible iff

1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of 
cpuset.cpus.exclusive of all its ancestors.
2) If its parent is not a partition root, none of the CPUs in 
cpuset.cpus.exclusive are currently allocated to other partitions. This 
the same remote partition concept in my v2 patch. If its parent is a 
partition root, part of its exclusive CPUs will be distributed to this 
child partition like the current behavior of cpuset partition.

I can rework my patch to adopt this model if it is what you have in mind.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-06-06 19:58                                                             ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-06-06 19:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

Hello, Waiman.

On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote:
...
> I had a different idea on the semantics of the cpuset.cpus.exclusive at the
> beginning. My original thinking is that it was the actual exclusive CPUs
> that are allocated to the cgroup. Now if we treat this as a hint of what
> exclusive CPUs should be used and it becomes valid only if the cgroup can

I wouldn't call it a hint. It's still hard allocation of the CPUs to the
cgroups that own them. Setting up a partition requires exclusive CPUs and
thus would depend on exclusive allocations set up accordingly.

> become a valid partition. I can see it as a value that can be hierarchically
> set throughout the whole cpuset hierarchy.
> 
> So a transition to a valid partition is possible iff
> 
> 1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of
> cpuset.cpus.exclusive of all its ancestors.

Yes.

> 2) If its parent is not a partition root, none of the CPUs in
> cpuset.cpus.exclusive are currently allocated to other partitions. This the

Not just that, the CPUs aren't available to cgroups which don't have them
set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some
cgroups, it shouldn't appear in cpus.effective of cgroups which don't have
the CPU in their cpus.exclusive.

So, .exclusive explicitly establishes exclusive ownership of CPUs and
partitions depend on that with an implicit "turn CPUs exclusive" behavior in
case the parent is a partition root for backward compatibility.

> same remote partition concept in my v2 patch. If its parent is a partition
> root, part of its exclusive CPUs will be distributed to this child partition
> like the current behavior of cpuset partition.

Yes, similar in a sense. Please do away with the "once .reserve is used, the
behavior is switched" part. Instead, it can be sth like "if the parent is a
partition root, cpuset implicitly tries to set all CPUs in its cpus file in
its cpus.exclusive file" so that user-visible behavior stays unchanged
depending on past history.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-06-06 19:58                                                             ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-06-06 19:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

Hello, Waiman.

On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote:
...
> I had a different idea on the semantics of the cpuset.cpus.exclusive at the
> beginning. My original thinking is that it was the actual exclusive CPUs
> that are allocated to the cgroup. Now if we treat this as a hint of what
> exclusive CPUs should be used and it becomes valid only if the cgroup can

I wouldn't call it a hint. It's still hard allocation of the CPUs to the
cgroups that own them. Setting up a partition requires exclusive CPUs and
thus would depend on exclusive allocations set up accordingly.

> become a valid partition. I can see it as a value that can be hierarchically
> set throughout the whole cpuset hierarchy.
> 
> So a transition to a valid partition is possible iff
> 
> 1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of
> cpuset.cpus.exclusive of all its ancestors.

Yes.

> 2) If its parent is not a partition root, none of the CPUs in
> cpuset.cpus.exclusive are currently allocated to other partitions. This the

Not just that, the CPUs aren't available to cgroups which don't have them
set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some
cgroups, it shouldn't appear in cpus.effective of cgroups which don't have
the CPU in their cpus.exclusive.

So, .exclusive explicitly establishes exclusive ownership of CPUs and
partitions depend on that with an implicit "turn CPUs exclusive" behavior in
case the parent is a partition root for backward compatibility.

> same remote partition concept in my v2 patch. If its parent is a partition
> root, part of its exclusive CPUs will be distributed to this child partition
> like the current behavior of cpuset partition.

Yes, similar in a sense. Please do away with the "once .reserve is used, the
behavior is switched" part. Instead, it can be sth like "if the parent is a
partition root, cpuset implicitly tries to set all CPUs in its cpus file in
its cpus.exclusive file" so that user-visible behavior stays unchanged
depending on past history.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-06-06 20:11                                                               ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-06-06 20:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

On 6/6/23 15:58, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote:
> ...
>> I had a different idea on the semantics of the cpuset.cpus.exclusive at the
>> beginning. My original thinking is that it was the actual exclusive CPUs
>> that are allocated to the cgroup. Now if we treat this as a hint of what
>> exclusive CPUs should be used and it becomes valid only if the cgroup can
> I wouldn't call it a hint. It's still hard allocation of the CPUs to the
> cgroups that own them. Setting up a partition requires exclusive CPUs and
> thus would depend on exclusive allocations set up accordingly.
>
>> become a valid partition. I can see it as a value that can be hierarchically
>> set throughout the whole cpuset hierarchy.
>>
>> So a transition to a valid partition is possible iff
>>
>> 1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of
>> cpuset.cpus.exclusive of all its ancestors.
> Yes.
>
>> 2) If its parent is not a partition root, none of the CPUs in
>> cpuset.cpus.exclusive are currently allocated to other partitions. This the
> Not just that, the CPUs aren't available to cgroups which don't have them
> set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some
> cgroups, it shouldn't appear in cpus.effective of cgroups which don't have
> the CPU in their cpus.exclusive.
>
> So, .exclusive explicitly establishes exclusive ownership of CPUs and
> partitions depend on that with an implicit "turn CPUs exclusive" behavior in
> case the parent is a partition root for backward compatibility.
The current CPU exclusive behavior is limited to sibling cgroups only. 
Because of the hierarchical nature of cpu distribution, the set of 
exclusive CPUs have to appear in all its ancestors. When partition is 
enabled, we do a sibling exclusivity test at that point to verify that 
it is exclusive. It looks like you want to do an exclusivity test even 
when the partition isn't active. I can certainly do that when the file 
is being updated. However, it will fail the write if the exclusivity 
test fails just like the v1 cpuset.cpus.exclusive flag if you are OK 
with that.
>
>> same remote partition concept in my v2 patch. If its parent is a partition
>> root, part of its exclusive CPUs will be distributed to this child partition
>> like the current behavior of cpuset partition.
> Yes, similar in a sense. Please do away with the "once .reserve is used, the
> behavior is switched" part.

That behavior has been gone in my v2 patch.

> Instead, it can be sth like "if the parent is a
> partition root, cpuset implicitly tries to set all CPUs in its cpus file in
> its cpus.exclusive file" so that user-visible behavior stays unchanged
> depending on past history.

If parent is a partition root, auto reservation will be done and 
cpus.exclusive will be set automatically just like before. So existing 
applications using partition will not be affected.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
@ 2023-06-06 20:11                                                               ` Waiman Long
  0 siblings, 0 replies; 45+ messages in thread
From: Waiman Long @ 2023-06-06 20:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

On 6/6/23 15:58, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote:
> ...
>> I had a different idea on the semantics of the cpuset.cpus.exclusive at the
>> beginning. My original thinking is that it was the actual exclusive CPUs
>> that are allocated to the cgroup. Now if we treat this as a hint of what
>> exclusive CPUs should be used and it becomes valid only if the cgroup can
> I wouldn't call it a hint. It's still hard allocation of the CPUs to the
> cgroups that own them. Setting up a partition requires exclusive CPUs and
> thus would depend on exclusive allocations set up accordingly.
>
>> become a valid partition. I can see it as a value that can be hierarchically
>> set throughout the whole cpuset hierarchy.
>>
>> So a transition to a valid partition is possible iff
>>
>> 1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of
>> cpuset.cpus.exclusive of all its ancestors.
> Yes.
>
>> 2) If its parent is not a partition root, none of the CPUs in
>> cpuset.cpus.exclusive are currently allocated to other partitions. This the
> Not just that, the CPUs aren't available to cgroups which don't have them
> set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some
> cgroups, it shouldn't appear in cpus.effective of cgroups which don't have
> the CPU in their cpus.exclusive.
>
> So, .exclusive explicitly establishes exclusive ownership of CPUs and
> partitions depend on that with an implicit "turn CPUs exclusive" behavior in
> case the parent is a partition root for backward compatibility.
The current CPU exclusive behavior is limited to sibling cgroups only. 
Because of the hierarchical nature of cpu distribution, the set of 
exclusive CPUs have to appear in all its ancestors. When partition is 
enabled, we do a sibling exclusivity test at that point to verify that 
it is exclusive. It looks like you want to do an exclusivity test even 
when the partition isn't active. I can certainly do that when the file 
is being updated. However, it will fail the write if the exclusivity 
test fails just like the v1 cpuset.cpus.exclusive flag if you are OK 
with that.
>
>> same remote partition concept in my v2 patch. If its parent is a partition
>> root, part of its exclusive CPUs will be distributed to this child partition
>> like the current behavior of cpuset partition.
> Yes, similar in a sense. Please do away with the "once .reserve is used, the
> behavior is switched" part.

That behavior has been gone in my v2 patch.

> Instead, it can be sth like "if the parent is a
> partition root, cpuset implicitly tries to set all CPUs in its cpus file in
> its cpus.exclusive file" so that user-visible behavior stays unchanged
> depending on past history.

If parent is a partition root, auto reservation will be done and 
cpus.exclusive will be set automatically just like before. So existing 
applications using partition will not be affected.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition
  2023-06-06 20:11                                                               ` Waiman Long
  (?)
@ 2023-06-06 20:13                                                               ` Tejun Heo
  -1 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2023-06-06 20:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Michal Koutný,
	Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
	linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli,
	Valentin Schneider, Frederic Weisbecker, Mrunal Patel,
	Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld

Hello,

On Tue, Jun 06, 2023 at 04:11:02PM -0400, Waiman Long wrote:
...
> The current CPU exclusive behavior is limited to sibling cgroups only.
> Because of the hierarchical nature of cpu distribution, the set of exclusive
> CPUs have to appear in all its ancestors. When partition is enabled, we do a
> sibling exclusivity test at that point to verify that it is exclusive. It
> looks like you want to do an exclusivity test even when the partition isn't
> active. I can certainly do that when the file is being updated. However, it
> will fail the write if the exclusivity test fails just like the v1
> cpuset.cpus.exclusive flag if you are OK with that.

Yeah, doesn't look like there's a way around it if we want to make
.exclusive a feature which is useful on its own.

> > Instead, it can be sth like "if the parent is a
> > partition root, cpuset implicitly tries to set all CPUs in its cpus file in
> > its cpus.exclusive file" so that user-visible behavior stays unchanged
> > depending on past history.
> 
> If parent is a partition root, auto reservation will be done and
> cpus.exclusive will be set automatically just like before. So existing
> applications using partition will not be affected.

Sounds great.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2023-06-06 20:13 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-12 15:37 [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition Waiman Long
2023-04-12 15:37 ` Waiman Long
2023-04-12 19:28 ` Tejun Heo
     [not found]   ` <1ce6a073-e573-0c32-c3d8-f67f3d389a28@redhat.com>
2023-04-12 20:22     ` Tejun Heo
2023-04-12 20:33       ` Waiman Long
2023-04-13  0:03         ` Tejun Heo
2023-04-13  0:26           ` Waiman Long
2023-04-13  0:26             ` Waiman Long
2023-04-13  0:33             ` Tejun Heo
2023-04-13  0:33               ` Tejun Heo
2023-04-13  0:55               ` Waiman Long
2023-04-13  0:55                 ` Waiman Long
2023-04-13  1:17                 ` Tejun Heo
2023-04-13  1:17                   ` Tejun Heo
2023-04-13  1:55                   ` Waiman Long
2023-04-14  1:22                     ` Waiman Long
2023-04-14  1:22                       ` Waiman Long
2023-04-14 16:54                       ` Tejun Heo
2023-04-14 16:54                         ` Tejun Heo
2023-04-14 17:29                         ` Waiman Long
2023-04-14 17:34                           ` Tejun Heo
2023-04-14 17:38                             ` Waiman Long
2023-04-14 19:06                               ` Waiman Long
2023-05-02 18:01                                 ` Michal Koutný
2023-05-02 21:26                                   ` Waiman Long
2023-05-02 22:27                                     ` Michal Koutný
2023-05-02 22:27                                       ` Michal Koutný
2023-05-04  3:01                                       ` Waiman Long
2023-05-05 16:03                                         ` Tejun Heo
2023-05-05 16:03                                           ` Tejun Heo
2023-05-05 16:25                                           ` Waiman Long
2023-05-05 16:25                                             ` Waiman Long
2023-05-08  1:03                                             ` Waiman Long
2023-05-22 19:49                                               ` Tejun Heo
2023-05-28 21:18                                                 ` Waiman Long
2023-06-05 18:03                                                   ` Tejun Heo
2023-06-05 20:00                                                     ` Waiman Long
2023-06-05 20:00                                                       ` Waiman Long
2023-06-05 20:27                                                       ` Tejun Heo
2023-06-06  2:47                                                         ` Waiman Long
2023-06-06 19:58                                                           ` Tejun Heo
2023-06-06 19:58                                                             ` Tejun Heo
2023-06-06 20:11                                                             ` Waiman Long
2023-06-06 20:11                                                               ` Waiman Long
2023-06-06 20:13                                                               ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.