* [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition @ 2023-04-12 15:37 Waiman Long 2023-04-12 19:28 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-04-12 15:37 UTC (permalink / raw) To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan Cc: linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Waiman Long This patch series introduces a new "isolcpus" partition type to the existing list of {member, root, isolated} types. The primary reason of adding this new "isolcpus" partition is to facilitate the distribution of isolated CPUs down the cgroup v2 hierarchy. The other non-member partition types have the limitation that their parents have to be valid partitions too. It will be hard to create a partition a few layers down the hierarchy. It is relatively rare to have applications that require creation of a separate scheduling domain (root). However, it is more common to have applications that require the use of isolated CPUs (isolated), e.g. DPDK. One can use the "isolcpus" or "nohz_full" boot command options to get that statically. Of course, the "isolated" partition is another way to achieve that dynamically. Modern container orchestration tools like Kubernetes use the cgroup hierarchy to manage different containers. If a container needs to use isolated CPUs, it is hard to get those with existing set of cpuset partition types. With this patch series, a new "isolcpus" partition can be created to hold a set of isolated CPUs that can be pull into other "isolated" partitions. The "isolcpus" partition is special that there can have at most one instance of this in a system. It serves as a pool for isolated CPUs and cannot hold tasks or sub-cpusets underneath it. It is also not cpu-exclusive so that the isolated CPUs can be distributed down the sibling hierarchies, though those isolated CPUs will not be useable until the partition type becomes "isolated". Once isolated CPUs are needed in a cgroup, the administrator can write a list of isolated CPUs into its "cpuset.cpus" and change its partition type to "isolated" to pull in those isolated CPUs from the "isolcpus" partition and use them in that cgroup. That will make the distribution of isolated CPUs to cgroups that need them much easier. In the future, we may be able to extend this special "isolcpus" partition type to support other isolation attributes like those that can be specified with the "isolcpus" boot command line and related options. Waiman Long (5): cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling cgroup/cpuset: Add a new "isolcpus" paritition root state cgroup/cpuset: Make isolated partition pull CPUs from isolcpus partition cgroup/cpuset: Documentation update for the new "isolcpus" partition cgroup/cpuset: Extend test_cpuset_prs.sh to test isolcpus partition Documentation/admin-guide/cgroup-v2.rst | 89 ++- kernel/cgroup/cpuset.c | 548 +++++++++++++++--- .../selftests/cgroup/test_cpuset_prs.sh | 376 ++++++++---- 3 files changed, 789 insertions(+), 224 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-12 15:37 [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition Waiman Long @ 2023-04-12 19:28 ` Tejun Heo [not found] ` <1ce6a073-e573-0c32-c3d8-f67f3d389a28@redhat.com> 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-04-12 19:28 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker Hello, Waiman. On Wed, Apr 12, 2023 at 11:37:53AM -0400, Waiman Long wrote: > This patch series introduces a new "isolcpus" partition type to the > existing list of {member, root, isolated} types. The primary reason > of adding this new "isolcpus" partition is to facilitate the > distribution of isolated CPUs down the cgroup v2 hierarchy. > > The other non-member partition types have the limitation that their > parents have to be valid partitions too. It will be hard to create a > partition a few layers down the hierarchy. > > It is relatively rare to have applications that require creation of > a separate scheduling domain (root). However, it is more common to > have applications that require the use of isolated CPUs (isolated), > e.g. DPDK. One can use the "isolcpus" or "nohz_full" boot command options > to get that statically. Of course, the "isolated" partition is another > way to achieve that dynamically. > > Modern container orchestration tools like Kubernetes use the cgroup > hierarchy to manage different containers. If a container needs to use > isolated CPUs, it is hard to get those with existing set of cpuset > partition types. With this patch series, a new "isolcpus" partition > can be created to hold a set of isolated CPUs that can be pull into > other "isolated" partitions. > > The "isolcpus" partition is special that there can have at most one > instance of this in a system. It serves as a pool for isolated CPUs > and cannot hold tasks or sub-cpusets underneath it. It is also not > cpu-exclusive so that the isolated CPUs can be distributed down the > sibling hierarchies, though those isolated CPUs will not be useable > until the partition type becomes "isolated". > > Once isolated CPUs are needed in a cgroup, the administrator can write > a list of isolated CPUs into its "cpuset.cpus" and change its partition > type to "isolated" to pull in those isolated CPUs from the "isolcpus" > partition and use them in that cgroup. That will make the distribution > of isolated CPUs to cgroups that need them much easier. I'm not sure about this. It feels really hacky in that it side-steps the distribution hierarchy completely. I can imagine a non-isolated cpuset wanting to allow isolated cpusets downstream but that should be done hierarchically - e.g. by allowing a cgroup to express what isolated cpus are allowed in the subtree. Also, can you give more details on the targeted use cases? Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <1ce6a073-e573-0c32-c3d8-f67f3d389a28@redhat.com>]
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition [not found] ` <1ce6a073-e573-0c32-c3d8-f67f3d389a28@redhat.com> @ 2023-04-12 20:22 ` Tejun Heo 2023-04-12 20:33 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-04-12 20:22 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker Hello, Waiman. On Wed, Apr 12, 2023 at 03:52:36PM -0400, Waiman Long wrote: > There is still a distribution hierarchy as the list of isolation CPUs have > to be distributed down to the target cgroup through the hierarchy. For > example, > > cgroup root > +- isolcpus (cpus 8,9; isolcpus) > +- user.slice (cpus 1-9; ecpus 1-7; member) > +- user-x.slice (cpus 8,9; ecpus 8,9; isolated) > +- user-y.slice (cpus 1,2; ecpus 1,2; member) > > OTOH, I do agree that this can be somewhat hacky. That is why I post it as a > RFC to solicit feedback. Wouldn't it be possible to make it hierarchical by adding another cpumask to cpuset which lists the cpus which are allowed in the hierarchy but not used unless claimed by an isolated domain? Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-12 20:22 ` Tejun Heo @ 2023-04-12 20:33 ` Waiman Long 2023-04-13 0:03 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-04-12 20:33 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 4/12/23 16:22, Tejun Heo wrote: > Hello, Waiman. > > On Wed, Apr 12, 2023 at 03:52:36PM -0400, Waiman Long wrote: >> There is still a distribution hierarchy as the list of isolation CPUs have >> to be distributed down to the target cgroup through the hierarchy. For >> example, >> >> cgroup root >> +- isolcpus (cpus 8,9; isolcpus) >> +- user.slice (cpus 1-9; ecpus 1-7; member) >> +- user-x.slice (cpus 8,9; ecpus 8,9; isolated) >> +- user-y.slice (cpus 1,2; ecpus 1,2; member) >> >> OTOH, I do agree that this can be somewhat hacky. That is why I post it as a >> RFC to solicit feedback. > Wouldn't it be possible to make it hierarchical by adding another cpumask to > cpuset which lists the cpus which are allowed in the hierarchy but not used > unless claimed by an isolated domain? I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file. So there will be one in the root cgroup that defines all the isolated CPUs one can have. It is then distributed down the hierarchy and can be claimed only if a cgroup becomes an "isolated" partition. There will be a slight change in the semantics of an "isolated" partition, but I doubt there will be much users out there. If you are OK with this approach, I can modify my patch series to do that. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-12 20:33 ` Waiman Long @ 2023-04-13 0:03 ` Tejun Heo 2023-04-13 0:26 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-04-13 0:03 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker Hello, On Wed, Apr 12, 2023 at 04:33:29PM -0400, Waiman Long wrote: > I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file. > So there will be one in the root cgroup that defines all the isolated CPUs > one can have. It is then distributed down the hierarchy and can be claimed > only if a cgroup becomes an "isolated" partition. There will be a slight Yeah, that seems a lot more congruent with the typical pattern. > change in the semantics of an "isolated" partition, but I doubt there will > be much users out there. I haven't thought through it too hard but what prevents staying compatible with the current behavior? > If you are OK with this approach, I can modify my patch series to do that. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-13 0:03 ` Tejun Heo @ 2023-04-13 0:26 ` Waiman Long 2023-04-13 0:33 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-04-13 0:26 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 4/12/23 20:03, Tejun Heo wrote: > Hello, > > On Wed, Apr 12, 2023 at 04:33:29PM -0400, Waiman Long wrote: >> I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file. >> So there will be one in the root cgroup that defines all the isolated CPUs >> one can have. It is then distributed down the hierarchy and can be claimed >> only if a cgroup becomes an "isolated" partition. There will be a slight > Yeah, that seems a lot more congruent with the typical pattern. > >> change in the semantics of an "isolated" partition, but I doubt there will >> be much users out there. > I haven't thought through it too hard but what prevents staying compatible > with the current behavior? It is possible to stay compatible with existing behavior. It is just that a break from existing behavior will make the solution more clean. So the new behavior will be: If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it is set, the new rule will be used. Does that look reasonable to you? Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-13 0:26 ` Waiman Long @ 2023-04-13 0:33 ` Tejun Heo 2023-04-13 0:55 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-04-13 0:33 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker Hello, On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote: > If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it > is set, the new rule will be used. > > Does that look reasonable to you? Sounds a bit contrived. Does it need to be something defined in the root cgroup? The only thing that's needed is that a cgroup needs to claim CPUs exclusively without using them, right? Let's say we add a new interface file, say, cpuset.cpus.reserve which is always exclusive and can be consumed by children whichever way they want, wouldn't that be sufficient? Then, there would be nothing to describe in the root cgroup. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-13 0:33 ` Tejun Heo @ 2023-04-13 0:55 ` Waiman Long 2023-04-13 1:17 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-04-13 0:55 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 4/12/23 20:33, Tejun Heo wrote: > Hello, > > On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote: >> If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it >> is set, the new rule will be used. >> >> Does that look reasonable to you? > Sounds a bit contrived. Does it need to be something defined in the root > cgroup? Yes, because we need to take away the isolated CPUs from the effective cpus of the root cgroup. So it needs to start from the root. That is also why we have the partition rule that the parent of a partition has to be a partition root itself. With the new scheme, we don't need a special cgroup to hold the isolated CPUs. The new root cgroup file will be enough to inform the system what CPUs will have to be isolated. My current thinking is that the root's "cpuset.cpus.isolated" will start with whatever have been set in the "isolcpus" or "nohz_full" boot command line and can be extended from there but not shrank below that as there can be additional isolation attributes with those isolated CPUs. Cheers, Longman > The only thing that's needed is that a cgroup needs to claim CPUs > exclusively without using them, right? Let's say we add a new interface > file, say, cpuset.cpus.reserve which is always exclusive and can be consumed > by children whichever way they want, wouldn't that be sufficient? Then, > there would be nothing to describe in the root cgroup. > > Thanks. > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-13 0:55 ` Waiman Long @ 2023-04-13 1:17 ` Tejun Heo 2023-04-13 1:55 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-04-13 1:17 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker Hello, Waiman. On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote: > > Sounds a bit contrived. Does it need to be something defined in the root > > cgroup? > > Yes, because we need to take away the isolated CPUs from the effective cpus > of the root cgroup. So it needs to start from the root. That is also why we > have the partition rule that the parent of a partition has to be a partition > root itself. With the new scheme, we don't need a special cgroup to hold the I'm following. The root is already a partition root and the cgroupfs control knobs are owned by the parent, so the root cgroup would own the first level cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some CPUs exclusively to a first level cgroup, it can then set that cgroup's reserve knob accordingly (or maybe the better name is cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's partition and give them to the first level cgroup. The first level cgroup then is free to do whatever with those CPUs that now belong exclusively to the cgroup subtree. > isolated CPUs. The new root cgroup file will be enough to inform the system > what CPUs will have to be isolated. > > My current thinking is that the root's "cpuset.cpus.isolated" will start > with whatever have been set in the "isolcpus" or "nohz_full" boot command > line and can be extended from there but not shrank below that as there can > be additional isolation attributes with those isolated CPUs. I'm not sure we wanna tie with those automatically. I think it'd be confusing than helpful. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-13 1:17 ` Tejun Heo @ 2023-04-13 1:55 ` Waiman Long 2023-04-14 1:22 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-04-13 1:55 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 4/12/23 21:17, Tejun Heo wrote: > Hello, Waiman. > > On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote: >>> Sounds a bit contrived. Does it need to be something defined in the root >>> cgroup? >> Yes, because we need to take away the isolated CPUs from the effective cpus >> of the root cgroup. So it needs to start from the root. That is also why we >> have the partition rule that the parent of a partition has to be a partition >> root itself. With the new scheme, we don't need a special cgroup to hold the > I'm following. The root is already a partition root and the cgroupfs control > knobs are owned by the parent, so the root cgroup would own the first level > cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some > CPUs exclusively to a first level cgroup, it can then set that cgroup's > reserve knob accordingly (or maybe the better name is > cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's > partition and give them to the first level cgroup. The first level cgroup > then is free to do whatever with those CPUs that now belong exclusively to > the cgroup subtree. I am OK with the cpuset.cpus.reserve name, but not that much with the cpuset.cpus.exclusive name as it can get confused with cgroup v1's cpuset.cpu_exclusive. Of course, I prefer the cpuset.cpus.isolated name a bit more. Once an isolated CPU gets used in an isolated partition, it is exclusive and it can't be used in another isolated partition. Since we will allow users to set cpuset.cpus.reserve to whatever value they want. The distribution of isolated CPUs is only valid if the cpus are present in its parent's cpuset.cpus.reserve and all the way up to the root. It is a bit expensive, but it should be a relatively rare operation. > >> isolated CPUs. The new root cgroup file will be enough to inform the system >> what CPUs will have to be isolated. >> >> My current thinking is that the root's "cpuset.cpus.isolated" will start >> with whatever have been set in the "isolcpus" or "nohz_full" boot command >> line and can be extended from there but not shrank below that as there can >> be additional isolation attributes with those isolated CPUs. > I'm not sure we wanna tie with those automatically. I think it'd be > confusing than helpful. Yes, I am fine with taking this off for now. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-13 1:55 ` Waiman Long @ 2023-04-14 1:22 ` Waiman Long 2023-04-14 16:54 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-04-14 1:22 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 4/12/23 21:55, Waiman Long wrote: > On 4/12/23 21:17, Tejun Heo wrote: >> Hello, Waiman. >> >> On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote: >>>> Sounds a bit contrived. Does it need to be something defined in the >>>> root >>>> cgroup? >>> Yes, because we need to take away the isolated CPUs from the >>> effective cpus >>> of the root cgroup. So it needs to start from the root. That is also >>> why we >>> have the partition rule that the parent of a partition has to be a >>> partition >>> root itself. With the new scheme, we don't need a special cgroup to >>> hold the >> I'm following. The root is already a partition root and the cgroupfs >> control >> knobs are owned by the parent, so the root cgroup would own the first >> level >> cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to >> assign some >> CPUs exclusively to a first level cgroup, it can then set that cgroup's >> reserve knob accordingly (or maybe the better name is >> cpuset.cpus.exclusive), which will take those CPUs out of the root >> cgroup's >> partition and give them to the first level cgroup. The first level >> cgroup >> then is free to do whatever with those CPUs that now belong >> exclusively to >> the cgroup subtree. > > I am OK with the cpuset.cpus.reserve name, but not that much with the > cpuset.cpus.exclusive name as it can get confused with cgroup v1's > cpuset.cpu_exclusive. Of course, I prefer the cpuset.cpus.isolated > name a bit more. Once an isolated CPU gets used in an isolated > partition, it is exclusive and it can't be used in another isolated > partition. > > Since we will allow users to set cpuset.cpus.reserve to whatever value > they want. The distribution of isolated CPUs is only valid if the cpus > are present in its parent's cpuset.cpus.reserve and all the way up to > the root. It is a bit expensive, but it should be a relatively rare > operation. I now have a slightly different idea of how to do that. We already have an internal cpumask for partitioning - subparts_cpus. I am thinking about exposing it as cpuset.cpus.reserve. The current way of creating subpartitions will be called automatic reservation and require a direct parent/child partition relationship. But as soon as a user write anything to it, it will break automatic reservation and require manual reservation going forward. In that way, we can keep the old behavior, but also support new use cases. I am going to work on that. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-14 1:22 ` Waiman Long @ 2023-04-14 16:54 ` Tejun Heo 2023-04-14 17:29 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-04-14 16:54 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote: > I now have a slightly different idea of how to do that. We already have an > internal cpumask for partitioning - subparts_cpus. I am thinking about > exposing it as cpuset.cpus.reserve. The current way of creating > subpartitions will be called automatic reservation and require a direct > parent/child partition relationship. But as soon as a user write anything to > it, it will break automatic reservation and require manual reservation going > forward. > > In that way, we can keep the old behavior, but also support new use cases. I > am going to work on that. I'm not sure I fully understand the proposed behavior but it does sound more quirky. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-14 16:54 ` Tejun Heo @ 2023-04-14 17:29 ` Waiman Long 2023-04-14 17:34 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-04-14 17:29 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 4/14/23 12:54, Tejun Heo wrote: > On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote: >> I now have a slightly different idea of how to do that. We already have an >> internal cpumask for partitioning - subparts_cpus. I am thinking about >> exposing it as cpuset.cpus.reserve. The current way of creating >> subpartitions will be called automatic reservation and require a direct >> parent/child partition relationship. But as soon as a user write anything to >> it, it will break automatic reservation and require manual reservation going >> forward. >> >> In that way, we can keep the old behavior, but also support new use cases. I >> am going to work on that. > I'm not sure I fully understand the proposed behavior but it does sound more > quirky. The idea is to use the existing subparts_cpus for cpu reservation instead of adding a new cpumask for that purpose. The current way of partition creation does cpus reservation (setting subparts_cpus) automatically with the constraint that the parent of a partition must be a partition root itself. One way to relax this constraint is to allow a new manual reservation mode where users can set reserve cpus manually and distribute them down the hierarchy before activating a partition to use those cpus. Now the question is how to enable this new manual reservation mode. One way to do it is to enable it whenever the new cpuset.cpus.reserve file is modified. Alternatively, we may enable it by a cgroupfs mount option or a boot command line option. Hope this can clarify your confusion. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-14 17:29 ` Waiman Long @ 2023-04-14 17:34 ` Tejun Heo 2023-04-14 17:38 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-04-14 17:34 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote: > > On 4/14/23 12:54, Tejun Heo wrote: > > On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote: > > > I now have a slightly different idea of how to do that. We already have an > > > internal cpumask for partitioning - subparts_cpus. I am thinking about > > > exposing it as cpuset.cpus.reserve. The current way of creating > > > subpartitions will be called automatic reservation and require a direct > > > parent/child partition relationship. But as soon as a user write anything to > > > it, it will break automatic reservation and require manual reservation going > > > forward. > > > > > > In that way, we can keep the old behavior, but also support new use cases. I > > > am going to work on that. > > I'm not sure I fully understand the proposed behavior but it does sound more > > quirky. > > The idea is to use the existing subparts_cpus for cpu reservation instead of > adding a new cpumask for that purpose. The current way of partition creation > does cpus reservation (setting subparts_cpus) automatically with the > constraint that the parent of a partition must be a partition root itself. > One way to relax this constraint is to allow a new manual reservation mode > where users can set reserve cpus manually and distribute them down the > hierarchy before activating a partition to use those cpus. > > Now the question is how to enable this new manual reservation mode. One way > to do it is to enable it whenever the new cpuset.cpus.reserve file is > modified. Alternatively, we may enable it by a cgroupfs mount option or a > boot command line option. It'd probably be best if we can keep the behavior within cgroupfs if possible. Would you mind writing up the documentation section describing the behavior beforehand? I think things would be clearer if we look at it from the interface documentation side. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-14 17:34 ` Tejun Heo @ 2023-04-14 17:38 ` Waiman Long 2023-04-14 19:06 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-04-14 17:38 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 4/14/23 13:34, Tejun Heo wrote: > On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote: >> On 4/14/23 12:54, Tejun Heo wrote: >>> On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote: >>>> I now have a slightly different idea of how to do that. We already have an >>>> internal cpumask for partitioning - subparts_cpus. I am thinking about >>>> exposing it as cpuset.cpus.reserve. The current way of creating >>>> subpartitions will be called automatic reservation and require a direct >>>> parent/child partition relationship. But as soon as a user write anything to >>>> it, it will break automatic reservation and require manual reservation going >>>> forward. >>>> >>>> In that way, we can keep the old behavior, but also support new use cases. I >>>> am going to work on that. >>> I'm not sure I fully understand the proposed behavior but it does sound more >>> quirky. >> The idea is to use the existing subparts_cpus for cpu reservation instead of >> adding a new cpumask for that purpose. The current way of partition creation >> does cpus reservation (setting subparts_cpus) automatically with the >> constraint that the parent of a partition must be a partition root itself. >> One way to relax this constraint is to allow a new manual reservation mode >> where users can set reserve cpus manually and distribute them down the >> hierarchy before activating a partition to use those cpus. >> >> Now the question is how to enable this new manual reservation mode. One way >> to do it is to enable it whenever the new cpuset.cpus.reserve file is >> modified. Alternatively, we may enable it by a cgroupfs mount option or a >> boot command line option. > It'd probably be best if we can keep the behavior within cgroupfs if > possible. Would you mind writing up the documentation section describing the > behavior beforehand? I think things would be clearer if we look at it from > the interface documentation side. Sure, will do that. I need some time and so it will be early next week. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-14 17:38 ` Waiman Long @ 2023-04-14 19:06 ` Waiman Long 2023-05-02 18:01 ` Michal Koutný 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-04-14 19:06 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 4/14/23 13:38, Waiman Long wrote: > On 4/14/23 13:34, Tejun Heo wrote: >> On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote: >>> On 4/14/23 12:54, Tejun Heo wrote: >>>> On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote: >>>>> I now have a slightly different idea of how to do that. We already >>>>> have an >>>>> internal cpumask for partitioning - subparts_cpus. I am thinking >>>>> about >>>>> exposing it as cpuset.cpus.reserve. The current way of creating >>>>> subpartitions will be called automatic reservation and require a >>>>> direct >>>>> parent/child partition relationship. But as soon as a user write >>>>> anything to >>>>> it, it will break automatic reservation and require manual >>>>> reservation going >>>>> forward. >>>>> >>>>> In that way, we can keep the old behavior, but also support new >>>>> use cases. I >>>>> am going to work on that. >>>> I'm not sure I fully understand the proposed behavior but it does >>>> sound more >>>> quirky. >>> The idea is to use the existing subparts_cpus for cpu reservation >>> instead of >>> adding a new cpumask for that purpose. The current way of partition >>> creation >>> does cpus reservation (setting subparts_cpus) automatically with the >>> constraint that the parent of a partition must be a partition root >>> itself. >>> One way to relax this constraint is to allow a new manual >>> reservation mode >>> where users can set reserve cpus manually and distribute them down the >>> hierarchy before activating a partition to use those cpus. >>> >>> Now the question is how to enable this new manual reservation mode. >>> One way >>> to do it is to enable it whenever the new cpuset.cpus.reserve file is >>> modified. Alternatively, we may enable it by a cgroupfs mount option >>> or a >>> boot command line option. >> It'd probably be best if we can keep the behavior within cgroupfs if >> possible. Would you mind writing up the documentation section >> describing the >> behavior beforehand? I think things would be clearer if we look at it >> from >> the interface documentation side. > > Sure, will do that. I need some time and so it will be early next week. Just kidding :-) Below is a draft of the new cpuset.cpus.reserve cgroupfs file: cpuset.cpus.reserve A read-write multiple values file which exists on all cpuset-enabled cgroups. It lists the reserved CPUs to be used for the creation of child partitions. See the section on "cpuset.cpus.partition" below for more information on cpuset partition. These reserved CPUs should be a subset of "cpuset.cpus" and will be mutually exclusive of "cpuset.cpus.effective" when used since these reserved CPUs cannot be used by tasks in the current cgroup. There are two modes for partition CPUs reservation - auto or manual. The system starts up in auto mode where "cpuset.cpus.reserve" will be set automatically when valid child partitions are created and users don't need to touch the file at all. This mode has the limitation that the parent of a partition must be a partition root itself. So child partition has to be created one-by-one from the cgroup root down. To enable the creation of a partition down in the hierarchy without the intermediate cgroups to be partition roots, one has to turn on the manual reservation mode by writing directly to "cpuset.cpus.reserve" with a value different from its current value. By distributing the reserve CPUs down the cgroup hierarchy to the parent of the target cgroup, this target cgroup can be switched to become a partition root if its "cpuset.cpus" is a subset of the set of valid reserve CPUs in its parent. The set of valid reserve CPUs is the set that are present in all its ancestors' "cpuset.cpus.reserve" up to cgroup root and which have not been allocated to another valid partition yet. Once manual reservation mode is enabled, a cgroup administrator must always set up "cpuset.cpus.reserve" files properly before a valid partition can be created. So this mode has more administrative overhead but with greater flexibility. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-04-14 19:06 ` Waiman Long @ 2023-05-02 18:01 ` Michal Koutný 2023-05-02 21:26 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Michal Koutný @ 2023-05-02 18:01 UTC (permalink / raw) To: Waiman Long Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker [-- Attachment #1: Type: text/plain, Size: 2456 bytes --] Hello. The previous thread arrived incomplete to me, so I respond to the last message only. Point me to a message URL if it was covered. On Fri, Apr 14, 2023 at 03:06:27PM -0400, Waiman Long <longman@redhat.com> wrote: > Below is a draft of the new cpuset.cpus.reserve cgroupfs file: > > cpuset.cpus.reserve > A read-write multiple values file which exists on all > cpuset-enabled cgroups. > > It lists the reserved CPUs to be used for the creation of > child partitions. See the section on "cpuset.cpus.partition" > below for more information on cpuset partition. These reserved > CPUs should be a subset of "cpuset.cpus" and will be mutually > exclusive of "cpuset.cpus.effective" when used since these > reserved CPUs cannot be used by tasks in the current cgroup. > > There are two modes for partition CPUs reservation - > auto or manual. The system starts up in auto mode where > "cpuset.cpus.reserve" will be set automatically when valid > child partitions are created and users don't need to touch the > file at all. This mode has the limitation that the parent of a > partition must be a partition root itself. So child partition > has to be created one-by-one from the cgroup root down. > > To enable the creation of a partition down in the hierarchy > without the intermediate cgroups to be partition roots, Why would be this needed? Owning a CPU (a resource) must logically be passed all the way from root to the target cgroup, i.e. this is expressed by valid partitioning down to given level. > one > has to turn on the manual reservation mode by writing directly > to "cpuset.cpus.reserve" with a value different from its > current value. By distributing the reserve CPUs down the cgroup > hierarchy to the parent of the target cgroup, this target cgroup > can be switched to become a partition root if its "cpuset.cpus" > is a subset of the set of valid reserve CPUs in its parent. level n `- level n+1 cpuset.cpus // these are actually configured by "owner" of level n cpuset.cpus.partition // similrly here, level n decides if child is a partition I.e. what would be level n/cpuset.cpus.reserve good for when it can directly control level n+1/cpuset.cpus? Thanks, Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-05-02 18:01 ` Michal Koutný @ 2023-05-02 21:26 ` Waiman Long 2023-05-02 22:27 ` Michal Koutný 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-05-02 21:26 UTC (permalink / raw) To: Michal Koutný Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 5/2/23 14:01, Michal Koutný wrote: > Hello. > > The previous thread arrived incomplete to me, so I respond to the last > message only. Point me to a message URL if it was covered. > > On Fri, Apr 14, 2023 at 03:06:27PM -0400, Waiman Long <longman@redhat.com> wrote: >> Below is a draft of the new cpuset.cpus.reserve cgroupfs file: >> >> cpuset.cpus.reserve >> A read-write multiple values file which exists on all >> cpuset-enabled cgroups. >> >> It lists the reserved CPUs to be used for the creation of >> child partitions. See the section on "cpuset.cpus.partition" >> below for more information on cpuset partition. These reserved >> CPUs should be a subset of "cpuset.cpus" and will be mutually >> exclusive of "cpuset.cpus.effective" when used since these >> reserved CPUs cannot be used by tasks in the current cgroup. >> >> There are two modes for partition CPUs reservation - >> auto or manual. The system starts up in auto mode where >> "cpuset.cpus.reserve" will be set automatically when valid >> child partitions are created and users don't need to touch the >> file at all. This mode has the limitation that the parent of a >> partition must be a partition root itself. So child partition >> has to be created one-by-one from the cgroup root down. >> >> To enable the creation of a partition down in the hierarchy >> without the intermediate cgroups to be partition roots, > Why would be this needed? Owning a CPU (a resource) must logically be > passed all the way from root to the target cgroup, i.e. this is > expressed by valid partitioning down to given level. > >> one >> has to turn on the manual reservation mode by writing directly >> to "cpuset.cpus.reserve" with a value different from its >> current value. By distributing the reserve CPUs down the cgroup >> hierarchy to the parent of the target cgroup, this target cgroup >> can be switched to become a partition root if its "cpuset.cpus" >> is a subset of the set of valid reserve CPUs in its parent. > level n > `- level n+1 > cpuset.cpus // these are actually configured by "owner" of level n > cpuset.cpus.partition // similrly here, level n decides if child is a partition > > I.e. what would be level n/cpuset.cpus.reserve good for when it can > directly control level n+1/cpuset.cpus? In the new scheme, the available cpus are still directly passed down to a descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated to a partition) have to be exclusive. So what the cpuset.cpus.reserve does is to identify those exclusive CPUs that can be excluded from the effective_cpus of the parent cgroups before they are claimed by a child partition. Currently this is done automatically when a child partition is created off a parent partition root. The new scheme will break it into 2 separate steps without the requirement that the parent of a partition has to be a partition root itself. Cheers, Longman claimed by a partition and will be excluded from the effective_cpus of the parent ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-05-02 21:26 ` Waiman Long @ 2023-05-02 22:27 ` Michal Koutný 2023-05-04 3:01 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Michal Koutný @ 2023-05-02 22:27 UTC (permalink / raw) To: Waiman Long Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman@redhat.com> wrote: > In the new scheme, the available cpus are still directly passed down to a > descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated > to a partition) have to be exclusive. So what the cpuset.cpus.reserve does > is to identify those exclusive CPUs that can be excluded from the > effective_cpus of the parent cgroups before they are claimed by a child > partition. Currently this is done automatically when a child partition is > created off a parent partition root. The new scheme will break it into 2 > separate steps without the requirement that the parent of a partition has to > be a partition root itself. new scheme 1st step: echo C >p/cpuset.cpus.reserve # p/cpuset.cpus.effective == A-C (1) 2nd step (claim): echo C' >p/c/cpuset.cpus # C'⊆C echo root >p/c/cpuset.cpus.partition current scheme 1st step (configure): echo C >p/c/cpuset.cpus 2nd step (reserve & claim): echo root >p/c/cpuset.cpus.partition # p/cpuset.cpus.effective == A-C (2) As long as p/c is unpopulated, (1) and (2) are equal situations. Why is the (different) two step procedure needed? Also the relaxation of requirement of a parent being a partition confuses me -- if the parent is not a partition, i.e. it has no exclusive ownership of CPUs but it can still "give" it to children -- is child partition meant to be exclusive? (IOW can parent siblings reserve some same CPUs?) Thanks, Michal ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-05-02 22:27 ` Michal Koutný @ 2023-05-04 3:01 ` Waiman Long 2023-05-05 16:03 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-05-04 3:01 UTC (permalink / raw) To: Michal Koutný Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 5/2/23 18:27, Michal Koutný wrote: > On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman@redhat.com> wrote: >> In the new scheme, the available cpus are still directly passed down to a >> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated >> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does >> is to identify those exclusive CPUs that can be excluded from the >> effective_cpus of the parent cgroups before they are claimed by a child >> partition. Currently this is done automatically when a child partition is >> created off a parent partition root. The new scheme will break it into 2 >> separate steps without the requirement that the parent of a partition has to >> be a partition root itself. > new scheme > 1st step: > echo C >p/cpuset.cpus.reserve > # p/cpuset.cpus.effective == A-C (1) > 2nd step (claim): > echo C' >p/c/cpuset.cpus # C'⊆C > echo root >p/c/cpuset.cpus.partition It is something like that. However, the current scheme of automatic reservation is also supported, i.e. cpuset.cpus.reserve will be set automatically when the child cgroup becomes a valid partition as long as the cpuset.cpus.reserve file is not written to. This is for backward compatibility. Once it is written to, automatic mode will end and users have to manually set it afterward. > > current scheme > 1st step (configure): > echo C >p/c/cpuset.cpus > 2nd step (reserve & claim): > echo root >p/c/cpuset.cpus.partition > # p/cpuset.cpus.effective == A-C (2) > > As long as p/c is unpopulated, (1) and (2) are equal situations. > Why is the (different) two step procedure needed? > > Also the relaxation of requirement of a parent being a partition > confuses me -- if the parent is not a partition, i.e. it has no > exclusive ownership of CPUs but it can still "give" it to children -- is > child partition meant to be exclusive? (IOW can parent siblings reserve > some same CPUs?) A valid partition root has exclusive ownership of its CPUs. That is a rule that won't be changed. As a result, an incoming partition root cannot claim CPUs that have been allocated to another partition. To simplify thing, transition to a valid partition root is not possible if any of the CPUs in its cpuset.cpus are not in the cpuset.cpus.reserve of its ancestor or have been allocated to another partition. The partition root simply becomes invalid. The parent can virtually give the reserved CPUs from the root down the hierarchy and a child can claim them once it becomes a partition root. In manual mode, we need to check all the way up the hierarchy to the root to figure out what CPUs in cpuset.cpus.reserve are valid. It has higher overhead, but enabling partition is not a fast operation anyway. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-05-04 3:01 ` Waiman Long @ 2023-05-05 16:03 ` Tejun Heo 2023-05-05 16:25 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-05-05 16:03 UTC (permalink / raw) To: Waiman Long Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote: > > On 5/2/23 18:27, Michal Koutný wrote: > > On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman@redhat.com> wrote: > > > In the new scheme, the available cpus are still directly passed down to a > > > descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated > > > to a partition) have to be exclusive. So what the cpuset.cpus.reserve does > > > is to identify those exclusive CPUs that can be excluded from the > > > effective_cpus of the parent cgroups before they are claimed by a child > > > partition. Currently this is done automatically when a child partition is > > > created off a parent partition root. The new scheme will break it into 2 > > > separate steps without the requirement that the parent of a partition has to > > > be a partition root itself. > > new scheme > > 1st step: > > echo C >p/cpuset.cpus.reserve > > # p/cpuset.cpus.effective == A-C (1) > > 2nd step (claim): > > echo C' >p/c/cpuset.cpus # C'⊆C > > echo root >p/c/cpuset.cpus.partition > > It is something like that. However, the current scheme of automatic > reservation is also supported, i.e. cpuset.cpus.reserve will be set > automatically when the child cgroup becomes a valid partition as long as the > cpuset.cpus.reserve file is not written to. This is for backward > compatibility. > > Once it is written to, automatic mode will end and users have to manually > set it afterward. I really don't like the implicit switching behavior. This is interface behavior modifying internal state that userspace can't view or control directly. Regardless of how the rest of the discussion develops, this part should be improved (e.g. would it work to always try to auto-reserve if the cpu isn't already reserved?). Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-05-05 16:03 ` Tejun Heo @ 2023-05-05 16:25 ` Waiman Long 2023-05-08 1:03 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-05-05 16:25 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker On 5/5/23 12:03, Tejun Heo wrote: > On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote: >> On 5/2/23 18:27, Michal Koutný wrote: >>> On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <longman@redhat.com> wrote: >>>> In the new scheme, the available cpus are still directly passed down to a >>>> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated >>>> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does >>>> is to identify those exclusive CPUs that can be excluded from the >>>> effective_cpus of the parent cgroups before they are claimed by a child >>>> partition. Currently this is done automatically when a child partition is >>>> created off a parent partition root. The new scheme will break it into 2 >>>> separate steps without the requirement that the parent of a partition has to >>>> be a partition root itself. >>> new scheme >>> 1st step: >>> echo C >p/cpuset.cpus.reserve >>> # p/cpuset.cpus.effective == A-C (1) >>> 2nd step (claim): >>> echo C' >p/c/cpuset.cpus # C'⊆C >>> echo root >p/c/cpuset.cpus.partition >> It is something like that. However, the current scheme of automatic >> reservation is also supported, i.e. cpuset.cpus.reserve will be set >> automatically when the child cgroup becomes a valid partition as long as the >> cpuset.cpus.reserve file is not written to. This is for backward >> compatibility. >> >> Once it is written to, automatic mode will end and users have to manually >> set it afterward. > I really don't like the implicit switching behavior. This is interface > behavior modifying internal state that userspace can't view or control > directly. Regardless of how the rest of the discussion develops, this part > should be improved (e.g. would it work to always try to auto-reserve if the > cpu isn't already reserved?). After some more thought yesterday, I have a slight change in my design that auto-reserve as it is now will stay for partitions that have a partition root parent. For remote partition that doesn't have a partition root parent, its creation will require pre-allocating additional CPUs into top_cpuset's cpuset.cpus.reserve first. So there will be no change in behavior for existing use cases whether a remote partition is created or not. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-05-05 16:25 ` Waiman Long @ 2023-05-08 1:03 ` Waiman Long 2023-05-22 19:49 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-05-08 1:03 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld Hi, The following is the proposed text for "cpuset.cpus.reserve" and "cpuset.cpus.partition" of the new cpuset partition in Documentation/admin-guide/cgroup-v2.rst. cpuset.cpus.reserve A read-write multiple values file which exists only on root cgroup. It lists all the CPUs that are reserved for adjacent and remote partitions created in the system. See the next section for more information on what an adjacent or remote partitions is. Creation of adjacent partition does not require touching this control file as CPU reservation will be done automatically. In order to create a remote partition, the CPUs needed by the remote partition has to be written to this file first. A "+" prefix can be used to indicate a list of additional CPUs that are to be added without disturbing the CPUs that are originally there. For example, if its current value is "3-4", echoing ""+5" to it will change it to "3-5". Once a remote partition is destroyed, its CPUs have to be removed from this file or no other process can use them. A "-" prefix can be used to remove a list of CPUs from it. However, removing CPUs that are currently used in existing partitions may cause those partitions to become invalid. A single "-" character without any number can be used to indicate removal of all the free CPUs not allocated to any partitions to avoid accidental partition invalidation. cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable. It accepts only the following input values when written to. ========== ===================================== "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing ========== ===================================== A cpuset partition is a collection of cgroups with a partition root at the top of the hierarchy and its descendants except those that are separate partition roots themselves and their descendants. A partition has exclusive access to the set of CPUs allocated to it. Other cgroups outside of that partition cannot use any CPUs in that set. There are two types of partitions - adjacent and remote. The parent of an adjacent partition must be a valid partition root. Partition roots of adjacent partitions are all clustered around the root cgroup. Creation of adjacent partition is done by writing the desired partition type into "cpuset.cpus.partition". A remote partition does not require a partition root parent. So a remote partition can be formed far from the root cgroup. However, its creation is a 2-step process. The CPUs needed by a remote partition ("cpuset.cpus" of the partition root) has to be written into "cpuset.cpus.reserve" of the root cgroup first. After that, "isolated" can be written into "cpuset.cpus.partition" of the partition root to form a remote isolated partition which is the only supported remote partition type for now. All remote partitions are terminal as adjacent partition cannot be created underneath it. The root cgroup is always a partition root and its state cannot be changed. All other non-root cgroups start out as "member". When set to "root", the current cgroup is the root of a new partition or scheduling domain. When set to "isolated", the CPUs in that partition will be in an isolated state without any load balancing from the scheduler. Tasks placed in such a partition with multiple CPUs should be carefully distributed and bound to each of the individual CPUs for optimal performance. The value shown in "cpuset.cpus.effective" of a partition root is the CPUs that are dedicated to that partition and not available to cgroups outside of that partittion. A partition root ("root" or "isolated") can be in one of the two possible states - valid or invalid. An invalid partition root is in a degraded state where some state information may be retained, but behaves more like a "member". All possible state transitions among "member", "root" and "isolated" are allowed. On read, the "cpuset.cpus.partition" file can show the following values. ============================= ===================================== "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing "root invalid (<reason>)" Invalid partition root "isolated invalid (<reason>)" Invalid isolated partition root ============================= ===================================== In the case of an invalid partition root, a descriptive string on why the partition is invalid is included within parentheses. For an adjacent partition root to be valid, the following conditions must be met. 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they are not shared by any of its siblings (exclusivity rule). 2) The parent cgroup is a valid partition root. 3) The "cpuset.cpus" is not empty and must contain at least one of the CPUs from parent's "cpuset.cpus", i.e. they overlap. 4) The "cpuset.cpus.effective" cannot be empty unless there is no task associated with this partition. For a remote partition root to be valid, the following conditions must be met. 1) The same exclusivity rule as adjacent partition root. 2) The "cpuset.cpus" is not empty and all the CPUs must be present in "cpuset.cpus.reserve" of the root cgroup and none of them are allocated to another partition. 3) The "cpuset.cpus" value must be present in all its ancestors to ensure proper hierarchical cpu distribution. External events like hotplug or changes to "cpuset.cpus" can cause a valid partition root to become invalid and vice versa. Note that a task cannot be moved to a cgroup with empty "cpuset.cpus.effective". For a valid partition root with the sibling cpu exclusivity rule enabled, changes made to "cpuset.cpus" that violate the exclusivity rule will invalidate the partition as well as its sibling partitions with conflicting cpuset.cpus values. So care must be taking in changing "cpuset.cpus". A valid non-root parent partition may distribute out all its CPUs to its child partitions when there is no task associated with it. Care must be taken to change a valid partition root to "member" as all its child partitions, if present, will become invalid causing disruption to tasks running in those child partitions. These inactivated partitions could be recovered if their parent is switched back to a partition root with a proper set of "cpuset.cpus". Poll and inotify events are triggered whenever the state of "cpuset.cpus.partition" changes. That includes changes caused by write to "cpuset.cpus.partition", cpu hotplug or other changes that modify the validity status of the partition. This will allow user space agents to monitor unexpected changes to "cpuset.cpus.partition" without the need to do continuous polling. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-05-08 1:03 ` Waiman Long @ 2023-05-22 19:49 ` Tejun Heo 2023-05-28 21:18 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-05-22 19:49 UTC (permalink / raw) To: Waiman Long Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld Hello, Waiman. On Sun, May 07, 2023 at 09:03:44PM -0400, Waiman Long wrote: ... > cpuset.cpus.reserve > A read-write multiple values file which exists only on root > cgroup. > > It lists all the CPUs that are reserved for adjacent and remote > partitions created in the system. See the next section for > more information on what an adjacent or remote partitions is. > > Creation of adjacent partition does not require touching this > control file as CPU reservation will be done automatically. > In order to create a remote partition, the CPUs needed by the > remote partition has to be written to this file first. > > A "+" prefix can be used to indicate a list of additional > CPUs that are to be added without disturbing the CPUs that are > originally there. For example, if its current value is "3-4", > echoing ""+5" to it will change it to "3-5". > > Once a remote partition is destroyed, its CPUs have to be > removed from this file or no other process can use them. A "-" > prefix can be used to remove a list of CPUs from it. However, > removing CPUs that are currently used in existing partitions > may cause those partitions to become invalid. A single "-" > character without any number can be used to indicate removal > of all the free CPUs not allocated to any partitions to avoid > accidental partition invalidation. Why is the syntax different from .cpus? Wouldn't it be better to keep them the same? > cpuset.cpus.partition > A read-write single value file which exists on non-root > cpuset-enabled cgroups. This flag is owned by the parent cgroup > and is not delegatable. > > It accepts only the following input values when written to. > > ========== ===================================== > "member" Non-root member of a partition > "root" Partition root > "isolated" Partition root without load balancing > ========== ===================================== > > A cpuset partition is a collection of cgroups with a partition > root at the top of the hierarchy and its descendants except > those that are separate partition roots themselves and their > descendants. A partition has exclusive access to the set of > CPUs allocated to it. Other cgroups outside of that partition > cannot use any CPUs in that set. > > There are two types of partitions - adjacent and remote. The > parent of an adjacent partition must be a valid partition root. > Partition roots of adjacent partitions are all clustered around > the root cgroup. Creation of adjacent partition is done by > writing the desired partition type into "cpuset.cpus.partition". > > A remote partition does not require a partition root parent. > So a remote partition can be formed far from the root cgroup. > However, its creation is a 2-step process. The CPUs needed > by a remote partition ("cpuset.cpus" of the partition root) > has to be written into "cpuset.cpus.reserve" of the root > cgroup first. After that, "isolated" can be written into > "cpuset.cpus.partition" of the partition root to form a remote > isolated partition which is the only supported remote partition > type for now. > > All remote partitions are terminal as adjacent partition cannot > be created underneath it. Can you elaborate this extra restriction a bit further? In general, I think it'd be really helpful if the document explains the reasoning behind the design decisions. ie. Why is reserving for? What purpose does it serve that the regular isolated ones cannot? That'd help clarifying the design decisions. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-05-22 19:49 ` Tejun Heo @ 2023-05-28 21:18 ` Waiman Long 2023-06-05 18:03 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-05-28 21:18 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld On 5/22/23 15:49, Tejun Heo wrote: > Hello, Waiman. Sorry for the late reply as I had been off for almost 2 weeks due to PTO. > > On Sun, May 07, 2023 at 09:03:44PM -0400, Waiman Long wrote: > ... >> cpuset.cpus.reserve >> A read-write multiple values file which exists only on root >> cgroup. >> >> It lists all the CPUs that are reserved for adjacent and remote >> partitions created in the system. See the next section for >> more information on what an adjacent or remote partitions is. >> >> Creation of adjacent partition does not require touching this >> control file as CPU reservation will be done automatically. >> In order to create a remote partition, the CPUs needed by the >> remote partition has to be written to this file first. >> >> A "+" prefix can be used to indicate a list of additional >> CPUs that are to be added without disturbing the CPUs that are >> originally there. For example, if its current value is "3-4", >> echoing ""+5" to it will change it to "3-5". >> >> Once a remote partition is destroyed, its CPUs have to be >> removed from this file or no other process can use them. A "-" >> prefix can be used to remove a list of CPUs from it. However, >> removing CPUs that are currently used in existing partitions >> may cause those partitions to become invalid. A single "-" >> character without any number can be used to indicate removal >> of all the free CPUs not allocated to any partitions to avoid >> accidental partition invalidation. > Why is the syntax different from .cpus? Wouldn't it be better to keep them > the same? Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that are used in multiple partitions. Also automatic reservation of adjacent partitions can happen in parallel. That is why I think it will be safer if we allow incremental increase or decrease of reserve CPUs to be used for remote partitions. I will include this reasoning into the doc file. >> cpuset.cpus.partition >> A read-write single value file which exists on non-root >> cpuset-enabled cgroups. This flag is owned by the parent cgroup >> and is not delegatable. >> >> It accepts only the following input values when written to. >> >> ========== ===================================== >> "member" Non-root member of a partition >> "root" Partition root >> "isolated" Partition root without load balancing >> ========== ===================================== >> >> A cpuset partition is a collection of cgroups with a partition >> root at the top of the hierarchy and its descendants except >> those that are separate partition roots themselves and their >> descendants. A partition has exclusive access to the set of >> CPUs allocated to it. Other cgroups outside of that partition >> cannot use any CPUs in that set. >> >> There are two types of partitions - adjacent and remote. The >> parent of an adjacent partition must be a valid partition root. >> Partition roots of adjacent partitions are all clustered around >> the root cgroup. Creation of adjacent partition is done by >> writing the desired partition type into "cpuset.cpus.partition". >> >> A remote partition does not require a partition root parent. >> So a remote partition can be formed far from the root cgroup. >> However, its creation is a 2-step process. The CPUs needed >> by a remote partition ("cpuset.cpus" of the partition root) >> has to be written into "cpuset.cpus.reserve" of the root >> cgroup first. After that, "isolated" can be written into >> "cpuset.cpus.partition" of the partition root to form a remote >> isolated partition which is the only supported remote partition >> type for now. >> >> All remote partitions are terminal as adjacent partition cannot >> be created underneath it. > Can you elaborate this extra restriction a bit further? Are you referring to the fact that only remote isolated partitions are supported? I do not preclude the support of load balancing remote partitions. I keep it to isolated partitions for now for ease of implementation and I am not currently aware of a use case where such a remote partition type is needed. If you are talking about remote partition being terminal. It is mainly because it can be more tricky to support hierarchical adjacent partitions underneath it especially if it is not isolated. We can certainly support it if a use case arises. I just don't want to implement code that nobody is really going to use. BTW, with the current way the remote partition is created, it is not possible to have another remote partition underneath it. > > In general, I think it'd be really helpful if the document explains the > reasoning behind the design decisions. ie. Why is reserving for? What > purpose does it serve that the regular isolated ones cannot? That'd help > clarifying the design decisions. I understand your concern. If you think it is better to support both types of remote partitions or hierarchical adjacent partitions underneath it for symmetry purpose, I can certain do that. It just needs to take a bit more time. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-05-28 21:18 ` Waiman Long @ 2023-06-05 18:03 ` Tejun Heo 2023-06-05 20:00 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-06-05 18:03 UTC (permalink / raw) To: Waiman Long Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld Hello, Waiman. On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote: > On 5/22/23 15:49, Tejun Heo wrote: > Sorry for the late reply as I had been off for almost 2 weeks due to PTO. And me too. Just moved. > > Why is the syntax different from .cpus? Wouldn't it be better to keep them > > the same? > > Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that > are used in multiple partitions. Also automatic reservation of adjacent > partitions can happen in parallel. That is why I think it will be safer if Ah, I see, this is because cpu.reserve is only in the root cgroup, so you can't say that the knob is owned by the parent cgroup and thus access is controlled that way. ... > > > There are two types of partitions - adjacent and remote. The > > > parent of an adjacent partition must be a valid partition root. > > > Partition roots of adjacent partitions are all clustered around > > > the root cgroup. Creation of adjacent partition is done by > > > writing the desired partition type into "cpuset.cpus.partition". > > > > > > A remote partition does not require a partition root parent. > > > So a remote partition can be formed far from the root cgroup. > > > However, its creation is a 2-step process. The CPUs needed > > > by a remote partition ("cpuset.cpus" of the partition root) > > > has to be written into "cpuset.cpus.reserve" of the root > > > cgroup first. After that, "isolated" can be written into > > > "cpuset.cpus.partition" of the partition root to form a remote > > > isolated partition which is the only supported remote partition > > > type for now. > > > > > > All remote partitions are terminal as adjacent partition cannot > > > be created underneath it. > > > > Can you elaborate this extra restriction a bit further? > > Are you referring to the fact that only remote isolated partitions are > supported? I do not preclude the support of load balancing remote > partitions. I keep it to isolated partitions for now for ease of > implementation and I am not currently aware of a use case where such a > remote partition type is needed. > > If you are talking about remote partition being terminal. It is mainly > because it can be more tricky to support hierarchical adjacent partitions > underneath it especially if it is not isolated. We can certainly support it > if a use case arises. I just don't want to implement code that nobody is > really going to use. > > BTW, with the current way the remote partition is created, it is not > possible to have another remote partition underneath it. The fact that the control is spread across a root-only file and per-cgroup file seems hacky to me. e.g. How would it interact with namespacing? Are there reasons why this can't be properly hierarchical other than the amount of work needed? For example: cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs that the cgroup holds exclusively. The mask is always a subset of cpuset.cpus. The parent loses access to a CPU when the CPU is given to a child by setting the CPU in the child's cpus.exclusive and the CPU can't be given to more than one child. IOW, exclusive CPUs are available only to the leaf cgroups that have them set in their .exclusive file. When a cgroup is turned into a partition, its cpuset.cpus and cpuset.cpus.exclusive should be the same. For backward compatibility, if the cgroup's parent is already a partition, cpuset will automatically attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive. I could well be missing something important but I'd really like to see something like the above where the reservation feature blends in with the rest of cpuset. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-06-05 18:03 ` Tejun Heo @ 2023-06-05 20:00 ` Waiman Long 2023-06-05 20:27 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-06-05 20:00 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld On 6/5/23 14:03, Tejun Heo wrote: > Hello, Waiman. > > On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote: >> On 5/22/23 15:49, Tejun Heo wrote: >> Sorry for the late reply as I had been off for almost 2 weeks due to PTO. > And me too. Just moved. > >>> Why is the syntax different from .cpus? Wouldn't it be better to keep them >>> the same? >> Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that >> are used in multiple partitions. Also automatic reservation of adjacent >> partitions can happen in parallel. That is why I think it will be safer if > Ah, I see, this is because cpu.reserve is only in the root cgroup, so you > can't say that the knob is owned by the parent cgroup and thus access is > controlled that way. > > ... >>>> There are two types of partitions - adjacent and remote. The >>>> parent of an adjacent partition must be a valid partition root. >>>> Partition roots of adjacent partitions are all clustered around >>>> the root cgroup. Creation of adjacent partition is done by >>>> writing the desired partition type into "cpuset.cpus.partition". >>>> >>>> A remote partition does not require a partition root parent. >>>> So a remote partition can be formed far from the root cgroup. >>>> However, its creation is a 2-step process. The CPUs needed >>>> by a remote partition ("cpuset.cpus" of the partition root) >>>> has to be written into "cpuset.cpus.reserve" of the root >>>> cgroup first. After that, "isolated" can be written into >>>> "cpuset.cpus.partition" of the partition root to form a remote >>>> isolated partition which is the only supported remote partition >>>> type for now. >>>> >>>> All remote partitions are terminal as adjacent partition cannot >>>> be created underneath it. >>> Can you elaborate this extra restriction a bit further? >> Are you referring to the fact that only remote isolated partitions are >> supported? I do not preclude the support of load balancing remote >> partitions. I keep it to isolated partitions for now for ease of >> implementation and I am not currently aware of a use case where such a >> remote partition type is needed. >> >> If you are talking about remote partition being terminal. It is mainly >> because it can be more tricky to support hierarchical adjacent partitions >> underneath it especially if it is not isolated. We can certainly support it >> if a use case arises. I just don't want to implement code that nobody is >> really going to use. >> >> BTW, with the current way the remote partition is created, it is not >> possible to have another remote partition underneath it. > The fact that the control is spread across a root-only file and per-cgroup > file seems hacky to me. e.g. How would it interact with namespacing? Are > there reasons why this can't be properly hierarchical other than the amount > of work needed? For example: > > cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs > that the cgroup holds exclusively. The mask is always a subset of > cpuset.cpus. The parent loses access to a CPU when the CPU is given to a > child by setting the CPU in the child's cpus.exclusive and the CPU can't > be given to more than one child. IOW, exclusive CPUs are available only to > the leaf cgroups that have them set in their .exclusive file. > > When a cgroup is turned into a partition, its cpuset.cpus and > cpuset.cpus.exclusive should be the same. For backward compatibility, if > the cgroup's parent is already a partition, cpuset will automatically > attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive. > > I could well be missing something important but I'd really like to see > something like the above where the reservation feature blends in with the > rest of cpuset. It can certainly be made hierarchical as you suggest. It does increase complexity from both user and kernel point of view. From the user point of view, there is one more knob to manage hierarchically which is not used that often. From the kernel point of view, we may need to have one more cpumask per cpuset as the current subparts_cpus is used to track automatic reservation. We need another cpumask to contain extra exclusive CPUs not allocated through automatic reservation. The fact that you mention this new control file as a list of exclusively owned CPUs for this cgroup. Creating a partition is in fact allocating exclusive CPUs to a cgroup. So it kind of overlaps with the cpuset.cpus.partititon file. Can we fail a write to cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this exclusive list is only valid if a valid partition can be formed. So we need to properly manage the dependency between these 2 control files. Alternatively, I have no problem exposing cpuset.cpus.exclusive as a read-only file. It is a bit problematic if we need to make it writable. As for namespacing, you do raise a good point. I was thinking mostly from a whole system point of view as the use case that I am aware of does not needs that. To allow delegation of exclusive CPUs to a child cgroup, that cgroup has to be a partition root itself. One compromise that I can think of is to only allow automatic reservation only in such a scenario. In that case, I need to support a remote load balanced partition as well and hierarchical sub-partitions underneath it. That can be done with some extra code to the existing v2 patchset without introducing too much complexity. IOW, the use of remote partition is only allowed on the whole system level where one has access to the cgroup root. Exclusive CPUs distribution within a container can only be done via the use of adjacent partitions with automatic reservation. Will that be a good enough compromise from your point of view? Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-06-05 20:00 ` Waiman Long @ 2023-06-05 20:27 ` Tejun Heo 2023-06-06 2:47 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-06-05 20:27 UTC (permalink / raw) To: Waiman Long Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld Hello, On Mon, Jun 05, 2023 at 04:00:39PM -0400, Waiman Long wrote: ... > > file seems hacky to me. e.g. How would it interact with namespacing? Are > > there reasons why this can't be properly hierarchical other than the amount > > of work needed? For example: > > > > cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs > > that the cgroup holds exclusively. The mask is always a subset of > > cpuset.cpus. The parent loses access to a CPU when the CPU is given to a > > child by setting the CPU in the child's cpus.exclusive and the CPU can't > > be given to more than one child. IOW, exclusive CPUs are available only to > > the leaf cgroups that have them set in their .exclusive file. > > > > When a cgroup is turned into a partition, its cpuset.cpus and > > cpuset.cpus.exclusive should be the same. For backward compatibility, if > > the cgroup's parent is already a partition, cpuset will automatically > > attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive. > > > > I could well be missing something important but I'd really like to see > > something like the above where the reservation feature blends in with the > > rest of cpuset. > > It can certainly be made hierarchical as you suggest. It does increase > complexity from both user and kernel point of view. > > From the user point of view, there is one more knob to manage hierarchically > which is not used that often. From user pov, this only affects them when they want to create partitions down the tree, right? > From the kernel point of view, we may need to have one more cpumask per > cpuset as the current subparts_cpus is used to track automatic reservation. > We need another cpumask to contain extra exclusive CPUs not allocated > through automatic reservation. The fact that you mention this new control > file as a list of exclusively owned CPUs for this cgroup. Creating a > partition is in fact allocating exclusive CPUs to a cgroup. So it kind of > overlaps with the cpuset.cpus.partititon file. Can we fail a write to Yes, it substitutes and expands on cpuset.cpus.partition behavior. > cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this > exclusive list is only valid if a valid partition can be formed. So we need > to properly manage the dependency between these 2 control files. So, I think cpus.exclusive can become the sole mechanism to arbitrate exclusive owenership of CPUs and .partition can depend on .exclusive. > Alternatively, I have no problem exposing cpuset.cpus.exclusive as a > read-only file. It is a bit problematic if we need to make it writable. I don't follow. How would remote partitions work then? > As for namespacing, you do raise a good point. I was thinking mostly from a > whole system point of view as the use case that I am aware of does not needs > that. To allow delegation of exclusive CPUs to a child cgroup, that cgroup > has to be a partition root itself. One compromise that I can think of is to > only allow automatic reservation only in such a scenario. In that case, I > need to support a remote load balanced partition as well and hierarchical > sub-partitions underneath it. That can be done with some extra code to the > existing v2 patchset without introducing too much complexity. > > IOW, the use of remote partition is only allowed on the whole system level > where one has access to the cgroup root. Exclusive CPUs distribution within > a container can only be done via the use of adjacent partitions with > automatic reservation. Will that be a good enough compromise from your point > of view? It seems too twisted to me. I'd much prefer it to be better integrated with the rest of cpuset. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-06-05 20:27 ` Tejun Heo @ 2023-06-06 2:47 ` Waiman Long 2023-06-06 19:58 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-06-06 2:47 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld On 6/5/23 16:27, Tejun Heo wrote: > Hello, > > On Mon, Jun 05, 2023 at 04:00:39PM -0400, Waiman Long wrote: > ... >>> file seems hacky to me. e.g. How would it interact with namespacing? Are >>> there reasons why this can't be properly hierarchical other than the amount >>> of work needed? For example: >>> >>> cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs >>> that the cgroup holds exclusively. The mask is always a subset of >>> cpuset.cpus. The parent loses access to a CPU when the CPU is given to a >>> child by setting the CPU in the child's cpus.exclusive and the CPU can't >>> be given to more than one child. IOW, exclusive CPUs are available only to >>> the leaf cgroups that have them set in their .exclusive file. >>> >>> When a cgroup is turned into a partition, its cpuset.cpus and >>> cpuset.cpus.exclusive should be the same. For backward compatibility, if >>> the cgroup's parent is already a partition, cpuset will automatically >>> attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive. >>> >>> I could well be missing something important but I'd really like to see >>> something like the above where the reservation feature blends in with the >>> rest of cpuset. >> It can certainly be made hierarchical as you suggest. It does increase >> complexity from both user and kernel point of view. >> >> From the user point of view, there is one more knob to manage hierarchically >> which is not used that often. > From user pov, this only affects them when they want to create partitions > down the tree, right? > >> From the kernel point of view, we may need to have one more cpumask per >> cpuset as the current subparts_cpus is used to track automatic reservation. >> We need another cpumask to contain extra exclusive CPUs not allocated >> through automatic reservation. The fact that you mention this new control >> file as a list of exclusively owned CPUs for this cgroup. Creating a >> partition is in fact allocating exclusive CPUs to a cgroup. So it kind of >> overlaps with the cpuset.cpus.partititon file. Can we fail a write to > Yes, it substitutes and expands on cpuset.cpus.partition behavior. > >> cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this >> exclusive list is only valid if a valid partition can be formed. So we need >> to properly manage the dependency between these 2 control files. > So, I think cpus.exclusive can become the sole mechanism to arbitrate > exclusive owenership of CPUs and .partition can depend on .exclusive. > >> Alternatively, I have no problem exposing cpuset.cpus.exclusive as a >> read-only file. It is a bit problematic if we need to make it writable. > I don't follow. How would remote partitions work then? I had a different idea on the semantics of the cpuset.cpus.exclusive at the beginning. My original thinking is that it was the actual exclusive CPUs that are allocated to the cgroup. Now if we treat this as a hint of what exclusive CPUs should be used and it becomes valid only if the cgroup can become a valid partition. I can see it as a value that can be hierarchically set throughout the whole cpuset hierarchy. So a transition to a valid partition is possible iff 1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of cpuset.cpus.exclusive of all its ancestors. 2) If its parent is not a partition root, none of the CPUs in cpuset.cpus.exclusive are currently allocated to other partitions. This the same remote partition concept in my v2 patch. If its parent is a partition root, part of its exclusive CPUs will be distributed to this child partition like the current behavior of cpuset partition. I can rework my patch to adopt this model if it is what you have in mind. Thanks, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-06-06 2:47 ` Waiman Long @ 2023-06-06 19:58 ` Tejun Heo 2023-06-06 20:11 ` Waiman Long 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2023-06-06 19:58 UTC (permalink / raw) To: Waiman Long Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld Hello, Waiman. On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote: ... > I had a different idea on the semantics of the cpuset.cpus.exclusive at the > beginning. My original thinking is that it was the actual exclusive CPUs > that are allocated to the cgroup. Now if we treat this as a hint of what > exclusive CPUs should be used and it becomes valid only if the cgroup can I wouldn't call it a hint. It's still hard allocation of the CPUs to the cgroups that own them. Setting up a partition requires exclusive CPUs and thus would depend on exclusive allocations set up accordingly. > become a valid partition. I can see it as a value that can be hierarchically > set throughout the whole cpuset hierarchy. > > So a transition to a valid partition is possible iff > > 1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of > cpuset.cpus.exclusive of all its ancestors. Yes. > 2) If its parent is not a partition root, none of the CPUs in > cpuset.cpus.exclusive are currently allocated to other partitions. This the Not just that, the CPUs aren't available to cgroups which don't have them set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some cgroups, it shouldn't appear in cpus.effective of cgroups which don't have the CPU in their cpus.exclusive. So, .exclusive explicitly establishes exclusive ownership of CPUs and partitions depend on that with an implicit "turn CPUs exclusive" behavior in case the parent is a partition root for backward compatibility. > same remote partition concept in my v2 patch. If its parent is a partition > root, part of its exclusive CPUs will be distributed to this child partition > like the current behavior of cpuset partition. Yes, similar in a sense. Please do away with the "once .reserve is used, the behavior is switched" part. Instead, it can be sth like "if the parent is a partition root, cpuset implicitly tries to set all CPUs in its cpus file in its cpus.exclusive file" so that user-visible behavior stays unchanged depending on past history. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-06-06 19:58 ` Tejun Heo @ 2023-06-06 20:11 ` Waiman Long 2023-06-06 20:13 ` Tejun Heo 0 siblings, 1 reply; 32+ messages in thread From: Waiman Long @ 2023-06-06 20:11 UTC (permalink / raw) To: Tejun Heo Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld On 6/6/23 15:58, Tejun Heo wrote: > Hello, Waiman. > > On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote: > ... >> I had a different idea on the semantics of the cpuset.cpus.exclusive at the >> beginning. My original thinking is that it was the actual exclusive CPUs >> that are allocated to the cgroup. Now if we treat this as a hint of what >> exclusive CPUs should be used and it becomes valid only if the cgroup can > I wouldn't call it a hint. It's still hard allocation of the CPUs to the > cgroups that own them. Setting up a partition requires exclusive CPUs and > thus would depend on exclusive allocations set up accordingly. > >> become a valid partition. I can see it as a value that can be hierarchically >> set throughout the whole cpuset hierarchy. >> >> So a transition to a valid partition is possible iff >> >> 1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of >> cpuset.cpus.exclusive of all its ancestors. > Yes. > >> 2) If its parent is not a partition root, none of the CPUs in >> cpuset.cpus.exclusive are currently allocated to other partitions. This the > Not just that, the CPUs aren't available to cgroups which don't have them > set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some > cgroups, it shouldn't appear in cpus.effective of cgroups which don't have > the CPU in their cpus.exclusive. > > So, .exclusive explicitly establishes exclusive ownership of CPUs and > partitions depend on that with an implicit "turn CPUs exclusive" behavior in > case the parent is a partition root for backward compatibility. The current CPU exclusive behavior is limited to sibling cgroups only. Because of the hierarchical nature of cpu distribution, the set of exclusive CPUs have to appear in all its ancestors. When partition is enabled, we do a sibling exclusivity test at that point to verify that it is exclusive. It looks like you want to do an exclusivity test even when the partition isn't active. I can certainly do that when the file is being updated. However, it will fail the write if the exclusivity test fails just like the v1 cpuset.cpus.exclusive flag if you are OK with that. > >> same remote partition concept in my v2 patch. If its parent is a partition >> root, part of its exclusive CPUs will be distributed to this child partition >> like the current behavior of cpuset partition. > Yes, similar in a sense. Please do away with the "once .reserve is used, the > behavior is switched" part. That behavior has been gone in my v2 patch. > Instead, it can be sth like "if the parent is a > partition root, cpuset implicitly tries to set all CPUs in its cpus file in > its cpus.exclusive file" so that user-visible behavior stays unchanged > depending on past history. If parent is a partition root, auto reservation will be done and cpus.exclusive will be set automatically just like before. So existing applications using partition will not be affected. Cheers, Longman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition 2023-06-06 20:11 ` Waiman Long @ 2023-06-06 20:13 ` Tejun Heo 0 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2023-06-06 20:13 UTC (permalink / raw) To: Waiman Long Cc: Michal Koutný, Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, linux-kernel, cgroups, linux-doc, linux-kselftest, Juri Lelli, Valentin Schneider, Frederic Weisbecker, Mrunal Patel, Ryan Phillips, Brent Rowsell, Peter Hunt, Phil Auld Hello, On Tue, Jun 06, 2023 at 04:11:02PM -0400, Waiman Long wrote: ... > The current CPU exclusive behavior is limited to sibling cgroups only. > Because of the hierarchical nature of cpu distribution, the set of exclusive > CPUs have to appear in all its ancestors. When partition is enabled, we do a > sibling exclusivity test at that point to verify that it is exclusive. It > looks like you want to do an exclusivity test even when the partition isn't > active. I can certainly do that when the file is being updated. However, it > will fail the write if the exclusivity test fails just like the v1 > cpuset.cpus.exclusive flag if you are OK with that. Yeah, doesn't look like there's a way around it if we want to make .exclusive a feature which is useful on its own. > > Instead, it can be sth like "if the parent is a > > partition root, cpuset implicitly tries to set all CPUs in its cpus file in > > its cpus.exclusive file" so that user-visible behavior stays unchanged > > depending on past history. > > If parent is a partition root, auto reservation will be done and > cpus.exclusive will be set automatically just like before. So existing > applications using partition will not be affected. Sounds great. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2023-06-06 20:13 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-04-12 15:37 [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition Waiman Long 2023-04-12 19:28 ` Tejun Heo [not found] ` <1ce6a073-e573-0c32-c3d8-f67f3d389a28@redhat.com> 2023-04-12 20:22 ` Tejun Heo 2023-04-12 20:33 ` Waiman Long 2023-04-13 0:03 ` Tejun Heo 2023-04-13 0:26 ` Waiman Long 2023-04-13 0:33 ` Tejun Heo 2023-04-13 0:55 ` Waiman Long 2023-04-13 1:17 ` Tejun Heo 2023-04-13 1:55 ` Waiman Long 2023-04-14 1:22 ` Waiman Long 2023-04-14 16:54 ` Tejun Heo 2023-04-14 17:29 ` Waiman Long 2023-04-14 17:34 ` Tejun Heo 2023-04-14 17:38 ` Waiman Long 2023-04-14 19:06 ` Waiman Long 2023-05-02 18:01 ` Michal Koutný 2023-05-02 21:26 ` Waiman Long 2023-05-02 22:27 ` Michal Koutný 2023-05-04 3:01 ` Waiman Long 2023-05-05 16:03 ` Tejun Heo 2023-05-05 16:25 ` Waiman Long 2023-05-08 1:03 ` Waiman Long 2023-05-22 19:49 ` Tejun Heo 2023-05-28 21:18 ` Waiman Long 2023-06-05 18:03 ` Tejun Heo 2023-06-05 20:00 ` Waiman Long 2023-06-05 20:27 ` Tejun Heo 2023-06-06 2:47 ` Waiman Long 2023-06-06 19:58 ` Tejun Heo 2023-06-06 20:11 ` Waiman Long 2023-06-06 20:13 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).