* Cache Allocation Technology Design @ 2014-10-16 18:44 vikas 2014-10-20 16:18 ` Matt Fleming 2014-11-03 23:29 ` Vikas Shivappa 0 siblings, 2 replies; 39+ messages in thread From: vikas @ 2014-10-16 18:44 UTC (permalink / raw) To: linux-kernel; +Cc: matt.fleming, will.auld, tj, vikas.shivappa Hi All , We have put together a draft design document for cache allocation technology below. Please review the same and let us know any feedback. Make sure you cc my email vikas.shivappa@linux.intel.com when replying Thanks, Vikas What is Cache Allocation Technology ( CAT ) ------------------------------------------- Cache Allocation Technology provides a way for the Software (OS/VMM) to restrict cache allocation to a defined 'subset' of cache which may be overlapping with other 'subsets'. This feature is used when allocating a line in cache ie when pulling new data into the cache. The programming of the h/w is done via programming MSRs. The different cache subsets are identified by CLOS identifier (class of service) and each CLOS has a CBM (cache bit mask). The CBM is a contiguous set of bits which defines the amount of cache resource that is available for each 'subset'. Why is CAT (cache allocation technology) needed ------------------------------------------------ The CAT enables more cache resources to be made available for higher priority applications based on guidance from the execution environment. The architecture also allows dynamically changing these subsets during runtime to further optimize the performance of the higher priority application with minimal degradation to the low priority app. Additionally, resources can be rebalanced for system throughput benefit. (Refer to Section 17.15 in the Intel SDM http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf) This technique may be useful in managing large computer systems which large LLC. Examples may be large servers running instances of webservers or database servers. In such complex systems, these subsets can be used for more careful placing of the available cache resources. The CAT kernel patch would provide a basic kernel framework for users to be able to implement such cache subsets. Kernel implementation Overview ------------------------------- Kernel implements a cgroup subsystem to support Cache Allocation. Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each cgroup would have one CBM and would just represent one cache 'subset'. The user would be allowed to create as many directories as there are CLOSs defined by the h/w. If user tries to create more than the available CLOSs , -ENOSPC is returned. Currently we support only one level of directory, ie directory can be created only under the root. There are 2 modes supported 1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs specified by the 'cpus' file. The tasks in the CAT cgroup would be constrained only on the CPUs in the 'cpus' file. The CPUs in this file are exclusively used for this cgroup. Requests by task using the sched_setaffinity() would be filtered through the tasks 'cpus'. These tasks would get to fill the LLC cache represented by the cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as the existing cpumask datastructure. 2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be for a group of tasks. There is no 'cpus' file and the CPUs that the tasks run are not restricted by the CAT cgroup Assignment of CBM,CLOS and modes --------------------------------- Root directory would have all bits in 'cbm' file by default. The cbm_max file in the root defines the maximum number of bits describing the available cache units. Say if cbm_max is 16 then the 'cbm' cannot have more than 16 bits. The 'affinitized' file is either 0 or 1 which represent the two modes. System would boot with affinitized mode and all CPUs would have all bits in cbm set meaning all CPUs have 100% cache(effectively cache allocation is not in effect). The 'cbm' file is restricted to having no more than its cbm_max least significant bits set. Any contiguous subset of these bits maybe set to indication the cache mapping desired. The 'cbm' between 2 directories can overlap. The 'cbm' would represent the cache 'subset' of the CAT cgroup. For ex: on a system with 16 bits of max cbm bits , if the directory has the least significant 4 bits set in its 'cbm' file, it would be allocated the right quarter of the Last level cache which means the tasks belonging to this CAT cgroup can use the right quarter of the cache to fill. If it has the most significant 8 bits set ,it would be allocated the left half of the cache(8 bits out of 16 represents 50%). The cache subset would be affinitized to a set of cpus in affinitized mode. The CPUs to which this allocation is affinitized to is represented by the 'cpus' file. The 'cpus' need to be mutually exclusive from cpus of other directories. The cache portion defined in the CBM file is available to all tasks within the CAT group and these task are not allowed to allocate space in other parts of the cache. 'cbm' file is used in both modes where as the 'cpus' file is relevant in affinitized mode and would disappear in non-affinitized mode. Scheduling and Context Switch ------------------------------ In affinitized mode , the cache 'subset' and the tasks in a CAT cgroup are affinitized to the CPUs represented by the CAT cgroup's 'cpus' file i.e when user sets the 'cbm' to 'portion' and 'cpus' to c and 'tasks' to t, the tasks 't' would always be scheduled on cpus 'c' and will get to fill in the allocated 'portion' in last level cache. As noted above ,in the affinitized mode the tasks in a CAT cgroup would also be affinitized to the CPUs in the 'cpus' file of the directory. Following hooks in the kernel are required to implement this (on the lines of cpuset code) - in sched_setaffinity to mask the requested cpu mask with what is present in the task's 'cpus' - in migrate_task to migrate the tasks only to those CPUs in the 'cpus' file if possible. - in select_task_rq In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file indicate the tasks the cache subset is affinitized to. When user adds tasks to the tasks file , the tasks would get to fill the cache subset represented by the CAT cgroup's 'cbm' file. During context switch kernel implements this by writing the corresponding CLOSid (internally maintained by kernel) of the CAT cgroup to the CPU's IA32_PQR_ASSOC MSR. Usage and Example ----------------- Following would mount the cache allocation cgroup subsystem and create 2 directories. Please refer to Documentation/cgroups/cgroups.txt on details about how to use cgroups. cd /sys/fs/cgroup mkdir cachealloc mount -t cgroup -ocachealloc cachealloc /sys/fs/cgroup/cachealloc cd cachealloc Create 2 cat cgroups mkdir group1 mkdir group2 Following are some of the Files in the directory ls cachea.cbm cachea.cpus . cpus file only appears in the affinitized mode cgroup.procs tasks cbm_max (root only) affinitized (root only) . by default itsaffinitized mode Say if the cache is 2MB and cbm supports 16 bits, then setting the below allocates the 'right 1/4th(512KB)' of the cache to group2 Edit the CBM for group2 to set the least significant 4 bits. This allocates 'right quarter' of the cache. cd group2 /bin/echo 0xf > cachealloc.cbm Change cpus in the directory. /bin/echo 1-4 > cachealloc.cpus Edit the CBM for group2 to set the least significant 8 bits.This allocates the right half of the cache to 'group2'. cd group2 /bin/echo 0xff > cachea.cbm Assign tasks to the group2 /bin/echo PID1 > tasks /bin/echo PID2 > tasks Meaning now threads PID1 and PID2 runs on CPUs 1-4 , and get to fill the 'right half' of the cache. The tasks PID1 and PID2 can only have a subset of the cpu affinity defined in the 'cpus' file Edit the affinitized to 0.mode is changed in root directory cd .. /bin/echo 0 > cachealloc.affinitized Now the tasks and the cache allocation is not affinitized to the CPUs and the task's cpu affinity is not restricted to being with the subset of 'cpus' cpumask. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-16 18:44 Cache Allocation Technology Design vikas @ 2014-10-20 16:18 ` Matt Fleming 2014-10-24 10:53 ` Peter Zijlstra 2014-11-03 23:29 ` Vikas Shivappa 1 sibling, 1 reply; 39+ messages in thread From: Matt Fleming @ 2014-10-20 16:18 UTC (permalink / raw) To: vikas Cc: linux-kernel, matt.fleming, will.auld, tj, vikas.shivappa, Peter Zijlstra (Cc'ing Peter Zijlstra for comments) On Thu, 16 Oct, at 11:44:10AM, vikas wrote: > Hi All , We have put together a draft design document for cache > allocation technology below. Please review the same and let us know any > feedback. > > Make sure you cc my email vikas.shivappa@linux.intel.com when replying > > Thanks, > Vikas > > What is Cache Allocation Technology ( CAT ) > ------------------------------------------- > > Cache Allocation Technology provides a way for the Software (OS/VMM) > to restrict cache allocation to a defined 'subset' of cache which may > be overlapping with other 'subsets'. This feature is used when > allocating a line in cache ie when pulling new data into the cache. > The programming of the h/w is done via programming MSRs. > > The different cache subsets are identified by CLOS identifier (class > of service) and each CLOS has a CBM (cache bit mask). The CBM is a > contiguous set of bits which defines the amount of cache resource that > is available for each 'subset'. > > Why is CAT (cache allocation technology) needed > ------------------------------------------------ > > The CAT enables more cache resources to be made available for higher > priority applications based on guidance from the execution > environment. > > The architecture also allows dynamically changing these subsets during > runtime to further optimize the performance of the higher priority > application with minimal degradation to the low priority app. > Additionally, resources can be rebalanced for system throughput > benefit. (Refer to Section 17.15 in the Intel SDM > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf) > > This technique may be useful in managing large computer systems which > large LLC. Examples may be large servers running instances of > webservers or database servers. In such complex systems, these subsets > can be used for more careful placing of the available cache > resources. > > The CAT kernel patch would provide a basic kernel framework for users > to be able to implement such cache subsets. > > > Kernel implementation Overview > ------------------------------- > > Kernel implements a cgroup subsystem to support Cache Allocation. > > Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each > cgroup would have one CBM and would just represent one cache 'subset'. > > The user would be allowed to create as many directories as there are > CLOSs defined by the h/w. If user tries to create more than the > available CLOSs , -ENOSPC is returned. Currently we support only one > level of directory, ie directory can be created only under the root. > > There are 2 modes supported > > 1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs > specified by the 'cpus' file. The tasks in the CAT cgroup would be > constrained only on the CPUs in the 'cpus' file. The CPUs in this file > are exclusively used for this cgroup. Requests by task > using the sched_setaffinity() would be filtered through the tasks > 'cpus'. > > These tasks would get to fill the LLC cache represented by the > cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as > the existing cpumask datastructure. > > 2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be > for a group of tasks. There is no 'cpus' file and the CPUs that the > tasks run are not restricted by the CAT cgroup > > > Assignment of CBM,CLOS and modes > --------------------------------- > > Root directory would have all bits in 'cbm' file by default. > > The cbm_max file in the root defines the maximum number of bits > describing the available cache units. Say if cbm_max is 16 then the > 'cbm' cannot have more than 16 bits. > > The 'affinitized' file is either 0 or 1 which represent the two modes. > System would boot with affinitized mode and all CPUs would have all > bits in cbm set meaning all CPUs have 100% cache(effectively cache > allocation is not in effect). > > The 'cbm' file is restricted to having no more than its cbm_max least > significant bits set. Any contiguous subset of these bits maybe set to > indication the cache mapping desired. The 'cbm' between 2 directories > can overlap. The 'cbm' would represent the cache 'subset' of the CAT > cgroup. For ex: on a system with 16 bits of max cbm bits , if the > directory has the least significant 4 bits set in its 'cbm' file, it > would be allocated the right quarter of the Last level cache which > means the tasks belonging to this CAT cgroup can use the right quarter > of the cache to fill. If it has the most significant 8 bits set ,it > would be allocated the left half of the cache(8 bits out of 16 > represents 50%). > > The cache subset would be affinitized to a set of cpus in affinitized > mode. The CPUs to which this allocation is affinitized to is > represented by the 'cpus' file. The 'cpus' need to be mutually > exclusive from cpus of other directories. > > The cache portion defined in the CBM file is available to all tasks > within the CAT group and these task are not allowed to allocate space > in other parts of the cache. > > 'cbm' file is used in both modes where as the 'cpus' file is relevant > in affinitized mode and would disappear in non-affinitized mode. > > > Scheduling and Context Switch > ------------------------------ > > In affinitized mode , the cache 'subset' and the tasks in a CAT cgroup > are affinitized to the CPUs represented by the CAT cgroup's 'cpus' > file i.e when user sets the 'cbm' to 'portion' and 'cpus' to c and > 'tasks' to t, the tasks 't' would always be scheduled on cpus 'c' and > will get to fill in the allocated 'portion' in last level cache. > > As noted above ,in the affinitized mode the tasks in a CAT cgroup > would also be affinitized to the CPUs in the 'cpus' file of the > directory. Following hooks in the kernel are required to implement > this (on the lines of cpuset code) > - in sched_setaffinity to mask the requested cpu mask with what is > present in the task's 'cpus' > - in migrate_task to migrate the tasks only to those CPUs in the > 'cpus' file if possible. > - in select_task_rq > > In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file > indicate the tasks the cache subset is affinitized to. When user adds > tasks to the tasks file , the tasks would get to fill the cache subset > represented by the CAT cgroup's 'cbm' file. > > During context switch kernel implements this by writing the > corresponding CLOSid (internally maintained by kernel) of the CAT > cgroup to the CPU's IA32_PQR_ASSOC MSR. > > Usage and Example > ----------------- > > > Following would mount the cache allocation cgroup subsystem and create > 2 directories. Please refer to Documentation/cgroups/cgroups.txt on > details about how to use cgroups. > > cd /sys/fs/cgroup > mkdir cachealloc > mount -t cgroup -ocachealloc cachealloc /sys/fs/cgroup/cachealloc > cd cachealloc > > Create 2 cat cgroups > > mkdir group1 > mkdir group2 > > Following are some of the Files in the directory > > ls > cachea.cbm > cachea.cpus . cpus file only appears in the affinitized mode > cgroup.procs > tasks > cbm_max (root only) > affinitized (root only) . by default itsaffinitized mode > > Say if the cache is 2MB and cbm supports 16 bits, then setting the > below allocates the 'right 1/4th(512KB)' of the cache to group2 > > Edit the CBM for group2 to set the least significant 4 bits. This > allocates 'right quarter' of the cache. > > cd group2 > /bin/echo 0xf > cachealloc.cbm > > Change cpus in the directory. > > /bin/echo 1-4 > cachealloc.cpus > > Edit the CBM for group2 to set the least significant 8 bits.This > allocates the right half of the cache to 'group2'. > > cd group2 > /bin/echo 0xff > cachea.cbm > > Assign tasks to the group2 > > /bin/echo PID1 > tasks > /bin/echo PID2 > tasks > Meaning now threads > PID1 and PID2 runs on CPUs 1-4 , and get to fill the 'right half' of > the cache. The tasks PID1 and PID2 can only have a subset of the cpu > affinity defined in the 'cpus' file > > Edit the affinitized to 0.mode is changed in root directory cd .. > > /bin/echo 0 > cachealloc.affinitized > > Now the tasks and the cache allocation is not affinitized to the CPUs > and the task's cpu affinity is not restricted to being with the subset > of 'cpus' cpumask. -- Matt Fleming, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-20 16:18 ` Matt Fleming @ 2014-10-24 10:53 ` Peter Zijlstra 2014-10-28 23:22 ` Matt Fleming 2014-10-29 17:26 ` Vikas Shivappa 0 siblings, 2 replies; 39+ messages in thread From: Peter Zijlstra @ 2014-10-24 10:53 UTC (permalink / raw) To: Matt Fleming Cc: vikas, linux-kernel, matt.fleming, will.auld, tj, vikas.shivappa On Mon, Oct 20, 2014 at 05:18:55PM +0100, Matt Fleming wrote: > > What is Cache Allocation Technology ( CAT ) > > ------------------------------------------- Its a horrible name is what it is, please consider using the old name, that at least was clear in purpose. > > Kernel implementation Overview > > ------------------------------- > > > > Kernel implements a cgroup subsystem to support Cache Allocation. > > > > Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each > > cgroup would have one CBM and would just represent one cache 'subset'. > > > > The user would be allowed to create as many directories as there are > > CLOSs defined by the h/w. If user tries to create more than the > > available CLOSs , -ENOSPC is returned. Currently we support only one > > level of directory, ie directory can be created only under the root. NAK, cgroups must support full hierarchies, simply enforce that the child cgroup's mask is a subset of the parent's. > > There are 2 modes supported > > > > 1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs > > specified by the 'cpus' file. The tasks in the CAT cgroup would be > > constrained only on the CPUs in the 'cpus' file. The CPUs in this file > > are exclusively used for this cgroup. Requests by task > > using the sched_setaffinity() would be filtered through the tasks > > 'cpus'. NAK, we will not have yet another cgroup mucking about with task affinities. > > These tasks would get to fill the LLC cache represented by the > > cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as > > the existing cpumask datastructure. > > > > 2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be > > for a group of tasks. There is no 'cpus' file and the CPUs that the > > tasks run are not restricted by the CAT cgroup It appears to me this 'mode' thing is entirely superfluous and can be constructed by voluntary operation of this and cpusets or manual affinity calls. > > Assignment of CBM,CLOS and modes > > --------------------------------- > > > > Root directory would have all bits in 'cbm' file by default. > > > > The cbm_max file in the root defines the maximum number of bits > > describing the available cache units. Say if cbm_max is 16 then the > > 'cbm' cannot have more than 16 bits. This seems redundant, if you've already stated that the root cbm is the full set, there is no need to further provide this. > > The 'cbm' file is restricted to having no more than its cbm_max least > > significant bits set. Any contiguous subset of these bits maybe set to > > indication the cache mapping desired. The 'cbm' between 2 directories > > can overlap. The 'cbm' would represent the cache 'subset' of the CAT > > cgroup. This would follow from the hierarchy requirement/conditions. > > Scheduling and Context Switch > > ------------------------------ > > In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file > > indicate the tasks the cache subset is affinitized to. When user adds > > tasks to the tasks file , the tasks would get to fill the cache subset > > represented by the CAT cgroup's 'cbm' file. > > > > During context switch kernel implements this by writing the > > corresponding CLOSid (internally maintained by kernel) of the CAT > > cgroup to the CPU's IA32_PQR_ASSOC MSR. Right. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-24 10:53 ` Peter Zijlstra @ 2014-10-28 23:22 ` Matt Fleming 2014-10-29 8:16 ` Peter Zijlstra 2014-10-29 17:26 ` Vikas Shivappa 1 sibling, 1 reply; 39+ messages in thread From: Matt Fleming @ 2014-10-28 23:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, linux-kernel, Matt Fleming, Will Auld, Tejun Heo, Vikas Shivappa On Fri, 24 Oct, at 12:53:06PM, Peter Zijlstra wrote: > > NAK, cgroups must support full hierarchies, simply enforce that the > child cgroup's mask is a subset of the parent's. For the specific case of Cache Allocation, if we're creating hierarchies from bitmasks, there's a very clear limit to how we can divide up the bits - we can't support an indefinite number of cgroup directories. What do you mean by "full hierarchies"? -- Matt Fleming, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-28 23:22 ` Matt Fleming @ 2014-10-29 8:16 ` Peter Zijlstra 2014-10-29 12:48 ` Matt Fleming 0 siblings, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-10-29 8:16 UTC (permalink / raw) To: Matt Fleming Cc: Vikas Shivappa, linux-kernel, Matt Fleming, Will Auld, Tejun Heo, Vikas Shivappa On Tue, Oct 28, 2014 at 11:22:15PM +0000, Matt Fleming wrote: > On Fri, 24 Oct, at 12:53:06PM, Peter Zijlstra wrote: > > > > NAK, cgroups must support full hierarchies, simply enforce that the > > child cgroup's mask is a subset of the parent's. > > For the specific case of Cache Allocation, if we're creating hierarchies > from bitmasks, there's a very clear limit to how we can divide up the > bits - we can't support an indefinite number of cgroup directories. > > What do you mean by "full hierarchies"? Ah, so one way around that is to only assign a (whats the CQE equivalent of RMIDs again?) once you stick a task in. But basically it means you need to allow things like: root/virt/more/crap/hostA /hostB /sanityA /random/other/yunk Now, the root will have the entire bitmask set, any child, say virt/more/crap can also have them all set, and you can maybe only start differentiating in the /host[AB] bits. Whether or not it makes sense, libvirt likes to create these pointless deep hierarchies, as do a lot of other people for that matter. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-29 8:16 ` Peter Zijlstra @ 2014-10-29 12:48 ` Matt Fleming 2014-10-29 13:45 ` Peter Zijlstra 0 siblings, 1 reply; 39+ messages in thread From: Matt Fleming @ 2014-10-29 12:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, linux-kernel, Matt Fleming, Will Auld, Tejun Heo, Vikas Shivappa On Wed, 29 Oct, at 09:16:40AM, Peter Zijlstra wrote: > > Ah, so one way around that is to only assign a (whats the CQE equivalent > of RMIDs again?) once you stick a task in. I think you're after "Class of Service" (CLOS) ID. Yeah we can do the CLOS ID assignment on-demand but what we can't do on-demand is the cache bitmask assignment, i.e. how we carve up the LLC. These need to persist irrespective of which task is running. And it's the cache bitmask that I'm specifically talking about not allowing arbitrarly deep nesting. So if I create a cgroup directory with a mask of 0x3 in the root cgroup directory for CAT (meow). Then, create two sub-directories, and split my 0x3 bitmask into 0x2 and 0x1, it's impossible to nest any further, i.e. /sys/fs/cgroup/cacheqe 0xffffffff | | meow 0x3 / \ / \ sub1 sub2 0x1, 0x2 Of course the pathological case is creating a cgroup directory with bitmask 0x1, so you can't have sub-directories because you can't split the cache allocation at all. Does this fly in the face of "full hierarchies"? Or is this a reasonable limitation? > But basically it means you need to allow things like: > > root/virt/more/crap/hostA > /hostB > /sanityA > /random/other/yunk > > Now, the root will have the entire bitmask set, any child, say > virt/more/crap can also have them all set, and you can maybe only start > differentiating in the /host[AB] bits. > > Whether or not it makes sense, libvirt likes to create these pointless > deep hierarchies, as do a lot of other people for that matter. OK, this is something I hadn't considered; that you may *not* want to split the cache bitmask as you move down the hierarchy. I think that's something we could do without too much pain, though actually programming that from a user perspective makes my head hurt. -- Matt Fleming, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-29 12:48 ` Matt Fleming @ 2014-10-29 13:45 ` Peter Zijlstra 2014-10-29 16:32 ` Auld, Will 0 siblings, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-10-29 13:45 UTC (permalink / raw) To: Matt Fleming Cc: Vikas Shivappa, linux-kernel, Matt Fleming, Will Auld, Tejun Heo, Vikas Shivappa On Wed, Oct 29, 2014 at 12:48:34PM +0000, Matt Fleming wrote: > On Wed, 29 Oct, at 09:16:40AM, Peter Zijlstra wrote: > > > > Ah, so one way around that is to only assign a (whats the CQE equivalent > > of RMIDs again?) once you stick a task in. > > I think you're after "Class of Service" (CLOS) ID. > > Yeah we can do the CLOS ID assignment on-demand but what we can't do > on-demand is the cache bitmask assignment, i.e. how we carve up the LLC. > These need to persist irrespective of which task is running. And it's > the cache bitmask that I'm specifically talking about not allowing > arbitrarly deep nesting. > > So if I create a cgroup directory with a mask of 0x3 in the root cgroup > directory for CAT (meow). All we now need is a DOG to go woof :-) and they can have a party. > Then, create two sub-directories, and split my > 0x3 bitmask into 0x2 and 0x1, it's impossible to nest any further, i.e. > > /sys/fs/cgroup/cacheqe 0xffffffff > | > | > meow 0x3 > / \ > / \ > sub1 sub2 0x1, 0x2 > > Of course the pathological case is creating a cgroup directory with > bitmask 0x1, so you can't have sub-directories because you can't split > the cache allocation at all. > > Does this fly in the face of "full hierarchies"? Or is this a reasonable > limitation? I don't see a reason why we should not allow further children of sub1, they'll all have to have 0x1, but that should be fine, pointless perhaps, but perfectly consistent. > > But basically it means you need to allow things like: > > > > root/virt/more/crap/hostA > > /hostB > > /sanityA > > /random/other/yunk > > > > Now, the root will have the entire bitmask set, any child, say > > virt/more/crap can also have them all set, and you can maybe only start > > differentiating in the /host[AB] bits. > > > > Whether or not it makes sense, libvirt likes to create these pointless > > deep hierarchies, as do a lot of other people for that matter. > > OK, this is something I hadn't considered; that you may *not* want to > split the cache bitmask as you move down the hierarchy. > > I think that's something we could do without too much pain, though > actually programming that from a user perspective makes my head hurt. Right, also note that in the libvirt case, most of the intermediate groups are empty (of tasks) and would thus not actually instantiate a CLOS thingy. ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: Cache Allocation Technology Design 2014-10-29 13:45 ` Peter Zijlstra @ 2014-10-29 16:32 ` Auld, Will 2014-10-29 17:28 ` Peter Zijlstra 0 siblings, 1 reply; 39+ messages in thread From: Auld, Will @ 2014-10-29 16:32 UTC (permalink / raw) To: Peter Zijlstra, Matt Fleming Cc: Vikas Shivappa, linux-kernel, Fleming, Matt, Tejun Heo, Shivappa, Vikas, Auld, Will I maybe repeating what Peter has just said but for elements in the hierarchy where the mask is the same as its parents mask there is no need for a separate CLOS even in the case where there are tasks in the group. So we can inherit the CLOS of the parent until which time both the mask is different than the parent and there are tasks in the group. Thanks, Will > -----Original Message----- > From: Peter Zijlstra [mailto:peterz@infradead.org] > Sent: Wednesday, October 29, 2014 6:45 AM > To: Matt Fleming > Cc: Vikas Shivappa; linux-kernel@vger.kernel.org; Fleming, Matt; Auld, > Will; Tejun Heo; Shivappa, Vikas > Subject: Re: Cache Allocation Technology Design > > On Wed, Oct 29, 2014 at 12:48:34PM +0000, Matt Fleming wrote: > > On Wed, 29 Oct, at 09:16:40AM, Peter Zijlstra wrote: > > > > > > Ah, so one way around that is to only assign a (whats the CQE > > > equivalent of RMIDs again?) once you stick a task in. > > > > I think you're after "Class of Service" (CLOS) ID. > > > > Yeah we can do the CLOS ID assignment on-demand but what we can't do > > on-demand is the cache bitmask assignment, i.e. how we carve up the > LLC. > > These need to persist irrespective of which task is running. And it's > > the cache bitmask that I'm specifically talking about not allowing > > arbitrarly deep nesting. > > > > So if I create a cgroup directory with a mask of 0x3 in the root > > cgroup directory for CAT (meow). > > All we now need is a DOG to go woof :-) and they can have a party. > > > Then, create two sub-directories, and split my > > 0x3 bitmask into 0x2 and 0x1, it's impossible to nest any further, > i.e. > > > > /sys/fs/cgroup/cacheqe 0xffffffff > > | > > | > > meow 0x3 > > / \ > > / \ > > sub1 sub2 0x1, 0x2 > > > > Of course the pathological case is creating a cgroup directory with > > bitmask 0x1, so you can't have sub-directories because you can't > split > > the cache allocation at all. > > > > Does this fly in the face of "full hierarchies"? Or is this a > > reasonable limitation? > > I don't see a reason why we should not allow further children of sub1, > they'll all have to have 0x1, but that should be fine, pointless > perhaps, but perfectly consistent. > > > > But basically it means you need to allow things like: > > > > > > root/virt/more/crap/hostA > > > /hostB > > > /sanityA > > > /random/other/yunk > > > > > > Now, the root will have the entire bitmask set, any child, say > > > virt/more/crap can also have them all set, and you can maybe only > > > start differentiating in the /host[AB] bits. > > > > > > Whether or not it makes sense, libvirt likes to create these > > > pointless deep hierarchies, as do a lot of other people for that > matter. > > > > OK, this is something I hadn't considered; that you may *not* want to > > split the cache bitmask as you move down the hierarchy. > > > > I think that's something we could do without too much pain, though > > actually programming that from a user perspective makes my head hurt. > > Right, also note that in the libvirt case, most of the intermediate > groups are empty (of tasks) and would thus not actually instantiate a > CLOS thingy. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-29 16:32 ` Auld, Will @ 2014-10-29 17:28 ` Peter Zijlstra 2014-10-29 17:41 ` Vikas Shivappa 0 siblings, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-10-29 17:28 UTC (permalink / raw) To: Auld, Will Cc: Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt, Tejun Heo, Shivappa, Vikas On Wed, Oct 29, 2014 at 04:32:04PM +0000, Auld, Will wrote: > I maybe repeating what Peter has just said but for elements in the > hierarchy where the mask is the same as its parents mask there is no > need for a separate CLOS even in the case where there are tasks in the > group. So we can inherit the CLOS of the parent until which time both > the mask is different than the parent and there are tasks in the > group. I did not state that explicitly, but I did think about that. We could still wait to allocate a CLOS until at least one such group acquires a task. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-29 17:28 ` Peter Zijlstra @ 2014-10-29 17:41 ` Vikas Shivappa 2014-10-29 18:22 ` Tejun Heo 0 siblings, 1 reply; 39+ messages in thread From: Vikas Shivappa @ 2014-10-29 17:41 UTC (permalink / raw) To: Peter Zijlstra Cc: Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt, Tejun Heo, Shivappa, Vikas On Wed, 29 Oct 2014, Peter Zijlstra wrote: > On Wed, Oct 29, 2014 at 04:32:04PM +0000, Auld, Will wrote: >> I maybe repeating what Peter has just said but for elements in the >> hierarchy where the mask is the same as its parents mask there is no >> need for a separate CLOS even in the case where there are tasks in the >> group. So we can inherit the CLOS of the parent until which time both >> the mask is different than the parent and there are tasks in the >> group. > > I did not state that explicitly, but I did think about that. We could > still wait to allocate a CLOS until at least one such group acquires a > task. > Was wondering if it is a requirement of the 'full hierarchy' for the child to inherit the cbm of parent ? . Alternately we can allocate the CLOSid when a cgroup is created and have an empty cbm - but dont let the tasks to be added unless the user assigns a cbm. Cpuset does something similar where its necessary to set the cpu mask(empty by default) of a cgroup before adding tasks. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-29 17:41 ` Vikas Shivappa @ 2014-10-29 18:22 ` Tejun Heo 2014-10-30 7:07 ` Peter Zijlstra 0 siblings, 1 reply; 39+ messages in thread From: Tejun Heo @ 2014-10-29 18:22 UTC (permalink / raw) To: Vikas Shivappa Cc: Peter Zijlstra, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt On Wed, Oct 29, 2014 at 10:41:47AM -0700, Vikas Shivappa wrote: > Was wondering if it is a requirement of the 'full hierarchy' for the child > to inherit the cbm of parent ? . > Alternately we can allocate the CLOSid when a cgroup is created and have an > empty cbm - but dont let the tasks to be added unless the user assigns a Please don't do that. All controllers must be fully hierarchical, shouldn't fail task migration and always allow execution of member tasks. Thanks. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-29 18:22 ` Tejun Heo @ 2014-10-30 7:07 ` Peter Zijlstra 2014-10-30 7:14 ` Peter Zijlstra 2014-10-30 12:43 ` Tejun Heo 0 siblings, 2 replies; 39+ messages in thread From: Peter Zijlstra @ 2014-10-30 7:07 UTC (permalink / raw) To: Tejun Heo Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt On Wed, Oct 29, 2014 at 02:22:34PM -0400, Tejun Heo wrote: > On Wed, Oct 29, 2014 at 10:41:47AM -0700, Vikas Shivappa wrote: > > Was wondering if it is a requirement of the 'full hierarchy' for the child > > to inherit the cbm of parent ? . > > Alternately we can allocate the CLOSid when a cgroup is created and have an > > empty cbm - but dont let the tasks to be added unless the user assigns a > > Please don't do that. All controllers must be fully hierarchical, With you so far. > shouldn't fail task migration If this means echo $tid > tasks, then sorry we can't do. There is a limited number of hardware resources backing this thing. At some point they're consumed and something must give. So either we fail mkdir, but that means allocating CLOS IDs for possibly empty cgroups, or we allocate on demand which means failing task assignment. The same -- albeit for a different reason -- is true of the RT sched groups, we simply cannot instantiate them such that tasks can join, sysads _have_ to configure them before we can add tasks to them. > and always allow execution of member tasks. If we accept tasks, they'll run. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 7:07 ` Peter Zijlstra @ 2014-10-30 7:14 ` Peter Zijlstra 2014-10-30 12:44 ` Tejun Heo 2014-10-30 12:43 ` Tejun Heo 1 sibling, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-10-30 7:14 UTC (permalink / raw) To: Tejun Heo Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote: > > and always allow execution of member tasks. This too btw is not strictly speaking possible for all controllers. Most all sched controllers live by the grace of forcing tasks not to run at times (eg. the bandwidth controls), falsifying the 'always'. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 7:14 ` Peter Zijlstra @ 2014-10-30 12:44 ` Tejun Heo 2014-10-30 13:19 ` Peter Zijlstra 0 siblings, 1 reply; 39+ messages in thread From: Tejun Heo @ 2014-10-30 12:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt On Thu, Oct 30, 2014 at 08:14:24AM +0100, Peter Zijlstra wrote: > On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote: > > > and always allow execution of member tasks. > > This too btw is not strictly speaking possible for all controllers. Most > all sched controllers live by the grace of forcing tasks not to run at > times (eg. the bandwidth controls), falsifying the 'always'. Oh sure, the a task just has to run in a foreseeable future, or rather, a task must not be blocked indefinitely requiring userland intervention to become executable again. Thanks. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 12:44 ` Tejun Heo @ 2014-10-30 13:19 ` Peter Zijlstra 2014-10-30 15:25 ` Tejun Heo 0 siblings, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-10-30 13:19 UTC (permalink / raw) To: Tejun Heo Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt On Thu, Oct 30, 2014 at 08:44:40AM -0400, Tejun Heo wrote: > On Thu, Oct 30, 2014 at 08:14:24AM +0100, Peter Zijlstra wrote: > > On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote: > > > > and always allow execution of member tasks. > > > > This too btw is not strictly speaking possible for all controllers. Most > > all sched controllers live by the grace of forcing tasks not to run at > > times (eg. the bandwidth controls), falsifying the 'always'. > > Oh sure, the a task just has to run in a foreseeable future, or > rather, a task must not be blocked indefinitely requiring userland > intervention to become executable again. Like the freezer cgroup you mean? ;-) ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 13:19 ` Peter Zijlstra @ 2014-10-30 15:25 ` Tejun Heo 0 siblings, 0 replies; 39+ messages in thread From: Tejun Heo @ 2014-10-30 15:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt Hello, Peter. On Thu, Oct 30, 2014 at 02:19:04PM +0100, Peter Zijlstra wrote: > On Thu, Oct 30, 2014 at 08:44:40AM -0400, Tejun Heo wrote: > > On Thu, Oct 30, 2014 at 08:14:24AM +0100, Peter Zijlstra wrote: > > > On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote: > > > > > and always allow execution of member tasks. > > > > > > This too btw is not strictly speaking possible for all controllers. Most > > > all sched controllers live by the grace of forcing tasks not to run at > > > times (eg. the bandwidth controls), falsifying the 'always'. > > > > Oh sure, the a task just has to run in a foreseeable future, or > > rather, a task must not be blocked indefinitely requiring userland > > intervention to become executable again. > > Like the freezer cgroup you mean? ;-) Oh yeah, that's horribly broken. Merging it with jobctl stop is a todo item. This "stuck in a random place in kernel" thing made sense for suspend/hibernation only because the kernel wasn't gonna run anymore. The fact that this got exposed to userland on a running system just shows how little we were thinking while implementing all the controllers. It should be equivalent to layered job control stop so that what's prevented from running is the userland part, not kernel. Thanks. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 7:07 ` Peter Zijlstra 2014-10-30 7:14 ` Peter Zijlstra @ 2014-10-30 12:43 ` Tejun Heo 2014-10-30 13:18 ` Peter Zijlstra ` (3 more replies) 1 sibling, 4 replies; 39+ messages in thread From: Tejun Heo @ 2014-10-30 12:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt Hello, Peter. On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote: > If this means echo $tid > tasks, then sorry we can't do. There is a > limited number of hardware resources backing this thing. At some point > they're consumed and something must give. And that something shouldn't be disallowing task migration across cgroups. This simply doesn't work with co-mounting or unified hierarchy. cpuset automatically takes on the nearest ancestor's configuration which has enough execution resources. Maybe that can be an option for this too? One of the problems is that we generally assume that a task can run some point in time in a lot of places in the kernel and can't just not run a task indefinitely because it's in a cgroup configured certain way. > So either we fail mkdir, but that means allocating CLOS IDs for possibly > empty cgroups, or we allocate on demand which means failing task > assignment. Can't fail mkdir or css enabling either. Again, co-mounting and unified hierarchy. Also, the behavior is just horrible to use from userland. > The same -- albeit for a different reason -- is true of the RT sched > groups, we simply cannot instantiate them such that tasks can join, > sysads _have_ to configure them before we can add tasks to them. Yeah, RT is one of the main items which is problematic, more so because it's currently coupled with the normal sched controller and the default config doesn't have any RT slice. Do we completely block RT task w/o slice? Is that okay? Thanks. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 12:43 ` Tejun Heo @ 2014-10-30 13:18 ` Peter Zijlstra 2014-10-30 17:03 ` Tejun Heo 2014-10-30 14:14 ` Matt Fleming ` (2 subsequent siblings) 3 siblings, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-10-30 13:18 UTC (permalink / raw) To: Tejun Heo Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt On Thu, Oct 30, 2014 at 08:43:33AM -0400, Tejun Heo wrote: > Hello, Peter. > > On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote: > > If this means echo $tid > tasks, then sorry we can't do. There is a > > limited number of hardware resources backing this thing. At some point > > they're consumed and something must give. > > And that something shouldn't be disallowing task migration across > cgroups. This simply doesn't work with co-mounting or unified > hierarchy. cpuset automatically takes on the nearest ancestor's > configuration which has enough execution resources. Maybe that can be > an option for this too? It will give very random and nondeterministic behaviour and basically destroy the entire purpose of the controller (which are the very same reasons I detest that 'new' behaviour in cpusets). > One of the problems is that we generally assume that a task can run > some point in time in a lot of places in the kernel and can't just not > run a task indefinitely because it's in a cgroup configured certain > way. Refusing tasks into a previously empty cgroup creates no such problems. Its already in a cgroup (wherever its parent was) and it can run there, failing to move it to another does not affect things. > > So either we fail mkdir, but that means allocating CLOS IDs for possibly > > empty cgroups, or we allocate on demand which means failing task > > assignment. > > Can't fail mkdir or css enabling either. Again, co-mounting and > unified hierarchy. Also, the behavior is just horrible to use from > userland. In order to fix the co-mounting and unified hierarchy I still need to hear a proposal for that tasks vs processes thing. Traditionally the cgroups were task based, but many controllers are process based (simply because what they control is process wide, not per task), and there was talk (2-3 years ago or so) about making the entire cgroup thing per process, which obviously fails for all scheduler related cgroups. > > The same -- albeit for a different reason -- is true of the RT sched > > groups, we simply cannot instantiate them such that tasks can join, > > sysads _have_ to configure them before we can add tasks to them. > > Yeah, RT is one of the main items which is problematic, more so > because it's currently coupled with the normal sched controller and > the default config doesn't have any RT slice. Simply because you cannot give a slice on creation; or if you did that would mean failing mkdir when a new cgroup would exceed the available time. Also any !0 slice is wrong because it will not match the requirements of the proposed workload, the administrator will have to set it to match the workload. Therefore 0. > Do we completely block RT task w/o slice? Is that okay? We will not allow an RT task in, the write to the tasks file will fail. The same will be true for deadline tasks, we'll fail entry into a cgroup when the combined requirements of the tasks exceed the provisions of the group. There is just no way around that and still provide sane semantics. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 13:18 ` Peter Zijlstra @ 2014-10-30 17:03 ` Tejun Heo 2014-10-30 21:43 ` Peter Zijlstra 0 siblings, 1 reply; 39+ messages in thread From: Tejun Heo @ 2014-10-30 17:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt Hey, Peter. On Thu, Oct 30, 2014 at 02:18:45PM +0100, Peter Zijlstra wrote: > On Thu, Oct 30, 2014 at 08:43:33AM -0400, Tejun Heo wrote: > > And that something shouldn't be disallowing task migration across > > cgroups. This simply doesn't work with co-mounting or unified > > hierarchy. cpuset automatically takes on the nearest ancestor's > > configuration which has enough execution resources. Maybe that can be > > an option for this too? > > It will give very random and nondeterministic behaviour and basically > destroy the entire purpose of the controller (which are the very same > reasons I detest that 'new' behaviour in cpusets). I agree with you that this is a corner case behavior which deviates from the usual behavior; however, the deviation is inherent. This stems from the fact that the kernel in general doesn't allow tasks which cannot be run. You say that you detest the new behaviors of cpuset; however, the old behaviors were just as sucky - bouncing tasks to an ancestor cgroup forcifully and without any indication or way to restore the previous configuration. What's different with the new behavior is that it explicitly distinguishes between the configured and effective configurations as the kernel isn't capable for actually enforcing certain subset of configurations. So, the inherent problem is always there no matter what we do and the question is that of a policy to deal with it. One of the main issues I see with failing cgroup-level operations for controller specific reasons is lack of visibility. All you can get out of a failed operation is a single error return and there's no good way to communicate why something isn't working, well not even who's the culprit. Having "effective" vs "configured" makes it explicit that the kernel isn't capable of honoring all configurations and make the details of the situation visible. Another part is inconsistencies across controllers. This sure is worse when there are multiple controllers involved but inconsistent behaviors across different hierarchies are annoying all the same with single controller multiple hierarchies. Userland often manages some of those hierarchies together and it can get horribly confusing. No matter what, we need to settle on a single policy and having effective configuration seems like the better one. > > One of the problems is that we generally assume that a task can run > > some point in time in a lot of places in the kernel and can't just not > > run a task indefinitely because it's in a cgroup configured certain > > way. > > Refusing tasks into a previously empty cgroup creates no such problems. > Its already in a cgroup (wherever its parent was) and it can run there, > failing to move it to another does not affect things. Yeah, sure, hard failing can work too. It didn't work well for cpuset because a runnable configuration may become not so if the system config changes afterwards but this probably doesn't have an issue like that. I'm not saying something like the above won't work. It'd but I don't think that's the right place to fail. This controller might not even require the distinction between configured and effective tho? Can't a new child just inherit the parent's configuration and never allow the config to become completely empty? The problem cpuset faces is that of underlying hardware configuration changing. This one doesn't have that. > > > So either we fail mkdir, but that means allocating CLOS IDs for possibly > > > empty cgroups, or we allocate on demand which means failing task > > > assignment. > > > > Can't fail mkdir or css enabling either. Again, co-mounting and > > unified hierarchy. Also, the behavior is just horrible to use from > > userland. > > In order to fix the co-mounting and unified hierarchy I still need to > hear a proposal for that tasks vs processes thing. > > Traditionally the cgroups were task based, but many controllers are > process based (simply because what they control is process wide, not per > task), and there was talk (2-3 years ago or so) about making the entire > cgroup thing per process, which obviously fails for all scheduler > related cgroups. Yeah, it needs to be a separate interface where a given userland task can access its own knobs in a race-free way (cgroup interface can't even do that) whether that's a pseudo filesystem, say, /proc/self/BLAHBLAH or new syscalls. This one is necessary regardless of what happens with cgroup. cgroup simply isn't a suitable mechanism to expose these types of knobs to individual userland threads. > > Yeah, RT is one of the main items which is problematic, more so > > because it's currently coupled with the normal sched controller and > > the default config doesn't have any RT slice. > > Simply because you cannot give a slice on creation; or if you did that > would mean failing mkdir when a new cgroup would exceed the available > time. > > Also any !0 slice is wrong because it will not match the requirements of > the proposed workload, the administrator will have to set it to match > the workload. > > Therefore 0. As long as RT is separate from normal sched controller, this *could* be fine. The main problem now is that userland which wants to use the cpu controller but doesn't want to fully manage RT slices end up disabling RT slices. It might work if a new child can share the parent's slice till explicitly configured. Another problem is when you wanna change the configuration after the hierarchy is already populated. I don't know. I'd even be happy with cgroup not having anything to do with RT slice distribution. Do you have any ideas which can make RT slice distribution more palatable? If we can't decouple the two, we'd be effectively requiring whoever is managing the cpu controller to also become a full-fledged RT slice arbitrator, which might actually work too. > > Do we completely block RT task w/o slice? Is that okay? > > We will not allow an RT task in, the write to the tasks file will fail. > > The same will be true for deadline tasks, we'll fail entry into a cgroup > when the combined requirements of the tasks exceed the provisions of the > group. > > There is just no way around that and still provide sane semantics. Can't a task just lose RT / deadline properties when migrating into a different RT / deadline domain? We already modify task properties on migration for cpuset after all. It'd be far simpler that way. Thanks. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 17:03 ` Tejun Heo @ 2014-10-30 21:43 ` Peter Zijlstra 2014-10-30 22:22 ` Tejun Heo 0 siblings, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-10-30 21:43 UTC (permalink / raw) To: Tejun Heo Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt On Thu, Oct 30, 2014 at 01:03:31PM -0400, Tejun Heo wrote: > Hey, Peter. > > On Thu, Oct 30, 2014 at 02:18:45PM +0100, Peter Zijlstra wrote: > > On Thu, Oct 30, 2014 at 08:43:33AM -0400, Tejun Heo wrote: > > > And that something shouldn't be disallowing task migration across > > > cgroups. This simply doesn't work with co-mounting or unified > > > hierarchy. cpuset automatically takes on the nearest ancestor's > > > configuration which has enough execution resources. Maybe that can be > > > an option for this too? > > > > It will give very random and nondeterministic behaviour and basically > > destroy the entire purpose of the controller (which are the very same > > reasons I detest that 'new' behaviour in cpusets). > > I agree with you that this is a corner case behavior which deviates > from the usual behavior; however, the deviation is inherent. This > stems from the fact that the kernel in general doesn't allow tasks > which cannot be run. You say that you detest the new behaviors of > cpuset; however, the old behaviors were just as sucky - bouncing tasks > to an ancestor cgroup forcifully and without any indication or way to > restore the previous configuration. What's different with the new > behavior is that it explicitly distinguishes between the configured > and effective configurations as the kernel isn't capable for actually > enforcing certain subset of configurations. If a cpu bounces (by accident or whatever) then there is no trace left behind that the system didn't in fact observe/obey its constraints. It should have provided an error or failed the hotplug. But we digress, lets not have this discussion (again :) and focus on the new thing. > So, the inherent problem is always there no matter what we do and the > question is that of a policy to deal with it. One of the main issues > I see with failing cgroup-level operations for controller specific > reasons is lack of visibility. All you can get out of a failed > operation is a single error return and there's no good way to > communicate why something isn't working, well not even who's the > culprit. Having "effective" vs "configured" makes it explicit that > the kernel isn't capable of honoring all configurations and make the > details of the situation visible. Right, so that is a short coming of the co-mount idea. Your effective vs configured thing is misleading and surprising though. Operations might 'succeed' and still have failed, without any clear indication/notification of change. > Another part is inconsistencies across controllers. This sure is > worse when there are multiple controllers involved but inconsistent > behaviors across different hierarchies are annoying all the same with > single controller multiple hierarchies. Userland often manages some > of those hierarchies together and it can get horribly confusing. No > matter what, we need to settle on a single policy and having effective > configuration seems like the better one. I'm not entirely sure I follow. Without co-mounting its entirely obvious which one is failing. Also, per the previous point, since you need a notification channel anyway, you might as well do the expected fail and report more details through that. > > > One of the problems is that we generally assume that a task can run > > > some point in time in a lot of places in the kernel and can't just not > > > run a task indefinitely because it's in a cgroup configured certain > > > way. > > > > Refusing tasks into a previously empty cgroup creates no such problems. > > Its already in a cgroup (wherever its parent was) and it can run there, > > failing to move it to another does not affect things. > > Yeah, sure, hard failing can work too. It didn't work well for cpuset > because a runnable configuration may become not so if the system > config changes afterwards but this probably doesn't have an issue like > that. I'm not saying something like the above won't work. It'd but I > don't think that's the right place to fail. Right, this thing doesn't suffer that particular problem if its good it stays good. > This controller might not even require the distinction between > configured and effective tho? Can't a new child just inherit the > parent's configuration and never allow the config to become completely > empty? It can do that. But that still has a problem, there is a mapping in hardware which restricts the number of active configurations. The total configuration space is larger than the supported active configurations. So _something_ must fail. The initial proposal was mkdir failing when there were more than the hardware supported active config cgroup directories. The alternative was on-demand activation where we only allocate the hardware resource when the first task gets moved into the group -- which then clearly can fail. > > Traditionally the cgroups were task based, but many controllers are > > process based (simply because what they control is process wide, not per > > task), and there was talk (2-3 years ago or so) about making the entire > > cgroup thing per process, which obviously fails for all scheduler > > related cgroups. > > Yeah, it needs to be a separate interface where a given userland task > can access its own knobs in a race-free way (cgroup interface can't > even do that) whether that's a pseudo filesystem, say, > /proc/self/BLAHBLAH or new syscalls. This one is necessary regardless > of what happens with cgroup. cgroup simply isn't a suitable mechanism > to expose these types of knobs to individual userland threads. I'm not sure what you're saying there. You want to replace the task-controllers with another pseudo filesystem that does it differently but still is a hierarchical controller?, how is that different from just not co-mounting the task and process based controllers, either way you end up with 2 separate hierarchies. > > > Yeah, RT is one of the main items which is problematic, more so > > > because it's currently coupled with the normal sched controller and > > > the default config doesn't have any RT slice. > > > > Simply because you cannot give a slice on creation; or if you did that > > would mean failing mkdir when a new cgroup would exceed the available > > time. > > > > Also any !0 slice is wrong because it will not match the requirements of > > the proposed workload, the administrator will have to set it to match > > the workload. > > > > Therefore 0. > > As long as RT is separate from normal sched controller, this *could* > be fine. The main problem now is that userland which wants to use the > cpu controller but doesn't want to fully manage RT slices end up > disabling RT slices. I don't get this, who but the admin manages things, and how would you accidentally have an RT app and not know about it. And if you're in that situation you're screwed anyhow, since you've no f'ing clue how to configure your system for it anyhow. At which point you're in deep. > It might work if a new child can share the > parent's slice till explicitly configured. Principle of least surprise. That's surprising behaviour. Why move it in he first place? > Another problem is when > you wanna change the configuration after the hierarchy is already > populated. We fail the configuration change. For RR/FIFO we won't allow you to set the slice to 0 if there's tasks. For deadline we would fail everything that tries to lower things below the utilization required by the tasks (and child groups). > I don't know. I'd even be happy with cgroup not having > anything to do with RT slice distribution. Do you have any ideas > which can make RT slice distribution more palatable? If we can't > decouple the two, we'd be effectively requiring whoever is managing > the cpu controller to also become a full-fledged RT slice arbitrator, > which might actually work too. The admin you mean? He had better know what the heck he's doing if he's running RT apps, great fail is otherwise fairly deterministic in his future. The thing is, you cannot arbiter this stuff, RR/FIFO are horrible pieces of shit interfaces, they don't describe near enough. People need to be involved. > > > Do we completely block RT task w/o slice? Is that okay? > > > > We will not allow an RT task in, the write to the tasks file will fail. > > > > The same will be true for deadline tasks, we'll fail entry into a cgroup > > when the combined requirements of the tasks exceed the provisions of the > > group. > > > > There is just no way around that and still provide sane semantics. > > Can't a task just lose RT / deadline properties when migrating into a > different RT / deadline domain? We already modify task properties on > migration for cpuset after all. It'd be far simpler that way. Again, why move it in the first place? This all sounds like whomever is doing this is clueless. You don't move RT tasks about if you're not intimately aware of them and their requirements. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 21:43 ` Peter Zijlstra @ 2014-10-30 22:22 ` Tejun Heo 2014-10-30 22:47 ` Peter Zijlstra 2014-10-31 13:07 ` Peter Zijlstra 0 siblings, 2 replies; 39+ messages in thread From: Tejun Heo @ 2014-10-30 22:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt Hello, On Thu, Oct 30, 2014 at 10:43:53PM +0100, Peter Zijlstra wrote: > If a cpu bounces (by accident or whatever) then there is no trace left > behind that the system didn't in fact observe/obey its constraints. It > should have provided an error or failed the hotplug. But we digress, > lets not have this discussion (again :) and focus on the new thing. Oh, we sure can have notifications / persistent markers to track deviation from the configuration. It's not like the old scheme did much better in this respect. It just wrecked the configuration without telling anyone. If this matters enough, we need error recording / reporting no matter which way we choose. I'm not against that at all. > > So, the inherent problem is always there no matter what we do and the > > question is that of a policy to deal with it. One of the main issues > > I see with failing cgroup-level operations for controller specific > > reasons is lack of visibility. All you can get out of a failed > > operation is a single error return and there's no good way to > > communicate why something isn't working, well not even who's the > > culprit. Having "effective" vs "configured" makes it explicit that > > the kernel isn't capable of honoring all configurations and make the > > details of the situation visible. > > Right, so that is a short coming of the co-mount idea. Your effective vs > configured thing is misleading and surprising though. Operations might > 'succeed' and still have failed, without any clear > indication/notification of change. Hmmm... it gets more pronounced w/ co-mounting but it's also problem with isolated hierarchies too. How is changing configuration irreversibly without any notificaiton any less surprising? It's the same end result. The only difference is that there's no way to go back when the resource which went offline comes back. I really don't think configuration being silently changed counts as a valid notification mechanism to userland. > > Another part is inconsistencies across controllers. This sure is > > worse when there are multiple controllers involved but inconsistent > > behaviors across different hierarchies are annoying all the same with > > single controller multiple hierarchies. Userland often manages some > > of those hierarchies together and it can get horribly confusing. No > > matter what, we need to settle on a single policy and having effective > > configuration seems like the better one. > > I'm not entirely sure I follow. Without co-mounting its entirely obvious > which one is failing. Sure, "which" is easier w/o co-mounting. Why can still be hard tho as migration is an "apply all the configs" event. > Also, per the previous point, since you need a notification channel > anyway, you might as well do the expected fail and report more details > through that. How do you match the failure to the specific migration attempt tho? I really can't think of a good and simple interface for that given the interface that we have. For most controllers, it is fairly straight forward to avoid controller specific migration failures. Sure, cpuset is special but it has to be special one way or the other. > > This controller might not even require the distinction between > > configured and effective tho? Can't a new child just inherit the > > parent's configuration and never allow the config to become completely > > empty? > > It can do that. But that still has a problem, there is a mapping in > hardware which restricts the number of active configurations. The total > configuration space is larger than the supported active configurations. > > So _something_ must fail. The initial proposal was mkdir failing when > there were more than the hardware supported active config cgroup > directories. The alternative was on-demand activation where we only > allocate the hardware resource when the first task gets moved into the > group -- which then clearly can fail. Hmmm... why can't it just refuse creating a different configuration when its config space is full? Make children inherit the parent's configuration and refuse config writes which require it to create a new one if the config space is full. Seems pretty straight-forward. What am I missing? > > Yeah, it needs to be a separate interface where a given userland task > > can access its own knobs in a race-free way (cgroup interface can't > > even do that) whether that's a pseudo filesystem, say, > > /proc/self/BLAHBLAH or new syscalls. This one is necessary regardless > > of what happens with cgroup. cgroup simply isn't a suitable mechanism > > to expose these types of knobs to individual userland threads. > > I'm not sure what you're saying there. You want to replace the > task-controllers with another pseudo filesystem that does it differently > but still is a hierarchical controller?, how is that different from just > not co-mounting the task and process based controllers, either way you > end up with 2 separate hierarchies. It doesn't have much to do with co-mounting. The process itself often has to be involved in assigning different properties to its threads. It requires intimiate knowledge of which one is doing what meaning that accessing self's knobs is the most common use case rather than an external entity reaching inside. This means that this should be a programmable interface accessible from each binary. cgroup is horrible for this. A process has to read path from /proc/self/cgroups and then access the cgroup that it's in, which BTW could have changed inbetween. It really needs a proper programmable interface which guarantees self access. I don't know what the exact form should be. It can be an extension to sched_setattr(), a new syscall or a pseudo filesystem scoped to the process. > > I don't know. I'd even be happy with cgroup not having > > anything to do with RT slice distribution. Do you have any ideas > > which can make RT slice distribution more palatable? If we can't > > decouple the two, we'd be effectively requiring whoever is managing > > the cpu controller to also become a full-fledged RT slice arbitrator, > > which might actually work too. > > The admin you mean? He had better know what the heck he's doing if he's Resource management is automated in a lot of cases and it's only gonna be more so in the future. It's about having behaviors which are more palatable to that but please read on. > running RT apps, great fail is otherwise fairly deterministic in his > future. > > The thing is, you cannot arbiter this stuff, RR/FIFO are horrible pieces > of shit interfaces, they don't describe near enough. People need to be > involved. So, I think it'd be best if RT/deadline stuff can be separated out so that grouping the usual BE scheduling doesn't affect them, but if that's not feasible, yeah, I agree the only thing which we can do is requiring the entity which is controlling the cpu hierarchy, which may be a human admin or whatever manager, to distribute them explicitly. There doesn't seem to be any way around it. > > Can't a task just lose RT / deadline properties when migrating into a > > different RT / deadline domain? We already modify task properties on > > migration for cpuset after all. It'd be far simpler that way. > > Again, why move it in the first place? This all sounds like whomever is > doing this is clueless. You don't move RT tasks about if you're not > intimately aware of them and their requirements. Oh, seriously, if I could build this thing from ground up, I'd just tie it to process hierarchy and make the associations static. It's just that we can't do that at this point and I'm trying to find a behaviorally simple and acceptable way to deal with task migrations so that neither kernel or userland has to be too complex. So, behaviors which blow configs across migrations and consider them as "fresh" is completely fine by me. I mostly wanna avoid requiring complicated failure handling from the users which most likely won't be tested a lot and crap out when something exceptional happens. If it blows RT/deadline settings reliably on each and every migration and refuse RT priorities or cpu controller configs which can lead the invalid configs, it'd be perfect. This whole thing is really about having consistent behavior patterns which avoid obscure failure modes whenever possible. Unified hierarchy does build on top of those but we do want these consistencies regardless of that. Thanks. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 22:22 ` Tejun Heo @ 2014-10-30 22:47 ` Peter Zijlstra 2014-11-06 16:27 ` Matt Fleming 2014-10-31 13:07 ` Peter Zijlstra 1 sibling, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-10-30 22:47 UTC (permalink / raw) To: Tejun Heo Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt Let me reply to just this one, I'll do the rest tomorrow, need sleeps. On Thu, Oct 30, 2014 at 06:22:36PM -0400, Tejun Heo wrote: > > > This controller might not even require the distinction between > > > configured and effective tho? Can't a new child just inherit the > > > parent's configuration and never allow the config to become completely > > > empty? > > > > It can do that. But that still has a problem, there is a mapping in > > hardware which restricts the number of active configurations. The total > > configuration space is larger than the supported active configurations. > > > > So _something_ must fail. The initial proposal was mkdir failing when > > there were more than the hardware supported active config cgroup > > directories. The alternative was on-demand activation where we only > > allocate the hardware resource when the first task gets moved into the > > group -- which then clearly can fail. > > Hmmm... why can't it just refuse creating a different configuration > when its config space is full? Make children inherit the parent's > configuration and refuse config writes which require it to create a > new one if the config space is full. Seems pretty straight-forward. > What am I missing? We could do that I suppose, there is the one corner case that would not allow, intermediate directories with a restricted config that also have priv restrictions but no actual tasks. Not sure that makes sense though. Are there any other cases I might have missed? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 22:47 ` Peter Zijlstra @ 2014-11-06 16:27 ` Matt Fleming 2014-11-06 17:20 ` Vikas Shivappa 0 siblings, 1 reply; 39+ messages in thread From: Matt Fleming @ 2014-11-06 16:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Vikas Shivappa, Auld, Will, Vikas Shivappa, linux-kernel, Fleming, Matt On Thu, 30 Oct, at 11:47:40PM, Peter Zijlstra wrote: > > Let me reply to just this one, I'll do the rest tomorrow, need sleeps. > > On Thu, Oct 30, 2014 at 06:22:36PM -0400, Tejun Heo wrote: > > > > > This controller might not even require the distinction between > > > > configured and effective tho? Can't a new child just inherit the > > > > parent's configuration and never allow the config to become completely > > > > empty? > > > > > > It can do that. But that still has a problem, there is a mapping in > > > hardware which restricts the number of active configurations. The total > > > configuration space is larger than the supported active configurations. > > > > > > So _something_ must fail. The initial proposal was mkdir failing when > > > there were more than the hardware supported active config cgroup > > > directories. The alternative was on-demand activation where we only > > > allocate the hardware resource when the first task gets moved into the > > > group -- which then clearly can fail. > > > > Hmmm... why can't it just refuse creating a different configuration > > when its config space is full? Make children inherit the parent's > > configuration and refuse config writes which require it to create a > > new one if the config space is full. Seems pretty straight-forward. > > What am I missing? > > We could do that I suppose, there is the one corner case that would not > allow, intermediate directories with a restricted config that also have > priv restrictions but no actual tasks. Not sure that makes sense though. Could you elaborate on this configuration? > Are there any other cases I might have missed? I don't think so. So, for the specific CAT case what you're proposing is make the failure case happen when writing to the cache bitmask file instead of failing mkdir() or echo $tid > tasks ? I think that's OK. If we've run out of CLOS ids I would expect to see -ENOSPC returned, whereas if we try and set an invalid bitmask we'd get -EINVAL. Vikas, Will? -- Matt Fleming, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-11-06 16:27 ` Matt Fleming @ 2014-11-06 17:20 ` Vikas Shivappa 0 siblings, 0 replies; 39+ messages in thread From: Vikas Shivappa @ 2014-11-06 17:20 UTC (permalink / raw) To: Matt Fleming Cc: Peter Zijlstra, Tejun Heo, Vikas Shivappa, Auld, Will, Vikas Shivappa, linux-kernel, Fleming, Matt On Thu, 6 Nov 2014, Matt Fleming wrote: > On Thu, 30 Oct, at 11:47:40PM, Peter Zijlstra wrote: >> >> Let me reply to just this one, I'll do the rest tomorrow, need sleeps. >> >> On Thu, Oct 30, 2014 at 06:22:36PM -0400, Tejun Heo wrote: >> >>>>> This controller might not even require the distinction between >>>>> configured and effective tho? Can't a new child just inherit the >>>>> parent's configuration and never allow the config to become completely >>>>> empty? >>>> >>>> It can do that. But that still has a problem, there is a mapping in >>>> hardware which restricts the number of active configurations. The total >>>> configuration space is larger than the supported active configurations. >>>> >>>> So _something_ must fail. The initial proposal was mkdir failing when >>>> there were more than the hardware supported active config cgroup >>>> directories. The alternative was on-demand activation where we only >>>> allocate the hardware resource when the first task gets moved into the >>>> group -- which then clearly can fail. >>> >>> Hmmm... why can't it just refuse creating a different configuration >>> when its config space is full? Make children inherit the parent's >>> configuration and refuse config writes which require it to create a >>> new one if the config space is full. Seems pretty straight-forward. >>> What am I missing? >> >> We could do that I suppose, there is the one corner case that would not >> allow, intermediate directories with a restricted config that also have >> priv restrictions but no actual tasks. Not sure that makes sense though. > > Could you elaborate on this configuration? > >> Are there any other cases I might have missed? > > I don't think so. > > So, for the specific CAT case what you're proposing is make the failure > case happen when writing to the cache bitmask file instead of failing > mkdir() or echo $tid > tasks ? > > I think that's OK. If we've run out of CLOS ids I would expect to see > -ENOSPC returned, whereas if we try and set an invalid bitmask we'd get > -EINVAL. > > Vikas, Will? Yes that is correct. You can always create more cgroups and the new cgroup just inherits the mask from the parent and uses the same CLOSid as its parent , so it wont fail because of lack of CLOSids. The only case of failure as you said is when user tries to modify a cbm to a different one. > > -- > Matt Fleming, Intel Open Source Technology Center > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 22:22 ` Tejun Heo 2014-10-30 22:47 ` Peter Zijlstra @ 2014-10-31 13:07 ` Peter Zijlstra 2014-10-31 15:58 ` Tejun Heo 1 sibling, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-10-31 13:07 UTC (permalink / raw) To: Tejun Heo Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt On Thu, Oct 30, 2014 at 06:22:36PM -0400, Tejun Heo wrote: > Hello, > > On Thu, Oct 30, 2014 at 10:43:53PM +0100, Peter Zijlstra wrote: > > If a cpu bounces (by accident or whatever) then there is no trace left > > behind that the system didn't in fact observe/obey its constraints. It > > should have provided an error or failed the hotplug. But we digress, > > lets not have this discussion (again :) and focus on the new thing. > > Oh, we sure can have notifications / persistent markers to track > deviation from the configuration. It's not like the old scheme did > much better in this respect. It just wrecked the configuration > without telling anyone. If this matters enough, we need error > recording / reporting no matter which way we choose. I'm not against > that at all. True; then again, hotplug isn't a magical thing, you do it yourself -- with the suspend case being special, I'll grant you that. > > > So, the inherent problem is always there no matter what we do and the > > > question is that of a policy to deal with it. One of the main issues > > > I see with failing cgroup-level operations for controller specific > > > reasons is lack of visibility. All you can get out of a failed > > > operation is a single error return and there's no good way to > > > communicate why something isn't working, well not even who's the > > > culprit. Having "effective" vs "configured" makes it explicit that > > > the kernel isn't capable of honoring all configurations and make the > > > details of the situation visible. > > > > Right, so that is a short coming of the co-mount idea. Your effective vs > > configured thing is misleading and surprising though. Operations might > > 'succeed' and still have failed, without any clear > > indication/notification of change. > > Hmmm... it gets more pronounced w/ co-mounting but it's also problem > with isolated hierarchies too. How is changing configuration > irreversibly without any notificaiton any less surprising? It's the > same end result. The only difference is that there's no way to go > back when the resource which went offline comes back. I really don't > think configuration being silently changed counts as a valid > notification mechanism to userland. I think we're talking past one another here. You said the problem was that failing migrate is that you've no clue which controller failed in the co-mount case. With isolated hierarchies you do know. But then you continue talk about cpuset and hotplug. Now the thing with that is, the only one doing hotplug is the admin (I know there's a few kernel side hotplug but they're BUGs and I even NAKed a few, which didn't stop them from being merged) -- the exception being suspend, suspend is special because 1) there's a guarantee the CPU will actually come back and 2) its unobservable, userspace never sees the CPUs go away and come back because its frozen. The only real way to hotplug is if you do it your damn self, and its also you who setup the cpuset, so its fully on you if shit happens. No real magic there. Except now people seem to want to wrap it into magic and hide it all from the admin, pretend its not there and make it uncontrollable. Kernel side hotplug is broken for a myriad of reasons, but lets not diverge too far here. > > > Another part is inconsistencies across controllers. This sure is > > > worse when there are multiple controllers involved but inconsistent > > > behaviors across different hierarchies are annoying all the same with > > > single controller multiple hierarchies. Userland often manages some > > > of those hierarchies together and it can get horribly confusing. No > > > matter what, we need to settle on a single policy and having effective > > > configuration seems like the better one. > > > > I'm not entirely sure I follow. Without co-mounting its entirely obvious > > which one is failing. > > Sure, "which" is easier w/o co-mounting. Why can still be hard tho as > migration is an "apply all the configs" event. Typically controllers don;'t control too many configs at once and the specific return error could be a good hint there. > > Also, per the previous point, since you need a notification channel > > anyway, you might as well do the expected fail and report more details > > through that. > > How do you match the failure to the specific migration attempt tho? I > really can't think of a good and simple interface for that given the > interface that we have. For most controllers, it is fairly straight > forward to avoid controller specific migration failures. Sure, cpuset > is special but it has to be special one way or the other. You can include in the msg with the pid that was just attempted in the pid namespace of the observer, if the pid is not available in that namespace discard the message since the observer could not possibly have done the deed. > It doesn't have much to do with co-mounting. > > The process itself often has to be involved in assigning different > properties to its threads. It requires intimiate knowledge of which > one is doing what meaning that accessing self's knobs is the most > common use case rather than an external entity reaching inside. This > means that this should be a programmable interface accessible from > each binary. cgroup is horrible for this. A process has to read path > from /proc/self/cgroups and then access the cgroup that it's in, which > BTW could have changed inbetween. > > It really needs a proper programmable interface which guarantees self > access. I don't know what the exact form should be. It can be an > extension to sched_setattr(), a new syscall or a pseudo filesystem > scoped to the process. That's an entirely separate issue; and I don't see that solving the task vs process issue at all. > > The admin you mean? He had better know what the heck he's doing if he's > > Resource management is automated in a lot of cases and it's only gonna > be more so in the future. It's about having behaviors which are more > palatable to that but please read on. > > > running RT apps, great fail is otherwise fairly deterministic in his > > future. > > > > The thing is, you cannot arbiter this stuff, RR/FIFO are horrible pieces > > of shit interfaces, they don't describe near enough. People need to be > > involved. > > So, I think it'd be best if RT/deadline stuff can be separated out so > that grouping the usual BE scheduling doesn't affect them, but if > that's not feasible, yeah, I agree the only thing which we can do is > requiring the entity which is controlling the cpu hierarchy, which may > be a human admin or whatever manager, to distribute them explicitly. > There doesn't seem to be any way around it. Automation is nice and all, but RT is about providing determinism and guarantees. Unless you morph into a full blown RT aware muddleware and have all your RT apps communicate their requirements to it (ie. rewrite them all) to it, this is a non starter. Given that the RR/FIFO APIs are not communicating enough and we need to support them anyhow, human intervention it is. > > > Can't a task just lose RT / deadline properties when migrating into a > > > different RT / deadline domain? We already modify task properties on > > > migration for cpuset after all. It'd be far simpler that way. > > > > Again, why move it in the first place? This all sounds like whomever is > > doing this is clueless. You don't move RT tasks about if you're not > > intimately aware of them and their requirements. > > Oh, seriously, if I could build this thing from ground up, I'd just > tie it to process hierarchy and make the associations static. This thing being cgroups? I'm not sure static associations cater for the various use cases that people have. > It's > just that we can't do that at this point and I'm trying to find a > behaviorally simple and acceptable way to deal with task migrations so > that neither kernel or userland has to be too complex. Sure simple and consistent is all good, but we should also not make it too simple and thereby exclude useful things. > So, behaviors > which blow configs across migrations and consider them as "fresh" is > completely fine by me. Its not by me, its completely surprising and counter intuitive. > I mostly wanna avoid requiring complicated > failure handling from the users which most likely won't be tested a > lot and crap out when something exceptional happens. Smells like you just want to pretend nothing bad happens when you do stupid. I prefer to fail early and fail hard over pretend happy and surprise behaviour any day. > This whole thing is really about having consistent behavior patterns > which avoid obscure failure modes whenever possible. Unified > hierarchy does build on top of those but we do want these > consistencies regardless of that. I'm all for consistency, but I abhor make believe. And while I like the unified hierarchy thing conceptually, I'm by now fairly sure reality is about to ruin it. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-31 13:07 ` Peter Zijlstra @ 2014-10-31 15:58 ` Tejun Heo 2014-11-04 13:13 ` Peter Zijlstra 0 siblings, 1 reply; 39+ messages in thread From: Tejun Heo @ 2014-10-31 15:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt Hello, Peter. On Fri, Oct 31, 2014 at 02:07:38PM +0100, Peter Zijlstra wrote: > I think we're talking past one another here. You said the problem was > that failing migrate is that you've no clue which controller failed in > the co-mount case. With isolated hierarchies you do know. Yes, with co-mounting, the issue becomes worse but I think it's still not ideal even without co-mounting because the error reporting ends up conflating task organization operation and application of configurations. More on this later. > But then you continue talk about cpuset and hotplug. Now the thing with > that is, the only one doing hotplug is the admin (I know there's a few > kernel side hotplug but they're BUGs and I even NAKed a few, which > didn't stop them from being merged) -- the exception being suspend, > suspend is special because 1) there's a guarantee the CPU will actually > come back and 2) its unobservable, userspace never sees the CPUs go away > and come back because its frozen. > > The only real way to hotplug is if you do it your damn self, and its > also you who setup the cpuset, so its fully on you if shit happens. > > No real magic there. Except now people seem to want to wrap it into > magic and hide it all from the admin, pretend its not there and make it > uncontrollable. Hmmm... I think a difference is how we perceive userspace is composed and interacts with the various aspects of kernel. But even in the presence of a competent admin that you're suggesting, interactions of different aspects of a system are often compartmentalized. e.g. an admin configuring cpuset to accomodate a given set of persistent and important workload isn't too likely to expect a memory unit soft failure in several weeks and the need to hot-swap the memory module. It just isn't cost-effective enough to lump those two planes of planning into the same activity especially if the admin is hand-crafting the configuration. The issue that I see with the current method is that a much rare exception condition ends up messing up configurations which is on a different plane and that there's no recourse once that happens. If the said workload keeps forking, there's no easy way to recover the previous configuration. Both ways of handling the situation have components of surprise but as I wrote before that surprise is inherent and comes from the fact that the kernel can't afford tasks which aren't runnable. As a policy of handling the surprising situation, having explicit configured / effective settings seems like a better option to me because 1. it makes it explicit that the effective configuration may differ from the requested one 2. it makes handling exception cases easier. I think #1 is important because hard errors which rarely but do happen are very difficult to deal with properly because it's usually nearly invisible. > > Sure, "which" is easier w/o co-mounting. Why can still be hard tho as > > migration is an "apply all the configs" event. > > Typically controllers don;'t control too many configs at once and the > specific return error could be a good hint there. Usually, yeah. I still end up scratching my head with migration rejections w/ cpuset or blkcg tho. > > > Also, per the previous point, since you need a notification channel > > > anyway, you might as well do the expected fail and report more details > > > through that. > > > > How do you match the failure to the specific migration attempt tho? I > > really can't think of a good and simple interface for that given the > > interface that we have. For most controllers, it is fairly straight > > forward to avoid controller specific migration failures. Sure, cpuset > > is special but it has to be special one way or the other. > > You can include in the msg with the pid that was just attempted in the > pid namespace of the observer, if the pid is not available in that > namespace discard the message since the observer could not possibly have > done the deed. I don't know. Is that a good interface? If a human admin is echoing and dmesg'ing afterwards, it should work but scraping the log for an unstructured plain text error usually isn't a very good interface to build tools around. For example, for CAT and its limit on the numbers of possible configurations, it can technically be made to work by reporting errors on mkdir or task migration; however, it is *far* better and clearer to report, say, -ENOSPC when you're actually trying to change the configuration. The error is directly tied to the operation requested. That's just how it should be whenever possible. > > It really needs a proper programmable interface which guarantees self > > access. I don't know what the exact form should be. It can be an > > extension to sched_setattr(), a new syscall or a pseudo filesystem > > scoped to the process. > > That's an entirely separate issue; and I don't see that solving the task > vs process issue at all. Hmm... I don't see it that way tho. In-process configuration is primarily something to be done by the process while cgroup management is to be done by external adminy entity. They are on different planes. Individual binaries accessing their own cgroups doesn't make a lot of sense and is actually broken. Likewise, external management entity meddling with individual threads of a process is at best cumbersome. It can be allowed but that's often not how it's useful. I really don't see why cgroup would be involved with per-thread settings. > Automation is nice and all, but RT is about providing determinism and > guarantees. Unless you morph into a full blown RT aware muddleware and > have all your RT apps communicate their requirements to it (ie. rewrite > them all) to it, this is a non starter. > > Given that the RR/FIFO APIs are not communicating enough and we need to > support them anyhow, human intervention it is. Yeah, I fully agree with you there. The issue is not that RT/FIFO requires explicit actions from userland but that they're currently tied to BE scheduling. Conceptually, they don't have to be but they're in practice and that ends up requiring whoever, be that an admin or automated tool, is managing the BE grouping to also manage RT/FIFO slices, which isn't ideal but should be workable. I was mostly curious whether they can be separated with a reasonable amount of effort. That's a no, right? > > Oh, seriously, if I could build this thing from ground up, I'd just > > tie it to process hierarchy and make the associations static. > > This thing being cgroups? I'm not sure static associations cater for the > various use cases that people have. Sure, we have no chance of changing it at this point, but I'm pretty sure if we started by tying it to the process hierarchy, we and the userland would have been able to achieve about the same set of functionalities without all these migration business. > > It's > > just that we can't do that at this point and I'm trying to find a > > behaviorally simple and acceptable way to deal with task migrations so > > that neither kernel or userland has to be too complex. > > Sure simple and consistent is all good, but we should also not make it > too simple and thereby exclude useful things. What are we excluding tho? Previously, cgroup didn't have rules, policies or conventions. It just had this skeletal features to group tasks and every controller did its own thing diverging the way they treat hierarchies, errors, migrations, configurations, notifications and so on. It didn't put in the effort to actually identify the required functionalities or characterize what belongs where. Every controller was doing its own brownian motion in the design space. Most of the properties being identified and policies being set up are actually fundamental and inherent. e.g. Creating a subhierarchy and organizing the children in them is fundamentally a task sub-categorizing operation. Conceptually, doing so shouldn't be impeded by or affect the resource configured for the parent of that sub hierarchy and for most controllers this can be achieved in a straight-forward manner by making children not putting further restrictions on the resources from its parent on creation. This is a rule which should be inherent and this type of conventions ultimately lead to better designs and implementations. I think this is evident for the controller in question being discussed on this thread. Task organization - creating cgroups and moving tasks around tasks between them - is an inherently different operation from configuring each controller. They shouldn't be conflated. It doesn't make any sense to fail creation of a cgroup or failing task migration later because controller can't be configured certain way. They should be orthogonal as much as possible. If there's restriction on controller configuration, that should be enforced on controller configuration. > > So, behaviors > > which blow configs across migrations and consider them as "fresh" is > > completely fine by me. > > Its not by me, its completely surprising and counter intuitive. I don't get it. This is one of few cases where controller is distributing hard-walled resources and as you said userland intervention is a must in facilitating such distribution. Isn't this pretty well in line with what you've been saying? The admin is moving a RT / deadline task into a different scheduling domain and if such operation always requires setting scheduling policies again, what's surprising about it? It makes conceptual sense - the task is moving across two scheduling domains with different set of hard resources. It'd work well and reliably too in practice and userland only has one less vector of failure while achieving the same thing. > > I mostly wanna avoid requiring complicated > > failure handling from the users which most likely won't be tested a > > lot and crap out when something exceptional happens. > > Smells like you just want to pretend nothing bad happens when you do > stupid. I prefer to fail early and fail hard over pretend happy and > surprise behaviour any day. But where am I losing anything? I'm not saying everything is always better this way but if I look at the overall compromises, it seems like a clear win to me. > > This whole thing is really about having consistent behavior patterns > > which avoid obscure failure modes whenever possible. Unified > > hierarchy does build on top of those but we do want these > > consistencies regardless of that. > > I'm all for consistency, but I abhor make believe. And while I like the > unified hierarchy thing conceptually, I'm by now fairly sure reality is > about to ruin it. Hmm... I get exactly the opposite feeling. A lot of fundamental properties are being identified and things mostly fall into places. Thanks. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-31 15:58 ` Tejun Heo @ 2014-11-04 13:13 ` Peter Zijlstra 2014-11-05 20:41 ` Tejun Heo 0 siblings, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-11-04 13:13 UTC (permalink / raw) To: Tejun Heo Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt, Thomas Gleixner On Fri, Oct 31, 2014 at 11:58:06AM -0400, Tejun Heo wrote: > > No real magic there. Except now people seem to want to wrap it into > > magic and hide it all from the admin, pretend its not there and make it > > uncontrollable. > > Hmmm... I think a difference is how we perceive userspace is composed > and interacts with the various aspects of kernel. But even in the > presence of a competent admin that you're suggesting, interactions of > different aspects of a system are often compartmentalized. e.g. an > admin configuring cpuset to accomodate a given set of persistent and > important workload isn't too likely to expect a memory unit soft > failure in several weeks and the need to hot-swap the memory module. > It just isn't cost-effective enough to lump those two planes of > planning into the same activity especially if the admin is > hand-crafting the configuration. The issue that I see with the > current method is that a much rare exception condition ends up messing > up configurations which is on a different plane and that there's no > recourse once that happens. If the said workload keeps forking, > there's no easy way to recover the previous configuration. > > Both ways of handling the situation have components of surprise but as > I wrote before that surprise is inherent and comes from the fact that > the kernel can't afford tasks which aren't runnable. As a policy of > handling the surprising situation, having explicit configured / > effective settings seems like a better option to me because 1. it > makes it explicit that the effective configuration may differ from the > requested one 2. it makes handling exception cases easier. I think #1 > is important because hard errors which rarely but do happen are very > difficult to deal with properly because it's usually nearly invisible. So there are scenarios where you want to hard fail the machine if the constraints are not met. Its better to just give up than to pretend. This effective/requested split is policy, a hardcoded kernel policy. One that doesn't work for a number of cases. Fail and let userspace sort it out is a much safer option. Some people want hard guarantees, if you're not willing to cater to them with cgroups they'll go off and invent yet more muck :/ Do you want to shut down the saw, or pretend its still controlled and loose your fingers because it missed a deadline? Even HPC might not want to pretend continue, they might want to notify the jobs scheduler and get a different job split, rather than continue half-arsed. A persistent delay on the job completion barrier is way bad for them. > > Typically controllers don;'t control too many configs at once and the > > specific return error could be a good hint there. > > Usually, yeah. I still end up scratching my head with migration > rejections w/ cpuset or blkcg tho. This means you already need to deal with this, so how about we try and make that work instead of saying we cannot fail migration. > > You can include in the msg with the pid that was just attempted in the > > pid namespace of the observer, if the pid is not available in that > > namespace discard the message since the observer could not possibly have > > done the deed. > > I don't know. Is that a good interface? If a human admin is echoing > and dmesg'ing afterwards, it should work but scraping the log for an > unstructured plain text error usually isn't a very good interface to > build tools around. > > For example, for CAT and its limit on the numbers of possible > configurations, it can technically be made to work by reporting errors > on mkdir or task migration; however, it is *far* better and clearer to > report, say, -ENOSPC when you're actually trying to change the > configuration. The error is directly tied to the operation requested. > That's just how it should be whenever possible. I never suggested dmesg, I was thinking of a cgroup.notifier file that reports all 'events' for that cgroup. If you listen to it while performing your operation, you get the msgs: $ cat cgroup.notifier & echo $pid > tasks ; kill -INT $! Or something like that. Seeing how the entire cgroup thing is text based, this would end up spewing text like: $cgroup-path failed attach $pid: $reason Where everything is in the namespace of the observer; and if there is no namespace translation possible, drop the event, because you can't have seen or done anything anyhow. > > That's an entirely separate issue; and I don't see that solving the task > > vs process issue at all. > > Hmm... I don't see it that way tho. In-process configuration is > primarily something to be done by the process while cgroup management > is to be done by external adminy entity. They are on different > planes. Individual binaries accessing their own cgroups doesn't make > a lot of sense and is actually broken. Likewise, external management > entity meddling with individual threads of a process is at best > cumbersome. It can be allowed but that's often not how it's useful. > I really don't see why cgroup would be involved with per-thread > settings. Well, people are doing it now. And it 'works' if you assume nobody is going to do 'crazy' things behind your back, which is a fair assumption (most of the time). Its just that some people seem hell bend on doing crazy things behind your back in the name of progress or whatnot ;-) Take one would be making sure this background crap can be shot in the head. I'm not arguing against an atomic interface, I'm just saying its not required for useful things. > > Automation is nice and all, but RT is about providing determinism and > > guarantees. Unless you morph into a full blown RT aware muddleware and > > have all your RT apps communicate their requirements to it (ie. rewrite > > them all) to it, this is a non starter. > > > > Given that the RR/FIFO APIs are not communicating enough and we need to > > support them anyhow, human intervention it is. > > Yeah, I fully agree with you there. The issue is not that RT/FIFO > requires explicit actions from userland but that they're currently > tied to BE scheduling. Conceptually, they don't have to be but > they're in practice and that ends up requiring whoever, be that an > admin or automated tool, is managing the BE grouping to also manage > RT/FIFO slices, which isn't ideal but should be workable. I was > mostly curious whether they can be separated with a reasonable amount > of effort. That's a no, right? What's a BE? Separating them is technically possible (painful maybe), but doesn't make any kind of sense to me. > > > Oh, seriously, if I could build this thing from ground up, I'd just > > > tie it to process hierarchy and make the associations static. > > > > This thing being cgroups? I'm not sure static associations cater for the > > various use cases that people have. > > Sure, we have no chance of changing it at this point, but I'm pretty > sure if we started by tying it to the process hierarchy, we and the > userland would have been able to achieve about the same set of > functionalities without all these migration business. How would we do things like per-cgroup workqueues? We'd need to somehow spawn kthreads outside of the normal kthreadd hierarchy. (this btw is something we need to sort, but lets not have that discussion here -- this email is getting too big as is). > > Sure simple and consistent is all good, but we should also not make it > > too simple and thereby exclude useful things. > > What are we excluding tho? Hard guarantees it seems. > Previously, cgroup didn't have rules, > policies or conventions. It just had this skeletal features to group > tasks and every controller did its own thing diverging the way they > treat hierarchies, errors, migrations, configurations, notifications > and so on. It didn't put in the effort to actually identify the > required functionalities or characterize what belongs where. Every > controller was doing its own brownian motion in the design space. Sure, agreed, we need more sanity there. I do however think we need to put in the effort to map out all use cases. > Most of the properties being identified and policies being set up are > actually fundamental and inherent. e.g. Creating a subhierarchy and > organizing the children in them is fundamentally a task > sub-categorizing operation. > Conceptually, doing so shouldn't be > impeded by or affect the resource configured for the parent of that > sub hierarchy Uh what? No you want exactly that in a hierarchy. You want children to submit to the configuration of the parent. > and for most controllers this can be achieved in a > straight-forward manner by making children not putting further > restrictions on the resources from its parent on creation. The other way around, children can only put further restrictions on, they cannot relax restrictions from the parent. > I think this is evident for the controller in question being discussed > on this thread. Task organization - creating cgroups and moving tasks > around tasks between them - is an inherently different operation from > configuring each controller. They shouldn't be conflated. It doesn't > make any sense to fail creation of a cgroup or failing task migration > later because controller can't be configured certain way. They should > be orthogonal as much as possible. If there's restriction on > controller configuration, that should be enforced on controller > configuration. I'd mostly agree with that, but note how you put it in relative terms :-) I did give one (probably strained) example where putting the fail on the config side was more constrained than placing it at the migrate. > > > So, behaviors > > > which blow configs across migrations and consider them as "fresh" is > > > completely fine by me. > > > > Its not by me, its completely surprising and counter intuitive. > > I don't get it. This is one of few cases where controller is > distributing hard-walled resources and as you said userland > intervention is a must in facilitating such distribution. Isn't this > pretty well in line with what you've been saying? The admin is moving > a RT / deadline task into a different scheduling domain and if such > operation always requires setting scheduling policies again, what's > surprising about it? It would make cgroups useless. It would break running applications. You might as well not allow migration at all. But the very fact that migration would destroy configuration of an existing task would surprise me, I would -- like stated before -- much rather refuse the migration than destroy existing state. > It makes conceptual sense - the task is moving across two scheduling > domains with different set of hard resources. It'd work well and > reliably too in practice and userland only has one less vector of > failure while achieving the same thing. No its absolutely certified insane is what. It introduces a massive ton of fail. Tasks that were running fine and predictable are then all of a sudden a complete trainwreck. > > Smells like you just want to pretend nothing bad happens when you do > > stupid. I prefer to fail early and fail hard over pretend happy and > > surprise behaviour any day. > > But where am I losing anything? I'm not saying everything is always > better this way but if I look at the overall compromises, it seems > like a clear win to me. You allow the creation of fail and want to mop up the pieces afterwards -- if at all possible. I want to avoid the creation of fail. By allowing an effective config different from the requested -- be it either using less CPUs than specified, or a different scheduling policy or the forced use of remote memory, you could have lost your finger before you can fix up. Would it not be better to keep your finger? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-11-04 13:13 ` Peter Zijlstra @ 2014-11-05 20:41 ` Tejun Heo 0 siblings, 0 replies; 39+ messages in thread From: Tejun Heo @ 2014-11-05 20:41 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt, Thomas Gleixner Hello, Peter. On Tue, Nov 04, 2014 at 02:13:50PM +0100, Peter Zijlstra wrote: > So there are scenarios where you want to hard fail the machine if the > constraints are not met. Its better to just give up than to pretend. > > This effective/requested split is policy, a hardcoded kernel policy. One > that doesn't work for a number of cases. Fail and let userspace sort it > out is a much safer option. cpuset simply never implemented hard failing. the old policy wasn't a hard fail. It did the same thing as applying the effective setting. The only difference is that the process was irreversible. The kind of hard fail you're talking about would be rejecting CPU down command if downing a CPU would create a non-executable cpuset, which would be a silly conflation of layers. > Some people want hard guarantees, if you're not willing to cater to them > with cgroups they'll go off and invent yet more muck :/ > > Do you want to shut down the saw, or pretend its still controlled and > loose your fingers because it missed a deadline? > > Even HPC might not want to pretend continue, they might want to notify > the jobs scheduler and get a different job split, rather than continue > half-arsed. A persistent delay on the job completion barrier is way bad > for them. Again, we never had hard failures for cpuset. The old behavior was *more* surprising than the new one in that it was all implicit and the actions taken were out of ordinary (no other controller action moves tasks to other cgroups) and irreversible. I agree with your point that things should be as little surprising as possible but the facts you're using aren't in support of that point. One thing which is debatable is whether to allow configuring cpumasks which make the effective set empty. I don't think we fail that now but failing that is completely fine and doesn't create discrepancies with having configured and effective settings. > > > Typically controllers don;'t control too many configs at once and the > > > specific return error could be a good hint there. > > > > Usually, yeah. I still end up scratching my head with migration > > rejections w/ cpuset or blkcg tho. > > This means you already need to deal with this, so how about we try and > make that work instead of saying we cannot fail migration. My point is that failing these types of things at configuration time is a lot better approach. Everything sure is a trade-off but the benefits here seem pretty clear to me. > I never suggested dmesg, I was thinking of a cgroup.notifier file that > reports all 'events' for that cgroup. > > If you listen to it while performing your operation, you get the msgs: > > $ cat cgroup.notifier & echo $pid > tasks ; kill -INT $! > > Or something like that. Seeing how the entire cgroup thing is text > based, this would end up spewing text like: > > $cgroup-path failed attach $pid: $reason > > Where everything is in the namespace of the observer; and if there is > no namespace translation possible, drop the event, because you can't > have seen or done anything anyhow. Techcinally, we can do that or any number of other complex schemes but isn't it obviously better if we can confine controller configuration failures to actual configuration attemps. Simple -errno failures would be enough. > > Yeah, I fully agree with you there. The issue is not that RT/FIFO > > requires explicit actions from userland but that they're currently > > tied to BE scheduling. Conceptually, they don't have to be but > > they're in practice and that ends up requiring whoever, be that an > > admin or automated tool, is managing the BE grouping to also manage > > RT/FIFO slices, which isn't ideal but should be workable. I was > > mostly curious whether they can be separated with a reasonable amount > > of effort. That's a no, right? > > What's a BE? Separating them is technically possible (painful maybe), > but doesn't make any kind of sense to me. Oops, best effort. I was using a term from io scheduling. Sorry about that. I meant fair_sched_class. At least conceptually, the hierarchies of different scheduling classes are orthogonal, so I was wondering whether separating them out would be possible. If that's not practically feasible, I don't think it's a big problem. Userland would just have to adapt to it. > > Sure, we have no chance of changing it at this point, but I'm pretty > > sure if we started by tying it to the process hierarchy, we and the > > userland would have been able to achieve about the same set of > > functionalities without all these migration business. > > How would we do things like per-cgroup workqueues? We'd need to somehow > spawn kthreads outside of the normal kthreadd hierarchy. We can either have proxy kthreadd's or just reparent tasks once they're created. We already reparent after all. > (this btw is something we need to sort, but lets not have that > discussion here -- this email is getting too big as is). I don't think discussing this is meaningful. This train has left a long time ago and I don't see any realistic chance of backtracking to this route. > Sure, agreed, we need more sanity there. I do however think we need to > put in the effort to map out all use cases. I've been doing that for over a year now. I haven't mapped out *all* use cases but I do have pretty clear ideas on what matters in achieving the core functionalities. > > Conceptually, doing so shouldn't be > > impeded by or affect the resource configured for the parent of that > > sub hierarchy > > Uh what? No you want exactly that in a hierarchy. You want children to > submit to the configuration of the parent. You misunderstood. Yes, children should submit to the configuration of the parent but the act of merely creating a new child or moving tasks there shouldn't deviate the configuration from what the parent has. Using CAT as an example, creating a child shouldn't create a new configuration. It should in effect have the same configuration as its parent. As such, moving tasks in there shouldn't fail as long as tasks can be moved to the parent, which is a property we want to maintain. This is really fundamental - the operation of sub-categorazation shouldn't affect controller configuration. They should and can remain orthogonal. > > and for most controllers this can be achieved in a > > straight-forward manner by making children not putting further > > restrictions on the resources from its parent on creation. > > The other way around, children can only put further restrictions on, > they cannot relax restrictions from the parent. I meant on creation. Putting further restrictions is the only thing a child can do but on creation it should have the same effective configuration as its parent. > > I think this is evident for the controller in question being discussed > > on this thread. Task organization - creating cgroups and moving tasks > > around tasks between them - is an inherently different operation from > > configuring each controller. They shouldn't be conflated. It doesn't > > make any sense to fail creation of a cgroup or failing task migration > > later because controller can't be configured certain way. They should > > be orthogonal as much as possible. If there's restriction on > > controller configuration, that should be enforced on controller > > configuration. > > I'd mostly agree with that, but note how you put it in relative terms > :-) But everything is relative. At the moment we lose sight of that, we lose the ability to make sensible and healthy trade-offs. I could have written the above in absolutes but I actively avoid that whenever possible. > I did give one (probably strained) example where putting the fail on the > config side was more constrained than placing it at the migrate. If you're referring to cpuset, it wasn't a good example. > > I don't get it. This is one of few cases where controller is > > distributing hard-walled resources and as you said userland > > intervention is a must in facilitating such distribution. Isn't this > > pretty well in line with what you've been saying? The admin is moving > > a RT / deadline task into a different scheduling domain and if such > > operation always requires setting scheduling policies again, what's > > surprising about it? > > It would make cgroups useless. It would break running applications. > You might as well not allow migration at all. Task migrations will be a low-priority manegerial operation. It's mostly used to set up the initial hierarchy. Tasks should be put in a logical structure on startup and resource control changes should happen through specific controller enable/disable and configuration changes. This is inherent in the unified hierarchy design and the reason why controllers are individually enabled and disabled at each level. Task categorization is an orthogonal operation to resource restriction. Tasks are logically organized and resource controls are dynamically configured over the logical structure. So, yes, the role of migration is diminished in the unified hierarchy and that's by design. We can't go full static process hierarchy at this point but this way we can get reasonably close while accomodating gradual transition. > But the very fact that migration would destroy configuration of an > existing task would surprise me, I would -- like stated before -- much > rather refuse the migration than destroy existing state. I suppose this depends on the perspective but if the RT config is reliably reset on migration, I don't see why it'd be surprising. It's a well-defined behavior which happens without exception and we already have a precedence in changing per-task settings according to a task's cgroup membership - cpuset overrides the cpu and node masks on migration. > By allowing an effective config different from the requested -- be it > either using less CPUs than specified, or a different scheduling policy > or the forced use of remote memory, you could have lost your finger > before you can fix up. I don't get why you're lumping the cpuset and cpu situations together. They're different and cpu doesn't deal with any "effective" settings. Thanks. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 12:43 ` Tejun Heo 2014-10-30 13:18 ` Peter Zijlstra @ 2014-10-30 14:14 ` Matt Fleming [not found] ` <CAAAKZwvJOKsrj_yczDGaNLaNYo+_=HzsTLwDdcaTJqO2VMy8uA@mail.gmail.com> 2014-10-30 23:18 ` Vikas Shivappa 3 siblings, 0 replies; 39+ messages in thread From: Matt Fleming @ 2014-10-30 14:14 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Vikas Shivappa, Auld, Will, Vikas Shivappa, linux-kernel, Fleming, Matt On Thu, 30 Oct, at 08:43:33AM, Tejun Heo wrote: > Hello, Peter. > > On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote: > > If this means echo $tid > tasks, then sorry we can't do. There is a > > limited number of hardware resources backing this thing. At some point > > they're consumed and something must give. > > And that something shouldn't be disallowing task migration across > cgroups. This simply doesn't work with co-mounting or unified > hierarchy. cpuset automatically takes on the nearest ancestor's > configuration which has enough execution resources. Maybe that can be > an option for this too? Oh, you can always add more tasks to a cgroup, or move tasks between cgroups. What you can't always do is create more cgroups. -- Matt Fleming, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 39+ messages in thread
[parent not found: <CAAAKZwvJOKsrj_yczDGaNLaNYo+_=HzsTLwDdcaTJqO2VMy8uA@mail.gmail.com>]
* Re: Cache Allocation Technology Design [not found] ` <CAAAKZwvJOKsrj_yczDGaNLaNYo+_=HzsTLwDdcaTJqO2VMy8uA@mail.gmail.com> @ 2014-10-30 17:12 ` Tejun Heo 2014-10-30 22:35 ` Tim Hockin 0 siblings, 1 reply; 39+ messages in thread From: Tejun Heo @ 2014-10-30 17:12 UTC (permalink / raw) To: Tim Hockin Cc: linux-kernel, Auld, Will, Matt Fleming, Vikas Shivappa, Peter Zijlstra, Fleming, Matt, Vikas Shivappa On Thu, Oct 30, 2014 at 07:58:34AM -0700, Tim Hockin wrote: > Another reason unified hierarchy is a bad model. Things wrong with this message. 1. Top posted. It isn't clear which part you're referring to and this was pointed out to you multiple times in the past. 2. No real thoughts or technical details. Maybe you had some in your head but nothing was elaborated. This forces me to guess what you had on mind when you produced the above sentence and of course me not being you this takes a considerable amount of brain cycle and I'd still end up with multiple alternative scenarios that I'll have to cover. 3. Needlessly loaded expression, which forces me to respond. Combined, this is just rude and you've been showing this type of behavior multiple times. Behave yourself. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 17:12 ` Tejun Heo @ 2014-10-30 22:35 ` Tim Hockin 2014-10-31 16:57 ` Tejun Heo 0 siblings, 1 reply; 39+ messages in thread From: Tim Hockin @ 2014-10-30 22:35 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel, Auld, Will, Matt Fleming, Vikas Shivappa, Peter Zijlstra, Fleming, Matt, Vikas Shivappa On Thu, Oct 30, 2014 at 10:12 AM, Tejun Heo <tj@kernel.org> wrote: > On Thu, Oct 30, 2014 at 07:58:34AM -0700, Tim Hockin wrote: >> Another reason unified hierarchy is a bad model. > > Things wrong with this message. > > 1. Top posted. It isn't clear which part you're referring to and this > was pointed out to you multiple times in the past. I occasionally fall victim to gmail's defaults. I apologize for that. > 2. No real thoughts or technical details. Maybe you had some in your > head but nothing was elaborated. This forces me to guess what you > had on mind when you produced the above sentence and of course me > not being you this takes a considerable amount of brain cycle and > I'd still end up with multiple alternative scenarios that I'll have > to cover. I think the conversation is well enough understood by the people for whom this bit of snark was intended that reading my mind was not that hard. That said, it was overly snark-tastic, and sent in haste. My point, of course, was that here is an example of something which maps very well to the idea of cgroups (a set of processes that share some controller) but DOES NOT map well to the unified hierarchy model. It must be managed more carefully than arbitrary hierarchy can enforce. The result is the mish-mash of workarounds proposed in this thread to force it into arbitrary hierarchy mode, including this no-win situation of running out of hardware resources - it is going to fail. Will it fail at cgroup creation time (doesn't scale to arbitrary hierarchy) or will it fail when you add processes to it (awkward at best) or will it fail when you flip some control file to enable the feature? I know the unified hierarchy ship has sailed, so there's not non-snarky way to argue the point any further, but this is such an obvious case, to me, that I had to say something. Tim ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 22:35 ` Tim Hockin @ 2014-10-31 16:57 ` Tejun Heo 0 siblings, 0 replies; 39+ messages in thread From: Tejun Heo @ 2014-10-31 16:57 UTC (permalink / raw) To: Tim Hockin Cc: linux-kernel, Auld, Will, Matt Fleming, Vikas Shivappa, Peter Zijlstra, Fleming, Matt, Vikas Shivappa Hello, Tim. On Thu, Oct 30, 2014 at 03:35:44PM -0700, Tim Hockin wrote: > I think the conversation is well enough understood by the people for > whom this bit of snark was intended that reading my mind was not that I really don't think it is. cgroups in general isn't that well understood and while some may be familiar with what they've been working on, most aren't too well acquainted what changes are made and why. I surely am responsible for not being better at communicating but it took me quite a while and I'm still in the process of crystalizing those myself. > hard. That said, it was overly snark-tastic, and sent in haste. The problem with this type of snarky one liner is that it undermines the fundamentals of techunical discussions on the mailing list. It requires too much effort on the other party for speculation and if the other party doesn't repond, the snark comment succeeds at establishing the vague negativity that it carried. If you have a technical opinion, form and communicate it properly so that it can be analyzed and discussed properly. I think my wording in my previous messages was too strong and apologize for that but please don't do this. > My point, of course, was that here is an example of something which > maps very well to the idea of cgroups (a set of processes that share > some controller) but DOES NOT map well to the unified hierarchy model. I'm pretty sure that conclusion is premature. As I wrote in my reply to Peter, I strongly believe that a set of reasonable constraints and conventions lead to a much better and more functional design, interface and implementation. It sure can feel like an annoyance if one used to be accustomed to doing whatever and now has to follow these new constraints but we were paying heavily elsewhere for the lack of consistency and, in general, sense. I could have communicated it clearer but the fundamental issue that I see with the original proposal is that it conflates task organization and controller configuration. They belong to different planes of control and should be orthogonal as much as possible. This shows up evidently, for example, in how errors are reported. A write to a knob of the involved controller failing with the proper error code is a far superior way compared to failing mkdir or task migration. The only reason we even think that doing anything else is fine is because we've never thought about what's the right thing to do all along and just did whatever is convenient in terms of immediate implementation for each individual case. > It must be managed more carefully than arbitrary hierarchy can > enforce. The result is the mish-mash of workarounds proposed in this > thread to force it into arbitrary hierarchy mode, including this > no-win situation of running out of hardware resources - it is going to > fail. Will it fail at cgroup creation time (doesn't scale to > arbitrary hierarchy) or will it fail when you add processes to it > (awkward at best) or will it fail when you flip some control file to > enable the feature? Please see above. It's more of the process of finding the *right* place to put operations and their failures. Task migration sure can fail due to memory pressure or basic cgroup organizational contraints; however, it's outright wrong to fail it because a given controller can support only limited number of configurations. Again, being able to do whatever one wanna do often doesn't lead to a good design. > I know the unified hierarchy ship has sailed, so there's not > non-snarky way to argue the point any further, but this is such an > obvious case, to me, that I had to say something. If you properly compose your ideas and concerns, I can think about and discuss them and make adjustments where appropriate and it seems to me that your impression at least in this instance isn't very well warranted. The snark comment can achieve none of the productive things which can come from proper discussions. All it can do is aggravating the tone of the discussion, so, again, please refrain from it in the future. Thanks. -- tejun ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 12:43 ` Tejun Heo ` (2 preceding siblings ...) [not found] ` <CAAAKZwvJOKsrj_yczDGaNLaNYo+_=HzsTLwDdcaTJqO2VMy8uA@mail.gmail.com> @ 2014-10-30 23:18 ` Vikas Shivappa 2014-11-04 13:17 ` Peter Zijlstra 3 siblings, 1 reply; 39+ messages in thread From: Vikas Shivappa @ 2014-10-30 23:18 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Vikas Shivappa, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt, h.peter.anvin On Thu, 30 Oct 2014, Tejun Heo wrote: > Hello, Peter. > > On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote: >> If this means echo $tid > tasks, then sorry we can't do. There is a >> limited number of hardware resources backing this thing. At some point >> they're consumed and something must give. > > And that something shouldn't be disallowing task migration across > cgroups. This simply doesn't work with co-mounting or unified > hierarchy. cpuset automatically takes on the nearest ancestor's > configuration which has enough execution resources. Maybe that can be > an option for this too? One way to it is to merge the CAT cgroups into the cpuset . In essense there is no CAT cgroup seperately and we just have a new file 'cbm' in the cpuset. This would be visible only when system has Cache allocation support and the user can manipulate the cache bit mask here. The user can use the already existing cpu_exclusive file in the cpuset to mark the cgroups to use exclusive CPUs. That way we simplify and reuse cpuset code/hierarchy .. ? Thanks, Vikas > > One of the problems is that we generally assume that a task can run > some point in time in a lot of places in the kernel and can't just not > run a task indefinitely because it's in a cgroup configured certain > way. > >> So either we fail mkdir, but that means allocating CLOS IDs for possibly >> empty cgroups, or we allocate on demand which means failing task >> assignment. > > Can't fail mkdir or css enabling either. Again, co-mounting and > unified hierarchy. Also, the behavior is just horrible to use from > userland. > >> The same -- albeit for a different reason -- is true of the RT sched >> groups, we simply cannot instantiate them such that tasks can join, >> sysads _have_ to configure them before we can add tasks to them. > > Yeah, RT is one of the main items which is problematic, more so > because it's currently coupled with the normal sched controller and > the default config doesn't have any RT slice. Do we completely block > RT task w/o slice? Is that okay? > > Thanks. > > -- > tejun > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-30 23:18 ` Vikas Shivappa @ 2014-11-04 13:17 ` Peter Zijlstra 2014-11-06 17:03 ` Matt Fleming 0 siblings, 1 reply; 39+ messages in thread From: Peter Zijlstra @ 2014-11-04 13:17 UTC (permalink / raw) To: Vikas Shivappa Cc: Tejun Heo, Auld, Will, Matt Fleming, Vikas Shivappa, linux-kernel, Fleming, Matt, h.peter.anvin On Thu, Oct 30, 2014 at 04:18:33PM -0700, Vikas Shivappa wrote: > One way to it is to merge the CAT cgroups into the cpuset . In essense there > is no CAT cgroup seperately and we just have a new file 'cbm' in the cpuset. > This would be visible only when system has Cache allocation support and the > user can manipulate the cache bit mask here. > The user can use the already existing cpu_exclusive file in the cpuset to > mark the cgroups to use exclusive CPUs. > That way we simplify and reuse cpuset code/hierarchy .. ? I don't like extending cpusets further. Its already a weird and too big controller. What is wrong with having a specific CQM controller and using it together with cpusets where desired? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-11-04 13:17 ` Peter Zijlstra @ 2014-11-06 17:03 ` Matt Fleming 2014-11-10 15:50 ` Peter Zijlstra 0 siblings, 1 reply; 39+ messages in thread From: Matt Fleming @ 2014-11-06 17:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Vikas Shivappa, Tejun Heo, Auld, Will, Vikas Shivappa, linux-kernel, Fleming, Matt, h.peter.anvin On Tue, 04 Nov, at 02:17:14PM, Peter Zijlstra wrote: > > I don't like extending cpusets further. Its already a weird and too big > controller. > > What is wrong with having a specific CQM controller and using it > together with cpusets where desired? The specific problem that conflating cpusets and the CAT controller is trying to solve is that on some platforms the CLOS ID doesn't move with data that travels up the cache hierarchy, i.e. we lose the CLOS ID when data moves from LLC to L2. I think the idea with pinning CLOS IDs to a specific cpu and any tasks that are using that ID is that it works around this problem out of the box, rather than requiring sysadmins to configure things. -- Matt Fleming, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-11-06 17:03 ` Matt Fleming @ 2014-11-10 15:50 ` Peter Zijlstra 0 siblings, 0 replies; 39+ messages in thread From: Peter Zijlstra @ 2014-11-10 15:50 UTC (permalink / raw) To: Matt Fleming Cc: Vikas Shivappa, Tejun Heo, Auld, Will, Vikas Shivappa, linux-kernel, Fleming, Matt, h.peter.anvin On Thu, Nov 06, 2014 at 05:03:23PM +0000, Matt Fleming wrote: > On Tue, 04 Nov, at 02:17:14PM, Peter Zijlstra wrote: > > > > I don't like extending cpusets further. Its already a weird and too big > > controller. > > > > What is wrong with having a specific CQM controller and using it > > together with cpusets where desired? > > The specific problem that conflating cpusets and the CAT controller is > trying to solve is that on some platforms the CLOS ID doesn't move with > data that travels up the cache hierarchy, i.e. we lose the CLOS ID when > data moves from LLC to L2. > > I think the idea with pinning CLOS IDs to a specific cpu and any tasks > that are using that ID is that it works around this problem out of the > box, rather than requiring sysadmins to configure things. So either the user needs to set that mode _and_ set cpu masks, or the user needs to use cpusets and set masks, same difference to me. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-24 10:53 ` Peter Zijlstra 2014-10-28 23:22 ` Matt Fleming @ 2014-10-29 17:26 ` Vikas Shivappa 2014-10-29 18:16 ` Peter Zijlstra 1 sibling, 1 reply; 39+ messages in thread From: Vikas Shivappa @ 2014-10-29 17:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Matt Fleming, vikas, linux-kernel, matt.fleming, will.auld, tj, vikas.shivappa On Fri, 24 Oct 2014, Peter Zijlstra wrote: > On Mon, Oct 20, 2014 at 05:18:55PM +0100, Matt Fleming wrote: >>> What is Cache Allocation Technology ( CAT ) >>> ------------------------------------------- > > Its a horrible name is what it is, please consider using the old name, > that at least was clear in purpose. > >>> Kernel implementation Overview >>> ------------------------------- >>> >>> Kernel implements a cgroup subsystem to support Cache Allocation. >>> >>> Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each >>> cgroup would have one CBM and would just represent one cache 'subset'. >>> >>> The user would be allowed to create as many directories as there are >>> CLOSs defined by the h/w. If user tries to create more than the >>> available CLOSs , -ENOSPC is returned. Currently we support only one >>> level of directory, ie directory can be created only under the root. > > NAK, cgroups must support full hierarchies, simply enforce that the > child cgroup's mask is a subset of the parent's. > >>> There are 2 modes supported >>> >>> 1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs >>> specified by the 'cpus' file. The tasks in the CAT cgroup would be >>> constrained only on the CPUs in the 'cpus' file. The CPUs in this file >>> are exclusively used for this cgroup. Requests by task >>> using the sched_setaffinity() would be filtered through the tasks >>> 'cpus'. > > NAK, we will not have yet another cgroup mucking about with task > affinities. > >>> These tasks would get to fill the LLC cache represented by the >>> cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as >>> the existing cpumask datastructure. >>> >>> 2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be >>> for a group of tasks. There is no 'cpus' file and the CPUs that the >>> tasks run are not restricted by the CAT cgroup > > It appears to me this 'mode' thing is entirely superfluous and can be > constructed by voluntary operation of this and cpusets or manual > affinity calls. Do you mean user would would just user the cpusets for cpu affinity and CAT cgroup for cache allocation as shown in example below ? In other words say affinitize the PID1 and PID2 to CPUs 1 and 2 and then set the desired cache allocation as well like below - then we have the desired cpu affinity and cache allocation for these PIDs.. cd /sys/fs/cgroup/cpuset mkdir group1_specialuse /bin/echo 1-2 > cpuset.cpus /bin/echo PID1 > tasks /bin/echo PID2 > tasks Now come to CAT and do the cache allocation for the same tasks PID1 and PID2. cd /sys/fs/cgroup/cat (CAT cgroup) mkdir group1_specialuse (keeping same name just for understanding) /bin/echo 0xf > cat.cbm (set the cache bit mask) /bin/echo PID1 > tasks /bin/echo PID2 > tasks > >>> Assignment of CBM,CLOS and modes >>> --------------------------------- >>> >>> Root directory would have all bits in 'cbm' file by default. >>> >>> The cbm_max file in the root defines the maximum number of bits >>> describing the available cache units. Say if cbm_max is 16 then the >>> 'cbm' cannot have more than 16 bits. > > This seems redundant, if you've already stated that the root cbm is the > full set, there is no need to further provide this. > >>> The 'cbm' file is restricted to having no more than its cbm_max least >>> significant bits set. Any contiguous subset of these bits maybe set to >>> indication the cache mapping desired. The 'cbm' between 2 directories >>> can overlap. The 'cbm' would represent the cache 'subset' of the CAT >>> cgroup. > > This would follow from the hierarchy requirement/conditions. > >>> Scheduling and Context Switch >>> ------------------------------ > >>> In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file >>> indicate the tasks the cache subset is affinitized to. When user adds >>> tasks to the tasks file , the tasks would get to fill the cache subset >>> represented by the CAT cgroup's 'cbm' file. >>> >>> During context switch kernel implements this by writing the >>> corresponding CLOSid (internally maintained by kernel) of the CAT >>> cgroup to the CPU's IA32_PQR_ASSOC MSR. > > Right. > ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-29 17:26 ` Vikas Shivappa @ 2014-10-29 18:16 ` Peter Zijlstra 0 siblings, 0 replies; 39+ messages in thread From: Peter Zijlstra @ 2014-10-29 18:16 UTC (permalink / raw) To: Vikas Shivappa Cc: Matt Fleming, vikas, linux-kernel, matt.fleming, will.auld, tj On Wed, Oct 29, 2014 at 10:26:16AM -0700, Vikas Shivappa wrote: > >It appears to me this 'mode' thing is entirely superfluous and can be > >constructed by voluntary operation of this and cpusets or manual > >affinity calls. > > Do you mean user would would just user the cpusets for cpu affinity and CAT > cgroup for cache allocation as shown in example below ? > > In other words say affinitize the PID1 and PID2 to CPUs 1 and 2 > and then set the desired cache allocation as well like below - then we have > the desired cpu affinity and cache allocation for these PIDs.. > > cd /sys/fs/cgroup/cpuset > > mkdir group1_specialuse > /bin/echo 1-2 > cpuset.cpus > /bin/echo PID1 > tasks > /bin/echo PID2 > tasks > > Now come to CAT and do the cache allocation for the same tasks PID1 and > PID2. > > cd /sys/fs/cgroup/cat (CAT cgroup) > > mkdir group1_specialuse (keeping same name just for understanding) > /bin/echo 0xf > cat.cbm (set the cache bit mask) > /bin/echo PID1 > tasks > /bin/echo PID2 > tasks > Yah, except I have a strong urge to mount cpusets under /dog when you put it like that ;-) Or co-mount cpusets and pets and do it that way. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Cache Allocation Technology Design 2014-10-16 18:44 Cache Allocation Technology Design vikas 2014-10-20 16:18 ` Matt Fleming @ 2014-11-03 23:29 ` Vikas Shivappa 1 sibling, 0 replies; 39+ messages in thread From: Vikas Shivappa @ 2014-11-03 23:29 UTC (permalink / raw) To: vikas Cc: linux-kernel, matt.fleming, will.auld, tj, vikas.shivappa, hpa, tglx, mingo Hello All, Thanks for all the feedback so far and below is the modified 'Kernel Implementation' Section for review - Rest of the sections are the same as before with just some changes in text as per changed implementation , so can be ignored as well .. Also adding Peter Anvin, Thomas Gleixner, and Ingo Molnar for comments. Kernel implementation Overview ------------------------------- Kernel adds a file 'cbm'(cache bit mask) to the existing cpuset cgroup subsystem to support Cache Allocation. A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal to the kernel and not exposed to user. Each cgroup would have one CBM and would just represent one cache 'subset'. The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the cgroup never fails(as it was always there in cpuset already). When a child cgroup is created it inherits the CLOSid and the CBM from its parent. When a user changes the default CBM for a cgroup, a new CLOSid is allocated. The changing of 'cbm' may fail once the kernel runs out of maximum CLOSids it can support. The tasks in the cgroup would get to fill the LLC cache represented by the cgroup's 'cbm' file. User can use the existing 'cpu_exclusive' file in the cpuset cgroup to affinitize the tasks in a cgroup to exclusive set of CPUs. Root directory would have all bits set in 'cbm' file by default. Since all the children inherit the parent 'cbm' , this effectively makes the feature not take effect until user changes the cbm - or in other words the 'cbm' for all the cgroups created would be all 1s if user never modifies any 'cbm' file.Which means all the tasks get to fill in all the cache and hence cache allocation is not in effect. Assignment of CBM,CLOS --------------------------------- The 'cbm' needs to be a subset of the parent node's 'cbm'. Any contiguous subset of these bits maybe set to indicate the cache mapping desired. The 'cbm' between 2 directories can overlap. The 'cbm' would represent the cache 'subset' of the CAT cgroup. For ex: on a system with 16 bits of max cbm bits , if the directory has the least significant 4 bits set in its 'cbm' file(meaning the 'cbm' is just 0xf), it would be allocated the right quarter of the Last level cache which means the tasks belonging to this CAT cgroup can use the right quarter of the cache to fill. If it has the most significant 8 bits set ,it would be allocated the left half of the cache(8 bits out of 16 represents 50%). The cache portion defined in the CBM file is available to all tasks within the cgroup to fill and these task are not allowed to allocate space in other parts of the cache. Scheduling and Context Switch ------------------------------ During context switch kernel implements this by writing the CLOSid (internally maintained by kernel) of the cgroup to which the task belongs to the CPU's IA32_PQR_ASSOC MSR. Usage and Example ----------------- With this patch the cpuset cgroup would show a new file cpuset.cbm. cd /sys/fs/cgroup/cpuset Create 2 cpuset cgroups mkdir group1 mkdir group2 Following are some of the Files in the directory ls cpuset.cpus cpuset.cpu_exclusive cpuset.mems cpuset.mem_exclusive ... cpuset.cbm ... Say if the cache is 2MB and cbm supports 16 bits, then setting the below allocates the 'right 1/4th(512KB)' of the cache to group2 Assign cpus and memory node to the group2. cd group2 /bin/echo 1-2 > cpuset.cpus /bin/echo 0 > cpuset.mems Make the CPUs exclusive for the cgroup /bin/echo 1 > cpuset.cpus_exclusive Edit the CBM for group2 to set the least significant 4 bits. This allocates 'right quarter' of the cache. /bin/echo 0xf > cpuset.cbm Change cpus in the directory. /bin/echo 1-4 > cpuset.cpus Edit the CBM for group2 to set the least significant 8 bits.This allocates the right half of the cache to 'group2'. cd group2 /bin/echo 0xff > cpuset.cbm Assign tasks to the group2 /bin/echo PID1 > tasks /bin/echo PID2 > tasks Meaning now threads PID1 and PID2 runs on CPUs 1-2 , and get to fill the 'right half' of the cache. Thanks, Vikas On Thu, 16 Oct 2014, vikas wrote: > Hi All , We have put together a draft design document for cache > allocation technology below. Please review the same and let us know any > feedback. > > Make sure you cc my email vikas.shivappa@linux.intel.com when replying > > Thanks, > Vikas > > What is Cache Allocation Technology ( CAT ) > ------------------------------------------- > > Cache Allocation Technology provides a way for the Software (OS/VMM) > to restrict cache allocation to a defined 'subset' of cache which may > be overlapping with other 'subsets'. This feature is used when > allocating a line in cache ie when pulling new data into the cache. > The programming of the h/w is done via programming MSRs. > > The different cache subsets are identified by CLOS identifier (class > of service) and each CLOS has a CBM (cache bit mask). The CBM is a > contiguous set of bits which defines the amount of cache resource that > is available for each 'subset'. > > Why is CAT (cache allocation technology) needed > ------------------------------------------------ > > The CAT enables more cache resources to be made available for higher > priority applications based on guidance from the execution > environment. > > The architecture also allows dynamically changing these subsets during > runtime to further optimize the performance of the higher priority > application with minimal degradation to the low priority app. > Additionally, resources can be rebalanced for system throughput > benefit. (Refer to Section 17.15 in the Intel SDM > http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf) > > This technique may be useful in managing large computer systems which > large LLC. Examples may be large servers running instances of > webservers or database servers. In such complex systems, these subsets > can be used for more careful placing of the available cache > resources. > > The CAT kernel patch would provide a basic kernel framework for users > to be able to implement such cache subsets. > > > Kernel implementation Overview > ------------------------------- > > Kernel implements a cgroup subsystem to support Cache Allocation. > > Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each > cgroup would have one CBM and would just represent one cache 'subset'. > > The user would be allowed to create as many directories as there are > CLOSs defined by the h/w. If user tries to create more than the > available CLOSs , -ENOSPC is returned. Currently we support only one > level of directory, ie directory can be created only under the root. > > There are 2 modes supported > > 1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs > specified by the 'cpus' file. The tasks in the CAT cgroup would be > constrained only on the CPUs in the 'cpus' file. The CPUs in this file > are exclusively used for this cgroup. Requests by task > using the sched_setaffinity() would be filtered through the tasks > 'cpus'. > > These tasks would get to fill the LLC cache represented by the > cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as > the existing cpumask datastructure. > > 2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be > for a group of tasks. There is no 'cpus' file and the CPUs that the > tasks run are not restricted by the CAT cgroup > > > Assignment of CBM,CLOS and modes > --------------------------------- > > Root directory would have all bits in 'cbm' file by default. > > The cbm_max file in the root defines the maximum number of bits > describing the available cache units. Say if cbm_max is 16 then the > 'cbm' cannot have more than 16 bits. > > The 'affinitized' file is either 0 or 1 which represent the two modes. > System would boot with affinitized mode and all CPUs would have all > bits in cbm set meaning all CPUs have 100% cache(effectively cache > allocation is not in effect). > > The 'cbm' file is restricted to having no more than its cbm_max least > significant bits set. Any contiguous subset of these bits maybe set to > indication the cache mapping desired. The 'cbm' between 2 directories > can overlap. The 'cbm' would represent the cache 'subset' of the CAT > cgroup. For ex: on a system with 16 bits of max cbm bits , if the > directory has the least significant 4 bits set in its 'cbm' file, it > would be allocated the right quarter of the Last level cache which > means the tasks belonging to this CAT cgroup can use the right quarter > of the cache to fill. If it has the most significant 8 bits set ,it > would be allocated the left half of the cache(8 bits out of 16 > represents 50%). > > The cache subset would be affinitized to a set of cpus in affinitized > mode. The CPUs to which this allocation is affinitized to is > represented by the 'cpus' file. The 'cpus' need to be mutually > exclusive from cpus of other directories. > > The cache portion defined in the CBM file is available to all tasks > within the CAT group and these task are not allowed to allocate space > in other parts of the cache. > > 'cbm' file is used in both modes where as the 'cpus' file is relevant > in affinitized mode and would disappear in non-affinitized mode. > > > Scheduling and Context Switch > ------------------------------ > > In affinitized mode , the cache 'subset' and the tasks in a CAT cgroup > are affinitized to the CPUs represented by the CAT cgroup's 'cpus' > file i.e when user sets the 'cbm' to 'portion' and 'cpus' to c and > 'tasks' to t, the tasks 't' would always be scheduled on cpus 'c' and > will get to fill in the allocated 'portion' in last level cache. > > As noted above ,in the affinitized mode the tasks in a CAT cgroup > would also be affinitized to the CPUs in the 'cpus' file of the > directory. Following hooks in the kernel are required to implement > this (on the lines of cpuset code) > - in sched_setaffinity to mask the requested cpu mask with what is > present in the task's 'cpus' > - in migrate_task to migrate the tasks only to those CPUs in the > 'cpus' file if possible. > - in select_task_rq > > In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file > indicate the tasks the cache subset is affinitized to. When user adds > tasks to the tasks file , the tasks would get to fill the cache subset > represented by the CAT cgroup's 'cbm' file. > > During context switch kernel implements this by writing the > corresponding CLOSid (internally maintained by kernel) of the CAT > cgroup to the CPU's IA32_PQR_ASSOC MSR. > > Usage and Example > ----------------- > > > Following would mount the cache allocation cgroup subsystem and create > 2 directories. Please refer to Documentation/cgroups/cgroups.txt on > details about how to use cgroups. > > cd /sys/fs/cgroup > mkdir cachealloc > mount -t cgroup -ocachealloc cachealloc /sys/fs/cgroup/cachealloc > cd cachealloc > > Create 2 cat cgroups > > mkdir group1 > mkdir group2 > > Following are some of the Files in the directory > > ls > cachea.cbm > cachea.cpus . cpus file only appears in the affinitized mode > cgroup.procs > tasks > cbm_max (root only) > affinitized (root only) . by default itsaffinitized mode > > Say if the cache is 2MB and cbm supports 16 bits, then setting the > below allocates the 'right 1/4th(512KB)' of the cache to group2 > > Edit the CBM for group2 to set the least significant 4 bits. This > allocates 'right quarter' of the cache. > > cd group2 > /bin/echo 0xf > cachealloc.cbm > > Change cpus in the directory. > > /bin/echo 1-4 > cachealloc.cpus > > Edit the CBM for group2 to set the least significant 8 bits.This > allocates the right half of the cache to 'group2'. > > cd group2 > /bin/echo 0xff > cachea.cbm > > Assign tasks to the group2 > > /bin/echo PID1 > tasks > /bin/echo PID2 > tasks > Meaning now threads > PID1 and PID2 runs on CPUs 1-4 , and get to fill the 'right half' of > the cache. The tasks PID1 and PID2 can only have a subset of the cpu > affinity defined in the 'cpus' file > > Edit the affinitized to 0.mode is changed in root directory cd .. > > /bin/echo 0 > cachealloc.affinitized > > Now the tasks and the cache allocation is not affinitized to the CPUs > and the task's cpu affinity is not restricted to being with the subset > of 'cpus' cpumask. > > > > > > > ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2014-11-10 15:50 UTC | newest] Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-10-16 18:44 Cache Allocation Technology Design vikas 2014-10-20 16:18 ` Matt Fleming 2014-10-24 10:53 ` Peter Zijlstra 2014-10-28 23:22 ` Matt Fleming 2014-10-29 8:16 ` Peter Zijlstra 2014-10-29 12:48 ` Matt Fleming 2014-10-29 13:45 ` Peter Zijlstra 2014-10-29 16:32 ` Auld, Will 2014-10-29 17:28 ` Peter Zijlstra 2014-10-29 17:41 ` Vikas Shivappa 2014-10-29 18:22 ` Tejun Heo 2014-10-30 7:07 ` Peter Zijlstra 2014-10-30 7:14 ` Peter Zijlstra 2014-10-30 12:44 ` Tejun Heo 2014-10-30 13:19 ` Peter Zijlstra 2014-10-30 15:25 ` Tejun Heo 2014-10-30 12:43 ` Tejun Heo 2014-10-30 13:18 ` Peter Zijlstra 2014-10-30 17:03 ` Tejun Heo 2014-10-30 21:43 ` Peter Zijlstra 2014-10-30 22:22 ` Tejun Heo 2014-10-30 22:47 ` Peter Zijlstra 2014-11-06 16:27 ` Matt Fleming 2014-11-06 17:20 ` Vikas Shivappa 2014-10-31 13:07 ` Peter Zijlstra 2014-10-31 15:58 ` Tejun Heo 2014-11-04 13:13 ` Peter Zijlstra 2014-11-05 20:41 ` Tejun Heo 2014-10-30 14:14 ` Matt Fleming [not found] ` <CAAAKZwvJOKsrj_yczDGaNLaNYo+_=HzsTLwDdcaTJqO2VMy8uA@mail.gmail.com> 2014-10-30 17:12 ` Tejun Heo 2014-10-30 22:35 ` Tim Hockin 2014-10-31 16:57 ` Tejun Heo 2014-10-30 23:18 ` Vikas Shivappa 2014-11-04 13:17 ` Peter Zijlstra 2014-11-06 17:03 ` Matt Fleming 2014-11-10 15:50 ` Peter Zijlstra 2014-10-29 17:26 ` Vikas Shivappa 2014-10-29 18:16 ` Peter Zijlstra 2014-11-03 23:29 ` Vikas Shivappa
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.