linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFD] CAT user space interface revisited
@ 2015-11-18 18:25 Thomas Gleixner
  2015-11-18 19:38 ` Luiz Capitulino
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Thomas Gleixner @ 2015-11-18 18:25 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, x86, Marcelo Tosatti, Luiz Capitulino,
	Vikas Shivappa, Tejun Heo, Yu Fenghua

Folks!

After rereading the mail flood on CAT and staring into the SDM for a
while, I think we all should sit back and look at it from scratch
again w/o our preconceptions - I certainly had to put my own away.

Let's look at the properties of CAT again:

   - It's a per socket facility

   - CAT slots can be associated to external hardware. This
     association is per socket as well, so different sockets can have
     different behaviour. I missed that detail when staring the first
     time, thanks for the pointer!

   - The association ifself is per cpu. The COS selection happens on a
     CPU while the set of masks which are selected via COS are shared
     by all CPUs on a socket.

There are restrictions which CAT imposes in terms of configurability:

   - The bits which select a cache partition need to be consecutive

   - The number of possible cache association masks is limited

Let's look at the configurations (CDP omitted and size restricted)

Default:   1 1 1 1 1 1 1 1
	   1 1 1 1 1 1 1 1
	   1 1 1 1 1 1 1 1
	   1 1 1 1 1 1 1 1

Shared:	   1 1 1 1 1 1 1 1
	   0 0 1 1 1 1 1 1
	   0 0 0 0 1 1 1 1
	   0 0 0 0 0 0 1 1

Isolated:  1 1 1 1 0 0 0 0
	   0 0 0 0 1 1 0 0
	   0 0 0 0 0 0 1 0
	   0 0 0 0 0 0 0 1

Or any combination thereof. Surely some combinations will not make any
sense, but we really should not make any restrictions on the stupidity
of a sysadmin. The worst outcome might be L3 disabled for everything,
so what?

Now that gets even more convoluted if CDP comes into play and we
really need to look at CDP right now. We might end up with something
which looks like this:

   	   1 1 1 1 0 0 0 0	Code
	   1 1 1 1 0 0 0 0	Data
	   0 0 0 0 0 0 1 0	Code
	   0 0 0 0 1 1 0 0	Data
	   0 0 0 0 0 0 0 1	Code
	   0 0 0 0 1 1 0 0	Data
or 
	   0 0 0 0 0 0 0 1	Code
	   0 0 0 0 1 1 0 0	Data
	   0 0 0 0 0 0 0 1	Code
	   0 0 0 0 0 1 1 0	Data

Let's look at partitioning itself. We have two options:

   1) Per task partitioning

   2) Per CPU partitioning

So far we only talked about #1, but I think that #2 has a value as
well. Let me give you a simple example.

Assume that you have isolated a CPU and run your important task on
it. You give that task a slice of cache. Now that task needs kernel
services which run in kernel threads on that CPU. We really don't want
to (and cannot) hunt down random kernel threads (think cpu bound
worker threads, softirq threads ....) and give them another slice of
cache. What we really want is:

    	 1 1 1 1 0 0 0 0    <- Default cache
	 0 0 0 0 1 1 1 0    <- Cache for important task
	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task

It would even be sufficient for particular use cases to just associate
a piece of cache to a given CPU and do not bother with tasks at all.

We really need to make this as configurable as possible from userspace
without imposing random restrictions to it. I played around with it on
my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
enabled) makes it really useless if we force the ids to have the same
meaning on all sockets and restrict it to per task partitioning.

Even if next generation systems will have more COS ids available,
there are not going to be enough to have a system wide consistent
view unless we have COS ids > nr_cpus.

Aside of that I don't think that a system wide consistent view is
useful at all.

 - If a task migrates between sockets, it's going to suffer anyway.
   Real sensitive applications will simply pin tasks on a socket to
   avoid that in the first place. If we make the whole thing
   configurable enough then the sysadmin can set it up to support
   even the nonsensical case of identical cache partitions on all
   sockets and let tasks use the corresponding partitions when
   migrating.

 - The number of cache slices is going to be limited no matter what,
   so one still has to come up with a sensible partitioning scheme.

 - Even if we have enough cos ids the system wide view will not make
   the configuration problem any simpler as it remains per socket.

It's hard. Policies are hard by definition, but this one is harder
than most other policies due to the inherent limitations.

So now to the interface part. Unfortunately we need to expose this
very close to the hardware implementation as there are really no
abstractions which allow us to express the various bitmap
combinations. Any abstraction I tried to come up with renders that
thing completely useless.

I was not able to identify any existing infrastructure where this
really fits in. I chose a directory/file based representation. We
certainly could do the same with a syscall, but that's just an
implementation detail.

At top level:

   xxxxxxx/cat/max_cosids		<- Assume that all CPUs are the same
   xxxxxxx/cat/max_maskbits		<- Assume that all CPUs are the same
   xxxxxxx/cat/cdp_enable		<- Depends on CDP availability

Per socket data:

   xxxxxxx/cat/socket-0/
   ...
   xxxxxxx/cat/socket-N/l3_size
   xxxxxxx/cat/socket-N/hwsharedbits

Per socket mask data:

   xxxxxxx/cat/socket-N/cos-id-0/
   ...
   xxxxxxx/cat/socket-N/cos-id-N/inuse
				/cat_mask	
				/cdp_mask	<- Data mask if CDP enabled

Per cpu default cos id for the cpus on that socket:

   xxxxxxx/cat/socket-N/cpu-x/default_cosid
   ...
   xxxxxxx/cat/socket-N/cpu-N/default_cosid

The above allows a simple cpu based partitioning. All tasks which do
not have a cache partition assigned on a particular socket use the
default one of the cpu they are running on.

Now for the task(s) partitioning:

   xxxxxxx/cat/partitions/

Under that directory one can create partitions

   xxxxxxx/cat/partitions/p1/tasks
			    /socket-0/cosid
			    ...
			    /socket-n/cosid

   The default value for the per socket cosid is COSID_DEFAULT, which
   causes the task(s) to use the per cpu default id.

Thoughts?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-18 18:25 [RFD] CAT user space interface revisited Thomas Gleixner
@ 2015-11-18 19:38 ` Luiz Capitulino
  2015-11-18 19:55   ` Auld, Will
  2015-11-18 22:34 ` Marcelo Tosatti
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Luiz Capitulino @ 2015-11-18 19:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Marcelo Tosatti, Vikas Shivappa,
	Tejun Heo, Yu Fenghua, will.auld, donald.d.dugger, riel

On Wed, 18 Nov 2015 19:25:03 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is
> useful at all.

This is a great writeup! I agree with everything you said.

> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.
> 
> I was not able to identify any existing infrastructure where this
> really fits in. I chose a directory/file based representation. We
> certainly could do the same with a syscall, but that's just an
> implementation detail.
> 
> At top level:
> 
>    xxxxxxx/cat/max_cosids		<- Assume that all CPUs are the same
>    xxxxxxx/cat/max_maskbits		<- Assume that all CPUs are the same
>    xxxxxxx/cat/cdp_enable		<- Depends on CDP availability
> 
> Per socket data:
> 
>    xxxxxxx/cat/socket-0/
>    ...
>    xxxxxxx/cat/socket-N/l3_size
>    xxxxxxx/cat/socket-N/hwsharedbits
> 
> Per socket mask data:
> 
>    xxxxxxx/cat/socket-N/cos-id-0/
>    ...
>    xxxxxxx/cat/socket-N/cos-id-N/inuse
> 				/cat_mask	
> 				/cdp_mask	<- Data mask if CDP enabled
> 
> Per cpu default cos id for the cpus on that socket:
> 
>    xxxxxxx/cat/socket-N/cpu-x/default_cosid
>    ...
>    xxxxxxx/cat/socket-N/cpu-N/default_cosid
> 
> The above allows a simple cpu based partitioning. All tasks which do
> not have a cache partition assigned on a particular socket use the
> default one of the cpu they are running on.
> 
> Now for the task(s) partitioning:
> 
>    xxxxxxx/cat/partitions/
> 
> Under that directory one can create partitions
> 
>    xxxxxxx/cat/partitions/p1/tasks
> 			    /socket-0/cosid
> 			    ...
> 			    /socket-n/cosid
> 
>    The default value for the per socket cosid is COSID_DEFAULT, which
>    causes the task(s) to use the per cpu default id.

I hope I've got all the details right, but this proposal looks awesome.
There's more people who seem to agree with something like this.

Btw, I think it should be possible to implement this with cgroups. But
I too don't care that much on cgroups vs. syscalls.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFD] CAT user space interface revisited
  2015-11-18 19:38 ` Luiz Capitulino
@ 2015-11-18 19:55   ` Auld, Will
  0 siblings, 0 replies; 32+ messages in thread
From: Auld, Will @ 2015-11-18 19:55 UTC (permalink / raw)
  To: Luiz Capitulino, Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Marcelo Tosatti, Shivappa, Vikas,
	Tejun Heo, Yu, Fenghua, Dugger, Donald D, riel, Luck, Tony

+Tony

> -----Original Message-----
> From: Luiz Capitulino [mailto:lcapitulino@redhat.com]
> Sent: Wednesday, November 18, 2015 11:38 AM
> To: Thomas Gleixner
> Cc: LKML; Peter Zijlstra; x86@kernel.org; Marcelo Tosatti; Shivappa, Vikas; Tejun
> Heo; Yu, Fenghua; Auld, Will; Dugger, Donald D; riel@redhat.com
> Subject: Re: [RFD] CAT user space interface revisited
> 
> On Wed, 18 Nov 2015 19:25:03 +0100 (CET) Thomas Gleixner
> <tglx@linutronix.de> wrote:
> 
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning.
> >
> > Even if next generation systems will have more COS ids available,
> > there are not going to be enough to have a system wide consistent view
> > unless we have COS ids > nr_cpus.
> >
> > Aside of that I don't think that a system wide consistent view is
> > useful at all.
> 
> This is a great writeup! I agree with everything you said.
> 
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
> >
> > I was not able to identify any existing infrastructure where this
> > really fits in. I chose a directory/file based representation. We
> > certainly could do the same with a syscall, but that's just an
> > implementation detail.
> >
> > At top level:
> >
> >    xxxxxxx/cat/max_cosids		<- Assume that all CPUs are the same
> >    xxxxxxx/cat/max_maskbits		<- Assume that all CPUs are the same
> >    xxxxxxx/cat/cdp_enable		<- Depends on CDP availability
> >
> > Per socket data:
> >
> >    xxxxxxx/cat/socket-0/
> >    ...
> >    xxxxxxx/cat/socket-N/l3_size
> >    xxxxxxx/cat/socket-N/hwsharedbits
> >
> > Per socket mask data:
> >
> >    xxxxxxx/cat/socket-N/cos-id-0/
> >    ...
> >    xxxxxxx/cat/socket-N/cos-id-N/inuse
> > 				/cat_mask
> > 				/cdp_mask	<- Data mask if CDP enabled
> >
> > Per cpu default cos id for the cpus on that socket:
> >
> >    xxxxxxx/cat/socket-N/cpu-x/default_cosid
> >    ...
> >    xxxxxxx/cat/socket-N/cpu-N/default_cosid
> >
> > The above allows a simple cpu based partitioning. All tasks which do
> > not have a cache partition assigned on a particular socket use the
> > default one of the cpu they are running on.
> >
> > Now for the task(s) partitioning:
> >
> >    xxxxxxx/cat/partitions/
> >
> > Under that directory one can create partitions
> >
> >    xxxxxxx/cat/partitions/p1/tasks
> > 			    /socket-0/cosid
> > 			    ...
> > 			    /socket-n/cosid
> >
> >    The default value for the per socket cosid is COSID_DEFAULT, which
> >    causes the task(s) to use the per cpu default id.
> 
> I hope I've got all the details right, but this proposal looks awesome.
> There's more people who seem to agree with something like this.
> 
> Btw, I think it should be possible to implement this with cgroups. But I too don't
> care that much on cgroups vs. syscalls.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-18 18:25 [RFD] CAT user space interface revisited Thomas Gleixner
  2015-11-18 19:38 ` Luiz Capitulino
@ 2015-11-18 22:34 ` Marcelo Tosatti
  2015-11-19  0:34   ` Marcelo Tosatti
  2015-11-19  8:11   ` Thomas Gleixner
  2015-11-19  0:01 ` Marcelo Tosatti
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-18 22:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a
> while, I think we all should sit back and look at it from scratch
> again w/o our preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>    - It's a per socket facility
> 
>    - CAT slots can be associated to external hardware. This
>      association is per socket as well, so different sockets can have
>      different behaviour. I missed that detail when staring the first
>      time, thanks for the pointer!
> 
>    - The association ifself is per cpu. The COS selection happens on a
>      CPU while the set of masks which are selected via COS are shared
>      by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>    - The bits which select a cache partition need to be consecutive
> 
>    - The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 
> Shared:	   1 1 1 1 1 1 1 1
> 	   0 0 1 1 1 1 1 1
> 	   0 0 0 0 1 1 1 1
> 	   0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
> 	   0 0 0 0 1 1 0 0
> 	   0 0 0 0 0 0 1 0
> 	   0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity
> of a sysadmin. The worst outcome might be L3 disabled for everything,
> so what?
> 
> Now that gets even more convoluted if CDP comes into play and we
> really need to look at CDP right now. We might end up with something
> which looks like this:
> 
>    	   1 1 1 1 0 0 0 0	Code
> 	   1 1 1 1 0 0 0 0	Data
> 	   0 0 0 0 0 0 1 0	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> or 
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 0 1 1 0	Data
> 
> Let's look at partitioning itself. We have two options:
> 
>    1) Per task partitioning
> 
>    2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on
> it. You give that task a slice of cache. Now that task needs kernel
> services which run in kernel threads on that CPU. We really don't want
> to (and cannot) hunt down random kernel threads (think cpu bound
> worker threads, softirq threads ....) and give them another slice of
> cache. What we really want is:
> 
>     	 1 1 1 1 0 0 0 0    <- Default cache
> 	 0 0 0 0 1 1 1 0    <- Cache for important task
> 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is
> useful at all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>    Real sensitive applications will simply pin tasks on a socket to
>    avoid that in the first place. If we make the whole thing
>    configurable enough then the sysadmin can set it up to support
>    even the nonsensical case of identical cache partitions on all
>    sockets and let tasks use the corresponding partitions when
>    migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>    so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>    the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder
> than most other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.
> 
> I was not able to identify any existing infrastructure where this
> really fits in. I chose a directory/file based representation. We
> certainly could do the same with a syscall, but that's just an
> implementation detail.
> 
> At top level:
> 
>    xxxxxxx/cat/max_cosids		<- Assume that all CPUs are the same
>    xxxxxxx/cat/max_maskbits		<- Assume that all CPUs are the same
>    xxxxxxx/cat/cdp_enable		<- Depends on CDP availability
> 
> Per socket data:
> 
>    xxxxxxx/cat/socket-0/
>    ...
>    xxxxxxx/cat/socket-N/l3_size
>    xxxxxxx/cat/socket-N/hwsharedbits
> 
> Per socket mask data:
> 
>    xxxxxxx/cat/socket-N/cos-id-0/
>    ...
>    xxxxxxx/cat/socket-N/cos-id-N/inuse
> 				/cat_mask	
> 				/cdp_mask	<- Data mask if CDP enabled
> 
> Per cpu default cos id for the cpus on that socket:
> 
>    xxxxxxx/cat/socket-N/cpu-x/default_cosid
>    ...
>    xxxxxxx/cat/socket-N/cpu-N/default_cosid
> 
> The above allows a simple cpu based partitioning. All tasks which do
> not have a cache partition assigned on a particular socket use the
> default one of the cpu they are running on.
> 
> Now for the task(s) partitioning:
> 
>    xxxxxxx/cat/partitions/
> 
> Under that directory one can create partitions
> 
>    xxxxxxx/cat/partitions/p1/tasks
> 			    /socket-0/cosid
> 			    ...
> 			    /socket-n/cosid
> 
>    The default value for the per socket cosid is COSID_DEFAULT, which
>    causes the task(s) to use the per cpu default id.
> 
> Thoughts?
> 
> Thanks,
> 
> 	tglx

The cgroups interface works, but moves the problem of contiguous
allocation to userspace, and is incompatible with cache allocations
on demand.

Have to solve the kernel threads VS cgroups issue...


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-18 18:25 [RFD] CAT user space interface revisited Thomas Gleixner
  2015-11-18 19:38 ` Luiz Capitulino
  2015-11-18 22:34 ` Marcelo Tosatti
@ 2015-11-19  0:01 ` Marcelo Tosatti
  2015-11-19  1:05   ` Marcelo Tosatti
                     ` (2 more replies)
  2015-11-24  7:31 ` Chao Peng
  2015-12-22 18:12 ` Yu, Fenghua
  4 siblings, 3 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-19  0:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a
> while, I think we all should sit back and look at it from scratch
> again w/o our preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>    - It's a per socket facility
> 
>    - CAT slots can be associated to external hardware. This
>      association is per socket as well, so different sockets can have
>      different behaviour. I missed that detail when staring the first
>      time, thanks for the pointer!
> 
>    - The association ifself is per cpu. The COS selection happens on a
>      CPU while the set of masks which are selected via COS are shared
>      by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>    - The bits which select a cache partition need to be consecutive
> 
>    - The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 
> Shared:	   1 1 1 1 1 1 1 1
> 	   0 0 1 1 1 1 1 1
> 	   0 0 0 0 1 1 1 1
> 	   0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
> 	   0 0 0 0 1 1 0 0
> 	   0 0 0 0 0 0 1 0
> 	   0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity
> of a sysadmin. The worst outcome might be L3 disabled for everything,
> so what?
> 
> Now that gets even more convoluted if CDP comes into play and we
> really need to look at CDP right now. We might end up with something
> which looks like this:
> 
>    	   1 1 1 1 0 0 0 0	Code
> 	   1 1 1 1 0 0 0 0	Data
> 	   0 0 0 0 0 0 1 0	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> or 
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 0 1 1 0	Data
> 
> Let's look at partitioning itself. We have two options:
> 
>    1) Per task partitioning
> 
>    2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on
> it. You give that task a slice of cache. Now that task needs kernel
> services which run in kernel threads on that CPU. We really don't want
> to (and cannot) hunt down random kernel threads (think cpu bound
> worker threads, softirq threads ....) and give them another slice of
> cache. What we really want is:
> 
>     	 1 1 1 1 0 0 0 0    <- Default cache
> 	 0 0 0 0 1 1 1 0    <- Cache for important task
> 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is
> useful at all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>    Real sensitive applications will simply pin tasks on a socket to
>    avoid that in the first place. If we make the whole thing
>    configurable enough then the sysadmin can set it up to support
>    even the nonsensical case of identical cache partitions on all
>    sockets and let tasks use the corresponding partitions when
>    migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>    so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>    the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder
> than most other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.

No you don't.

> I was not able to identify any existing infrastructure where this
> really fits in. I chose a directory/file based representation. We
> certainly could do the same with a syscall, but that's just an
> implementation detail.
> 
> At top level:
> 
>    xxxxxxx/cat/max_cosids		<- Assume that all CPUs are the same
>    xxxxxxx/cat/max_maskbits		<- Assume that all CPUs are the same
>    xxxxxxx/cat/cdp_enable		<- Depends on CDP availability
> 
> Per socket data:
> 
>    xxxxxxx/cat/socket-0/
>    ...
>    xxxxxxx/cat/socket-N/l3_size
>    xxxxxxx/cat/socket-N/hwsharedbits
> 
> Per socket mask data:
> 
>    xxxxxxx/cat/socket-N/cos-id-0/
>    ...
>    xxxxxxx/cat/socket-N/cos-id-N/inuse
> 				/cat_mask	
> 				/cdp_mask	<- Data mask if CDP enabled

There is no need to expose all this to userspace, but for some unknown 
reason people seem to be fond of that, so lets pretend its necessary.

> Per cpu default cos id for the cpus on that socket:
> 
>    xxxxxxx/cat/socket-N/cpu-x/default_cosid
>    ...
>    xxxxxxx/cat/socket-N/cpu-N/default_cosid
> 
> The above allows a simple cpu based partitioning. All tasks which do
> not have a cache partition assigned on a particular socket use the
> default one of the cpu they are running on.

A tasks which does not have a partition assigned to it 
has to use the "other tasks" group (COSid0), so that it does 
not interfere with the cache reservations of other tasks.

All is necessary are reservations {size,type}, and lists of reservations
per tasks. This is the right level to expose this to userspace without
userspace having to care about unnecessary HW details.

> Now for the task(s) partitioning:
> 
>    xxxxxxx/cat/partitions/
> 
> Under that directory one can create partitions
> 
>    xxxxxxx/cat/partitions/p1/tasks
> 			    /socket-0/cosid
> 			    ...
> 			    /socket-n/cosid
> 
>    The default value for the per socket cosid is COSID_DEFAULT, which
>    causes the task(s) to use the per cpu default id.
> 
> Thoughts?
> 
> Thanks,
> 
> 	tglx

Again: you don't need to look into the MSR table and relate it 
to tasks if you store the data as:

	task group 1 = {
			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
			reservation-2 = {size = 100Kb, type = code, socketmask = 0xffff}
	}
	
	task group 2 = {
			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
			reservation-3 = {size = 200Kb, type = code, socketmask = 0xffff}
	}

Task group 1 and task group 2 share reservation-1.

This is what userspace is going to expose to users, of course.

If you expose the MSRs to userspace, you force userspace to convert
from this format to the MSRs (minding whether there
are contiguous regions available, and the region shared with HW).

    - The bits which select a cache partition need to be consecutive

BUT, for our usecase the cgroups interface works as well, so lets
go with that (Tejun apparently had a usecase where tasks were allowed to 
set reservations themselves, on response to external events).



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-18 22:34 ` Marcelo Tosatti
@ 2015-11-19  0:34   ` Marcelo Tosatti
  2015-11-19  8:35     ` Thomas Gleixner
  2015-11-19  8:11   ` Thomas Gleixner
  1 sibling, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-19  0:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Wed, Nov 18, 2015 at 08:34:07PM -0200, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > Folks!
> > 
> > After rereading the mail flood on CAT and staring into the SDM for a
> > while, I think we all should sit back and look at it from scratch
> > again w/o our preconceptions - I certainly had to put my own away.
> > 
> > Let's look at the properties of CAT again:
> > 
> >    - It's a per socket facility
> > 
> >    - CAT slots can be associated to external hardware. This
> >      association is per socket as well, so different sockets can have
> >      different behaviour. I missed that detail when staring the first
> >      time, thanks for the pointer!
> > 
> >    - The association ifself is per cpu. The COS selection happens on a
> >      CPU while the set of masks which are selected via COS are shared
> >      by all CPUs on a socket.
> > 
> > There are restrictions which CAT imposes in terms of configurability:
> > 
> >    - The bits which select a cache partition need to be consecutive
> > 
> >    - The number of possible cache association masks is limited
> > 
> > Let's look at the configurations (CDP omitted and size restricted)
> > 
> > Default:   1 1 1 1 1 1 1 1
> > 	   1 1 1 1 1 1 1 1
> > 	   1 1 1 1 1 1 1 1
> > 	   1 1 1 1 1 1 1 1
> > 
> > Shared:	   1 1 1 1 1 1 1 1
> > 	   0 0 1 1 1 1 1 1
> > 	   0 0 0 0 1 1 1 1
> > 	   0 0 0 0 0 0 1 1
> > 
> > Isolated:  1 1 1 1 0 0 0 0
> > 	   0 0 0 0 1 1 0 0
> > 	   0 0 0 0 0 0 1 0
> > 	   0 0 0 0 0 0 0 1
> > 
> > Or any combination thereof. Surely some combinations will not make any
> > sense, but we really should not make any restrictions on the stupidity
> > of a sysadmin. The worst outcome might be L3 disabled for everything,
> > so what?
> > 
> > Now that gets even more convoluted if CDP comes into play and we
> > really need to look at CDP right now. We might end up with something
> > which looks like this:
> > 
> >    	   1 1 1 1 0 0 0 0	Code
> > 	   1 1 1 1 0 0 0 0	Data
> > 	   0 0 0 0 0 0 1 0	Code
> > 	   0 0 0 0 1 1 0 0	Data
> > 	   0 0 0 0 0 0 0 1	Code
> > 	   0 0 0 0 1 1 0 0	Data
> > or 
> > 	   0 0 0 0 0 0 0 1	Code
> > 	   0 0 0 0 1 1 0 0	Data
> > 	   0 0 0 0 0 0 0 1	Code
> > 	   0 0 0 0 0 1 1 0	Data
> > 
> > Let's look at partitioning itself. We have two options:
> > 
> >    1) Per task partitioning
> > 
> >    2) Per CPU partitioning
> > 
> > So far we only talked about #1, but I think that #2 has a value as
> > well. Let me give you a simple example.
> > 
> > Assume that you have isolated a CPU and run your important task on
> > it. You give that task a slice of cache. Now that task needs kernel
> > services which run in kernel threads on that CPU. We really don't want
> > to (and cannot) hunt down random kernel threads (think cpu bound
> > worker threads, softirq threads ....) and give them another slice of
> > cache. What we really want is:
> > 
> >     	 1 1 1 1 0 0 0 0    <- Default cache
> > 	 0 0 0 0 1 1 1 0    <- Cache for important task
> > 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> > 
> > It would even be sufficient for particular use cases to just associate
> > a piece of cache to a given CPU and do not bother with tasks at all.

Well any work on behalf of the important task, should have its cache
protected as well (example irq handling threads). 

But for certain kernel tasks for which L3 cache is not beneficial
(eg: kernel samepage merging), it might useful to exclude such tasks
from the "important, do not flush" L3 cache portion.

> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning.
> > 
> > Even if next generation systems will have more COS ids available,
> > there are not going to be enough to have a system wide consistent
> > view unless we have COS ids > nr_cpus.
> > 
> > Aside of that I don't think that a system wide consistent view is
> > useful at all.
> > 
> >  - If a task migrates between sockets, it's going to suffer anyway.
> >    Real sensitive applications will simply pin tasks on a socket to
> >    avoid that in the first place. If we make the whole thing
> >    configurable enough then the sysadmin can set it up to support
> >    even the nonsensical case of identical cache partitions on all
> >    sockets and let tasks use the corresponding partitions when
> >    migrating.
> > 
> >  - The number of cache slices is going to be limited no matter what,
> >    so one still has to come up with a sensible partitioning scheme.
> > 
> >  - Even if we have enough cos ids the system wide view will not make
> >    the configuration problem any simpler as it remains per socket.
> > 
> > It's hard. Policies are hard by definition, but this one is harder
> > than most other policies due to the inherent limitations.

That is exactly why it should be allowed for software to automatically 
configure the policies.

> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
> > 
> > I was not able to identify any existing infrastructure where this
> > really fits in. I chose a directory/file based representation. We
> > certainly could do the same with a syscall, but that's just an
> > implementation detail.
> > 
> > At top level:
> > 
> >    xxxxxxx/cat/max_cosids		<- Assume that all CPUs are the same
> >    xxxxxxx/cat/max_maskbits		<- Assume that all CPUs are the same
> >    xxxxxxx/cat/cdp_enable		<- Depends on CDP availability
> > 
> > Per socket data:
> > 
> >    xxxxxxx/cat/socket-0/
> >    ...
> >    xxxxxxx/cat/socket-N/l3_size
> >    xxxxxxx/cat/socket-N/hwsharedbits
> > 
> > Per socket mask data:
> > 
> >    xxxxxxx/cat/socket-N/cos-id-0/
> >    ...
> >    xxxxxxx/cat/socket-N/cos-id-N/inuse
> > 				/cat_mask	
> > 				/cdp_mask	<- Data mask if CDP enabled
> > 
> > Per cpu default cos id for the cpus on that socket:
> > 
> >    xxxxxxx/cat/socket-N/cpu-x/default_cosid
> >    ...
> >    xxxxxxx/cat/socket-N/cpu-N/default_cosid
> > 
> > The above allows a simple cpu based partitioning. All tasks which do
> > not have a cache partition assigned on a particular socket use the
> > default one of the cpu they are running on.
> > 
> > Now for the task(s) partitioning:
> > 
> >    xxxxxxx/cat/partitions/
> > 
> > Under that directory one can create partitions
> > 
> >    xxxxxxx/cat/partitions/p1/tasks
> > 			    /socket-0/cosid
> > 			    ...
> > 			    /socket-n/cosid
> > 
> >    The default value for the per socket cosid is COSID_DEFAULT, which
> >    causes the task(s) to use the per cpu default id.
> > 
> > Thoughts?
> > 
> > Thanks,
> > 
> > 	tglx
> 
> The cgroups interface works, but moves the problem of contiguous
> allocation to userspace, and is incompatible with cache allocations
> on demand.
> 
> Have to solve the kernel threads VS cgroups issue...
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19  0:01 ` Marcelo Tosatti
@ 2015-11-19  1:05   ` Marcelo Tosatti
  2015-11-19  9:09     ` Thomas Gleixner
  2015-11-19 20:30     ` Marcelo Tosatti
  2015-11-19  9:07   ` Thomas Gleixner
  2015-11-24  8:27   ` Chao Peng
  2 siblings, 2 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-19  1:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Wed, Nov 18, 2015 at 10:01:53PM -0200, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > Folks!
> > 
> > After rereading the mail flood on CAT and staring into the SDM for a
> > while, I think we all should sit back and look at it from scratch
> > again w/o our preconceptions - I certainly had to put my own away.
> > 
> > Let's look at the properties of CAT again:
> > 
> >    - It's a per socket facility
> > 
> >    - CAT slots can be associated to external hardware. This
> >      association is per socket as well, so different sockets can have
> >      different behaviour. I missed that detail when staring the first
> >      time, thanks for the pointer!
> > 
> >    - The association ifself is per cpu. The COS selection happens on a
> >      CPU while the set of masks which are selected via COS are shared
> >      by all CPUs on a socket.
> > 
> > There are restrictions which CAT imposes in terms of configurability:
> > 
> >    - The bits which select a cache partition need to be consecutive
> > 
> >    - The number of possible cache association masks is limited
> > 
> > Let's look at the configurations (CDP omitted and size restricted)
> > 
> > Default:   1 1 1 1 1 1 1 1
> > 	   1 1 1 1 1 1 1 1
> > 	   1 1 1 1 1 1 1 1
> > 	   1 1 1 1 1 1 1 1
> > 
> > Shared:	   1 1 1 1 1 1 1 1
> > 	   0 0 1 1 1 1 1 1
> > 	   0 0 0 0 1 1 1 1
> > 	   0 0 0 0 0 0 1 1
> > 
> > Isolated:  1 1 1 1 0 0 0 0
> > 	   0 0 0 0 1 1 0 0
> > 	   0 0 0 0 0 0 1 0
> > 	   0 0 0 0 0 0 0 1
> > 
> > Or any combination thereof. Surely some combinations will not make any
> > sense, but we really should not make any restrictions on the stupidity
> > of a sysadmin. The worst outcome might be L3 disabled for everything,
> > so what?
> > 
> > Now that gets even more convoluted if CDP comes into play and we
> > really need to look at CDP right now. We might end up with something
> > which looks like this:
> > 
> >    	   1 1 1 1 0 0 0 0	Code
> > 	   1 1 1 1 0 0 0 0	Data
> > 	   0 0 0 0 0 0 1 0	Code
> > 	   0 0 0 0 1 1 0 0	Data
> > 	   0 0 0 0 0 0 0 1	Code
> > 	   0 0 0 0 1 1 0 0	Data
> > or 
> > 	   0 0 0 0 0 0 0 1	Code
> > 	   0 0 0 0 1 1 0 0	Data
> > 	   0 0 0 0 0 0 0 1	Code
> > 	   0 0 0 0 0 1 1 0	Data
> > 
> > Let's look at partitioning itself. We have two options:
> > 
> >    1) Per task partitioning
> > 
> >    2) Per CPU partitioning
> > 
> > So far we only talked about #1, but I think that #2 has a value as
> > well. Let me give you a simple example.
> > 
> > Assume that you have isolated a CPU and run your important task on
> > it. You give that task a slice of cache. Now that task needs kernel
> > services which run in kernel threads on that CPU. We really don't want
> > to (and cannot) hunt down random kernel threads (think cpu bound
> > worker threads, softirq threads ....) and give them another slice of
> > cache. What we really want is:
> > 
> >     	 1 1 1 1 0 0 0 0    <- Default cache
> > 	 0 0 0 0 1 1 1 0    <- Cache for important task
> > 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> > 
> > It would even be sufficient for particular use cases to just associate
> > a piece of cache to a given CPU and do not bother with tasks at all.
> > 
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning.
> > 
> > Even if next generation systems will have more COS ids available,
> > there are not going to be enough to have a system wide consistent
> > view unless we have COS ids > nr_cpus.
> > 
> > Aside of that I don't think that a system wide consistent view is
> > useful at all.
> > 
> >  - If a task migrates between sockets, it's going to suffer anyway.
> >    Real sensitive applications will simply pin tasks on a socket to
> >    avoid that in the first place. If we make the whole thing
> >    configurable enough then the sysadmin can set it up to support
> >    even the nonsensical case of identical cache partitions on all
> >    sockets and let tasks use the corresponding partitions when
> >    migrating.
> > 
> >  - The number of cache slices is going to be limited no matter what,
> >    so one still has to come up with a sensible partitioning scheme.
> > 
> >  - Even if we have enough cos ids the system wide view will not make
> >    the configuration problem any simpler as it remains per socket.
> > 
> > It's hard. Policies are hard by definition, but this one is harder
> > than most other policies due to the inherent limitations.
> > 
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
> 
> No you don't.

Actually, there is a point that is useful: you might want the important
application to share the L3 portion with HW (that HW DMAs into), and
have only the application and the HW use that region.

So its a good point that controlling the exact position of the reservation 
is important.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-18 22:34 ` Marcelo Tosatti
  2015-11-19  0:34   ` Marcelo Tosatti
@ 2015-11-19  8:11   ` Thomas Gleixner
  1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2015-11-19  8:11 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

Marcelo,

On Wed, 18 Nov 2015, Marcelo Tosatti wrote:

Can you please trim your replies? It's really annoying having to
search for a single line of reply.

> The cgroups interface works, but moves the problem of contiguous
> allocation to userspace, and is incompatible with cache allocations
> on demand.
>
> Have to solve the kernel threads VS cgroups issue...

Sorry, I have no idea what you want to tell me.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19  0:34   ` Marcelo Tosatti
@ 2015-11-19  8:35     ` Thomas Gleixner
  2015-11-19 13:44       ` Luiz Capitulino
  2015-11-20 14:15       ` Marcelo Tosatti
  0 siblings, 2 replies; 32+ messages in thread
From: Thomas Gleixner @ 2015-11-19  8:35 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 08:34:07PM -0200, Marcelo Tosatti wrote:
> > On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > > Assume that you have isolated a CPU and run your important task on
> > > it. You give that task a slice of cache. Now that task needs kernel
> > > services which run in kernel threads on that CPU. We really don't want
> > > to (and cannot) hunt down random kernel threads (think cpu bound
> > > worker threads, softirq threads ....) and give them another slice of
> > > cache. What we really want is:
> > > 
> > >     	 1 1 1 1 0 0 0 0    <- Default cache
> > > 	 0 0 0 0 1 1 1 0    <- Cache for important task
> > > 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> > > 
> > > It would even be sufficient for particular use cases to just associate
> > > a piece of cache to a given CPU and do not bother with tasks at all.
> 
> Well any work on behalf of the important task, should have its cache
> protected as well (example irq handling threads). 

Right, but that's nothing you can do automatically and certainly not
from a random application.

> But for certain kernel tasks for which L3 cache is not beneficial
> (eg: kernel samepage merging), it might useful to exclude such tasks
> from the "important, do not flush" L3 cache portion.

Sure it might be useful, but this needs to be done on a case by case
basis and there is no way to do this in any automated way.
 
> > > It's hard. Policies are hard by definition, but this one is harder
> > > than most other policies due to the inherent limitations.
> 
> That is exactly why it should be allowed for software to automatically 
> configure the policies.

There is nothing you can do automatically. If you want to allow
applications to set the policies themself, then you need to assign a
portion of the bitmask space and a portion of the cos id space to that
application and then let it do with that space what it wants.

That's where cgroups come into play. But that does not solve the other
issues of "global" configuration, i.e. CPU defaults etc.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19  0:01 ` Marcelo Tosatti
  2015-11-19  1:05   ` Marcelo Tosatti
@ 2015-11-19  9:07   ` Thomas Gleixner
  2015-11-24  8:27   ` Chao Peng
  2 siblings, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2015-11-19  9:07 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
> 
> No you don't.

Because you have a use case which allows you to write some policy
translator? I seriously doubt that it is general enough.
 
> Again: you don't need to look into the MSR table and relate it 
> to tasks if you store the data as:
> 
> 	task group 1 = {
> 			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> 			reservation-2 = {size = 100Kb, type = code, socketmask = 0xffff}
> 	}
> 	
> 	task group 2 = {
> 			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> 			reservation-3 = {size = 200Kb, type = code, socketmask = 0xffff}
> 	}
> 
> Task group 1 and task group 2 share reservation-1.
> 
> This is what userspace is going to expose to users, of course.


 
> If you expose the MSRs to userspace, you force userspace to convert
> from this format to the MSRs (minding whether there
> are contiguous regions available, and the region shared with HW).

Fair enough. I'm not too fond about the exposure of the MSRs, but I
chose this just to explain the full problem space and the various
requirements we might have accross the full application space.

If we can come up with an abstract way which does not impose
restrictions on the overall configuration abilities, I'm all for it.

>     - The bits which select a cache partition need to be consecutive
> 
> BUT, for our usecase the cgroups interface works as well, so lets
> go with that (Tejun apparently had a usecase where tasks were allowed to 
> set reservations themselves, on response to external events).

Can you please set aside your narrow use case view for a moment and
just think about the full application space? We are not designing such
an interface for a single use case.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19  1:05   ` Marcelo Tosatti
@ 2015-11-19  9:09     ` Thomas Gleixner
  2015-11-19 20:59       ` Marcelo Tosatti
  2015-11-19 20:30     ` Marcelo Tosatti
  1 sibling, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2015-11-19  9:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> Actually, there is a point that is useful: you might want the important
> application to share the L3 portion with HW (that HW DMAs into), and
> have only the application and the HW use that region.
> 
> So its a good point that controlling the exact position of the reservation 
> is important.

I'm glad you figured that out yourself. :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19  8:35     ` Thomas Gleixner
@ 2015-11-19 13:44       ` Luiz Capitulino
  2015-11-20 14:15       ` Marcelo Tosatti
  1 sibling, 0 replies; 32+ messages in thread
From: Luiz Capitulino @ 2015-11-19 13:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marcelo Tosatti, LKML, Peter Zijlstra, x86, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Thu, 19 Nov 2015 09:35:34 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> > Well any work on behalf of the important task, should have its cache
> > protected as well (example irq handling threads). 
> 
> Right, but that's nothing you can do automatically and certainly not
> from a random application.

Right and that's not a problem. For the use-cases CAT is intended to,
manual and per-workload system setup is very common. Things like
thread pinning, hugepages reservation, CPU isolation, nohz_full, etc
require manual setup too.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19  1:05   ` Marcelo Tosatti
  2015-11-19  9:09     ` Thomas Gleixner
@ 2015-11-19 20:30     ` Marcelo Tosatti
  1 sibling, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-19 20:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Wed, Nov 18, 2015 at 11:05:35PM -0200, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 10:01:53PM -0200, Marcelo Tosatti wrote:
> > On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > > Folks!
> > > 
> > > After rereading the mail flood on CAT and staring into the SDM for a
> > > while, I think we all should sit back and look at it from scratch
> > > again w/o our preconceptions - I certainly had to put my own away.
> > > 
> > > Let's look at the properties of CAT again:
> > > 
> > >    - It's a per socket facility
> > > 
> > >    - CAT slots can be associated to external hardware. This
> > >      association is per socket as well, so different sockets can have
> > >      different behaviour. I missed that detail when staring the first
> > >      time, thanks for the pointer!
> > > 
> > >    - The association ifself is per cpu. The COS selection happens on a
> > >      CPU while the set of masks which are selected via COS are shared
> > >      by all CPUs on a socket.
> > > 
> > > There are restrictions which CAT imposes in terms of configurability:
> > > 
> > >    - The bits which select a cache partition need to be consecutive
> > > 
> > >    - The number of possible cache association masks is limited
> > > 
> > > Let's look at the configurations (CDP omitted and size restricted)
> > > 
> > > Default:   1 1 1 1 1 1 1 1
> > > 	   1 1 1 1 1 1 1 1
> > > 	   1 1 1 1 1 1 1 1
> > > 	   1 1 1 1 1 1 1 1
> > > 
> > > Shared:	   1 1 1 1 1 1 1 1
> > > 	   0 0 1 1 1 1 1 1
> > > 	   0 0 0 0 1 1 1 1
> > > 	   0 0 0 0 0 0 1 1
> > > 
> > > Isolated:  1 1 1 1 0 0 0 0
> > > 	   0 0 0 0 1 1 0 0
> > > 	   0 0 0 0 0 0 1 0
> > > 	   0 0 0 0 0 0 0 1
> > > 
> > > Or any combination thereof. Surely some combinations will not make any
> > > sense, but we really should not make any restrictions on the stupidity
> > > of a sysadmin. The worst outcome might be L3 disabled for everything,
> > > so what?
> > > 
> > > Now that gets even more convoluted if CDP comes into play and we
> > > really need to look at CDP right now. We might end up with something
> > > which looks like this:
> > > 
> > >    	   1 1 1 1 0 0 0 0	Code
> > > 	   1 1 1 1 0 0 0 0	Data
> > > 	   0 0 0 0 0 0 1 0	Code
> > > 	   0 0 0 0 1 1 0 0	Data
> > > 	   0 0 0 0 0 0 0 1	Code
> > > 	   0 0 0 0 1 1 0 0	Data
> > > or 
> > > 	   0 0 0 0 0 0 0 1	Code
> > > 	   0 0 0 0 1 1 0 0	Data
> > > 	   0 0 0 0 0 0 0 1	Code
> > > 	   0 0 0 0 0 1 1 0	Data
> > > 
> > > Let's look at partitioning itself. We have two options:
> > > 
> > >    1) Per task partitioning
> > > 
> > >    2) Per CPU partitioning
> > > 
> > > So far we only talked about #1, but I think that #2 has a value as
> > > well. Let me give you a simple example.
> > > 
> > > Assume that you have isolated a CPU and run your important task on
> > > it. You give that task a slice of cache. Now that task needs kernel
> > > services which run in kernel threads on that CPU. We really don't want
> > > to (and cannot) hunt down random kernel threads (think cpu bound
> > > worker threads, softirq threads ....) and give them another slice of
> > > cache. What we really want is:
> > > 
> > >     	 1 1 1 1 0 0 0 0    <- Default cache
> > > 	 0 0 0 0 1 1 1 0    <- Cache for important task
> > > 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> > > 
> > > It would even be sufficient for particular use cases to just associate
> > > a piece of cache to a given CPU and do not bother with tasks at all.
> > > 
> > > We really need to make this as configurable as possible from userspace
> > > without imposing random restrictions to it. I played around with it on
> > > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > > enabled) makes it really useless if we force the ids to have the same
> > > meaning on all sockets and restrict it to per task partitioning.
> > > 
> > > Even if next generation systems will have more COS ids available,
> > > there are not going to be enough to have a system wide consistent
> > > view unless we have COS ids > nr_cpus.
> > > 
> > > Aside of that I don't think that a system wide consistent view is
> > > useful at all.
> > > 
> > >  - If a task migrates between sockets, it's going to suffer anyway.
> > >    Real sensitive applications will simply pin tasks on a socket to
> > >    avoid that in the first place. If we make the whole thing
> > >    configurable enough then the sysadmin can set it up to support
> > >    even the nonsensical case of identical cache partitions on all
> > >    sockets and let tasks use the corresponding partitions when
> > >    migrating.
> > > 
> > >  - The number of cache slices is going to be limited no matter what,
> > >    so one still has to come up with a sensible partitioning scheme.
> > > 
> > >  - Even if we have enough cos ids the system wide view will not make
> > >    the configuration problem any simpler as it remains per socket.
> > > 
> > > It's hard. Policies are hard by definition, but this one is harder
> > > than most other policies due to the inherent limitations.
> > > 
> > > So now to the interface part. Unfortunately we need to expose this
> > > very close to the hardware implementation as there are really no
> > > abstractions which allow us to express the various bitmap
> > > combinations. Any abstraction I tried to come up with renders that
> > > thing completely useless.
> > 
> > No you don't.
> 
> Actually, there is a point that is useful: you might want the important
> application to share the L3 portion with HW (that HW DMAs into), and
> have only the application and the HW use that region.

Actually, don't see why that makes sense.

So "share the L3 portion" means being allowed to reclaim data from that
portion of L3 cache.

Why would you want to allow application and HW to reclaim from the same 
region? I don't know.

But exposing the HW interface allows you to do that, if some reason 
for doing so exists.

Exposing the HW interface:
--------------------------

Pros: *1) Can do whatever combination necessary.
Cons: *2) Userspace has to deal with the contiguity issue
      (example: upon allocation request, "compacting" the cbm bits
       can allow for allocation request to be succesful (that is
       enough contiguous bits available), but "compacting" means 
       moving CBM bits around which means applications will lose 
       their reservation at the time the CBM bit positions are moved, 
       so it can affect ongoing code).
      *3) Userspace has to deal with convertion from kbytes to cache ways.
      *4) Userspace has to deal with locking access to the interface.
      * Userspace has no access to the timing of schedin/schedouts, 
        so it cannot perform optimizations based on that information.

Not exposing the HW interface:
------------------------------

Pros: *10) Can use whatever combination necessary, provided that you 
extend the interface.
*11) Allows the kernel to optimize usage of the reservations, because only 
the kernel knows the times of scheduling.
*12) Allows the kernel to handle 2,3,4, rather than having userspace
handle it.
*13) Allows applications to set cache reservations themselves, directly 
via an ioctl or system call.

Cons:
* There are users of the cgroups interface today, they will have 
to change.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19  9:09     ` Thomas Gleixner
@ 2015-11-19 20:59       ` Marcelo Tosatti
  2015-11-20  7:53         ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-19 20:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Thu, Nov 19, 2015 at 10:09:03AM +0100, Thomas Gleixner wrote:
> On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> > Actually, there is a point that is useful: you might want the important
> > application to share the L3 portion with HW (that HW DMAs into), and
> > have only the application and the HW use that region.
> > 
> > So its a good point that controlling the exact position of the reservation 
> > is important.
> 
> I'm glad you figured that out yourself. :)
> 
> Thanks,
> 
> 	tglx

The HW is a reclaimer of the L3 region shared with HW.

You might want to remove any threads from reclaiming from 
that region.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19 20:59       ` Marcelo Tosatti
@ 2015-11-20  7:53         ` Thomas Gleixner
  2015-11-20 17:51           ` Marcelo Tosatti
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2015-11-20  7:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Thu, 19 Nov 2015, Marcelo Tosatti wrote:
> On Thu, Nov 19, 2015 at 10:09:03AM +0100, Thomas Gleixner wrote:
> > On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> > > Actually, there is a point that is useful: you might want the important
> > > application to share the L3 portion with HW (that HW DMAs into), and
> > > have only the application and the HW use that region.
> > > 
> > > So its a good point that controlling the exact position of the reservation 
> > > is important.
> > 
> > I'm glad you figured that out yourself. :)
> > 
> > Thanks,
> > 
> > 	tglx
> 
> The HW is a reclaimer of the L3 region shared with HW.
> 
> You might want to remove any threads from reclaiming from 
> that region.

I might for some threads, but certainly not for those which need to
access DMA buffers. Throwing away 10% of L3 just because you don't
want to deal with it at the interface level is hillarious.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19  8:35     ` Thomas Gleixner
  2015-11-19 13:44       ` Luiz Capitulino
@ 2015-11-20 14:15       ` Marcelo Tosatti
  1 sibling, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-20 14:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Thu, Nov 19, 2015 at 09:35:34AM +0100, Thomas Gleixner wrote:
> On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
> > On Wed, Nov 18, 2015 at 08:34:07PM -0200, Marcelo Tosatti wrote:
> > > On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > > > Assume that you have isolated a CPU and run your important task on
> > > > it. You give that task a slice of cache. Now that task needs kernel
> > > > services which run in kernel threads on that CPU. We really don't want
> > > > to (and cannot) hunt down random kernel threads (think cpu bound
> > > > worker threads, softirq threads ....) and give them another slice of
> > > > cache. What we really want is:
> > > > 
> > > >     	 1 1 1 1 0 0 0 0    <- Default cache
> > > > 	 0 0 0 0 1 1 1 0    <- Cache for important task
> > > > 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> > > > 
> > > > It would even be sufficient for particular use cases to just associate
> > > > a piece of cache to a given CPU and do not bother with tasks at all.
> > 
> > Well any work on behalf of the important task, should have its cache
> > protected as well (example irq handling threads). 
> 
> Right, but that's nothing you can do automatically and certainly not
> from a random application.
> 
> > But for certain kernel tasks for which L3 cache is not beneficial
> > (eg: kernel samepage merging), it might useful to exclude such tasks
> > from the "important, do not flush" L3 cache portion.
> 
> Sure it might be useful, but this needs to be done on a case by case
> basis and there is no way to do this in any automated way.
>  
> > > > It's hard. Policies are hard by definition, but this one is harder
> > > > than most other policies due to the inherent limitations.
> > 
> > That is exactly why it should be allowed for software to automatically 
> > configure the policies.
> 
> There is nothing you can do automatically. 

Every cacheline brought in the L3 has a reaccess time (the time when it
was first brought in to the time it was reaccessed).

Assume you have a single threaded app, a sequence of cacheline
accesses.

Now if there are groups of accesses which have long reaccess times
(meaning that keeping them in L3 is not beneficial), that are large
enough to justify the OS notification, the application can notify the OS
to switch to a constrained COSid (so that L3 misses reclaim from that
small portion of the L3 cache).

> If you want to allow
> applications to set the policies themself, then you need to assign a
> portion of the bitmask space and a portion of the cos id space to that
> application and then let it do with that space what it wants.

Thats why you should specify the requirements independently of each
other (the requirement in this case the size of the reservation and
type, which is tied to the application), and let something else figure
out how they all fit together.

> That's where cgroups come into play. But that does not solve the other
> issues of "global" configuration, i.e. CPU defaults etc.

I don't understand what you mean issues of global configuration.

CPU defaults: A task is associated with a COSid. A COSid points to 
a set of CBMs (one CBM per socket). What defaults are you talking about?

But the interfaces do not exclude each other (the ioctl or syscall
interfaces and the manual direct MSR interface can coexist). There is
time pressure to integrate something workable for the present use cases
(none are in the class "applications set reservation themselves").

Peter has some objection against ioctls. So for something workable,
well have to handle the numbered issues pointed in the other e-mail
(2,3,4), in userspace.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-20  7:53         ` Thomas Gleixner
@ 2015-11-20 17:51           ` Marcelo Tosatti
  0 siblings, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-20 17:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Luiz Capitulino, Vikas Shivappa,
	Tejun Heo, Yu Fenghua

On Fri, Nov 20, 2015 at 08:53:34AM +0100, Thomas Gleixner wrote:
> On Thu, 19 Nov 2015, Marcelo Tosatti wrote:
> > On Thu, Nov 19, 2015 at 10:09:03AM +0100, Thomas Gleixner wrote:
> > > On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> > > > Actually, there is a point that is useful: you might want the important
> > > > application to share the L3 portion with HW (that HW DMAs into), and
> > > > have only the application and the HW use that region.
> > > > 
> > > > So its a good point that controlling the exact position of the reservation 
> > > > is important.
> > > 
> > > I'm glad you figured that out yourself. :)
> > > 
> > > Thanks,
> > > 
> > > 	tglx
> > 
> > The HW is a reclaimer of the L3 region shared with HW.
> > 
> > You might want to remove any threads from reclaiming from 
> > that region.
> 
> I might for some threads, but certainly not for those which need to
> access DMA buffers.

Yes, when i wrote "its a good point that controlling the exact position
of the reservation is important" i had that in mind as well.

But its wrong: not having a bit set in the CBM for the portion of L3
cache which is shared with HW only means "for cacheline misses of the
application, evict cachelines from this portion".

So yes, you might want to exclude the application which accesses DMA
buffers from reclaiming cachelines in the portion shared with HW,
to keep those cachelines longer in L3.

> Throwing away 10% of L3 just because you don't
> want to deal with it at the interface level is hillarious.

If there is interest on per-application configuration then it can 
be integrated as well.

Thanks for your time.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-18 18:25 [RFD] CAT user space interface revisited Thomas Gleixner
                   ` (2 preceding siblings ...)
  2015-11-19  0:01 ` Marcelo Tosatti
@ 2015-11-24  7:31 ` Chao Peng
  2015-11-24 23:06   ` Marcelo Tosatti
  2015-12-22 18:12 ` Yu, Fenghua
  4 siblings, 1 reply; 32+ messages in thread
From: Chao Peng @ 2015-11-24  7:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, x86, Marcelo Tosatti, Luiz Capitulino,
	Vikas Shivappa, Tejun Heo, Yu Fenghua

On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> 
> Let's look at partitioning itself. We have two options:
> 
>    1) Per task partitioning
> 
>    2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.

I would second this. In practice per CPU partitioning is useful for
realtime as well. And I can see three possible solutions:

     1) What you suggested below, to address both problems in one
        framework. But I wonder if it would end with too complex.

     2) Achieve per CPU partitioning with per task partitioning. For
        example, if current CAT patch can solve the kernel threads
	problem, together with CPU pinning, we then can set a same CBM
	for all the tasks/kernel threads run on an isolated CPU. 

     3) I wonder if it feasible to separate the two requirements? For
        example, divides the work into three components: rdt-base,
	per task interface (current cgroup interface/IOCTL or something)
	and per CPU interface. The two interfaces are exclusive and
	selected at build time. One thing to reject this option would be
	even with per CPU partitioning, we still need per task partitioning,
	in that case we will go to option 1) again.

Thanks,
Chao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-19  0:01 ` Marcelo Tosatti
  2015-11-19  1:05   ` Marcelo Tosatti
  2015-11-19  9:07   ` Thomas Gleixner
@ 2015-11-24  8:27   ` Chao Peng
       [not found]     ` <20151124212543.GA11303@amt.cnet>
  2 siblings, 1 reply; 32+ messages in thread
From: Chao Peng @ 2015-11-24  8:27 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Vikas Shivappa, Tejun Heo, Yu Fenghua

On Wed, Nov 18, 2015 at 10:01:54PM -0200, Marcelo Tosatti wrote:
> > 	tglx
> 
> Again: you don't need to look into the MSR table and relate it 
> to tasks if you store the data as:
> 
> 	task group 1 = {
> 			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> 			reservation-2 = {size = 100Kb, type = code, socketmask = 0xffff}
> 	}
> 	
> 	task group 2 = {
> 			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> 			reservation-3 = {size = 200Kb, type = code, socketmask = 0xffff}
> 	}
> 
> Task group 1 and task group 2 share reservation-1.

Because there is only size but not CBM position info, I guess for
different reservations they will not overlap each other, right?

Personally I like this way of exposing minimal information to userspace.
I can think it working well except for one concern of losing flexibility:

For instance, there is a box for which the full CBM is 0xfffff. After
cache reservation creating/freeing for a while we then have reservations:

reservation1: 0xf0000
reservation2: 0x00ff0

Now people want to request a reservation which size is 0xff, so how
will kernel do at this time? It could return just error or do some
moving/merging (e.g. for reservation2: 0x00ff0 => 0x0ff00) and then
satisfy the request. But I don't know if the moving/merging will cause
delay for tasks that is using it.

Thanks,
Chao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-11-24  7:31 ` Chao Peng
@ 2015-11-24 23:06   ` Marcelo Tosatti
  0 siblings, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-24 23:06 UTC (permalink / raw)
  To: Chao Peng
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Vikas Shivappa, Tejun Heo, Yu Fenghua

On Tue, Nov 24, 2015 at 03:31:24PM +0800, Chao Peng wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > 
> > Let's look at partitioning itself. We have two options:
> > 
> >    1) Per task partitioning
> > 
> >    2) Per CPU partitioning
> > 
> > So far we only talked about #1, but I think that #2 has a value as
> > well. Let me give you a simple example.
> 
> I would second this. In practice per CPU partitioning is useful for
> realtime as well. And I can see three possible solutions:
> 
>      1) What you suggested below, to address both problems in one
>         framework. But I wonder if it would end with too complex.
> 
>      2) Achieve per CPU partitioning with per task partitioning. For
>         example, if current CAT patch can solve the kernel threads
> 	problem, together with CPU pinning, we then can set a same CBM
> 	for all the tasks/kernel threads run on an isolated CPU. 

As for the kernel threads problem, it seems its a silly limitation of
the code which handles writes to cgroups:

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f89d929..0603652 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2466,16 +2466,6 @@ static ssize_t __cgroup_procs_write(struct
kernfs_open_file *of, char *buf,
        if (threadgroup)
                tsk = tsk->group_leader;

-       /*
-        * Workqueue threads may acquire PF_NO_SETAFFINITY and become
-        * trapped in a cpuset, or RT worker may be born in a cgroup
-        * with no rt_runtime allocated.  Just say no.
-        */
-       if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
-               ret = -EINVAL;
-               goto out_unlock_rcu;
-       }
-
        get_task_struct(tsk);
        rcu_read_unlock();

For a cgroup hierarchy with no cpusets (such as CAT only) this
limitation makes no sense (looking for a place where to move this to).

Any ETA on per-socket bitmasks? 

> 
>      3) I wonder if it feasible to separate the two requirements? For
>         example, divides the work into three components: rdt-base,
> 	per task interface (current cgroup interface/IOCTL or something)
> 	and per CPU interface. The two interfaces are exclusive and
> 	selected at build time. One thing to reject this option would be
> 	even with per CPU partitioning, we still need per task partitioning,
> 	in that case we will go to option 1) again.
> 
> Thanks,
> Chao

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
       [not found]     ` <20151124212543.GA11303@amt.cnet>
@ 2015-11-25  1:29       ` Marcelo Tosatti
  0 siblings, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2015-11-25  1:29 UTC (permalink / raw)
  To: Chao Peng, f
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Vikas Shivappa, Tejun Heo, Yu Fenghua

On Tue, Nov 24, 2015 at 07:25:43PM -0200, Marcelo Tosatti wrote:
> On Tue, Nov 24, 2015 at 04:27:54PM +0800, Chao Peng wrote:
> > On Wed, Nov 18, 2015 at 10:01:54PM -0200, Marcelo Tosatti wrote:
> > > > 	tglx
> > > 
> > > Again: you don't need to look into the MSR table and relate it 
> > > to tasks if you store the data as:
> > > 
> > > 	task group 1 = {
> > > 			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> > > 			reservation-2 = {size = 100Kb, type = code, socketmask = 0xffff}
> > > 	}
> > > 	
> > > 	task group 2 = {
> > > 			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> > > 			reservation-3 = {size = 200Kb, type = code, socketmask = 0xffff}
> > > 	}
> > > 
> > > Task group 1 and task group 2 share reservation-1.
> > 
> > Because there is only size but not CBM position info, I guess for
> > different reservations they will not overlap each other, right?
> 
> Reservation 1 is shared between task group 1 and task group 2 
> so the CBMs overlap (by 80Kb, rounded).
> 
> > Personally I like this way of exposing minimal information to userspace.
> > I can think it working well except for one concern of losing flexibility:
> > 
> > For instance, there is a box for which the full CBM is 0xfffff. After
> > cache reservation creating/freeing for a while we then have reservations:
> > 
> > reservation1: 0xf0000
> > reservation2: 0x00ff0
> > 
> > Now people want to request a reservation which size is 0xff, so how
> > will kernel do at this time? It could return just error or do some
> > moving/merging (e.g. for reservation2: 0x00ff0 => 0x0ff00) and then
> > satisfy the request. But I don't know if the moving/merging will cause
> > delay for tasks that is using it.
> 
> Right, i was thinking of adding a "force" parameter. 
> 
> So, default behaviour of attach: do not merge.
> "force" behaviour of attach: move reservations around and merge if
> necessary.

To make the decision userspace would need the know that a merge can
be performed if particular reservations can be moved (that is, the
moveable property is per-reservation, depending on whether its ok 
for the given app to cacheline fault or not).
Anyway, thats for later.







^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFD] CAT user space interface revisited
  2015-11-18 18:25 [RFD] CAT user space interface revisited Thomas Gleixner
                   ` (3 preceding siblings ...)
  2015-11-24  7:31 ` Chao Peng
@ 2015-12-22 18:12 ` Yu, Fenghua
  2015-12-23 10:28   ` Marcelo Tosatti
  4 siblings, 1 reply; 32+ messages in thread
From: Yu, Fenghua @ 2015-12-22 18:12 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, x86, Marcelo Tosatti, Luiz Capitulino, Shivappa,
	Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

> From: Thomas Gleixner [mailto:tglx@linutronix.de]
> Sent: Wednesday, November 18, 2015 10:25 AM
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a while, I
> think we all should sit back and look at it from scratch again w/o our
> preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>    - It's a per socket facility
> 
>    - CAT slots can be associated to external hardware. This
>      association is per socket as well, so different sockets can have
>      different behaviour. I missed that detail when staring the first
>      time, thanks for the pointer!
> 
>    - The association ifself is per cpu. The COS selection happens on a
>      CPU while the set of masks which are selected via COS are shared
>      by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>    - The bits which select a cache partition need to be consecutive
> 
>    - The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 
> Shared:	   1 1 1 1 1 1 1 1
> 	   0 0 1 1 1 1 1 1
> 	   0 0 0 0 1 1 1 1
> 	   0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
> 	   0 0 0 0 1 1 0 0
> 	   0 0 0 0 0 0 1 0
> 	   0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity of a
> sysadmin. The worst outcome might be L3 disabled for everything, so what?
> 
> Now that gets even more convoluted if CDP comes into play and we really
> need to look at CDP right now. We might end up with something which looks
> like this:
> 
>    	   1 1 1 1 0 0 0 0	Code
> 	   1 1 1 1 0 0 0 0	Data
> 	   0 0 0 0 0 0 1 0	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> or
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 0 1 1 0	Data
> 
> Let's look at partitioning itself. We have two options:
> 
>    1) Per task partitioning
> 
>    2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as well. Let me
> give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on it. You
> give that task a slice of cache. Now that task needs kernel services which run
> in kernel threads on that CPU. We really don't want to (and cannot) hunt
> down random kernel threads (think cpu bound worker threads, softirq
> threads ....) and give them another slice of cache. What we really want is:
> 
>     	 1 1 1 1 0 0 0 0    <- Default cache
> 	 0 0 0 0 1 1 1 0    <- Cache for important task
> 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate a piece of
> cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on my new
> intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same meaning
> on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available, there are
> not going to be enough to have a system wide consistent view unless we
> have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is useful at all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>    Real sensitive applications will simply pin tasks on a socket to
>    avoid that in the first place. If we make the whole thing
>    configurable enough then the sysadmin can set it up to support
>    even the nonsensical case of identical cache partitions on all
>    sockets and let tasks use the corresponding partitions when
>    migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>    so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>    the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder than most
> other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this very
> close to the hardware implementation as there are really no abstractions
> which allow us to express the various bitmap combinations. Any abstraction I
> tried to come up with renders that thing completely useless.
> 
> I was not able to identify any existing infrastructure where this really fits in. I
> chose a directory/file based representation. We certainly could do the same

Is this be /sys/devices/system/?
Then create qos/cat directory. In the future, other directories may be created
e.g. qos/mbm?

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-12-22 18:12 ` Yu, Fenghua
@ 2015-12-23 10:28   ` Marcelo Tosatti
  2015-12-29 12:44     ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2015-12-23 10:28 UTC (permalink / raw)
  To: Yu, Fenghua, Thomas Gleixner
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Shivappa, Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

On Tue, Dec 22, 2015 at 06:12:05PM +0000, Yu, Fenghua wrote:
> > From: Thomas Gleixner [mailto:tglx@linutronix.de]
> > Sent: Wednesday, November 18, 2015 10:25 AM
> > Folks!
> > 
> > After rereading the mail flood on CAT and staring into the SDM for a while, I
> > think we all should sit back and look at it from scratch again w/o our
> > preconceptions - I certainly had to put my own away.
> > 
> > Let's look at the properties of CAT again:
> > 
> >    - It's a per socket facility
> > 
> >    - CAT slots can be associated to external hardware. This
> >      association is per socket as well, so different sockets can have
> >      different behaviour. I missed that detail when staring the first
> >      time, thanks for the pointer!
> > 
> >    - The association ifself is per cpu. The COS selection happens on a
> >      CPU while the set of masks which are selected via COS are shared
> >      by all CPUs on a socket.
> > 
> > There are restrictions which CAT imposes in terms of configurability:
> > 
> >    - The bits which select a cache partition need to be consecutive
> > 
> >    - The number of possible cache association masks is limited
> > 
> > Let's look at the configurations (CDP omitted and size restricted)
> > 
> > Default:   1 1 1 1 1 1 1 1
> > 	   1 1 1 1 1 1 1 1
> > 	   1 1 1 1 1 1 1 1
> > 	   1 1 1 1 1 1 1 1
> > 
> > Shared:	   1 1 1 1 1 1 1 1
> > 	   0 0 1 1 1 1 1 1
> > 	   0 0 0 0 1 1 1 1
> > 	   0 0 0 0 0 0 1 1
> > 
> > Isolated:  1 1 1 1 0 0 0 0
> > 	   0 0 0 0 1 1 0 0
> > 	   0 0 0 0 0 0 1 0
> > 	   0 0 0 0 0 0 0 1
> > 
> > Or any combination thereof. Surely some combinations will not make any
> > sense, but we really should not make any restrictions on the stupidity of a
> > sysadmin. The worst outcome might be L3 disabled for everything, so what?
> > 
> > Now that gets even more convoluted if CDP comes into play and we really
> > need to look at CDP right now. We might end up with something which looks
> > like this:
> > 
> >    	   1 1 1 1 0 0 0 0	Code
> > 	   1 1 1 1 0 0 0 0	Data
> > 	   0 0 0 0 0 0 1 0	Code
> > 	   0 0 0 0 1 1 0 0	Data
> > 	   0 0 0 0 0 0 0 1	Code
> > 	   0 0 0 0 1 1 0 0	Data
> > or
> > 	   0 0 0 0 0 0 0 1	Code
> > 	   0 0 0 0 1 1 0 0	Data
> > 	   0 0 0 0 0 0 0 1	Code
> > 	   0 0 0 0 0 1 1 0	Data
> > 
> > Let's look at partitioning itself. We have two options:
> > 
> >    1) Per task partitioning
> > 
> >    2) Per CPU partitioning
> > 
> > So far we only talked about #1, but I think that #2 has a value as well. Let me
> > give you a simple example.
> > 
> > Assume that you have isolated a CPU and run your important task on it. You
> > give that task a slice of cache. Now that task needs kernel services which run
> > in kernel threads on that CPU. We really don't want to (and cannot) hunt
> > down random kernel threads (think cpu bound worker threads, softirq
> > threads ....) and give them another slice of cache. What we really want is:
> > 
> >     	 1 1 1 1 0 0 0 0    <- Default cache
> > 	 0 0 0 0 1 1 1 0    <- Cache for important task
> > 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> > 
> > It would even be sufficient for particular use cases to just associate a piece of
> > cache to a given CPU and do not bother with tasks at all.
> > 
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on my new
> > intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same meaning
> > on all sockets and restrict it to per task partitioning.
> > 
> > Even if next generation systems will have more COS ids available, there are
> > not going to be enough to have a system wide consistent view unless we
> > have COS ids > nr_cpus.
> > 
> > Aside of that I don't think that a system wide consistent view is useful at all.
> > 
> >  - If a task migrates between sockets, it's going to suffer anyway.
> >    Real sensitive applications will simply pin tasks on a socket to
> >    avoid that in the first place. If we make the whole thing
> >    configurable enough then the sysadmin can set it up to support
> >    even the nonsensical case of identical cache partitions on all
> >    sockets and let tasks use the corresponding partitions when
> >    migrating.
> > 
> >  - The number of cache slices is going to be limited no matter what,
> >    so one still has to come up with a sensible partitioning scheme.
> > 
> >  - Even if we have enough cos ids the system wide view will not make
> >    the configuration problem any simpler as it remains per socket.
> > 
> > It's hard. Policies are hard by definition, but this one is harder than most
> > other policies due to the inherent limitations.
> > 
> > So now to the interface part. Unfortunately we need to expose this very
> > close to the hardware implementation as there are really no abstractions
> > which allow us to express the various bitmap combinations. Any abstraction I
> > tried to come up with renders that thing completely useless.
> > 
> > I was not able to identify any existing infrastructure where this really fits in. I
> > chose a directory/file based representation. We certainly could do the same
> 
> Is this be /sys/devices/system/?
> Then create qos/cat directory. In the future, other directories may be created
> e.g. qos/mbm?
> 
> Thanks.
> 
> -Fenghua

Fenghua, 

I suppose Thomas is talking about the socketmask only, as discussed in
the call with Intel.

Thomas, is that correct? (if you want a change in directory structure,
please explain the whys, because we don't need that change in directory 
structure).




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-12-23 10:28   ` Marcelo Tosatti
@ 2015-12-29 12:44     ` Thomas Gleixner
  2015-12-31 19:22       ` Marcelo Tosatti
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2015-12-29 12:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Yu, Fenghua, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Shivappa, Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

Marcelo,

On Wed, 23 Dec 2015, Marcelo Tosatti wrote:
> On Tue, Dec 22, 2015 at 06:12:05PM +0000, Yu, Fenghua wrote:
> > > From: Thomas Gleixner [mailto:tglx@linutronix.de]
> > >
> > > I was not able to identify any existing infrastructure where this really fits in. I
> > > chose a directory/file based representation. We certainly could do the same
> > 
> > Is this be /sys/devices/system/?
> > Then create qos/cat directory. In the future, other directories may be created
> > e.g. qos/mbm?
> 
> I suppose Thomas is talking about the socketmask only, as discussed in
> the call with Intel.

I have no idea about what you talked in a RH/Intel call.
 
> Thomas, is that correct? (if you want a change in directory structure,
> please explain the whys, because we don't need that change in directory 
> structure).

Can you please start to write coherent and understandable mails? I have no
idea of which directory structure, which does not need to be changed, you are
talking.

I described a directory structure for that qos/cat stuff in my proposal and
that's complete AFAICT.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-12-29 12:44     ` Thomas Gleixner
@ 2015-12-31 19:22       ` Marcelo Tosatti
  2015-12-31 22:30         ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2015-12-31 19:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Yu, Fenghua, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Shivappa, Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

On Tue, Dec 29, 2015 at 01:44:16PM +0100, Thomas Gleixner wrote:
> Marcelo,
> 
> On Wed, 23 Dec 2015, Marcelo Tosatti wrote:
> > On Tue, Dec 22, 2015 at 06:12:05PM +0000, Yu, Fenghua wrote:
> > > > From: Thomas Gleixner [mailto:tglx@linutronix.de]
> > > >
> > > > I was not able to identify any existing infrastructure where this really fits in. I
> > > > chose a directory/file based representation. We certainly could do the same
> > > 
> > > Is this be /sys/devices/system/?
> > > Then create qos/cat directory. In the future, other directories may be created
> > > e.g. qos/mbm?
> > 
> > I suppose Thomas is talking about the socketmask only, as discussed in
> > the call with Intel.
> 
> I have no idea about what you talked in a RH/Intel call.
>  
> > Thomas, is that correct? (if you want a change in directory structure,
> > please explain the whys, because we don't need that change in directory 
> > structure).
> 
> Can you please start to write coherent and understandable mails? I have no
> idea of which directory structure, which does not need to be changed, you are
> talking.

Thomas,

There is one directory structure in this topic, CAT. That is the
directory structure which is exposed to userspace to control the 
CAT HW. 

With the current patchset posted by Intel ("Subject: [PATCH V16 00/11]
x86: Intel Cache Allocation Technology Support"), the directory
structure there (the files and directories exposed by that patchset)
(*1) does not allow one to configure different CBM masks on each socket
(that is, it forces the user to configure the same mask CBM on every
socket). This is a blocker for us, and it is one of the points in your
proposal.

There was a call between Red Hat and Intel where it was communicated
to Intel, and Intel agreed, that it was necessary to fix this (fix this
== allow different CBM masks on different sockets).

Now, that is one change to the current directory structure (*1).

(*1) modified to allow for different CBM masks on different sockets, 
lets say (*2), is what we have been waiting for Intel to post. 
It would handle our usecase, and all use-cases which the current
patchset from Intel already handles (Vikas posted emails mentioning
there are happy users of the current interface, feel free to ask 
him for more details).

What i have asked you, and you replied "to go Google read my previous
post" is this:
What are the advantages over you proposal (which is a completely
different directory structure, requiring a complete rewrite),
over (*2) ?

(what is my reason behind this: the reason is that if you, with
maintainer veto power, forces your proposal to be accepted, it will be
necessary to wait for another rewrite (a new set of problems, fully
think through your proposal, test it, ...) rather than simply modify an
already known, reviewed, already used directory structure.

And functionally, your proposal adds nothing to (*2) (other than, well,
being a different directory structure).

If Fenghua or you post a patchset, say in 2 weeks, with your proposal,
i am fine with that. But i since i doubt that will be the case, i am 
pushing for the interface which requires the least amount of changes
(and therefore the least amount of time) to be integrated.

>From your email:

"It would even be sufficient for particular use cases to just associate
a piece of cache to a given CPU and do not bother with tasks at all.

We really need to make this as configurable as possible from userspace
without imposing random restrictions to it. I played around with it on
my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
enabled) makes it really useless if we force the ids to have the same
meaning on all sockets and restrict it to per task partitioning."

Yes, thats the issue we hit, that is the modification that was agreed
with Intel, and thats what we are waiting for them to post.

> I described a directory structure for that qos/cat stuff in my proposal and
> that's complete AFAICT.

Ok, lets make the job for the submitter easier. You are the maintainer,
so you decide.

Is it enough for you to have (*2) (which was agreed with Intel), or 
would you rather prefer to integrate the directory structure at 
"[RFD] CAT user space interface revisited" ?

Thanks.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-12-31 19:22       ` Marcelo Tosatti
@ 2015-12-31 22:30         ` Thomas Gleixner
  2016-01-04 17:20           ` Marcelo Tosatti
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2015-12-31 22:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Yu, Fenghua, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Shivappa, Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

Marcelo,

On Thu, 31 Dec 2015, Marcelo Tosatti wrote:

First of all thanks for the explanation.

> There is one directory structure in this topic, CAT. That is the
> directory structure which is exposed to userspace to control the 
> CAT HW. 
> 
> With the current patchset posted by Intel ("Subject: [PATCH V16 00/11]
> x86: Intel Cache Allocation Technology Support"), the directory
> structure there (the files and directories exposed by that patchset)
> (*1) does not allow one to configure different CBM masks on each socket
> (that is, it forces the user to configure the same mask CBM on every
> socket). This is a blocker for us, and it is one of the points in your
> proposal.
> 
> There was a call between Red Hat and Intel where it was communicated
> to Intel, and Intel agreed, that it was necessary to fix this (fix this
> == allow different CBM masks on different sockets).
> 
> Now, that is one change to the current directory structure (*1).

I don't have an idea how that would look like. The current structure is a
cgroups based hierarchy oriented approach, which does not allow simple things
like

T1	00001111
T2	00111100

at least not in a way which is natural to the problem at hand.

> (*1) modified to allow for different CBM masks on different sockets, 
> lets say (*2), is what we have been waiting for Intel to post. 
> It would handle our usecase, and all use-cases which the current
> patchset from Intel already handles (Vikas posted emails mentioning
> there are happy users of the current interface, feel free to ask 
> him for more details).

I cannot imagine how that modification to the current interface would solve
that. Not to talk about per CPU associations which are not related to tasks at
all.

> What i have asked you, and you replied "to go Google read my previous
> post" is this:
> What are the advantages over you proposal (which is a completely
> different directory structure, requiring a complete rewrite),
> over (*2) ?
> 
> (what is my reason behind this: the reason is that if you, with
> maintainer veto power, forces your proposal to be accepted, it will be
> necessary to wait for another rewrite (a new set of problems, fully
> think through your proposal, test it, ...) rather than simply modify an
> already known, reviewed, already used directory structure.
> 
> And functionally, your proposal adds nothing to (*2) (other than, well,
> being a different directory structure).

Sorry. I cannot see at all how a modification to the existing interface would
cover all the sensible use cases I described in a coherent way. I really want
to see a proper description of the interface before people start hacking on it
in a frenzy. What you described is: "let's say (*2)" modification. That's
pretty meager.

> If Fenghua or you post a patchset, say in 2 weeks, with your proposal,
> i am fine with that. But i since i doubt that will be the case, i am 
> pushing for the interface which requires the least amount of changes
> (and therefore the least amount of time) to be integrated.
> 
> >From your email:
> 
> "It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning."
> 
> Yes, thats the issue we hit, that is the modification that was agreed
> with Intel, and thats what we are waiting for them to post.

How do you implement the above - especially that part:

 "It would even be sufficient for particular use cases to just associate a
  piece of cache to a given CPU and do not bother with tasks at all."

as a "simple" modification to (*1) ?
 
> > I described a directory structure for that qos/cat stuff in my proposal and
> > that's complete AFAICT.
> 
> Ok, lets make the job for the submitter easier. You are the maintainer,
> so you decide.
> 
> Is it enough for you to have (*2) (which was agreed with Intel), or 
> would you rather prefer to integrate the directory structure at 
> "[RFD] CAT user space interface revisited" ?

The only thing I care about as a maintainer is, that we merge something which
actually reflects the properties of the hardware and gives the admin the
required flexibility to utilize it fully. I don't care at all if it's my
proposal or something else which allows to do the same.

Let me copy the relevant bits from my proposal here once more and let me ask
questions to the various points so you can tell me how that modification to
(*1) is going to deal with that.

>> At top level:
>>
>>  xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
>>  xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
>>  xxxxxxx/cat/cdp_enable <- Depends on CDP availability

 Where is that information in (*2) and how is that related to (*1)? If you
 think it's not required, please explain why.

>> Per socket data:
>>
>>  xxxxxxx/cat/socket-0/
>>  ...
>>  xxxxxxx/cat/socket-N/l3_size
>>  xxxxxxx/cat/socket-N/hwsharedbits

 Where is that information in (*2) and how is that related to (*1)? If you
 think it's not required, please explain why.

>> Per socket mask data:
>>
>>  xxxxxxx/cat/socket-N/cos-id-0/
>>  ...
>>  xxxxxxx/cat/socket-N/cos-id-N/inuse
>>                               /cat_mask
>>                               /cdp_mask <- Data mask if CDP enabled

 Where is that information in (*2) and how is that related to (*1)? If you
 think it's not required, please explain why.

>> Per cpu default cos id for the cpus on that socket:
>> 
>>  xxxxxxx/cat/socket-N/cpu-x/default_cosid
>>  ...
>>  xxxxxxx/cat/socket-N/cpu-N/default_cosid
>>
>> The above allows a simple cpu based partitioning. All tasks which do
>> not have a cache partition assigned on a particular socket use the
>> default one of the cpu they are running on.

 Where is that information in (*2) and how is that related to (*1)? If you
 think it's not required, please explain why.

>> Now for the task(s) partitioning:
>>
>>  xxxxxxx/cat/partitions/
>>
>> Under that directory one can create partitions
>> 
>>  xxxxxxx/cat/partitions/p1/tasks
>>                           /socket-0/cosid
>>                           ...
>>                           /socket-n/cosid
>> 
>> The default value for the per socket cosid is COSID_DEFAULT, which
>> causes the task(s) to use the per cpu default id. 

 Where is that information in (*2) and how is that related to (*1)? If you
 think it's not required, please explain why.

Yes. I ask the same question several times and I really want to see the
directory/interface structure which solves all of the above before anyone
starts to implement it. We already have a completely useless interface (*1)
and there is no point to implement another one based on it (*2) just because
it solves your particular issue and is the fastest way forward. User space
interfaces are hard and we really do not need some half baken solution which
we have to support forever.

Let me enumerate the required points again:

    1) Information about the hardware properties

    2) Integration of CAT and CDP 

    3) Per socket cos-id partitioning

    4) Per cpu default cos-id association

    5) Task association to cos-id

Can you please explain in a simple directory based scheme, like the one I gave
you how all of these points are going to be solved with a modification to (*1)?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2015-12-31 22:30         ` Thomas Gleixner
@ 2016-01-04 17:20           ` Marcelo Tosatti
  2016-01-04 17:44             ` Marcelo Tosatti
  2016-01-05 23:09             ` Thomas Gleixner
  0 siblings, 2 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2016-01-04 17:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Yu, Fenghua, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Shivappa, Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

On Thu, Dec 31, 2015 at 11:30:57PM +0100, Thomas Gleixner wrote:
> Marcelo,
> 
> On Thu, 31 Dec 2015, Marcelo Tosatti wrote:
> 
> First of all thanks for the explanation.
> 
> > There is one directory structure in this topic, CAT. That is the
> > directory structure which is exposed to userspace to control the 
> > CAT HW. 
> > 
> > With the current patchset posted by Intel ("Subject: [PATCH V16 00/11]
> > x86: Intel Cache Allocation Technology Support"), the directory
> > structure there (the files and directories exposed by that patchset)
> > (*1) does not allow one to configure different CBM masks on each socket
> > (that is, it forces the user to configure the same mask CBM on every
> > socket). This is a blocker for us, and it is one of the points in your
> > proposal.
> > 
> > There was a call between Red Hat and Intel where it was communicated
> > to Intel, and Intel agreed, that it was necessary to fix this (fix this
> > == allow different CBM masks on different sockets).
> > 
> > Now, that is one change to the current directory structure (*1).
> 
> I don't have an idea how that would look like. The current structure is a
> cgroups based hierarchy oriented approach, which does not allow simple things
> like
> 
> T1	00001111
> T2	00111100
> 
> at least not in a way which is natural to the problem at hand.



	cgroupA/

		cbm_mask  (if set, set for all CPUs)

		socket1/cbm_mask
		socket2/cbm_mask
		...
		socketN/cbm_mask (if set, overrides global
		cbm_mask).

Something along those lines.

Do you see any problem with it?

> > (*1) modified to allow for different CBM masks on different sockets, 
> > lets say (*2), is what we have been waiting for Intel to post. 
> > It would handle our usecase, and all use-cases which the current
> > patchset from Intel already handles (Vikas posted emails mentioning
> > there are happy users of the current interface, feel free to ask 
> > him for more details).
> 
> I cannot imagine how that modification to the current interface would solve
> that. Not to talk about per CPU associations which are not related to tasks at
> all.

Not sure what you mean by per CPU associations.

If you fix a cbmmask on a given pCPU, say CPU1, and control which tasks
run on that pCPU, then you control the cbmmask for all tasks (say
tasklist-1) on that CPU, fine.

Can achieve the same by putting all tasks from tasklist-1 into a
cgroup.

> > What i have asked you, and you replied "to go Google read my previous
> > post" is this:
> > What are the advantages over you proposal (which is a completely
> > different directory structure, requiring a complete rewrite),
> > over (*2) ?
> > 
> > (what is my reason behind this: the reason is that if you, with
> > maintainer veto power, forces your proposal to be accepted, it will be
> > necessary to wait for another rewrite (a new set of problems, fully
> > think through your proposal, test it, ...) rather than simply modify an
> > already known, reviewed, already used directory structure.
> > 
> > And functionally, your proposal adds nothing to (*2) (other than, well,
> > being a different directory structure).
> 
> Sorry. I cannot see at all how a modification to the existing interface would
> cover all the sensible use cases I described in a coherent way. I really want
> to see a proper description of the interface before people start hacking on it
> in a frenzy. What you described is: "let's say (*2)" modification. That's
> pretty meager.
> 
> > If Fenghua or you post a patchset, say in 2 weeks, with your proposal,
> > i am fine with that. But i since i doubt that will be the case, i am 
> > pushing for the interface which requires the least amount of changes
> > (and therefore the least amount of time) to be integrated.
> > 
> > >From your email:
> > 
> > "It would even be sufficient for particular use cases to just associate
> > a piece of cache to a given CPU and do not bother with tasks at all.
> > 
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning."
> > 
> > Yes, thats the issue we hit, that is the modification that was agreed
> > with Intel, and thats what we are waiting for them to post.
> 
> How do you implement the above - especially that part:
> 
>  "It would even be sufficient for particular use cases to just associate a
>   piece of cache to a given CPU and do not bother with tasks at all."
> 
> as a "simple" modification to (*1) ?

As noted above.
>  
> > > I described a directory structure for that qos/cat stuff in my proposal and
> > > that's complete AFAICT.
> > 
> > Ok, lets make the job for the submitter easier. You are the maintainer,
> > so you decide.
> > 
> > Is it enough for you to have (*2) (which was agreed with Intel), or 
> > would you rather prefer to integrate the directory structure at 
> > "[RFD] CAT user space interface revisited" ?
> 
> The only thing I care about as a maintainer is, that we merge something which
> actually reflects the properties of the hardware and gives the admin the
> required flexibility to utilize it fully. I don't care at all if it's my
> proposal or something else which allows to do the same.
> 
> Let me copy the relevant bits from my proposal here once more and let me ask
> questions to the various points so you can tell me how that modification to
> (*1) is going to deal with that.
> 
> >> At top level:
> >>
> >>  xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
> >>  xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same

This can be exposed to userspace via a file.

> >>  xxxxxxx/cat/cdp_enable <- Depends on CDP availability
> 
>  Where is that information in (*2) and how is that related to (*1)? If you
>  think it's not required, please explain why.

Intel has come up with a scheme to implement CDP. I'll go read 
that and reply to this email afterwards.

> >> Per socket data:
> >>
> >>  xxxxxxx/cat/socket-0/
> >>  ...
> >>  xxxxxxx/cat/socket-N/l3_size
> >>  xxxxxxx/cat/socket-N/hwsharedbits
> 
>  Where is that information in (*2) and how is that related to (*1)? If you
>  think it's not required, please explain why.

l3_size: userspace can figure that by itself (exposed somewhere in
sysfs).

hwsharedbits: All userspace needs to know is
which bits are shared with HW, to decide whether or not to use that
region of a given socket for a given cbmmask.

So expose that userspace, fine. Can do that in cgroups.

> >> Per socket mask data:
> >>
> >>  xxxxxxx/cat/socket-N/cos-id-0/
> >>  ...
> >>  xxxxxxx/cat/socket-N/cos-id-N/inuse
> >>                               /cat_mask
> >>                               /cdp_mask <- Data mask if CDP enabled
> 
>  Where is that information in (*2) and how is that related to (*1)? If you
>  think it's not required, please explain why.

Unsure - will reply in next email (but per-socket information seems
independent of that).

> 
> >> Per cpu default cos id for the cpus on that socket:
> >> 
> >>  xxxxxxx/cat/socket-N/cpu-x/default_cosid
> >>  ...
> >>  xxxxxxx/cat/socket-N/cpu-N/default_cosid
> >>
> >> The above allows a simple cpu based partitioning. All tasks which do
> >> not have a cache partition assigned on a particular socket use the
> >> default one of the cpu they are running on.
> 
>  Where is that information in (*2) and how is that related to (*1)? If you
>  think it's not required, please explain why.

Not required because with current Intel patchset you'd do:


# mount | grep rdt
cgroup on /sys/fs/cgroup/intel_rdt type cgroup
(rw,nosuid,nodev,noexec,relatime,intel_rdt)
# cd /sys/fs/cgroup/intel_rdt
# ls
cgroupALL              cgroup.procs  cgroup.sane_behavior
notify_on_release  tasks
cgroup.clone_children  cgroupRSVD    intel_rdt.l3_cbm	   release_agent
# cat tasks
1042
1066
1067
1069
...
# cd cgroupALL/
ps auxw  | while read i; do echo $i ; done
| cut -f 2 -d " "   | grep -v PID | while read x ; do echo $x > tasks;
done
-bash: echo: write error: No such process
-bash: echo: write error: No such process
-bash: echo: write error: No such process
-bash: echo: write error: No such process

# cat ../tasks | while read i; do echo $i > tasks; done
# cat ../tasks  | wc -l
0
(no tasks on root cgroup)

# cd ../cgroupRSVD
# cgroupRSVD]# cat tasks
# ps auxw | grep postfix
root	   1942  0.0  0.0  91136  4860 ?        Ss   Nov25   0:00
/usr/libexec/postfix/master -w
postfix    1981  0.0  0.0  91308  6520 ?        S    Nov25   0:00 qmgr
-l -t unix -u
postfix    4416  0.0  0.0  91240  6296 ?        S    17:05   0:00 pickup
-l -t unix -u
root	   4486  0.0  0.0 112652  2304 pts/0    S+   17:31   0:00 grep
--color=auto postfix
# echo 4416 > tasks
# cat intel_rdt.l3_cbm
000ffff0
# cat ../cgroupALL/intel_rdt.l3_cbm
000000ff

Bits f0 are shared between cgroupRSVD and cgroupALL. Lets change:
# echo 0xf > ../cgroupALL/intel_rdt.l3_cbm
# cat ../cgroupALL/intel_rdt.l3_cbm
0000000f

Now they share none.

------

So i have created "cgroupALL" (what you call default_cosid) and
"cgroupRSVD".

> 
> >> Now for the task(s) partitioning:
> >>
> >>  xxxxxxx/cat/partitions/
> >>
> >> Under that directory one can create partitions
> >> 
> >>  xxxxxxx/cat/partitions/p1/tasks
> >>                           /socket-0/cosid
> >>                           ...
> >>                           /socket-n/cosid
> >> 
> >> The default value for the per socket cosid is COSID_DEFAULT, which
> >> causes the task(s) to use the per cpu default id. 
> 
>  Where is that information in (*2) and how is that related to (*1)? If you
>  think it's not required, please explain why.
> 
> Yes. I ask the same question several times and I really want to see the
> directory/interface structure which solves all of the above before anyone
> starts to implement it. 

I don't see the problem, have a sequence of commands above which shows
to set a directory structure which is useful and does what the HW 
interface is supposed to do.

> We already have a completely useless interface (*1)
> and there is no point to implement another one based on it (*2) just because
> it solves your particular issue and is the fastest way forward. User space
> interfaces are hard and we really do not need some half baken solution which
> we have to support forever.

Fine. Can you please tell me what i can't do with the current interface?
AFAICS everything can be done (except missing support for (*2)).

> 
> Let me enumerate the required points again:
> 
>     1) Information about the hardware properties

Fine. Intel should expose that information.

>     2) Integration of CAT and CDP 

Fine, Intel has comeup with a directory structure for that, 
let me read the patchset again and i'll reply to you.

>     3) Per socket cos-id partitioning

(*2) as listed in the beginning of this e-mail.

>     4) Per cpu default cos-id association

This already exists, and as noted in the command sequence above,
works just fine. Please explain what problem are you seeing.

>     5) Task association to cos-id

Not sure what that means. Please explain.

> 
> Can you please explain in a simple directory based scheme, like the one I gave
> you how all of these points are going to be solved with a modification to (*1)?
> 
> Thanks,
> 
> 	tglx

Thanks Thomas, this style discussion is quite useful.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2016-01-04 17:20           ` Marcelo Tosatti
@ 2016-01-04 17:44             ` Marcelo Tosatti
  2016-01-05 23:09             ` Thomas Gleixner
  1 sibling, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2016-01-04 17:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Yu, Fenghua, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Shivappa, Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

On Mon, Jan 04, 2016 at 03:20:54PM -0200, Marcelo Tosatti wrote:
> On Thu, Dec 31, 2015 at 11:30:57PM +0100, Thomas Gleixner wrote:
> > Marcelo,
> > 
> > On Thu, 31 Dec 2015, Marcelo Tosatti wrote:
> > 
> > First of all thanks for the explanation.
> > 
> > > There is one directory structure in this topic, CAT. That is the
> > > directory structure which is exposed to userspace to control the 
> > > CAT HW. 
> > > 
> > > With the current patchset posted by Intel ("Subject: [PATCH V16 00/11]
> > > x86: Intel Cache Allocation Technology Support"), the directory
> > > structure there (the files and directories exposed by that patchset)
> > > (*1) does not allow one to configure different CBM masks on each socket
> > > (that is, it forces the user to configure the same mask CBM on every
> > > socket). This is a blocker for us, and it is one of the points in your
> > > proposal.
> > > 
> > > There was a call between Red Hat and Intel where it was communicated
> > > to Intel, and Intel agreed, that it was necessary to fix this (fix this
> > > == allow different CBM masks on different sockets).
> > > 
> > > Now, that is one change to the current directory structure (*1).
> > 
> > I don't have an idea how that would look like. The current structure is a
> > cgroups based hierarchy oriented approach, which does not allow simple things
> > like
> > 
> > T1	00001111
> > T2	00111100
> > 
> > at least not in a way which is natural to the problem at hand.
> 
> 
> 
> 	cgroupA/
> 
> 		cbm_mask  (if set, set for all CPUs)
> 
> 		socket1/cbm_mask
> 		socket2/cbm_mask
> 		...
> 		socketN/cbm_mask (if set, overrides global
> 		cbm_mask).
> 
> Something along those lines.
> 
> Do you see any problem with it?
> 
> > > (*1) modified to allow for different CBM masks on different sockets, 
> > > lets say (*2), is what we have been waiting for Intel to post. 
> > > It would handle our usecase, and all use-cases which the current
> > > patchset from Intel already handles (Vikas posted emails mentioning
> > > there are happy users of the current interface, feel free to ask 
> > > him for more details).
> > 
> > I cannot imagine how that modification to the current interface would solve
> > that. Not to talk about per CPU associations which are not related to tasks at
> > all.
> 
> Not sure what you mean by per CPU associations.
> 
> If you fix a cbmmask on a given pCPU, say CPU1, and control which tasks
> run on that pCPU, then you control the cbmmask for all tasks (say
> tasklist-1) on that CPU, fine.
> 
> Can achieve the same by putting all tasks from tasklist-1 into a
> cgroup.
> 
> > > What i have asked you, and you replied "to go Google read my previous
> > > post" is this:
> > > What are the advantages over you proposal (which is a completely
> > > different directory structure, requiring a complete rewrite),
> > > over (*2) ?
> > > 
> > > (what is my reason behind this: the reason is that if you, with
> > > maintainer veto power, forces your proposal to be accepted, it will be
> > > necessary to wait for another rewrite (a new set of problems, fully
> > > think through your proposal, test it, ...) rather than simply modify an
> > > already known, reviewed, already used directory structure.
> > > 
> > > And functionally, your proposal adds nothing to (*2) (other than, well,
> > > being a different directory structure).
> > 
> > Sorry. I cannot see at all how a modification to the existing interface would
> > cover all the sensible use cases I described in a coherent way. I really want
> > to see a proper description of the interface before people start hacking on it
> > in a frenzy. What you described is: "let's say (*2)" modification. That's
> > pretty meager.
> > 
> > > If Fenghua or you post a patchset, say in 2 weeks, with your proposal,
> > > i am fine with that. But i since i doubt that will be the case, i am 
> > > pushing for the interface which requires the least amount of changes
> > > (and therefore the least amount of time) to be integrated.
> > > 
> > > >From your email:
> > > 
> > > "It would even be sufficient for particular use cases to just associate
> > > a piece of cache to a given CPU and do not bother with tasks at all.
> > > 
> > > We really need to make this as configurable as possible from userspace
> > > without imposing random restrictions to it. I played around with it on
> > > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > > enabled) makes it really useless if we force the ids to have the same
> > > meaning on all sockets and restrict it to per task partitioning."
> > > 
> > > Yes, thats the issue we hit, that is the modification that was agreed
> > > with Intel, and thats what we are waiting for them to post.
> > 
> > How do you implement the above - especially that part:
> > 
> >  "It would even be sufficient for particular use cases to just associate a
> >   piece of cache to a given CPU and do not bother with tasks at all."
> > 
> > as a "simple" modification to (*1) ?
> 
> As noted above.
> >  
> > > > I described a directory structure for that qos/cat stuff in my proposal and
> > > > that's complete AFAICT.
> > > 
> > > Ok, lets make the job for the submitter easier. You are the maintainer,
> > > so you decide.
> > > 
> > > Is it enough for you to have (*2) (which was agreed with Intel), or 
> > > would you rather prefer to integrate the directory structure at 
> > > "[RFD] CAT user space interface revisited" ?
> > 
> > The only thing I care about as a maintainer is, that we merge something which
> > actually reflects the properties of the hardware and gives the admin the
> > required flexibility to utilize it fully. I don't care at all if it's my
> > proposal or something else which allows to do the same.
> > 
> > Let me copy the relevant bits from my proposal here once more and let me ask
> > questions to the various points so you can tell me how that modification to
> > (*1) is going to deal with that.
> > 
> > >> At top level:
> > >>
> > >>  xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
> > >>  xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
> 
> This can be exposed to userspace via a file.
> 
> > >>  xxxxxxx/cat/cdp_enable <- Depends on CDP availability
> > 
> >  Where is that information in (*2) and how is that related to (*1)? If you
> >  think it's not required, please explain why.
> 
> Intel has come up with a scheme to implement CDP. I'll go read 
> that and reply to this email afterwards.

Pasting relevant parts of the patchset submission. 
Looks fine to me, two files, one for data cache cbmmask, another
for instruction cache cbm mask. 
Those two files would be moved to "socket-N" directories.

(will review the CDP patchset...).

Subject: [PATCH V2 0/5] x86: Intel Code Data Prioritization Support

This patch set supports Intel code data prioritization which is an
extension of cache allocation and allows to allocate code and data cache
seperately. It also includes cgroup interface for the user as seperate
patches. The cgroup interface for cache alloc is also resent.

This patch adds enumeration support for Code Data Prioritization(CDP)
feature found in future Intel Xeon processors. It includes CPUID
enumeration routines for CDP.

CDP is an extension to Cache Allocation and lets threads allocate subset
of L3 cache for code and data separately. The allocation is represented
by the code or data cache capacity bit mask(cbm) MSRs
IA32_L3_QOS_MASK_n. Each Class of service would be associated with one
dcache_cbm and one icache_cbm MSR and hence the number of available
CLOSids is halved with CDP. The association for a CLOSid 'n' is shown
below :

data_cbm_address (n) = base + (n <<1)
code_cbm_address (n) = base + (n <<1) +1.
During scheduling the kernel writes the CLOSid
of the thread to IA32_PQR_ASSOC_MSR.

Adds two files to the intel_rdt cgroup 'dcache_cbm' and 'icache_cbm'
when code data prioritization(cdp) support is present. The files
represent the data capacity bit mask(cbm) and instruction cbm for L3
cache. User can specify the data and code cbm and the threads belonging
to the cgroup would get to fill the l3 cache represented by the cbm with
the data or code.

For ex: Consider a scenario where the max cbm bits is 10 and L3 cache
size is 10MB:
then specifying a dcache_cbm = 0x3 and icache_cbm = 0xc would give 2MB
of exclusive cache for data and code for the tasks to fill in.

This feature is an extension to cache allocation and lets user specify a
capacity for code and data separately. Initially these cbms would have
the same value as the l3_cbm(which represents the common cbm for code
and data). Once the user tries to write to either the dcache_cbm or
icache_cbm, the kernel tries to enable the cdp mode in hardware by
writing to the IA32_PQOS_CFG MSR. The switch is only possible if the
number of Class of service IDs(CLOSids) used is < half of total CLOSids
available at the time of switch. This is because the CLOSIds are halved
once CDP is enabled and each CLOSid now maps to a data IA32_L3_QOS_n MSR
and a code IA32_L3_QOS_n MSR.
Once the CDP is enabled user can use the dcache_cbm and icache_cbm just
like the l3_cbm. The CLOSids are not exposed to the user and maintained
by the kernel internally.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2016-01-04 17:20           ` Marcelo Tosatti
  2016-01-04 17:44             ` Marcelo Tosatti
@ 2016-01-05 23:09             ` Thomas Gleixner
  2016-01-06 12:46               ` Marcelo Tosatti
  2016-01-08 20:21               ` Thomas Gleixner
  1 sibling, 2 replies; 32+ messages in thread
From: Thomas Gleixner @ 2016-01-05 23:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Yu, Fenghua, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Shivappa, Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

Marcelo,

On Mon, 4 Jan 2016, Marcelo Tosatti wrote:
> On Thu, Dec 31, 2015 at 11:30:57PM +0100, Thomas Gleixner wrote:
> > I don't have an idea how that would look like. The current structure is a
> > cgroups based hierarchy oriented approach, which does not allow simple things
> > like
> > 
> > T1	00001111
> > T2	00111100
> > 
> > at least not in a way which is natural to the problem at hand.
> 
> 
> 
> 	cgroupA/
> 
> 		cbm_mask  (if set, set for all CPUs)

You mean sockets, right?

> 
> 		socket1/cbm_mask
> 		socket2/cbm_mask
> 		...
> 		socketN/cbm_mask (if set, overrides global
> 		cbm_mask).
> 
> Something along those lines.
> 
> Do you see any problem with it?

So for that case:

task1:	 cbm_mask 00001111
task2:	 cbm_mask 00111100

i.e. task1 and task2 share bit 2/3 of the mask. 

I need to have two cgroups: cgroup1 and cgroup2, task1 is member of cgroup1
and task2 is member of cgroup2, right?

So now add some more of this and then figure out, which cbm_masks are in use
on which socket. That means I need to go through all cgroups and find the
cbm_masks there.

With my proposed directory structure you get a very clear view about the
in-use closids and the associated cbm_masks. That view represents the hardware
in the best way. With the cgroups stuff we get an artificial representation
which does not tell us anything about the in-use closids and the associated
cbm_masks.
 
> > I cannot imagine how that modification to the current interface would solve
> > that. Not to talk about per CPU associations which are not related to tasks at
> > all.
> 
> Not sure what you mean by per CPU associations.

As I wrote before:

 "It would even be sufficient for particular use cases to just associate
  a piece of cache to a given CPU and do not bother with tasks at all."

> If you fix a cbmmask on a given pCPU, say CPU1, and control which tasks
> run on that pCPU, then you control the cbmmask for all tasks (say
> tasklist-1) on that CPU, fine.
> 
> Can achieve the same by putting all tasks from tasklist-1 into a
> cgroup.

Which means, that I need to go and find everything including kernel threads
and put them into a particular cgroup. That's really not useful and it simply
does not work:

To which cgroup belongs a dynamically created per cpu worker thread? To the
cgroup of the parent. But is the parent necessarily in the proper cgroup? No,
there is no guarantee. So it ends up in some random cgroup unless I start
chasing every new thread, instead of letting it use the default cosid of the
CPU.

Having a per cpu default cos-id which is used when the task does not have a
cos-id associated makes a lot of sense and makes it simpler to utilize that
facility.

> > >> Per cpu default cos id for the cpus on that socket:
> > >> 
> > >>  xxxxxxx/cat/socket-N/cpu-x/default_cosid
> > >>  ...
> > >>  xxxxxxx/cat/socket-N/cpu-N/default_cosid
> > >>
> > >> The above allows a simple cpu based partitioning. All tasks which do
> > >> not have a cache partition assigned on a particular socket use the
> > >> default one of the cpu they are running on.
> > 
> >  Where is that information in (*2) and how is that related to (*1)? If you
> >  think it's not required, please explain why.
> 
> Not required because with current Intel patchset you'd do:

<SNIP>
...
</SNIP>

> # cat intel_rdt.l3_cbm
> 000ffff0
> # cat ../cgroupALL/intel_rdt.l3_cbm
> 000000ff
> 
> Bits f0 are shared between cgroupRSVD and cgroupALL. Lets change:
> # echo 0xf > ../cgroupALL/intel_rdt.l3_cbm
> # cat ../cgroupALL/intel_rdt.l3_cbm
> 0000000f
> 
> Now they share none.

Well, you changed ALL and everything, but you still did not assign a
particular cos-id to a particular CPU as their default.
 
> > >> Now for the task(s) partitioning:
> > >>
> > >>  xxxxxxx/cat/partitions/
> > >>
> > >> Under that directory one can create partitions
> > >> 
> > >>  xxxxxxx/cat/partitions/p1/tasks
> > >>                           /socket-0/cosid
> > >>                           ...
> > >>                           /socket-n/cosid
> > >> 
> > >> The default value for the per socket cosid is COSID_DEFAULT, which
> > >> causes the task(s) to use the per cpu default id. 
> > 
> >  Where is that information in (*2) and how is that related to (*1)? If you
> >  think it's not required, please explain why.
> > 
> > Yes. I ask the same question several times and I really want to see the
> > directory/interface structure which solves all of the above before anyone
> > starts to implement it. 
> 
> I don't see the problem, have a sequence of commands above which shows
> to set a directory structure which is useful and does what the HW 
> interface is supposed to do.

Well, you have a sequence of commands, which gives you the result which you
need for your particular problem.

> > We already have a completely useless interface (*1)
> > and there is no point to implement another one based on it (*2) just because
> > it solves your particular issue and is the fastest way forward. User space
> > interfaces are hard and we really do not need some half baken solution which
> > we have to support forever.
> 
> Fine. Can you please tell me what i can't do with the current interface?
> AFAICS everything can be done (except missing support for (*2)).

1) There is no consistent view of the facility. Is a sysadmin supposed to add
   printks to the kernel to figure that out or should he keep track of that
   information on a piece of paper? Neither option is useful if you have to
   analyze a system which was set up 3 month ago.

2) Simple and easy default settings for tasks and CPUs

   Rather than forcing the admin to find everything which might be related,
   it's way better to have configurable defaults.

   The cgroup interface allows you to set a task default, because everything
   which is in the root group is going to use that, but that default is
   useless. See #3

   That still does not give me the an simple and easy to use way to set a per
   cpu default.
 
3) Non hierarchical setup

   The current interface has a rdt_root_group, which is set up at init. That
   group uses a closid (one of the few we have). And you cannot use that thing
   for anything else than having all bits set in the mask because all groups
   you create underneath must be a subset of the parent group.

   That is simply crap.

   We force something which is entirely not hierarchical into a structure
   which is designed for hierarchical problems and thereby waste one of the
   scarce and pretious resources.

   That technology is straight forward partitioning and has nothing
   hierarchical at all.

The decision to use cgroups was wrong in the beginning and it does not become
better by pretending that it solves some particular use cases and by repeating
that everything can be solved with it.

If all I have is a hammer I certainly can pretend that everything is a
nail. We all know how well that works ....

> >     4) Per cpu default cos-id association
> 
> This already exists, and as noted in the command sequence above,
> works just fine. Please explain what problem are you seeing.

No it does not exist. You emulate it by forcing stuff into cgroups which is
not at all required if you have a proper and well thought out interface.

> >     5) Task association to cos-id
> 
> Not sure what that means. Please explain.

> > >>  xxxxxxx/cat/partitions/p1/tasks
> > >>                           /socket-0/cosid
> > >>                           ...
> > >>                           /socket-n/cosid

Yes, I agree, that this is very similar to the cgroup mechanism, but it is not
in a pointless hierarchy. It's just the last step of the mechanism which I
proposed to represent the hardware in the best way and give the admin the
required flexibility. Again:

This is general information:

xxxxxxx/cat/max_cosids
xxxxxxx/cat/max_maskbits
xxxxxxx/cat/cdp_enable

This is per socket information and per socket cos-id management

xxxxxxx/cat/socket-0/...
xxxxxxx/cat/socket-N/hwsharedbits
		    /cos-id-0/...
		    /cos-id-N/in-use
			     /cat_mask
			     /cdp_mask

This is per cpu default cos-id association

xxxxxxx/cat/socket-0/...
xxxxxxx/cat/socket-N/cpu-x/default_cosid
xxxxxxx/cat/socket-N/cpu-N/default_cosid

This is partitioning, where tasks are associated.

xxxxxxx/cat/partitions/
xxxxxxx/cat/partitions/p1/tasks
			 /socket-0/cosid
			 /socket-N/cosid

That's the part which can be expressed with cgroups somehow, but for the price
of losing a cosid and having a pointless hierarchy. Again, there is nothing
hierarchical in RDT/CAT/CDP. It's all about partitioning and unfortunately the
number of possible partitions is massively limited.

I asked you last time already, but you just gave me random shell commands to
show that it can be done. I ask again:

Can you please explain in a simple directory based scheme, like the one I
gave you above how all of these points are going to be solved with "some
modifications" to the existing cgroup thingy.

And just for completeness, lets look at a practical real world use case:

   1  Socket
  18  CPUs
   4  COSIds (Yes, that's all that hardware gives me)
  32  mask bits
   2  hw shared bits at position 30 and 31

  Have 4 CPU partitions:
  
  CPU  0 -  5   general tasks
  CPU  6 -  9   RT1
  CPU 10 - 13   RT2
  CPU 14 - 17   RT3

, Let each CPU partition have 1/4 of the cache.

  Here is my solution:

  # echo 0xff000000 > xxxx/cat/socket-0/cosid-0
  # echo 0x00ff0000 > xxxx/cat/socket-0/cosid-1
  # echo 0x0000ff00 > xxxx/cat/socket-0/cosid-2
  # echo 0x000000ff > xxxx/cat/socket-0/cosid-3

  # for CPU in 0 1 2 3 4 5; do
  #  echo 0 > xxxx/cat/socket-0/cpu-$CPU/default_cosid;
  # done

  # CPU=6
  # while [ $I -lt 18 ]; do
  #   let "ID = 1 + (($CPU - 6) / 4)"
  #   echo $ID > xxxx/cat/socket-0/cpu-$CPU/default_cosid;
  #   let "CPU += 1"
  # done

That's it. Simple, right?

The real interesting thing here is, that you can't do that at all with that
current cgroup thingy. Simply because you run out of cosids.

Even if you have enough COSids on a newer CPU then your solution will be way
more complex and you still have not solved the issues of chasing kernel
threads etc.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2016-01-05 23:09             ` Thomas Gleixner
@ 2016-01-06 12:46               ` Marcelo Tosatti
  2016-01-06 13:10                 ` Tejun Heo
  2016-01-08 20:21               ` Thomas Gleixner
  1 sibling, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2016-01-06 12:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Yu, Fenghua, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Shivappa, Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

On Wed, Jan 06, 2016 at 12:09:50AM +0100, Thomas Gleixner wrote:
> Marcelo,
> 
> On Mon, 4 Jan 2016, Marcelo Tosatti wrote:
> > On Thu, Dec 31, 2015 at 11:30:57PM +0100, Thomas Gleixner wrote:
> > > I don't have an idea how that would look like. The current structure is a
> > > cgroups based hierarchy oriented approach, which does not allow simple things
> > > like
> > > 
> > > T1	00001111
> > > T2	00111100
> > > 
> > > at least not in a way which is natural to the problem at hand.
> > 
> > 
> > 
> > 	cgroupA/
> > 
> > 		cbm_mask  (if set, set for all CPUs)
> 
> You mean sockets, right?
> 
> > 
> > 		socket1/cbm_mask
> > 		socket2/cbm_mask
> > 		...
> > 		socketN/cbm_mask (if set, overrides global
> > 		cbm_mask).
> > 
> > Something along those lines.
> > 
> > Do you see any problem with it?
> 
> So for that case:
> 
> task1:	 cbm_mask 00001111
> task2:	 cbm_mask 00111100
> 
> i.e. task1 and task2 share bit 2/3 of the mask. 
> 
> I need to have two cgroups: cgroup1 and cgroup2, task1 is member of cgroup1
> and task2 is member of cgroup2, right?
> 
> So now add some more of this and then figure out, which cbm_masks are in use
> on which socket. That means I need to go through all cgroups and find the
> cbm_masks there.

Yes.

> With my proposed directory structure you get a very clear view about the
> in-use closids and the associated cbm_masks. That view represents the hardware
> in the best way. With the cgroups stuff we get an artificial representation
> which does not tell us anything about the in-use closids and the associated
> cbm_masks.

Because you expose cos-ID ---> cbm / cdp masks.

Fine, agree thats nice.

> > > I cannot imagine how that modification to the current interface would solve
> > > that. Not to talk about per CPU associations which are not related to tasks at
> > > all.
> > 
> > Not sure what you mean by per CPU associations.
> 
> As I wrote before:
> 
>  "It would even be sufficient for particular use cases to just associate
>   a piece of cache to a given CPU and do not bother with tasks at all."
> 
> > If you fix a cbmmask on a given pCPU, say CPU1, and control which tasks
> > run on that pCPU, then you control the cbmmask for all tasks (say
> > tasklist-1) on that CPU, fine.
> > 
> > Can achieve the same by putting all tasks from tasklist-1 into a
> > cgroup.
> 
> Which means, that I need to go and find everything including kernel threads
> and put them into a particular cgroup. That's really not useful and it simply
> does not work:
> 
> To which cgroup belongs a dynamically created per cpu worker thread? To the
> cgroup of the parent. But is the parent necessarily in the proper cgroup? No,
> there is no guarantee. So it ends up in some random cgroup unless I start
> chasing every new thread, instead of letting it use the default cosid of the
> CPU.

Well, i suppose cgroups has facilities to handle this? That is, what is
required is:

On task creation, move the new task to a particular cgroup, based on
some visible characteristic of the task: (process name matching OR explicit
kernel thread creator specification OR ...).

Because there are two cases. Consider a kernel thread T, which contains 
code that is timing sensitive therefore requires to use a COSID (which
means use a reserved portion of cache).

Case 1) kernel thread T starts kernel thread R, which is also timing
sensitive (and wants to use the same COSID as kernel thread T). 
In that case, the cgroup's default (inherit cgroup from parent)
behaviour is correct.

Case 2) kernel thread T starts kernel thread X, which is not timing
sensitive, therefore kernel thread X should use "default cosid".
In the case of cgroups, in the example used elsewhere in this thread,
kernel thread X should be moved to "cgroupsALL".

Strictly speaking there is a third case:

Case 3) kernel thread T starts kernel thread Z, which wants to
be moved to a different COSID other than kernel thread T's COSID.

So using the default COSID is not necessarily the correct thing to do
(this should be configurable on a per-case basis).

> Having a per cpu default cos-id which is used when the task does not have a
> cos-id associated makes a lot of sense and makes it simpler to utilize that
> facility.

You would need a facility to switch to "inherit cgroup from parent"
mode, and also to handle case 3 (which i supposed cgroups did, because 
the same problem exists for other cgroup controllers).

> > > >> Per cpu default cos id for the cpus on that socket:
> > > >> 
> > > >>  xxxxxxx/cat/socket-N/cpu-x/default_cosid
> > > >>  ...
> > > >>  xxxxxxx/cat/socket-N/cpu-N/default_cosid
> > > >>
> > > >> The above allows a simple cpu based partitioning. All tasks which do
> > > >> not have a cache partition assigned on a particular socket use the
> > > >> default one of the cpu they are running on.
> > > 
> > >  Where is that information in (*2) and how is that related to (*1)? If you
> > >  think it's not required, please explain why.
> > 
> > Not required because with current Intel patchset you'd do:
> 
> <SNIP>
> ...
> </SNIP>
> 
> > # cat intel_rdt.l3_cbm
> > 000ffff0
> > # cat ../cgroupALL/intel_rdt.l3_cbm
> > 000000ff
> > 
> > Bits f0 are shared between cgroupRSVD and cgroupALL. Lets change:
> > # echo 0xf > ../cgroupALL/intel_rdt.l3_cbm
> > # cat ../cgroupALL/intel_rdt.l3_cbm
> > 0000000f
> > 
> > Now they share none.
> 
> Well, you changed ALL and everything, but you still did not assign a
> particular cos-id to a particular CPU as their default.
>  
> > > >> Now for the task(s) partitioning:
> > > >>
> > > >>  xxxxxxx/cat/partitions/
> > > >>
> > > >> Under that directory one can create partitions
> > > >> 
> > > >>  xxxxxxx/cat/partitions/p1/tasks
> > > >>                           /socket-0/cosid
> > > >>                           ...
> > > >>                           /socket-n/cosid
> > > >> 
> > > >> The default value for the per socket cosid is COSID_DEFAULT, which
> > > >> causes the task(s) to use the per cpu default id. 
> > > 
> > >  Where is that information in (*2) and how is that related to (*1)? If you
> > >  think it's not required, please explain why.
> > > 
> > > Yes. I ask the same question several times and I really want to see the
> > > directory/interface structure which solves all of the above before anyone
> > > starts to implement it. 
> > 
> > I don't see the problem, have a sequence of commands above which shows
> > to set a directory structure which is useful and does what the HW 
> > interface is supposed to do.
> 
> Well, you have a sequence of commands, which gives you the result which you
> need for your particular problem.
> 
> > > We already have a completely useless interface (*1)
> > > and there is no point to implement another one based on it (*2) just because
> > > it solves your particular issue and is the fastest way forward. User space
> > > interfaces are hard and we really do not need some half baken solution which
> > > we have to support forever.
> > 
> > Fine. Can you please tell me what i can't do with the current interface?
> > AFAICS everything can be done (except missing support for (*2)).
> 
> 1) There is no consistent view of the facility. Is a sysadmin supposed to add
>    printks to the kernel to figure that out or should he keep track of that
>    information on a piece of paper? Neither option is useful if you have to
>    analyze a system which was set up 3 month ago.

Parse the cgroups CAT directory.

> 2) Simple and easy default settings for tasks and CPUs
> 
>    Rather than forcing the admin to find everything which might be related,
>    it's way better to have configurable defaults.
> 
>    The cgroup interface allows you to set a task default, because everything
>    which is in the root group is going to use that, but that default is
>    useless. See #3
> 
>    That still does not give me the an simple and easy to use way to set a per
>    cpu default.

I also dislike the cgroups interface, your proposal is indeed nicer.

>  
> 3) Non hierarchical setup
> 
>    The current interface has a rdt_root_group, which is set up at init. That
>    group uses a closid (one of the few we have). And you cannot use that thing
>    for anything else than having all bits set in the mask because all groups
>    you create underneath must be a subset of the parent group.
> 
>    That is simply crap.
> 
>    We force something which is entirely not hierarchical into a structure
>    which is designed for hierarchical problems and thereby waste one of the
>    scarce and pretious resources.

Agree.

>    That technology is straight forward partitioning and has nothing
>    hierarchical at all.
> 
> The decision to use cgroups was wrong in the beginning and it does not become
> better by pretending that it solves some particular use cases and by repeating
> that everything can be solved with it.
> 
> If all I have is a hammer I certainly can pretend that everything is a
> nail. We all know how well that works ....
> 
> > >     4) Per cpu default cos-id association
> > 
> > This already exists, and as noted in the command sequence above,
> > works just fine. Please explain what problem are you seeing.
> 
> No it does not exist. You emulate it by forcing stuff into cgroups which is
> not at all required if you have a proper and well thought out interface.
> 
> > >     5) Task association to cos-id
> > 
> > Not sure what that means. Please explain.
> 
> > > >>  xxxxxxx/cat/partitions/p1/tasks
> > > >>                           /socket-0/cosid
> > > >>                           ...
> > > >>                           /socket-n/cosid
> 
> Yes, I agree, that this is very similar to the cgroup mechanism, but it is not
> in a pointless hierarchy. It's just the last step of the mechanism which I
> proposed to represent the hardware in the best way and give the admin the
> required flexibility. Again:
> 
> This is general information:
> 
> xxxxxxx/cat/max_cosids
> xxxxxxx/cat/max_maskbits
> xxxxxxx/cat/cdp_enable
> 
> This is per socket information and per socket cos-id management
> 
> xxxxxxx/cat/socket-0/...
> xxxxxxx/cat/socket-N/hwsharedbits
> 		    /cos-id-0/...
> 		    /cos-id-N/in-use
> 			     /cat_mask
> 			     /cdp_mask
> 
> This is per cpu default cos-id association
> 
> xxxxxxx/cat/socket-0/...
> xxxxxxx/cat/socket-N/cpu-x/default_cosid
> xxxxxxx/cat/socket-N/cpu-N/default_cosid
> 
> This is partitioning, where tasks are associated.
> 
> xxxxxxx/cat/partitions/
> xxxxxxx/cat/partitions/p1/tasks
> 			 /socket-0/cosid
> 			 /socket-N/cosid
> 
> That's the part which can be expressed with cgroups somehow, but for the price
> of losing a cosid and having a pointless hierarchy. Again, there is nothing
> hierarchical in RDT/CAT/CDP. It's all about partitioning and unfortunately the
> number of possible partitions is massively limited.
> 
> I asked you last time already, but you just gave me random shell commands to
> show that it can be done. I ask again:
> 
> Can you please explain in a simple directory based scheme, like the one I
> gave you above how all of these points are going to be solved with "some
> modifications" to the existing cgroup thingy.
> 
> And just for completeness, lets look at a practical real world use case:
> 
>    1  Socket
>   18  CPUs
>    4  COSIds (Yes, that's all that hardware gives me)
>   32  mask bits
>    2  hw shared bits at position 30 and 31
> 
>   Have 4 CPU partitions:
>   
>   CPU  0 -  5   general tasks
>   CPU  6 -  9   RT1
>   CPU 10 - 13   RT2
>   CPU 14 - 17   RT3
> 
> , Let each CPU partition have 1/4 of the cache.
> 
>   Here is my solution:
> 
>   # echo 0xff000000 > xxxx/cat/socket-0/cosid-0
>   # echo 0x00ff0000 > xxxx/cat/socket-0/cosid-1
>   # echo 0x0000ff00 > xxxx/cat/socket-0/cosid-2
>   # echo 0x000000ff > xxxx/cat/socket-0/cosid-3
> 
>   # for CPU in 0 1 2 3 4 5; do
>   #  echo 0 > xxxx/cat/socket-0/cpu-$CPU/default_cosid;
>   # done
> 
>   # CPU=6
>   # while [ $I -lt 18 ]; do
>   #   let "ID = 1 + (($CPU - 6) / 4)"
>   #   echo $ID > xxxx/cat/socket-0/cpu-$CPU/default_cosid;
>   #   let "CPU += 1"
>   # done
> 
> That's it. Simple, right?
> 
> The real interesting thing here is, that you can't do that at all with that
> current cgroup thingy. Simply because you run out of cosids.
> 
> Even if you have enough COSids on a newer CPU then your solution will be way
> more complex and you still have not solved the issues of chasing kernel
> threads etc.
> 
> Thanks,
> 
> 	tglx

Fine, i agree. Need to solve the problem of assignment of cosids on
creation of kernel threads as discussed above (the 3 points).

Fenghua what are your thoughts?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2016-01-06 12:46               ` Marcelo Tosatti
@ 2016-01-06 13:10                 ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2016-01-06 13:10 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Thomas Gleixner, Yu, Fenghua, LKML, Peter Zijlstra, x86,
	Luiz Capitulino, Shivappa, Vikas, Shankar, Ravi V, Luck, Tony

Hello, Marcelo.

On Wed, Jan 06, 2016 at 10:46:15AM -0200, Marcelo Tosatti wrote:
> Well, i suppose cgroups has facilities to handle this? That is, what is
> required is:

No, it doesn't.

> On task creation, move the new task to a particular cgroup, based on
> some visible characteristic of the task: (process name matching OR explicit
> kernel thread creator specification OR ...).

cgroup's primary goal is resource tracking and control.  For userland
processes, following fork / clone is enough; however, for a lot of
kthread tasks, task isn't even the right unit.  e.g. Think of CPU
cycles spent on packet reception, spawning per-cgroup kthreads to
handle packet rx separately isn't a realistic option.  The granularity
needs to be higher.  Except for a handful of cases, this pattern
holds.  Another example is IO resources spent during journal write.
Most of in-kernel resource tracking can't be split per-kthread.

While assigning kthreads to specific cgroups can be useful for a few
specific use cases, in terms of in-kernel resource tracking, it's more
of a distraction.

Please stop using cgroup for random task grouping.  Supporting the
level of flexibility to support arbitrary grouping gets in the way of
implementing proper resource control.  You won't be happy because
cgroup's rules get in the way and cgroup won't be happy because your
random stuff gets in the way of proper resource control.

Thomas's proposal obviously works better for the task at hand.  Maybe
there's something which can be extracted out of cgroup and shared for
task group tracking, if nothing else, hooks and synchronization, but
please don't tack it on top of cgroup when it doesn't really fit the
hierarchical resource distribution model.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFD] CAT user space interface revisited
  2016-01-05 23:09             ` Thomas Gleixner
  2016-01-06 12:46               ` Marcelo Tosatti
@ 2016-01-08 20:21               ` Thomas Gleixner
  1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2016-01-08 20:21 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Yu, Fenghua, LKML, Peter Zijlstra, x86, Luiz Capitulino,
	Shivappa, Vikas, Tejun Heo, Shankar, Ravi V, Luck, Tony

On Wed, 6 Jan 2016, Thomas Gleixner wrote:
> 
> This is general information:
> 
> xxxxxxx/cat/max_cosids
> xxxxxxx/cat/max_maskbits
> xxxxxxx/cat/cdp_enable
> 
> This is per socket information and per socket cos-id management

Just had a discussion with HPA about this and he pointed out that socket might
not be the proper solution. The correct partitioning unit is cache domains. So
we should change socket-X to L3-domain-X. That'll allow us to integrate future
things like CAT on L2.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-01-08 20:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-18 18:25 [RFD] CAT user space interface revisited Thomas Gleixner
2015-11-18 19:38 ` Luiz Capitulino
2015-11-18 19:55   ` Auld, Will
2015-11-18 22:34 ` Marcelo Tosatti
2015-11-19  0:34   ` Marcelo Tosatti
2015-11-19  8:35     ` Thomas Gleixner
2015-11-19 13:44       ` Luiz Capitulino
2015-11-20 14:15       ` Marcelo Tosatti
2015-11-19  8:11   ` Thomas Gleixner
2015-11-19  0:01 ` Marcelo Tosatti
2015-11-19  1:05   ` Marcelo Tosatti
2015-11-19  9:09     ` Thomas Gleixner
2015-11-19 20:59       ` Marcelo Tosatti
2015-11-20  7:53         ` Thomas Gleixner
2015-11-20 17:51           ` Marcelo Tosatti
2015-11-19 20:30     ` Marcelo Tosatti
2015-11-19  9:07   ` Thomas Gleixner
2015-11-24  8:27   ` Chao Peng
     [not found]     ` <20151124212543.GA11303@amt.cnet>
2015-11-25  1:29       ` Marcelo Tosatti
2015-11-24  7:31 ` Chao Peng
2015-11-24 23:06   ` Marcelo Tosatti
2015-12-22 18:12 ` Yu, Fenghua
2015-12-23 10:28   ` Marcelo Tosatti
2015-12-29 12:44     ` Thomas Gleixner
2015-12-31 19:22       ` Marcelo Tosatti
2015-12-31 22:30         ` Thomas Gleixner
2016-01-04 17:20           ` Marcelo Tosatti
2016-01-04 17:44             ` Marcelo Tosatti
2016-01-05 23:09             ` Thomas Gleixner
2016-01-06 12:46               ` Marcelo Tosatti
2016-01-06 13:10                 ` Tejun Heo
2016-01-08 20:21               ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).