[01/14] x86/cqm: Intel Resource Monitoring Documentation
diff mbox series

Message ID 1481929988-31569-2-git-send-email-vikas.shivappa@linux.intel.com
State New, archived
Headers show
Series
  • Cqm2: Intel Cache Monitoring fixes and enhancements
Related show

Commit Message

Vikas Shivappa Dec. 16, 2016, 11:12 p.m. UTC
Add documentation of usage of cqm and mbm events, continuous monitoring,
lazy and non-lazy monitoring.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 Documentation/x86/intel_rdt_mon_ui.txt | 91 ++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)
 create mode 100644 Documentation/x86/intel_rdt_mon_ui.txt

Comments

Peter Zijlstra Dec. 23, 2016, 12:32 p.m. UTC | #1
On Fri, Dec 16, 2016 at 03:12:55PM -0800, Vikas Shivappa wrote:
> +Continuous monitoring
> +---------------------
> +A new file cont_monitoring is added to perf_cgroup which helps to enable
> +cqm continuous monitoring. Enabling this field would start monitoring of
> +the cgroup without perf being launched. This can be used for long term
> +light weight monitoring of tasks/cgroups.
> +
> +To enable continuous monitoring of cgroup p1.
> +#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
> +
> +To disable continuous monitoring of cgroup p1.
> +#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
> +
> +To read the counters at the end of monitoring perf can be used.
> +
> +LAZY and NOLAZY Monitoring
> +--------------------------
> +LAZY:
> +By default when monitoring is enabled, the RMIDs are not allocated
> +immediately and allocated lazily only at the first sched_in.
> +There are 2-4 RMIDs per logical processor on each package. So if a dual
> +package has 48 logical processors, there would be upto 192 RMIDs on each
> +package = total of 192x2 RMIDs.
> +There is a possibility that RMIDs can runout and in that case the read
> +reports an error since there was no RMID available to monitor for an
> +event.
> +
> +NOLAZY:
> +When user wants guaranteed monitoring, he can enable the 'monitoring
> +mask' which is basically used to specify the packages he wants to
> +monitor. The RMIDs are statically allocated at open and failure is
> +indicated if RMIDs are not available.
> +
> +To specify monitoring on package 0 and package 1:
> +#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
> +
> +An error is thrown if packages not online are specified.

I very much dislike both those for adding files to the perf cgroup.
Drivers should really not do that.

I absolutely hate the second because events already have affinity.

I can't see this happening.
Vikas Shivappa Dec. 23, 2016, 7:35 p.m. UTC | #2
Hello Peterz,

On Fri, 23 Dec 2016, Peter Zijlstra wrote:

> On Fri, Dec 16, 2016 at 03:12:55PM -0800, Vikas Shivappa wrote:
>> +Continuous monitoring
>> +---------------------
>> +A new file cont_monitoring is added to perf_cgroup which helps to enable
>> +cqm continuous monitoring. Enabling this field would start monitoring of
>> +the cgroup without perf being launched. This can be used for long term
>> +light weight monitoring of tasks/cgroups.
>> +
>> +To enable continuous monitoring of cgroup p1.
>> +#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
>> +
>> +To disable continuous monitoring of cgroup p1.
>> +#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
>> +
>> +To read the counters at the end of monitoring perf can be used.
>> +
>> +LAZY and NOLAZY Monitoring
>> +--------------------------
>> +LAZY:
>> +By default when monitoring is enabled, the RMIDs are not allocated
>> +immediately and allocated lazily only at the first sched_in.
>> +There are 2-4 RMIDs per logical processor on each package. So if a dual
>> +package has 48 logical processors, there would be upto 192 RMIDs on each
>> +package = total of 192x2 RMIDs.
>> +There is a possibility that RMIDs can runout and in that case the read
>> +reports an error since there was no RMID available to monitor for an
>> +event.
>> +
>> +NOLAZY:
>> +When user wants guaranteed monitoring, he can enable the 'monitoring
>> +mask' which is basically used to specify the packages he wants to
>> +monitor. The RMIDs are statically allocated at open and failure is
>> +indicated if RMIDs are not available.
>> +
>> +To specify monitoring on package 0 and package 1:
>> +#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
>> +
>> +An error is thrown if packages not online are specified.
>
> I very much dislike both those for adding files to the perf cgroup.
> Drivers should really not do that.

Is the continuous monitoring the issue or the interface (adding a file in 
perf_cgroup) ? I have not mentioned in the documentaion but this continuous 
monitoring/ monitoring mask applies only to cgroup in this patch and hence we 
thought a good place for that is in the cgroup itself because its per cgroup.

For task events , this wont apply and we are thinking of providing a prctl based 
interface for user to toggle the continous monitoring ..

>
> I absolutely hate the second because events already have affinity.

This applies to continuous monitoring as well when there are no events 
associated. Meaning if the monitoring mask is chosen and user tries to enable 
continuous monitoring using the cgrp->cont_mon - all RMIDs are allocated 
immediately. the mon_mask provides a way for the user to have guarenteed RMIDs 
for both that have events and for continoous monitoring(no perf event 
associated) 
(assuming user uses it when user knows he would definitely use it.. or else 
there is LAZY mode)

Again this is cgroup specific and wont apply to task events and is needed when 
there are no events associated.

Thanks,
Vikas

>
> I can't see this happening.
>
Peter Zijlstra Dec. 23, 2016, 8:33 p.m. UTC | #3
On Fri, Dec 23, 2016 at 11:35:03AM -0800, Shivappa Vikas wrote:
> 
> Hello Peterz,
> 
> On Fri, 23 Dec 2016, Peter Zijlstra wrote:
> 
> >On Fri, Dec 16, 2016 at 03:12:55PM -0800, Vikas Shivappa wrote:
> >>+Continuous monitoring
> >>+---------------------
> >>+A new file cont_monitoring is added to perf_cgroup which helps to enable
> >>+cqm continuous monitoring. Enabling this field would start monitoring of
> >>+the cgroup without perf being launched. This can be used for long term
> >>+light weight monitoring of tasks/cgroups.
> >>+
> >>+To enable continuous monitoring of cgroup p1.
> >>+#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
> >>+
> >>+To disable continuous monitoring of cgroup p1.
> >>+#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
> >>+
> >>+To read the counters at the end of monitoring perf can be used.
> >>+
> >>+LAZY and NOLAZY Monitoring
> >>+--------------------------
> >>+LAZY:
> >>+By default when monitoring is enabled, the RMIDs are not allocated
> >>+immediately and allocated lazily only at the first sched_in.
> >>+There are 2-4 RMIDs per logical processor on each package. So if a dual
> >>+package has 48 logical processors, there would be upto 192 RMIDs on each
> >>+package = total of 192x2 RMIDs.
> >>+There is a possibility that RMIDs can runout and in that case the read
> >>+reports an error since there was no RMID available to monitor for an
> >>+event.
> >>+
> >>+NOLAZY:
> >>+When user wants guaranteed monitoring, he can enable the 'monitoring
> >>+mask' which is basically used to specify the packages he wants to
> >>+monitor. The RMIDs are statically allocated at open and failure is
> >>+indicated if RMIDs are not available.
> >>+
> >>+To specify monitoring on package 0 and package 1:
> >>+#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
> >>+
> >>+An error is thrown if packages not online are specified.
> >
> >I very much dislike both those for adding files to the perf cgroup.
> >Drivers should really not do that.
> 
> Is the continuous monitoring the issue or the interface (adding a file in
> perf_cgroup) ? I have not mentioned in the documentaion but this continuous
> monitoring/ monitoring mask applies only to cgroup in this patch and hence
> we thought a good place for that is in the cgroup itself because its per
> cgroup.
> 
> For task events , this wont apply and we are thinking of providing a prctl
> based interface for user to toggle the continous monitoring ..

More fail..

> >
> >I absolutely hate the second because events already have affinity.
> 
> This applies to continuous monitoring as well when there are no events
> associated. Meaning if the monitoring mask is chosen and user tries to
> enable continuous monitoring using the cgrp->cont_mon - all RMIDs are
> allocated immediately. the mon_mask provides a way for the user to have
> guarenteed RMIDs for both that have events and for continoous monitoring(no
> perf event associated) (assuming user uses it when user knows he would
> definitely use it.. or else there is LAZY mode)
> 
> Again this is cgroup specific and wont apply to task events and is needed
> when there are no events associated.

So no, the problem is that a driver introduces special ABI and behaviour
that radically departs from the regular behaviour.

Also, the 'whoops you ran out of RMIDs, please reboot' thing totally and
completely blows.
Vikas Shivappa Dec. 23, 2016, 9:41 p.m. UTC | #4
On Fri, 23 Dec 2016, Peter Zijlstra wrote:
>
> Also, the 'whoops you ran out of RMIDs, please reboot' thing totally and
> completely blows.

Well, this is really a hardware limitation. User cannot monitor more events on a 
package than # of RMIDs at the *same time*. The 10/14 reuses the RMIDs that are 
not monitored anymore. User can monitor more events once he stops monitoring 
some..

So we throw error at the read (LAZY mode) or open(NOLAZY mode) ..


>
>
Vikas Shivappa Dec. 25, 2016, 1:51 a.m. UTC | #5
On Fri, 23 Dec 2016, Peter Zijlstra wrote:

> On Fri, Dec 23, 2016 at 11:35:03AM -0800, Shivappa Vikas wrote:
>>
>> Hello Peterz,
>>
>> On Fri, 23 Dec 2016, Peter Zijlstra wrote:
>>
>>> On Fri, Dec 16, 2016 at 03:12:55PM -0800, Vikas Shivappa wrote:
>>>> +Continuous monitoring
>>>> +---------------------
>>>> +A new file cont_monitoring is added to perf_cgroup which helps to enable
>>>> +cqm continuous monitoring. Enabling this field would start monitoring of
>>>> +the cgroup without perf being launched. This can be used for long term
>>>> +light weight monitoring of tasks/cgroups.
>>>> +
>>>> +To enable continuous monitoring of cgroup p1.
>>>> +#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
>>>> +
>>>> +To disable continuous monitoring of cgroup p1.
>>>> +#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
>>>> +
>>>> +To read the counters at the end of monitoring perf can be used.
>>>> +
>>>> +LAZY and NOLAZY Monitoring
>>>> +--------------------------
>>>> +LAZY:
>>>> +By default when monitoring is enabled, the RMIDs are not allocated
>>>> +immediately and allocated lazily only at the first sched_in.
>>>> +There are 2-4 RMIDs per logical processor on each package. So if a dual
>>>> +package has 48 logical processors, there would be upto 192 RMIDs on each
>>>> +package = total of 192x2 RMIDs.
>>>> +There is a possibility that RMIDs can runout and in that case the read
>>>> +reports an error since there was no RMID available to monitor for an
>>>> +event.
>>>> +
>>>> +NOLAZY:
>>>> +When user wants guaranteed monitoring, he can enable the 'monitoring
>>>> +mask' which is basically used to specify the packages he wants to
>>>> +monitor. The RMIDs are statically allocated at open and failure is
>>>> +indicated if RMIDs are not available.
>>>> +
>>>> +To specify monitoring on package 0 and package 1:
>>>> +#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
>>>> +
>>>> +An error is thrown if packages not online are specified.
>>>
>>> I very much dislike both those for adding files to the perf cgroup.
>>> Drivers should really not do that.
>>
>> Is the continuous monitoring the issue or the interface (adding a file in
>> perf_cgroup) ? I have not mentioned in the documentaion but this continuous
>> monitoring/ monitoring mask applies only to cgroup in this patch and hence
>> we thought a good place for that is in the cgroup itself because its per
>> cgroup.
>>
>> For task events , this wont apply and we are thinking of providing a prctl
>> based interface for user to toggle the continous monitoring ..
>
> More fail..
>
>>>
>>> I absolutely hate the second because events already have affinity.
>>
>> This applies to continuous monitoring as well when there are no events
>> associated. Meaning if the monitoring mask is chosen and user tries to
>> enable continuous monitoring using the cgrp->cont_mon - all RMIDs are
>> allocated immediately. the mon_mask provides a way for the user to have
>> guarenteed RMIDs for both that have events and for continoous monitoring(no
>> perf event associated) (assuming user uses it when user knows he would
>> definitely use it.. or else there is LAZY mode)
>>
>> Again this is cgroup specific and wont apply to task events and is needed
>> when there are no events associated.
>
> So no, the problem is that a driver introduces special ABI and behaviour
> that radically departs from the regular behaviour.

Ok , looks like the interface  is the problem. Will try to 
fix this. We are just trying to have a light weight monitoring
option so that its reasonable to monitor for a
very long time (like lifetime of process etc). Mainly to not have all the perf 
scheduling overhead.
May be a perf event attr option is a more reasonable approach for the user to 
choose the option ? (rather than some new interface like prctl / cgroup file..)

Thanks,
Vikas
David Carrillo-Cisneros Dec. 27, 2016, 7:13 a.m. UTC | #6
>>>>> +LAZY and NOLAZY Monitoring
>>>>> +--------------------------
>>>>> +LAZY:
>>>>> +By default when monitoring is enabled, the RMIDs are not allocated
>>>>> +immediately and allocated lazily only at the first sched_in.
>>>>> +There are 2-4 RMIDs per logical processor on each package. So if a
>>>>> dual
>>>>> +package has 48 logical processors, there would be upto 192 RMIDs on
>>>>> each
>>>>> +package = total of 192x2 RMIDs.
>>>>> +There is a possibility that RMIDs can runout and in that case the read
>>>>> +reports an error since there was no RMID available to monitor for an
>>>>> +event.
>>>>> +
>>>>> +NOLAZY:
>>>>> +When user wants guaranteed monitoring, he can enable the 'monitoring
>>>>> +mask' which is basically used to specify the packages he wants to
>>>>> +monitor. The RMIDs are statically allocated at open and failure is
>>>>> +indicated if RMIDs are not available.
>>>>> +
>>>>> +To specify monitoring on package 0 and package 1:
>>>>> +#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
>>>>> +
>>>>> +An error is thrown if packages not online are specified.
>>>>
>>>>
>>>> I very much dislike both those for adding files to the perf cgroup.
>>>> Drivers should really not do that.
>>>
>>>
>>> Is the continuous monitoring the issue or the interface (adding a file in
>>> perf_cgroup) ? I have not mentioned in the documentaion but this
>>> continuous
>>> monitoring/ monitoring mask applies only to cgroup in this patch and
>>> hence
>>> we thought a good place for that is in the cgroup itself because its per
>>> cgroup.
>>>
>>> For task events , this wont apply and we are thinking of providing a
>>> prctl
>>> based interface for user to toggle the continous monitoring ..
>>
>>
>> More fail..
>>
>>>>
>>>> I absolutely hate the second because events already have affinity.
>>>

The per-package NOLAZY flags are distinct than affinity. They modify
the behavior of something already running on that package. Besides
that, this is intended to work when there are no perf_events and
perf_events cpu field is already used in cgroup events.

>>>
>>> This applies to continuous monitoring as well when there are no events
>>> associated. Meaning if the monitoring mask is chosen and user tries to
>>> enable continuous monitoring using the cgrp->cont_mon - all RMIDs are
>>> allocated immediately. the mon_mask provides a way for the user to have
>>> guarenteed RMIDs for both that have events and for continoous
>>> monitoring(no
>>> perf event associated) (assuming user uses it when user knows he would
>>> definitely use it.. or else there is LAZY mode)
>>>
>>> Again this is cgroup specific and wont apply to task events and is needed
>>> when there are no events associated.
>>
>>
>> So no, the problem is that a driver introduces special ABI and behaviour
>> that radically departs from the regular behaviour.
>
>
> Ok , looks like the interface  is the problem. Will try to fix this. We are
> just trying to have a light weight monitoring
> option so that its reasonable to monitor for a
> very long time (like lifetime of process etc). Mainly to not have all the
> perf scheduling overhead.
> May be a perf event attr option is a more reasonable approach for the user
> to choose the option ? (rather than some new interface like prctl / cgroup
> file..)

I don't see how a perf event attr option would work, since the goal of
continuous monitoring is to start CQM/CMT without a perf event.

An alternative is to add a single file to the cqm pmu directory. The
file contains which cgroups must be continuously monitored (optionally
with per-package flags):

  $ cat /sys/devices/intel_cmt/cgroup_cont_monitoring
  cgroup      per-pkg flags
  /                0;1;0;0
  g1             0;0;0;0
  g1/g1_1    0:0;0;0
  g2             0:1;0;0;0

to start continuous monitoring in a cgroup (flags optional, default to all 0's):
  $ echo "g2/g2_1 0;1;0;0" > /sys/devices/intel_cmt/cgroup_cont_monitoring
to stop it:
 $ echo "-g2/g2_1"

Note that the cgroup name is what perf_event_attr takes now, so it's
not that different from creating a perf event.


Another option is to create a directory per cgroup to monitor, so:
  $ mkdir /sys/devices/intel_cmt/cgroup_cont_monitoring/g1
starts continuous monitoring in g1.

This approach is problematic, though, because the cont_monitoring
property is not hierarchical, i.e. a cgroup g1/g1_1 may need
cont_monitoring while g1 doesn't. Supporting this would require to
either do something funny with the cgroup name or add extra files to
each folder and expose all cgroups. None of these options seem good to
me.

So, my money is on a single file
"/sys/devices/intel_cmt/cgroup_cont_monitoring". Thoughts?

Thanks,
David
Andi Kleen Dec. 27, 2016, 8 p.m. UTC | #7
Shivappa Vikas <vikas.shivappa@intel.com> writes:
>
> Ok , looks like the interface  is the problem. Will try to fix
> this. We are just trying to have a light weight monitoring
> option so that its reasonable to monitor for a
> very long time (like lifetime of process etc). Mainly to not have all
> the perf scheduling overhead.

That seems like an odd reason to define a completely new user interface.
This is to avoid one MSR write for a RMID change per context switch
in/out cgroup or is it other code too?

Is there some number you can put to the overhead?
Or is there some other overhead other than the MSR write
you're concerned about?

Do you have an ftrace or better PT trace with the overhead before-after?

Perhaps some optimization could be done in the code to make it faster,
then the new interface wouldn't be needed.

FWIW there are some pending changes to context switch that will
eliminate at least one common MSR write [1]. If that was fixed
you could do the RMID MSR write "for free"

-Andi

[1] https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/fsgsbase
Vikas Shivappa Dec. 27, 2016, 8:21 p.m. UTC | #8
On Tue, 27 Dec 2016, Andi Kleen wrote:

> Shivappa Vikas <vikas.shivappa@intel.com> writes:
>>
>> Ok , looks like the interface  is the problem. Will try to fix
>> this. We are just trying to have a light weight monitoring
>> option so that its reasonable to monitor for a
>> very long time (like lifetime of process etc). Mainly to not have all
>> the perf scheduling overhead.
>
> That seems like an odd reason to define a completely new user interface.
> This is to avoid one MSR write for a RMID change per context switch
> in/out cgroup or is it other code too?
>
> Is there some number you can put to the overhead?
> Or is there some other overhead other than the MSR write
> you're concerned about?

Yes, seems like the interface of having a file is odd as even Peterz thinks.

Its the perf overhead actually we are trying to avoid.

The MSR writes(the driver/cqm overhead 
really not perf..) we try to optimize by having a per cpu cache/group the rmids/ 
have a common write for rmid/closid etc.

The perf overhead i was thinking atleast was during the context switch which is 
the more constant overhead (the event creation is just one time).

-I was trying to see an alternative where
1.user specifies the continuous monitor with perf-attr in open
2.driver allocates the task/cgroup RMID and stores the RMID in cgroup or 
task_struct
3.turns off the event. (hence no perf ctx switch overhead? (all the perf hook 
calls for start/stop/add we dont need any of those -
i was still finding out if this route works basically if i turn off the event 
there is minimal overhead for the event and not start/stop/add calls for the 
event.)
4.but during switch_to driver writes the RMID MSR, so we still monitor.
5.read -> calls the driver -> driver just returns the count by reading the 
RMID.

>
> Do you have an ftrace or better PT trace with the overhead before-after?
>
> Perhaps some optimization could be done in the code to make it faster,
> then the new interface wouldn't be needed.
>
> FWIW there are some pending changes to context switch that will
> eliminate at least one common MSR write [1]. If that was fixed
> you could do the RMID MSR write "for free"

I see, thats good to know..

Thanks,
Vikas

>
> -Andi
>
> [1] https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/fsgsbase
>
>
David Carrillo-Cisneros Dec. 27, 2016, 9:33 p.m. UTC | #9
On Tue, Dec 27, 2016 at 12:00 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Shivappa Vikas <vikas.shivappa@intel.com> writes:
>>
>> Ok , looks like the interface  is the problem. Will try to fix
>> this. We are just trying to have a light weight monitoring
>> option so that its reasonable to monitor for a
>> very long time (like lifetime of process etc). Mainly to not have all
>> the perf scheduling overhead.
>
> That seems like an odd reason to define a completely new user interface.
> This is to avoid one MSR write for a RMID change per context switch
> in/out cgroup or is it other code too?
>
> Is there some number you can put to the overhead?

I obtained some timing by manually instrumenting the kernel in a Haswell EP.

When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
~1170ns

most of the time is spend in cgroup ctx switch (~1120ns) .

When using continuous monitoring in CQM driver, the avg time to
find the rmid to write inside of pqr_context switch  is ~16ns

Note that this excludes the MSR write. It's only the overhead of
finding the RMID
to write in PQR_ASSOC. Both paths call the same routine to find the
RMID, so there are
about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
of it comes from iterating over the pmu list.

> Or is there some other overhead other than the MSR write
> you're concerned about?

No, that problem is solved with the PQR software cache introduced in the series.


> Perhaps some optimization could be done in the code to make it faster,
> then the new interface wouldn't be needed.

There are some. One in my list is to create a list of pmus with at
least one cgroup event
and use it to iterate over in perf_cgroup_switch, instead of using the
"pmus" list.
The pmus list has grown a lot recently with the addition of all the uncore pmus.

Despite this optimization, it's unlikely that the whole sched_out +
sched_in gets that
close to the 15 ns of the non perf_event approach.

Please note that context switch time for llc_occupancy events has more
impact than for
other events because in order to obtain reliable measurements, the
RMID switch must
be active _all_ the time, not only while the event is read.

>
> FWIW there are some pending changes to context switch that will
> eliminate at least one common MSR write [1]. If that was fixed
> you could do the RMID MSR write "for free"

That may save the need for the PQR software cache in this series, but
won't speed up
the context switch.

Thanks,
David
David Carrillo-Cisneros Dec. 27, 2016, 9:38 p.m. UTC | #10
The perf overhead i was thinking atleast was during the context switch which
> is the more constant overhead (the event creation is just one time).
>
> -I was trying to see an alternative where
> 1.user specifies the continuous monitor with perf-attr in open
> 2.driver allocates the task/cgroup RMID and stores the RMID in cgroup or
> task_struct
> 3.turns off the event. (hence no perf ctx switch overhead? (all the perf
> hook calls for start/stop/add we dont need any of those -
> i was still finding out if this route works basically if i turn off the
> event there is minimal overhead for the event and not start/stop/add calls
> for the event.)
> 4.but during switch_to driver writes the RMID MSR, so we still monitor.
> 5.read -> calls the driver -> driver just returns the count by reading the
> RMID.

This option breaks user expectations about an event. If an event is
closed, it's gone.
It shouldn't leave some state behind.

Do you have thoughts about adding the one cgroup file to
the intel_cmt pmu directory?

Thanks,
David
Andi Kleen Dec. 27, 2016, 11:10 p.m. UTC | #11
On Tue, Dec 27, 2016 at 01:33:46PM -0800, David Carrillo-Cisneros wrote:
> When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
> avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
> ~1170ns
> 
> most of the time is spend in cgroup ctx switch (~1120ns) .
> 
> When using continuous monitoring in CQM driver, the avg time to
> find the rmid to write inside of pqr_context switch  is ~16ns
> 
> Note that this excludes the MSR write. It's only the overhead of
> finding the RMID
> to write in PQR_ASSOC. Both paths call the same routine to find the
> RMID, so there are
> about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
> of it comes from iterating over the pmu list.

Do Kan's pmu list patches help? 

https://patchwork.kernel.org/patch/9420035/

> 
> > Or is there some other overhead other than the MSR write
> > you're concerned about?
> 
> No, that problem is solved with the PQR software cache introduced in the series.

So it's already fixed?

How much is the cost with your cache?

> 
> 
> > Perhaps some optimization could be done in the code to make it faster,
> > then the new interface wouldn't be needed.
> 
> There are some. One in my list is to create a list of pmus with at
> least one cgroup event
> and use it to iterate over in perf_cgroup_switch, instead of using the
> "pmus" list.
> The pmus list has grown a lot recently with the addition of all the uncore pmus.

Kan's patches above already do that I believe.

> 
> Despite this optimization, it's unlikely that the whole sched_out +
> sched_in gets that
> close to the 15 ns of the non perf_event approach.

It would be good to see how close we can get. I assume
there is more potential for optimizations and fast pathing.

-Andi
David Carrillo-Cisneros Dec. 28, 2016, 1:23 a.m. UTC | #12
On Tue, Dec 27, 2016 at 3:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Tue, Dec 27, 2016 at 01:33:46PM -0800, David Carrillo-Cisneros wrote:
>> When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
>> avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
>> ~1170ns
>>
>> most of the time is spend in cgroup ctx switch (~1120ns) .
>>
>> When using continuous monitoring in CQM driver, the avg time to
>> find the rmid to write inside of pqr_context switch  is ~16ns
>>
>> Note that this excludes the MSR write. It's only the overhead of
>> finding the RMID
>> to write in PQR_ASSOC. Both paths call the same routine to find the
>> RMID, so there are
>> about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
>> of it comes from iterating over the pmu list.
>
> Do Kan's pmu list patches help?
>
> https://patchwork.kernel.org/patch/9420035/

I think these are independent problems. Kan's patches aim to reduce the overhead
of multiples events in the same task context. The overhead numbers I posted
measure only _one_ event in the cpu's context.

>
>>
>> > Or is there some other overhead other than the MSR write
>> > you're concerned about?
>>
>> No, that problem is solved with the PQR software cache introduced in the series.
>
> So it's already fixed?

Sort of, with PQR sw cache there is only one write to MSR and is only
when either the
RMID or the CLOSID actually changes.

>
> How much is the cost with your cache?

If there is no change on CLOSID or RMID, the hook and comparison takes
about 60 ns.
If there is a change, the write to the MSR + other overhead is about
610 ns (dominated by the MSR write).

>
>>
>>
>> > Perhaps some optimization could be done in the code to make it faster,
>> > then the new interface wouldn't be needed.
>>
>> There are some. One in my list is to create a list of pmus with at
>> least one cgroup event
>> and use it to iterate over in perf_cgroup_switch, instead of using the
>> "pmus" list.
>> The pmus list has grown a lot recently with the addition of all the uncore pmus.
>
> Kan's patches above already do that I believe.

see previous answer.

>
>>
>> Despite this optimization, it's unlikely that the whole sched_out +
>> sched_in gets that
>> close to the 15 ns of the non perf_event approach.
>
> It would be good to see how close we can get. I assume
> there is more potential for optimizations and fast pathing.

I will work on the optimization I described earlier that avoids iterating
over all pmus on the cgroup switch. That should take the bulk of the
overhead, but still more work will probably be needed to get close to the
15ns overhead.

Thanks,
David
Vikas Shivappa Dec. 28, 2016, 8:03 p.m. UTC | #13
On Tue, 27 Dec 2016, David Carrillo-Cisneros wrote:

> On Tue, Dec 27, 2016 at 3:10 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> On Tue, Dec 27, 2016 at 01:33:46PM -0800, David Carrillo-Cisneros wrote:
>>> When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
>>> avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
>>> ~1170ns
>>>
>>> most of the time is spend in cgroup ctx switch (~1120ns) .
>>>
>>> When using continuous monitoring in CQM driver, the avg time to
>>> find the rmid to write inside of pqr_context switch  is ~16ns
>>>
>>> Note that this excludes the MSR write. It's only the overhead of
>>> finding the RMID
>>> to write in PQR_ASSOC. Both paths call the same routine to find the
>>> RMID, so there are
>>> about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
>>> of it comes from iterating over the pmu list.
>>
>> Do Kan's pmu list patches help?
>>
>> https://patchwork.kernel.org/patch/9420035/
>
> I think these are independent problems. Kan's patches aim to reduce the overhead
> of multiples events in the same task context. The overhead numbers I posted
> measure only _one_ event in the cpu's context.
>
>>
>>>
>>>> Or is there some other overhead other than the MSR write
>>>> you're concerned about?
>>>
>>> No, that problem is solved with the PQR software cache introduced in the series.
>>
>> So it's already fixed?
>
> Sort of, with PQR sw cache there is only one write to MSR and is only
> when either the
> RMID or the CLOSID actually changes.
>
>>
>> How much is the cost with your cache?
>
> If there is no change on CLOSID or RMID, the hook and comparison takes
> about 60 ns.
> If there is a change, the write to the MSR + other overhead is about
> 610 ns (dominated by the MSR write).

We measured the MSR read and write we measured were close to 250 - 300 cycles. 
The issue was even the read was as costly which is why the caching helps as it 
avoids all reads. The grouping of RMIds using cgroup and
multiple events etc helps the cache because it increases the 
hit probability.

>
>>
>>>
>>>
>>>> Perhaps some optimization could be done in the code to make it faster,
>>>> then the new interface wouldn't be needed.
>>>
>>> There are some. One in my list is to create a list of pmus with at
>>> least one cgroup event
>>> and use it to iterate over in perf_cgroup_switch, instead of using the
>>> "pmus" list.
>>> The pmus list has grown a lot recently with the addition of all the uncore pmus.
>>
>> Kan's patches above already do that I believe.
>
> see previous answer.
>
>>
>>>
>>> Despite this optimization, it's unlikely that the whole sched_out +
>>> sched_in gets that
>>> close to the 15 ns of the non perf_event approach.
>>
>> It would be good to see how close we can get. I assume
>> there is more potential for optimizations and fast pathing.
>
> I will work on the optimization I described earlier that avoids iterating
> over all pmus on the cgroup switch. That should take the bulk of the
> overhead, but still more work will probably be needed to get close to the
> 15ns overhead.

This seems best option as its more generic so we really dont need our event 
specific change and adding a file interface which wasnt liked by Peterz/Andi 
anyways.
Will remove/clean up the continuos monitoring parts and resend the series.

Thanks,
Vikas

>
> Thanks,
> David
>

Patch
diff mbox series

diff --git a/Documentation/x86/intel_rdt_mon_ui.txt b/Documentation/x86/intel_rdt_mon_ui.txt
new file mode 100644
index 0000000..7d68a65
--- /dev/null
+++ b/Documentation/x86/intel_rdt_mon_ui.txt
@@ -0,0 +1,91 @@ 
+User Interface for Resource Monitoring in Intel Resource Director Technology
+
+Vikas Shivappa<vikas.shivappa@intel.com>
+David Carrillo-Cisneros<davidcc@google.com>
+Stephane Eranian <eranian@google.com>
+
+This feature is enabled by the CONFIG_INTEL_RDT_M Kconfig and the
+X86 /proc/cpuinfo flag bits cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
+
+Resource Monitoring
+-------------------
+Resource Monitoring includes cqm(cache quality monitoring) and
+mbm(memory bandwidth monitoring) and uses the perf interface. A light
+weight interface to enable monitoring without perf is enabled as well.
+
+CQM provides OS/VMM a way to monitor llc occupancy. It measures the
+amount of L3 cache fills per task or cgroup.
+
+MBM provides OS/VMM a way to monitor bandwidth from one level of cache
+to another. The current patches support L3 external bandwidth
+monitoring. It supports both 'local bandwidth' and 'total bandwidth'
+monitoring for the socket. Local bandwidth measures the amount of data
+sent through the memory controller on the socket and total b/w measures
+the total system bandwidth.
+
+To check the monitoring events enabled:
+
+# ./tools/perf/perf list | grep -i cqm
+intel_cqm/llc_occupancy/                           [Kernel PMU event]
+intel_cqm/local_bytes/                             [Kernel PMU event]
+intel_cqm/total_bytes/                             [Kernel PMU event]
+
+Monitoring tasks and cgroups using perf
+---------------------------------------
+Monitoring tasks and cgroup is like using any other perf event.
+
+#perf stat -I 1000 -e intel_cqm_llc/local_bytes/ -p PID1
+
+This will monitor the local_bytes event of the PID1 and report once
+every 1000ms
+
+#mkdir /sys/fs/cgroup/perf_event/p1
+#echo PID1 > /sys/fs/cgroup/perf_event/p1/tasks
+#echo PID2 > /sys/fs/cgroup/perf_event/p1/tasks
+
+#perf stat -I 1000 -e intel_cqm_llc/llc_occupancy/ -a -G p1
+
+This will monitor the llc_occupancy event of the perf cgroup p1 in
+interval mode.
+
+Hierarchical monitoring should work just like other events and users can
+also monitor a task with in a cgroup and the cgroup together, or
+different cgroups in the same hierarchy can be monitored together.
+
+Continuous monitoring
+---------------------
+A new file cont_monitoring is added to perf_cgroup which helps to enable
+cqm continuous monitoring. Enabling this field would start monitoring of
+the cgroup without perf being launched. This can be used for long term
+light weight monitoring of tasks/cgroups.
+
+To enable continuous monitoring of cgroup p1.
+#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
+
+To disable continuous monitoring of cgroup p1.
+#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
+
+To read the counters at the end of monitoring perf can be used.
+
+LAZY and NOLAZY Monitoring
+--------------------------
+LAZY:
+By default when monitoring is enabled, the RMIDs are not allocated
+immediately and allocated lazily only at the first sched_in.
+There are 2-4 RMIDs per logical processor on each package. So if a dual
+package has 48 logical processors, there would be upto 192 RMIDs on each
+package = total of 192x2 RMIDs.
+There is a possibility that RMIDs can runout and in that case the read
+reports an error since there was no RMID available to monitor for an
+event.
+
+NOLAZY:
+When user wants guaranteed monitoring, he can enable the 'monitoring
+mask' which is basically used to specify the packages he wants to
+monitor. The RMIDs are statically allocated at open and failure is
+indicated if RMIDs are not available.
+
+To specify monitoring on package 0 and package 1:
+#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
+
+An error is thrown if packages not online are specified.