All of lore.kernel.org
 help / color / mirror / Atom feed
* Why the need to do a perf_event_open syscall for each cpu on the system?
@ 2015-03-13 18:49 William Cohen
  2015-03-13 21:14 ` Vince Weaver
  2015-03-15  5:15 ` Elazar Leibovich
  0 siblings, 2 replies; 7+ messages in thread
From: William Cohen @ 2015-03-13 18:49 UTC (permalink / raw)
  To: linux-perf-users

Hi All,

I have a design question about the linux kernel perf support. A number of /proc statistics aggregate data across all the cpus in the system.  Why the does perf require the user-space application to enumerate all the processors and do a perf_event_open syscall for each of the processors?  Why not have a perf_event_open with pid=-1 and cpu=-1 mean system-wide event and aggregate it in the kernel when the value is read?  The line below from design.txt specifically say it is invalid.

(Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)

-Will

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the need to do a perf_event_open syscall for each cpu on the system?
  2015-03-13 18:49 Why the need to do a perf_event_open syscall for each cpu on the system? William Cohen
@ 2015-03-13 21:14 ` Vince Weaver
  2015-03-15  5:15 ` Elazar Leibovich
  1 sibling, 0 replies; 7+ messages in thread
From: Vince Weaver @ 2015-03-13 21:14 UTC (permalink / raw)
  To: William Cohen; +Cc: linux-perf-users

On Fri, 13 Mar 2015, William Cohen wrote:

> Why not have a perf_event_open with pid=-1 and cpu=-1 mean 
> system-wide event and aggregate it in the kernel when the value is read?  
> The line below from design.txt specifically say it is invalid.

you might have more luck asking questions like this on linux-kernel, I'm 
not sure if many of the actual kernel developers hang out on 
linux-perf-users.

From what I gather, having aggregate system-wide events in the kernel adds 
a lot of kernel overhead for not much benefit, as you can always aggregate 
yourself in userspace (which is what I think the perf tool does).

Vince

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the need to do a perf_event_open syscall for each cpu on the system?
  2015-03-13 18:49 Why the need to do a perf_event_open syscall for each cpu on the system? William Cohen
  2015-03-13 21:14 ` Vince Weaver
@ 2015-03-15  5:15 ` Elazar Leibovich
  2015-03-16 14:47   ` William Cohen
  1 sibling, 1 reply; 7+ messages in thread
From: Elazar Leibovich @ 2015-03-15  5:15 UTC (permalink / raw)
  To: William Cohen; +Cc: linux-perf-users

Hi,

Not an expert, but my understanding is that it's just technical
difficulty. Performance metrics are being saved in per-cpu buffer.
Having pid==-1 and cpu==-1 means that something would aggregate all
buffers in multiple CPUs to a single buffer. That code must exist,
either in userspace or in the kernel.

The kernel preferred that this code would be in userspace.

On Fri, Mar 13, 2015 at 8:49 PM, William Cohen <wcohen@redhat.com> wrote:
> Hi All,
>
> I have a design question about the linux kernel perf support. A number of /proc statistics aggregate data across all the cpus in the system.  Why the does perf require the user-space application to enumerate all the processors and do a perf_event_open syscall for each of the processors?  Why not have a perf_event_open with pid=-1 and cpu=-1 mean system-wide event and aggregate it in the kernel when the value is read?  The line below from design.txt specifically say it is invalid.
>
> (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
>
> -Will
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the need to do a perf_event_open syscall for each cpu on the system?
  2015-03-15  5:15 ` Elazar Leibovich
@ 2015-03-16 14:47   ` William Cohen
  2015-03-17  0:51     ` Stephane Eranian
  2015-03-17 14:40     ` Andi Kleen
  0 siblings, 2 replies; 7+ messages in thread
From: William Cohen @ 2015-03-16 14:47 UTC (permalink / raw)
  To: Elazar Leibovich; +Cc: linux-perf-users, Stephane Eranian

On 03/15/2015 01:15 AM, Elazar Leibovich wrote:
> Hi,
> 
> Not an expert, but my understanding is that it's just technical
> difficulty. Performance metrics are being saved in per-cpu buffer.
> Having pid==-1 and cpu==-1 means that something would aggregate all
> buffers in multiple CPUs to a single buffer. That code must exist,
> either in userspace or in the kernel.
> 
> The kernel preferred that this code would be in userspace.

Hi Elazar,

I suspected the reasoning was something along those lines.  I was hoping that someone could point to archived email threads with earlier discussions showing the complications that would arise by having system-wide setup perf event setup and reading handled in the kernel. Looking through the earlier versions of perf see that pid==-1 and cpu=-1 were not allowed in the very early proposed patches (http://thread.gmane.org/gmane.linux.kernel.cross-arch/2578).  However, not much in the way explanation in the design tradeoffs in there.

Making user-space set up performance events for each cpu certainly simplifies the kernel code for system-wide monitoring. The cgroup support is essentially like system-wide monitoring with additional filtering on the cgroup and things get more complicated using the perf cgroup support when the cgroups are not pinned to a particular processor, O(cgroups*cpus) opens and reads.  If the cgroups is scaled up at the same rate as cpus, this would be O(cpus^2).  I am wondering if handling the system-wide case (pid==-1 and cpu==-1) in the kernel would make cgroup and system-wide monitoring more efficient or if the complications in the kernel are just too much.

-Will
>
> On Fri, Mar 13, 2015 at 8:49 PM, William Cohen <wcohen@redhat.com> wrote:
>> Hi All,
>>
>> I have a design question about the linux kernel perf support. A number of /proc statistics aggregate data across all the cpus in the system.  Why the does perf require the user-space application to enumerate all the processors and do a perf_event_open syscall for each of the processors?  Why not have a perf_event_open with pid=-1 and cpu=-1 mean system-wide event and aggregate it in the kernel when the value is read?  The line below from design.txt specifically say it is invalid.
>>
>> (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
>>
>> -Will
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the need to do a perf_event_open syscall for each cpu on the system?
  2015-03-16 14:47   ` William Cohen
@ 2015-03-17  0:51     ` Stephane Eranian
  2015-03-17 14:40     ` Andi Kleen
  1 sibling, 0 replies; 7+ messages in thread
From: Stephane Eranian @ 2015-03-17  0:51 UTC (permalink / raw)
  To: William Cohen; +Cc: Elazar Leibovich, linux-perf-users

On Mon, Mar 16, 2015 at 10:47 AM, William Cohen <wcohen@redhat.com> wrote:
> On 03/15/2015 01:15 AM, Elazar Leibovich wrote:
>> Hi,
>>
>> Not an expert, but my understanding is that it's just technical
>> difficulty. Performance metrics are being saved in per-cpu buffer.
>> Having pid==-1 and cpu==-1 means that something would aggregate all
>> buffers in multiple CPUs to a single buffer. That code must exist,
>> either in userspace or in the kernel.
>>
>> The kernel preferred that this code would be in userspace.
>
> Hi Elazar,
>
> I suspected the reasoning was something along those lines.  I was hoping that someone could point to archived email threads with earlier discussions showing the complications that would arise by having system-wide setup perf event setup and reading handled in the kernel. Looking through the earlier versions of perf see that pid==-1 and cpu=-1 were not allowed in the very early proposed patches (http://thread.gmane.org/gmane.linux.kernel.cross-arch/2578).  However, not much in the way explanation in the design tradeoffs in there.
>
The perf_event interface an an event-driven interface. Users
manipulate individual events. Events can be attached to a thread
(process mode)
 or a CPU (per-cpu mode).
To monitor all CPUs in a system, multiple instances of an event must
be created and attached to each monitored CPU.  The kernel does not
program a event across all CPUs at once.

To collect system-wide profiles for cycles on a 12-way machine, you
need to create 12 events with the cycle event encoding and attach each
to a CPU. Perf record/stat do this automatically for you. As for the
samples. There is a sampling buffer associated with each event. To
sample
of cycles across 12 CPU, then 12 per-cpu sampling buffers are created
and mapped into the tool. When a buffer fills up, perf record dumps
the content unmodified into the perf.data file. As such, the
aggregation of the samples is deferred until perf report/annotate are
used.

> Making user-space set up performance events for each cpu certainly simplifies the kernel code for system-wide monitoring. The cgroup support is essentially like system-wide monitoring with additional filtering on the cgroup and things get more complicated using the perf cgroup support when the cgroups are not pinned to a particular processor, O(cgroups*cpus) opens and reads.  If the cgroups is scaled up at the same rate as cpus, this would be O(cpus^2).  I am wondering if handling the system-wide case (pid==-1 and cpu==-1) in the kernel would make cgroup and system-wide monitoring more efficient or if the complications in the kernel are just too much.

As Will explained, cgroup is just a filtered from of per CPU
monitoring. You can filter occurrences of an event based on cgroup.
You can
say "monitor cycles on CPU0 only when a thread from cgroup foo is
running on CPU0". Usually in per-cpu mode, the event is programmed
into the PMU counter and remains there until monitoring stops
regardless of context switches. In cgroup mode, the event is
programmed on
the CPU and remains enabled as long as the current thread is from the
cgroup of interest. If it is not, the event is descheduled. It is
rescheduled
once a thread from the cgroup is active on that CPU again.


Hope this helps.

>
> -Will
>>
>> On Fri, Mar 13, 2015 at 8:49 PM, William Cohen <wcohen@redhat.com> wrote:
>>> Hi All,
>>>
>>> I have a design question about the linux kernel perf support. A number of /proc statistics aggregate data across all the cpus in the system.  Why the does perf require the user-space application to enumerate all the processors and do a perf_event_open syscall for each of the processors?  Why not have a perf_event_open with pid=-1 and cpu=-1 mean system-wide event and aggregate it in the kernel when the value is read?  The line below from design.txt specifically say it is invalid.
>>>
>>> (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
>>>
>>> -Will
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the need to do a perf_event_open syscall for each cpu on the system?
  2015-03-16 14:47   ` William Cohen
  2015-03-17  0:51     ` Stephane Eranian
@ 2015-03-17 14:40     ` Andi Kleen
  2015-03-17 15:30       ` William Cohen
  1 sibling, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2015-03-17 14:40 UTC (permalink / raw)
  To: William Cohen; +Cc: Elazar Leibovich, linux-perf-users, Stephane Eranian

William Cohen <wcohen@redhat.com> writes:
>
> Making user-space set up performance events for each cpu certainly
> simplifies the kernel code for system-wide monitoring. The cgroup
> support is essentially like system-wide monitoring with additional
> filtering on the cgroup and things get more complicated using the perf
> cgroup support when the cgroups are not pinned to a particular
> processor, O(cgroups*cpus) opens and reads.  If the cgroups is scaled
> up at the same rate as cpus, this would be O(cpus^2).  I am wondering

Using O() notation here is misleading because a perf event 
is not an algorithmic step. It's just a data structure in memory,
associated with a file descriptor.  But the number of active
events at a time is always limited by the number of counters
in the CPU (ignoring software events here) and is comparable
small.

The memory usage is not a significant problem, it is dwarfed by other
data structures per CPU.  Usually the main problem people run into is
running out of file descriptors because most systems still run with a
ulimit -n default of 1024, which is easy to reach with even a small
number of event groups on a system with a moderate number of CPUs.

However ulimit -n can be easily fixed: just increase it. Arguably
the distribution defaults should probably be increased.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Why the need to do a perf_event_open syscall for each cpu on the system?
  2015-03-17 14:40     ` Andi Kleen
@ 2015-03-17 15:30       ` William Cohen
  0 siblings, 0 replies; 7+ messages in thread
From: William Cohen @ 2015-03-17 15:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Elazar Leibovich, linux-perf-users, Stephane Eranian

On 03/17/2015 10:40 AM, Andi Kleen wrote:
> William Cohen <wcohen@redhat.com> writes:
>>
>> Making user-space set up performance events for each cpu certainly
>> simplifies the kernel code for system-wide monitoring. The cgroup
>> support is essentially like system-wide monitoring with additional
>> filtering on the cgroup and things get more complicated using the perf
>> cgroup support when the cgroups are not pinned to a particular
>> processor, O(cgroups*cpus) opens and reads.  If the cgroups is scaled
>> up at the same rate as cpus, this would be O(cpus^2).  I am wondering
> 
> Using O() notation here is misleading because a perf event 
> is not an algorithmic step. It's just a data structure in memory,
> associated with a file descriptor.  But the number of active
> events at a time is always limited by the number of counters
> in the CPU (ignoring software events here) and is comparable
> small.
> 
> The memory usage is not a significant problem, it is dwarfed by other
> data structures per CPU.  Usually the main problem people run into is
> running out of file descriptors because most systems still run with a
> ulimit -n default of 1024, which is easy to reach with even a small
> number of event groups on a system with a moderate number of CPUs.
> 
> However ulimit -n can be easily fixed: just increase it. Arguably
> the distribution defaults should probably be increased.
> 
> -Andi
> 

Hi Andi,

O() notation can be used to describe both time and space, Reading the perf counters is O(cpu^2) in time. As mentioned the number of file descriptors required is going to be grow pretty large quickly as the number of cpus/cgroups increase. 32 cpus and 32 cgroups would be 1024 file descriptors; 80 cpus and 80 cgroups would 6400 file descriptors.  There are machines more than 80 processors.  Does it make sense to have multiple thousands of file descriptors for performance monitoring?

Making the user-space responsible opening and reading the counters from each processor simplifies the kernel code.  However, is making user-space do this a better solution than doing this system-wide setup and aggregation in the kernel?  How much overhead is there for all the user-space/kernel-space transitions to read out 100's of values from the kernel versus doing that in the kernel?

-Will

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-03-17 15:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-13 18:49 Why the need to do a perf_event_open syscall for each cpu on the system? William Cohen
2015-03-13 21:14 ` Vince Weaver
2015-03-15  5:15 ` Elazar Leibovich
2015-03-16 14:47   ` William Cohen
2015-03-17  0:51     ` Stephane Eranian
2015-03-17 14:40     ` Andi Kleen
2015-03-17 15:30       ` William Cohen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.