Kernel Newbies archive on lore.kernel.org
 help / color / Atom feed
* perf_event wakeup_events = 0
@ 2019-09-06 23:28 Theodore Dubois
  2019-09-07 13:40 ` Valdis Klētnieks
  0 siblings, 1 reply; 6+ messages in thread
From: Theodore Dubois @ 2019-09-06 23:28 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: linux-kernel, kernelnewbies

The man page for perf_event_open(2) says that recent kernels treat a 0
value for wakeup_events the same as 1, which I believe means it will
notify after a single sample. However, strace on perf(1) shows that it
uses wakeup_events=0, and it's definitely not waking up on every
sample (it seems to be waking up every few seconds.)
tools/perf/design.txt says "Normally a notification is generated for
every page filled". Is the documentation wrong, or am I
misunderstanding something?

~Theodore

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: perf_event wakeup_events = 0
  2019-09-06 23:28 perf_event wakeup_events = 0 Theodore Dubois
@ 2019-09-07 13:40 ` Valdis Klētnieks
  2019-09-07 16:14   ` Theodore Dubois
  0 siblings, 1 reply; 6+ messages in thread
From: Valdis Klētnieks @ 2019-09-07 13:40 UTC (permalink / raw)
  To: Theodore Dubois; +Cc: kernelnewbies, a.p.zijlstra, linux-kernel

[-- Attachment #1.1: Type: text/plain, Size: 1827 bytes --]

On Fri, 06 Sep 2019 16:28:24 -0700, Theodore Dubois said:
> The man page for perf_event_open(2) says that recent kernels treat a 0
> value for wakeup_events the same as 1, which I believe means it will
> notify after a single sample. However, strace on perf(1) shows that it
> uses wakeup_events=0, and it's definitely not waking up on every
> sample (it seems to be waking up every few seconds.)
> tools/perf/design.txt says "Normally a notification is generated for
> every page filled". Is the documentation wrong, or am I
> misunderstanding something?

       wakeup_events, wakeup_watermark
              This  union sets how many samples (wakeup_events) or bytes (wakeup_watermark) happen before an overflow
              notification happens.  Which one is used is selected by the watermark bit flag.

              wakeup_events counts only PERF_RECORD_SAMPLE record types.  To receive overflow  notification  for  all
              PERF_RECORD types choose watermark and set wakeup_watermark to 1.

              Prior  to Linux 3.0, setting wakeup_events to 0 resulted in no overflow notifications; more recent ker?
              nels treat 0 the same as 1.

My reading of that is that in pre-3.0 kernels, you could choose to not get overflow
notifications, and now you'll get them whether or not you wanted them.

Under "overflow handling", we see:

       Overflows are generated only by sampling events (sample_period must have a nonzero value).

So the reason strace says perf is only waking up every few seconds is probably
because you either launched perf with options that only create trace events, or
it takes several seconds for an overflow to happen on a sampling event. A lot
of those fields are u64 counters, and won't overflow anytime soon.  Even the
u32 counters can take a few seconds to overflow....


[-- Attachment #1.2: Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: perf_event wakeup_events = 0
  2019-09-07 13:40 ` Valdis Klētnieks
@ 2019-09-07 16:14   ` Theodore Dubois
  2019-09-07 22:00     ` Valdis Klētnieks
  2019-09-07 22:45     ` Valdis Klētnieks
  0 siblings, 2 replies; 6+ messages in thread
From: Theodore Dubois @ 2019-09-07 16:14 UTC (permalink / raw)
  To: Valdis Klētnieks; +Cc: kernelnewbies, a.p.zijlstra, linux-kernel

>> The man page for perf_event_open(2) says that recent kernels treat a 0
>> value for wakeup_events the same as 1, which I believe means it will
>> notify after a single sample. However, strace on perf(1) shows that it
>> uses wakeup_events=0, and it's definitely not waking up on every
>> sample (it seems to be waking up every few seconds.)
>> tools/perf/design.txt says "Normally a notification is generated for
>> every page filled". Is the documentation wrong, or am I
>> misunderstanding something?
> 
>       wakeup_events, wakeup_watermark
>              This  union sets how many samples (wakeup_events) or bytes (wakeup_watermark) happen before an overflow
>              notification happens.  Which one is used is selected by the watermark bit flag.
> 
>              wakeup_events counts only PERF_RECORD_SAMPLE record types.  To receive overflow  notification  for  all
>              PERF_RECORD types choose watermark and set wakeup_watermark to 1.
> 
>              Prior  to Linux 3.0, setting wakeup_events to 0 resulted in no overflow notifications; more recent ker?
>              nels treat 0 the same as 1.
> 
> My reading of that is that in pre-3.0 kernels, you could choose to not get overflow
> notifications, and now you'll get them whether or not you wanted them.
> 
> Under "overflow handling", we see:
> 
>       Overflows are generated only by sampling events (sample_period must have a nonzero value).
> 
> So the reason strace says perf is only waking up every few seconds is probably
> because you either launched perf with options that only create trace events, or
> it takes several seconds for an overflow to happen on a sampling event. A lot
> of those fields are u64 counters, and won't overflow anytime soon.  Even the
> u32 counters can take a few seconds to overflow....

I launched perf record with the default options. Here’s one of the perf_event_open calls from strace:

08:57:37.083733 perf_event_open({type=PERF_TYPE_HARDWARE, size=PERF_ATTR_SIZE_VER5, config=PERF_COUNT_HW_CPU_CYCLES, sample_freq=4000, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_PERIOD, read_format=PERF_FORMAT_ID, disabled=1, inherit=1, pinned=0, exclusive=0, exclusive_user=0, exclude_kernel=1, exclude_hv=0, exclude_idle=0, mmap=1, comm=1, freq=1, inherit_stat=0, enable_on_exec=1, task=1, watermark=0, precise_ip=3 /* must have 0 skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=1, comm_exec=1, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0}, 134206, 18, -1, PERF_FLAG_FD_CLOEXEC) = 23 <0.000023>

sample_freq is 4000 (and freq is 1). Here’s the man page on this field:

       sample_period, sample_freq
              A "sampling" event is one that generates an  overflow  notifica‐
              tion  every N events, where N is given by sample_period.  A sam‐
              pling event has sample_period > 0.   When  an  overflow  occurs,
              requested  data is recorded in the mmap buffer.  The sample_type
              field controls what data is recorded on each overflow.

              sample_freq can be used if you wish to use frequency rather than
              period.   In  this case, you set the freq flag.  The kernel will
              adjust the sampling period to try and achieve the desired  rate.
              The rate of adjustment is a timer tick.

If I’m reading this right, this is a sampling event which overflows 4000 times a second. But perf then does a poll call which wakes up on this FD with POLLIN after 1.637 seconds, instead of 0.00025 seconds.

The man page snippet you pasted seems to be a different definition of an overflow event, but that doesn’t make sense either: the event generates a sample 4000 times a second, and 0 (supposedly the same as 1) would mean to wake up the FD after 1 sample, which would be 0.00025 seconds.

~Theodore
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: perf_event wakeup_events = 0
  2019-09-07 16:14   ` Theodore Dubois
@ 2019-09-07 22:00     ` Valdis Klētnieks
  2019-09-07 22:45     ` Valdis Klētnieks
  1 sibling, 0 replies; 6+ messages in thread
From: Valdis Klētnieks @ 2019-09-07 22:00 UTC (permalink / raw)
  To: Theodore Dubois; +Cc: kernelnewbies, a.p.zijlstra, linux-kernel

[-- Warning: decoded text below may be mangled --]
[-- Attachment #1.1: Type: text/plain; charset=us-ascii, Size: 546 bytes --]

On Sat, 07 Sep 2019 09:14:49 -0700, Theodore Dubois said:

> If I’m reading this right, this is a sampling event which overflows 4000
> times a second. But perf then does a poll call which wakes up on this FD with
> POLLIN after 1.637 seconds, instead of 0.00025 seconds.

No, it *takes a sample* 4,000 times a second.  For instance, number of cache line
misses since the last sample.  You get an overflow when the counter wraps because
there have been more than 2^32 events since you read the counter.

At least that's my understanding of it.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: perf_event wakeup_events = 0
  2019-09-07 16:14   ` Theodore Dubois
  2019-09-07 22:00     ` Valdis Klētnieks
@ 2019-09-07 22:45     ` Valdis Klētnieks
  2019-09-07 23:27       ` Theodore Dubois
  1 sibling, 1 reply; 6+ messages in thread
From: Valdis Klētnieks @ 2019-09-07 22:45 UTC (permalink / raw)
  To: Theodore Dubois; +Cc: kernelnewbies, a.p.zijlstra, linux-kernel

[-- Warning: decoded text below may be mangled --]
[-- Attachment #1.1: Type: text/plain; charset=us-ascii, Size: 2292 bytes --]

On Sat, 07 Sep 2019 09:14:49 -0700, Theodore Dubois said:

Reading what it actually says rather than what I thought it said.. :)

       Events come in two flavors: counting and sampled.  A counting event  is
       one  that  is  used  for  counting  the aggregate number of events that
       occur.  In general, counting event results are gathered with a  read(2)
       call.   A  sampling  event periodically writes measurements to a buffer
       that can then be accessed via mmap(2).

For some reason, I was thinking counting events.  -ENOCAFFEINE. :)

> sample_freq is 4000 (and freq is 1). Here’s the man page on this field:
>
>        sample_period, sample_freq
>               A "sampling" event is one that generates an  overflow  notifica‐
>               tion  every N events, where N is given by sample_period.  A sam‐
>               pling event has sample_period > 0.

There's this part:
>               pling event has sample_period > 0.   When  an  overflow  occurs,
>               requested  data is recorded in the mmap buffer.  The sample_type
>               field controls what data is recorded on each overflow.

So an entry is made in the buffer. It's not clear that this immediately triggers
a signal...

   MMAP layout
       When using perf_event_open() in sampled mode, asynchronous events (like
       counter overflow or PROT_EXEC mmap tracking) are logged  into  a  ring-
       buffer.  This ring-buffer is created and accessed through mmap(2).

       The mmap size should be 1+2^n pages, where the first page is a metadata
       page (struct perf_event_mmap_page) that contains various bits of infor?
       mation such as where the ring-buffer head is.

So you need to look at what size mmap buffer is being allocated.  It's *probably*
on the order of megabytes, so that you can buffer a fairly large number of entries
and not take several user/kernel transitions on every single entry...

> If I’m reading this right, this is a sampling event which overflows 4000 times a second.

And 4,000 entries are made in the buffer per second..

> But perf then does a poll call which wakes up on this FD with POLLIN after
> 1.637 seconds, instead of 0.00025 seconds

At which point perf goes and looks at several thousand entries in the ring buffer...

[-- Attachment #1.2: Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: perf_event wakeup_events = 0
  2019-09-07 22:45     ` Valdis Klētnieks
@ 2019-09-07 23:27       ` Theodore Dubois
  0 siblings, 0 replies; 6+ messages in thread
From: Theodore Dubois @ 2019-09-07 23:27 UTC (permalink / raw)
  To: Valdis Klētnieks; +Cc: kernelnewbies, a.p.zijlstra, linux-kernel

On Sep 7, 2019, at 3:45 PM, Valdis Klētnieks <valdis.kletnieks@vt.edu> wrote:

> So an entry is made in the buffer. It's not clear that this immediately triggers
> a signal…

I think the documentation says it does when wakeup_events is 1. The code for
perf backs this up:
https://github.com/torvalds/linux/blob/a9815a4fa2fd297cab9fa7a12161b16657290293/tools/perf/util/evsel.c#L1051-L1054
The puzzle is what happens when wakeup_events is 0. The documentation saying
"more recent kernels treat 0 the same as 1" suggests it should behave the same,
but then why would perf set it to 1 after zero-initializing it?

> So you need to look at what size mmap buffer is being allocated.  It's *probably*
> on the order of megabytes, so that you can buffer a fairly large number of entries
> and not take several user/kernel transitions on every single entry…

It’s 512 KiB. Each sample is 40 bytes (the sample_type is IP | TID | TIME |
PERIOD, and each one of those 8 bytes). 40 bytes per sample * 4000 samples per
second * 1.637 seconds is 261920 which is almost exactly half the buffer.

So does wakeup_events = 0 means it causes a wakeup when the buffer is half
full? I don't see anything in the man page about this....

If you'd like to try yourself, this is the strace command I've been using:
strace -ttTv -eperf_event_open,mmap,poll -operf.strace perf record stress --cpu 1 --timeout 1

~Theodore

> 
> On Sat, 07 Sep 2019 09:14:49 -0700, Theodore Dubois said:
> 
> Reading what it actually says rather than what I thought it said.. :)
> 
>       Events come in two flavors: counting and sampled.  A counting event  is
>       one  that  is  used  for  counting  the aggregate number of events that
>       occur.  In general, counting event results are gathered with a  read(2)
>       call.   A  sampling  event periodically writes measurements to a buffer
>       that can then be accessed via mmap(2).
> 
> For some reason, I was thinking counting events.  -ENOCAFFEINE. :)
> 
>> sample_freq is 4000 (and freq is 1). Here’s the man page on this field:
>> 
>>       sample_period, sample_freq
>>              A "sampling" event is one that generates an  overflow  notifica‐
>>              tion  every N events, where N is given by sample_period.  A sam‐
>>              pling event has sample_period > 0.
> 
> There's this part:
>>              pling event has sample_period > 0.   When  an  overflow  occurs,
>>              requested  data is recorded in the mmap buffer.  The sample_type
>>              field controls what data is recorded on each overflow.
> 
> So an entry is made in the buffer. It's not clear that this immediately triggers
> a signal...
> 
>   MMAP layout
>       When using perf_event_open() in sampled mode, asynchronous events (like
>       counter overflow or PROT_EXEC mmap tracking) are logged  into  a  ring-
>       buffer.  This ring-buffer is created and accessed through mmap(2).
> 
>       The mmap size should be 1+2^n pages, where the first page is a metadata
>       page (struct perf_event_mmap_page) that contains various bits of infor?
>       mation such as where the ring-buffer head is.
> 
> So you need to look at what size mmap buffer is being allocated.  It's *probably*
> on the order of megabytes, so that you can buffer a fairly large number of entries
> and not take several user/kernel transitions on every single entry...
> 
>> If I’m reading this right, this is a sampling event which overflows 4000 times a second.
> 
> And 4,000 entries are made in the buffer per second..
> 
>> But perf then does a poll call which wakes up on this FD with POLLIN after
>> 1.637 seconds, instead of 0.00025 seconds
> 
> At which point perf goes and looks at several thousand entries in the ring buffer...


_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-06 23:28 perf_event wakeup_events = 0 Theodore Dubois
2019-09-07 13:40 ` Valdis Klētnieks
2019-09-07 16:14   ` Theodore Dubois
2019-09-07 22:00     ` Valdis Klētnieks
2019-09-07 22:45     ` Valdis Klētnieks
2019-09-07 23:27       ` Theodore Dubois

Kernel Newbies archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kernelnewbies/0 kernelnewbies/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kernelnewbies kernelnewbies/ https://lore.kernel.org/kernelnewbies \
		kernelnewbies@kernelnewbies.org kernelnewbies@archiver.kernel.org
	public-inbox-index kernelnewbies

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernelnewbies.kernelnewbies


AGPL code for this site: git clone https://public-inbox.org/ public-inbox