[RFC] perf: need to expose sched_clock to correlate user samples with kernel samples

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
@ 2012-10-16 10:13 Stephane Eranian
  2012-10-16 17:23 ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2012-10-16 10:13 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter

Hi,

There are many situations where we want to correlate events happening at
the user level with samples recorded in the perf_event kernel sampling buffer.
For instance, we might want to correlate the call to a function or creation of
a file with samples. Similarly, when we want to monitor a JVM with jitted code,
we need to be able to correlate jitted code mappings with perf event samples
for symbolization.

Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
That causes each PERF_RECORD_SAMPLE to include a timestamp
generated by calling the local_clock() -> sched_clock_cpu() function.

To make correlating user vs. kernel samples easy, we would need to
access that sched_clock() functionality. However, none of the existing
clock calls permit this at this point. They all return timestamps which are
not using the same source and/or offset as sched_clock.

I believe a similar issue exists with the ftrace subsystem.

The problem needs to be adressed in a portable manner. Solutions
based on reading TSC for the user level to reconstruct sched_clock()
don't seem appropriate to me.

One possibility to address this limitation would be to extend clock_gettime()
with a new clock time, e.g., CLOCK_PERF.

However, I understand that sched_clock_cpu() provides ordering guarantees only
when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
But we already have to deal with this problem when merging samples obtained
from different CPU sampling buffer in per-thread mode. So this is not
necessarily
a showstopper.

Alternatives could be to use uprobes but that's less practical to setup.

Anyone with better ideas?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-10-16 10:13 [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples Stephane Eranian
@ 2012-10-16 17:23 ` Peter Zijlstra
  2012-10-18 19:33   ` Stephane Eranian
                     ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Peter Zijlstra @ 2012-10-16 17:23 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter, tglx, John Stultz

On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
> Hi,
> 
> There are many situations where we want to correlate events happening at
> the user level with samples recorded in the perf_event kernel sampling buffer.
> For instance, we might want to correlate the call to a function or creation of
> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
> we need to be able to correlate jitted code mappings with perf event samples
> for symbolization.
> 
> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
> That causes each PERF_RECORD_SAMPLE to include a timestamp
> generated by calling the local_clock() -> sched_clock_cpu() function.
> 
> To make correlating user vs. kernel samples easy, we would need to
> access that sched_clock() functionality. However, none of the existing
> clock calls permit this at this point. They all return timestamps which are
> not using the same source and/or offset as sched_clock.
> 
> I believe a similar issue exists with the ftrace subsystem.
> 
> The problem needs to be adressed in a portable manner. Solutions
> based on reading TSC for the user level to reconstruct sched_clock()
> don't seem appropriate to me.
> 
> One possibility to address this limitation would be to extend clock_gettime()
> with a new clock time, e.g., CLOCK_PERF.
> 
> However, I understand that sched_clock_cpu() provides ordering guarantees only
> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
> But we already have to deal with this problem when merging samples obtained
> from different CPU sampling buffer in per-thread mode. So this is not
> necessarily
> a showstopper.
> 
> Alternatives could be to use uprobes but that's less practical to setup.
> 
> Anyone with better ideas?

You forgot to CC the time people ;-)

I've no problem with adding CLOCK_PERF (or another/better name).

Thomas, John?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-10-16 17:23 ` Peter Zijlstra
@ 2012-10-18 19:33   ` Stephane Eranian
  2012-11-10  2:04   ` John Stultz
  2013-02-01 14:18   ` Pawel Moll
  2 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2012-10-18 19:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter, tglx, John Stultz

On Tue, Oct 16, 2012 at 7:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>> Hi,
>>
>> There are many situations where we want to correlate events happening at
>> the user level with samples recorded in the perf_event kernel sampling buffer.
>> For instance, we might want to correlate the call to a function or creation of
>> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
>> we need to be able to correlate jitted code mappings with perf event samples
>> for symbolization.
>>
>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>
>> To make correlating user vs. kernel samples easy, we would need to
>> access that sched_clock() functionality. However, none of the existing
>> clock calls permit this at this point. They all return timestamps which are
>> not using the same source and/or offset as sched_clock.
>>
>> I believe a similar issue exists with the ftrace subsystem.
>>
>> The problem needs to be adressed in a portable manner. Solutions
>> based on reading TSC for the user level to reconstruct sched_clock()
>> don't seem appropriate to me.
>>
>> One possibility to address this limitation would be to extend clock_gettime()
>> with a new clock time, e.g., CLOCK_PERF.
>>
>> However, I understand that sched_clock_cpu() provides ordering guarantees only
>> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
>> But we already have to deal with this problem when merging samples obtained
>> from different CPU sampling buffer in per-thread mode. So this is not
>> necessarily
>> a showstopper.
>>
>> Alternatives could be to use uprobes but that's less practical to setup.
>>
>> Anyone with better ideas?
>
> You forgot to CC the time people ;-)
>
I did not know where they were.

> I've no problem with adding CLOCK_PERF (or another/better name).
>
Ok, good.

> Thomas, John?
>
Any comment?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-10-16 17:23 ` Peter Zijlstra
  2012-10-18 19:33   ` Stephane Eranian
@ 2012-11-10  2:04   ` John Stultz
  2012-11-11 20:32     ` Stephane Eranian
  2012-11-13 20:58     ` Steven Rostedt
  2013-02-01 14:18   ` Pawel Moll
  2 siblings, 2 replies; 65+ messages in thread
From: John Stultz @ 2012-11-10  2:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter,
	tglx

On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>> Hi,
>>
>> There are many situations where we want to correlate events happening at
>> the user level with samples recorded in the perf_event kernel sampling buffer.
>> For instance, we might want to correlate the call to a function or creation of
>> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
>> we need to be able to correlate jitted code mappings with perf event samples
>> for symbolization.
>>
>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>
>> To make correlating user vs. kernel samples easy, we would need to
>> access that sched_clock() functionality. However, none of the existing
>> clock calls permit this at this point. They all return timestamps which are
>> not using the same source and/or offset as sched_clock.
>>
>> I believe a similar issue exists with the ftrace subsystem.
>>
>> The problem needs to be adressed in a portable manner. Solutions
>> based on reading TSC for the user level to reconstruct sched_clock()
>> don't seem appropriate to me.
>>
>> One possibility to address this limitation would be to extend clock_gettime()
>> with a new clock time, e.g., CLOCK_PERF.
>>
>> However, I understand that sched_clock_cpu() provides ordering guarantees only
>> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
>> But we already have to deal with this problem when merging samples obtained
>> from different CPU sampling buffer in per-thread mode. So this is not
>> necessarily
>> a showstopper.
>>
>> Alternatives could be to use uprobes but that's less practical to setup.
>>
>> Anyone with better ideas?
> You forgot to CC the time people ;-)
>
> I've no problem with adding CLOCK_PERF (or another/better name).
Hrm. I'm not excited about exporting that sort of internal kernel 
details to userland.

The behavior and expectations from sched_clock() has changed over the 
years, so I'm not sure its wise to export it, since we'd have to 
preserve its behavior from then on.

Also I worry that it will be abused in the same way that direct TSC 
access is, where the seemingly better performance from the more 
careful/correct CLOCK_MONOTONIC would cause developers to write fragile 
userland code that will break when moved from one machine to the next.

I'd probably rather perf output timestamps to userland using sane clocks 
(CLOCK_MONOTONIC), rather then trying to introduce a new time domain to 
userland.   But I probably could be convinced I'm wrong.

thanks
-john


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-11-10  2:04   ` John Stultz
@ 2012-11-11 20:32     ` Stephane Eranian
  2012-11-12 18:53       ` John Stultz
  2012-11-13 20:58     ` Steven Rostedt
  1 sibling, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2012-11-11 20:32 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter,
	tglx

On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <john.stultz@linaro.org> wrote:
> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
>>
>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>>>
>>> Hi,
>>>
>>> There are many situations where we want to correlate events happening at
>>> the user level with samples recorded in the perf_event kernel sampling
>>> buffer.
>>> For instance, we might want to correlate the call to a function or
>>> creation of
>>> a file with samples. Similarly, when we want to monitor a JVM with jitted
>>> code,
>>> we need to be able to correlate jitted code mappings with perf event
>>> samples
>>> for symbolization.
>>>
>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>>
>>> To make correlating user vs. kernel samples easy, we would need to
>>> access that sched_clock() functionality. However, none of the existing
>>> clock calls permit this at this point. They all return timestamps which
>>> are
>>> not using the same source and/or offset as sched_clock.
>>>
>>> I believe a similar issue exists with the ftrace subsystem.
>>>
>>> The problem needs to be adressed in a portable manner. Solutions
>>> based on reading TSC for the user level to reconstruct sched_clock()
>>> don't seem appropriate to me.
>>>
>>> One possibility to address this limitation would be to extend
>>> clock_gettime()
>>> with a new clock time, e.g., CLOCK_PERF.
>>>
>>> However, I understand that sched_clock_cpu() provides ordering guarantees
>>> only
>>> when invoked on the same CPU repeatedly, i.e., it's not globally
>>> synchronized.
>>> But we already have to deal with this problem when merging samples
>>> obtained
>>> from different CPU sampling buffer in per-thread mode. So this is not
>>> necessarily
>>> a showstopper.
>>>
>>> Alternatives could be to use uprobes but that's less practical to setup.
>>>
>>> Anyone with better ideas?
>>
>> You forgot to CC the time people ;-)
>>
>> I've no problem with adding CLOCK_PERF (or another/better name).
>
> Hrm. I'm not excited about exporting that sort of internal kernel details to
> userland.
>
> The behavior and expectations from sched_clock() has changed over the years,
> so I'm not sure its wise to export it, since we'd have to preserve its
> behavior from then on.
>
It's not about just exposing sched_clock(). We need to expose a time source
that is exactly equivalent to what perf_event uses internally. If sched_clock()
changes, then perf_event clock will change too and so would that new time
source for clock_gettime(). As long as everything remains consistent, we are
good.

> Also I worry that it will be abused in the same way that direct TSC access
> is, where the seemingly better performance from the more careful/correct
> CLOCK_MONOTONIC would cause developers to write fragile userland code that
> will break when moved from one machine to the next.
>
The only goal for this new time source is for correlating user-level
samples with
kernel level samples, i.e., application level events with a PMU counter overflow
for instance. Anybody trying anything else would be on their own.

clock_gettime(CLOCK_PERF): guarantee to return the same time source as
that used by the perf_event subsystem to timestamp samples when
PERF_SAMPLE_TIME is requested in attr->sample_type.

> I'd probably rather perf output timestamps to userland using sane clocks
> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
> userland.   But I probably could be convinced I'm wrong.
>
Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
grabbing any locks because that would need to run from NMI context?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-11-11 20:32     ` Stephane Eranian
@ 2012-11-12 18:53       ` John Stultz
  2012-11-12 20:54         ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: John Stultz @ 2012-11-12 18:53 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter,
	tglx

On 11/11/2012 12:32 PM, Stephane Eranian wrote:
> On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <john.stultz@linaro.org> wrote:
>> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
>>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>>>> Hi,
>>>>
>>>> There are many situations where we want to correlate events happening at
>>>> the user level with samples recorded in the perf_event kernel sampling
>>>> buffer.
>>>> For instance, we might want to correlate the call to a function or
>>>> creation of
>>>> a file with samples. Similarly, when we want to monitor a JVM with jitted
>>>> code,
>>>> we need to be able to correlate jitted code mappings with perf event
>>>> samples
>>>> for symbolization.
>>>>
>>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>>>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>>>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>>>
>>>> To make correlating user vs. kernel samples easy, we would need to
>>>> access that sched_clock() functionality. However, none of the existing
>>>> clock calls permit this at this point. They all return timestamps which
>>>> are
>>>> not using the same source and/or offset as sched_clock.
>>>>
>>>> I believe a similar issue exists with the ftrace subsystem.
>>>>
>>>> The problem needs to be adressed in a portable manner. Solutions
>>>> based on reading TSC for the user level to reconstruct sched_clock()
>>>> don't seem appropriate to me.
>>>>
>>>> One possibility to address this limitation would be to extend
>>>> clock_gettime()
>>>> with a new clock time, e.g., CLOCK_PERF.
>>>>
>>>> However, I understand that sched_clock_cpu() provides ordering guarantees
>>>> only
>>>> when invoked on the same CPU repeatedly, i.e., it's not globally
>>>> synchronized.
>>>> But we already have to deal with this problem when merging samples
>>>> obtained
>>>> from different CPU sampling buffer in per-thread mode. So this is not
>>>> necessarily
>>>> a showstopper.
>>>>
>>>> Alternatives could be to use uprobes but that's less practical to setup.
>>>>
>>>> Anyone with better ideas?
>>> You forgot to CC the time people ;-)
>>>
>>> I've no problem with adding CLOCK_PERF (or another/better name).
>> Hrm. I'm not excited about exporting that sort of internal kernel details to
>> userland.
>>
>> The behavior and expectations from sched_clock() has changed over the years,
>> so I'm not sure its wise to export it, since we'd have to preserve its
>> behavior from then on.
>>
> It's not about just exposing sched_clock(). We need to expose a time source
> that is exactly equivalent to what perf_event uses internally. If sched_clock()
> changes, then perf_event clock will change too and so would that new time
> source for clock_gettime(). As long as everything remains consistent, we are
> good.

Sure, but I'm just hesitant to expose that sort of internal detail. If 
we change it later, its not just perf_events, but any other applications 
that have come to depend on the particular behavior we expose.  We can 
claim "that was never promised" but it still leads to a bad situation.

>> Also I worry that it will be abused in the same way that direct TSC access
>> is, where the seemingly better performance from the more careful/correct
>> CLOCK_MONOTONIC would cause developers to write fragile userland code that
>> will break when moved from one machine to the next.
>>
> The only goal for this new time source is for correlating user-level
> samples with
> kernel level samples, i.e., application level events with a PMU counter overflow
> for instance. Anybody trying anything else would be on their own.
>
> clock_gettime(CLOCK_PERF): guarantee to return the same time source as
> that used by the perf_event subsystem to timestamp samples when
> PERF_SAMPLE_TIME is requested in attr->sample_type.

I'm not familiar enough with perf's interfaces, but if you are going to 
make this clockid bound so tightly with perf, could you maybe export a 
perf timestamp from one of perf's interfaces rather then using the more 
generic clock_gettime() interface?


>
>> I'd probably rather perf output timestamps to userland using sane clocks
>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
>> userland.   But I probably could be convinced I'm wrong.
>>
> Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
> grabbing any locks because that would need to run from NMI context?
No,  of course why we have sched_clock. But I'm suggesting we consider 
changing what perf exports (via maybe interpolation/translation) to be 
CLOCK_MONOTONIC-ish.


I'm not strongly objecting here, I just want to make sure other 
alternatives are explored before we start giving applications another 
internal kernel behavior dependent interface to hang themselves with.  :)

thanks
-john


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-11-12 18:53       ` John Stultz
@ 2012-11-12 20:54         ` Stephane Eranian
  2012-11-12 22:39           ` John Stultz
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2012-11-12 20:54 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter,
	tglx

On Mon, Nov 12, 2012 at 7:53 PM, John Stultz <john.stultz@linaro.org> wrote:
> On 11/11/2012 12:32 PM, Stephane Eranian wrote:
>>
>> On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <john.stultz@linaro.org>
>> wrote:
>>>
>>> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
>>>>
>>>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> There are many situations where we want to correlate events happening
>>>>> at
>>>>> the user level with samples recorded in the perf_event kernel sampling
>>>>> buffer.
>>>>> For instance, we might want to correlate the call to a function or
>>>>> creation of
>>>>> a file with samples. Similarly, when we want to monitor a JVM with
>>>>> jitted
>>>>> code,
>>>>> we need to be able to correlate jitted code mappings with perf event
>>>>> samples
>>>>> for symbolization.
>>>>>
>>>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>>>>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>>>>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>>>>
>>>>> To make correlating user vs. kernel samples easy, we would need to
>>>>> access that sched_clock() functionality. However, none of the existing
>>>>> clock calls permit this at this point. They all return timestamps which
>>>>> are
>>>>> not using the same source and/or offset as sched_clock.
>>>>>
>>>>> I believe a similar issue exists with the ftrace subsystem.
>>>>>
>>>>> The problem needs to be adressed in a portable manner. Solutions
>>>>> based on reading TSC for the user level to reconstruct sched_clock()
>>>>> don't seem appropriate to me.
>>>>>
>>>>> One possibility to address this limitation would be to extend
>>>>> clock_gettime()
>>>>> with a new clock time, e.g., CLOCK_PERF.
>>>>>
>>>>> However, I understand that sched_clock_cpu() provides ordering
>>>>> guarantees
>>>>> only
>>>>> when invoked on the same CPU repeatedly, i.e., it's not globally
>>>>> synchronized.
>>>>> But we already have to deal with this problem when merging samples
>>>>> obtained
>>>>> from different CPU sampling buffer in per-thread mode. So this is not
>>>>> necessarily
>>>>> a showstopper.
>>>>>
>>>>> Alternatives could be to use uprobes but that's less practical to
>>>>> setup.
>>>>>
>>>>> Anyone with better ideas?
>>>>
>>>> You forgot to CC the time people ;-)
>>>>
>>>> I've no problem with adding CLOCK_PERF (or another/better name).
>>>
>>> Hrm. I'm not excited about exporting that sort of internal kernel details
>>> to
>>> userland.
>>>
>>> The behavior and expectations from sched_clock() has changed over the
>>> years,
>>> so I'm not sure its wise to export it, since we'd have to preserve its
>>> behavior from then on.
>>>
>> It's not about just exposing sched_clock(). We need to expose a time
>> source
>> that is exactly equivalent to what perf_event uses internally. If
>> sched_clock()
>> changes, then perf_event clock will change too and so would that new time
>> source for clock_gettime(). As long as everything remains consistent, we
>> are
>> good.
>
>
> Sure, but I'm just hesitant to expose that sort of internal detail. If we
> change it later, its not just perf_events, but any other applications that
> have come to depend on the particular behavior we expose.  We can claim
> "that was never promised" but it still leads to a bad situation.
>
>
>>> Also I worry that it will be abused in the same way that direct TSC
>>> access
>>> is, where the seemingly better performance from the more careful/correct
>>> CLOCK_MONOTONIC would cause developers to write fragile userland code
>>> that
>>> will break when moved from one machine to the next.
>>>
>> The only goal for this new time source is for correlating user-level
>> samples with
>> kernel level samples, i.e., application level events with a PMU counter
>> overflow
>> for instance. Anybody trying anything else would be on their own.
>>
>> clock_gettime(CLOCK_PERF): guarantee to return the same time source as
>> that used by the perf_event subsystem to timestamp samples when
>> PERF_SAMPLE_TIME is requested in attr->sample_type.
>
>
> I'm not familiar enough with perf's interfaces, but if you are going to make
> this clockid bound so tightly with perf, could you maybe export a perf
> timestamp from one of perf's interfaces rather then using the more generic
> clock_gettime() interface?
>
Yeah, I considered that as well. But it is more complicated. The only syscall
we could extend for perf_events is ioctl(). But that one requires that an
event be created so we obtain a file descriptor for the ioctl() call
So we'd have to
pretend programming a dummy event just for the purpose of obtained a timestamp.
We could do that but that's not so nice. But more amenable to the

Keep in mind that the clock_gettime() would be used by programs which are not
self-monitoring but may be monitored externally by a tool such as perf. We just
need to them to emit their events with a timestamp that can be
correlated offline
with those of perf_events.

>
>
>>
>>> I'd probably rather perf output timestamps to userland using sane clocks
>>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
>>> userland.   But I probably could be convinced I'm wrong.
>>>
>> Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
>> grabbing any locks because that would need to run from NMI context?
>
> No,  of course why we have sched_clock. But I'm suggesting we consider
> changing what perf exports (via maybe interpolation/translation) to be
> CLOCK_MONOTONIC-ish.
>
Explain to me the key difference between monotonic and what sched_clock()
is returning today? Does this have to do with the global monotonic vs.
the cpu-wide
monotonic?

>
> I'm not strongly objecting here, I just want to make sure other alternatives
> are explored before we start giving applications another internal kernel
> behavior dependent interface to hang themselves with.  :)
>
> thanks
> -john
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-11-12 20:54         ` Stephane Eranian
@ 2012-11-12 22:39           ` John Stultz
  0 siblings, 0 replies; 65+ messages in thread
From: John Stultz @ 2012-11-12 22:39 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter,
	tglx

On 11/12/2012 12:54 PM, Stephane Eranian wrote:
> On Mon, Nov 12, 2012 at 7:53 PM, John Stultz <john.stultz@linaro.org> wrote:
>> On 11/11/2012 12:32 PM, Stephane Eranian wrote:
>>> On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <john.stultz@linaro.org>
>>> wrote:
>>>> Also I worry that it will be abused in the same way that direct TSC
>>>> access
>>>> is, where the seemingly better performance from the more careful/correct
>>>> CLOCK_MONOTONIC would cause developers to write fragile userland code
>>>> that
>>>> will break when moved from one machine to the next.
>>>>
>>> The only goal for this new time source is for correlating user-level
>>> samples with
>>> kernel level samples, i.e., application level events with a PMU counter
>>> overflow
>>> for instance. Anybody trying anything else would be on their own.
>>>
>>> clock_gettime(CLOCK_PERF): guarantee to return the same time source as
>>> that used by the perf_event subsystem to timestamp samples when
>>> PERF_SAMPLE_TIME is requested in attr->sample_type.
>>
>> I'm not familiar enough with perf's interfaces, but if you are going to make
>> this clockid bound so tightly with perf, could you maybe export a perf
>> timestamp from one of perf's interfaces rather then using the more generic
>> clock_gettime() interface?
>>
> Yeah, I considered that as well. But it is more complicated. The only syscall
> we could extend for perf_events is ioctl(). But that one requires that an
> event be created so we obtain a file descriptor for the ioctl() call
> So we'd have to
> pretend programming a dummy event just for the purpose of obtained a timestamp.
> We could do that but that's not so nice. But more amenable to the

Sorry, you trailed off.   Did you want to finish that thought? (I do 
that all the time.  :)

> Keep in mind that the clock_gettime() would be used by programs which are not
> self-monitoring but may be monitored externally by a tool such as perf. We just
> need to them to emit their events with a timestamp that can be
> correlated offline
> with those of perf_events.

Again, forgive me for not really knowing much about perf here, but could 
you have a perf log an event when clock_gettime() was called, possibly 
recording the returned value, so you could correlate that data yourself?

>>>> I'd probably rather perf output timestamps to userland using sane clocks
>>>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
>>>> userland.   But I probably could be convinced I'm wrong.
>>>>
>>> Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
>>> grabbing any locks because that would need to run from NMI context?
>> No,  of course why we have sched_clock. But I'm suggesting we consider
>> changing what perf exports (via maybe interpolation/translation) to be
>> CLOCK_MONOTONIC-ish.
>>
> Explain to me the key difference between monotonic and what sched_clock()
> is returning today? Does this have to do with the global monotonic vs.
> the cpu-wide
> monotonic?

So CLOCK_MONOTONIC is the number of NTP corrected (for accuracy) seconds 
+ nsecs that the machine has been up for (so that doesn't include time 
in suspend). Its promised to be globally monotonic across cpus.

In my understanding, sched_clock's definition has changed over time. It 
used to be a fast but possibly incorrect nanoseconds since boot, but 
with suspend and other events it could reset/overflow and users (then 
only the scheduler) would be able to deal with it. It also wasn't 
guaranteed to be consistent across cpus.  So it was limited to 
calculating approximate time intervals on a single cpu.

However, with cfs  (And Peter or Ingo could probably hop in and clarify 
further) I believe it started to require some cross-cpu consistency and 
reset events would cause probelms with the scheduler, so additional 
layers have been added to try to enforce these additional requirements.

I suspect they aren't that far off, except calibration frequency errors 
go uncorrected with sched_clock. But was thinking you could get periodic 
timestamps in perf that correlated CLOCK_MONOTONIC with sched_clock and 
then allow the kernel to interpolate the sched_clock times out to 
something pretty close to CLOCK_MONOTONIC. That way perf wouldn't leak 
the sched_clock time domain to userland.

Again, sorry for being a pain here. The CLOCK_PERF would be a easy 
solution, but I just want to make sure its really the best one long term.

thanks
-john

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-11-10  2:04   ` John Stultz
  2012-11-11 20:32     ` Stephane Eranian
@ 2012-11-13 20:58     ` Steven Rostedt
  2012-11-14 22:26       ` John Stultz
  1 sibling, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2012-11-13 20:58 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, Stephane Eranian, LKML, mingo, Paul Mackerras,
	Anton Blanchard, Will Deacon, ak, Pekka Enberg, Robert Richter,
	tglx

On Fri, 2012-11-09 at 18:04 -0800, John Stultz wrote:
> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:

> > I've no problem with adding CLOCK_PERF (or another/better name).
> Hrm. I'm not excited about exporting that sort of internal kernel 
> details to userland.
> 
> The behavior and expectations from sched_clock() has changed over the 
> years, so I'm not sure its wise to export it, since we'd have to 
> preserve its behavior from then on.
> 
> Also I worry that it will be abused in the same way that direct TSC 
> access is, where the seemingly better performance from the more 
> careful/correct CLOCK_MONOTONIC would cause developers to write fragile 
> userland code that will break when moved from one machine to the next.
> 
> I'd probably rather perf output timestamps to userland using sane clocks 
> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to 
> userland.   But I probably could be convinced I'm wrong.

I'm surprised that perf has its own clock anyway. But I would like to
export the tracing clocks. We have three (well four) of them:

trace_clock_local() which is defined to be a very fast clock but may not
be synced with other cpus (basically, it just calls sched_clock).

trace_clock() which is not totally serialized, but also not totally off
(between local and global). This uses local_clock() which is the same
thing that perf_clock() uses.

trace_clock_global() which is a monotonic clock across CPUs. It's much
slower than the above, but works well when you require synced
timestamps.

There's also trace_clock_counter() which isn't even a clock :-)  It's
just a incremental atomic counter that goes up every time it's called.
This is the most synced clock, but is absolutely meaningless for
timestamps. It's just a way to show ordered events.

-- Steve

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-11-13 20:58     ` Steven Rostedt
@ 2012-11-14 22:26       ` John Stultz
  2012-11-14 23:30         ` Steven Rostedt
  0 siblings, 1 reply; 65+ messages in thread
From: John Stultz @ 2012-11-14 22:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Stephane Eranian, LKML, mingo, Paul Mackerras,
	Anton Blanchard, Will Deacon, ak, Pekka Enberg, Robert Richter,
	tglx

On 11/13/2012 12:58 PM, Steven Rostedt wrote:
> On Fri, 2012-11-09 at 18:04 -0800, John Stultz wrote:
>> On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
>>> I've no problem with adding CLOCK_PERF (or another/better name).
>> Hrm. I'm not excited about exporting that sort of internal kernel
>> details to userland.
>>
>> The behavior and expectations from sched_clock() has changed over the
>> years, so I'm not sure its wise to export it, since we'd have to
>> preserve its behavior from then on.
>>
>> Also I worry that it will be abused in the same way that direct TSC
>> access is, where the seemingly better performance from the more
>> careful/correct CLOCK_MONOTONIC would cause developers to write fragile
>> userland code that will break when moved from one machine to the next.
>>
>> I'd probably rather perf output timestamps to userland using sane clocks
>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
>> userland.   But I probably could be convinced I'm wrong.
> I'm surprised that perf has its own clock anyway. But I would like to
> export the tracing clocks. We have three (well four) of them:
>
> trace_clock_local() which is defined to be a very fast clock but may not
> be synced with other cpus (basically, it just calls sched_clock).
>
> trace_clock() which is not totally serialized, but also not totally off
> (between local and global). This uses local_clock() which is the same
> thing that perf_clock() uses.
>
> trace_clock_global() which is a monotonic clock across CPUs. It's much
> slower than the above, but works well when you require synced
> timestamps.
>
> There's also trace_clock_counter() which isn't even a clock :-)  It's
> just a incremental atomic counter that goes up every time it's called.
> This is the most synced clock, but is absolutely meaningless for
> timestamps. It's just a way to show ordered events.

Oof. This is getting uglier. I'd really prefer not to expose all these 
different internal clocks out userland. Especially via clock_gettime().

thanks
-john


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-11-14 22:26       ` John Stultz
@ 2012-11-14 23:30         ` Steven Rostedt
  0 siblings, 0 replies; 65+ messages in thread
From: Steven Rostedt @ 2012-11-14 23:30 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, Stephane Eranian, LKML, mingo, Paul Mackerras,
	Anton Blanchard, Will Deacon, ak, Pekka Enberg, Robert Richter,
	tglx

On Wed, 2012-11-14 at 14:26 -0800, John Stultz wrote:

> > There's also trace_clock_counter() which isn't even a clock :-)  It's
> > just a incremental atomic counter that goes up every time it's called.
> > This is the most synced clock, but is absolutely meaningless for
> > timestamps. It's just a way to show ordered events.
> 
> Oof. This is getting uglier. I'd really prefer not to expose all these 
> different internal clocks out userland. Especially via clock_gettime().

Actually, I would be happy to just expose them to modules. As things
like hwlat_detect could use them.

-- Steve



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2012-10-16 17:23 ` Peter Zijlstra
  2012-10-18 19:33   ` Stephane Eranian
  2012-11-10  2:04   ` John Stultz
@ 2013-02-01 14:18   ` Pawel Moll
  2013-02-05 21:18     ` David Ahern
                       ` (2 more replies)
  2 siblings, 3 replies; 65+ messages in thread
From: Pawel Moll @ 2013-02-01 14:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter,
	tglx, John Stultz

Hello,

I'd like to revive the topic...

On Tue, 2012-10-16 at 18:23 +0100, Peter Zijlstra wrote:
> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
> > Hi,
> > 
> > There are many situations where we want to correlate events happening at
> > the user level with samples recorded in the perf_event kernel sampling buffer.
> > For instance, we might want to correlate the call to a function or creation of
> > a file with samples. Similarly, when we want to monitor a JVM with jitted code,
> > we need to be able to correlate jitted code mappings with perf event samples
> > for symbolization.
> > 
> > Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
> > That causes each PERF_RECORD_SAMPLE to include a timestamp
> > generated by calling the local_clock() -> sched_clock_cpu() function.
> > 
> > To make correlating user vs. kernel samples easy, we would need to
> > access that sched_clock() functionality. However, none of the existing
> > clock calls permit this at this point. They all return timestamps which are
> > not using the same source and/or offset as sched_clock.
> > 
> > I believe a similar issue exists with the ftrace subsystem.
> > 
> > The problem needs to be adressed in a portable manner. Solutions
> > based on reading TSC for the user level to reconstruct sched_clock()
> > don't seem appropriate to me.
> > 
> > One possibility to address this limitation would be to extend clock_gettime()
> > with a new clock time, e.g., CLOCK_PERF.
> > 
> > However, I understand that sched_clock_cpu() provides ordering guarantees only
> > when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
> > But we already have to deal with this problem when merging samples obtained
> > from different CPU sampling buffer in per-thread mode. So this is not
> > necessarily
> > a showstopper.
> > 
> > Alternatives could be to use uprobes but that's less practical to setup.
> > 
> > Anyone with better ideas?
> 
> You forgot to CC the time people ;-)
> 
> I've no problem with adding CLOCK_PERF (or another/better name).
> 
> Thomas, John?

I've just faced the same issue - correlating an event in userspace with
data from the perf stream, but to my mind what I want to get is a value
returned by perf_clock() _in the current "session" context_.

Stephane didn't like the idea of opening a "fake" perf descriptor in
order to get the timestamp, but surely one must have the "session"
already running to be interested in such data in the first place? So I
think the ioctl() idea is not out of place here... How about the simple
change below?

Regards

Pawel

8<---
>From 2ad51a27fbf64bf98cee190efc3fbd7002819692 Mon Sep 17 00:00:00 2001
From: Pawel Moll <pawel.moll@arm.com>
Date: Fri, 1 Feb 2013 14:03:56 +0000
Subject: [PATCH] perf: Add ioctl to return current time value

To co-relate user space events with the perf events stream
a current (as in: "what time(stamp) is it now?") time value
must be made available.

This patch adds a perf ioctl that makes this possible.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
 include/uapi/linux/perf_event.h |    1 +
 kernel/events/core.c            |    8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4f63c05..b745fb0 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -316,6 +316,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_PERIOD		_IOW('$', 4, __u64)
 #define PERF_EVENT_IOC_SET_OUTPUT	_IO ('$', 5)
 #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
+#define PERF_EVENT_IOC_GET_TIME		_IOR('$', 7, __u64)
 
 enum perf_event_ioc_flags {
 	PERF_IOC_FLAG_GROUP		= 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 301079d..4202b1c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3298,6 +3298,14 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case PERF_EVENT_IOC_SET_FILTER:
 		return perf_event_set_filter(event, (void __user *)arg);
 
+	case PERF_EVENT_IOC_GET_TIME:
+	{
+		u64 time = perf_clock();
+		if (copy_to_user((void __user *)arg, &time, sizeof(time)))
+			return -EFAULT;
+		return 0;
+	}
+
 	default:
 		return -ENOTTY;
 	}
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-01 14:18   ` Pawel Moll
@ 2013-02-05 21:18     ` David Ahern
  2013-02-05 22:13     ` Stephane Eranian
  2013-06-26 16:49     ` David Ahern
  2 siblings, 0 replies; 65+ messages in thread
From: David Ahern @ 2013-02-05 21:18 UTC (permalink / raw)
  To: Pawel Moll, Peter Zijlstra, mingo
  Cc: Stephane Eranian, LKML, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter,
	tglx, John Stultz

On 2/1/13 7:18 AM, Pawel Moll wrote:
> 8<---
>  From 2ad51a27fbf64bf98cee190efc3fbd7002819692 Mon Sep 17 00:00:00 2001
> From: Pawel Moll <pawel.moll@arm.com>
> Date: Fri, 1 Feb 2013 14:03:56 +0000
> Subject: [PATCH] perf: Add ioctl to return current time value
>
> To co-relate user space events with the perf events stream
> a current (as in: "what time(stamp) is it now?") time value
> must be made available.
>
> This patch adds a perf ioctl that makes this possible.

Are there any objections to this approach? I need a solution for my 
time-of-day patch as well.

David

>
> Signed-off-by: Pawel Moll <pawel.moll@arm.com>
> ---
>   include/uapi/linux/perf_event.h |    1 +
>   kernel/events/core.c            |    8 ++++++++
>   2 files changed, 9 insertions(+)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 4f63c05..b745fb0 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -316,6 +316,7 @@ struct perf_event_attr {
>   #define PERF_EVENT_IOC_PERIOD		_IOW('$', 4, __u64)
>   #define PERF_EVENT_IOC_SET_OUTPUT	_IO ('$', 5)
>   #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
> +#define PERF_EVENT_IOC_GET_TIME		_IOR('$', 7, __u64)
>
>   enum perf_event_ioc_flags {
>   	PERF_IOC_FLAG_GROUP		= 1U << 0,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 301079d..4202b1c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3298,6 +3298,14 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>   	case PERF_EVENT_IOC_SET_FILTER:
>   		return perf_event_set_filter(event, (void __user *)arg);
>
> +	case PERF_EVENT_IOC_GET_TIME:
> +	{
> +		u64 time = perf_clock();
> +		if (copy_to_user((void __user *)arg, &time, sizeof(time)))
> +			return -EFAULT;
> +		return 0;
> +	}
> +
>   	default:
>   		return -ENOTTY;
>   	}
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-01 14:18   ` Pawel Moll
  2013-02-05 21:18     ` David Ahern
@ 2013-02-05 22:13     ` Stephane Eranian
  2013-02-05 22:28       ` John Stultz
  2013-02-06 18:17       ` Pawel Moll
  2013-06-26 16:49     ` David Ahern
  2 siblings, 2 replies; 65+ messages in thread
From: Stephane Eranian @ 2013-02-05 22:13 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Peter Zijlstra, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter,
	tglx, John Stultz

On Fri, Feb 1, 2013 at 3:18 PM, Pawel Moll <pawel.moll@arm.com> wrote:
> Hello,
>
> I'd like to revive the topic...
>
> On Tue, 2012-10-16 at 18:23 +0100, Peter Zijlstra wrote:
>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>> > Hi,
>> >
>> > There are many situations where we want to correlate events happening at
>> > the user level with samples recorded in the perf_event kernel sampling buffer.
>> > For instance, we might want to correlate the call to a function or creation of
>> > a file with samples. Similarly, when we want to monitor a JVM with jitted code,
>> > we need to be able to correlate jitted code mappings with perf event samples
>> > for symbolization.
>> >
>> > Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>> > That causes each PERF_RECORD_SAMPLE to include a timestamp
>> > generated by calling the local_clock() -> sched_clock_cpu() function.
>> >
>> > To make correlating user vs. kernel samples easy, we would need to
>> > access that sched_clock() functionality. However, none of the existing
>> > clock calls permit this at this point. They all return timestamps which are
>> > not using the same source and/or offset as sched_clock.
>> >
>> > I believe a similar issue exists with the ftrace subsystem.
>> >
>> > The problem needs to be adressed in a portable manner. Solutions
>> > based on reading TSC for the user level to reconstruct sched_clock()
>> > don't seem appropriate to me.
>> >
>> > One possibility to address this limitation would be to extend clock_gettime()
>> > with a new clock time, e.g., CLOCK_PERF.
>> >
>> > However, I understand that sched_clock_cpu() provides ordering guarantees only
>> > when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
>> > But we already have to deal with this problem when merging samples obtained
>> > from different CPU sampling buffer in per-thread mode. So this is not
>> > necessarily
>> > a showstopper.
>> >
>> > Alternatives could be to use uprobes but that's less practical to setup.
>> >
>> > Anyone with better ideas?
>>
>> You forgot to CC the time people ;-)
>>
>> I've no problem with adding CLOCK_PERF (or another/better name).
>>
>> Thomas, John?
>
> I've just faced the same issue - correlating an event in userspace with
> data from the perf stream, but to my mind what I want to get is a value
> returned by perf_clock() _in the current "session" context_.
>
> Stephane didn't like the idea of opening a "fake" perf descriptor in
> order to get the timestamp, but surely one must have the "session"
> already running to be interested in such data in the first place? So I
> think the ioctl() idea is not out of place here... How about the simple
> change below?
>
The app requesting the timestamp may not necessarily have an active
perf session. And by that I mean, it may not be self-monitoring. But it
could be monitored by an external tool such as perf, without necessary
knowing it.

The timestamp is global or at least per-cpu. It is not tied to a particular
active event.

The thing I did not like about ioctl() is that it now means that the app
needs to become a user of the perf_event API. It needs to program
a dummy event just to get a timestamp. As opposed to just calling
a clock_gettime(CLOCK_PERF) function which guarantees a clock
source identical to that used by perf_events. In that case, the app
timestamps its events in such a way that if it was monitored externally,
that external tool would be able to correlate all the samples because they
would all have the same time source.

But if people are strongly opposed to the clock_gettime() approach, then
I can go with the ioctl() because the functionality is definitively needed
ASAP.



> 8<---
> From 2ad51a27fbf64bf98cee190efc3fbd7002819692 Mon Sep 17 00:00:00 2001
> From: Pawel Moll <pawel.moll@arm.com>
> Date: Fri, 1 Feb 2013 14:03:56 +0000
> Subject: [PATCH] perf: Add ioctl to return current time value
>
> To co-relate user space events with the perf events stream
> a current (as in: "what time(stamp) is it now?") time value
> must be made available.
>
> This patch adds a perf ioctl that makes this possible.
>
> Signed-off-by: Pawel Moll <pawel.moll@arm.com>
> ---
>  include/uapi/linux/perf_event.h |    1 +
>  kernel/events/core.c            |    8 ++++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 4f63c05..b745fb0 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -316,6 +316,7 @@ struct perf_event_attr {
>  #define PERF_EVENT_IOC_PERIOD          _IOW('$', 4, __u64)
>  #define PERF_EVENT_IOC_SET_OUTPUT      _IO ('$', 5)
>  #define PERF_EVENT_IOC_SET_FILTER      _IOW('$', 6, char *)
> +#define PERF_EVENT_IOC_GET_TIME                _IOR('$', 7, __u64)
>
>  enum perf_event_ioc_flags {
>         PERF_IOC_FLAG_GROUP             = 1U << 0,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 301079d..4202b1c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3298,6 +3298,14 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>         case PERF_EVENT_IOC_SET_FILTER:
>                 return perf_event_set_filter(event, (void __user *)arg);
>
> +       case PERF_EVENT_IOC_GET_TIME:
> +       {
> +               u64 time = perf_clock();
> +               if (copy_to_user((void __user *)arg, &time, sizeof(time)))
> +                       return -EFAULT;
> +               return 0;
> +       }
> +
>         default:
>                 return -ENOTTY;
>         }
> --
> 1.7.10.4
>
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-05 22:13     ` Stephane Eranian
@ 2013-02-05 22:28       ` John Stultz
  2013-02-06  1:19         ` Steven Rostedt
  2013-02-18 20:35         ` Thomas Gleixner
  2013-02-06 18:17       ` Pawel Moll
  1 sibling, 2 replies; 65+ messages in thread
From: John Stultz @ 2013-02-05 22:28 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Pawel Moll, Peter Zijlstra, LKML, mingo, Paul Mackerras,
	Anton Blanchard, Will Deacon, ak, Pekka Enberg, Steven Rostedt,
	Robert Richter, tglx

On 02/05/2013 02:13 PM, Stephane Eranian wrote:
> On Fri, Feb 1, 2013 at 3:18 PM, Pawel Moll <pawel.moll@arm.com> wrote:
>> Hello,
>>
>> I'd like to revive the topic...
>>
>> On Tue, 2012-10-16 at 18:23 +0100, Peter Zijlstra wrote:
>>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>>>> Hi,
>>>>
>>>> There are many situations where we want to correlate events happening at
>>>> the user level with samples recorded in the perf_event kernel sampling buffer.
>>>> For instance, we might want to correlate the call to a function or creation of
>>>> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
>>>> we need to be able to correlate jitted code mappings with perf event samples
>>>> for symbolization.
>>>>
>>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>>>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>>>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>>>
>>>> To make correlating user vs. kernel samples easy, we would need to
>>>> access that sched_clock() functionality. However, none of the existing
>>>> clock calls permit this at this point. They all return timestamps which are
>>>> not using the same source and/or offset as sched_clock.
>>>>
>>>> I believe a similar issue exists with the ftrace subsystem.
>>>>
>>>> The problem needs to be adressed in a portable manner. Solutions
>>>> based on reading TSC for the user level to reconstruct sched_clock()
>>>> don't seem appropriate to me.
>>>>
>>>> One possibility to address this limitation would be to extend clock_gettime()
>>>> with a new clock time, e.g., CLOCK_PERF.
>>>>
>>>> However, I understand that sched_clock_cpu() provides ordering guarantees only
>>>> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
>>>> But we already have to deal with this problem when merging samples obtained
>>>> from different CPU sampling buffer in per-thread mode. So this is not
>>>> necessarily
>>>> a showstopper.
>>>>
>>>> Alternatives could be to use uprobes but that's less practical to setup.
>>>>
>>>> Anyone with better ideas?
>>> You forgot to CC the time people ;-)
>>>
>>> I've no problem with adding CLOCK_PERF (or another/better name).
>>>
>>> Thomas, John?
>> I've just faced the same issue - correlating an event in userspace with
>> data from the perf stream, but to my mind what I want to get is a value
>> returned by perf_clock() _in the current "session" context_.
>>
>> Stephane didn't like the idea of opening a "fake" perf descriptor in
>> order to get the timestamp, but surely one must have the "session"
>> already running to be interested in such data in the first place? So I
>> think the ioctl() idea is not out of place here... How about the simple
>> change below?
>>
> The app requesting the timestamp may not necessarily have an active
> perf session. And by that I mean, it may not be self-monitoring. But it
> could be monitored by an external tool such as perf, without necessary
> knowing it.
>
> The timestamp is global or at least per-cpu. It is not tied to a particular
> active event.
>
> The thing I did not like about ioctl() is that it now means that the app
> needs to become a user of the perf_event API. It needs to program
> a dummy event just to get a timestamp. As opposed to just calling
> a clock_gettime(CLOCK_PERF) function which guarantees a clock
> source identical to that used by perf_events. In that case, the app
> timestamps its events in such a way that if it was monitored externally,
> that external tool would be able to correlate all the samples because they
> would all have the same time source.
>
> But if people are strongly opposed to the clock_gettime() approach, then
> I can go with the ioctl() because the functionality is definitively needed
> ASAP.

I prefer the ioctl method, since its less likely to be re-purposed/misused.

Though I'd be most comfortable with finding some way for perf-timestamps 
to be CLOCK_MONOTONIC based (or maybe CLOCK_MONOTONIC_RAW if it would be 
easier), and just avoid all together adding another time domain that 
doesn't really have clear definition (other then "what perf uses").

thanks
-john

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-05 22:28       ` John Stultz
@ 2013-02-06  1:19         ` Steven Rostedt
  2013-02-06 18:17           ` Pawel Moll
  2013-02-18 20:35         ` Thomas Gleixner
  1 sibling, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2013-02-06  1:19 UTC (permalink / raw)
  To: John Stultz
  Cc: Stephane Eranian, Pawel Moll, Peter Zijlstra, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Robert Richter, tglx

On Tue, 2013-02-05 at 14:28 -0800, John Stultz wrote:
> I prefer the ioctl method, since its less likely to be re-purposed/misused.
> 
> Though I'd be most comfortable with finding some way for perf-timestamps 
> to be CLOCK_MONOTONIC based (or maybe CLOCK_MONOTONIC_RAW if it would be 
> easier), and just avoid all together adding another time domain that 
> doesn't really have clear definition (other then "what perf uses").

Perhaps add a new perf system call? Does everything need to go through
the one great mighty perf_ioctl aka sys_perf_event_open()? I mean, if
there's something that can be agnostic to an event, but still very much
related to perf, perhaps another perf syscall should be created.

If people are worried about adding a bunch of new perf syscalls, maybe
add a sys_perf_control() system call that works like an ioctl but
without a file descriptor. Something for things that don't require an
event attached to it, like to retrieve a time stamp counter that perf
uses, but done in a way that it could be used for other things perf
related that does not require an event.

-- Steve

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-06  1:19         ` Steven Rostedt
@ 2013-02-06 18:17           ` Pawel Moll
  2013-02-13 20:00             ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: Pawel Moll @ 2013-02-06 18:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: John Stultz, Stephane Eranian, Peter Zijlstra, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Robert Richter, tglx

On Wed, 2013-02-06 at 01:19 +0000, Steven Rostedt wrote:
> If people are worried about adding a bunch of new perf syscalls, maybe
> add a sys_perf_control() system call that works like an ioctl but
> without a file descriptor. Something for things that don't require an
> event attached to it, like to retrieve a time stamp counter that perf
> uses, but done in a way that it could be used for other things perf
> related that does not require an event.

Something along these lines? (completely untested and of course missing
all the #defines __NR_perf_control xxx)

8<-----------------
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4f63c05..be7409b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -322,6 +322,11 @@ enum perf_event_ioc_flags {
 };
 
 /*
+ * Command codes for ioctl-like sys_perf_control interface:
+ */
+#define PERF_CONTROL_GET_TIME		_IOR('$', 0, __u64)
+
+/*
  * Structure of the page that can be mapped via mmap
  */
 struct perf_event_mmap_page {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 301079d..750404d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6678,6 +6678,29 @@ err_fd:
 }
 
 /**
+ * sys_perf_control - ioctl-like interface to control system-wide
+ *		      perf behaviour
+ *
+ * @cmd:	one of the PERF_CONTROL_* commands
+ * @arg:	command-specific argument
+ */
+SYSCALL_DEFINE2(perf_control, unsigned int, cmd, unsigned long, arg)
+{
+	switch (cmd) {
+	case PERF_CONTROL_GET_TIME:
+	{
+		u64 time = perf_clock();
+		if (copy_to_user((void __user *)arg, &time, sizeof(time)))
+			return -EFAULT;
+		return 0;
+	}
+
+	default:
+		return -ENOTTY;
+	}
+}
+
+/**
  * perf_event_create_kernel_counter
  *
  * @attr: attributes of the counter to create
8<-----------------

Cheers!

Pawel



^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-05 22:13     ` Stephane Eranian
  2013-02-05 22:28       ` John Stultz
@ 2013-02-06 18:17       ` Pawel Moll
  1 sibling, 0 replies; 65+ messages in thread
From: Pawel Moll @ 2013-02-06 18:17 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter,
	tglx, John Stultz

On Tue, 2013-02-05 at 22:13 +0000, Stephane Eranian wrote:
> The app requesting the timestamp may not necessarily have an active
> perf session. And by that I mean, it may not be self-monitoring. But it
> could be monitored by an external tool such as perf, without necessary
> knowing it.

Fair enough - I guess eg. JIT engine could generate a timestamped
information about generated symbols which could be then merged with a
stream recorded by the perf tool by yet another tool.

> The timestamp is global or at least per-cpu. It is not tied to a particular
> active event.

I know that the implementation does it now, but is it actually specified
as such? I could imagine a situation where perf_clock() returns time
elapsed since the "current" sys_perf_event_open()... (not that I'd like
it ;-)

Paweł

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-06 18:17           ` Pawel Moll
@ 2013-02-13 20:00             ` Stephane Eranian
  2013-02-14 10:33               ` Pawel Moll
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2013-02-13 20:00 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Steven Rostedt, John Stultz, Peter Zijlstra, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Robert Richter, tglx

On Wed, Feb 6, 2013 at 7:17 PM, Pawel Moll <pawel.moll@arm.com> wrote:
> On Wed, 2013-02-06 at 01:19 +0000, Steven Rostedt wrote:
>> If people are worried about adding a bunch of new perf syscalls, maybe
>> add a sys_perf_control() system call that works like an ioctl but
>> without a file descriptor. Something for things that don't require an
>> event attached to it, like to retrieve a time stamp counter that perf
>> uses, but done in a way that it could be used for other things perf
>> related that does not require an event.
>
> Something along these lines? (completely untested and of course missing
> all the #defines __NR_perf_control xxx)
>
> 8<-----------------
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 4f63c05..be7409b 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -322,6 +322,11 @@ enum perf_event_ioc_flags {
>  };
>
>  /*
> + * Command codes for ioctl-like sys_perf_control interface:
> + */
> +#define PERF_CONTROL_GET_TIME          _IOR('$', 0, __u64)
> +
> +/*
>   * Structure of the page that can be mapped via mmap
>   */
>  struct perf_event_mmap_page {
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 301079d..750404d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6678,6 +6678,29 @@ err_fd:
>  }
>
>  /**
> + * sys_perf_control - ioctl-like interface to control system-wide
> + *                   perf behaviour
> + *
> + * @cmd:       one of the PERF_CONTROL_* commands
> + * @arg:       command-specific argument
> + */
> +SYSCALL_DEFINE2(perf_control, unsigned int, cmd, unsigned long, arg)
> +{
> +       switch (cmd) {
> +       case PERF_CONTROL_GET_TIME:
> +       {
> +               u64 time = perf_clock();
> +               if (copy_to_user((void __user *)arg, &time, sizeof(time)))
> +                       return -EFAULT;
> +               return 0;
> +       }
> +
> +       default:
> +               return -ENOTTY;
> +       }
> +}
> +
> +/**
>   * perf_event_create_kernel_counter
>   *
>   * @attr: attributes of the counter to create
> 8<-----------------
>
> Cheers!

So what would be the role of this new syscall besides GET_TIME?
What other controls without a fd could be done? We are already passing
a lot of control thru the perf_event_open() some in the attr struct others
as arguments.

The only advantage of this "disguised" ioctl() is that it does not require
a fd. But it is worth adding a syscall just to avoid creating a fd?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-13 20:00             ` Stephane Eranian
@ 2013-02-14 10:33               ` Pawel Moll
  2013-02-18 15:16                 ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: Pawel Moll @ 2013-02-14 10:33 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Steven Rostedt, John Stultz, Peter Zijlstra, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Robert Richter, tglx

On Wed, 2013-02-13 at 20:00 +0000, Stephane Eranian wrote:
> On Wed, Feb 6, 2013 at 7:17 PM, Pawel Moll <pawel.moll@arm.com> wrote:
> > On Wed, 2013-02-06 at 01:19 +0000, Steven Rostedt wrote:
> >> If people are worried about adding a bunch of new perf syscalls, maybe
> >> add a sys_perf_control() system call that works like an ioctl but
> >> without a file descriptor. Something for things that don't require an
> >> event attached to it, like to retrieve a time stamp counter that perf
> >> uses, but done in a way that it could be used for other things perf
> >> related that does not require an event.
> >
> >  /**
> > + * sys_perf_control - ioctl-like interface to control system-wide
> > + *                   perf behaviour
> > + *
> > + * @cmd:       one of the PERF_CONTROL_* commands
> > + * @arg:       command-specific argument
> > + */
> > +SYSCALL_DEFINE2(perf_control, unsigned int, cmd, unsigned long, arg)
> > +{
> > +       switch (cmd) {
> > +       case PERF_CONTROL_GET_TIME:
> > +       {
> > +               u64 time = perf_clock();
> > +               if (copy_to_user((void __user *)arg, &time, sizeof(time)))
> > +                       return -EFAULT;
> > +               return 0;
> > +       }
> > +
> > +       default:
> > +               return -ENOTTY;
> > +       }
> > +}
>
> So what would be the role of this new syscall besides GET_TIME?
> What other controls without a fd could be done? We are already passing
> a lot of control thru the perf_event_open() some in the attr struct others
> as arguments.

I think Steven was thinking about an "extensible" fd-less interface.
Whether we'll need any other fd-less control in the future, I don't
know...

> The only advantage of this "disguised" ioctl() is that it does not require
> a fd. But it is worth adding a syscall just to avoid creating a fd?

Frankly speaking, I have some doubts here, but I do sys_perf_open()
anyway, so it was mainly trying to address your situation.

One way or another I'd like to get the timestamp, so how about picking
one solution and trying to make it happen? Seems that my previous
"standard ioctl()" patch would be the best compromise?

Pawel



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-14 10:33               ` Pawel Moll
@ 2013-02-18 15:16                 ` Stephane Eranian
  2013-02-18 18:59                   ` David Ahern
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2013-02-18 15:16 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Steven Rostedt, John Stultz, Peter Zijlstra, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Robert Richter, tglx

Hi,


I think the advantage of the ioctl() is that is reuses existing infrastructure.
The downside is that to get the timestamp you need at a minimum:

uint64_t get_perf_timestamp(void)
{
   struct perf_event_attr attr;
   uint64_t ts = 0;
   int fd;

   memset(&attr, 0, sizeof(attr));

   /* pick a dummy SW event (no PMU HW resource allocated), keep it disabled */
   attr.type = PERF_TYPE_SOFTWARE;
   attr.config =  PERF_COUNT_SW_CPU_CLOCK; /* dummy event */
   attr.disabled = 1;

   /* attach to self in per-thread mode */
   fd = perf_event_open(&attr, 0, -1, -1, 0);
   if (fd == -1)
       return 0;

  ioctl(fd, PERF_EVENT_IOC_GET_TIME, &ts);
  close(fd);

  return ts;
}

This function is likely to be called multiple times. So we could cache
the fd and reuse it.
There would be a bigger penalty only for the first call. Thereafter it
would cost probably
 just a bit more than the pfm_control() approach, because of the need
to go through VFS.
So I am fine with this.


On Thu, Feb 14, 2013 at 11:33 AM, Pawel Moll <pawel.moll@arm.com> wrote:
> On Wed, 2013-02-13 at 20:00 +0000, Stephane Eranian wrote:
>> On Wed, Feb 6, 2013 at 7:17 PM, Pawel Moll <pawel.moll@arm.com> wrote:
>> > On Wed, 2013-02-06 at 01:19 +0000, Steven Rostedt wrote:
>> >> If people are worried about adding a bunch of new perf syscalls, maybe
>> >> add a sys_perf_control() system call that works like an ioctl but
>> >> without a file descriptor. Something for things that don't require an
>> >> event attached to it, like to retrieve a time stamp counter that perf
>> >> uses, but done in a way that it could be used for other things perf
>> >> related that does not require an event.
>> >
>> >  /**
>> > + * sys_perf_control - ioctl-like interface to control system-wide
>> > + *                   perf behaviour
>> > + *
>> > + * @cmd:       one of the PERF_CONTROL_* commands
>> > + * @arg:       command-specific argument
>> > + */
>> > +SYSCALL_DEFINE2(perf_control, unsigned int, cmd, unsigned long, arg)
>> > +{
>> > +       switch (cmd) {
>> > +       case PERF_CONTROL_GET_TIME:
>> > +       {
>> > +               u64 time = perf_clock();
>> > +               if (copy_to_user((void __user *)arg, &time, sizeof(time)))
>> > +                       return -EFAULT;
>> > +               return 0;
>> > +       }
>> > +
>> > +       default:
>> > +               return -ENOTTY;
>> > +       }
>> > +}
>>
>> So what would be the role of this new syscall besides GET_TIME?
>> What other controls without a fd could be done? We are already passing
>> a lot of control thru the perf_event_open() some in the attr struct others
>> as arguments.
>
> I think Steven was thinking about an "extensible" fd-less interface.
> Whether we'll need any other fd-less control in the future, I don't
> know...
>
>> The only advantage of this "disguised" ioctl() is that it does not require
>> a fd. But it is worth adding a syscall just to avoid creating a fd?
>
> Frankly speaking, I have some doubts here, but I do sys_perf_open()
> anyway, so it was mainly trying to address your situation.
>
> One way or another I'd like to get the timestamp, so how about picking
> one solution and trying to make it happen? Seems that my previous
> "standard ioctl()" patch would be the best compromise?
>
> Pawel
>
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-18 15:16                 ` Stephane Eranian
@ 2013-02-18 18:59                   ` David Ahern
  0 siblings, 0 replies; 65+ messages in thread
From: David Ahern @ 2013-02-18 18:59 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Pawel Moll, Steven Rostedt, John Stultz, Peter Zijlstra, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Robert Richter, tglx

On 2/18/13 8:16 AM, Stephane Eranian wrote:
> Hi,
>
>
> I think the advantage of the ioctl() is that is reuses existing infrastructure.
> The downside is that to get the timestamp you need at a minimum:
>
> uint64_t get_perf_timestamp(void)
> {
>     struct perf_event_attr attr;
>     uint64_t ts = 0;
>     int fd;
>
>     memset(&attr, 0, sizeof(attr));
>
>     /* pick a dummy SW event (no PMU HW resource allocated), keep it disabled */
>     attr.type = PERF_TYPE_SOFTWARE;
>     attr.config =  PERF_COUNT_SW_CPU_CLOCK; /* dummy event */
>     attr.disabled = 1;
>
>     /* attach to self in per-thread mode */
>     fd = perf_event_open(&attr, 0, -1, -1, 0);
>     if (fd == -1)
>         return 0;
>
>    ioctl(fd, PERF_EVENT_IOC_GET_TIME, &ts);
>    close(fd);
>
>    return ts;
> }
>

That's the approach I took with an update to my perf_clock to 
time-of-day series. Specific patch:
 
https://github.com/dsahern/linux/commit/7e6f40fca5f8cdbee1cd46d42b11aee71d0ffd34

and series:
https://github.com/dsahern/linux/commits/perf-time-of-day-3.7

David

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-05 22:28       ` John Stultz
  2013-02-06  1:19         ` Steven Rostedt
@ 2013-02-18 20:35         ` Thomas Gleixner
  2013-02-19 18:25           ` John Stultz
  1 sibling, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2013-02-18 20:35 UTC (permalink / raw)
  To: John Stultz
  Cc: Stephane Eranian, Pawel Moll, Peter Zijlstra, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On Tue, 5 Feb 2013, John Stultz wrote:
> On 02/05/2013 02:13 PM, Stephane Eranian wrote:
> > But if people are strongly opposed to the clock_gettime() approach, then
> > I can go with the ioctl() because the functionality is definitively needed
> > ASAP.
> 
> I prefer the ioctl method, since its less likely to be re-purposed/misused.

Urgh. No! With a dedicated CLOCK_PERF we might have a decent chance to
put this into a vsyscall. With an ioctl not so much.
 
> Though I'd be most comfortable with finding some way for perf-timestamps to be
> CLOCK_MONOTONIC based (or maybe CLOCK_MONOTONIC_RAW if it would be easier),
> and just avoid all together adding another time domain that doesn't really
> have clear definition (other then "what perf uses").

What's wrong with that. We already have the infrastructure to create
dynamic time domains which can be completely disconnected from
everything else.

Tracing/perf/instrumentation is a different domain and the main issue
there is performance. So going for a vsyscall enabled clock_gettime()
approach is definitely the best thing to do.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-18 20:35         ` Thomas Gleixner
@ 2013-02-19 18:25           ` John Stultz
  2013-02-19 19:55             ` Thomas Gleixner
  2013-02-20 10:29             ` Peter Zijlstra
  0 siblings, 2 replies; 65+ messages in thread
From: John Stultz @ 2013-02-19 18:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stephane Eranian, Pawel Moll, Peter Zijlstra, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On 02/18/2013 12:35 PM, Thomas Gleixner wrote:
> On Tue, 5 Feb 2013, John Stultz wrote:
>> On 02/05/2013 02:13 PM, Stephane Eranian wrote:
>>> But if people are strongly opposed to the clock_gettime() approach, then
>>> I can go with the ioctl() because the functionality is definitively needed
>>> ASAP.
>> I prefer the ioctl method, since its less likely to be re-purposed/misused.
> Urgh. No! With a dedicated CLOCK_PERF we might have a decent chance to
> put this into a vsyscall. With an ioctl not so much.
>   
>> Though I'd be most comfortable with finding some way for perf-timestamps to be
>> CLOCK_MONOTONIC based (or maybe CLOCK_MONOTONIC_RAW if it would be easier),
>> and just avoid all together adding another time domain that doesn't really
>> have clear definition (other then "what perf uses").
> What's wrong with that. We already have the infrastructure to create
> dynamic time domains which can be completely disconnected from
> everything else.

Right, but those are for actual hardware domains that we had no other 
way of interacting with.

> Tracing/perf/instrumentation is a different domain and the main issue
> there is performance. So going for a vsyscall enabled clock_gettime()
> approach is definitely the best thing to do.

So describe how the perf time domain is different then CLOCK_MONOTONIC_RAW.

My concern here is that we're basically creating a kernel interface that 
exports implementation-defined semantics (again: whatever perf does 
right now). And I think folks want to do this, because adding CLOCK_PERF 
is easier then trying to:

1) Get a lock-free method for accessing CLOCK_MONOTONIC_RAW

2) Having perf interpolate its timestamps to CLOCK_MONOTONIC, or 
CLOCKMONOTONIC_RAW when it exports the data

The semantics on sched_clock() have been very flexible and hand-wavy in 
the past. And I agree with the need for the kernel to have a 
"fast-and-loose" clock as well as the benefits to that flexibility as 
the scheduler code has evolved.  But non-the-less, the changes in its 
semantics have bitten us badly a few times.

So I totally understand why the vsyscall is attractive. I'm just very 
cautious about exporting a similarly fuzzily defined interface to 
userland. So until its clear what the semantics will need to be going 
forward (forever!), my preference will be that we not add it.

thanks
-john

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-19 18:25           ` John Stultz
@ 2013-02-19 19:55             ` Thomas Gleixner
  2013-02-19 20:15               ` Thomas Gleixner
  2013-02-20 10:29             ` Peter Zijlstra
  1 sibling, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2013-02-19 19:55 UTC (permalink / raw)
  To: John Stultz
  Cc: Stephane Eranian, Pawel Moll, Peter Zijlstra, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On Tue, 19 Feb 2013, John Stultz wrote:
> On 02/18/2013 12:35 PM, Thomas Gleixner wrote:
> > On Tue, 5 Feb 2013, John Stultz wrote:
> > > On 02/05/2013 02:13 PM, Stephane Eranian wrote:
> > > > But if people are strongly opposed to the clock_gettime() approach, then
> > > > I can go with the ioctl() because the functionality is definitively
> > > > needed
> > > > ASAP.
> > > I prefer the ioctl method, since its less likely to be
> > > re-purposed/misused.
> > Urgh. No! With a dedicated CLOCK_PERF we might have a decent chance to
> > put this into a vsyscall. With an ioctl not so much.
> >   
> > > Though I'd be most comfortable with finding some way for perf-timestamps
> > > to be
> > > CLOCK_MONOTONIC based (or maybe CLOCK_MONOTONIC_RAW if it would be
> > > easier),
> > > and just avoid all together adding another time domain that doesn't really
> > > have clear definition (other then "what perf uses").
> > What's wrong with that. We already have the infrastructure to create
> > dynamic time domains which can be completely disconnected from
> > everything else.
> 
> Right, but those are for actual hardware domains that we had no other way of
> interacting with.

Agreed.

> > Tracing/perf/instrumentation is a different domain and the main issue
> > there is performance. So going for a vsyscall enabled clock_gettime()
> > approach is definitely the best thing to do.
> 
> So describe how the perf time domain is different then CLOCK_MONOTONIC_RAW.

Mostly not, except for x86 :)
 
> My concern here is that we're basically creating a kernel interface that
> exports implementation-defined semantics (again: whatever perf does right
> now). And I think folks want to do this, because adding CLOCK_PERF is easier
> then trying to:
> 
> 1) Get a lock-free method for accessing CLOCK_MONOTONIC_RAW

Well, you can't. We already guarantee that CLOCK_MONOTONIC_RAW is
monotonic and we can't break that when adding a vsyscall
implementation. So the gtod->seq or any equivalent needs to stay, no
matter what.

OTOH, thinking more about it:

If you look at the dance sched_clock_local() is doing, it can be
doubted that a vsyscall based access to CLOCK_MONOTONIC_RAW is going
to be massivly more overhead.

The periodic update of sched_clock will cause the same issues as the
gtod->seq write hold time, which is extremly short. It's not a
seqcount in sched_clock_local(), it's a cmpxchg based retry loop.

Would be interesting to compare and contrast that. Though you can't do
that in the kernel as the write hold time of the timekeeper seq is way
larger than the gtod->seq write hold time. I have a patch series in
work which makes the timekeeper seq hold time almost as short as that
of gtod->seq.

Even if sched_clock is stable, the overhead of dealing with short
counters or calling mult_fract() is probably in the same range as what
we do with the timekeeping related vsyscalls.

> 2) Having perf interpolate its timestamps to CLOCK_MONOTONIC, or
> CLOCKMONOTONIC_RAW when it exports the data

If we can get the timekeeper seq write hold time down to the bare
minimum (comparable to gtod->seq) I doubt that sched_clock will have
any reasonable advantage.
 
> The semantics on sched_clock() have been very flexible and hand-wavy in the
> past. And I agree with the need for the kernel to have a "fast-and-loose"
> clock as well as the benefits to that flexibility as the scheduler code has
> evolved.  But non-the-less, the changes in its semantics have bitten us badly
> a few times.
> 
> So I totally understand why the vsyscall is attractive. I'm just very cautious
> about exporting a similarly fuzzily defined interface to userland. So until
> its clear what the semantics will need to be going forward (forever!), my
> preference will be that we not add it.

Fair enough.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-19 19:55             ` Thomas Gleixner
@ 2013-02-19 20:15               ` Thomas Gleixner
  2013-02-19 20:35                 ` John Stultz
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2013-02-19 20:15 UTC (permalink / raw)
  To: John Stultz
  Cc: Stephane Eranian, Pawel Moll, Peter Zijlstra, LKML, Ingo Molnar,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On Tue, 19 Feb 2013, Thomas Gleixner wrote:
> On Tue, 19 Feb 2013, John Stultz wrote:
> Would be interesting to compare and contrast that. Though you can't do
> that in the kernel as the write hold time of the timekeeper seq is way
> larger than the gtod->seq write hold time. I have a patch series in
> work which makes the timekeeper seq hold time almost as short as that
> of gtod->seq.

As a side note. There is a really interesting corner case
vs. virtualization.

  VCPU0						VCPU1

  update_wall_time()
    write_seqlock_irqsave(&tk->lock, flags);
    ....

Host schedules out VCPU0

Arbitrary delay

Host schedules in VCPU0
						__vdso_clock_gettime()#1
    update_vsyscall();
						__vdso_clock_gettime()#2

Depending on the length of the delay which kept VCPU0 away from
executing and depending on the direction of the ntp update of the
timekeeping variables __vdso_clock_gettime()#2 can observe time going
backwards.

You can reproduce that by pinning VCPU0 to physical core 0 and VCPU1
to physical core 1. Now remove all load from physical core 1 except
VCPU1 and put massive load on physical core 0 and make sure that the
NTP adjustment lowers the mult factor.

Fun, isn't it ?

     tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-19 20:15               ` Thomas Gleixner
@ 2013-02-19 20:35                 ` John Stultz
  2013-02-19 21:50                   ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: John Stultz @ 2013-02-19 20:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stephane Eranian, Pawel Moll, Peter Zijlstra, LKML, Ingo Molnar,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On 02/19/2013 12:15 PM, Thomas Gleixner wrote:
> On Tue, 19 Feb 2013, Thomas Gleixner wrote:
>> On Tue, 19 Feb 2013, John Stultz wrote:
>> Would be interesting to compare and contrast that. Though you can't do
>> that in the kernel as the write hold time of the timekeeper seq is way
>> larger than the gtod->seq write hold time. I have a patch series in
>> work which makes the timekeeper seq hold time almost as short as that
>> of gtod->seq.
> As a side note. There is a really interesting corner case
> vs. virtualization.
>
>    VCPU0						VCPU1
>
>    update_wall_time()
>      write_seqlock_irqsave(&tk->lock, flags);
>      ....
>
> Host schedules out VCPU0
>
> Arbitrary delay
>
> Host schedules in VCPU0
> 						__vdso_clock_gettime()#1
>      update_vsyscall();
> 						__vdso_clock_gettime()#2
>
> Depending on the length of the delay which kept VCPU0 away from
> executing and depending on the direction of the ntp update of the
> timekeeping variables __vdso_clock_gettime()#2 can observe time going
> backwards.
>
> You can reproduce that by pinning VCPU0 to physical core 0 and VCPU1
> to physical core 1. Now remove all load from physical core 1 except
> VCPU1 and put massive load on physical core 0 and make sure that the
> NTP adjustment lowers the mult factor.
>
> Fun, isn't it ?

Yea, this has always worried me. I had a patch for this way way back, 
blocking vdso readers for the entire timekeeping update.
But it was ugly, hurt performance and no one seemed to be hitting the 
window you hit above.  None the less, you're probably right, we should 
find a way to do it right. I'll try to revive those patches.

thanks
-john




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-19 20:35                 ` John Stultz
@ 2013-02-19 21:50                   ` Thomas Gleixner
  2013-02-19 22:20                     ` John Stultz
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2013-02-19 21:50 UTC (permalink / raw)
  To: John Stultz
  Cc: Stephane Eranian, Pawel Moll, Peter Zijlstra, LKML, Ingo Molnar,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On Tue, 19 Feb 2013, John Stultz wrote:
> On 02/19/2013 12:15 PM, Thomas Gleixner wrote:
> > Depending on the length of the delay which kept VCPU0 away from
> > executing and depending on the direction of the ntp update of the
> > timekeeping variables __vdso_clock_gettime()#2 can observe time going
> > backwards.
> > 
> > You can reproduce that by pinning VCPU0 to physical core 0 and VCPU1
> > to physical core 1. Now remove all load from physical core 1 except
> > VCPU1 and put massive load on physical core 0 and make sure that the
> > NTP adjustment lowers the mult factor.
> > 
> > Fun, isn't it ?
> 
> Yea, this has always worried me. I had a patch for this way way back, blocking
> vdso readers for the entire timekeeping update.
> But it was ugly, hurt performance and no one seemed to be hitting the window
> you hit above.  None the less, you're probably right, we should find a way to
> do it right. I'll try to revive those patches.

Let me summarize the IRC discussion we just had about that:

1) We really want to reduce the seq write hold time of the timekeeper
   to the bare minimum.

   That's doable and I have working patches for this by splitting the
   timekeeper seqlock into a spin_lock and a seqcount and doing the
   update calculations on a shadow timekeeper structure. The seq write
   hold time then gets reduced to switching a pointer and updating the
   gtod data.

   So the sequence would look like:

   raw_spin_lock(&timekeeper_lock);
   copy_shadow_data(current_timekeeper, shadow_timekeeper);
   do_timekeeping_and_ntp_update(shadow_timekeeper);
   write_seqcount_begin(&timekeeper_seq);
   switch_pointers(current_timekeeper, shadow_timekeeper);
   update_vsyscall();
   write_seqcount_end(&timekeeper_seq);
   raw_spin_unlock(&timekeeper_lock);

   It's really worth the trouble. On one of my optimized RT systems I
   get the maximum latency of the non timekeeping cores (timekeeping
   duty is pinned to core 0) down from 8us to 4 us. That's a whopping
   factor of 2.

2) Doing #1 will allow to observe the described time going backwards
   scenario in kernel as well.

   The reason why we did not get complaints about that scenario at all
   (yet) is that the window and the probability to hit it are small
   enough. Nevertheless it's a real issue for virtualized systems.

   Now you came up with the great idea, that the timekeeping core is
   able to calculate what the approximate safe value is for the
   clocksource readout to be in a state where wreckage relative to the
   last update of the clocksource is not observable, not matter how
   long the scheduled out delay is and in which direction the NTP
   update is going. 

   So the writer side would still look like described in #1, but the
   reader side would grow another sanity check:

   Note, that's not relevant for CLOCK_MONOTONIC_RAW!

--- linux-2.6.orig/arch/x86/vdso/vclock_gettime.c
+++ linux-2.6/arch/x86/vdso/vclock_gettime.c
@@ -193,7 +193,7 @@ notrace static int __always_inline do_re
 notrace static int do_monotonic(struct timespec *ts)
 {
 	unsigned long seq;
-	u64 ns;
+	u64 ns, d;
 	int mode;

 	ts->tv_nsec = 0;
@@ -202,9 +202,10 @@ notrace static int do_monotonic(struct t
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->monotonic_time_sec;
 		ns = gtod->monotonic_time_snsec;
-		ns += vgetsns(&mode);
+		d = vgetsns(&mode);
+		ns += d;
 		ns >>= gtod->clock.shift;
-	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+	} while (read_seqcount_retry(&gtod->seq, seq) || d > gtod->safe_delta);
 	timespec_add_ns(ts, ns);

 	return mode;

   Note, that this sanity check also needs to be applied to all in
   kernel and real syscall interfaces.

I think that's a proper solution for this issue, unless you want to go
down the ugly road to expand the vsyscall seq write hold time to the
full timekeeper_lock hold time. The factor 2 reduction of latencies on
RT is argument enough for me to try that approach.

I'll polish up the shadow timekeeper patches in the next few days, so
you can have a go on the tk/gtod->safe_delta calulation, ok ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-19 21:50                   ` Thomas Gleixner
@ 2013-02-19 22:20                     ` John Stultz
  2013-02-20 10:06                       ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: John Stultz @ 2013-02-19 22:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stephane Eranian, Pawel Moll, Peter Zijlstra, LKML, Ingo Molnar,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On 02/19/2013 01:50 PM, Thomas Gleixner wrote:
> On Tue, 19 Feb 2013, John Stultz wrote:
>> On 02/19/2013 12:15 PM, Thomas Gleixner wrote:
>>> Depending on the length of the delay which kept VCPU0 away from
>>> executing and depending on the direction of the ntp update of the
>>> timekeeping variables __vdso_clock_gettime()#2 can observe time going
>>> backwards.
>>>
>>> You can reproduce that by pinning VCPU0 to physical core 0 and VCPU1
>>> to physical core 1. Now remove all load from physical core 1 except
>>> VCPU1 and put massive load on physical core 0 and make sure that the
>>> NTP adjustment lowers the mult factor.
>>>
>>> Fun, isn't it ?
>> Yea, this has always worried me. I had a patch for this way way back, blocking
>> vdso readers for the entire timekeeping update.
>> But it was ugly, hurt performance and no one seemed to be hitting the window
>> you hit above.  None the less, you're probably right, we should find a way to
>> do it right. I'll try to revive those patches.
> Let me summarize the IRC discussion we just had about that:
>
> 1) We really want to reduce the seq write hold time of the timekeeper
>     to the bare minimum.
>
>     That's doable and I have working patches for this by splitting the
>     timekeeper seqlock into a spin_lock and a seqcount and doing the
>     update calculations on a shadow timekeeper structure. The seq write
>     hold time then gets reduced to switching a pointer and updating the
>     gtod data.
>
>     So the sequence would look like:
>
>     raw_spin_lock(&timekeeper_lock);
>     copy_shadow_data(current_timekeeper, shadow_timekeeper);
>     do_timekeeping_and_ntp_update(shadow_timekeeper);
>     write_seqcount_begin(&timekeeper_seq);
>     switch_pointers(current_timekeeper, shadow_timekeeper);
>     update_vsyscall();
>     write_seqcount_end(&timekeeper_seq);
>     raw_spin_unlock(&timekeeper_lock);
>
>     It's really worth the trouble. On one of my optimized RT systems I
>     get the maximum latency of the non timekeeping cores (timekeeping
>     duty is pinned to core 0) down from 8us to 4 us. That's a whopping
>     factor of 2.
>
> 2) Doing #1 will allow to observe the described time going backwards
>     scenario in kernel as well.
>
>     The reason why we did not get complaints about that scenario at all
>     (yet) is that the window and the probability to hit it are small
>     enough. Nevertheless it's a real issue for virtualized systems.
>
>     Now you came up with the great idea, that the timekeeping core is
>     able to calculate what the approximate safe value is for the
>     clocksource readout to be in a state where wreckage relative to the
>     last update of the clocksource is not observable, not matter how
>     long the scheduled out delay is and in which direction the NTP
>     update is going.

So the other bit of caution here, is I realize my idea of "valid cycle 
ranges" has the potential for deadlock.

While it should be fine for use with vdso, we have to be careful if we 
use this in-kernel, because if we're in the update path, the valid 
interval check could trigger the ktime_get() in hrtimer_interrupt() to 
spin forever. So we need to be sure we don't use this method anywhere in 
the code paths that trigger the update_wall_time() code.

So some additional thinking may be necessary here. Though it may be as 
simple as making sure we don't loop on the cpu that does the timekeeping 
update.

thanks
-john


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-19 22:20                     ` John Stultz
@ 2013-02-20 10:06                       ` Thomas Gleixner
  0 siblings, 0 replies; 65+ messages in thread
From: Thomas Gleixner @ 2013-02-20 10:06 UTC (permalink / raw)
  To: John Stultz
  Cc: Stephane Eranian, Pawel Moll, Peter Zijlstra, LKML, Ingo Molnar,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On Tue, 19 Feb 2013, John Stultz wrote:
> On 02/19/2013 01:50 PM, Thomas Gleixner wrote:
> > 2) Doing #1 will allow to observe the described time going backwards
> >     scenario in kernel as well.
> > 
> >     The reason why we did not get complaints about that scenario at all
> >     (yet) is that the window and the probability to hit it are small
> >     enough. Nevertheless it's a real issue for virtualized systems.
> > 
> >     Now you came up with the great idea, that the timekeeping core is
> >     able to calculate what the approximate safe value is for the
> >     clocksource readout to be in a state where wreckage relative to the
> >     last update of the clocksource is not observable, not matter how
> >     long the scheduled out delay is and in which direction the NTP
> >     update is going.
> 
> So the other bit of caution here, is I realize my idea of "valid cycle ranges"
> has the potential for deadlock.
> 
> While it should be fine for use with vdso, we have to be careful if we use
> this in-kernel, because if we're in the update path, the valid interval check
> could trigger the ktime_get() in hrtimer_interrupt() to spin forever. So we
> need to be sure we don't use this method anywhere in the code paths that
> trigger the update_wall_time() code.

Hmm, right.
 
> So some additional thinking may be necessary here. Though it may be as simple
> as making sure we don't loop on the cpu that does the timekeeping update.

Either that or make sure to use ktime_get_nocheck() in those code pathes.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-19 18:25           ` John Stultz
  2013-02-19 19:55             ` Thomas Gleixner
@ 2013-02-20 10:29             ` Peter Zijlstra
  2013-02-23  6:04               ` John Stultz
  1 sibling, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2013-02-20 10:29 UTC (permalink / raw)
  To: John Stultz
  Cc: Thomas Gleixner, Stephane Eranian, Pawel Moll, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On Tue, 2013-02-19 at 10:25 -0800, John Stultz wrote:
> So describe how the perf time domain is different then
> CLOCK_MONOTONIC_RAW.

The primary difference is that the trace/sched/perf time domain is not
strictly monotonic, it is only locally monotonic -- that is two time
stamps taken on the same cpu are guaranteed to be monotonic.

Furthermore, to make it useful, there's an actual bound on the inter-cpu
drift (implemented by limiting the drift to CLOCK_MONOTONIC).

Additionally -- to increase use -- we also added a monotonic sync point
when cpu A queries time of cpu B.

> My concern here is that we're basically creating a kernel interface
> that 
> exports implementation-defined semantics (again: whatever perf does 
> right now). And I think folks want to do this, because adding
> CLOCK_PERF 
> is easier then trying to:
> 
> 1) Get a lock-free method for accessing CLOCK_MONOTONIC_RAW
> 
> 2) Having perf interpolate its timestamps to CLOCK_MONOTONIC, or 
> CLOCKMONOTONIC_RAW when it exports the data

Mostly cheaper, not easier. Given unstable TSC, MONOTONIC will have to
fall back to another clock source (hpet, acpi_pm and other assorted
crap).

In order to avoid this, we'd had to relax the requirements. Using
anything other than TSC is simply not an option.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-20 10:29             ` Peter Zijlstra
@ 2013-02-23  6:04               ` John Stultz
  2013-02-25 14:10                 ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: John Stultz @ 2013-02-23  6:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Stephane Eranian, Pawel Moll, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On 02/20/2013 02:29 AM, Peter Zijlstra wrote:
> On Tue, 2013-02-19 at 10:25 -0800, John Stultz wrote:
>> So describe how the perf time domain is different then
>> CLOCK_MONOTONIC_RAW.
> The primary difference is that the trace/sched/perf time domain is not
> strictly monotonic, it is only locally monotonic -- that is two time
> stamps taken on the same cpu are guaranteed to be monotonic.

So how would a clock_gettime(CLOCK_PERF,...) interface help you figure 
out which cpu you got your timestamp from?

> Furthermore, to make it useful, there's an actual bound on the inter-cpu
> drift (implemented by limiting the drift to CLOCK_MONOTONIC).

So this sounds like you're already sort of interpolating to 
CLOCK_MONOTONIC, or am I just misunderstanding you?

> Additionally -- to increase use -- we also added a monotonic sync point
> when cpu A queries time of cpu B.

Not sure I'm following this bit. But I'll have to go look at the code on 
Monday.

>
>> My concern here is that we're basically creating a kernel interface
>> that
>> exports implementation-defined semantics (again: whatever perf does
>> right now). And I think folks want to do this, because adding
>> CLOCK_PERF
>> is easier then trying to:
>>
>> 1) Get a lock-free method for accessing CLOCK_MONOTONIC_RAW
>>
>> 2) Having perf interpolate its timestamps to CLOCK_MONOTONIC, or
>> CLOCKMONOTONIC_RAW when it exports the data
> Mostly cheaper, not easier. Given unstable TSC, MONOTONIC will have to
> fall back to another clock source (hpet, acpi_pm and other assorted
> crap).
>
> In order to avoid this, we'd had to relax the requirements. Using
> anything other than TSC is simply not an option.

Right, and this I understand. We can can play a little fast and lose 
with the rules for in-kernel uses, given the variety of hardware and the 
fact that performance is more critical then perfect accuracy. Since 
we're in-kernel we also have more information then userland does about 
what cpu we're running on, so we can get away with only 
locally-monotonic timestamps.

But I want to be careful if we're exporting this out to userland that 
its both useful and that there's an actual specification for how 
CLOCK_PERF behaves, applications can rely upon not changing in the future.

thanks
-john

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-23  6:04               ` John Stultz
@ 2013-02-25 14:10                 ` Peter Zijlstra
  2013-03-14 15:34                   ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2013-02-25 14:10 UTC (permalink / raw)
  To: John Stultz
  Cc: Thomas Gleixner, Stephane Eranian, Pawel Moll, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On Fri, 2013-02-22 at 22:04 -0800, John Stultz wrote:
> On 02/20/2013 02:29 AM, Peter Zijlstra wrote:
> > On Tue, 2013-02-19 at 10:25 -0800, John Stultz wrote:
> >> So describe how the perf time domain is different then
> >> CLOCK_MONOTONIC_RAW.
> > The primary difference is that the trace/sched/perf time domain is not
> > strictly monotonic, it is only locally monotonic -- that is two time
> > stamps taken on the same cpu are guaranteed to be monotonic.
> 
> So how would a clock_gettime(CLOCK_PERF,...) interface help you figure 
> out which cpu you got your timestamp from?

I'm not sure we want to expose it that far.. The reason people want
this clock exposed is to be able to do logging on the same time-line so
we can correlate events from both sources (kernel and user-space).

In case of parallel execution we cannot guarantee order and reading
logs/reconstructing events things require a bit of human intelligence.

> > Furthermore, to make it useful, there's an actual bound on the inter-cpu
> > drift (implemented by limiting the drift to CLOCK_MONOTONIC).
> 
> So this sounds like you're already sort of interpolating to 
> CLOCK_MONOTONIC, or am I just misunderstanding you?

That's right, although there's modes where the TSC is guaranteed stable
where we don't do this (it avoids some expensive bits), so we can not
rely on this.

> > Additionally -- to increase use -- we also added a monotonic sync point
> > when cpu A queries time of cpu B.
> 
> Not sure I'm following this bit. But I'll have to go look at the code
> on Monday.

It will basically pull the 'slowest' cpu forward so that for that
'event' we can say the two time-lines have a common point.

> Right, and this I understand. We can can play a little fast and lose 
> with the rules for in-kernel uses, given the variety of hardware and the 
> fact that performance is more critical then perfect accuracy. Since 
> we're in-kernel we also have more information then userland does about 
> what cpu we're running on, so we can get away with only 
> locally-monotonic timestamps.
> 
> But I want to be careful if we're exporting this out to userland that 
> its both useful and that there's an actual specification for how 
> CLOCK_PERF behaves, applications can rely upon not changing in the future.

Well, the timestamps themselves are already exposed to userspace
through the ftrace and perf data logs. All people want is to add
secondary data stream in the same time-line.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-25 14:10                 ` Peter Zijlstra
@ 2013-03-14 15:34                   ` Stephane Eranian
  2013-03-14 19:57                     ` Pawel Moll
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2013-03-14 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Thomas Gleixner, Pawel Moll, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On Mon, Feb 25, 2013 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2013-02-22 at 22:04 -0800, John Stultz wrote:
>> On 02/20/2013 02:29 AM, Peter Zijlstra wrote:
>> > On Tue, 2013-02-19 at 10:25 -0800, John Stultz wrote:
>> >> So describe how the perf time domain is different then
>> >> CLOCK_MONOTONIC_RAW.
>> > The primary difference is that the trace/sched/perf time domain is not
>> > strictly monotonic, it is only locally monotonic -- that is two time
>> > stamps taken on the same cpu are guaranteed to be monotonic.
>>
>> So how would a clock_gettime(CLOCK_PERF,...) interface help you figure
>> out which cpu you got your timestamp from?
>
> I'm not sure we want to expose it that far.. The reason people want
> this clock exposed is to be able to do logging on the same time-line so
> we can correlate events from both sources (kernel and user-space).
>
> In case of parallel execution we cannot guarantee order and reading
> logs/reconstructing events things require a bit of human intelligence.
>
>> > Furthermore, to make it useful, there's an actual bound on the inter-cpu
>> > drift (implemented by limiting the drift to CLOCK_MONOTONIC).
>>
>> So this sounds like you're already sort of interpolating to
>> CLOCK_MONOTONIC, or am I just misunderstanding you?
>
> That's right, although there's modes where the TSC is guaranteed stable
> where we don't do this (it avoids some expensive bits), so we can not
> rely on this.
>
>> > Additionally -- to increase use -- we also added a monotonic sync point
>> > when cpu A queries time of cpu B.
>>
>> Not sure I'm following this bit. But I'll have to go look at the code
>> on Monday.
>
> It will basically pull the 'slowest' cpu forward so that for that
> 'event' we can say the two time-lines have a common point.
>
>> Right, and this I understand. We can can play a little fast and lose
>> with the rules for in-kernel uses, given the variety of hardware and the
>> fact that performance is more critical then perfect accuracy. Since
>> we're in-kernel we also have more information then userland does about
>> what cpu we're running on, so we can get away with only
>> locally-monotonic timestamps.
>>
>> But I want to be careful if we're exporting this out to userland that
>> its both useful and that there's an actual specification for how
>> CLOCK_PERF behaves, applications can rely upon not changing in the future.
>
> Well, the timestamps themselves are already exposed to userspace
> through the ftrace and perf data logs. All people want is to add
> secondary data stream in the same time-line.
>
I agree with Peter on this. The timestamps are already visible.
All we need is the ability to generate them for another user-level
level data stream.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-03-14 15:34                   ` Stephane Eranian
@ 2013-03-14 19:57                     ` Pawel Moll
  2013-03-31 16:23                       ` David Ahern
  0 siblings, 1 reply; 65+ messages in thread
From: Pawel Moll @ 2013-03-14 19:57 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, John Stultz, Thomas Gleixner, LKML, mingo,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter

On Thu, 2013-03-14 at 15:34 +0000, Stephane Eranian wrote:
> > Well, the timestamps themselves are already exposed to userspace
> > through the ftrace and perf data logs. All people want is to add
> > secondary data stream in the same time-line.
> >
> I agree with Peter on this. The timestamps are already visible.
> All we need is the ability to generate them for another user-level
> level data stream.

Ok, how about the code below? I must say I have some doubts about the
resolution, as there seem to be no generic way of figuring it out for
the sched_clock (the arch/arm/kernel/sched_clock.c is actually
calculating it, but than just prints it out and nothing more).

And, to summarize, we went through 3 ideas:

1. ioctl() - http://article.gmane.org/gmane.linux.kernel/1433933
2. syscall - http://article.gmane.org/gmane.linux.kernel/1437057
3. POSIX clock - below

John also suggested that maybe the perf could use CLOCK_MONOTONIC_RAW
instead of local/sched_clock().

How about a final decision?

Regards

Pawel

8<--------------------
>From c986492d38156f1fc25ab3182f0a494bb13389ce Mon Sep 17 00:00:00 2001
From: Pawel Moll <pawel.moll@arm.com>
Date: Thu, 14 Mar 2013 19:49:09 +0000
Subject: [PATCH] perf: POSIX CLOCK_PERF to report current time value

To co-relate user space events with the perf events stream
a current (as in: "what time(stamp) is it now?") time value
must be made available.

This patch adds a POSIX clock returning the perf_clock()
value and accesible from userspace:

	#include <time.h>

	struct timespec ts;

	clock_gettime(CLOCK_PERF, &ts);

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
 include/uapi/linux/time.h |    1 +
 kernel/events/core.c      |   20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h
index 0d3c0ed..cea16b0 100644
--- a/include/uapi/linux/time.h
+++ b/include/uapi/linux/time.h
@@ -54,6 +54,7 @@ struct itimerval {
 #define CLOCK_BOOTTIME			7
 #define CLOCK_REALTIME_ALARM		8
 #define CLOCK_BOOTTIME_ALARM		9
+#define CLOCK_PERF			10
 
 /*
  * The IDs of various hardware clocks:
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0cd865..81ca459 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -37,6 +37,7 @@
 #include <linux/ftrace_event.h>
 #include <linux/hw_breakpoint.h>
 #include <linux/mm_types.h>
+#include <linux/posix-timers.h>
 
 #include "internal.h"
 
@@ -209,6 +210,19 @@ static inline u64 perf_clock(void)
 	return local_clock();
 }
 
+static int perf_posix_clock_getres(const clockid_t which_clock,
+		struct timespec *tp)
+{
+	*tp = ns_to_timespec(TICK_NSEC);
+	return 0;
+}
+
+static int perf_posix_clock_get(clockid_t which_clock, struct timespec *tp)
+{
+	*tp = ns_to_timespec(perf_clock());
+	return 0;
+}
+
 static inline struct perf_cpu_context *
 __get_cpu_context(struct perf_event_context *ctx)
 {
@@ -7391,6 +7405,10 @@ perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)
 
 void __init perf_event_init(void)
 {
+	struct k_clock perf_posix_clock = {
+		.clock_getres = perf_posix_clock_getres,
+		.clock_get = perf_posix_clock_get,
+	};
 	int ret;
 
 	idr_init(&pmu_idr);
@@ -7407,6 +7425,8 @@ void __init perf_event_init(void)
 	ret = init_hw_breakpoint();
 	WARN(ret, "hw_breakpoint initialization failed with: %d", ret);
 
+	posix_timers_register_clock(CLOCK_PERF, &perf_posix_clock);
+
 	/* do not patch jump label more than once per second */
 	jump_label_rate_limit(&perf_sched_events, HZ);
 
-- 
1.7.10.4





^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-03-14 19:57                     ` Pawel Moll
@ 2013-03-31 16:23                       ` David Ahern
  2013-04-01 18:29                         ` John Stultz
  0 siblings, 1 reply; 65+ messages in thread
From: David Ahern @ 2013-03-31 16:23 UTC (permalink / raw)
  To: Pawel Moll, Stephane Eranian, Peter Zijlstra, John Stultz,
	Thomas Gleixner
  Cc: LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On 3/14/13 1:57 PM, Pawel Moll wrote:
> Ok, how about the code below? I must say I have some doubts about the
> resolution, as there seem to be no generic way of figuring it out for
> the sched_clock (the arch/arm/kernel/sched_clock.c is actually
> calculating it, but than just prints it out and nothing more).
>
> And, to summarize, we went through 3 ideas:
>
> 1. ioctl() - http://article.gmane.org/gmane.linux.kernel/1433933
> 2. syscall - http://article.gmane.org/gmane.linux.kernel/1437057
> 3. POSIX clock - below
>
> John also suggested that maybe the perf could use CLOCK_MONOTONIC_RAW
> instead of local/sched_clock().

Any chance a decision can be reached in time for 3.10? Seems like the 
simplest option is the perf event based ioctl.

Converting/correlating perf_clock timestamps to time-of-day is a feature 
I have been trying to get into perf for over 2 years. This is a big 
piece needed for that goal -- along with the xtime tracepoints:
   https://lkml.org/lkml/2013/3/19/433

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-03-31 16:23                       ` David Ahern
@ 2013-04-01 18:29                         ` John Stultz
  2013-04-01 22:29                           ` David Ahern
  2013-04-02  7:54                           ` Peter Zijlstra
  0 siblings, 2 replies; 65+ messages in thread
From: John Stultz @ 2013-04-01 18:29 UTC (permalink / raw)
  To: David Ahern
  Cc: Pawel Moll, Stephane Eranian, Peter Zijlstra, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On 03/31/2013 09:23 AM, David Ahern wrote:
> On 3/14/13 1:57 PM, Pawel Moll wrote:
>> Ok, how about the code below? I must say I have some doubts about the
>> resolution, as there seem to be no generic way of figuring it out for
>> the sched_clock (the arch/arm/kernel/sched_clock.c is actually
>> calculating it, but than just prints it out and nothing more).
>>
>> And, to summarize, we went through 3 ideas:
>>
>> 1. ioctl() - http://article.gmane.org/gmane.linux.kernel/1433933
>> 2. syscall - http://article.gmane.org/gmane.linux.kernel/1437057
>> 3. POSIX clock - below
>>
>> John also suggested that maybe the perf could use CLOCK_MONOTONIC_RAW
>> instead of local/sched_clock().
>
> Any chance a decision can be reached in time for 3.10? Seems like the 
> simplest option is the perf event based ioctl.

I'm still not sold on the CLOCK_PERF posix clock. The semantics are 
still too hand-wavy and implementation specific.

While I'd prefer perf to export some existing semi-sane time domain 
(using interpolation if necessary), I realize the hardware constraints 
and performance optimizations make this unlikely (though I'm 
disappointed I've not seen any attempt or proof point that it won't work).

Thus if we must expose this kernel detail to userland, I think we should 
be careful about how publicly we expose such an interface, as it has the 
potential for misuse and eventual user-land breakage.

So while having a perf specific ioctl is still exposing what I expect 
will be non-static kernel internal behavior to userland, it at least it 
exposes it in a less generic fashion, which is preferable to me.

The next point of conflict is likely if the ioctl method will be 
sufficient given performance concerns. Something I'd be interested in 
hearing about from the folks pushing this. Right now it seems any method 
is preferable then not having an interface - but I want to make sure 
that's really true.

For example, if the ioctl interface is really too slow, its likely folks 
will end up using periodic perf ioctl samples and interpolating using 
normal vdso clock_gettime() timestamps.

If that is acceptable, then why not invert the solution and just have 
perf injecting periodic CLOCK_MONOTONIC timestamps into the log, then 
have perf report fast, but less-accurate sched_clock deltas from that 
CLOCK_MONOTONIC boundary.

Another alternative that might be a reasonable compromise: have perf 
register a dynamic posix clock id, which would be a driver specific, 
less public interface. That would provide the initial method to access 
the perf time domain. Then when it came time to optimize further, 
someone would have to sort out the difficulties of creating a vdso 
method for accessing dynamic posix clocks. It wouldn't be easy, but it 
wouldn't be impossible to do.

> Converting/correlating perf_clock timestamps to time-of-day is a 
> feature I have been trying to get into perf for over 2 years. This is 
> a big piece needed for that goal -- along with the xtime tracepoints:
>   https://lkml.org/lkml/2013/3/19/433

I sympathize with how long this process can take.  Having maintainers 
disagree without resolution can be a tar-pit. That said, its only been a 
few months where this has had proper visibility, and the discussion has 
paused for months at a time. Despite how long and slow this probably 
feels, the idea of maintaining a bad interface for the next decade seems 
much longer. ;)  So don't get discouraged yet.

thanks
-john

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-01 18:29                         ` John Stultz
@ 2013-04-01 22:29                           ` David Ahern
  2013-04-01 23:12                             ` John Stultz
  2013-04-03  9:17                             ` Stephane Eranian
  2013-04-02  7:54                           ` Peter Zijlstra
  1 sibling, 2 replies; 65+ messages in thread
From: David Ahern @ 2013-04-01 22:29 UTC (permalink / raw)
  To: John Stultz
  Cc: Pawel Moll, Stephane Eranian, Peter Zijlstra, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On 4/1/13 12:29 PM, John Stultz wrote:
>> Any chance a decision can be reached in time for 3.10? Seems like the
>> simplest option is the perf event based ioctl.
>
> I'm still not sold on the CLOCK_PERF posix clock. The semantics are
> still too hand-wavy and implementation specific.
>
> While I'd prefer perf to export some existing semi-sane time domain
> (using interpolation if necessary), I realize the hardware constraints
> and performance optimizations make this unlikely (though I'm
> disappointed I've not seen any attempt or proof point that it won't work).
>
> Thus if we must expose this kernel detail to userland, I think we should
> be careful about how publicly we expose such an interface, as it has the
> potential for misuse and eventual user-land breakage.

But perf_clock timestamps are already exposed to userland. This new API 
-- be it a posix clock or an ioctl -- just allows retrieval of a 
timestamp outside of a generated event.

>
> So while having a perf specific ioctl is still exposing what I expect
> will be non-static kernel internal behavior to userland, it at least it
> exposes it in a less generic fashion, which is preferable to me.
>
>
>
> The next point of conflict is likely if the ioctl method will be
> sufficient given performance concerns. Something I'd be interested in
> hearing about from the folks pushing this. Right now it seems any method
> is preferable then not having an interface - but I want to make sure
> that's really true.
>
> For example, if the ioctl interface is really too slow, its likely folks
> will end up using periodic perf ioctl samples and interpolating using
> normal vdso clock_gettime() timestamps.

The performance/speed depends on how often is called. I have no idea 
what Stephane's use case is but for me it is to correlate perf_clock 
timestamps to timeofday. In my perf-based daemon that tracks process 
schedulings, I update the correlation every 5-10 minutes.

>
> If that is acceptable, then why not invert the solution and just have
> perf injecting periodic CLOCK_MONOTONIC timestamps into the log, then
> have perf report fast, but less-accurate sched_clock deltas from that
> CLOCK_MONOTONIC boundary.

Something similar to that approach has been discussed as well. i.e, add 
a realtime clock event and have it injected into the stream e.g.,
https://lkml.org/lkml/2011/2/27/158

But there are cons to this approach -- e.g, you need that first event 
generated that tells you realtime to perf_clock correlation and you 
don't want to have to scan an unknown length of events looking for the 
first one to get the correlation only to backup and process the events.

And an ioctl to generate that first event was shot down as well...
   https://lkml.org/lkml/2011/3/1/174
   https://lkml.org/lkml/2011/3/2/186

David

>
> Another alternative that might be a reasonable compromise: have perf
> register a dynamic posix clock id, which would be a driver specific,
> less public interface. That would provide the initial method to access
> the perf time domain. Then when it came time to optimize further,
> someone would have to sort out the difficulties of creating a vdso
> method for accessing dynamic posix clocks. It wouldn't be easy, but it
> wouldn't be impossible to do.
>
>
>> Converting/correlating perf_clock timestamps to time-of-day is a
>> feature I have been trying to get into perf for over 2 years. This is
>> a big piece needed for that goal -- along with the xtime tracepoints:
>>   https://lkml.org/lkml/2013/3/19/433
>
> I sympathize with how long this process can take.  Having maintainers
> disagree without resolution can be a tar-pit. That said, its only been a
> few months where this has had proper visibility, and the discussion has
> paused for months at a time. Despite how long and slow this probably
> feels, the idea of maintaining a bad interface for the next decade seems
> much longer. ;)  So don't get discouraged yet.
>
> thanks
> -john


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-01 22:29                           ` David Ahern
@ 2013-04-01 23:12                             ` John Stultz
  2013-04-03  9:17                             ` Stephane Eranian
  1 sibling, 0 replies; 65+ messages in thread
From: John Stultz @ 2013-04-01 23:12 UTC (permalink / raw)
  To: David Ahern
  Cc: Pawel Moll, Stephane Eranian, Peter Zijlstra, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On 04/01/2013 03:29 PM, David Ahern wrote:
> On 4/1/13 12:29 PM, John Stultz wrote:
>>> Any chance a decision can be reached in time for 3.10? Seems like the
>>> simplest option is the perf event based ioctl.
>>
>> I'm still not sold on the CLOCK_PERF posix clock. The semantics are
>> still too hand-wavy and implementation specific.
>>
>> While I'd prefer perf to export some existing semi-sane time domain
>> (using interpolation if necessary), I realize the hardware constraints
>> and performance optimizations make this unlikely (though I'm
>> disappointed I've not seen any attempt or proof point that it won't 
>> work).
>>
>> Thus if we must expose this kernel detail to userland, I think we should
>> be careful about how publicly we expose such an interface, as it has the
>> potential for misuse and eventual user-land breakage.
>
> But perf_clock timestamps are already exposed to userland. This new 
> API -- be it a posix clock or an ioctl -- just allows retrieval of a 
> timestamp outside of a generated event.

Although perf_clock timestamps are not exposed to applications in a way 
they can use for their own purposes, no? Just as timestamp data 
correlated with other perf data.

My big concern here is that if we applications can retrieve these 
timestamps for their own use,  folks will see CLOCK_PERF as a cheaper 
alternative to CLOCK_MONOTONIC, and then end up getting bitten when the 
CLOCK_PERF semantics change. So until someone will hammer out exactly 
the behavior CLOCK_PERF should have forever going forward, I'd rather 
not add it as a generic clockid.

If we're going to have to expose the perf timestamps to userland, then 
I'd prefer we do it in a less public way, where its clearly tied to the 
perf interface and not as a generic clockid. (And either the ioctl or 
dynamic posix clock id would be a way to go there).

>
>>
>> So while having a perf specific ioctl is still exposing what I expect
>> will be non-static kernel internal behavior to userland, it at least it
>> exposes it in a less generic fashion, which is preferable to me.
>>
>>
>>
>> The next point of conflict is likely if the ioctl method will be
>> sufficient given performance concerns. Something I'd be interested in
>> hearing about from the folks pushing this. Right now it seems any method
>> is preferable then not having an interface - but I want to make sure
>> that's really true.
>>
>> For example, if the ioctl interface is really too slow, its likely folks
>> will end up using periodic perf ioctl samples and interpolating using
>> normal vdso clock_gettime() timestamps.
>
> The performance/speed depends on how often is called. I have no idea 
> what Stephane's use case is but for me it is to correlate perf_clock 
> timestamps to timeofday. In my perf-based daemon that tracks process 
> schedulings, I update the correlation every 5-10 minutes.

So that sounds like the ioctl approach would have no penalty from a 
performance perspective.

>
>>
>> If that is acceptable, then why not invert the solution and just have
>> perf injecting periodic CLOCK_MONOTONIC timestamps into the log, then
>> have perf report fast, but less-accurate sched_clock deltas from that
>> CLOCK_MONOTONIC boundary.
>
> Something similar to that approach has been discussed as well. i.e, 
> add a realtime clock event and have it injected into the stream e.g.,
> https://lkml.org/lkml/2011/2/27/158
>
> But there are cons to this approach -- e.g, you need that first event 
> generated that tells you realtime to perf_clock correlation and you 
> don't want to have to scan an unknown length of events looking for the 
> first one to get the correlation only to backup and process the events.
>
> And an ioctl to generate that first event was shot down as well...
>   https://lkml.org/lkml/2011/3/1/174
>   https://lkml.org/lkml/2011/3/2/186

Hrm.

So from my quick read of that thread, what it seems Thomas is getting at 
there is having the periodic CLOCK_REALTIME injection isn't valuable 
without the tracepoints on timekeeping changes. But once those are 
there, the periodic timestamp injection would seemingly provide _most_ 
of what you need.

The missing bit is the desire to inject arbitrary timestamp "fences" 
into the log, which Peter and Thomas apparently don't like, but actually 
sounds useful to me.

But maybe there is a way to do this without adding an ioctl?

For instance, and this is just spitballing here, if you had a tracepoint 
for clock_gettime() which returned the clockid and value, you could 
create these fences just by requesting the time from userland. The VDSO 
clock_gettime() would avoid the syscall, so you'd have to actually call 
the syscall directly from userland. But this would have the added 
benefit of not slowing down normal userspace that doesn't want to cause 
these fences in the log.

I realize these half-baked "brainstorming" ideas are probably a bit 
unwelcome after you've spent quite a bit of time trying different 
approaches without any decisive resolution from maintainers. But maybe 
Thomas and Peter can chime in here and maybe help to clarify their 
objections?

thanks
-john

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-01 18:29                         ` John Stultz
  2013-04-01 22:29                           ` David Ahern
@ 2013-04-02  7:54                           ` Peter Zijlstra
  2013-04-02 16:05                             ` Pawel Moll
  2013-04-02 16:19                             ` John Stultz
  1 sibling, 2 replies; 65+ messages in thread
From: Peter Zijlstra @ 2013-04-02  7:54 UTC (permalink / raw)
  To: John Stultz
  Cc: David Ahern, Pawel Moll, Stephane Eranian, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On Mon, 2013-04-01 at 11:29 -0700, John Stultz wrote:
> I'm still not sold on the CLOCK_PERF posix clock. The semantics are 
> still too hand-wavy and implementation specific.

How about we define the semantics as: match whatever comes out of perf
(and preferably ftrace by default) stuff?

Since that stuff is already exposed to userspace, doesn't it make sense
to have a user accessible time source that generates the same time-line
so that people can create logs that can be properly interleaved?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-02  7:54                           ` Peter Zijlstra
@ 2013-04-02 16:05                             ` Pawel Moll
  2013-04-02 16:19                             ` John Stultz
  1 sibling, 0 replies; 65+ messages in thread
From: Pawel Moll @ 2013-04-02 16:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, David Ahern, Stephane Eranian, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On Tue, 2013-04-02 at 08:54 +0100, Peter Zijlstra wrote:
> On Mon, 2013-04-01 at 11:29 -0700, John Stultz wrote:
> > I'm still not sold on the CLOCK_PERF posix clock. The semantics are 
> > still too hand-wavy and implementation specific.
> 
> How about we define the semantics as: match whatever comes out of perf
> (and preferably ftrace by default) stuff?

My thought exactly. Maybe if we defined it as "CLOCK_TRACE" and had
equivalent "trace_clock()" function used by both perf (instead of
perf_clock()) and ftrace the semantics would became clearer? This clock
could be then described as "source of timestamps used by Linux trace
infrastructure, in particular by ftrace and perf".

Paweł



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-02  7:54                           ` Peter Zijlstra
  2013-04-02 16:05                             ` Pawel Moll
@ 2013-04-02 16:19                             ` John Stultz
  2013-04-02 16:34                               ` Pawel Moll
  2013-04-03 17:19                               ` Pawel Moll
  1 sibling, 2 replies; 65+ messages in thread
From: John Stultz @ 2013-04-02 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Ahern, Pawel Moll, Stephane Eranian, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On 04/02/2013 12:54 AM, Peter Zijlstra wrote:
> On Mon, 2013-04-01 at 11:29 -0700, John Stultz wrote:
>> I'm still not sold on the CLOCK_PERF posix clock. The semantics are
>> still too hand-wavy and implementation specific.
> How about we define the semantics as: match whatever comes out of perf
> (and preferably ftrace by default) stuff?

That's not a sane interface. We've already been bitten by semantic 
changes in sched_clock affecting in-kernel users. How are we going to 
handle this with userland in the future? What happens when applications 
depend on "what comes out of perf" on one system and that ends up being 
different on another? "Oh, its just broken, the application shouldn't be 
using that."

I'm sort of amazed that folks are so careful and hesitant to add an 
ioctl to inject a timestamp fence into perf, but then so cavalier about 
adding a ill-defined clockid as a generic interface.

> Since that stuff is already exposed to userspace, doesn't it make sense
> to have a user accessible time source that generates the same time-line
> so that people can create logs that can be properly interleaved?

Its exposed to userspace as timestamps correlated with specific data, 
not timestamps for any purpose. We export kernel function addresses via 
WARN_ON messages to dmesg, it doesn't mean we might as well allow 
userland to jump and execute those addresses. ;)

I still think exposing the perf clock to userland is a bad idea, and 
would much rather the kernel provide timestamp data in the logs 
themselves to make the logs useful. But if we're going to have to do 
this via a clockid, I'm going to want it to be done via a dynamic posix 
clockid, so its clear its tightly tied with perf and not considered a 
generic interface (and I can clearly point folks having problems to the 
perf maintainers ;).

thanks
-john

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-02 16:19                             ` John Stultz
@ 2013-04-02 16:34                               ` Pawel Moll
  2013-04-03 17:19                               ` Pawel Moll
  1 sibling, 0 replies; 65+ messages in thread
From: Pawel Moll @ 2013-04-02 16:34 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, David Ahern, Stephane Eranian, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
> I still think exposing the perf clock to userland is a bad idea, and 
> would much rather the kernel provide timestamp data in the logs 
> themselves to make the logs useful. But if we're going to have to do 
> this via a clockid, I'm going to want it to be done via a dynamic posix 
> clockid, so its clear its tightly tied with perf and not considered a 
> generic interface (and I can clearly point folks having problems to the 
> perf maintainers ;).

Hm. 15 mins ago I didn't know about dynamic posix clocks existence at
all ;-)

I feel that the idea of opening a magic character device to obtain a
magic number to be used with clock_gettime() to get the timestamp may
not be popular, but maybe (just a thought) we could somehow use the file
descriptor obtained by the sys_perf_open() itself? How different would
it be from the ioctl(*_GET_TIME) I'm not sure, but I'll try to research
the idea. Counts as number 4 (5?) on my list ;-)

Paweł



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-01 22:29                           ` David Ahern
  2013-04-01 23:12                             ` John Stultz
@ 2013-04-03  9:17                             ` Stephane Eranian
  2013-04-03 13:55                               ` David Ahern
  1 sibling, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2013-04-03  9:17 UTC (permalink / raw)
  To: David Ahern
  Cc: John Stultz, Pawel Moll, Peter Zijlstra, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On Tue, Apr 2, 2013 at 12:29 AM, David Ahern <dsahern@gmail.com> wrote:
> On 4/1/13 12:29 PM, John Stultz wrote:
>>>
>>> Any chance a decision can be reached in time for 3.10? Seems like the
>>> simplest option is the perf event based ioctl.
>>
>>
>> I'm still not sold on the CLOCK_PERF posix clock. The semantics are
>> still too hand-wavy and implementation specific.
>>
>> While I'd prefer perf to export some existing semi-sane time domain
>> (using interpolation if necessary), I realize the hardware constraints
>> and performance optimizations make this unlikely (though I'm
>> disappointed I've not seen any attempt or proof point that it won't work).
>>
>> Thus if we must expose this kernel detail to userland, I think we should
>> be careful about how publicly we expose such an interface, as it has the
>> potential for misuse and eventual user-land breakage.
>
>
> But perf_clock timestamps are already exposed to userland. This new API --
> be it a posix clock or an ioctl -- just allows retrieval of a timestamp
> outside of a generated event.
>
Agreed.
>
>>
>> So while having a perf specific ioctl is still exposing what I expect
>> will be non-static kernel internal behavior to userland, it at least it
>> exposes it in a less generic fashion, which is preferable to me.
>>
>>
>>
>> The next point of conflict is likely if the ioctl method will be
>> sufficient given performance concerns. Something I'd be interested in
>> hearing about from the folks pushing this. Right now it seems any method
>> is preferable then not having an interface - but I want to make sure
>> that's really true.
>>
>> For example, if the ioctl interface is really too slow, its likely folks
>> will end up using periodic perf ioctl samples and interpolating using
>> normal vdso clock_gettime() timestamps.
>
I haven't done any specific testing with either approach yet. The goal is to
use this perf timestamp to correlate user level events to hardware
events recorded
by the kernel. I would assume there would be situations where those user events
could be on the critical path, and thus the timestamp operation would have to be
as efficient as possible. The vdso approach would be ideal.

>
> The performance/speed depends on how often is called. I have no idea what
> Stephane's use case is but for me it is to correlate perf_clock timestamps
> to timeofday. In my perf-based daemon that tracks process schedulings, I
> update the correlation every 5-10 minutes.
>
I was more thinking along the lines of runtime environments like Java where
a JIT compiler is invoked frequently and you need to correlate samples in the
native code with Java source. For that, the JIT compiler has to emit mapping
tables which have to be timestamped as address ranges may be re-used.

>
>>
>> If that is acceptable, then why not invert the solution and just have
>> perf injecting periodic CLOCK_MONOTONIC timestamps into the log, then
>> have perf report fast, but less-accurate sched_clock deltas from that
>> CLOCK_MONOTONIC boundary.
>
>
> Something similar to that approach has been discussed as well. i.e, add a
> realtime clock event and have it injected into the stream e.g.,
> https://lkml.org/lkml/2011/2/27/158
>
> But there are cons to this approach -- e.g, you need that first event
> generated that tells you realtime to perf_clock correlation and you don't
> want to have to scan an unknown length of events looking for the first one
> to get the correlation only to backup and process the events.
>
> And an ioctl to generate that first event was shot down as well...
>   https://lkml.org/lkml/2011/3/1/174
>   https://lkml.org/lkml/2011/3/2/186
>
> David
>
>
>>
>> Another alternative that might be a reasonable compromise: have perf
>> register a dynamic posix clock id, which would be a driver specific,
>> less public interface. That would provide the initial method to access
>> the perf time domain. Then when it came time to optimize further,
>> someone would have to sort out the difficulties of creating a vdso
>> method for accessing dynamic posix clocks. It wouldn't be easy, but it
>> wouldn't be impossible to do.
>>
>>
>>> Converting/correlating perf_clock timestamps to time-of-day is a
>>> feature I have been trying to get into perf for over 2 years. This is
>>> a big piece needed for that goal -- along with the xtime tracepoints:
>>>   https://lkml.org/lkml/2013/3/19/433
>>
>>
>> I sympathize with how long this process can take.  Having maintainers
>> disagree without resolution can be a tar-pit. That said, its only been a
>> few months where this has had proper visibility, and the discussion has
>> paused for months at a time. Despite how long and slow this probably
>> feels, the idea of maintaining a bad interface for the next decade seems
>> much longer. ;)  So don't get discouraged yet.
>>
>> thanks
>> -john
>
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03  9:17                             ` Stephane Eranian
@ 2013-04-03 13:55                               ` David Ahern
  2013-04-03 14:00                                 ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: David Ahern @ 2013-04-03 13:55 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: John Stultz, Pawel Moll, Peter Zijlstra, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On 4/3/13 3:17 AM, Stephane Eranian wrote:
> I haven't done any specific testing with either approach yet. The goal is to
> use this perf timestamp to correlate user level events to hardware
> events recorded
> by the kernel. I would assume there would be situations where those user events
> could be on the critical path, and thus the timestamp operation would have to be
> as efficient as possible. The vdso approach would be ideal.
>
>>
>> The performance/speed depends on how often is called. I have no idea what
>> Stephane's use case is but for me it is to correlate perf_clock timestamps
>> to timeofday. In my perf-based daemon that tracks process schedulings, I
>> update the correlation every 5-10 minutes.
>>
> I was more thinking along the lines of runtime environments like Java where
> a JIT compiler is invoked frequently and you need to correlate samples in the
> native code with Java source. For that, the JIT compiler has to emit mapping
> tables which have to be timestamped as address ranges may be re-used.

What's the advantage of changing apps -- like the JIT compiler -- to 
emit perf based timestamps versus having perf emit existing timestamps? 
ie., monotonic and realtime clocks already have vdso mappings for 
userspace with well known performance characteristics. Why not have perf 
convert its perf_clock timestamps into monotonic or realtime when 
dumping events?

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 13:55                               ` David Ahern
@ 2013-04-03 14:00                                 ` Stephane Eranian
  2013-04-03 14:14                                   ` David Ahern
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2013-04-03 14:00 UTC (permalink / raw)
  To: David Ahern
  Cc: John Stultz, Pawel Moll, Peter Zijlstra, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On Wed, Apr 3, 2013 at 3:55 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/13 3:17 AM, Stephane Eranian wrote:
>>
>> I haven't done any specific testing with either approach yet. The goal is
>> to
>> use this perf timestamp to correlate user level events to hardware
>> events recorded
>> by the kernel. I would assume there would be situations where those user
>> events
>> could be on the critical path, and thus the timestamp operation would have
>> to be
>> as efficient as possible. The vdso approach would be ideal.
>>
>>>
>>> The performance/speed depends on how often is called. I have no idea what
>>> Stephane's use case is but for me it is to correlate perf_clock
>>> timestamps
>>> to timeofday. In my perf-based daemon that tracks process schedulings, I
>>> update the correlation every 5-10 minutes.
>>>
>> I was more thinking along the lines of runtime environments like Java
>> where
>> a JIT compiler is invoked frequently and you need to correlate samples in
>> the
>> native code with Java source. For that, the JIT compiler has to emit
>> mapping
>> tables which have to be timestamped as address ranges may be re-used.
>
>
> What's the advantage of changing apps -- like the JIT compiler -- to emit
> perf based timestamps versus having perf emit existing timestamps? ie.,
> monotonic and realtime clocks already have vdso mappings for userspace with
> well known performance characteristics. Why not have perf convert its
> perf_clock timestamps into monotonic or realtime when dumping events?
>
Can monotonic timestamps be obtained from NMI context in the kernel?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 14:00                                 ` Stephane Eranian
@ 2013-04-03 14:14                                   ` David Ahern
  2013-04-03 14:22                                     ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: David Ahern @ 2013-04-03 14:14 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: John Stultz, Pawel Moll, Peter Zijlstra, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On 4/3/13 8:00 AM, Stephane Eranian wrote:
>> What's the advantage of changing apps -- like the JIT compiler -- to emit
>> perf based timestamps versus having perf emit existing timestamps? ie.,
>> monotonic and realtime clocks already have vdso mappings for userspace with
>> well known performance characteristics. Why not have perf convert its
>> perf_clock timestamps into monotonic or realtime when dumping events?
>>
> Can monotonic timestamps be obtained from NMI context in the kernel?

I don't understand the context of the question.

I am not suggesting perf_clock be changed. I am working on correlating 
existing perf_clock timestamps to clocks typically used by apps 
(REALTIME and time-of-day but also applies to MONOTONIC).

You are wanting the reverse -- have apps emit perf_clock timestamps. I 
was just wondering what is the advantage of this approach?

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 14:14                                   ` David Ahern
@ 2013-04-03 14:22                                     ` Stephane Eranian
  2013-04-03 17:57                                       ` John Stultz
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2013-04-03 14:22 UTC (permalink / raw)
  To: David Ahern
  Cc: John Stultz, Pawel Moll, Peter Zijlstra, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt

On Wed, Apr 3, 2013 at 4:14 PM, David Ahern <dsahern@gmail.com> wrote:
> On 4/3/13 8:00 AM, Stephane Eranian wrote:
>>>
>>> What's the advantage of changing apps -- like the JIT compiler -- to emit
>>> perf based timestamps versus having perf emit existing timestamps? ie.,
>>> monotonic and realtime clocks already have vdso mappings for userspace
>>> with
>>> well known performance characteristics. Why not have perf convert its
>>> perf_clock timestamps into monotonic or realtime when dumping events?
>>>
>> Can monotonic timestamps be obtained from NMI context in the kernel?
>
>
> I don't understand the context of the question.
>
> I am not suggesting perf_clock be changed. I am working on correlating
> existing perf_clock timestamps to clocks typically used by apps (REALTIME
> and time-of-day but also applies to MONOTONIC).
>
But for that, you'd need to expose to users the correlation between
the two clocks.
And now you'd fixed two clock sources definitions not just one.

> You are wanting the reverse -- have apps emit perf_clock timestamps. I was
> just wondering what is the advantage of this approach?
>
Well, that's how I interpreted your question ;-<

If you could have perf_clock use monotonic then we would not have this
discussion.
The correlation would be trivial.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-02 16:19                             ` John Stultz
  2013-04-02 16:34                               ` Pawel Moll
@ 2013-04-03 17:19                               ` Pawel Moll
  2013-04-03 17:29                                 ` John Stultz
  1 sibling, 1 reply; 65+ messages in thread
From: Pawel Moll @ 2013-04-03 17:19 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, David Ahern, Stephane Eranian, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
> But if we're going to have to do 
> this via a clockid, I'm going to want it to be done via a dynamic posix 
> clockid, so its clear its tightly tied with perf and not considered a 
> generic interface (and I can clearly point folks having problems to the 
> perf maintainers ;).

Ok, so how about the code below?

There are two distinct parts of the "solution":

1. The dynamic posix clock, as you suggested. Then one can get the perf
timestamp by doing:

	clock_fd = open("/dev/perf-clock", O_RDONLY);
	clock_gettime(FD_TO_CLOCKID(clock_fd), &ts) 

2. A sort-of-hack in the get_posix_clock() function making it possible
to do the same using the perf event file descriptor, eg.:

	fd = sys_perf_event_open(&attr, -1, 0, -1, 0);
	clock_gettime(FD_TO_CLOCKID(fd), &ts) 

Any (either strong or not) opinions?

Pawel

8<--------------
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e47ee46..b2127e3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -52,6 +52,7 @@ struct perf_guest_info_callbacks {
 #include <linux/atomic.h>
 #include <linux/sysfs.h>
 #include <linux/perf_regs.h>
+#include <linux/posix-clock.h>
 #include <asm/local.h>
 
 struct perf_callchain_entry {
@@ -845,4 +846,6 @@ _name##_show(struct device *dev,					\
 									\
 static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
 
+struct posix_clock *perf_get_posix_clock(struct file *fp);
+
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0cd865..534cb43 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7446,6 +7446,49 @@ unlock:
 }
 device_initcall(perf_event_sysfs_init);
 
+static int perf_posix_clock_getres(struct posix_clock *pc, struct timespec *tp)
+{
+	*tp = ns_to_timespec(TICK_NSEC);
+	return 0;
+}
+
+static int perf_posix_clock_gettime(struct posix_clock *pc, struct timespec *tp)
+{
+	*tp = ns_to_timespec(perf_clock());
+	return 0;
+}
+
+static const struct posix_clock_operations perf_posix_clock_ops = {
+	.clock_getres = perf_posix_clock_getres,
+	.clock_gettime = perf_posix_clock_gettime,
+};
+
+static struct posix_clock perf_posix_clock;
+
+struct posix_clock *perf_get_posix_clock(struct file *fp)
+{
+	if (!fp || fp->f_op != &perf_fops)
+		return NULL;
+
+	down_read(&perf_posix_clock.rwsem);
+
+	return &perf_posix_clock;
+}
+
+static int __init perf_posix_clock_init(void)
+{
+	dev_t devt;
+	int ret;
+
+	ret = alloc_chrdev_region(&devt, 0, 1, "perf-clock");
+	if (ret)
+		return ret;
+
+	perf_posix_clock.ops = perf_posix_clock_ops;
+	return posix_clock_register(&perf_posix_clock, devt);
+}
+device_initcall(perf_posix_clock_init);
+
 #ifdef CONFIG_CGROUP_PERF
 static struct cgroup_subsys_state *perf_cgroup_css_alloc(struct cgroup *cont)
 {
diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c
index ce033c7..e2a40a5 100644
--- a/kernel/time/posix-clock.c
+++ b/kernel/time/posix-clock.c
@@ -20,6 +20,7 @@
 #include <linux/device.h>
 #include <linux/export.h>
 #include <linux/file.h>
+#include <linux/perf_event.h>
 #include <linux/posix-clock.h>
 #include <linux/slab.h>
 #include <linux/syscalls.h>
@@ -249,16 +250,21 @@ struct posix_clock_desc {
 static int get_clock_desc(const clockid_t id, struct posix_clock_desc *cd)
 {
 	struct file *fp = fget(CLOCKID_TO_FD(id));
+	struct posix_clock *perf_clk = NULL;
 	int err = -EINVAL;
 
 	if (!fp)
 		return err;
 
-	if (fp->f_op->open != posix_clock_open || !fp->private_data)
+#if defined(CONFIG_PERF_EVENTS)
+	perf_clk = perf_get_posix_clock(fp);
+#endif
+	if ((fp->f_op->open != posix_clock_open || !fp->private_data) &&
+			!perf_clk)
 		goto out;
 
 	cd->fp = fp;
-	cd->clk = get_posix_clock(fp);
+	cd->clk = perf_clk ? perf_clk : get_posix_clock(fp);
 
 	err = cd->clk ? 0 : -ENODEV;
 out:




^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 17:19                               ` Pawel Moll
@ 2013-04-03 17:29                                 ` John Stultz
  2013-04-03 17:35                                   ` Pawel Moll
  0 siblings, 1 reply; 65+ messages in thread
From: John Stultz @ 2013-04-03 17:29 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Peter Zijlstra, David Ahern, Stephane Eranian, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On 04/03/2013 10:19 AM, Pawel Moll wrote:
> On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
>> But if we're going to have to do
>> this via a clockid, I'm going to want it to be done via a dynamic posix
>> clockid, so its clear its tightly tied with perf and not considered a
>> generic interface (and I can clearly point folks having problems to the
>> perf maintainers ;).
> Ok, so how about the code below?
>
> There are two distinct parts of the "solution":
>
> 1. The dynamic posix clock, as you suggested. Then one can get the perf
> timestamp by doing:
>
> 	clock_fd = open("/dev/perf-clock", O_RDONLY);
> 	clock_gettime(FD_TO_CLOCKID(clock_fd), &ts)
>
> 2. A sort-of-hack in the get_posix_clock() function making it possible
> to do the same using the perf event file descriptor, eg.:
>
> 	fd = sys_perf_event_open(&attr, -1, 0, -1, 0);
> 	clock_gettime(FD_TO_CLOCKID(fd), &ts)

#2 makes my nose wrinkle. Forgive me for being somewhat ignorant on the 
perf interfaces, but why is the second portion necessary or beneficial?

thanks
-john


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 17:29                                 ` John Stultz
@ 2013-04-03 17:35                                   ` Pawel Moll
  2013-04-03 17:50                                     ` John Stultz
  0 siblings, 1 reply; 65+ messages in thread
From: Pawel Moll @ 2013-04-03 17:35 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, David Ahern, Stephane Eranian, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter

On Wed, 2013-04-03 at 18:29 +0100, John Stultz wrote:
> On 04/03/2013 10:19 AM, Pawel Moll wrote:
> > On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
> >> But if we're going to have to do
> >> this via a clockid, I'm going to want it to be done via a dynamic posix
> >> clockid, so its clear its tightly tied with perf and not considered a
> >> generic interface (and I can clearly point folks having problems to the
> >> perf maintainers ;).
> > Ok, so how about the code below?
> >
> > There are two distinct parts of the "solution":
> >
> > 1. The dynamic posix clock, as you suggested. Then one can get the perf
> > timestamp by doing:
> >
> > 	clock_fd = open("/dev/perf-clock", O_RDONLY);
> > 	clock_gettime(FD_TO_CLOCKID(clock_fd), &ts)
> >
> > 2. A sort-of-hack in the get_posix_clock() function making it possible
> > to do the same using the perf event file descriptor, eg.:
> >
> > 	fd = sys_perf_event_open(&attr, -1, 0, -1, 0);
> > 	clock_gettime(FD_TO_CLOCKID(fd), &ts)
> 
> #2 makes my nose wrinkle. 

To make myself clear: I consider the code as it is a hack.

> Forgive me for being somewhat ignorant on the 
> perf interfaces, but why is the second portion necessary or beneficial?

My thinking: the perf syscall returns a file descriptor already, so it
would make sense to re-use it in the clock_gettime() call instead of
jumping through loops to open a character device file, which may not
exist at all (eg. no udev) or may be placed or named in a random way
(eg. some local udev rule).

I'm open for different opinions :-)

Pawel



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 17:35                                   ` Pawel Moll
@ 2013-04-03 17:50                                     ` John Stultz
  2013-04-04  7:37                                       ` Richard Cochran
  2013-04-04 16:29                                       ` Pawel Moll
  0 siblings, 2 replies; 65+ messages in thread
From: John Stultz @ 2013-04-03 17:50 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Peter Zijlstra, David Ahern, Stephane Eranian, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter, Richard Cochran

On 04/03/2013 10:35 AM, Pawel Moll wrote:
> On Wed, 2013-04-03 at 18:29 +0100, John Stultz wrote:
>> On 04/03/2013 10:19 AM, Pawel Moll wrote:
>>> On Tue, 2013-04-02 at 17:19 +0100, John Stultz wrote:
>>>> But if we're going to have to do
>>>> this via a clockid, I'm going to want it to be done via a dynamic posix
>>>> clockid, so its clear its tightly tied with perf and not considered a
>>>> generic interface (and I can clearly point folks having problems to the
>>>> perf maintainers ;).
>>> Ok, so how about the code below?
>>>
>>> There are two distinct parts of the "solution":
>>>
>>> 1. The dynamic posix clock, as you suggested. Then one can get the perf
>>> timestamp by doing:
>>>
>>> 	clock_fd = open("/dev/perf-clock", O_RDONLY);
>>> 	clock_gettime(FD_TO_CLOCKID(clock_fd), &ts)
>>>
>>> 2. A sort-of-hack in the get_posix_clock() function making it possible
>>> to do the same using the perf event file descriptor, eg.:
>>>
>>> 	fd = sys_perf_event_open(&attr, -1, 0, -1, 0);
>>> 	clock_gettime(FD_TO_CLOCKID(fd), &ts)
>> #2 makes my nose wrinkle.
> To make myself clear: I consider the code as it is a hack.
>
>> Forgive me for being somewhat ignorant on the
>> perf interfaces, but why is the second portion necessary or beneficial?
> My thinking: the perf syscall returns a file descriptor already, so it
> would make sense to re-use it in the clock_gettime() call instead of
> jumping through loops to open a character device file, which may not
> exist at all (eg. no udev) or may be placed or named in a random way
> (eg. some local udev rule).
>
> I'm open for different opinions :-)

Cc'ing Richard for his thoughts here.


I get the reasoning around reusing the fd we already have, but is the 
possibility of a dynamic chardev pathname really a big concern?

I'm guessing the private_data on the perf file is already used?

Maybe can we extend the dynamic posix clock code to work on more then 
just the chardev? Although I worry about multiplexing too much 
functionality on the file.

thanks
-john


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 14:22                                     ` Stephane Eranian
@ 2013-04-03 17:57                                       ` John Stultz
  2013-04-04  8:12                                         ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: John Stultz @ 2013-04-03 17:57 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: David Ahern, Pawel Moll, Peter Zijlstra, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt

On 04/03/2013 07:22 AM, Stephane Eranian wrote:
> On Wed, Apr 3, 2013 at 4:14 PM, David Ahern <dsahern@gmail.com> wrote:
>> On 4/3/13 8:00 AM, Stephane Eranian wrote:
>>>> Why not have perf convert its
>>>> perf_clock timestamps into monotonic or realtime when dumping events?

So this is exactly what I've been wondering through all this.

Perf can keep track of events using its own time domain (which is 
understandably required due to performance and locking issues), but when 
exporting those timestamps to userland, could it not do the same (likely 
imperfect) conversion to existing userland time domains (like 
CLOCK_MONOTONIC)?


>>> Can monotonic timestamps be obtained from NMI context in the kernel?
>>
>> I don't understand the context of the question.
>>
>> I am not suggesting perf_clock be changed. I am working on correlating
>> existing perf_clock timestamps to clocks typically used by apps (REALTIME
>> and time-of-day but also applies to MONOTONIC).
>>
> But for that, you'd need to expose to users the correlation between
> the two clocks.
> And now you'd fixed two clock sources definitions not just one.

I'm not sure I follow this. If perf exported data came with 
CLOCK_MONOTONIC timestamps, no correlation would need to be exposed.  
perf would just have to do the extra overhead of doing the conversion on 
export.


>> You are wanting the reverse -- have apps emit perf_clock timestamps. I was
>> just wondering what is the advantage of this approach?
>>
> Well, that's how I interpreted your question ;-<
>
> If you could have perf_clock use monotonic then we would not have this
> discussion.
> The correlation would be trivial.

I think the suggestion is not to have the perf_clock use 
CLOCK_MONOTONIC,  but the perf interfaces export CLOCK_MONOTONIC.

thanks
-john


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 17:50                                     ` John Stultz
@ 2013-04-04  7:37                                       ` Richard Cochran
  2013-04-04 16:33                                         ` Pawel Moll
  2013-04-04 16:29                                       ` Pawel Moll
  1 sibling, 1 reply; 65+ messages in thread
From: Richard Cochran @ 2013-04-04  7:37 UTC (permalink / raw)
  To: John Stultz
  Cc: Pawel Moll, Peter Zijlstra, David Ahern, Stephane Eranian,
	Thomas Gleixner, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter

On Wed, Apr 03, 2013 at 10:50:57AM -0700, John Stultz wrote:
> 
> I get the reasoning around reusing the fd we already have, but is
> the possibility of a dynamic chardev pathname really a big concern?

I have been following this thread, and, not knowing very much about
perf, I would think that the userland can easily open a second file
(the dynamic posix clock chardev) in order to get these time stamps.

> Maybe can we extend the dynamic posix clock code to work on more
> then just the chardev? Although I worry about multiplexing too much
> functionality on the file.

I don't yet see a need for that, but if we do, then it should work in
a generic way, and not as a list of special cases, like we saw in the
patch.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 17:57                                       ` John Stultz
@ 2013-04-04  8:12                                         ` Stephane Eranian
  2013-04-04 22:26                                           ` John Stultz
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2013-04-04  8:12 UTC (permalink / raw)
  To: John Stultz
  Cc: David Ahern, Pawel Moll, Peter Zijlstra, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt

On Wed, Apr 3, 2013 at 7:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> On 04/03/2013 07:22 AM, Stephane Eranian wrote:
>>
>> On Wed, Apr 3, 2013 at 4:14 PM, David Ahern <dsahern@gmail.com> wrote:
>>>
>>> On 4/3/13 8:00 AM, Stephane Eranian wrote:
>>>>>
>>>>> Why not have perf convert its
>>>>> perf_clock timestamps into monotonic or realtime when dumping events?
>
>
> So this is exactly what I've been wondering through all this.
>
> Perf can keep track of events using its own time domain (which is
> understandably required due to performance and locking issues), but when
> exporting those timestamps to userland, could it not do the same (likely
> imperfect) conversion to existing userland time domains (like
> CLOCK_MONOTONIC)?
>
>
>
>>>> Can monotonic timestamps be obtained from NMI context in the kernel?
>>>
>>>
>>> I don't understand the context of the question.
>>>
>>> I am not suggesting perf_clock be changed. I am working on correlating
>>> existing perf_clock timestamps to clocks typically used by apps (REALTIME
>>> and time-of-day but also applies to MONOTONIC).
>>>
>> But for that, you'd need to expose to users the correlation between
>> the two clocks.
>> And now you'd fixed two clock sources definitions not just one.
>
>
> I'm not sure I follow this. If perf exported data came with CLOCK_MONOTONIC
> timestamps, no correlation would need to be exposed.  perf would just have
> to do the extra overhead of doing the conversion on export.
>
There is no explicit export operation in perf.  You record a sample when
the counter overflows and generates an NMI interrupt. In the NMI interrupt
handler, the sample record is written to the sampling buffer. That is when
the timestamp is generated. The sampling buffer is directly accessible to
users via mmap(). The perf tool just dumps the raw sampling buffer into
a file, no sample record is modified or even looked at. The processing
of the samples is done offline (via perf report) and could be done on
another machine. In other words, the perf.data file is self-contained.

Are you suggesting that the perf tool or kernel could expose a constant
correlation factor between perf timestamp and MONOTONIC and that
this constant could be record by the perf tool in the perf.data file and
used later on by the perf report command?



>
>
>>> You are wanting the reverse -- have apps emit perf_clock timestamps. I
>>> was
>>> just wondering what is the advantage of this approach?
>>>
>> Well, that's how I interpreted your question ;-<
>>
>> If you could have perf_clock use monotonic then we would not have this
>> discussion.
>> The correlation would be trivial.
>
>
> I think the suggestion is not to have the perf_clock use CLOCK_MONOTONIC,
> but the perf interfaces export CLOCK_MONOTONIC.
>
> thanks
> -john
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-03 17:50                                     ` John Stultz
  2013-04-04  7:37                                       ` Richard Cochran
@ 2013-04-04 16:29                                       ` Pawel Moll
  2013-04-05 18:16                                         ` Pawel Moll
  1 sibling, 1 reply; 65+ messages in thread
From: Pawel Moll @ 2013-04-04 16:29 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, David Ahern, Stephane Eranian, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter, Richard Cochran

On Wed, 2013-04-03 at 18:50 +0100, John Stultz wrote:
> I get the reasoning around reusing the fd we already have, but is the 
> possibility of a dynamic chardev pathname really a big concern?

Well, in my particular development system I have no udev, so I had to
manually do "mknod". Perf syscall works out of the box. Of course one
could say it's my problem...

> I'm guessing the private_data on the perf file is already used?

Of course.

> Maybe can we extend the dynamic posix clock code to work on more then 
> just the chardev? 

The idea I'm following now is to make the dynamic clock framework even
more generic, so there could be a clock associated with an arbitrary
struct file * (the perf syscall is getting one with
anon_inode_getfile()). I don't know how to get this done yet, but I'll
give it a try and report.

Paweł



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-04  7:37                                       ` Richard Cochran
@ 2013-04-04 16:33                                         ` Pawel Moll
  0 siblings, 0 replies; 65+ messages in thread
From: Pawel Moll @ 2013-04-04 16:33 UTC (permalink / raw)
  To: Richard Cochran
  Cc: John Stultz, Peter Zijlstra, David Ahern, Stephane Eranian,
	Thomas Gleixner, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter

On Thu, 2013-04-04 at 08:37 +0100, Richard Cochran wrote:
> > I get the reasoning around reusing the fd we already have, but is
> > the possibility of a dynamic chardev pathname really a big concern?
> 
> I have been following this thread, and, not knowing very much about
> perf, I would think that the userland can easily open a second file
> (the dynamic posix clock chardev) in order to get these time stamps.

Sure it can - I've tested it. It's just a bit cumbersome in my opinion
(there is nothing else perf-related in /dev). I can agree to disagree if
you think otherwise :-)

> > Maybe can we extend the dynamic posix clock code to work on more
> > then just the chardev? Although I worry about multiplexing too much
> > functionality on the file.
> 
> I don't yet see a need for that, but if we do, then it should work in
> a generic way, and not as a list of special cases, like we saw in the
> patch.

By all means - and even more generic way than it is now (why character
devices not any other file?). I'll give it a try.

Paweł




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-04  8:12                                         ` Stephane Eranian
@ 2013-04-04 22:26                                           ` John Stultz
  0 siblings, 0 replies; 65+ messages in thread
From: John Stultz @ 2013-04-04 22:26 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: David Ahern, Pawel Moll, Peter Zijlstra, Thomas Gleixner, LKML,
	mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt

On 04/04/2013 01:12 AM, Stephane Eranian wrote:
> On Wed, Apr 3, 2013 at 7:57 PM, John Stultz <john.stultz@linaro.org> wrote:
>> I'm not sure I follow this. If perf exported data came with CLOCK_MONOTONIC
>> timestamps, no correlation would need to be exposed.  perf would just have
>> to do the extra overhead of doing the conversion on export.
> There is no explicit export operation in perf.  You record a sample when
> the counter overflows and generates an NMI interrupt. In the NMI interrupt
> handler, the sample record is written to the sampling buffer. That is when
> the timestamp is generated. The sampling buffer is directly accessible to
> users via mmap(). The perf tool just dumps the raw sampling buffer into
> a file, no sample record is modified or even looked at. The processing
> of the samples is done offline (via perf report) and could be done on
> another machine. In other words, the perf.data file is self-contained.
Ah. Ok, I didn't realize perfs buffers were directly mmaped.  I was 
thinking perf could do the translation not at NMI time but when the 
buffer was later read by the application.  That helps explain some of 
the constraints.

thanks
-john

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-04 16:29                                       ` Pawel Moll
@ 2013-04-05 18:16                                         ` Pawel Moll
  2013-04-06 11:05                                           ` Richard Cochran
  0 siblings, 1 reply; 65+ messages in thread
From: Pawel Moll @ 2013-04-05 18:16 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, David Ahern, Stephane Eranian, Thomas Gleixner,
	LKML, mingo, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter, Richard Cochran

On Thu, 2013-04-04 at 17:29 +0100, Pawel Moll wrote:
> > Maybe can we extend the dynamic posix clock code to work on more then 
> > just the chardev? 
> 
> The idea I'm following now is to make the dynamic clock framework even
> more generic, so there could be a clock associated with an arbitrary
> struct file * (the perf syscall is getting one with
> anon_inode_getfile()). I don't know how to get this done yet, but I'll
> give it a try and report.

Ok, so how about the code below? Disclaimer: this is just a proposal.
I'm not sure how welcomed would be an extra field in struct file, but
this makes the clocks ultimately flexible - one can "attach" the clock
to any arbitrary struct file. Alternatively we could mark a "clocked"
file with a special flag in f_mode and have some kind of lookup.

Also, I can't stop thinking that the posix-clock.c shouldn't actually do
anything about the character device... The PTP core (as the model of
using character device seems to me just one of possible choices) could
do this on its own and have simple open/release attaching/detaching the
clock. This would remove a lot of "generic dev" code in the
posix-clock.c and all the optional cdev methods in struct posix_clock.
It's just a thought, though...

And a couple of questions to Richard... Isn't the kref_put() in
posix_clock_unregister() a bug? I'm not 100% but it looks like a simple
register->unregister sequence was making the ref count == -1, so the
delete_clock() won't be called. And was there any particular reason that
the ops in struct posix_clock are *not* a pointer? This makes static
clock declaration a bit cumbersome (I'm not a C language lawyer, but gcc
doesn't let me do simply .ops = other_static_struct_with_ops).

Regards

Pawel

8<-------------------------------------------
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..4090500 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -804,6 +804,9 @@ struct file {
 #ifdef CONFIG_DEBUG_WRITECOUNT
 	unsigned long f_mnt_write_state;
 #endif
+
+	/* for clock_gettime(FD_TO_CLOCKID(fd)) and friends */
+	struct posix_clock	*posix_clock;
 };
 
 struct file_handle {
diff --git a/include/linux/posix-clock.h b/include/linux/posix-clock.h
index 34c4498..85df2c5 100644
--- a/include/linux/posix-clock.h
+++ b/include/linux/posix-clock.h
@@ -123,6 +123,10 @@ struct posix_clock {
 	void (*release)(struct posix_clock *clk);
 };
 
+void posix_clock_init(struct posix_clock *clk);
+void posix_clock_attach(struct posix_clock *clk, struct file *fp);
+void posix_clock_detach(struct file *fp);
+
 /**
  * posix_clock_register() - register a new clock
  * @clk:   Pointer to the clock. Caller must provide 'ops' and 'release'
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0cd865..0b70ad1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -34,6 +34,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/kernel_stat.h>
 #include <linux/perf_event.h>
+#include <linux/posix-clock.h>
 #include <linux/ftrace_event.h>
 #include <linux/hw_breakpoint.h>
 #include <linux/mm_types.h>
@@ -627,6 +628,25 @@ perf_cgroup_mark_enabled(struct perf_event *event,
 }
 #endif
 
+static int perf_posix_clock_getres(struct posix_clock *pc, struct timespec *tp)
+{
+	*tp = ns_to_timespec(TICK_NSEC);
+	return 0;
+}
+
+static int perf_posix_clock_gettime(struct posix_clock *pc, struct timespec *tp)
+{
+	*tp = ns_to_timespec(perf_clock());
+	return 0;
+}
+
+static struct posix_clock perf_posix_clock = {
+	.ops = (struct posix_clock_operations) {
+		.clock_getres = perf_posix_clock_getres,
+		.clock_gettime = perf_posix_clock_gettime,
+	},
+};
+
 void perf_pmu_disable(struct pmu *pmu)
 {
 	int *count = this_cpu_ptr(pmu->pmu_disable_count);
@@ -2992,6 +3012,7 @@ static void put_event(struct perf_event *event)
 
 static int perf_release(struct inode *inode, struct file *file)
 {
+	posix_clock_detach(file);
 	put_event(file->private_data);
 	return 0;
 }
@@ -6671,6 +6692,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	 * perf_group_detach().
 	 */
 	fdput(group);
+	posix_clock_attach(&perf_posix_clock, event_file);
 	fd_install(event_fd, event_file);
 	return event_fd;
 
@@ -7416,6 +7438,8 @@ void __init perf_event_init(void)
 	 */
 	BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
 		     != 1024);
+
+	posix_clock_init(&perf_posix_clock);
 }
 
 static int __init perf_event_sysfs_init(void)
diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c
index ce033c7..525fa44 100644
--- a/kernel/time/posix-clock.c
+++ b/kernel/time/posix-clock.c
@@ -25,14 +25,44 @@
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
 
-static void delete_clock(struct kref *kref);
+void posix_clock_init(struct posix_clock *clk)
+{
+	kref_init(&clk->kref);
+	init_rwsem(&clk->rwsem);
+}
+EXPORT_SYMBOL_GPL(posix_clock_init);
+
+void posix_clock_attach(struct posix_clock *clk, struct file *fp)
+{
+	kref_get(&clk->kref);
+	fp->posix_clock = clk;
+}
+EXPORT_SYMBOL_GPL(posix_clock_attach);
+
+static void delete_clock(struct kref *kref)
+{
+	struct posix_clock *clk = container_of(kref, struct posix_clock, kref);
+
+	if (clk->release)
+		clk->release(clk);
+}
+
+void posix_clock_detach(struct file *fp)
+{
+	kref_put(&fp->posix_clock->kref, delete_clock);
+	fp->posix_clock = NULL;
+}
+EXPORT_SYMBOL_GPL(posix_clock_detach);
 
 /*
  * Returns NULL if the posix_clock instance attached to 'fp' is old and stale.
  */
 static struct posix_clock *get_posix_clock(struct file *fp)
 {
-	struct posix_clock *clk = fp->private_data;
+	struct posix_clock *clk = fp->posix_clock;
+
+	if (!clk)
+		return NULL;
 
 	down_read(&clk->rwsem);
 
@@ -167,10 +197,8 @@ static int posix_clock_open(struct inode *inode, struct file *fp)
 	else
 		err = 0;
 
-	if (!err) {
-		kref_get(&clk->kref);
-		fp->private_data = clk;
-	}
+	if (!err)
+		posix_clock_attach(clk, fp);
 out:
 	up_read(&clk->rwsem);
 	return err;
@@ -178,15 +206,13 @@ out:
 
 static int posix_clock_release(struct inode *inode, struct file *fp)
 {
-	struct posix_clock *clk = fp->private_data;
+	struct posix_clock *clk = fp->posix_clock;
 	int err = 0;
 
 	if (clk->ops.release)
 		err = clk->ops.release(clk);
 
-	kref_put(&clk->kref, delete_clock);
-
-	fp->private_data = NULL;
+	posix_clock_detach(fp);
 
 	return err;
 }
@@ -210,8 +236,7 @@ int posix_clock_register(struct posix_clock *clk, dev_t devid)
 {
 	int err;
 
-	kref_init(&clk->kref);
-	init_rwsem(&clk->rwsem);
+	posix_clock_init(clk);
 
 	cdev_init(&clk->cdev, &posix_clock_file_operations);
 	clk->cdev.owner = clk->ops.owner;
@@ -221,14 +246,6 @@ int posix_clock_register(struct posix_clock *clk, dev_t devid)
 }
 EXPORT_SYMBOL_GPL(posix_clock_register);
 
-static void delete_clock(struct kref *kref)
-{
-	struct posix_clock *clk = container_of(kref, struct posix_clock, kref);
-
-	if (clk->release)
-		clk->release(clk);
-}
-
 void posix_clock_unregister(struct posix_clock *clk)
 {
 	cdev_del(&clk->cdev);
@@ -249,22 +266,19 @@ struct posix_clock_desc {
 static int get_clock_desc(const clockid_t id, struct posix_clock_desc *cd)
 {
 	struct file *fp = fget(CLOCKID_TO_FD(id));
-	int err = -EINVAL;
 
 	if (!fp)
-		return err;
-
-	if (fp->f_op->open != posix_clock_open || !fp->private_data)
-		goto out;
+		return -EINVAL;
 
 	cd->fp = fp;
 	cd->clk = get_posix_clock(fp);
 
-	err = cd->clk ? 0 : -ENODEV;
-out:
-	if (err)
+	if (!cd->clk) {
 		fput(fp);
-	return err;
+		return -ENODEV;
+	}
+
+	return 0;
 }
 
 static void put_clock_desc(struct posix_clock_desc *cd)





^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-05 18:16                                         ` Pawel Moll
@ 2013-04-06 11:05                                           ` Richard Cochran
  2013-04-08 17:58                                             ` Pawel Moll
  0 siblings, 1 reply; 65+ messages in thread
From: Richard Cochran @ 2013-04-06 11:05 UTC (permalink / raw)
  To: Pawel Moll
  Cc: John Stultz, Peter Zijlstra, David Ahern, Stephane Eranian,
	Thomas Gleixner, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter

On Fri, Apr 05, 2013 at 07:16:53PM +0100, Pawel Moll wrote:
 
> Ok, so how about the code below? Disclaimer: this is just a proposal.
> I'm not sure how welcomed would be an extra field in struct file, but
> this makes the clocks ultimately flexible - one can "attach" the clock
> to any arbitrary struct file. Alternatively we could mark a "clocked"
> file with a special flag in f_mode and have some kind of lookup.

Only a tiny minority of file instances will want to be clocks.
Therefore I think adding the extra field will be a hard sell.

The flag idea sounds harmless, but how do you perform the lookup?
 
> Also, I can't stop thinking that the posix-clock.c shouldn't actually do
> anything about the character device... The PTP core (as the model of
> using character device seems to me just one of possible choices) could
> do this on its own and have simple open/release attaching/detaching the
> clock. This would remove a lot of "generic dev" code in the
> posix-clock.c and all the optional cdev methods in struct posix_clock.
> It's just a thought, though...

Right, the chardev could be pushed into the PHC layer. The original
idea of chardev clocks did have precedents, though, like hpet and rtc.
 
> And a couple of questions to Richard... Isn't the kref_put() in
> posix_clock_unregister() a bug? I'm not 100% but it looks like a simple
> register->unregister sequence was making the ref count == -1, so the
> delete_clock() won't be called.

Well,

	posix_clock_register() -> kref_init() ->
		atomic_set(&kref->refcount, 1);

So refcount is now 1 ...

	posix_clock_unregister() -> kref_put() -> kref_sub(count=1) ->
		atomic_sub_and_test((int) count, &kref->refcount)

and refcount is now 0. Can't see how you would get -1 here.

> And was there any particular reason that the ops in struct
> posix_clock are *not* a pointer?

One less run time indirection maybe? I don't really remember why or
how we arrived at this. The whole PHC review took a year, with
something like fifteen revisions.

Thanks,
Richard


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-06 11:05                                           ` Richard Cochran
@ 2013-04-08 17:58                                             ` Pawel Moll
  2013-04-08 19:05                                               ` John Stultz
  0 siblings, 1 reply; 65+ messages in thread
From: Pawel Moll @ 2013-04-08 17:58 UTC (permalink / raw)
  To: Richard Cochran, Peter Zijlstra, John Stultz, Stephane Eranian
  Cc: David Ahern, Thomas Gleixner, LKML, mingo, Paul Mackerras,
	Anton Blanchard, Will Deacon, ak, Pekka Enberg, Steven Rostedt,
	Robert Richter

On Sat, 2013-04-06 at 12:05 +0100, Richard Cochran wrote:
> On Fri, Apr 05, 2013 at 07:16:53PM +0100, Pawel Moll wrote:
> > Ok, so how about the code below? Disclaimer: this is just a proposal.
> > I'm not sure how welcomed would be an extra field in struct file, but
> > this makes the clocks ultimately flexible - one can "attach" the clock
> > to any arbitrary struct file. Alternatively we could mark a "clocked"
> > file with a special flag in f_mode and have some kind of lookup.
> 
> Only a tiny minority of file instances will want to be clocks.
> Therefore I think adding the extra field will be a hard sell.
> 
> The flag idea sounds harmless, but how do you perform the lookup?

Hash table. I'll get some code typed and post it tomorrow.
 
> > Also, I can't stop thinking that the posix-clock.c shouldn't actually do
> > anything about the character device... The PTP core (as the model of
> > using character device seems to me just one of possible choices) could
> > do this on its own and have simple open/release attaching/detaching the
> > clock. This would remove a lot of "generic dev" code in the
> > posix-clock.c and all the optional cdev methods in struct posix_clock.
> > It's just a thought, though...
> 
> Right, the chardev could be pushed into the PHC layer. The original
> idea of chardev clocks did have precedents, though, like hpet and rtc.

I'm not arguing about the use of cdev for PTP clocks, it's perfectly
fine with me. I'm just not convinced that the "more generic" clock layer
should enforce cdevs and nothing more. But I think we're more-or-less
talking the same language here, so I'll simply create a patch and send
it as RFC.

> > And a couple of questions to Richard... Isn't the kref_put() in
> > posix_clock_unregister() a bug? I'm not 100% but it looks like a simple
> > register->unregister sequence was making the ref count == -1, so the
> > delete_clock() won't be called.
> 
> Well,
> 
> 	posix_clock_register() -> kref_init() ->
> 		atomic_set(&kref->refcount, 1);
> 
> So refcount is now 1 ...
> 
> 	posix_clock_unregister() -> kref_put() -> kref_sub(count=1) ->
> 		atomic_sub_and_test((int) count, &kref->refcount)
> 
> and refcount is now 0. Can't see how you would get -1 here.

Eh. For some reason I was convinced that kref_init() sets the counter to
0 not 1. My bad.

> > And was there any particular reason that the ops in struct
> > posix_clock are *not* a pointer?
> 
> One less run time indirection maybe? I don't really remember why or
> how we arrived at this. The whole PHC review took a year, with
> something like fifteen revisions.

Ok. As most of the *_ops seem to be referenced via pointers (including
file ops, which are pretty heavily used ;-) and this makes it much
easier to define static clocks, I'll propose a change in a separate
patch.

Now, before I spend time doing all this, a question to John, Peter,
Stephane and the rest of the public - would a solution providing such
userspace interface:

	fd = sys_perf_open()
	timestamp = clock_gettime((FD_TO_CLOCKID(fd), &ts)

be acceptable to all?

Paweł



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-08 17:58                                             ` Pawel Moll
@ 2013-04-08 19:05                                               ` John Stultz
  2013-04-09  5:02                                                 ` Richard Cochran
  0 siblings, 1 reply; 65+ messages in thread
From: John Stultz @ 2013-04-08 19:05 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Peter Zijlstra, Stephane Eranian, David Ahern,
	Thomas Gleixner, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter

On 04/08/2013 10:58 AM, Pawel Moll wrote:
> Now, before I spend time doing all this, a question to John, Peter,
> Stephane and the rest of the public - would a solution providing such
> userspace interface:
>
> 	fd = sys_perf_open()
> 	timestamp = clock_gettime((FD_TO_CLOCKID(fd), &ts)
>
> be acceptable to all?

So thinking this through further, I'm worried we may _not_ be able to 
eventually enable this to be a vdso as I had earlier hoped. Mostly 
because I'm not sure how the fd -> file -> clock lookup could be done in 
userland (any ideas?).

So this makes this approach mostly equivalent long term to the ioctl 
method, from a performance perspective. And makes the dynamic posix 
clockid somewhat less of a middle-ground compromise between the ioctl 
and generic constant clockid approach.

So while I'm not opposed to the sort of extention proposed above, I want 
to make sure introducing the new approach is worth the effort when 
compared with just adding an ioctl.

thanks
-john

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-04-08 19:05                                               ` John Stultz
@ 2013-04-09  5:02                                                 ` Richard Cochran
  0 siblings, 0 replies; 65+ messages in thread
From: Richard Cochran @ 2013-04-09  5:02 UTC (permalink / raw)
  To: John Stultz
  Cc: Pawel Moll, Peter Zijlstra, Stephane Eranian, David Ahern,
	Thomas Gleixner, LKML, mingo, Paul Mackerras, Anton Blanchard,
	Will Deacon, ak, Pekka Enberg, Steven Rostedt, Robert Richter

On Mon, Apr 08, 2013 at 12:05:52PM -0700, John Stultz wrote:
> 
> So thinking this through further, I'm worried we may _not_ be able
> to eventually enable this to be a vdso as I had earlier hoped.
> Mostly because I'm not sure how the fd -> file -> clock lookup could
> be done in userland (any ideas?).

How about a new clock operation, clock_install_vdso(), that lets the
process arrange for one dynamic clock to be reflected in its vdso
page?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-02-01 14:18   ` Pawel Moll
  2013-02-05 21:18     ` David Ahern
  2013-02-05 22:13     ` Stephane Eranian
@ 2013-06-26 16:49     ` David Ahern
  2013-07-15 10:44       ` Pawel Moll
  2 siblings, 1 reply; 65+ messages in thread
From: David Ahern @ 2013-06-26 16:49 UTC (permalink / raw)
  To: Pawel Moll, Peter Zijlstra, Stephane Eranian, Ingo Molnar
  Cc: LKML, Paul Mackerras, Anton Blanchard, Will Deacon, ak,
	Pekka Enberg, Steven Rostedt, Robert Richter, tglx, John Stultz

With all the perf ioctl extensions tossed out the past day or so I 
wanted to revive this request. Still need a solution to the problem of 
correlating perf_clock to other clocks ...

On 2/1/13 7:18 AM, Pawel Moll wrote:
> Hello,
>
> I'd like to revive the topic...
>
> On Tue, 2012-10-16 at 18:23 +0100, Peter Zijlstra wrote:
>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>>> Hi,
>>>
>>> There are many situations where we want to correlate events happening at
>>> the user level with samples recorded in the perf_event kernel sampling buffer.
>>> For instance, we might want to correlate the call to a function or creation of
>>> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
>>> we need to be able to correlate jitted code mappings with perf event samples
>>> for symbolization.
>>>
>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>>
>>> To make correlating user vs. kernel samples easy, we would need to
>>> access that sched_clock() functionality. However, none of the existing
>>> clock calls permit this at this point. They all return timestamps which are
>>> not using the same source and/or offset as sched_clock.
>>>
>>> I believe a similar issue exists with the ftrace subsystem.
>>>
>>> The problem needs to be adressed in a portable manner. Solutions
>>> based on reading TSC for the user level to reconstruct sched_clock()
>>> don't seem appropriate to me.
>>>
>>> One possibility to address this limitation would be to extend clock_gettime()
>>> with a new clock time, e.g., CLOCK_PERF.
>>>
>>> However, I understand that sched_clock_cpu() provides ordering guarantees only
>>> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
>>> But we already have to deal with this problem when merging samples obtained
>>> from different CPU sampling buffer in per-thread mode. So this is not
>>> necessarily
>>> a showstopper.
>>>
>>> Alternatives could be to use uprobes but that's less practical to setup.
>>>
>>> Anyone with better ideas?
>>
>> You forgot to CC the time people ;-)
>>
>> I've no problem with adding CLOCK_PERF (or another/better name).
>>
>> Thomas, John?
>
> I've just faced the same issue - correlating an event in userspace with
> data from the perf stream, but to my mind what I want to get is a value
> returned by perf_clock() _in the current "session" context_.
>
> Stephane didn't like the idea of opening a "fake" perf descriptor in
> order to get the timestamp, but surely one must have the "session"
> already running to be interested in such data in the first place? So I
> think the ioctl() idea is not out of place here... How about the simple
> change below?
>
> Regards
>
> Pawel
>
> 8<---
>  From 2ad51a27fbf64bf98cee190efc3fbd7002819692 Mon Sep 17 00:00:00 2001
> From: Pawel Moll <pawel.moll@arm.com>
> Date: Fri, 1 Feb 2013 14:03:56 +0000
> Subject: [PATCH] perf: Add ioctl to return current time value
>
> To co-relate user space events with the perf events stream
> a current (as in: "what time(stamp) is it now?") time value
> must be made available.
>
> This patch adds a perf ioctl that makes this possible.
>
> Signed-off-by: Pawel Moll <pawel.moll@arm.com>
> ---
>   include/uapi/linux/perf_event.h |    1 +
>   kernel/events/core.c            |    8 ++++++++
>   2 files changed, 9 insertions(+)
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 4f63c05..b745fb0 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -316,6 +316,7 @@ struct perf_event_attr {
>   #define PERF_EVENT_IOC_PERIOD		_IOW('$', 4, __u64)
>   #define PERF_EVENT_IOC_SET_OUTPUT	_IO ('$', 5)
>   #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
> +#define PERF_EVENT_IOC_GET_TIME		_IOR('$', 7, __u64)
>
>   enum perf_event_ioc_flags {
>   	PERF_IOC_FLAG_GROUP		= 1U << 0,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 301079d..4202b1c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3298,6 +3298,14 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>   	case PERF_EVENT_IOC_SET_FILTER:
>   		return perf_event_set_filter(event, (void __user *)arg);
>
> +	case PERF_EVENT_IOC_GET_TIME:
> +	{
> +		u64 time = perf_clock();
> +		if (copy_to_user((void __user *)arg, &time, sizeof(time)))
> +			return -EFAULT;
> +		return 0;
> +	}
> +
>   	default:
>   		return -ENOTTY;
>   	}
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples
  2013-06-26 16:49     ` David Ahern
@ 2013-07-15 10:44       ` Pawel Moll
  0 siblings, 0 replies; 65+ messages in thread
From: Pawel Moll @ 2013-07-15 10:44 UTC (permalink / raw)
  To: David Ahern
  Cc: Peter Zijlstra, Stephane Eranian, Ingo Molnar, LKML,
	Paul Mackerras, Anton Blanchard, Will Deacon, ak, Pekka Enberg,
	Steven Rostedt, Robert Richter, tglx, John Stultz

On Wed, 2013-06-26 at 17:49 +0100, David Ahern wrote:
> With all the perf ioctl extensions tossed out the past day or so I 
> wanted to revive this request. Still need a solution to the problem of 
> correlating perf_clock to other clocks ...

And I second. We've been trying to squeeze the solution into the posix
clock framework (and vdso) but it didn't get anywhere, really. I've
spoken to John last week and although there is one more potential
"solution" (and the quotes are meaningful ;-), it seems that the
perf-specific ioctl would work for all interested individuals for now.

Paweł

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2013-07-15 10:45 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-16 10:13 [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples Stephane Eranian
2012-10-16 17:23 ` Peter Zijlstra
2012-10-18 19:33   ` Stephane Eranian
2012-11-10  2:04   ` John Stultz
2012-11-11 20:32     ` Stephane Eranian
2012-11-12 18:53       ` John Stultz
2012-11-12 20:54         ` Stephane Eranian
2012-11-12 22:39           ` John Stultz
2012-11-13 20:58     ` Steven Rostedt
2012-11-14 22:26       ` John Stultz
2012-11-14 23:30         ` Steven Rostedt
2013-02-01 14:18   ` Pawel Moll
2013-02-05 21:18     ` David Ahern
2013-02-05 22:13     ` Stephane Eranian
2013-02-05 22:28       ` John Stultz
2013-02-06  1:19         ` Steven Rostedt
2013-02-06 18:17           ` Pawel Moll
2013-02-13 20:00             ` Stephane Eranian
2013-02-14 10:33               ` Pawel Moll
2013-02-18 15:16                 ` Stephane Eranian
2013-02-18 18:59                   ` David Ahern
2013-02-18 20:35         ` Thomas Gleixner
2013-02-19 18:25           ` John Stultz
2013-02-19 19:55             ` Thomas Gleixner
2013-02-19 20:15               ` Thomas Gleixner
2013-02-19 20:35                 ` John Stultz
2013-02-19 21:50                   ` Thomas Gleixner
2013-02-19 22:20                     ` John Stultz
2013-02-20 10:06                       ` Thomas Gleixner
2013-02-20 10:29             ` Peter Zijlstra
2013-02-23  6:04               ` John Stultz
2013-02-25 14:10                 ` Peter Zijlstra
2013-03-14 15:34                   ` Stephane Eranian
2013-03-14 19:57                     ` Pawel Moll
2013-03-31 16:23                       ` David Ahern
2013-04-01 18:29                         ` John Stultz
2013-04-01 22:29                           ` David Ahern
2013-04-01 23:12                             ` John Stultz
2013-04-03  9:17                             ` Stephane Eranian
2013-04-03 13:55                               ` David Ahern
2013-04-03 14:00                                 ` Stephane Eranian
2013-04-03 14:14                                   ` David Ahern
2013-04-03 14:22                                     ` Stephane Eranian
2013-04-03 17:57                                       ` John Stultz
2013-04-04  8:12                                         ` Stephane Eranian
2013-04-04 22:26                                           ` John Stultz
2013-04-02  7:54                           ` Peter Zijlstra
2013-04-02 16:05                             ` Pawel Moll
2013-04-02 16:19                             ` John Stultz
2013-04-02 16:34                               ` Pawel Moll
2013-04-03 17:19                               ` Pawel Moll
2013-04-03 17:29                                 ` John Stultz
2013-04-03 17:35                                   ` Pawel Moll
2013-04-03 17:50                                     ` John Stultz
2013-04-04  7:37                                       ` Richard Cochran
2013-04-04 16:33                                         ` Pawel Moll
2013-04-04 16:29                                       ` Pawel Moll
2013-04-05 18:16                                         ` Pawel Moll
2013-04-06 11:05                                           ` Richard Cochran
2013-04-08 17:58                                             ` Pawel Moll
2013-04-08 19:05                                               ` John Stultz
2013-04-09  5:02                                                 ` Richard Cochran
2013-02-06 18:17       ` Pawel Moll
2013-06-26 16:49     ` David Ahern
2013-07-15 10:44       ` Pawel Moll

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.