From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752258Ab2KJCFM (ORCPT <rfc822;w@1wt.eu>);
	Fri, 9 Nov 2012 21:05:12 -0500
Received: from e5.ny.us.ibm.com ([32.97.182.145]:49915 "EHLO e5.ny.us.ibm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751166Ab2KJCFH (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Fri, 9 Nov 2012 21:05:07 -0500
Message-ID: <509DB632.7070305@linaro.org>
Date: Fri, 09 Nov 2012 18:04:34 -0800
From: John Stultz <john.stultz@linaro.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121011 Thunderbird/16.0.1
MIME-Version: 1.0
To: Peter Zijlstra <peterz@infradead.org>
CC: Stephane Eranian <eranian@google.com>, LKML <linux-kernel@vger.kernel.org>,
        "mingo@elte.hu" <mingo@elte.hu>, Paul Mackerras <paulus@samba.org>,
        Anton Blanchard <anton@samba.org>, Will Deacon <will.deacon@arm.com>,
        "ak@linux.intel.com" <ak@linux.intel.com>,
        Pekka Enberg <penberg@gmail.com>, Steven Rostedt <rostedt@goodmis.org>,
        Robert Richter <robert.richter@amd.com>, tglx <tglx@linutronix.de>
Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples
 with kernel samples
References: <CABPqkBQALeD=iO9x-N0nw+shhqa1kmUaj=sCvx+MvoAPGQ-y9A@mail.gmail.com> <1350408232.2336.42.camel@laptop>
In-Reply-To: <1350408232.2336.42.camel@laptop>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 12111002-5930-0000-0000-00000E00BB06
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 10/16/2012 10:23 AM, Peter Zijlstra wrote:
> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote:
>> Hi,
>>
>> There are many situations where we want to correlate events happening at
>> the user level with samples recorded in the perf_event kernel sampling buffer.
>> For instance, we might want to correlate the call to a function or creation of
>> a file with samples. Similarly, when we want to monitor a JVM with jitted code,
>> we need to be able to correlate jitted code mappings with perf event samples
>> for symbolization.
>>
>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME.
>> That causes each PERF_RECORD_SAMPLE to include a timestamp
>> generated by calling the local_clock() -> sched_clock_cpu() function.
>>
>> To make correlating user vs. kernel samples easy, we would need to
>> access that sched_clock() functionality. However, none of the existing
>> clock calls permit this at this point. They all return timestamps which are
>> not using the same source and/or offset as sched_clock.
>>
>> I believe a similar issue exists with the ftrace subsystem.
>>
>> The problem needs to be adressed in a portable manner. Solutions
>> based on reading TSC for the user level to reconstruct sched_clock()
>> don't seem appropriate to me.
>>
>> One possibility to address this limitation would be to extend clock_gettime()
>> with a new clock time, e.g., CLOCK_PERF.
>>
>> However, I understand that sched_clock_cpu() provides ordering guarantees only
>> when invoked on the same CPU repeatedly, i.e., it's not globally synchronized.
>> But we already have to deal with this problem when merging samples obtained
>> from different CPU sampling buffer in per-thread mode. So this is not
>> necessarily
>> a showstopper.
>>
>> Alternatives could be to use uprobes but that's less practical to setup.
>>
>> Anyone with better ideas?
> You forgot to CC the time people ;-)
>
> I've no problem with adding CLOCK_PERF (or another/better name).
Hrm. I'm not excited about exporting that sort of internal kernel 
details to userland.

The behavior and expectations from sched_clock() has changed over the 
years, so I'm not sure its wise to export it, since we'd have to 
preserve its behavior from then on.

Also I worry that it will be abused in the same way that direct TSC 
access is, where the seemingly better performance from the more 
careful/correct CLOCK_MONOTONIC would cause developers to write fragile 
userland code that will break when moved from one machine to the next.

I'd probably rather perf output timestamps to userland using sane clocks 
(CLOCK_MONOTONIC), rather then trying to introduce a new time domain to 
userland.   But I probably could be convinced I'm wrong.

thanks
-john