From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753456Ab2KLUzD (ORCPT ); Mon, 12 Nov 2012 15:55:03 -0500 Received: from mail-lb0-f174.google.com ([209.85.217.174]:43113 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753392Ab2KLUzA (ORCPT ); Mon, 12 Nov 2012 15:55:00 -0500 MIME-Version: 1.0 In-Reply-To: <50A145A5.7060402@linaro.org> References: <1350408232.2336.42.camel@laptop> <509DB632.7070305@linaro.org> <50A145A5.7060402@linaro.org> Date: Mon, 12 Nov 2012 21:54:58 +0100 Message-ID: Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples From: Stephane Eranian To: John Stultz Cc: Peter Zijlstra , LKML , "mingo@elte.hu" , Paul Mackerras , Anton Blanchard , Will Deacon , "ak@linux.intel.com" , Pekka Enberg , Steven Rostedt , Robert Richter , tglx Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 12, 2012 at 7:53 PM, John Stultz wrote: > On 11/11/2012 12:32 PM, Stephane Eranian wrote: >> >> On Sat, Nov 10, 2012 at 3:04 AM, John Stultz >> wrote: >>> >>> On 10/16/2012 10:23 AM, Peter Zijlstra wrote: >>>> >>>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote: >>>>> >>>>> Hi, >>>>> >>>>> There are many situations where we want to correlate events happening >>>>> at >>>>> the user level with samples recorded in the perf_event kernel sampling >>>>> buffer. >>>>> For instance, we might want to correlate the call to a function or >>>>> creation of >>>>> a file with samples. Similarly, when we want to monitor a JVM with >>>>> jitted >>>>> code, >>>>> we need to be able to correlate jitted code mappings with perf event >>>>> samples >>>>> for symbolization. >>>>> >>>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME. >>>>> That causes each PERF_RECORD_SAMPLE to include a timestamp >>>>> generated by calling the local_clock() -> sched_clock_cpu() function. >>>>> >>>>> To make correlating user vs. kernel samples easy, we would need to >>>>> access that sched_clock() functionality. However, none of the existing >>>>> clock calls permit this at this point. They all return timestamps which >>>>> are >>>>> not using the same source and/or offset as sched_clock. >>>>> >>>>> I believe a similar issue exists with the ftrace subsystem. >>>>> >>>>> The problem needs to be adressed in a portable manner. Solutions >>>>> based on reading TSC for the user level to reconstruct sched_clock() >>>>> don't seem appropriate to me. >>>>> >>>>> One possibility to address this limitation would be to extend >>>>> clock_gettime() >>>>> with a new clock time, e.g., CLOCK_PERF. >>>>> >>>>> However, I understand that sched_clock_cpu() provides ordering >>>>> guarantees >>>>> only >>>>> when invoked on the same CPU repeatedly, i.e., it's not globally >>>>> synchronized. >>>>> But we already have to deal with this problem when merging samples >>>>> obtained >>>>> from different CPU sampling buffer in per-thread mode. So this is not >>>>> necessarily >>>>> a showstopper. >>>>> >>>>> Alternatives could be to use uprobes but that's less practical to >>>>> setup. >>>>> >>>>> Anyone with better ideas? >>>> >>>> You forgot to CC the time people ;-) >>>> >>>> I've no problem with adding CLOCK_PERF (or another/better name). >>> >>> Hrm. I'm not excited about exporting that sort of internal kernel details >>> to >>> userland. >>> >>> The behavior and expectations from sched_clock() has changed over the >>> years, >>> so I'm not sure its wise to export it, since we'd have to preserve its >>> behavior from then on. >>> >> It's not about just exposing sched_clock(). We need to expose a time >> source >> that is exactly equivalent to what perf_event uses internally. If >> sched_clock() >> changes, then perf_event clock will change too and so would that new time >> source for clock_gettime(). As long as everything remains consistent, we >> are >> good. > > > Sure, but I'm just hesitant to expose that sort of internal detail. If we > change it later, its not just perf_events, but any other applications that > have come to depend on the particular behavior we expose. We can claim > "that was never promised" but it still leads to a bad situation. > > >>> Also I worry that it will be abused in the same way that direct TSC >>> access >>> is, where the seemingly better performance from the more careful/correct >>> CLOCK_MONOTONIC would cause developers to write fragile userland code >>> that >>> will break when moved from one machine to the next. >>> >> The only goal for this new time source is for correlating user-level >> samples with >> kernel level samples, i.e., application level events with a PMU counter >> overflow >> for instance. Anybody trying anything else would be on their own. >> >> clock_gettime(CLOCK_PERF): guarantee to return the same time source as >> that used by the perf_event subsystem to timestamp samples when >> PERF_SAMPLE_TIME is requested in attr->sample_type. > > > I'm not familiar enough with perf's interfaces, but if you are going to make > this clockid bound so tightly with perf, could you maybe export a perf > timestamp from one of perf's interfaces rather then using the more generic > clock_gettime() interface? > Yeah, I considered that as well. But it is more complicated. The only syscall we could extend for perf_events is ioctl(). But that one requires that an event be created so we obtain a file descriptor for the ioctl() call So we'd have to pretend programming a dummy event just for the purpose of obtained a timestamp. We could do that but that's not so nice. But more amenable to the Keep in mind that the clock_gettime() would be used by programs which are not self-monitoring but may be monitored externally by a tool such as perf. We just need to them to emit their events with a timestamp that can be correlated offline with those of perf_events. > > >> >>> I'd probably rather perf output timestamps to userland using sane clocks >>> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to >>> userland. But I probably could be convinced I'm wrong. >>> >> Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without >> grabbing any locks because that would need to run from NMI context? > > No, of course why we have sched_clock. But I'm suggesting we consider > changing what perf exports (via maybe interpolation/translation) to be > CLOCK_MONOTONIC-ish. > Explain to me the key difference between monotonic and what sched_clock() is returning today? Does this have to do with the global monotonic vs. the cpu-wide monotonic? > > I'm not strongly objecting here, I just want to make sure other alternatives > are explored before we start giving applications another internal kernel > behavior dependent interface to hang themselves with. :) > > thanks > -john >