From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752633Ab2KLSyK (ORCPT ); Mon, 12 Nov 2012 13:54:10 -0500 Received: from e35.co.us.ibm.com ([32.97.110.153]:48912 "EHLO e35.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751076Ab2KLSyI (ORCPT ); Mon, 12 Nov 2012 13:54:08 -0500 Message-ID: <50A145A5.7060402@linaro.org> Date: Mon, 12 Nov 2012 10:53:25 -0800 From: John Stultz User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121011 Thunderbird/16.0.1 MIME-Version: 1.0 To: Stephane Eranian CC: Peter Zijlstra , LKML , "mingo@elte.hu" , Paul Mackerras , Anton Blanchard , Will Deacon , "ak@linux.intel.com" , Pekka Enberg , Steven Rostedt , Robert Richter , tglx Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples References: <1350408232.2336.42.camel@laptop> <509DB632.7070305@linaro.org> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12111218-4834-0000-0000-0000005E8288 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/11/2012 12:32 PM, Stephane Eranian wrote: > On Sat, Nov 10, 2012 at 3:04 AM, John Stultz wrote: >> On 10/16/2012 10:23 AM, Peter Zijlstra wrote: >>> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote: >>>> Hi, >>>> >>>> There are many situations where we want to correlate events happening at >>>> the user level with samples recorded in the perf_event kernel sampling >>>> buffer. >>>> For instance, we might want to correlate the call to a function or >>>> creation of >>>> a file with samples. Similarly, when we want to monitor a JVM with jitted >>>> code, >>>> we need to be able to correlate jitted code mappings with perf event >>>> samples >>>> for symbolization. >>>> >>>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME. >>>> That causes each PERF_RECORD_SAMPLE to include a timestamp >>>> generated by calling the local_clock() -> sched_clock_cpu() function. >>>> >>>> To make correlating user vs. kernel samples easy, we would need to >>>> access that sched_clock() functionality. However, none of the existing >>>> clock calls permit this at this point. They all return timestamps which >>>> are >>>> not using the same source and/or offset as sched_clock. >>>> >>>> I believe a similar issue exists with the ftrace subsystem. >>>> >>>> The problem needs to be adressed in a portable manner. Solutions >>>> based on reading TSC for the user level to reconstruct sched_clock() >>>> don't seem appropriate to me. >>>> >>>> One possibility to address this limitation would be to extend >>>> clock_gettime() >>>> with a new clock time, e.g., CLOCK_PERF. >>>> >>>> However, I understand that sched_clock_cpu() provides ordering guarantees >>>> only >>>> when invoked on the same CPU repeatedly, i.e., it's not globally >>>> synchronized. >>>> But we already have to deal with this problem when merging samples >>>> obtained >>>> from different CPU sampling buffer in per-thread mode. So this is not >>>> necessarily >>>> a showstopper. >>>> >>>> Alternatives could be to use uprobes but that's less practical to setup. >>>> >>>> Anyone with better ideas? >>> You forgot to CC the time people ;-) >>> >>> I've no problem with adding CLOCK_PERF (or another/better name). >> Hrm. I'm not excited about exporting that sort of internal kernel details to >> userland. >> >> The behavior and expectations from sched_clock() has changed over the years, >> so I'm not sure its wise to export it, since we'd have to preserve its >> behavior from then on. >> > It's not about just exposing sched_clock(). We need to expose a time source > that is exactly equivalent to what perf_event uses internally. If sched_clock() > changes, then perf_event clock will change too and so would that new time > source for clock_gettime(). As long as everything remains consistent, we are > good. Sure, but I'm just hesitant to expose that sort of internal detail. If we change it later, its not just perf_events, but any other applications that have come to depend on the particular behavior we expose. We can claim "that was never promised" but it still leads to a bad situation. >> Also I worry that it will be abused in the same way that direct TSC access >> is, where the seemingly better performance from the more careful/correct >> CLOCK_MONOTONIC would cause developers to write fragile userland code that >> will break when moved from one machine to the next. >> > The only goal for this new time source is for correlating user-level > samples with > kernel level samples, i.e., application level events with a PMU counter overflow > for instance. Anybody trying anything else would be on their own. > > clock_gettime(CLOCK_PERF): guarantee to return the same time source as > that used by the perf_event subsystem to timestamp samples when > PERF_SAMPLE_TIME is requested in attr->sample_type. I'm not familiar enough with perf's interfaces, but if you are going to make this clockid bound so tightly with perf, could you maybe export a perf timestamp from one of perf's interfaces rather then using the more generic clock_gettime() interface? > >> I'd probably rather perf output timestamps to userland using sane clocks >> (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to >> userland. But I probably could be convinced I'm wrong. >> > Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without > grabbing any locks because that would need to run from NMI context? No, of course why we have sched_clock. But I'm suggesting we consider changing what perf exports (via maybe interpolation/translation) to be CLOCK_MONOTONIC-ish. I'm not strongly objecting here, I just want to make sure other alternatives are explored before we start giving applications another internal kernel behavior dependent interface to hang themselves with. :) thanks -john