From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753734Ab2KKUcr (ORCPT ); Sun, 11 Nov 2012 15:32:47 -0500 Received: from mail-lb0-f174.google.com ([209.85.217.174]:37904 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752253Ab2KKUcp (ORCPT ); Sun, 11 Nov 2012 15:32:45 -0500 MIME-Version: 1.0 In-Reply-To: <509DB632.7070305@linaro.org> References: <1350408232.2336.42.camel@laptop> <509DB632.7070305@linaro.org> Date: Sun, 11 Nov 2012 21:32:43 +0100 Message-ID: Subject: Re: [RFC] perf: need to expose sched_clock to correlate user samples with kernel samples From: Stephane Eranian To: John Stultz Cc: Peter Zijlstra , LKML , "mingo@elte.hu" , Paul Mackerras , Anton Blanchard , Will Deacon , "ak@linux.intel.com" , Pekka Enberg , Steven Rostedt , Robert Richter , tglx Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Nov 10, 2012 at 3:04 AM, John Stultz wrote: > On 10/16/2012 10:23 AM, Peter Zijlstra wrote: >> >> On Tue, 2012-10-16 at 12:13 +0200, Stephane Eranian wrote: >>> >>> Hi, >>> >>> There are many situations where we want to correlate events happening at >>> the user level with samples recorded in the perf_event kernel sampling >>> buffer. >>> For instance, we might want to correlate the call to a function or >>> creation of >>> a file with samples. Similarly, when we want to monitor a JVM with jitted >>> code, >>> we need to be able to correlate jitted code mappings with perf event >>> samples >>> for symbolization. >>> >>> Perf_events allows timestamping of samples with PERF_SAMPLE_TIME. >>> That causes each PERF_RECORD_SAMPLE to include a timestamp >>> generated by calling the local_clock() -> sched_clock_cpu() function. >>> >>> To make correlating user vs. kernel samples easy, we would need to >>> access that sched_clock() functionality. However, none of the existing >>> clock calls permit this at this point. They all return timestamps which >>> are >>> not using the same source and/or offset as sched_clock. >>> >>> I believe a similar issue exists with the ftrace subsystem. >>> >>> The problem needs to be adressed in a portable manner. Solutions >>> based on reading TSC for the user level to reconstruct sched_clock() >>> don't seem appropriate to me. >>> >>> One possibility to address this limitation would be to extend >>> clock_gettime() >>> with a new clock time, e.g., CLOCK_PERF. >>> >>> However, I understand that sched_clock_cpu() provides ordering guarantees >>> only >>> when invoked on the same CPU repeatedly, i.e., it's not globally >>> synchronized. >>> But we already have to deal with this problem when merging samples >>> obtained >>> from different CPU sampling buffer in per-thread mode. So this is not >>> necessarily >>> a showstopper. >>> >>> Alternatives could be to use uprobes but that's less practical to setup. >>> >>> Anyone with better ideas? >> >> You forgot to CC the time people ;-) >> >> I've no problem with adding CLOCK_PERF (or another/better name). > > Hrm. I'm not excited about exporting that sort of internal kernel details to > userland. > > The behavior and expectations from sched_clock() has changed over the years, > so I'm not sure its wise to export it, since we'd have to preserve its > behavior from then on. > It's not about just exposing sched_clock(). We need to expose a time source that is exactly equivalent to what perf_event uses internally. If sched_clock() changes, then perf_event clock will change too and so would that new time source for clock_gettime(). As long as everything remains consistent, we are good. > Also I worry that it will be abused in the same way that direct TSC access > is, where the seemingly better performance from the more careful/correct > CLOCK_MONOTONIC would cause developers to write fragile userland code that > will break when moved from one machine to the next. > The only goal for this new time source is for correlating user-level samples with kernel level samples, i.e., application level events with a PMU counter overflow for instance. Anybody trying anything else would be on their own. clock_gettime(CLOCK_PERF): guarantee to return the same time source as that used by the perf_event subsystem to timestamp samples when PERF_SAMPLE_TIME is requested in attr->sample_type. > I'd probably rather perf output timestamps to userland using sane clocks > (CLOCK_MONOTONIC), rather then trying to introduce a new time domain to > userland. But I probably could be convinced I'm wrong. > Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without grabbing any locks because that would need to run from NMI context?