linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] perf: User/kernel time correlation and event generation
@ 2014-11-04  0:28 Pawel Moll
  2014-11-04  0:28 ` [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps Pawel Moll
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Pawel Moll @ 2014-11-04  0:28 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll

From: Pawel Moll <mail@pawelmoll.com>

Hello again,

Back to the subject, this time with a slightly different angle...

I've organised a session on the subject during the tracing
minisummit at LPC 2014 in Dusseldorf. Notes taken from the discussion
taken by Steven Rostedt (thanks Steve!)

http://www.linuxplumbersconf.org/2014/wp-content/uploads/2014/10/LPC2014_Tracing.txt

The following three patches address three main topics. They are pretty
much orthogonal and (subject to small and obvious modifications) could
be applied independently from each other.

An executive summary of both discussion and the patches:

1. User/kernel perf timestamps correlation

Thomas suggested that, instead of jumping through loops of correlation,
perf should simply generate monotonic clock timestamps, instead of
using sched clock. Peter and I pointed out that Ingo didn't like this
idea as monotonic can be slow, but apparently the cases when it is are
irrelevant. Thomas offered to fly to Budapest to physically convince
Ingo - I hope it won't be necessary and he will be able to achieve
this here, on mailing lists :-)

Setting aside potential performance problems, it would be a really
great solution, unifying all trace systems (perf, ftrace and LTTng)
in this respect. I'm more than happy to work on potential improvements
in the are of monotonic clock if it was to help the cause.

If it is a definite no-go, we still have the third patch, allowing post-
capture correlation based on synchronisation events.

2. User event generation

Everyone present agreed that it would be a very-nice-to-have feature.
There was some discussion about implementation details, so I welcome
feedback and comments regarding my take on the matter.

3. Correlation with external timestamps

This is a new issue, which surfaced recently while I was working on
hardware trace infrastructure. It can timestamp trace packets, but is
using yet another, completely different time source to do this.

Thomas suggested solution which gets down to my original proposal for
sched/monotonic clock correlation - an additional sample type so events
can be "double stamped" using different clock sources providing
synchronisation points for later time approximation. I've just extended
the implementation with configuration value to select the clock source.
If the first patch (making perf timestamps monotonic) gets accepted,
there will be no immediate need for this one, but I'd like to gain some
feedback anyway.


That's all for this series. Previous versions:

- RFC: http://www.spinics.net/lists/kernel/msg1824419.html
- v1: http://thread.gmane.org/gmane.linux.kernel/1790231
- v2: http://thread.gmane.org/gmane.linux.kernel/1793272

Pawel Moll (3):
  perf: Use monotonic clock as a source for timestamps
  perf: Userspace event
  perf: Sample additional clock value

 include/linux/perf_event.h      |   7 +++
 include/uapi/linux/perf_event.h |  37 +++++++++++++-
 include/uapi/linux/prctl.h      |  10 ++++
 kernel/events/core.c            | 105 +++++++++++++++++++++++++++++++++++++++-
 kernel/sys.c                    |   5 ++
 kernel/sysctl.c                 |   7 +++
 6 files changed, 168 insertions(+), 3 deletions(-)

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps
  2014-11-04  0:28 [PATCH v3 0/3] perf: User/kernel time correlation and event generation Pawel Moll
@ 2014-11-04  0:28 ` Pawel Moll
  2014-11-04  7:23   ` Peter Zijlstra
  2014-11-04  0:28 ` [PATCH v3 2/3] perf: Userspace event Pawel Moll
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Pawel Moll @ 2014-11-04  0:28 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll

Until now, perf framework never defined the meaning of the timestampt
captured as PERF_SAMPLE_TIME sample type. The values were obtaining
from local (sched) clock, which is unavailable in userspace. This made
it impossible to correlate perf data with any other events. Other
tracing solutions have the source configurable (ftrace) or just share
a common time domain between kernel and userspace (LTTng).

Follow the trend by using monotonic clock, which is readily available
as POSIX CLOCK_MONOTONIC.

Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
by the user to obtain the clk_id to be used with POSIX clock API (eg.
clock_gettime()) to obtain a time value comparable with perf samples.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---

Ingo, I remember your comments about this approach in the past, but
during discussions at LPC Thomas was convinced that it's the right
thing to do - see cover letter for the series...

 include/linux/perf_event.h | 1 +
 kernel/events/core.c       | 4 +++-
 kernel/sysctl.c            | 7 +++++++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0..ba490d5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -738,6 +738,7 @@ extern int sysctl_perf_event_paranoid;
 extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
 extern int sysctl_perf_cpu_time_max_percent;
+extern int sysctl_perf_sample_time_clk_id;
 
 extern void perf_sample_event_took(u64 sample_len_ns);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2b02c9f..ea3d6d3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -234,6 +234,8 @@ int perf_cpu_time_max_percent_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
+
 /*
  * perf samples are done in some very critical code paths (NMIs).
  * If they take too much CPU time, the system can lock up and not
@@ -324,7 +326,7 @@ extern __weak const char *perf_pmu_name(void)
 
 static inline u64 perf_clock(void)
 {
-	return local_clock();
+	return ktime_get_mono_fast_ns();
 }
 
 static inline struct perf_cpu_context *
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 15f2511..cb75f5b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1094,6 +1094,13 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+	{
+		.procname	= "perf_sample_time_clk_id",
+		.data		= &sysctl_perf_sample_time_clk_id,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 #ifdef CONFIG_KMEMCHECK
 	{
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/3] perf: Userspace event
  2014-11-04  0:28 [PATCH v3 0/3] perf: User/kernel time correlation and event generation Pawel Moll
  2014-11-04  0:28 ` [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps Pawel Moll
@ 2014-11-04  0:28 ` Pawel Moll
  2014-11-04  6:33   ` Namhyung Kim
  2014-11-04  0:28 ` [PATCH v3 3/3] perf: Sample additional clock value Pawel Moll
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Pawel Moll @ 2014-11-04  0:28 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll, Pawel Moll

From: Pawel Moll <mail@pawelmoll.com>

This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
any process to inject custom data into perf data stream as a new
PERF_RECORD_UEVENT record, if such process is being observed or if it
is running on a CPU being observed by the perf framework.

The prctl call takes the following arguments:

	prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);

- type: a number meaning to describe content of the following data.
  Kernel does not pay attention to it and merely passes it further in
  the perf data, therefore its use must be agreed between the events
  producer (the process being observed) and the consumer (performance
  analysis tool). The perf userspace tool will contain a repository of
  "well known" types and reference implementation of their decoders.
- size: Length in bytes of the data.
- data: Pointer to the data.
- flags: Reserved for future use. Always pass zero.

Perf context that are supposed to receive events generated with the
prctl above must be opened with perf_event_attr.uevent set to 1. The
PERF_RECORD_UEVENT records consist of a standard perf event header,
32-bit type value, 32-bit data size and the data itself, followed by
padding to align the overall record size to 8 bytes and optional,
standard sample_id field.

Example use cases:

- "perf_printf" like mechanism to add logging messages to perf data;
  in the simplest case it can be just

	prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);

- synchronisation of performance data generated in user space with the
  perf stream coming from the kernel. For example, the marker can be
  inserted by a JIT engine after it generated portion of the code, but
  before the code is executed for the first time, allowing the
  post-processor to pick the correct debugging information.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
 include/linux/perf_event.h      |  4 +++
 include/uapi/linux/perf_event.h | 23 ++++++++++++-
 include/uapi/linux/prctl.h      | 10 ++++++
 kernel/events/core.c            | 71 +++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                    |  5 +++
 5 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ba490d5..867415d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -721,6 +721,8 @@ extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_fork(struct task_struct *tsk);
+extern int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		       const char __user *data);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -830,6 +832,8 @@ static inline void perf_event_mmap(struct vm_area_struct *vma)		{ }
 static inline void perf_event_exec(void)				{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
+static inline int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		              const char __user *data)			{ return -1; };
 static inline void perf_event_init(void)				{ }
 static inline int  perf_swevent_get_recursion_context(void)		{ return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)		{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9d84540..9a64eb1 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -303,7 +303,8 @@ struct perf_event_attr {
 				exclude_callchain_user   : 1, /* exclude user callchains */
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
-				__reserved_1   : 39;
+				uevents        :  1, /* allow uevents into the buffer */
+				__reserved_1   : 38;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -712,6 +713,26 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Data in userspace event record is transparent for the kernel
+	 *
+	 * Userspace perf tool code maintains a list of known types with
+	 * reference implementations of parsers for the data field.
+	 *
+	 * Overall size of the record (including type and size fields)
+	 * is always aligned to 8 bytes by adding padding after the data.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u32				type;
+	 *	u32				size;
+	 *	char				data[size];
+	 *	char				__padding[-size & 7];
+	 * 	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_UEVENT			= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75..2a6852f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,14 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+/*
+ * Perf userspace event generation
+ *
+ * second argument: event type
+ * third argument:  data size
+ * fourth argument: pointer to data
+ * fifth argument:  flags (currently unused, pass 0)
+ */
+#define PR_TASK_PERF_UEVENT	43
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ea3d6d3..3738e9c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5565,6 +5565,77 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 }
 
 /*
+ * Userspace-generated event
+ */
+
+struct perf_uevent {
+	struct perf_event_header	header;
+	u32				type;
+	u32				size;
+	u8				data[0];
+};
+
+static void perf_uevent_output(struct perf_event *event, void *data)
+{
+	struct perf_uevent *uevent = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	int size = uevent->header.size;
+
+	if (!event->attr.uevents)
+		return;
+
+	perf_event_header__init_id(&uevent->header, &sample, event);
+
+	if (perf_output_begin(&handle, event, uevent->header.size) != 0)
+		goto out;
+	perf_output_put(&handle, uevent->header);
+	perf_output_put(&handle, uevent->type);
+	perf_output_put(&handle, uevent->size);
+	__output_copy(&handle, uevent->data, uevent->size);
+
+	/* Padding to align overall data size to 8 bytes */
+	perf_output_skip(&handle, -uevent->size & (sizeof(u64) - 1));
+
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+out:
+	uevent->header.size = size;
+}
+
+int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		const char __user *data)
+{
+	struct perf_uevent *uevent;
+
+	/* Need some reasonable limit */
+	if (size > PAGE_SIZE)
+		return -E2BIG;
+
+	uevent = kmalloc(sizeof(*uevent) + size, GFP_KERNEL);
+	if (!uevent)
+		return -ENOMEM;
+
+	uevent->header.type = PERF_RECORD_UEVENT;
+	uevent->header.size = sizeof(*uevent) + ALIGN(size, sizeof(u64));
+
+	uevent->type = type;
+	uevent->size = size;
+	if (copy_from_user(uevent->data, data, size)) {
+		kfree(uevent);
+		return -EFAULT;
+	}
+
+	perf_event_aux(perf_uevent_output, uevent, NULL);
+
+	kfree(uevent);
+
+	return 0;
+}
+
+
+/*
  * Generic event overflow handling, sampling.
  */
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0..1c83677 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2121,6 +2121,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_TASK_PERF_EVENTS_ENABLE:
 		error = perf_event_task_enable();
 		break;
+	case PR_TASK_PERF_UEVENT:
+		if (arg5 != 0)
+			return -EINVAL;
+		error = perf_uevent(me, arg2, arg3, (char __user *)arg4);
+		break;
 	case PR_GET_TIMERSLACK:
 		error = current->timer_slack_ns;
 		break;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 3/3] perf: Sample additional clock value
  2014-11-04  0:28 [PATCH v3 0/3] perf: User/kernel time correlation and event generation Pawel Moll
  2014-11-04  0:28 ` [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps Pawel Moll
  2014-11-04  0:28 ` [PATCH v3 2/3] perf: Userspace event Pawel Moll
@ 2014-11-04  0:28 ` Pawel Moll
  2014-11-04  0:58 ` [PATCH v3 0/3] perf: User/kernel time correlation and event generation Andy Lutomirski
  2014-11-04  9:24 ` Masami Hiramatsu
  4 siblings, 0 replies; 24+ messages in thread
From: Pawel Moll @ 2014-11-04  0:28 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll, Pawel Moll

From: Pawel Moll <mail@pawelmoll.com>

This patch adds an option to sample value of an additional clock with
any perf event, with the the aim of allowing time correlation between
data coming from perf and other sources like hardware trace which is
timestamped with an external clock.

The idea is to generate periodic perf record containing timestamps from
two different sources, sampled as close to each other as possible. This
allows performing simple linear approximation:

        perf event    other event
       -----O--------------+-------------O------> t_other
            :              |             :
            :              V             :
       -----O----------------------------O------> t_perf

User can request such samples for any standard perf event, with the most
obvious examples of cpu-clock (hrtimer) at selected frequency or other
periodic events like sched:sched_switch.

In order to do this, PERF_SAMPLE_CLOCK has to be set in struct
perf_event_attr.sample_type and a type of the clock to be sampled
selected by setting perf_event_attr.clock to a value corresponding to
a POSIX clock clk_id (see "man 2 clock_gettime")

Currently three clocks are implemented: CLOCK_REALITME = 0,
CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
5 bits wide to allow for future extension to custom, non-POSIX clock
sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
ARM CoreSight (hardware trace) timestamp generator.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
 include/linux/perf_event.h      |  2 ++
 include/uapi/linux/perf_event.h | 16 ++++++++++++++--
 kernel/events/core.c            | 30 ++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 867415d..395d6ed 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -607,6 +607,8 @@ struct perf_sample_data {
 	 * Transaction flags for abort events:
 	 */
 	u64				txn;
+	/* Clock value (additional timestamp for time correlation) */
+	u64				clock;
 };
 
 /* default value for data source */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9a64eb1..53a7a72 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_SRC			= 1U << 15,
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
+	PERF_SAMPLE_CLOCK			= 1U << 18,
 
-	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
 };
 
 /*
@@ -304,7 +305,16 @@ struct perf_event_attr {
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
 				uevents        :  1, /* allow uevents into the buffer */
-				__reserved_1   : 38;
+
+				/*
+				 * clock: one of the POSIX clock IDs:
+				 *
+				 * 0 - CLOCK_REALTIME
+				 * 1 - CLOCK_MONOTONIC
+				 * 4 - CLOCK_MONOTONIC_RAW
+				 */
+				clock          :  5, /* clock type */
+				__reserved_1   : 33;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -544,6 +554,7 @@ enum perf_event_type {
 	 * 	{ u64			id;       } && PERF_SAMPLE_ID
 	 * 	{ u64			stream_id;} && PERF_SAMPLE_STREAM_ID
 	 * 	{ u32			cpu, res; } && PERF_SAMPLE_CPU
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 *	{ u64			id;	  } && PERF_SAMPLE_IDENTIFIER
 	 * } && perf_event_attr::sample_id_all
 	 *
@@ -687,6 +698,7 @@ enum perf_event_type {
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3738e9c..611e2f7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1232,6 +1232,9 @@ static void perf_event__header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		size += sizeof(data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->header_size = size;
 }
 
@@ -1259,6 +1262,9 @@ static void perf_event__id_header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_CPU)
 		size += sizeof(data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->id_header_size = size;
 }
 
@@ -4599,6 +4605,24 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
 		data->cpu_entry.cpu	 = raw_smp_processor_id();
 		data->cpu_entry.reserved = 0;
 	}
+
+	if (sample_type & PERF_SAMPLE_CLOCK) {
+		switch (event->attr.clock) {
+		case CLOCK_REALTIME:
+			data->clock = ktime_get_real_ns();
+			break;
+		case CLOCK_MONOTONIC:
+			data->clock = ktime_get_mono_fast_ns();
+			break;
+		case CLOCK_MONOTONIC_RAW:
+			data->clock = ktime_get_raw_ns();
+			break;
+		default:
+			data->clock = 0;
+			break;
+		}
+	}
+
 }
 
 void perf_event_header__init_id(struct perf_event_header *header,
@@ -4629,6 +4653,9 @@ static void __perf_event__output_id_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_CPU)
 		perf_output_put(handle, data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (sample_type & PERF_SAMPLE_IDENTIFIER)
 		perf_output_put(handle, data->id);
 }
@@ -4857,6 +4884,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		perf_output_put(handle, data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  0:28 [PATCH v3 0/3] perf: User/kernel time correlation and event generation Pawel Moll
                   ` (2 preceding siblings ...)
  2014-11-04  0:28 ` [PATCH v3 3/3] perf: Sample additional clock value Pawel Moll
@ 2014-11-04  0:58 ` Andy Lutomirski
  2014-11-04  1:11   ` John Stultz
  2014-11-04  9:24 ` Masami Hiramatsu
  4 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2014-11-04  0:58 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso, linux-kernel,
	Linux API, Pawel Moll

On Mon, Nov 3, 2014 at 4:28 PM, Pawel Moll <pawel.moll@arm.com> wrote:
> From: Pawel Moll <mail@pawelmoll.com>
> Thomas suggested solution which gets down to my original proposal for
> sched/monotonic clock correlation - an additional sample type so events
> can be "double stamped" using different clock sources providing
> synchronisation points for later time approximation. I've just extended
> the implementation with configuration value to select the clock source.
> If the first patch (making perf timestamps monotonic) gets accepted,
> there will be no immediate need for this one, but I'd like to gain some
> feedback anyway.
>

I have nothing intelligent to add to the potentional Thomas/Ingo
showdown, but I do have a related thought. :)

If you're going to add double-stamped packets, can you also add a
syscall to read multiple clocks at once, atomically?  Or can you
otherwise add a non-perf mechanism to get at this data?

Because the realtime to monotonic offset is really quite useful for
things like this, and it seems silly to make people actually open a
perf_event to get at it.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  0:58 ` [PATCH v3 0/3] perf: User/kernel time correlation and event generation Andy Lutomirski
@ 2014-11-04  1:11   ` John Stultz
  2014-11-04  1:25     ` Andy Lutomirski
                       ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: John Stultz @ 2014-11-04  1:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Pawel Moll, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso, linux-kernel,
	Linux API, Pawel Moll

On Mon, Nov 3, 2014 at 4:58 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Nov 3, 2014 at 4:28 PM, Pawel Moll <pawel.moll@arm.com> wrote:
>> From: Pawel Moll <mail@pawelmoll.com>
>> Thomas suggested solution which gets down to my original proposal for
>> sched/monotonic clock correlation - an additional sample type so events
>> can be "double stamped" using different clock sources providing
>> synchronisation points for later time approximation. I've just extended
>> the implementation with configuration value to select the clock source.
>> If the first patch (making perf timestamps monotonic) gets accepted,
>> there will be no immediate need for this one, but I'd like to gain some
>> feedback anyway.
>>
>
> I have nothing intelligent to add to the potentional Thomas/Ingo
> showdown, but I do have a related thought. :)
>
> If you're going to add double-stamped packets, can you also add a
> syscall to read multiple clocks at once, atomically?  Or can you
> otherwise add a non-perf mechanism to get at this data?
>
> Because the realtime to monotonic offset is really quite useful for
> things like this, and it seems silly to make people actually open a
> perf_event to get at it.

So this comes up periodically, but I don't think I've seen a interface
proposal that was decent yet.

Also, if you want to read multiple clocks at once, do you stop at two,
or three, or... there's possibly quite a few.  Additionally some
clocks may not be possible to read atomically (perf/sched clock and
system time for example may be based on different underlying
clocksources). The general idea feels like its creeping towards some
"atomically expose all timekeeping state" mega-interface.

I've got some thoughts on what a possible interface that wouldn't be
awful could look like, but I'm still hesitant because I don't really
know if exposing this sort of data is actually a good idea long term.

thanks
-john

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  1:11   ` John Stultz
@ 2014-11-04  1:25     ` Andy Lutomirski
  2014-11-04 15:07       ` Pawel Moll
  2014-11-04  8:01     ` Arnd Bergmann
  2014-11-04  8:24     ` Richard Cochran
  2 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2014-11-04  1:25 UTC (permalink / raw)
  To: John Stultz
  Cc: Pawel Moll, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso, linux-kernel,
	Linux API, Pawel Moll

On Mon, Nov 3, 2014 at 5:11 PM, John Stultz <john.stultz@linaro.org> wrote:
> On Mon, Nov 3, 2014 at 4:58 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Nov 3, 2014 at 4:28 PM, Pawel Moll <pawel.moll@arm.com> wrote:
>>> From: Pawel Moll <mail@pawelmoll.com>
>>> Thomas suggested solution which gets down to my original proposal for
>>> sched/monotonic clock correlation - an additional sample type so events
>>> can be "double stamped" using different clock sources providing
>>> synchronisation points for later time approximation. I've just extended
>>> the implementation with configuration value to select the clock source.
>>> If the first patch (making perf timestamps monotonic) gets accepted,
>>> there will be no immediate need for this one, but I'd like to gain some
>>> feedback anyway.
>>>
>>
>> I have nothing intelligent to add to the potentional Thomas/Ingo
>> showdown, but I do have a related thought. :)
>>
>> If you're going to add double-stamped packets, can you also add a
>> syscall to read multiple clocks at once, atomically?  Or can you
>> otherwise add a non-perf mechanism to get at this data?
>>
>> Because the realtime to monotonic offset is really quite useful for
>> things like this, and it seems silly to make people actually open a
>> perf_event to get at it.
>
> So this comes up periodically, but I don't think I've seen a interface
> proposal that was decent yet.
>
> Also, if you want to read multiple clocks at once, do you stop at two,
> or three, or... there's possibly quite a few.  Additionally some
> clocks may not be possible to read atomically (perf/sched clock and
> system time for example may be based on different underlying
> clocksources). The general idea feels like its creeping towards some
> "atomically expose all timekeeping state" mega-interface.
>
> I've got some thoughts on what a possible interface that wouldn't be
> awful could look like, but I'm still hesitant because I don't really
> know if exposing this sort of data is actually a good idea long term.

My only real thought here is that, if perf is going to try to do this,
then presumably it should be reasonably integrated w/ the core timing
code.  I.e. if perf does this, then presumably the core code should
know about it and there should be a core interface to it.

--Andy

>
> thanks
> -john



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/3] perf: Userspace event
  2014-11-04  0:28 ` [PATCH v3 2/3] perf: Userspace event Pawel Moll
@ 2014-11-04  6:33   ` Namhyung Kim
  2014-11-04 16:42     ` Pawel Moll
  0 siblings, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2014-11-04  6:33 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api,
	Pawel Moll

Hi Pawel,

On Tue,  4 Nov 2014 00:28:37 +0000, Pawel Moll wrote:
> +	/*
> +	 * Data in userspace event record is transparent for the kernel
> +	 *
> +	 * Userspace perf tool code maintains a list of known types with
> +	 * reference implementations of parsers for the data field.
> +	 *
> +	 * Overall size of the record (including type and size fields)
> +	 * is always aligned to 8 bytes by adding padding after the data.
> +	 *
> +	 * struct {
> +	 *	struct perf_event_header	header;
> +	 *	u32				type;
> +	 *	u32				size;

The struct perf_event_header also has 'size' field and it has the entire
length of the record so it's redundant.  Also there's 'misc' field in the
perf_event_header and I guess it can be used as 'type' info as it's
mostly for cpumode and we are in user mode by definition.

Thanks,
Namhyung


> +	 *	char				data[size];
> +	 *	char				__padding[-size & 7];
> +	 * 	struct sample_id		sample_id;
> +	 * };
> +	 */
> +	PERF_RECORD_UEVENT			= 11,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps
  2014-11-04  0:28 ` [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps Pawel Moll
@ 2014-11-04  7:23   ` Peter Zijlstra
  2014-11-04 15:25     ` Pawel Moll
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2014-11-04  7:23 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Tue, Nov 04, 2014 at 12:28:36AM +0000, Pawel Moll wrote:

> +int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;

const ?

>  /*
>   * perf samples are done in some very critical code paths (NMIs).
>   * If they take too much CPU time, the system can lock up and not
> @@ -324,7 +326,7 @@ extern __weak const char *perf_pmu_name(void)
>  
>  static inline u64 perf_clock(void)
>  {
> -	return local_clock();
> +	return ktime_get_mono_fast_ns();
>  }

Do we maybe want to make it boot-time switchable back to local_clock for
people with bad systems and or backwards compat issues?

Everybody using Core2 and older will very much not want to have this
unless they've got a very good reason for wanting it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  1:11   ` John Stultz
  2014-11-04  1:25     ` Andy Lutomirski
@ 2014-11-04  8:01     ` Arnd Bergmann
  2014-11-04  8:27       ` Richard Cochran
  2014-11-04 16:04       ` John Stultz
  2014-11-04  8:24     ` Richard Cochran
  2 siblings, 2 replies; 24+ messages in thread
From: Arnd Bergmann @ 2014-11-04  8:01 UTC (permalink / raw)
  To: John Stultz
  Cc: Andy Lutomirski, Pawel Moll, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, Linux API,
	Pawel Moll

On Monday 03 November 2014 17:11:53 John Stultz wrote:
> I've got some thoughts on what a possible interface that wouldn't be
> awful could look like, but I'm still hesitant because I don't really
> know if exposing this sort of data is actually a good idea long term.
 
I was also thinking (while working on an unrelated patch) we could use
a system call like

int clock_getoffset(clockid_t clkid, struct timespec *offs);

that returns the current offset between CLOCK_REALTIME and the
requested timebase. It is of course racy, but so is every use
of CLOCK_REALTIME. We could also use a reference other than
CLOCK_REALTIME that might be more stable, but passing two arbitrary
clocks as input would make this much more complex to implement.

	Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  1:11   ` John Stultz
  2014-11-04  1:25     ` Andy Lutomirski
  2014-11-04  8:01     ` Arnd Bergmann
@ 2014-11-04  8:24     ` Richard Cochran
  2 siblings, 0 replies; 24+ messages in thread
From: Richard Cochran @ 2014-11-04  8:24 UTC (permalink / raw)
  To: John Stultz
  Cc: Andy Lutomirski, Pawel Moll, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso, linux-kernel,
	Linux API, Pawel Moll

On Mon, Nov 03, 2014 at 05:11:53PM -0800, John Stultz wrote:
> On Mon, Nov 3, 2014 at 4:58 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > If you're going to add double-stamped packets, can you also add a
> > syscall to read multiple clocks at once, atomically?  Or can you
> > otherwise add a non-perf mechanism to get at this data?

Does not need to be "atomic". In fact it cannot be atomic in the
general case. Some clocks are read over memory mapped registers, but
others are read over slow and sleepy buses like PCIe or MDIO.

> > Because the realtime to monotonic offset is really quite useful for
> > things like this, and it seems silly to make people actually open a
> > perf_event to get at it.
> 
> So this comes up periodically, but I don't think I've seen a interface
> proposal that was decent yet.

We have ioctl PTP_SYS_OFFSET that alternately reads a dynamic clock
and CLOCK_REALTIME a given number of times. This is done without locks
or any kind of "atomic" guarantees, and it works quite well in
practice. The user can pick the number of repetitions to deal with
noisy run time environments, and usually it is a simple matter of
picking the reading with the shortest duration. However, the user is
free to do statistics over the readings in any way he wants.

It would be nice (and people have requested) a syscall that takes two
clockid_t arguments but otherwise works like PTP_SYS_OFFSET.

We really will never have to support more than two clocks. The
application will pick one clock as the reference and then measure each
of the other clocks relative to it, one at a time. The performance
should be perfectly adequate, even better than reading three or more
at once (with the understanding that these are "software" time
stamps).

Thanks,
Richard





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  8:01     ` Arnd Bergmann
@ 2014-11-04  8:27       ` Richard Cochran
  2014-11-04 10:49         ` Thomas Gleixner
  2014-11-04 16:04       ` John Stultz
  1 sibling, 1 reply; 24+ messages in thread
From: Richard Cochran @ 2014-11-04  8:27 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: John Stultz, Andy Lutomirski, Pawel Moll, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, Linux API,
	Pawel Moll

On Tue, Nov 04, 2014 at 09:01:31AM +0100, Arnd Bergmann wrote:
> On Monday 03 November 2014 17:11:53 John Stultz wrote:
> > I've got some thoughts on what a possible interface that wouldn't be
> > awful could look like, but I'm still hesitant because I don't really
> > know if exposing this sort of data is actually a good idea long term.
>  
> I was also thinking (while working on an unrelated patch) we could use
> a system call like
> 
> int clock_getoffset(clockid_t clkid, struct timespec *offs);
> 
> that returns the current offset between CLOCK_REALTIME and the
> requested timebase. It is of course racy, but so is every use
> of CLOCK_REALTIME. We could also use a reference other than
> CLOCK_REALTIME that might be more stable, but passing two arbitrary
> clocks as input would make this much more complex to implement.

No, it is really easy to implement. Just drop the idea of "atomic". It
really is not necessary or even possible.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  0:28 [PATCH v3 0/3] perf: User/kernel time correlation and event generation Pawel Moll
                   ` (3 preceding siblings ...)
  2014-11-04  0:58 ` [PATCH v3 0/3] perf: User/kernel time correlation and event generation Andy Lutomirski
@ 2014-11-04  9:24 ` Masami Hiramatsu
  2014-11-04 15:51   ` Pawel Moll
  4 siblings, 1 reply; 24+ messages in thread
From: Masami Hiramatsu @ 2014-11-04  9:24 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api,
	Pawel Moll

Hello,

(2014/11/04 9:28), Pawel Moll wrote:
> 2. User event generation
> 
> Everyone present agreed that it would be a very-nice-to-have feature.
> There was some discussion about implementation details, so I welcome
> feedback and comments regarding my take on the matter.

Hmm, I'm trying to make a similar thing, dynamic event definition via
ftrace, which is already done by kprobes/uprobes. And this will be shown
as dynamic events from perf too.

What I'd like to do is the binary version of ftrace-marker, the text
version is already supported by qemu (see below).
https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg00505.html

But since that is just a string data (not structured data), it is hard to
analyze via perf-script or some other useful filters/triggers in ftrace.

In my idea, the new event will be defined via a special file in debugfs like
kprobe-events, like below.

  # cd $debugfs/tracing
  # echo "newgrp/newevent signarg:s32 flag:u64" >> marker_events
  # cat events/newgrp/newevent/format
  name: newevent
  ID: 2048
  format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1;signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:s32 signarg;      offset:8;      size:4; signed:1;
        field:u64 flag; offset:12;      size:8; signed:0;

  print fmt: "signarg=%d flag=0x%Lx", REC->signarg, REC->flag

Then, users will write the data (excluded common fields) when the event happens
via trace_marker which start with '\0'ID(in u32). Kernel just checks the ID and
its data size, but doesn't parse, filter/trigger it and log it into the kernel buffer.

Of course, this has a downside that the user must have a privilege to access to debugfs.
Thus maybe we need both of prctl() IF for perf and this IF for ftrace.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  8:27       ` Richard Cochran
@ 2014-11-04 10:49         ` Thomas Gleixner
  2014-11-04 16:11           ` Arnd Bergmann
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2014-11-04 10:49 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Arnd Bergmann, John Stultz, Andy Lutomirski, Pawel Moll,
	Steven Rostedt, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern, Tomeu Vizoso,
	linux-kernel, Linux API, Pawel Moll

On Tue, 4 Nov 2014, Richard Cochran wrote:

> On Tue, Nov 04, 2014 at 09:01:31AM +0100, Arnd Bergmann wrote:
> > On Monday 03 November 2014 17:11:53 John Stultz wrote:
> > > I've got some thoughts on what a possible interface that wouldn't be
> > > awful could look like, but I'm still hesitant because I don't really
> > > know if exposing this sort of data is actually a good idea long term.
> >  
> > I was also thinking (while working on an unrelated patch) we could use
> > a system call like
> > 
> > int clock_getoffset(clockid_t clkid, struct timespec *offs);

We might make *offs a timespec64 or u64 :)

> > that returns the current offset between CLOCK_REALTIME and the
> > requested timebase. It is of course racy, but so is every use
> > of CLOCK_REALTIME. We could also use a reference other than
> > CLOCK_REALTIME that might be more stable, but passing two arbitrary
> > clocks as input would make this much more complex to implement.
> 
> No, it is really easy to implement. Just drop the idea of "atomic". It
> really is not necessary or even possible.

If the two clocks have the same underlying hardware then you get an
'atomic' snapshot of their relationship. That's true for any
combination of CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME and
CLOCK_TAI. So we can and should expose these 'atomic' snapshots.

There is another reason why we want to support the notion of 'atomic'
snapshots: There exists hardware which gives you 'atomic' samples of
two different hardware clocks and there is more of that coming soon.

If that's not the case then you need two seperate readouts which of
course cannot provide any guarantee, but I agree that we could do
something like what you do in the PTP_SYS_OFFSET ioctl and let user
space analyze the samples. But that should be optional.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  1:25     ` Andy Lutomirski
@ 2014-11-04 15:07       ` Pawel Moll
  0 siblings, 0 replies; 24+ messages in thread
From: Pawel Moll @ 2014-11-04 15:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Stultz, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso, linux-kernel,
	Linux API

On Tue, 2014-11-04 at 01:25 +0000, Andy Lutomirski wrote:
> >> If you're going to add double-stamped packets, can you also add a
> >> syscall to read multiple clocks at once, atomically?  Or can you
> >> otherwise add a non-perf mechanism to get at this data?
> >
> > I've got some thoughts on what a possible interface that wouldn't be
> > awful could look like, but I'm still hesitant because I don't really
> > know if exposing this sort of data is actually a good idea long term.
> 
> My only real thought here is that, if perf is going to try to do this,
> then presumably it should be reasonably integrated w/ the core timing
> code.  I.e. if perf does this, then presumably the core code should
> know about it and there should be a core interface to it.

I think I understand where you're coming from. Arnd's idea for the API
seems reasonable, although I can't promise implementing a proposal
(don't make me stop you from doing it :-).

As to the perf-specific correlation, I'm assuming limited accuracy.
Others already mentioned that in the absence of hardware support, the
time values are never really "atomic". The best what can be done is to
access them as near to each other in the code as possible and make sure
it happens in a non-preemptible section. In my tests I've achieved, on
average, sub-microsecond accuracy, which was good enough from my
perspective, but it's far from ideal 42ns resolution for my (just an
example) time source clocked at 24MHz.

Paweł


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps
  2014-11-04  7:23   ` Peter Zijlstra
@ 2014-11-04 15:25     ` Pawel Moll
  2014-11-04 15:30       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Pawel Moll @ 2014-11-04 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Tue, 2014-11-04 at 07:23 +0000, Peter Zijlstra wrote:
> On Tue, Nov 04, 2014 at 12:28:36AM +0000, Pawel Moll wrote:
> 
> > +int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
> 
> const ?

Sure (unless we have to change it as mentioned below)

> >  /*
> >   * perf samples are done in some very critical code paths (NMIs).
> >   * If they take too much CPU time, the system can lock up and not
> > @@ -324,7 +326,7 @@ extern __weak const char *perf_pmu_name(void)
> >  
> >  static inline u64 perf_clock(void)
> >  {
> > -	return local_clock();
> > +	return ktime_get_mono_fast_ns();
> >  }
> 
> Do we maybe want to make it boot-time switchable back to local_clock for
> people with bad systems and or backwards compat issues?

Very good idea, should have came up with it myself :-)

Does __setup("perf_use_local_clock") sound reasonable? Then we have to
decide whether to hide the sysctl "perf_sample_time_clk_id" (my
preferred option, will see how difficult it is) or just provide an
invalid clock_id (eg. -1) in it.

Cheers!

Pawel


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps
  2014-11-04 15:25     ` Pawel Moll
@ 2014-11-04 15:30       ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2014-11-04 15:30 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Tue, Nov 04, 2014 at 03:25:27PM +0000, Pawel Moll wrote:
> Very good idea, should have came up with it myself :-)
> 
> Does __setup("perf_use_local_clock") sound reasonable?

Would not the 'module' already prefix a perf.? I never quite know how
all that works out. But sure, that works.

> Then we have to
> decide whether to hide the sysctl "perf_sample_time_clk_id" (my
> preferred option, will see how difficult it is) or just provide an
> invalid clock_id (eg. -1) in it.

invalid clock id is fine with me, although I suppose the clock people
might have an oh-pinion ;-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  9:24 ` Masami Hiramatsu
@ 2014-11-04 15:51   ` Pawel Moll
  2014-11-05  8:06     ` Masami Hiramatsu
  0 siblings, 1 reply; 24+ messages in thread
From: Pawel Moll @ 2014-11-04 15:51 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Tue, 2014-11-04 at 09:24 +0000, Masami Hiramatsu wrote:
> What I'd like to do is the binary version of ftrace-marker, the text
> version is already supported by qemu (see below).
> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg00505.html
> 
> But since that is just a string data (not structured data), it is hard to
> analyze via perf-script or some other useful filters/triggers in ftrace.
> 
> In my idea, the new event will be defined via a special file in debugfs like
> kprobe-events, like below.
> 
>   # cd $debugfs/tracing
>   # echo "newgrp/newevent signarg:s32 flag:u64" >> marker_events
>   # cat events/newgrp/newevent/format
>   name: newevent
>   ID: 2048
>   format:
>         field:unsigned short common_type;       offset:0;       size:2; signed:0;
>         field:unsigned char common_flags;       offset:2;       size:1; signed:0;
>         field:unsigned char common_preempt_count;       offset:3;       size:1;signed:0;
>         field:int common_pid;   offset:4;       size:4; signed:1;
> 
>         field:s32 signarg;      offset:8;      size:4; signed:1;
>         field:u64 flag; offset:12;      size:8; signed:0;
> 
>   print fmt: "signarg=%d flag=0x%Lx", REC->signarg, REC->flag
> 
> Then, users will write the data (excluded common fields) when the event happens
> via trace_marker which start with '\0'ID(in u32). Kernel just checks the ID and
> its data size, but doesn't parse, filter/trigger it and log it into the kernel buffer.

Very neat, I like it! Certainly useful with scripting. Any gut feeling
regarding the kernel version it will be ready for? 3.19 or later than
that?

> Of course, this has a downside that the user must have a privilege to access to debugfs.
> Thus maybe we need both of prctl() IF for perf and this IF for ftrace.

I don't have any particularly strong feelings about the solution as long
as I'm able to create this "synchronisation point" of mine in the perf
data. In one of this patch's previous incarnations I was also doing a
write() to the perf fd to achieve pretty much the same result.

In my personal use case root access to debugfs isn't a problem (I need
it for other ftrace operations anyway). However Ingo and some other guys
seemed interested in prctl() approach because: 1. it's much simpler to
use even comparing with simple trace_marker's open(path)/write()/close()
and 2. because any process can do it at any time and the results are
quietly discarded if no one is listening. I also remember that when I
proposed sort of "unification" between trace_marker and the uevents,
Ingo straight away "suggested" keeping it separate.

Pawel


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04  8:01     ` Arnd Bergmann
  2014-11-04  8:27       ` Richard Cochran
@ 2014-11-04 16:04       ` John Stultz
  1 sibling, 0 replies; 24+ messages in thread
From: John Stultz @ 2014-11-04 16:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andy Lutomirski, Pawel Moll, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, Linux API,
	Pawel Moll

On Tue, Nov 4, 2014 at 12:01 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 03 November 2014 17:11:53 John Stultz wrote:
>> I've got some thoughts on what a possible interface that wouldn't be
>> awful could look like, but I'm still hesitant because I don't really
>> know if exposing this sort of data is actually a good idea long term.
>
> I was also thinking (while working on an unrelated patch) we could use
> a system call like
>
> int clock_getoffset(clockid_t clkid, struct timespec *offs);
>
> that returns the current offset between CLOCK_REALTIME and the
> requested timebase. It is of course racy, but so is every use
> of CLOCK_REALTIME. We could also use a reference other than
> CLOCK_REALTIME that might be more stable, but passing two arbitrary
> clocks as input would make this much more complex to implement.

Yea, this is too racy for me, at least for it to be useful. You get an
offset, but you don't get any sense of what it was actually valid for.

I think to be at all useful, you'll have to return both a timestamp
for a given clockid, and an offset to the second clockid. That way you
can generate a valid point in time on two clocks (as best as possible,
given possible non-atomic reads of separately backed clockids).

But again, I'm not totally sure exposing this provides that much value
over userspace reading the two clocks itself (in ABA fashion) to sort
this out.

And I also don't see it as particularly related to this perf extension
that Pawel is doing (since we are trying to avoid making the perf
clock a directly accessible clockid).

thanks
-john

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04 10:49         ` Thomas Gleixner
@ 2014-11-04 16:11           ` Arnd Bergmann
  0 siblings, 0 replies; 24+ messages in thread
From: Arnd Bergmann @ 2014-11-04 16:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Richard Cochran, John Stultz, Andy Lutomirski, Pawel Moll,
	Steven Rostedt, Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern, Tomeu Vizoso,
	linux-kernel, Linux API, Pawel Moll

On Tuesday 04 November 2014 11:49:04 Thomas Gleixner wrote:
> On Tue, 4 Nov 2014, Richard Cochran wrote:
> 
> > On Tue, Nov 04, 2014 at 09:01:31AM +0100, Arnd Bergmann wrote:
> > > On Monday 03 November 2014 17:11:53 John Stultz wrote:
> > > > I've got some thoughts on what a possible interface that wouldn't be
> > > > awful could look like, but I'm still hesitant because I don't really
> > > > know if exposing this sort of data is actually a good idea long term.
> > >  
> > > I was also thinking (while working on an unrelated patch) we could use
> > > a system call like
> > > 
> > > int clock_getoffset(clockid_t clkid, struct timespec *offs);
> 
> We might make *offs a timespec64 or u64 

I don't think we are ready yet to introduce timespec64 in the uapi
headers, this needs some more careful planning. Otherwise I agree
it's bad to introduce syscalls that we already know will become
obsolete soon.

	Arnd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/3] perf: Userspace event
  2014-11-04  6:33   ` Namhyung Kim
@ 2014-11-04 16:42     ` Pawel Moll
  2014-11-04 18:40       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Pawel Moll @ 2014-11-04 16:42 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Tue, 2014-11-04 at 06:33 +0000, Namhyung Kim wrote:
> Hi Pawel,
> 
> On Tue,  4 Nov 2014 00:28:37 +0000, Pawel Moll wrote:
> > +	/*
> > +	 * Data in userspace event record is transparent for the kernel
> > +	 *
> > +	 * Userspace perf tool code maintains a list of known types with
> > +	 * reference implementations of parsers for the data field.
> > +	 *
> > +	 * Overall size of the record (including type and size fields)
> > +	 * is always aligned to 8 bytes by adding padding after the data.
> > +	 *
> > +	 * struct {
> > +	 *	struct perf_event_header	header;
> > +	 *	u32				type;
> > +	 *	u32				size;
> 
> The struct perf_event_header also has 'size' field and it has the entire
> length of the record so it's redundant. 

Well, is it? Correct me if I'm wrong, but as far as I remember the
record size must be always aligned to 8 bytes. Thus you can't reliably
derive the data size from it - if I my data is 3 bytes long, I have to
add 5 bytes of padding thus making the header.size = 24 (I'm ignoring
sample_id here), right? So now, decoding the record, all I can do is:
header.size - sizeof(header) - sizeof(type) - sizeof(size) = 24 - 8 - 8
= 8. So, basing on the header.size the data is 8 bytes long. But only 3
first bytes are valid...

To summarize, there are three options:

1. I'm wrong and the record doesn't have to be padded to make it 8 bytes
aligned. Then I can drop the additional size field.

2. I could impose a limitation on the prctl API that the data size must
be 8 bytes aligned. Bad idea in my opinion, I'd rather not.

3. The additional size (for the data part) field stays. Notice that
PERF_SAMPLE_RAW has it as well :-)

>  Also there's 'misc' field in the
> perf_event_header and I guess it can be used as 'type' info as it's
> mostly for cpumode and we are in user mode by definition.

Hm. First of all, I don't really like the idea of "overloading" the misc
meaning. It's a set of flags and I'd rather see it staying like this.

Secondly, I'm not sure that it can be reused - we are in user mode,
true, but it can be either PERF_RECORD_MISC_USER or
PERF_RECORD_MISC_GUEST_USER.

Thirdly, misc is "only" 16 bits wide, and someone even asked for the
type to be 64 bit long! (I suspect he wanted to use it in some special,
hacky way though :-) 32 bit length seems like a reasonable choice,
though.

Do you feel that the "unnecessary" type field is a big problem?

Thanks for your time!

Pawel


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/3] perf: Userspace event
  2014-11-04 16:42     ` Pawel Moll
@ 2014-11-04 18:40       ` Peter Zijlstra
  2014-11-05  6:36         ` Namhyung Kim
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2014-11-04 18:40 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Namhyung Kim, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Tue, Nov 04, 2014 at 04:42:11PM +0000, Pawel Moll wrote:
> 
> 1. I'm wrong and the record doesn't have to be padded to make it 8 bytes
> aligned. Then I can drop the additional size field.

No, you're right, we're supposed to stay 8 byte aligned.

> 2. I could impose a limitation on the prctl API that the data size must
> be 8 bytes aligned. Bad idea in my opinion, I'd rather not.

Agreed.

> 3. The additional size (for the data part) field stays. Notice that
> PERF_SAMPLE_RAW has it as well :-)

Right, with binary data there is no other day. With \0 terminated
strings there won't be a problem, but I think we decided we wanted to
allow any binary blow.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/3] perf: Userspace event
  2014-11-04 18:40       ` Peter Zijlstra
@ 2014-11-05  6:36         ` Namhyung Kim
  0 siblings, 0 replies; 24+ messages in thread
From: Namhyung Kim @ 2014-11-05  6:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pawel Moll, Namhyung Kim, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, Masami Hiramatsu, Christopher Covington,
	David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org



Hi Peter and Pawel,

On Tue, 4 Nov 2014 19:40:31 +0100, Peter Zijlstra wrote:
> On Tue, Nov 04, 2014 at 04:42:11PM +0000, Pawel Moll wrote:
>> 
>> 1. I'm wrong and the record doesn't have to be padded to make it 8 bytes
>> aligned. Then I can drop the additional size field.
>
> No, you're right, we're supposed to stay 8 byte aligned.
>
>> 2. I could impose a limitation on the prctl API that the data size must
>> be 8 bytes aligned. Bad idea in my opinion, I'd rather not.
>
> Agreed.
>
>> 3. The additional size (for the data part) field stays. Notice that
>> PERF_SAMPLE_RAW has it as well :-)
>
> Right, with binary data there is no other day. With \0 terminated
> strings there won't be a problem, but I think we decided we wanted to
> allow any binary blow.

Ah, I missed that.  Thank you guys for explanation.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
  2014-11-04 15:51   ` Pawel Moll
@ 2014-11-05  8:06     ` Masami Hiramatsu
  0 siblings, 0 replies; 24+ messages in thread
From: Masami Hiramatsu @ 2014-11-05  8:06 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api,
	yrl.pp-manager.tt

(2014/11/05 0:51), Pawel Moll wrote:
> On Tue, 2014-11-04 at 09:24 +0000, Masami Hiramatsu wrote:
>> What I'd like to do is the binary version of ftrace-marker, the text
>> version is already supported by qemu (see below).
>> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg00505.html
>>
>> But since that is just a string data (not structured data), it is hard to
>> analyze via perf-script or some other useful filters/triggers in ftrace.
>>
>> In my idea, the new event will be defined via a special file in debugfs like
>> kprobe-events, like below.
>>
>>   # cd $debugfs/tracing
>>   # echo "newgrp/newevent signarg:s32 flag:u64" >> marker_events
>>   # cat events/newgrp/newevent/format
>>   name: newevent
>>   ID: 2048
>>   format:
>>         field:unsigned short common_type;       offset:0;       size:2; signed:0;
>>         field:unsigned char common_flags;       offset:2;       size:1; signed:0;
>>         field:unsigned char common_preempt_count;       offset:3;       size:1;signed:0;
>>         field:int common_pid;   offset:4;       size:4; signed:1;
>>
>>         field:s32 signarg;      offset:8;      size:4; signed:1;
>>         field:u64 flag; offset:12;      size:8; signed:0;
>>
>>   print fmt: "signarg=%d flag=0x%Lx", REC->signarg, REC->flag
>>
>> Then, users will write the data (excluded common fields) when the event happens
>> via trace_marker which start with '\0'ID(in u32). Kernel just checks the ID and
>> its data size, but doesn't parse, filter/trigger it and log it into the kernel buffer.
> 
> Very neat, I like it! Certainly useful with scripting. Any gut feeling
> regarding the kernel version it will be ready for? 3.19 or later than
> that?

Thanks, and not yet implemented, I'd like to ask people about the format etc.
before that :)

>> Of course, this has a downside that the user must have a privilege to access to debugfs.
>> Thus maybe we need both of prctl() IF for perf and this IF for ftrace.
> 
> I don't have any particularly strong feelings about the solution as long
> as I'm able to create this "synchronisation point" of mine in the perf
> data. In one of this patch's previous incarnations I was also doing a
> write() to the perf fd to achieve pretty much the same result.
> 
> In my personal use case root access to debugfs isn't a problem (I need
> it for other ftrace operations anyway). However Ingo and some other guys
> seemed interested in prctl() approach because: 1. it's much simpler to
> use even comparing with simple trace_marker's open(path)/write()/close()
> and 2. because any process can do it at any time and the results are
> quietly discarded if no one is listening. I also remember that when I
> proposed sort of "unification" between trace_marker and the uevents,
> Ingo straight away "suggested" keeping it separate.

Agreed, I think we can keep trace_marker opened (so application will just
need to write() the events), but for the second reason, prctl will be
better for per-application usage. Actually, ftrace is "system-wide" oriented,
but the perf is not.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-11-05  8:07 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-04  0:28 [PATCH v3 0/3] perf: User/kernel time correlation and event generation Pawel Moll
2014-11-04  0:28 ` [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps Pawel Moll
2014-11-04  7:23   ` Peter Zijlstra
2014-11-04 15:25     ` Pawel Moll
2014-11-04 15:30       ` Peter Zijlstra
2014-11-04  0:28 ` [PATCH v3 2/3] perf: Userspace event Pawel Moll
2014-11-04  6:33   ` Namhyung Kim
2014-11-04 16:42     ` Pawel Moll
2014-11-04 18:40       ` Peter Zijlstra
2014-11-05  6:36         ` Namhyung Kim
2014-11-04  0:28 ` [PATCH v3 3/3] perf: Sample additional clock value Pawel Moll
2014-11-04  0:58 ` [PATCH v3 0/3] perf: User/kernel time correlation and event generation Andy Lutomirski
2014-11-04  1:11   ` John Stultz
2014-11-04  1:25     ` Andy Lutomirski
2014-11-04 15:07       ` Pawel Moll
2014-11-04  8:01     ` Arnd Bergmann
2014-11-04  8:27       ` Richard Cochran
2014-11-04 10:49         ` Thomas Gleixner
2014-11-04 16:11           ` Arnd Bergmann
2014-11-04 16:04       ` John Stultz
2014-11-04  8:24     ` Richard Cochran
2014-11-04  9:24 ` Masami Hiramatsu
2014-11-04 15:51   ` Pawel Moll
2014-11-05  8:06     ` Masami Hiramatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).