All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] perf: User/kernel time correlation and event generation
@ 2014-11-06 16:51 Pawel Moll
  2014-11-06 16:51 ` [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps Pawel Moll
                   ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Pawel Moll @ 2014-11-06 16:51 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll

This is a second version of the post-LPC series. The changes are limited
to the first patch, adding a backward-compatibility kernel parameter.

The v3 cover letter:

I've organised a session on the subject during the tracing minisummit
at LPC 2014 in Dusseldorf. Notes taken from the discussion taken by
Steven Rostedt (thanks Steve!)

http://www.linuxplumbersconf.org/2014/wp-content/uploads/2014/10/LPC2014_Tracing.txt

The following three patches address three main topics. They are pretty
much orthogonal and (subject to small and obvious modifications) could
be applied independently from each other.

An executive summary of both discussion and the patches:

1. User/kernel perf timestamps correlation

Thomas suggested that, instead of jumping through loops of correlation,
perf should simply generate monotonic clock timestamps, instead of
using sched clock. Peter and I pointed out that Ingo didn't like this
idea as monotonic can be slow, but apparently the cases when it is are
irrelevant. Thomas offered to fly to Budapest to physically convince
Ingo - I hope it won't be necessary and he will be able to achieve
this here, on mailing lists :-)

Setting aside potential performance problems, it would be a really
great solution, unifying all trace systems (perf, ftrace and LTTng)
in this respect. I'm more than happy to work on potential improvements
in the are of monotonic clock if it was to help the cause.

If it is a definite no-go, we still have the third patch, allowing post-
capture correlation based on synchronisation events.

2. User event generation

Everyone present agreed that it would be a very-nice-to-have feature.
There was some discussion about implementation details, so I welcome
feedback and comments regarding my take on the matter.

3. Correlation with external timestamps

This is a new issue, which surfaced recently while I was working on
hardware trace infrastructure. It can timestamp trace packets, but is
using yet another, completely different time source to do this.

Thomas suggested solution which gets down to my original proposal for
sched/monotonic clock correlation - an additional sample type so events
can be "double stamped" using different clock sources providing
synchronisation points for later time approximation. I've just extended
the implementation with configuration value to select the clock source.
If the first patch (making perf timestamps monotonic) gets accepted,
there will be no immediate need for this one, but I'd like to gain some
feedback anyway.


That's all for this series. Previous versions:

- RFC: http://www.spinics.net/lists/kernel/msg1824419.html
- v1: http://thread.gmane.org/gmane.linux.kernel/1790231
- v2: http://thread.gmane.org/gmane.linux.kernel/1793272
- v3: http://thread.gmane.org/gmane.linux.kernel.api/5681


Pawel Moll (3):
  perf: Use monotonic clock as a source for timestamps
  perf: Userspace event
  perf: Sample additional clock value

 Documentation/kernel-parameters.txt |   9 +++
 include/linux/perf_event.h          |   6 ++
 include/uapi/linux/perf_event.h     |  37 +++++++++-
 include/uapi/linux/prctl.h          |  10 +++
 kernel/events/core.c                | 138 ++++++++++++++++++++++++++++++++++++
 kernel/sys.c                        |   5 ++
 6 files changed, 203 insertions(+), 2 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
  2014-11-06 16:51 [PATCH v4 0/3] perf: User/kernel time correlation and event generation Pawel Moll
@ 2014-11-06 16:51 ` Pawel Moll
  2014-11-27 15:05     ` Pawel Moll
                     ` (3 more replies)
  2014-11-06 16:51   ` Pawel Moll
  2014-11-06 16:51 ` [PATCH v4 3/3] perf: Sample additional clock value Pawel Moll
  2 siblings, 4 replies; 58+ messages in thread
From: Pawel Moll @ 2014-11-06 16:51 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll

Until now, perf framework never defined the meaning of the timestamps
captured as PERF_SAMPLE_TIME sample type. The values were obtaining
from local (sched) clock, which is unavailable in userspace. This made
it impossible to correlate perf data with any other events. Other
tracing solutions have the source configurable (ftrace) or just share
a common time domain between kernel and userspace (LTTng).

Follow the trend by using monotonic clock, which is readily available
as POSIX CLOCK_MONOTONIC.

Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
by the user to obtain the clk_id to be used with POSIX clock API (eg.
clock_gettime()) to obtain a time value comparable with perf samples.

Old behaviour can be restored by using "perf_use_local_clock" kernel
parameter.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
Ingo, I remember your comments about this approach in the past, but
during discussions at LPC Thomas was convinced that it's the right
thing to do - see cover letter for the series...

Changes since v3:

- Added "perf_use_lock_clock" parameter...

- ... and creating the sysctl value only when it's not defined (turned
  out that negative clk_ids are not invalid - they are used to describe
  dynamic clocks)

- Had to keep sysctl_perf_sample_time_clk_id non-const, because struct
  ctl_table.data is non-const

 Documentation/kernel-parameters.txt |  9 +++++++++
 kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4c81a86..8ead8d8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -91,6 +91,7 @@ parameter is applicable:
 	NUMA	NUMA support is enabled.
 	NFS	Appropriate NFS support is enabled.
 	OSS	OSS sound support is enabled.
+	PERF	Performance events and counters support is enabled.
 	PV_OPS	A paravirtualized kernel is enabled.
 	PARIDE	The ParIDE (parallel port IDE) subsystem is enabled.
 	PARISC	The PA-RISC architecture is enabled.
@@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			allocator.  This parameter is primarily	for debugging
 			and performance comparison.
 
+	perf_use_local_clock
+			[PERF]
+			Use local_clock() as a source for perf timestamps
+			generation. This was be the default behaviour and
+			this parameter can be used to maintain backward
+			compatibility or on older hardware with expensive
+			monotonic clock source.
+
 	pf.		[PARIDE]
 			See Documentation/blockdev/paride.txt.
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2b02c9f..5d0aa03 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,7 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/compat.h>
+#include <linux/sysctl.h>
 
 #include "internal.h"
 
@@ -322,8 +323,41 @@ extern __weak const char *perf_pmu_name(void)
 	return "pmu";
 }
 
+static bool perf_use_local_clock;
+static int __init perf_use_local_clock_setup(char *__unused)
+{
+	perf_use_local_clock = true;
+	return 1;
+}
+__setup("perf_use_local_clock", perf_use_local_clock_setup);
+
+static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
+
+static struct ctl_table perf_sample_time_kern_table[] = {
+	{
+		.procname       = "perf_sample_time_clk_id",
+		.data           = &sysctl_perf_sample_time_clk_id,
+		.maxlen         = sizeof(int),
+		.mode           = 0444,
+		.proc_handler   = proc_dointvec,
+	},
+	{}
+};
+
+static struct ctl_table perf_sample_time_root_table[] = {
+	{
+		.procname	= "kernel",
+		.mode		= 0555,
+		.child		= perf_sample_time_kern_table,
+	},
+	{}
+};
+
 static inline u64 perf_clock(void)
 {
+	if (likely(!perf_use_local_clock))
+		return ktime_get_mono_fast_ns();
+
 	return local_clock();
 }
 
@@ -8230,6 +8264,9 @@ void __init perf_event_init(void)
 	 */
 	BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
 		     != 1024);
+
+	if (!perf_use_local_clock)
+		register_sysctl_table(perf_sample_time_root_table);
 }
 
 static int __init perf_event_sysfs_init(void)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 2/3] perf: Userspace event
@ 2014-11-06 16:51   ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2014-11-06 16:51 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll

This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
any process to inject custom data into perf data stream as a new
PERF_RECORD_UEVENT record, if such process is being observed or if it
is running on a CPU being observed by the perf framework.

The prctl call takes the following arguments:

        prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);

- type: a number meaning to describe content of the following data.
  Kernel does not pay attention to it and merely passes it further in
  the perf data, therefore its use must be agreed between the events
  producer (the process being observed) and the consumer (performance
  analysis tool). The perf userspace tool will contain a repository of
  "well known" types and reference implementation of their decoders.
- size: Length in bytes of the data.
- data: Pointer to the data.
- flags: Reserved for future use. Always pass zero.

Perf context that are supposed to receive events generated with the
prctl above must be opened with perf_event_attr.uevent set to 1. The
PERF_RECORD_UEVENT records consist of a standard perf event header,
32-bit type value, 32-bit data size and the data itself, followed by
padding to align the overall record size to 8 bytes and optional,
standard sample_id field.

Example use cases:

- "perf_printf" like mechanism to add logging messages to perf data;
  in the simplest case it can be just

        prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);

- synchronisation of performance data generated in user space with the
  perf stream coming from the kernel. For example, the marker can be
  inserted by a JIT engine after it generated portion of the code, but
  before the code is executed for the first time, allowing the
  post-processor to pick the correct debugging information.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---

Changes since v3:

- none

 include/linux/perf_event.h      |  4 +++
 include/uapi/linux/perf_event.h | 23 ++++++++++++-
 include/uapi/linux/prctl.h      | 10 ++++++
 kernel/events/core.c            | 71 +++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                    |  5 +++
 5 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0..680767d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -721,6 +721,8 @@ extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_fork(struct task_struct *tsk);
+extern int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		       const char __user *data);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -829,6 +831,8 @@ static inline void perf_event_mmap(struct vm_area_struct *vma)		{ }
 static inline void perf_event_exec(void)				{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
+static inline int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		              const char __user *data)			{ return -1; };
 static inline void perf_event_init(void)				{ }
 static inline int  perf_swevent_get_recursion_context(void)		{ return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)		{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9d84540..9a64eb1 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -303,7 +303,8 @@ struct perf_event_attr {
 				exclude_callchain_user   : 1, /* exclude user callchains */
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
-				__reserved_1   : 39;
+				uevents        :  1, /* allow uevents into the buffer */
+				__reserved_1   : 38;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -712,6 +713,26 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Data in userspace event record is transparent for the kernel
+	 *
+	 * Userspace perf tool code maintains a list of known types with
+	 * reference implementations of parsers for the data field.
+	 *
+	 * Overall size of the record (including type and size fields)
+	 * is always aligned to 8 bytes by adding padding after the data.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u32				type;
+	 *	u32				size;
+	 *	char				data[size];
+	 *	char				__padding[-size & 7];
+	 * 	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_UEVENT			= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75..2a6852f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,14 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+/*
+ * Perf userspace event generation
+ *
+ * second argument: event type
+ * third argument:  data size
+ * fourth argument: pointer to data
+ * fifth argument:  flags (currently unused, pass 0)
+ */
+#define PR_TASK_PERF_UEVENT	43
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5d0aa03..9a2d082 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5597,6 +5597,77 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 }
 
 /*
+ * Userspace-generated event
+ */
+
+struct perf_uevent {
+	struct perf_event_header	header;
+	u32				type;
+	u32				size;
+	u8				data[0];
+};
+
+static void perf_uevent_output(struct perf_event *event, void *data)
+{
+	struct perf_uevent *uevent = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	int size = uevent->header.size;
+
+	if (!event->attr.uevents)
+		return;
+
+	perf_event_header__init_id(&uevent->header, &sample, event);
+
+	if (perf_output_begin(&handle, event, uevent->header.size) != 0)
+		goto out;
+	perf_output_put(&handle, uevent->header);
+	perf_output_put(&handle, uevent->type);
+	perf_output_put(&handle, uevent->size);
+	__output_copy(&handle, uevent->data, uevent->size);
+
+	/* Padding to align overall data size to 8 bytes */
+	perf_output_skip(&handle, -uevent->size & (sizeof(u64) - 1));
+
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+out:
+	uevent->header.size = size;
+}
+
+int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		const char __user *data)
+{
+	struct perf_uevent *uevent;
+
+	/* Need some reasonable limit */
+	if (size > PAGE_SIZE)
+		return -E2BIG;
+
+	uevent = kmalloc(sizeof(*uevent) + size, GFP_KERNEL);
+	if (!uevent)
+		return -ENOMEM;
+
+	uevent->header.type = PERF_RECORD_UEVENT;
+	uevent->header.size = sizeof(*uevent) + ALIGN(size, sizeof(u64));
+
+	uevent->type = type;
+	uevent->size = size;
+	if (copy_from_user(uevent->data, data, size)) {
+		kfree(uevent);
+		return -EFAULT;
+	}
+
+	perf_event_aux(perf_uevent_output, uevent, NULL);
+
+	kfree(uevent);
+
+	return 0;
+}
+
+
+/*
  * Generic event overflow handling, sampling.
  */
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0..1c83677 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2121,6 +2121,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_TASK_PERF_EVENTS_ENABLE:
 		error = perf_event_task_enable();
 		break;
+	case PR_TASK_PERF_UEVENT:
+		if (arg5 != 0)
+			return -EINVAL;
+		error = perf_uevent(me, arg2, arg3, (char __user *)arg4);
+		break;
 	case PR_GET_TIMERSLACK:
 		error = current->timer_slack_ns;
 		break;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 2/3] perf: Userspace event
@ 2014-11-06 16:51   ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2014-11-06 16:51 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Pawel Moll

This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
any process to inject custom data into perf data stream as a new
PERF_RECORD_UEVENT record, if such process is being observed or if it
is running on a CPU being observed by the perf framework.

The prctl call takes the following arguments:

        prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);

- type: a number meaning to describe content of the following data.
  Kernel does not pay attention to it and merely passes it further in
  the perf data, therefore its use must be agreed between the events
  producer (the process being observed) and the consumer (performance
  analysis tool). The perf userspace tool will contain a repository of
  "well known" types and reference implementation of their decoders.
- size: Length in bytes of the data.
- data: Pointer to the data.
- flags: Reserved for future use. Always pass zero.

Perf context that are supposed to receive events generated with the
prctl above must be opened with perf_event_attr.uevent set to 1. The
PERF_RECORD_UEVENT records consist of a standard perf event header,
32-bit type value, 32-bit data size and the data itself, followed by
padding to align the overall record size to 8 bytes and optional,
standard sample_id field.

Example use cases:

- "perf_printf" like mechanism to add logging messages to perf data;
  in the simplest case it can be just

        prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);

- synchronisation of performance data generated in user space with the
  perf stream coming from the kernel. For example, the marker can be
  inserted by a JIT engine after it generated portion of the code, but
  before the code is executed for the first time, allowing the
  post-processor to pick the correct debugging information.

Signed-off-by: Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org>
---

Changes since v3:

- none

 include/linux/perf_event.h      |  4 +++
 include/uapi/linux/perf_event.h | 23 ++++++++++++-
 include/uapi/linux/prctl.h      | 10 ++++++
 kernel/events/core.c            | 71 +++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                    |  5 +++
 5 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0..680767d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -721,6 +721,8 @@ extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_fork(struct task_struct *tsk);
+extern int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		       const char __user *data);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -829,6 +831,8 @@ static inline void perf_event_mmap(struct vm_area_struct *vma)		{ }
 static inline void perf_event_exec(void)				{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
+static inline int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		              const char __user *data)			{ return -1; };
 static inline void perf_event_init(void)				{ }
 static inline int  perf_swevent_get_recursion_context(void)		{ return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)		{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9d84540..9a64eb1 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -303,7 +303,8 @@ struct perf_event_attr {
 				exclude_callchain_user   : 1, /* exclude user callchains */
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
-				__reserved_1   : 39;
+				uevents        :  1, /* allow uevents into the buffer */
+				__reserved_1   : 38;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -712,6 +713,26 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Data in userspace event record is transparent for the kernel
+	 *
+	 * Userspace perf tool code maintains a list of known types with
+	 * reference implementations of parsers for the data field.
+	 *
+	 * Overall size of the record (including type and size fields)
+	 * is always aligned to 8 bytes by adding padding after the data.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u32				type;
+	 *	u32				size;
+	 *	char				data[size];
+	 *	char				__padding[-size & 7];
+	 * 	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_UEVENT			= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75..2a6852f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,14 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+/*
+ * Perf userspace event generation
+ *
+ * second argument: event type
+ * third argument:  data size
+ * fourth argument: pointer to data
+ * fifth argument:  flags (currently unused, pass 0)
+ */
+#define PR_TASK_PERF_UEVENT	43
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5d0aa03..9a2d082 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5597,6 +5597,77 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 }
 
 /*
+ * Userspace-generated event
+ */
+
+struct perf_uevent {
+	struct perf_event_header	header;
+	u32				type;
+	u32				size;
+	u8				data[0];
+};
+
+static void perf_uevent_output(struct perf_event *event, void *data)
+{
+	struct perf_uevent *uevent = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	int size = uevent->header.size;
+
+	if (!event->attr.uevents)
+		return;
+
+	perf_event_header__init_id(&uevent->header, &sample, event);
+
+	if (perf_output_begin(&handle, event, uevent->header.size) != 0)
+		goto out;
+	perf_output_put(&handle, uevent->header);
+	perf_output_put(&handle, uevent->type);
+	perf_output_put(&handle, uevent->size);
+	__output_copy(&handle, uevent->data, uevent->size);
+
+	/* Padding to align overall data size to 8 bytes */
+	perf_output_skip(&handle, -uevent->size & (sizeof(u64) - 1));
+
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+out:
+	uevent->header.size = size;
+}
+
+int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		const char __user *data)
+{
+	struct perf_uevent *uevent;
+
+	/* Need some reasonable limit */
+	if (size > PAGE_SIZE)
+		return -E2BIG;
+
+	uevent = kmalloc(sizeof(*uevent) + size, GFP_KERNEL);
+	if (!uevent)
+		return -ENOMEM;
+
+	uevent->header.type = PERF_RECORD_UEVENT;
+	uevent->header.size = sizeof(*uevent) + ALIGN(size, sizeof(u64));
+
+	uevent->type = type;
+	uevent->size = size;
+	if (copy_from_user(uevent->data, data, size)) {
+		kfree(uevent);
+		return -EFAULT;
+	}
+
+	perf_event_aux(perf_uevent_output, uevent, NULL);
+
+	kfree(uevent);
+
+	return 0;
+}
+
+
+/*
  * Generic event overflow handling, sampling.
  */
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0..1c83677 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2121,6 +2121,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_TASK_PERF_EVENTS_ENABLE:
 		error = perf_event_task_enable();
 		break;
+	case PR_TASK_PERF_UEVENT:
+		if (arg5 != 0)
+			return -EINVAL;
+		error = perf_uevent(me, arg2, arg3, (char __user *)arg4);
+		break;
 	case PR_GET_TIMERSLACK:
 		error = current->timer_slack_ns;
 		break;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 3/3] perf: Sample additional clock value
  2014-11-06 16:51 [PATCH v4 0/3] perf: User/kernel time correlation and event generation Pawel Moll
  2014-11-06 16:51 ` [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps Pawel Moll
  2014-11-06 16:51   ` Pawel Moll
@ 2014-11-06 16:51 ` Pawel Moll
  2015-01-05 13:45     ` Peter Zijlstra
  2 siblings, 1 reply; 58+ messages in thread
From: Pawel Moll @ 2014-11-06 16:51 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll

This patch adds an option to sample value of an additional clock with
any perf event, with the the aim of allowing time correlation between
data coming from perf and other sources like hardware trace which is
timestamped with an external clock.

The idea is to generate periodic perf record containing timestamps from
two different sources, sampled as close to each other as possible. This
allows performing simple linear approximation:

                      other event
       -----O--------------+-------------O------> t_other
            :              |             :
            :              V             :
       -----O----------------------------O------> t_perf
        perf event

User can request such samples for any standard perf event, with the most
obvious examples of cpu-clock (hrtimer) at selected frequency or other
periodic events like sched:sched_switch.

In order to do this, PERF_SAMPLE_CLOCK has to be set in struct
perf_event_attr.sample_type and a type of the clock to be sampled
selected by setting perf_event_attr.clock to a value corresponding to
a POSIX clock clk_id (see "man 2 clock_gettime")

Currently three clocks are implemented: CLOCK_REALITME = 0,
CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
5 bits wide to allow for future extension to custom, non-POSIX clock
sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
ARM CoreSight (hardware trace) timestamp generator.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---

Changes since v3:

- none

 include/linux/perf_event.h      |  2 ++
 include/uapi/linux/perf_event.h | 16 ++++++++++++++--
 kernel/events/core.c            | 30 ++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 680767d..b690a0d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -607,6 +607,8 @@ struct perf_sample_data {
 	 * Transaction flags for abort events:
 	 */
 	u64				txn;
+	/* Clock value (additional timestamp for time correlation) */
+	u64				clock;
 };
 
 /* default value for data source */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9a64eb1..53a7a72 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_SRC			= 1U << 15,
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
+	PERF_SAMPLE_CLOCK			= 1U << 18,
 
-	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
 };
 
 /*
@@ -304,7 +305,16 @@ struct perf_event_attr {
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
 				uevents        :  1, /* allow uevents into the buffer */
-				__reserved_1   : 38;
+
+				/*
+				 * clock: one of the POSIX clock IDs:
+				 *
+				 * 0 - CLOCK_REALTIME
+				 * 1 - CLOCK_MONOTONIC
+				 * 4 - CLOCK_MONOTONIC_RAW
+				 */
+				clock          :  5, /* clock type */
+				__reserved_1   : 33;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -544,6 +554,7 @@ enum perf_event_type {
 	 * 	{ u64			id;       } && PERF_SAMPLE_ID
 	 * 	{ u64			stream_id;} && PERF_SAMPLE_STREAM_ID
 	 * 	{ u32			cpu, res; } && PERF_SAMPLE_CPU
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 *	{ u64			id;	  } && PERF_SAMPLE_IDENTIFIER
 	 * } && perf_event_attr::sample_id_all
 	 *
@@ -687,6 +698,7 @@ enum perf_event_type {
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9a2d082..816baa6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1264,6 +1264,9 @@ static void perf_event__header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		size += sizeof(data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->header_size = size;
 }
 
@@ -1291,6 +1294,9 @@ static void perf_event__id_header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_CPU)
 		size += sizeof(data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->id_header_size = size;
 }
 
@@ -4631,6 +4637,24 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
 		data->cpu_entry.cpu	 = raw_smp_processor_id();
 		data->cpu_entry.reserved = 0;
 	}
+
+	if (sample_type & PERF_SAMPLE_CLOCK) {
+		switch (event->attr.clock) {
+		case CLOCK_REALTIME:
+			data->clock = ktime_get_real_ns();
+			break;
+		case CLOCK_MONOTONIC:
+			data->clock = ktime_get_mono_fast_ns();
+			break;
+		case CLOCK_MONOTONIC_RAW:
+			data->clock = ktime_get_raw_ns();
+			break;
+		default:
+			data->clock = 0;
+			break;
+		}
+	}
+
 }
 
 void perf_event_header__init_id(struct perf_event_header *header,
@@ -4661,6 +4685,9 @@ static void __perf_event__output_id_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_CPU)
 		perf_output_put(handle, data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (sample_type & PERF_SAMPLE_IDENTIFIER)
 		perf_output_put(handle, data->id);
 }
@@ -4889,6 +4916,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		perf_output_put(handle, data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2014-11-27 15:05     ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2014-11-27 15:05 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Thu, 2014-11-06 at 16:51 +0000, Pawel Moll wrote:
> Until now, perf framework never defined the meaning of the timestamps
> captured as PERF_SAMPLE_TIME sample type. The values were obtaining
> from local (sched) clock, which is unavailable in userspace. This made
> it impossible to correlate perf data with any other events. Other
> tracing solutions have the source configurable (ftrace) or just share
> a common time domain between kernel and userspace (LTTng).
> 
> Follow the trend by using monotonic clock, which is readily available
> as POSIX CLOCK_MONOTONIC.
> 
> Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
> by the user to obtain the clk_id to be used with POSIX clock API (eg.
> clock_gettime()) to obtain a time value comparable with perf samples.
> 
> Old behaviour can be restored by using "perf_use_local_clock" kernel
> parameter.
> 
> Signed-off-by: Pawel Moll <pawel.moll@arm.com>

It's been 3 weeks without any negative feedback (no feedback at all, but
I take the optimistic view :-)...

How about queuing at least this patch alone for the incoming merge
window? Or at least getting it into -next, with the view at 3.20?

Pawel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2014-11-27 15:05     ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2014-11-27 15:05 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, 2014-11-06 at 16:51 +0000, Pawel Moll wrote:
> Until now, perf framework never defined the meaning of the timestamps
> captured as PERF_SAMPLE_TIME sample type. The values were obtaining
> from local (sched) clock, which is unavailable in userspace. This made
> it impossible to correlate perf data with any other events. Other
> tracing solutions have the source configurable (ftrace) or just share
> a common time domain between kernel and userspace (LTTng).
> 
> Follow the trend by using monotonic clock, which is readily available
> as POSIX CLOCK_MONOTONIC.
> 
> Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
> by the user to obtain the clk_id to be used with POSIX clock API (eg.
> clock_gettime()) to obtain a time value comparable with perf samples.
> 
> Old behaviour can be restored by using "perf_use_local_clock" kernel
> parameter.
> 
> Signed-off-by: Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org>

It's been 3 weeks without any negative feedback (no feedback at all, but
I take the optimistic view :-)...

How about queuing at least this patch alone for the incoming merge
window? Or at least getting it into -next, with the view at 3.20?

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2014-12-11 13:39       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2014-12-11 13:39 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Thu, 2014-11-27 at 15:05 +0000, Pawel Moll wrote:
> On Thu, 2014-11-06 at 16:51 +0000, Pawel Moll wrote:
> > Until now, perf framework never defined the meaning of the timestamps
> > captured as PERF_SAMPLE_TIME sample type. The values were obtaining
> > from local (sched) clock, which is unavailable in userspace. This made
> > it impossible to correlate perf data with any other events. Other
> > tracing solutions have the source configurable (ftrace) or just share
> > a common time domain between kernel and userspace (LTTng).
> > 
> > Follow the trend by using monotonic clock, which is readily available
> > as POSIX CLOCK_MONOTONIC.
> > 
> > Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
> > by the user to obtain the clk_id to be used with POSIX clock API (eg.
> > clock_gettime()) to obtain a time value comparable with perf samples.
> > 
> > Old behaviour can be restored by using "perf_use_local_clock" kernel
> > parameter.
> > 
> > Signed-off-by: Pawel Moll <pawel.moll@arm.com>
> 
> It's been 3 weeks without any negative feedback (no feedback at all, but
> I take the optimistic view :-)...
> 
> How about queuing at least this patch alone for the incoming merge
> window? Or at least getting it into -next, with the view at 3.20?

It's been another two weeks... How about getting it queued for v3.20
then?

Pawel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2014-12-11 13:39       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2014-12-11 13:39 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, 2014-11-27 at 15:05 +0000, Pawel Moll wrote:
> On Thu, 2014-11-06 at 16:51 +0000, Pawel Moll wrote:
> > Until now, perf framework never defined the meaning of the timestamps
> > captured as PERF_SAMPLE_TIME sample type. The values were obtaining
> > from local (sched) clock, which is unavailable in userspace. This made
> > it impossible to correlate perf data with any other events. Other
> > tracing solutions have the source configurable (ftrace) or just share
> > a common time domain between kernel and userspace (LTTng).
> > 
> > Follow the trend by using monotonic clock, which is readily available
> > as POSIX CLOCK_MONOTONIC.
> > 
> > Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
> > by the user to obtain the clk_id to be used with POSIX clock API (eg.
> > clock_gettime()) to obtain a time value comparable with perf samples.
> > 
> > Old behaviour can be restored by using "perf_use_local_clock" kernel
> > parameter.
> > 
> > Signed-off-by: Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org>
> 
> It's been 3 weeks without any negative feedback (no feedback at all, but
> I take the optimistic view :-)...
> 
> How about queuing at least this patch alone for the incoming merge
> window? Or at least getting it into -next, with the view at 3.20?

It's been another two weeks... How about getting it queued for v3.20
then?

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
  2014-11-06 16:51 ` [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps Pawel Moll
  2014-11-27 15:05     ` Pawel Moll
@ 2015-01-05 13:00   ` Peter Zijlstra
  2015-01-21 15:52       ` Pawel Moll
  2015-01-16 12:41     ` Adrian Hunter
  2015-01-21 20:27     ` Pawel Moll
  3 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2015-01-05 13:00 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Thu, Nov 06, 2014 at 04:51:56PM +0000, Pawel Moll wrote:
>  Documentation/kernel-parameters.txt |  9 +++++++++
>  kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)

> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 4c81a86..8ead8d8 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt

> @@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			allocator.  This parameter is primarily	for debugging
>  			and performance comparison.
>  
> +	perf_use_local_clock
> +			[PERF]
> +			Use local_clock() as a source for perf timestamps
> +			generation. This was be the default behaviour and
> +			this parameter can be used to maintain backward
> +			compatibility or on older hardware with expensive
> +			monotonic clock source.
> +
>  	pf.		[PARIDE]
>  			See Documentation/blockdev/paride.txt.

So I'm always terminally confused on the naming of kernel parameters,
sometimes things are modules (even when they're not actually =m capable)
and get a module::foo naming or so and sometimes they're not.

So we want to use the module naming thing or not?

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 2b02c9f..5d0aa03 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c

> @@ -322,8 +323,41 @@ extern __weak const char *perf_pmu_name(void)
>  	return "pmu";
>  }
>  
> +static bool perf_use_local_clock;
> +static int __init perf_use_local_clock_setup(char *__unused)
> +{
> +	perf_use_local_clock = true;
> +	return 1;
> +}
> +__setup("perf_use_local_clock", perf_use_local_clock_setup);

>  static inline u64 perf_clock(void)
>  {
> +	if (likely(!perf_use_local_clock))
> +		return ktime_get_mono_fast_ns();
> +
>  	return local_clock();
>  }

Since this all is boot time, should we not use things like static_key
and avoid the 'pointless' conditional at runtime?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-05 13:01         ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-01-05 13:01 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Ingo Molnar, Richard Cochran, Steven Rostedt, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Thu, Dec 11, 2014 at 01:39:13PM +0000, Pawel Moll wrote:
> On Thu, 2014-11-27 at 15:05 +0000, Pawel Moll wrote:
> > It's been 3 weeks without any negative feedback (no feedback at all, but
> > I take the optimistic view :-)...
> > 
> > How about queuing at least this patch alone for the incoming merge
> > window? Or at least getting it into -next, with the view at 3.20?
> 
> It's been another two weeks... How about getting it queued for v3.20
> then?

Yes sorry about that, I had me a kid and took a wee bit of time away
from computers, then xmas and new years happened.

In any case, best wishes for the new year to all.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-05 13:01         ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-01-05 13:01 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Ingo Molnar, Richard Cochran, Steven Rostedt, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, Dec 11, 2014 at 01:39:13PM +0000, Pawel Moll wrote:
> On Thu, 2014-11-27 at 15:05 +0000, Pawel Moll wrote:
> > It's been 3 weeks without any negative feedback (no feedback at all, but
> > I take the optimistic view :-)...
> > 
> > How about queuing at least this patch alone for the incoming merge
> > window? Or at least getting it into -next, with the view at 3.20?
> 
> It's been another two weeks... How about getting it queued for v3.20
> then?

Yes sorry about that, I had me a kid and took a wee bit of time away
from computers, then xmas and new years happened.

In any case, best wishes for the new year to all.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 2/3] perf: Userspace event
@ 2015-01-05 13:12     ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-01-05 13:12 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Thu, Nov 06, 2014 at 04:51:57PM +0000, Pawel Moll wrote:
> This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
> any process to inject custom data into perf data stream as a new
> PERF_RECORD_UEVENT record, if such process is being observed or if it
> is running on a CPU being observed by the perf framework.
> 
> The prctl call takes the following arguments:
> 
>         prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);
> 
> - type: a number meaning to describe content of the following data.
>   Kernel does not pay attention to it and merely passes it further in
>   the perf data, therefore its use must be agreed between the events
>   producer (the process being observed) and the consumer (performance
>   analysis tool). The perf userspace tool will contain a repository of
>   "well known" types and reference implementation of their decoders.
> - size: Length in bytes of the data.
> - data: Pointer to the data.
> - flags: Reserved for future use. Always pass zero.
> 
> Perf context that are supposed to receive events generated with the
> prctl above must be opened with perf_event_attr.uevent set to 1. The
> PERF_RECORD_UEVENT records consist of a standard perf event header,
> 32-bit type value, 32-bit data size and the data itself, followed by
> padding to align the overall record size to 8 bytes and optional,
> standard sample_id field.
> 
> Example use cases:
> 
> - "perf_printf" like mechanism to add logging messages to perf data;
>   in the simplest case it can be just
> 
>         prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);
> 
> - synchronisation of performance data generated in user space with the
>   perf stream coming from the kernel. For example, the marker can be
>   inserted by a JIT engine after it generated portion of the code, but
>   before the code is executed for the first time, allowing the
>   post-processor to pick the correct debugging information.
> 

The think I remember being raised was a unified means of these msgs
across perf/ftrace/lttng. I am not seeing that mentioned.

Also, I would like a stronger rationale for the @type argument, if it
has no actual meaning why is it separate from the binary msg data?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 2/3] perf: Userspace event
@ 2015-01-05 13:12     ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-01-05 13:12 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 06, 2014 at 04:51:57PM +0000, Pawel Moll wrote:
> This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
> any process to inject custom data into perf data stream as a new
> PERF_RECORD_UEVENT record, if such process is being observed or if it
> is running on a CPU being observed by the perf framework.
> 
> The prctl call takes the following arguments:
> 
>         prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);
> 
> - type: a number meaning to describe content of the following data.
>   Kernel does not pay attention to it and merely passes it further in
>   the perf data, therefore its use must be agreed between the events
>   producer (the process being observed) and the consumer (performance
>   analysis tool). The perf userspace tool will contain a repository of
>   "well known" types and reference implementation of their decoders.
> - size: Length in bytes of the data.
> - data: Pointer to the data.
> - flags: Reserved for future use. Always pass zero.
> 
> Perf context that are supposed to receive events generated with the
> prctl above must be opened with perf_event_attr.uevent set to 1. The
> PERF_RECORD_UEVENT records consist of a standard perf event header,
> 32-bit type value, 32-bit data size and the data itself, followed by
> padding to align the overall record size to 8 bytes and optional,
> standard sample_id field.
> 
> Example use cases:
> 
> - "perf_printf" like mechanism to add logging messages to perf data;
>   in the simplest case it can be just
> 
>         prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);
> 
> - synchronisation of performance data generated in user space with the
>   perf stream coming from the kernel. For example, the marker can be
>   inserted by a JIT engine after it generated portion of the code, but
>   before the code is executed for the first time, allowing the
>   post-processor to pick the correct debugging information.
> 

The think I remember being raised was a unified means of these msgs
across perf/ftrace/lttng. I am not seeing that mentioned.

Also, I would like a stronger rationale for the @type argument, if it
has no actual meaning why is it separate from the binary msg data?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-05 13:45     ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-01-05 13:45 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
> Currently three clocks are implemented: CLOCK_REALITME = 0,
> CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
> 5 bits wide to allow for future extension to custom, non-POSIX clock
> sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
> ARM CoreSight (hardware trace) timestamp generator.

> @@ -304,7 +305,16 @@ struct perf_event_attr {
>  				mmap2          :  1, /* include mmap with inode data     */
>  				comm_exec      :  1, /* flag comm events that are due to an exec */
>  				uevents        :  1, /* allow uevents into the buffer */
> -				__reserved_1   : 38;
> +
> +				/*
> +				 * clock: one of the POSIX clock IDs:
> +				 *
> +				 * 0 - CLOCK_REALTIME
> +				 * 1 - CLOCK_MONOTONIC
> +				 * 4 - CLOCK_MONOTONIC_RAW
> +				 */
> +				clock          :  5, /* clock type */
> +				__reserved_1   : 33;
>  
>  	union {
>  		__u32		wakeup_events;	  /* wakeup every n events */

This would put a constraint on actually changing MAX_CLOCKS, are the
time people OK with that? Thomas, John?

I'm also not quite sure of using the >MAX_CLOCKS space for 'special'
clocks, preferably those would register themselves with the POSIX clock
interface.

> @@ -4631,6 +4637,24 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
>  		data->cpu_entry.cpu	 = raw_smp_processor_id();
>  		data->cpu_entry.reserved = 0;
>  	}
> +
> +	if (sample_type & PERF_SAMPLE_CLOCK) {
> +		switch (event->attr.clock) {
> +		case CLOCK_REALTIME:
> +			data->clock = ktime_get_real_ns();
> +			break;
> +		case CLOCK_MONOTONIC:
> +			data->clock = ktime_get_mono_fast_ns();
> +			break;
> +		case CLOCK_MONOTONIC_RAW:
> +			data->clock = ktime_get_raw_ns();
> +			break;
> +		default:
> +			data->clock = 0;
> +			break;
> +		}
> +	}
> +
>  }

This is broken. There is nothing stopping this from being called from
NMI context and only the MONO one is NMI safe.

Also, one would expect something like:

	default: {
		struct k_clock *kc = clockid_to_kclock(event->attr.clock);
		struct timespec ts;
		if (kc) {
			kc->clock_get(event->attr.clock, &ts);
			data->clock = ktime_to_ns(timespec_to_ktime(ts));
		} else {
			data->clock = 0;
		}
	}

Albeit preferably slightly less horrible -- of course, one would first
need to deal with the NMI issue.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-05 13:45     ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-01-05 13:45 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
> Currently three clocks are implemented: CLOCK_REALITME = 0,
> CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
> 5 bits wide to allow for future extension to custom, non-POSIX clock
> sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
> ARM CoreSight (hardware trace) timestamp generator.

> @@ -304,7 +305,16 @@ struct perf_event_attr {
>  				mmap2          :  1, /* include mmap with inode data     */
>  				comm_exec      :  1, /* flag comm events that are due to an exec */
>  				uevents        :  1, /* allow uevents into the buffer */
> -				__reserved_1   : 38;
> +
> +				/*
> +				 * clock: one of the POSIX clock IDs:
> +				 *
> +				 * 0 - CLOCK_REALTIME
> +				 * 1 - CLOCK_MONOTONIC
> +				 * 4 - CLOCK_MONOTONIC_RAW
> +				 */
> +				clock          :  5, /* clock type */
> +				__reserved_1   : 33;
>  
>  	union {
>  		__u32		wakeup_events;	  /* wakeup every n events */

This would put a constraint on actually changing MAX_CLOCKS, are the
time people OK with that? Thomas, John?

I'm also not quite sure of using the >MAX_CLOCKS space for 'special'
clocks, preferably those would register themselves with the POSIX clock
interface.

> @@ -4631,6 +4637,24 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
>  		data->cpu_entry.cpu	 = raw_smp_processor_id();
>  		data->cpu_entry.reserved = 0;
>  	}
> +
> +	if (sample_type & PERF_SAMPLE_CLOCK) {
> +		switch (event->attr.clock) {
> +		case CLOCK_REALTIME:
> +			data->clock = ktime_get_real_ns();
> +			break;
> +		case CLOCK_MONOTONIC:
> +			data->clock = ktime_get_mono_fast_ns();
> +			break;
> +		case CLOCK_MONOTONIC_RAW:
> +			data->clock = ktime_get_raw_ns();
> +			break;
> +		default:
> +			data->clock = 0;
> +			break;
> +		}
> +	}
> +
>  }

This is broken. There is nothing stopping this from being called from
NMI context and only the MONO one is NMI safe.

Also, one would expect something like:

	default: {
		struct k_clock *kc = clockid_to_kclock(event->attr.clock);
		struct timespec ts;
		if (kc) {
			kc->clock_get(event->attr.clock, &ts);
			data->clock = ktime_to_ns(timespec_to_ktime(ts));
		} else {
			data->clock = 0;
		}
	}

Albeit preferably slightly less horrible -- of course, one would first
need to deal with the NMI issue.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
  2015-01-05 13:45     ` Peter Zijlstra
  (?)
@ 2015-01-05 19:17     ` John Stultz
  -1 siblings, 0 replies; 58+ messages in thread
From: John Stultz @ 2015-01-05 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pawel Moll, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, lkml, Linux API

On Mon, Jan 5, 2015 at 5:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
>> Currently three clocks are implemented: CLOCK_REALITME = 0,
>> CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
>> 5 bits wide to allow for future extension to custom, non-POSIX clock
>> sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
>> ARM CoreSight (hardware trace) timestamp generator.
>
>> @@ -304,7 +305,16 @@ struct perf_event_attr {
>>                               mmap2          :  1, /* include mmap with inode data     */
>>                               comm_exec      :  1, /* flag comm events that are due to an exec */
>>                               uevents        :  1, /* allow uevents into the buffer */
>> -                             __reserved_1   : 38;
>> +
>> +                             /*
>> +                              * clock: one of the POSIX clock IDs:
>> +                              *
>> +                              * 0 - CLOCK_REALTIME
>> +                              * 1 - CLOCK_MONOTONIC
>> +                              * 4 - CLOCK_MONOTONIC_RAW
>> +                              */
>> +                             clock          :  5, /* clock type */
>> +                             __reserved_1   : 33;
>>
>>       union {
>>               __u32           wakeup_events;    /* wakeup every n events */
>
> This would put a constraint on actually changing MAX_CLOCKS, are the
> time people OK with that? Thomas, John?

Yea, I'd not want to enshrine MAX_CLOCKS if we can avoid it. We're
getting a bit close to it these days.

thanks
-john

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-16 12:41     ` Adrian Hunter
  0 siblings, 0 replies; 58+ messages in thread
From: Adrian Hunter @ 2015-01-16 12:41 UTC (permalink / raw)
  To: Pawel Moll, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api

On 06/11/14 18:51, Pawel Moll wrote:
> Until now, perf framework never defined the meaning of the timestamps
> captured as PERF_SAMPLE_TIME sample type. The values were obtaining
> from local (sched) clock, which is unavailable in userspace. This made
> it impossible to correlate perf data with any other events. Other
> tracing solutions have the source configurable (ftrace) or just share
> a common time domain between kernel and userspace (LTTng).
> 
> Follow the trend by using monotonic clock, which is readily available
> as POSIX CLOCK_MONOTONIC.
> 
> Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
> by the user to obtain the clk_id to be used with POSIX clock API (eg.
> clock_gettime()) to obtain a time value comparable with perf samples.
> 
> Old behaviour can be restored by using "perf_use_local_clock" kernel
> parameter.
> 

Don't forget this breaks the relationship to TSC. So you will need something
like below. Also patch 3 needs to be done first and extended to cover TSC so
there is no gap when we cannot get TSC.

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 143e5f5..b6e833d 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1922,21 +1922,6 @@ void arch_perf_update_userpage(struct
perf_event_mmap_page *userpg, u64 now)
        userpg->cap_user_time_zero = 0;
        userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
        userpg->pmc_width = x86_pmu.cntval_bits;
-
-       if (!sched_clock_stable())
-               return;
-
-       data = cyc2ns_read_begin();
-
-       userpg->cap_user_time = 1;
-       userpg->time_mult = data->cyc2ns_mul;
-       userpg->time_shift = data->cyc2ns_shift;
-       userpg->time_offset = data->cyc2ns_offset - now;
-
-       userpg->cap_user_time_zero = 1;
-       userpg->time_zero = data->cyc2ns_offset;
-
-       cyc2ns_read_end(data);
 }

 /*



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-16 12:41     ` Adrian Hunter
  0 siblings, 0 replies; 58+ messages in thread
From: Adrian Hunter @ 2015-01-16 12:41 UTC (permalink / raw)
  To: Pawel Moll, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA

On 06/11/14 18:51, Pawel Moll wrote:
> Until now, perf framework never defined the meaning of the timestamps
> captured as PERF_SAMPLE_TIME sample type. The values were obtaining
> from local (sched) clock, which is unavailable in userspace. This made
> it impossible to correlate perf data with any other events. Other
> tracing solutions have the source configurable (ftrace) or just share
> a common time domain between kernel and userspace (LTTng).
> 
> Follow the trend by using monotonic clock, which is readily available
> as POSIX CLOCK_MONOTONIC.
> 
> Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
> by the user to obtain the clk_id to be used with POSIX clock API (eg.
> clock_gettime()) to obtain a time value comparable with perf samples.
> 
> Old behaviour can be restored by using "perf_use_local_clock" kernel
> parameter.
> 

Don't forget this breaks the relationship to TSC. So you will need something
like below. Also patch 3 needs to be done first and extended to cover TSC so
there is no gap when we cannot get TSC.

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 143e5f5..b6e833d 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1922,21 +1922,6 @@ void arch_perf_update_userpage(struct
perf_event_mmap_page *userpg, u64 now)
        userpg->cap_user_time_zero = 0;
        userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
        userpg->pmc_width = x86_pmu.cntval_bits;
-
-       if (!sched_clock_stable())
-               return;
-
-       data = cyc2ns_read_begin();
-
-       userpg->cap_user_time = 1;
-       userpg->time_mult = data->cyc2ns_mul;
-       userpg->time_shift = data->cyc2ns_shift;
-       userpg->time_offset = data->cyc2ns_offset - now;
-
-       userpg->cap_user_time_zero = 1;
-       userpg->time_zero = data->cyc2ns_offset;
-
-       cyc2ns_read_end(data);
 }

 /*

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
  2015-01-05 13:01         ` Peter Zijlstra
  (?)
@ 2015-01-21 15:47         ` Pawel Moll
  -1 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Richard Cochran, Steven Rostedt, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Mon, 2015-01-05 at 13:01 +0000, Peter Zijlstra wrote:
> On Thu, Dec 11, 2014 at 01:39:13PM +0000, Pawel Moll wrote:
> > On Thu, 2014-11-27 at 15:05 +0000, Pawel Moll wrote:
> > > It's been 3 weeks without any negative feedback (no feedback at all, but
> > > I take the optimistic view :-)...
> > > 
> > > How about queuing at least this patch alone for the incoming merge
> > > window? Or at least getting it into -next, with the view at 3.20?
> > 
> > It's been another two weeks... How about getting it queued for v3.20
> > then?
> 
> Yes sorry about that, I had me a kid and took a wee bit of time away
> from computers, then xmas and new years happened.

"a kid" as in: you have forked? :-) It will happen to me in 1.5 month
time (or earlier). All the best luck, we'll both need it ;-)

> In any case, best wishes for the new year to all.

Indeed.

Pawel



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-21 15:52       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Mon, 2015-01-05 at 13:00 +0000, Peter Zijlstra wrote:
> On Thu, Nov 06, 2014 at 04:51:56PM +0000, Pawel Moll wrote:
> >  Documentation/kernel-parameters.txt |  9 +++++++++
> >  kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> 
> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > index 4c81a86..8ead8d8 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> 
> > @@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> >  			allocator.  This parameter is primarily	for debugging
> >  			and performance comparison.
> >  
> > +	perf_use_local_clock
> > +			[PERF]
> > +			Use local_clock() as a source for perf timestamps
> > +			generation. This was be the default behaviour and
> > +			this parameter can be used to maintain backward
> > +			compatibility or on older hardware with expensive
> > +			monotonic clock source.
> > +
> >  	pf.		[PARIDE]
> >  			See Documentation/blockdev/paride.txt.
> 
> So I'm always terminally confused on the naming of kernel parameters,
> sometimes things are modules (even when they're not actually =m capable)
> and get a module::foo naming or so and sometimes they're not.

I guess you mean module.foo?

> So we want to use the module naming thing or not?

Honestly, I don't mind either way. For one thing ftrace doesn't bother
and just uses __setup() as well.

> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 2b02c9f..5d0aa03 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> 
> > @@ -322,8 +323,41 @@ extern __weak const char *perf_pmu_name(void)
> >  	return "pmu";
> >  }
> >  
> > +static bool perf_use_local_clock;
> > +static int __init perf_use_local_clock_setup(char *__unused)
> > +{
> > +	perf_use_local_clock = true;
> > +	return 1;
> > +}
> > +__setup("perf_use_local_clock", perf_use_local_clock_setup);
> 
> >  static inline u64 perf_clock(void)
> >  {
> > +	if (likely(!perf_use_local_clock))
> > +		return ktime_get_mono_fast_ns();
> > +
> >  	return local_clock();
> >  }
> 
> Since this all is boot time, should we not use things like static_key
> and avoid the 'pointless' conditional at runtime?

Right. it's good to learn new stuff - I didn't know there was
architecture-agnostic support for jump labels. Definitely worth the
effort, will give it a try and spin the patch.

Pawel



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-21 15:52       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Mon, 2015-01-05 at 13:00 +0000, Peter Zijlstra wrote:
> On Thu, Nov 06, 2014 at 04:51:56PM +0000, Pawel Moll wrote:
> >  Documentation/kernel-parameters.txt |  9 +++++++++
> >  kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> 
> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > index 4c81a86..8ead8d8 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> 
> > @@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> >  			allocator.  This parameter is primarily	for debugging
> >  			and performance comparison.
> >  
> > +	perf_use_local_clock
> > +			[PERF]
> > +			Use local_clock() as a source for perf timestamps
> > +			generation. This was be the default behaviour and
> > +			this parameter can be used to maintain backward
> > +			compatibility or on older hardware with expensive
> > +			monotonic clock source.
> > +
> >  	pf.		[PARIDE]
> >  			See Documentation/blockdev/paride.txt.
> 
> So I'm always terminally confused on the naming of kernel parameters,
> sometimes things are modules (even when they're not actually =m capable)
> and get a module::foo naming or so and sometimes they're not.

I guess you mean module.foo?

> So we want to use the module naming thing or not?

Honestly, I don't mind either way. For one thing ftrace doesn't bother
and just uses __setup() as well.

> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 2b02c9f..5d0aa03 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> 
> > @@ -322,8 +323,41 @@ extern __weak const char *perf_pmu_name(void)
> >  	return "pmu";
> >  }
> >  
> > +static bool perf_use_local_clock;
> > +static int __init perf_use_local_clock_setup(char *__unused)
> > +{
> > +	perf_use_local_clock = true;
> > +	return 1;
> > +}
> > +__setup("perf_use_local_clock", perf_use_local_clock_setup);
> 
> >  static inline u64 perf_clock(void)
> >  {
> > +	if (likely(!perf_use_local_clock))
> > +		return ktime_get_mono_fast_ns();
> > +
> >  	return local_clock();
> >  }
> 
> Since this all is boot time, should we not use things like static_key
> and avoid the 'pointless' conditional at runtime?

Right. it's good to learn new stuff - I didn't know there was
architecture-agnostic support for jump labels. Definitely worth the
effort, will give it a try and spin the patch.

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 2/3] perf: Userspace event
@ 2015-01-21 16:01       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Mon, 2015-01-05 at 13:12 +0000, Peter Zijlstra wrote:
> On Thu, Nov 06, 2014 at 04:51:57PM +0000, Pawel Moll wrote:
> > This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
> > any process to inject custom data into perf data stream as a new
> > PERF_RECORD_UEVENT record, if such process is being observed or if it
> > is running on a CPU being observed by the perf framework.
> > 
> > The prctl call takes the following arguments:
> > 
> >         prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);
> > 
> > - type: a number meaning to describe content of the following data.
> >   Kernel does not pay attention to it and merely passes it further in
> >   the perf data, therefore its use must be agreed between the events
> >   producer (the process being observed) and the consumer (performance
> >   analysis tool). The perf userspace tool will contain a repository of
> >   "well known" types and reference implementation of their decoders.
> > - size: Length in bytes of the data.
> > - data: Pointer to the data.
> > - flags: Reserved for future use. Always pass zero.
> > 
> > Perf context that are supposed to receive events generated with the
> > prctl above must be opened with perf_event_attr.uevent set to 1. The
> > PERF_RECORD_UEVENT records consist of a standard perf event header,
> > 32-bit type value, 32-bit data size and the data itself, followed by
> > padding to align the overall record size to 8 bytes and optional,
> > standard sample_id field.
> > 
> > Example use cases:
> > 
> > - "perf_printf" like mechanism to add logging messages to perf data;
> >   in the simplest case it can be just
> > 
> >         prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);
> > 
> > - synchronisation of performance data generated in user space with the
> >   perf stream coming from the kernel. For example, the marker can be
> >   inserted by a JIT engine after it generated portion of the code, but
> >   before the code is executed for the first time, allowing the
> >   post-processor to pick the correct debugging information.
> 
> The think I remember being raised was a unified means of these msgs
> across perf/ftrace/lttng. I am not seeing that mentioned.

Right. I was considering the "well known types repository" an attempt in
this direction. Having said that - ftrace also takes a random blob as
the trace marker, so the unification has to happen in userspace anyway.
I'll have a look what LTTng has to say in this respect.

> Also, I would like a stronger rationale for the @type argument, if it
> has no actual meaning why is it separate from the binary msg data?

Valid point. Without type 0 defined as a string, it doesn't bring
anything into the equation. I just have a gut feeling that sooner than
later we will want to split the messages somehow. Maybe we should make
it a "reserved for future use, use 0 now" field?

        * struct {
        *      struct perf_event_header        header;
        *      u32                             __reserved; /* always 0 */
        *      u32                             size;
        *      char                            data[size];
        *      char                            __padding[-size & 7];
        *      struct sample_id                sample_id;
        * };

or, probably even better, make it a version value at a known offset
(currently always 1, with just size and random sized data following).

        * struct {
        *      struct perf_event_header        header;
        *      u32                             version; /* use 1 */
        *      u32                             size;
        *      char                            data[size];
        *      char                            __padding[-size & 7];
        *      struct sample_id                sample_id;
        * };

So that we can mutate the user events format without too much of the
pain - the parsers will simply complain about unknown format if such
occurs and with the size of the record in the header, it is possible to
skip it.

Pawel



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 2/3] perf: Userspace event
@ 2015-01-21 16:01       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Mon, 2015-01-05 at 13:12 +0000, Peter Zijlstra wrote:
> On Thu, Nov 06, 2014 at 04:51:57PM +0000, Pawel Moll wrote:
> > This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
> > any process to inject custom data into perf data stream as a new
> > PERF_RECORD_UEVENT record, if such process is being observed or if it
> > is running on a CPU being observed by the perf framework.
> > 
> > The prctl call takes the following arguments:
> > 
> >         prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);
> > 
> > - type: a number meaning to describe content of the following data.
> >   Kernel does not pay attention to it and merely passes it further in
> >   the perf data, therefore its use must be agreed between the events
> >   producer (the process being observed) and the consumer (performance
> >   analysis tool). The perf userspace tool will contain a repository of
> >   "well known" types and reference implementation of their decoders.
> > - size: Length in bytes of the data.
> > - data: Pointer to the data.
> > - flags: Reserved for future use. Always pass zero.
> > 
> > Perf context that are supposed to receive events generated with the
> > prctl above must be opened with perf_event_attr.uevent set to 1. The
> > PERF_RECORD_UEVENT records consist of a standard perf event header,
> > 32-bit type value, 32-bit data size and the data itself, followed by
> > padding to align the overall record size to 8 bytes and optional,
> > standard sample_id field.
> > 
> > Example use cases:
> > 
> > - "perf_printf" like mechanism to add logging messages to perf data;
> >   in the simplest case it can be just
> > 
> >         prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);
> > 
> > - synchronisation of performance data generated in user space with the
> >   perf stream coming from the kernel. For example, the marker can be
> >   inserted by a JIT engine after it generated portion of the code, but
> >   before the code is executed for the first time, allowing the
> >   post-processor to pick the correct debugging information.
> 
> The think I remember being raised was a unified means of these msgs
> across perf/ftrace/lttng. I am not seeing that mentioned.

Right. I was considering the "well known types repository" an attempt in
this direction. Having said that - ftrace also takes a random blob as
the trace marker, so the unification has to happen in userspace anyway.
I'll have a look what LTTng has to say in this respect.

> Also, I would like a stronger rationale for the @type argument, if it
> has no actual meaning why is it separate from the binary msg data?

Valid point. Without type 0 defined as a string, it doesn't bring
anything into the equation. I just have a gut feeling that sooner than
later we will want to split the messages somehow. Maybe we should make
it a "reserved for future use, use 0 now" field?

        * struct {
        *      struct perf_event_header        header;
        *      u32                             __reserved; /* always 0 */
        *      u32                             size;
        *      char                            data[size];
        *      char                            __padding[-size & 7];
        *      struct sample_id                sample_id;
        * };

or, probably even better, make it a version value at a known offset
(currently always 1, with just size and random sized data following).

        * struct {
        *      struct perf_event_header        header;
        *      u32                             version; /* use 1 */
        *      u32                             size;
        *      char                            data[size];
        *      char                            __padding[-size & 7];
        *      struct sample_id                sample_id;
        * };

So that we can mutate the user events format without too much of the
pain - the parsers will simply complain about unknown format if such
occurs and with the size of the record in the header, it is possible to
skip it.

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-21 17:12       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 17:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Mon, 2015-01-05 at 13:45 +0000, Peter Zijlstra wrote:
> On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
> > Currently three clocks are implemented: CLOCK_REALITME = 0,
> > CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
> > 5 bits wide to allow for future extension to custom, non-POSIX clock
> > sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
> > ARM CoreSight (hardware trace) timestamp generator.
> 
> > @@ -304,7 +305,16 @@ struct perf_event_attr {
> >  				mmap2          :  1, /* include mmap with inode data     */
> >  				comm_exec      :  1, /* flag comm events that are due to an exec */
> >  				uevents        :  1, /* allow uevents into the buffer */
> > -				__reserved_1   : 38;
> > +
> > +				/*
> > +				 * clock: one of the POSIX clock IDs:
> > +				 *
> > +				 * 0 - CLOCK_REALTIME
> > +				 * 1 - CLOCK_MONOTONIC
> > +				 * 4 - CLOCK_MONOTONIC_RAW
> > +				 */
> > +				clock          :  5, /* clock type */
> > +				__reserved_1   : 33;
> >  
> >  	union {
> >  		__u32		wakeup_events;	  /* wakeup every n events */
> 
> This would put a constraint on actually changing MAX_CLOCKS, are the
> time people OK with that? Thomas, John?

I admit I have some doubts about it myself. But I don't think we can
afford full int for the clock id, can we?

> I'm also not quite sure of using the >MAX_CLOCKS space for 'special'
> clocks, preferably those would register themselves with the POSIX clock
> interface.

That may be a hard sell - John never was particularly keen on extending
the hard-coded clocks with random sources. And the hardware trace clock
I had on mind would be probably one of them - its meaning depends a lot
on the . Of course I'm looking forward to being surprised :-)

> > @@ -4631,6 +4637,24 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
> >  		data->cpu_entry.cpu	 = raw_smp_processor_id();
> >  		data->cpu_entry.reserved = 0;
> >  	}
> > +
> > +	if (sample_type & PERF_SAMPLE_CLOCK) {
> > +		switch (event->attr.clock) {
> > +		case CLOCK_REALTIME:
> > +			data->clock = ktime_get_real_ns();
> > +			break;
> > +		case CLOCK_MONOTONIC:
> > +			data->clock = ktime_get_mono_fast_ns();
> > +			break;
> > +		case CLOCK_MONOTONIC_RAW:
> > +			data->clock = ktime_get_raw_ns();
> > +			break;
> > +		default:
> > +			data->clock = 0;
> > +			break;
> > +		}
> > +	}
> > +
> >  }
> 
> This is broken. There is nothing stopping this from being called from
> NMI context and only the MONO one is NMI safe.

Uh, right. 

> Also, one would expect something like:
> 
> 	default: {
> 		struct k_clock *kc = clockid_to_kclock(event->attr.clock);
> 		struct timespec ts;
> 		if (kc) {
> 			kc->clock_get(event->attr.clock, &ts);
> 			data->clock = ktime_to_ns(timespec_to_ktime(ts));
> 		} else {
> 			data->clock = 0;
> 		}
> 	}
> 
> Albeit preferably slightly less horrible -- of course, one would first
> need to deal with the NMI issue.

Ok, generally this patch clearly needs more work. I'll have to revisit
what Thomas did with the monotonic clock to understand the potential
solutions.

Maybe, in this case, giving such a wide choice of clocks to be sampled
is not the right thing after all? Maybe, when we face the issue of a
trace clock for real, simple "PERF_SAMPLE_TRACE_CLOCK" will suffice?
With the monotonic clock as a normal time base merged (hopefully
soon :-) the correlation between kernel and user space (which was the
original reason for having this change) won't be an issue any more.

Pawel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-21 17:12       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 17:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Mon, 2015-01-05 at 13:45 +0000, Peter Zijlstra wrote:
> On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
> > Currently three clocks are implemented: CLOCK_REALITME = 0,
> > CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
> > 5 bits wide to allow for future extension to custom, non-POSIX clock
> > sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
> > ARM CoreSight (hardware trace) timestamp generator.
> 
> > @@ -304,7 +305,16 @@ struct perf_event_attr {
> >  				mmap2          :  1, /* include mmap with inode data     */
> >  				comm_exec      :  1, /* flag comm events that are due to an exec */
> >  				uevents        :  1, /* allow uevents into the buffer */
> > -				__reserved_1   : 38;
> > +
> > +				/*
> > +				 * clock: one of the POSIX clock IDs:
> > +				 *
> > +				 * 0 - CLOCK_REALTIME
> > +				 * 1 - CLOCK_MONOTONIC
> > +				 * 4 - CLOCK_MONOTONIC_RAW
> > +				 */
> > +				clock          :  5, /* clock type */
> > +				__reserved_1   : 33;
> >  
> >  	union {
> >  		__u32		wakeup_events;	  /* wakeup every n events */
> 
> This would put a constraint on actually changing MAX_CLOCKS, are the
> time people OK with that? Thomas, John?

I admit I have some doubts about it myself. But I don't think we can
afford full int for the clock id, can we?

> I'm also not quite sure of using the >MAX_CLOCKS space for 'special'
> clocks, preferably those would register themselves with the POSIX clock
> interface.

That may be a hard sell - John never was particularly keen on extending
the hard-coded clocks with random sources. And the hardware trace clock
I had on mind would be probably one of them - its meaning depends a lot
on the . Of course I'm looking forward to being surprised :-)

> > @@ -4631,6 +4637,24 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
> >  		data->cpu_entry.cpu	 = raw_smp_processor_id();
> >  		data->cpu_entry.reserved = 0;
> >  	}
> > +
> > +	if (sample_type & PERF_SAMPLE_CLOCK) {
> > +		switch (event->attr.clock) {
> > +		case CLOCK_REALTIME:
> > +			data->clock = ktime_get_real_ns();
> > +			break;
> > +		case CLOCK_MONOTONIC:
> > +			data->clock = ktime_get_mono_fast_ns();
> > +			break;
> > +		case CLOCK_MONOTONIC_RAW:
> > +			data->clock = ktime_get_raw_ns();
> > +			break;
> > +		default:
> > +			data->clock = 0;
> > +			break;
> > +		}
> > +	}
> > +
> >  }
> 
> This is broken. There is nothing stopping this from being called from
> NMI context and only the MONO one is NMI safe.

Uh, right. 

> Also, one would expect something like:
> 
> 	default: {
> 		struct k_clock *kc = clockid_to_kclock(event->attr.clock);
> 		struct timespec ts;
> 		if (kc) {
> 			kc->clock_get(event->attr.clock, &ts);
> 			data->clock = ktime_to_ns(timespec_to_ktime(ts));
> 		} else {
> 			data->clock = 0;
> 		}
> 	}
> 
> Albeit preferably slightly less horrible -- of course, one would first
> need to deal with the NMI issue.

Ok, generally this patch clearly needs more work. I'll have to revisit
what Thomas did with the monotonic clock to understand the potential
solutions.

Maybe, in this case, giving such a wide choice of clocks to be sampled
is not the right thing after all? Maybe, when we face the issue of a
trace clock for real, simple "PERF_SAMPLE_TRACE_CLOCK" will suffice?
With the monotonic clock as a normal time base merged (hopefully
soon :-) the correlation between kernel and user space (which was the
original reason for having this change) won't be an issue any more.

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-21 17:44         ` John Stultz
  0 siblings, 0 replies; 58+ messages in thread
From: John Stultz @ 2015-01-21 17:44 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Wed, Jan 21, 2015 at 9:12 AM, Pawel Moll <pawel.moll@arm.com> wrote:
> On Mon, 2015-01-05 at 13:45 +0000, Peter Zijlstra wrote:
>> On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
>> > Currently three clocks are implemented: CLOCK_REALITME = 0,
>> > CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
>> > 5 bits wide to allow for future extension to custom, non-POSIX clock
>> > sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
>> > ARM CoreSight (hardware trace) timestamp generator.
>>
>> > @@ -304,7 +305,16 @@ struct perf_event_attr {
>> >                             mmap2          :  1, /* include mmap with inode data     */
>> >                             comm_exec      :  1, /* flag comm events that are due to an exec */
>> >                             uevents        :  1, /* allow uevents into the buffer */
>> > -                           __reserved_1   : 38;
>> > +
>> > +                           /*
>> > +                            * clock: one of the POSIX clock IDs:
>> > +                            *
>> > +                            * 0 - CLOCK_REALTIME
>> > +                            * 1 - CLOCK_MONOTONIC
>> > +                            * 4 - CLOCK_MONOTONIC_RAW
>> > +                            */
>> > +                           clock          :  5, /* clock type */
>> > +                           __reserved_1   : 33;
>> >
>> >     union {
>> >             __u32           wakeup_events;    /* wakeup every n events */
>>
>> This would put a constraint on actually changing MAX_CLOCKS, are the
>> time people OK with that? Thomas, John?
>
> I admit I have some doubts about it myself. But I don't think we can
> afford full int for the clock id, can we?
>
>> I'm also not quite sure of using the >MAX_CLOCKS space for 'special'
>> clocks, preferably those would register themselves with the POSIX clock
>> interface.
>
> That may be a hard sell - John never was particularly keen on extending
> the hard-coded clocks with random sources. And the hardware trace clock
> I had on mind would be probably one of them - its meaning depends a lot
> on the . Of course I'm looking forward to being surprised :-)

Yea. I'm definitely still wanting to be cautious about adding new
clockids. Basically if there's a new well defined time domain, then
I'm open to it,  (for example, I'm expecting there will be a smeared
leapsecond time domain at some point in the future), but we've already
grown more then I'm comfortable with given the existing MAX_CLOCKS
limit.

For example, I regret adding the _ALARM clockids. Those are basically
duplicative time domains from the readers perspective, and only have
unique value for setting timers, which should have been handled via a
flag to the timer interfaces.

I suspect we'll have to bump MAX_CLOCKS that at some point and hope it
doesn't break anyone.

That said, there is the dynamic posix clockids. I'm not sure if it
would make sense, but even if we don't bump MAX_CLOCKS, might there
be some case where someone wants to use a dynamic posix clock for the
perf reference?

thanks
-john

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-21 17:44         ` John Stultz
  0 siblings, 0 replies; 58+ messages in thread
From: John Stultz @ 2015-01-21 17:44 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 21, 2015 at 9:12 AM, Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
> On Mon, 2015-01-05 at 13:45 +0000, Peter Zijlstra wrote:
>> On Thu, Nov 06, 2014 at 04:51:58PM +0000, Pawel Moll wrote:
>> > Currently three clocks are implemented: CLOCK_REALITME = 0,
>> > CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
>> > 5 bits wide to allow for future extension to custom, non-POSIX clock
>> > sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
>> > ARM CoreSight (hardware trace) timestamp generator.
>>
>> > @@ -304,7 +305,16 @@ struct perf_event_attr {
>> >                             mmap2          :  1, /* include mmap with inode data     */
>> >                             comm_exec      :  1, /* flag comm events that are due to an exec */
>> >                             uevents        :  1, /* allow uevents into the buffer */
>> > -                           __reserved_1   : 38;
>> > +
>> > +                           /*
>> > +                            * clock: one of the POSIX clock IDs:
>> > +                            *
>> > +                            * 0 - CLOCK_REALTIME
>> > +                            * 1 - CLOCK_MONOTONIC
>> > +                            * 4 - CLOCK_MONOTONIC_RAW
>> > +                            */
>> > +                           clock          :  5, /* clock type */
>> > +                           __reserved_1   : 33;
>> >
>> >     union {
>> >             __u32           wakeup_events;    /* wakeup every n events */
>>
>> This would put a constraint on actually changing MAX_CLOCKS, are the
>> time people OK with that? Thomas, John?
>
> I admit I have some doubts about it myself. But I don't think we can
> afford full int for the clock id, can we?
>
>> I'm also not quite sure of using the >MAX_CLOCKS space for 'special'
>> clocks, preferably those would register themselves with the POSIX clock
>> interface.
>
> That may be a hard sell - John never was particularly keen on extending
> the hard-coded clocks with random sources. And the hardware trace clock
> I had on mind would be probably one of them - its meaning depends a lot
> on the . Of course I'm looking forward to being surprised :-)

Yea. I'm definitely still wanting to be cautious about adding new
clockids. Basically if there's a new well defined time domain, then
I'm open to it,  (for example, I'm expecting there will be a smeared
leapsecond time domain at some point in the future), but we've already
grown more then I'm comfortable with given the existing MAX_CLOCKS
limit.

For example, I regret adding the _ALARM clockids. Those are basically
duplicative time domains from the readers perspective, and only have
unique value for setting timers, which should have been handled via a
flag to the timer interfaces.

I suspect we'll have to bump MAX_CLOCKS that at some point and hope it
doesn't break anyone.

That said, there is the dynamic posix clockids. I'm not sure if it
would make sense, but even if we don't bump MAX_CLOCKS, might there
be some case where someone wants to use a dynamic posix clock for the
perf reference?

thanks
-john

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-21 17:54           ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 17:54 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Wed, 2015-01-21 at 17:44 +0000, John Stultz wrote:
> That said, there is the dynamic posix clockids. I'm not sure if it
> would make sense, but even if we don't bump MAX_CLOCKS, might there
> be some case where someone wants to use a dynamic posix clock for the
> perf reference?

If I remember correctly, last time I tried to use dynamic posix clocks
in the perf context, one needed to open a ptp character device in order
to get a file descriptor, than translated into a clockid_t value -
correct me if I'm wrong. But here you get the fd from the
sys_perf_open() and clock_*() doesn't know anything about such
descriptor.

I was looking into a way of associating a random clock with a random fd,
so that perf could "attach" itself to the clock API at will, but it
turned out not to be trivial (I'd have to dig through old threads to
remember all the nasty details).

The good thing is that it looks like the immediate need for this was no
more, with perf using monotonic clock as the clock source. It will come
back when we get into hardware trace correlation, but one step at a
time...

Pawel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-21 17:54           ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 17:54 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, 2015-01-21 at 17:44 +0000, John Stultz wrote:
> That said, there is the dynamic posix clockids. I'm not sure if it
> would make sense, but even if we don't bump MAX_CLOCKS, might there
> be some case where someone wants to use a dynamic posix clock for the
> perf reference?

If I remember correctly, last time I tried to use dynamic posix clocks
in the perf context, one needed to open a ptp character device in order
to get a file descriptor, than translated into a clockid_t value -
correct me if I'm wrong. But here you get the fd from the
sys_perf_open() and clock_*() doesn't know anything about such
descriptor.

I was looking into a way of associating a random clock with a random fd,
so that perf could "attach" itself to the clock API at will, but it
turned out not to be trivial (I'd have to dig through old threads to
remember all the nasty details).

The good thing is that it looks like the immediate need for this was no
more, with perf using monotonic clock as the clock source. It will come
back when we get into hardware trace correlation, but one step at a
time...

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-21 18:05             ` John Stultz
  0 siblings, 0 replies; 58+ messages in thread
From: John Stultz @ 2015-01-21 18:05 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Wed, Jan 21, 2015 at 9:54 AM, Pawel Moll <pawel.moll@arm.com> wrote:
> On Wed, 2015-01-21 at 17:44 +0000, John Stultz wrote:
>> That said, there is the dynamic posix clockids. I'm not sure if it
>> would make sense, but even if we don't bump MAX_CLOCKS, might there
>> be some case where someone wants to use a dynamic posix clock for the
>> perf reference?
>
> If I remember correctly, last time I tried to use dynamic posix clocks
> in the perf context, one needed to open a ptp character device in order
> to get a file descriptor, than translated into a clockid_t value -
> correct me if I'm wrong. But here you get the fd from the
> sys_perf_open() and clock_*() doesn't know anything about such
> descriptor.

Sorry for losing context here then. Yes, the dynamic clockid has to be
exported via some other interface to userland (likely via the driver
that provides it), but once the id is known it can be used via the
clock_*() functions.   I was thinking here since the perf_event_attr
wants to associate with a clockid, including the possibility for
dynamic clockids would be wise, but I didn't read closely enough to
see how that clockid was specified.

> I was looking into a way of associating a random clock with a random fd,
> so that perf could "attach" itself to the clock API at will, but it
> turned out not to be trivial (I'd have to dig through old threads to
> remember all the nasty details).
>
> The good thing is that it looks like the immediate need for this was no
> more, with perf using monotonic clock as the clock source. It will come
> back when we get into hardware trace correlation, but one step at a
> time...

Ok. I'm eager to see this settled (and the current approach I don't
have any objections to, although I'm not paying super close attention
now that its all in the perf core).   I know this has taken far longer
then you'd have liked, so thanks for your persistence!

-john

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-21 18:05             ` John Stultz
  0 siblings, 0 replies; 58+ messages in thread
From: John Stultz @ 2015-01-21 18:05 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 21, 2015 at 9:54 AM, Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
> On Wed, 2015-01-21 at 17:44 +0000, John Stultz wrote:
>> That said, there is the dynamic posix clockids. I'm not sure if it
>> would make sense, but even if we don't bump MAX_CLOCKS, might there
>> be some case where someone wants to use a dynamic posix clock for the
>> perf reference?
>
> If I remember correctly, last time I tried to use dynamic posix clocks
> in the perf context, one needed to open a ptp character device in order
> to get a file descriptor, than translated into a clockid_t value -
> correct me if I'm wrong. But here you get the fd from the
> sys_perf_open() and clock_*() doesn't know anything about such
> descriptor.

Sorry for losing context here then. Yes, the dynamic clockid has to be
exported via some other interface to userland (likely via the driver
that provides it), but once the id is known it can be used via the
clock_*() functions.   I was thinking here since the perf_event_attr
wants to associate with a clockid, including the possibility for
dynamic clockids would be wise, but I didn't read closely enough to
see how that clockid was specified.

> I was looking into a way of associating a random clock with a random fd,
> so that perf could "attach" itself to the clock API at will, but it
> turned out not to be trivial (I'd have to dig through old threads to
> remember all the nasty details).
>
> The good thing is that it looks like the immediate need for this was no
> more, with perf using monotonic clock as the clock source. It will come
> back when we get into hardware trace correlation, but one step at a
> time...

Ok. I'm eager to see this settled (and the current approach I don't
have any objections to, although I'm not paying super close attention
now that its all in the perf core).   I know this has taken far longer
then you'd have liked, so thanks for your persistence!

-john

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-21 19:48         ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 19:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Wed, 2015-01-21 at 15:52 +0000, Pawel Moll wrote:
> On Mon, 2015-01-05 at 13:00 +0000, Peter Zijlstra wrote:
> > On Thu, Nov 06, 2014 at 04:51:56PM +0000, Pawel Moll wrote:
> > >  Documentation/kernel-parameters.txt |  9 +++++++++
> > >  kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 46 insertions(+)
> > 
> > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > > index 4c81a86..8ead8d8 100644
> > > --- a/Documentation/kernel-parameters.txt
> > > +++ b/Documentation/kernel-parameters.txt
> > 
> > > @@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > >  			allocator.  This parameter is primarily	for debugging
> > >  			and performance comparison.
> > >  
> > > +	perf_use_local_clock
> > > +			[PERF]
> > > +			Use local_clock() as a source for perf timestamps
> > > +			generation. This was be the default behaviour and
> > > +			this parameter can be used to maintain backward
> > > +			compatibility or on older hardware with expensive
> > > +			monotonic clock source.
> > > +
> > >  	pf.		[PARIDE]
> > >  			See Documentation/blockdev/paride.txt.
> > 
> > So I'm always terminally confused on the naming of kernel parameters,
> > sometimes things are modules (even when they're not actually =m capable)
> > and get a module::foo naming or so and sometimes they're not.
> 
> I guess you mean module.foo?
> 
> > So we want to use the module naming thing or not?
> 
> Honestly, I don't mind either way. For one thing ftrace doesn't bother
> and just uses __setup() as well.

There's one more thing to this - as far as I remember, the module name
is actually derived from the compilation unit name (at some level of
Kbuild). I may be wrong (will have to double check), but a module
parameter defined in kernel/events/core.c may be called something like
"core.parameter" ;-).

Pawel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-21 19:48         ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 19:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, 2015-01-21 at 15:52 +0000, Pawel Moll wrote:
> On Mon, 2015-01-05 at 13:00 +0000, Peter Zijlstra wrote:
> > On Thu, Nov 06, 2014 at 04:51:56PM +0000, Pawel Moll wrote:
> > >  Documentation/kernel-parameters.txt |  9 +++++++++
> > >  kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 46 insertions(+)
> > 
> > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > > index 4c81a86..8ead8d8 100644
> > > --- a/Documentation/kernel-parameters.txt
> > > +++ b/Documentation/kernel-parameters.txt
> > 
> > > @@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > >  			allocator.  This parameter is primarily	for debugging
> > >  			and performance comparison.
> > >  
> > > +	perf_use_local_clock
> > > +			[PERF]
> > > +			Use local_clock() as a source for perf timestamps
> > > +			generation. This was be the default behaviour and
> > > +			this parameter can be used to maintain backward
> > > +			compatibility or on older hardware with expensive
> > > +			monotonic clock source.
> > > +
> > >  	pf.		[PARIDE]
> > >  			See Documentation/blockdev/paride.txt.
> > 
> > So I'm always terminally confused on the naming of kernel parameters,
> > sometimes things are modules (even when they're not actually =m capable)
> > and get a module::foo naming or so and sometimes they're not.
> 
> I guess you mean module.foo?
> 
> > So we want to use the module naming thing or not?
> 
> Honestly, I don't mind either way. For one thing ftrace doesn't bother
> and just uses __setup() as well.

There's one more thing to this - as far as I remember, the module name
is actually derived from the compilation unit name (at some level of
Kbuild). I may be wrong (will have to double check), but a module
parameter defined in kernel/events/core.c may be called something like
"core.parameter" ;-).

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-21 20:07           ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 20:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Wed, 2015-01-21 at 19:48 +0000, Pawel Moll wrote:
> On Wed, 2015-01-21 at 15:52 +0000, Pawel Moll wrote:
> > On Mon, 2015-01-05 at 13:00 +0000, Peter Zijlstra wrote:
> > > On Thu, Nov 06, 2014 at 04:51:56PM +0000, Pawel Moll wrote:
> > > >  Documentation/kernel-parameters.txt |  9 +++++++++
> > > >  kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 46 insertions(+)
> > > 
> > > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > > > index 4c81a86..8ead8d8 100644
> > > > --- a/Documentation/kernel-parameters.txt
> > > > +++ b/Documentation/kernel-parameters.txt
> > > 
> > > > @@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > > >  			allocator.  This parameter is primarily	for debugging
> > > >  			and performance comparison.
> > > >  
> > > > +	perf_use_local_clock
> > > > +			[PERF]
> > > > +			Use local_clock() as a source for perf timestamps
> > > > +			generation. This was be the default behaviour and
> > > > +			this parameter can be used to maintain backward
> > > > +			compatibility or on older hardware with expensive
> > > > +			monotonic clock source.
> > > > +
> > > >  	pf.		[PARIDE]
> > > >  			See Documentation/blockdev/paride.txt.
> > > 
> > > So I'm always terminally confused on the naming of kernel parameters,
> > > sometimes things are modules (even when they're not actually =m capable)
> > > and get a module::foo naming or so and sometimes they're not.
> > 
> > I guess you mean module.foo?
> > 
> > > So we want to use the module naming thing or not?
> > 
> > Honestly, I don't mind either way. For one thing ftrace doesn't bother
> > and just uses __setup() as well.
> 
> There's one more thing to this - as far as I remember, the module name
> is actually derived from the compilation unit name (at some level of
> Kbuild). I may be wrong (will have to double check), but a module
> parameter defined in kernel/events/core.c may be called something like
> "core.parameter" ;-).

Ok, so it's possible to enforce "perf." prefix:

/* You can override this manually, but generally this should match the
   module name. */
#ifdef MODULE
#define MODULE_PARAM_PREFIX /* empty */
#else
#define MODULE_PARAM_PREFIX KBUILD_MODNAME "."
#endif

So, perf_use_local_clock or perf.use_local_clock? :-)

Pawel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps
@ 2015-01-21 20:07           ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 20:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, 2015-01-21 at 19:48 +0000, Pawel Moll wrote:
> On Wed, 2015-01-21 at 15:52 +0000, Pawel Moll wrote:
> > On Mon, 2015-01-05 at 13:00 +0000, Peter Zijlstra wrote:
> > > On Thu, Nov 06, 2014 at 04:51:56PM +0000, Pawel Moll wrote:
> > > >  Documentation/kernel-parameters.txt |  9 +++++++++
> > > >  kernel/events/core.c                | 37 +++++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 46 insertions(+)
> > > 
> > > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > > > index 4c81a86..8ead8d8 100644
> > > > --- a/Documentation/kernel-parameters.txt
> > > > +++ b/Documentation/kernel-parameters.txt
> > > 
> > > > @@ -2763,6 +2764,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> > > >  			allocator.  This parameter is primarily	for debugging
> > > >  			and performance comparison.
> > > >  
> > > > +	perf_use_local_clock
> > > > +			[PERF]
> > > > +			Use local_clock() as a source for perf timestamps
> > > > +			generation. This was be the default behaviour and
> > > > +			this parameter can be used to maintain backward
> > > > +			compatibility or on older hardware with expensive
> > > > +			monotonic clock source.
> > > > +
> > > >  	pf.		[PARIDE]
> > > >  			See Documentation/blockdev/paride.txt.
> > > 
> > > So I'm always terminally confused on the naming of kernel parameters,
> > > sometimes things are modules (even when they're not actually =m capable)
> > > and get a module::foo naming or so and sometimes they're not.
> > 
> > I guess you mean module.foo?
> > 
> > > So we want to use the module naming thing or not?
> > 
> > Honestly, I don't mind either way. For one thing ftrace doesn't bother
> > and just uses __setup() as well.
> 
> There's one more thing to this - as far as I remember, the module name
> is actually derived from the compilation unit name (at some level of
> Kbuild). I may be wrong (will have to double check), but a module
> parameter defined in kernel/events/core.c may be called something like
> "core.parameter" ;-).

Ok, so it's possible to enforce "perf." prefix:

/* You can override this manually, but generally this should match the
   module name. */
#ifdef MODULE
#define MODULE_PARAM_PREFIX /* empty */
#else
#define MODULE_PARAM_PREFIX KBUILD_MODNAME "."
#endif

So, perf_use_local_clock or perf.use_local_clock? :-)

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-01-21 20:27     ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 20:27 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll

Until now, perf framework never defined the meaning of the timestamps
captured as PERF_SAMPLE_TIME sample type. The values were obtaining
from local (sched) clock, which is unavailable in userspace. This made
it impossible to correlate perf data with any other events. Other
tracing solutions have the source configurable (ftrace) or just share
a common time domain between kernel and userspace (LTTng).

Follow the trend by using monotonic clock, which is readily available
as POSIX CLOCK_MONOTONIC.

Also add a sysctl "perf_sample_time_clk_id" attribute (usually available
as "/proc/sys/kernel/perf_sample_time_clk_id") which can be used by the
user to obtain the clk_id to be used with POSIX clock API (eg.
clock_gettime()) to obtain a time value comparable with perf samples.

Old behaviour can be restored by using "perf_use_local_clock" kernel
parameter.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
Changes since v4:

- using jump labels to reduce runtime overhead of the local/mono check

 Documentation/kernel-parameters.txt |  9 ++++++++
 kernel/events/core.c                | 43 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4df73da..3cccd0e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -91,6 +91,7 @@ parameter is applicable:
 	NUMA	NUMA support is enabled.
 	NFS	Appropriate NFS support is enabled.
 	OSS	OSS sound support is enabled.
+	PERF	Performance events and counters support is enabled.
 	PV_OPS	A paravirtualized kernel is enabled.
 	PARIDE	The ParIDE (parallel port IDE) subsystem is enabled.
 	PARISC	The PA-RISC architecture is enabled.
@@ -2795,6 +2796,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			allocator.  This parameter is primarily	for debugging
 			and performance comparison.
 
+	perf_use_local_clock
+			[PERF]
+			Use local_clock() as a source for perf timestamps
+			generation. This was be the default behaviour and
+			this parameter can be used to maintain backward
+			compatibility or on older hardware with expensive
+			monotonic clock source.
+
 	pf.		[PARIDE]
 			See Documentation/blockdev/paride.txt.
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 882f835..f45434d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,8 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/compat.h>
+#include <linux/sysctl.h>
+#include <linux/jump_label.h>
 
 #include "internal.h"
 
@@ -322,9 +324,43 @@ extern __weak const char *perf_pmu_name(void)
 	return "pmu";
 }
 
+static struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
+static bool perf_use_local_clock_param __initdata;
+static int __init perf_use_local_clock_setup(char *__unused)
+{
+	perf_use_local_clock_param = true;
+	return 1;
+}
+__setup("perf_use_local_clock", perf_use_local_clock_setup);
+
+static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
+
+static struct ctl_table perf_sample_time_kern_table[] = {
+	{
+		.procname       = "perf_sample_time_clk_id",
+		.data           = &sysctl_perf_sample_time_clk_id,
+		.maxlen         = sizeof(int),
+		.mode           = 0444,
+		.proc_handler   = proc_dointvec,
+	},
+	{}
+};
+
+static struct ctl_table perf_sample_time_root_table[] = {
+	{
+		.procname	= "kernel",
+		.mode		= 0555,
+		.child		= perf_sample_time_kern_table,
+	},
+	{}
+};
+
 static inline u64 perf_clock(void)
 {
-	return local_clock();
+	if (static_key_false(&perf_use_local_clock_key))
+		return local_clock();
+	else
+		return ktime_get_mono_fast_ns();
 }
 
 static inline struct perf_cpu_context *
@@ -8271,6 +8307,11 @@ void __init perf_event_init(void)
 	 */
 	BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
 		     != 1024);
+
+	if (perf_use_local_clock_param)
+		static_key_slow_inc(&perf_use_local_clock_key);
+	else
+		register_sysctl_table(perf_sample_time_root_table);
 }
 
 static int __init perf_event_sysfs_init(void)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-01-21 20:27     ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-21 20:27 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Pawel Moll

Until now, perf framework never defined the meaning of the timestamps
captured as PERF_SAMPLE_TIME sample type. The values were obtaining
from local (sched) clock, which is unavailable in userspace. This made
it impossible to correlate perf data with any other events. Other
tracing solutions have the source configurable (ftrace) or just share
a common time domain between kernel and userspace (LTTng).

Follow the trend by using monotonic clock, which is readily available
as POSIX CLOCK_MONOTONIC.

Also add a sysctl "perf_sample_time_clk_id" attribute (usually available
as "/proc/sys/kernel/perf_sample_time_clk_id") which can be used by the
user to obtain the clk_id to be used with POSIX clock API (eg.
clock_gettime()) to obtain a time value comparable with perf samples.

Old behaviour can be restored by using "perf_use_local_clock" kernel
parameter.

Signed-off-by: Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org>
---
Changes since v4:

- using jump labels to reduce runtime overhead of the local/mono check

 Documentation/kernel-parameters.txt |  9 ++++++++
 kernel/events/core.c                | 43 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4df73da..3cccd0e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -91,6 +91,7 @@ parameter is applicable:
 	NUMA	NUMA support is enabled.
 	NFS	Appropriate NFS support is enabled.
 	OSS	OSS sound support is enabled.
+	PERF	Performance events and counters support is enabled.
 	PV_OPS	A paravirtualized kernel is enabled.
 	PARIDE	The ParIDE (parallel port IDE) subsystem is enabled.
 	PARISC	The PA-RISC architecture is enabled.
@@ -2795,6 +2796,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			allocator.  This parameter is primarily	for debugging
 			and performance comparison.
 
+	perf_use_local_clock
+			[PERF]
+			Use local_clock() as a source for perf timestamps
+			generation. This was be the default behaviour and
+			this parameter can be used to maintain backward
+			compatibility or on older hardware with expensive
+			monotonic clock source.
+
 	pf.		[PARIDE]
 			See Documentation/blockdev/paride.txt.
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 882f835..f45434d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,8 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/compat.h>
+#include <linux/sysctl.h>
+#include <linux/jump_label.h>
 
 #include "internal.h"
 
@@ -322,9 +324,43 @@ extern __weak const char *perf_pmu_name(void)
 	return "pmu";
 }
 
+static struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
+static bool perf_use_local_clock_param __initdata;
+static int __init perf_use_local_clock_setup(char *__unused)
+{
+	perf_use_local_clock_param = true;
+	return 1;
+}
+__setup("perf_use_local_clock", perf_use_local_clock_setup);
+
+static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
+
+static struct ctl_table perf_sample_time_kern_table[] = {
+	{
+		.procname       = "perf_sample_time_clk_id",
+		.data           = &sysctl_perf_sample_time_clk_id,
+		.maxlen         = sizeof(int),
+		.mode           = 0444,
+		.proc_handler   = proc_dointvec,
+	},
+	{}
+};
+
+static struct ctl_table perf_sample_time_root_table[] = {
+	{
+		.procname	= "kernel",
+		.mode		= 0555,
+		.child		= perf_sample_time_kern_table,
+	},
+	{}
+};
+
 static inline u64 perf_clock(void)
 {
-	return local_clock();
+	if (static_key_false(&perf_use_local_clock_key))
+		return local_clock();
+	else
+		return ktime_get_mono_fast_ns();
 }
 
 static inline struct perf_cpu_context *
@@ -8271,6 +8307,11 @@ void __init perf_event_init(void)
 	 */
 	BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
 		     != 1024);
+
+	if (perf_use_local_clock_param)
+		static_key_slow_inc(&perf_use_local_clock_key);
+	else
+		register_sysctl_table(perf_sample_time_root_table);
 }
 
 static int __init perf_event_sysfs_init(void)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-23 17:06       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-23 17:06 UTC (permalink / raw)
  To: Peter Zijlstra, John Stultz
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Mon, 2015-01-05 at 13:45 +0000, Peter Zijlstra wrote:
> Also, one would expect something like:
> 
> 	default: {
> 		struct k_clock *kc = clockid_to_kclock(event->attr.clock);
> 		struct timespec ts;
> 		if (kc) {
> 			kc->clock_get(event->attr.clock, &ts);
> 			data->clock = ktime_to_ns(timespec_to_ktime(ts));
> 		} else {
> 			data->clock = 0;
> 		}
> 	}
> 
> Albeit preferably slightly less horrible -- of course, one would first
> need to deal with the NMI issue.

I was thinking about it... Maybe the solution is approaching the problem
in a completely different way.

As far as I understand (John?) POSIX timers can be used on any clockid?
So it would be possible to obtain a dynamic clock id, for example for my
exotic trace hardware (by any means necessary, like opening a char
device) and create a timer firing every 1 ms (in the trace time domain).
Than this event would be somehow associated with a perf session (for
example, by passing the timerid via perf's ioctl) and then, every when
timer fires, a perf record (something like PERF_RECORD_TIMER?)
containing the timer/clock's value *and* the normal perf timestamp,
would be injected into the circular buffer.

No issue with NMI, no issue with passing clockid through
perf_event_attr...

Does it make any sense to anyone else but me? ;-)

Pawel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-23 17:06       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-01-23 17:06 UTC (permalink / raw)
  To: Peter Zijlstra, John Stultz
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Mon, 2015-01-05 at 13:45 +0000, Peter Zijlstra wrote:
> Also, one would expect something like:
> 
> 	default: {
> 		struct k_clock *kc = clockid_to_kclock(event->attr.clock);
> 		struct timespec ts;
> 		if (kc) {
> 			kc->clock_get(event->attr.clock, &ts);
> 			data->clock = ktime_to_ns(timespec_to_ktime(ts));
> 		} else {
> 			data->clock = 0;
> 		}
> 	}
> 
> Albeit preferably slightly less horrible -- of course, one would first
> need to deal with the NMI issue.

I was thinking about it... Maybe the solution is approaching the problem
in a completely different way.

As far as I understand (John?) POSIX timers can be used on any clockid?
So it would be possible to obtain a dynamic clock id, for example for my
exotic trace hardware (by any means necessary, like opening a char
device) and create a timer firing every 1 ms (in the trace time domain).
Than this event would be somehow associated with a perf session (for
example, by passing the timerid via perf's ioctl) and then, every when
timer fires, a perf record (something like PERF_RECORD_TIMER?)
containing the timer/clock's value *and* the normal perf timestamp,
would be injected into the circular buffer.

No issue with NMI, no issue with passing clockid through
perf_event_attr...

Does it make any sense to anyone else but me? ;-)

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-23 18:05         ` David Ahern
  0 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2015-01-23 18:05 UTC (permalink / raw)
  To: Pawel Moll, Peter Zijlstra, John Stultz
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, Thomas Gleixner,
	Tomeu Vizoso, linux-kernel, linux-api

On 1/23/15 10:06 AM, Pawel Moll wrote:
> As far as I understand (John?) POSIX timers can be used on any clockid?
> So it would be possible to obtain a dynamic clock id, for example for my
> exotic trace hardware (by any means necessary, like opening a char
> device) and create a timer firing every 1 ms (in the trace time domain).
> Than this event would be somehow associated with a perf session (for
> example, by passing the timerid via perf's ioctl) and then, every when
> timer fires, a perf record (something like PERF_RECORD_TIMER?)
> containing the timer/clock's value*and*  the normal perf timestamp,
> would be injected into the circular buffer.

Like this: https://lkml.org/lkml/2011/2/27/158 ? note the date -- 4 
years ago. This is has been dragging on for a long time.

A few problems with that approach:
1. I would like to see a sample generated immediately to get the 
perf_clock -> timeofday correlation immediately rather than have to wait 
N (milli)seconds and have perf scan forward through an M-(giga)byte file 
looking for the one sample that gives the correlation.

I tried to address that problem with an ioctl to force a sample into the 
stream:
    https://lkml.org/lkml/2011/2/27/159
it did not go over very well.

2. there is a risk that the realtime samples dominate a stream.

Another issue that has been raised is updates to xtime by ntp / user. I 
have suggested tracepoints to catch those:
     https://lkml.org/lkml/2011/6/7/636
I don't believe there were ever any comments on the tracepoints.

David

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 3/3] perf: Sample additional clock value
@ 2015-01-23 18:05         ` David Ahern
  0 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2015-01-23 18:05 UTC (permalink / raw)
  To: Pawel Moll, Peter Zijlstra, John Stultz
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, Thomas Gleixner,
	Tomeu Vizoso, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 1/23/15 10:06 AM, Pawel Moll wrote:
> As far as I understand (John?) POSIX timers can be used on any clockid?
> So it would be possible to obtain a dynamic clock id, for example for my
> exotic trace hardware (by any means necessary, like opening a char
> device) and create a timer firing every 1 ms (in the trace time domain).
> Than this event would be somehow associated with a perf session (for
> example, by passing the timerid via perf's ioctl) and then, every when
> timer fires, a perf record (something like PERF_RECORD_TIMER?)
> containing the timer/clock's value*and*  the normal perf timestamp,
> would be injected into the circular buffer.

Like this: https://lkml.org/lkml/2011/2/27/158 ? note the date -- 4 
years ago. This is has been dragging on for a long time.

A few problems with that approach:
1. I would like to see a sample generated immediately to get the 
perf_clock -> timeofday correlation immediately rather than have to wait 
N (milli)seconds and have perf scan forward through an M-(giga)byte file 
looking for the one sample that gives the correlation.

I tried to address that problem with an ioctl to force a sample into the 
stream:
    https://lkml.org/lkml/2011/2/27/159
it did not go over very well.

2. there is a risk that the realtime samples dominate a stream.

Another issue that has been raised is updates to xtime by ntp / user. I 
have suggested tracepoints to catch those:
     https://lkml.org/lkml/2011/6/7/636
I don't believe there were ever any comments on the tracepoints.

David

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-02 16:52       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-02-02 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

Afternoon, Peter,

On Wed, 2015-01-21 at 20:27 +0000, Pawel Moll wrote:
> Until now, perf framework never defined the meaning of the timestamps
> captured as PERF_SAMPLE_TIME sample type. The values were obtaining
> from local (sched) clock, which is unavailable in userspace. This made
> it impossible to correlate perf data with any other events. Other
> tracing solutions have the source configurable (ftrace) or just share
> a common time domain between kernel and userspace (LTTng).
> 
> Follow the trend by using monotonic clock, which is readily available
> as POSIX CLOCK_MONOTONIC.
> 
> Also add a sysctl "perf_sample_time_clk_id" attribute (usually available
> as "/proc/sys/kernel/perf_sample_time_clk_id") which can be used by the
> user to obtain the clk_id to be used with POSIX clock API (eg.
> clock_gettime()) to obtain a time value comparable with perf samples.
> 
> Old behaviour can be restored by using "perf_use_local_clock" kernel
> parameter.
> 
> Signed-off-by: Pawel Moll <pawel.moll@arm.com>

I know that you're busy with other stuff, but it's already rc7 time
again... We can leave the other two patches from the series for later,
but how about getting this one merged for 3.20 and ending the 2 or 3
years long struggle? I'm not saying that everyone is happy about it, but
no one seems to be unhappy enough to speak :-)

Cheers!

Pawel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-02 16:52       ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-02-02 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Afternoon, Peter,

On Wed, 2015-01-21 at 20:27 +0000, Pawel Moll wrote:
> Until now, perf framework never defined the meaning of the timestamps
> captured as PERF_SAMPLE_TIME sample type. The values were obtaining
> from local (sched) clock, which is unavailable in userspace. This made
> it impossible to correlate perf data with any other events. Other
> tracing solutions have the source configurable (ftrace) or just share
> a common time domain between kernel and userspace (LTTng).
> 
> Follow the trend by using monotonic clock, which is readily available
> as POSIX CLOCK_MONOTONIC.
> 
> Also add a sysctl "perf_sample_time_clk_id" attribute (usually available
> as "/proc/sys/kernel/perf_sample_time_clk_id") which can be used by the
> user to obtain the clk_id to be used with POSIX clock API (eg.
> clock_gettime()) to obtain a time value comparable with perf samples.
> 
> Old behaviour can be restored by using "perf_use_local_clock" kernel
> parameter.
> 
> Signed-off-by: Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org>

I know that you're busy with other stuff, but it's already rc7 time
again... We can leave the other two patches from the series for later,
but how about getting this one merged for 3.20 and ending the 2 or 3
years long struggle? I'm not saying that everyone is happy about it, but
no one seems to be unhappy enough to speak :-)

Cheers!

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-03  9:20           ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-02-03  9:20 UTC (permalink / raw)
  To: ajh mls
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso, linux-kernel,
	linux-api, adrian.hunter

On Tue, 2015-02-03 at 08:30 +0000, ajh mls wrote:
> There is still
>
> http://marc.info/?l=linux-kernel&m=142141223902303

Uh. I have no idea why, but I haven't got this mail at all :-(

Thanks for pointing it out!

Pawel



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-03  9:20           ` Pawel Moll
  0 siblings, 0 replies; 58+ messages in thread
From: Pawel Moll @ 2015-02-03  9:20 UTC (permalink / raw)
  To: ajh mls
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	adrian.hunter-ral2JQCrhuEAvxtiuMwx3w

On Tue, 2015-02-03 at 08:30 +0000, ajh mls wrote:
> There is still
>
> http://marc.info/?l=linux-kernel&m=142141223902303

Uh. I have no idea why, but I haven't got this mail at all :-(

Thanks for pointing it out!

Pawel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-11 16:12             ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-02-11 16:12 UTC (permalink / raw)
  To: Pawel Moll
  Cc: ajh mls, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso, linux-kernel,
	linux-api, adrian.hunter


How about something like the below? I _think_ it should mostly work for
x86, where the tsc is a 64bit wide cycle counter.

I suppose we should extend the perf userpage time data with
time_last_cycle and time_mask if/when we want to make this work on
something with a short counter.

Of course, at that time we also need to somehow deal with that counter
wrapping, its hardly practical to go iterate all possible userpg
instances from a timer handler.


---
 Documentation/kernel-parameters.txt |  9 +++++++
 arch/x86/kernel/cpu/perf_event.c    | 44 ++++++++++++++++++++++++---------
 include/linux/perf_event.h          |  6 +++++
 kernel/events/core.c                | 49 ++++++++++++++++++++++++++++++++++---
 kernel/time/timekeeping.c           | 30 +++++++++++++++++++++++
 5 files changed, 123 insertions(+), 15 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 176d4fe4f076..52255676b6e2 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -91,6 +91,7 @@ the beginning of each description states the restrictions within which a
 	NUMA	NUMA support is enabled.
 	NFS	Appropriate NFS support is enabled.
 	OSS	OSS sound support is enabled.
+	PERF	Performance events and counters support is enabled.
 	PV_OPS	A paravirtualized kernel is enabled.
 	PARIDE	The ParIDE (parallel port IDE) subsystem is enabled.
 	PARISC	The PA-RISC architecture is enabled.
@@ -2796,6 +2797,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			allocator.  This parameter is primarily	for debugging
 			and performance comparison.
 
+	perf_use_local_clock
+			[PERF]
+			Use local_clock() as a source for perf timestamps
+			generation. This was be the default behaviour and
+			this parameter can be used to maintain backward
+			compatibility or on older hardware with expensive
+			monotonic clock source.
+
 	pf.		[PARIDE]
 			See Documentation/blockdev/paride.txt.
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index b71a7f86d68a..436a66632f76 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1952,6 +1952,35 @@ static struct pmu pmu = {
 	.flush_branch_stack	= x86_pmu_flush_branch_stack,
 };
 
+static void local_clock_user_time(struct perf_event_mmap_page *userpg, u64 now)
+{
+	data = cyc2ns_read_begin();
+
+	userpg->cap_user_time = 1;
+	userpg->time_mult = data->cyc2ns_mul;
+	userpg->time_shift = data->cyc2ns_shift;
+	userpg->time_offset = data->cyc2ns_offset - now;
+
+	userpg->cap_user_time_zero = 1;
+	userpg->time_zero = data->cyc2ns_offset;
+
+	cyc2ns_read_end(data);
+}
+
+extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
+
+static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
+{
+	userpg->cap_user_time = 1;
+	userpg->cap_user_time_zero = 1;
+
+	__ktime_get_mono_fast(&userpg->time_zero,
+			      &userpg->time_mult,
+			      &userpg->time_shift);
+
+	userpg->offset = userpg->time_zero - now;
+}
+
 void arch_perf_update_userpage(struct perf_event *event,
 			       struct perf_event_mmap_page *userpg, u64 now)
 {
@@ -1966,17 +1995,10 @@ void arch_perf_update_userpage(struct perf_event *event,
 	if (!sched_clock_stable())
 		return;
 
-	data = cyc2ns_read_begin();
-
-	userpg->cap_user_time = 1;
-	userpg->time_mult = data->cyc2ns_mul;
-	userpg->time_shift = data->cyc2ns_shift;
-	userpg->time_offset = data->cyc2ns_offset - now;
-
-	userpg->cap_user_time_zero = 1;
-	userpg->time_zero = data->cyc2ns_offset;
-
-	cyc2ns_read_end(data);
+	if (static_key_false(&perf_use_local_clock_key))
+		local_clock_user_time(userpg, now);
+	else
+		ktime_fast_mono_user_time(userpg, now);
 }
 
 /*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 33262004c310..1d61f968113a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -561,6 +561,12 @@ extern void perf_pmu_enable(struct pmu *pmu);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
 extern int perf_event_refresh(struct perf_event *event, int refresh);
+
+extern struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
+extern void __weak
+arch_perf_update_userpage(struct perf_event *event,
+			  struct perf_event_mmap_page *userpg, u64 now);
+
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 13209a90b751..7bad385103ea 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,8 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/compat.h>
+#include <linux/sysctl.h>
+#include <linux/jump_label.h>
 
 #include "internal.h"
 
@@ -322,9 +324,43 @@ extern __weak const char *perf_pmu_name(void)
 	return "pmu";
 }
 
+struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
+static bool perf_use_local_clock_param __initdata;
+static int __init perf_use_local_clock_setup(char *__unused)
+{
+	perf_use_local_clock_param = true;
+	return 1;
+}
+__setup("perf_use_local_clock", perf_use_local_clock_setup);
+
+static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
+
+static struct ctl_table perf_sample_time_kern_table[] = {
+	{
+		.procname       = "perf_sample_time_clk_id",
+		.data           = &sysctl_perf_sample_time_clk_id,
+		.maxlen         = sizeof(int),
+		.mode           = 0444,
+		.proc_handler   = proc_dointvec,
+	},
+	{}
+};
+
+static struct ctl_table perf_sample_time_root_table[] = {
+	{
+		.procname	= "kernel",
+		.mode		= 0555,
+		.child		= perf_sample_time_kern_table,
+	},
+	{}
+};
+
 static inline u64 perf_clock(void)
 {
-	return local_clock();
+	if (static_key_false(&perf_use_local_clock_key))
+		return local_clock();
+	else
+		return ktime_get_mono_fast_ns();
 }
 
 static inline struct perf_cpu_context *
@@ -4101,8 +4137,8 @@ static void perf_event_init_userpage(struct perf_event *event)
 	rcu_read_unlock();
 }
 
-void __weak arch_perf_update_userpage(
-	struct perf_event *event, struct perf_event_mmap_page *userpg, u64 now)
+void __weak arch_perf_update_userpage(struct perf_event *event,
+				      struct perf_event_mmap_page *userpg, u64 now)
 {
 }
 
@@ -4487,7 +4523,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
-	rb = rb_alloc(nr_pages, 
+	rb = rb_alloc(nr_pages,
 		event->attr.watermark ? event->attr.wakeup_watermark : 0,
 		event->cpu, flags);
 
@@ -8516,6 +8552,11 @@ void __init perf_event_init(void)
 	 */
 	BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
 		     != 1024);
+
+	if (perf_use_local_clock_param)
+		static_key_slow_inc(&perf_use_local_clock_key);
+	else
+		register_sysctl_table(perf_sample_time_root_table);
 }
 
 static int __init perf_event_sysfs_init(void)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index b124af259800..37bed5931a91 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -334,6 +334,36 @@ u64 notrace ktime_get_mono_fast_ns(void)
 }
 EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
 
+void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift)
+{
+	struct tk_read_base *tkr;
+	unsigned int seq;
+	cycle_t cycle_now, delta;
+	u64 nsecs, now;
+
+	do {
+		seq = raw_read_seqcount(&tk_fast_mono.seq);
+		tkr = tk_fast_mono.base + (seq & 0x01);
+
+		cycle_now = tkr->read(tkr->clock);
+		delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
+
+		nsec = delta * tkr->mult + tkr->xtime_nsec;
+		nsec >>= tkr->shift;
+		nsec += arch_gettimeoffset();
+
+		now = ktime_to_ns(tkr->base_mono) + nsec;
+
+		*mult = tkr->mult;
+		*shift = tkr->shift;
+
+		nsec = mul_u64_u32_shr(cycle_now, tkr->mult, tkr->shift);
+
+		*offset = now - nsec;
+
+	} while (read_seqcount_retry(&tk_fast_mono.seq, seq));
+}
+
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
 
 static inline void update_vsyscall(struct timekeeper *tk)

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-11 16:12             ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-02-11 16:12 UTC (permalink / raw)
  To: Pawel Moll
  Cc: ajh mls, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	adrian.hunter-ral2JQCrhuEAvxtiuMwx3w


How about something like the below? I _think_ it should mostly work for
x86, where the tsc is a 64bit wide cycle counter.

I suppose we should extend the perf userpage time data with
time_last_cycle and time_mask if/when we want to make this work on
something with a short counter.

Of course, at that time we also need to somehow deal with that counter
wrapping, its hardly practical to go iterate all possible userpg
instances from a timer handler.


---
 Documentation/kernel-parameters.txt |  9 +++++++
 arch/x86/kernel/cpu/perf_event.c    | 44 ++++++++++++++++++++++++---------
 include/linux/perf_event.h          |  6 +++++
 kernel/events/core.c                | 49 ++++++++++++++++++++++++++++++++++---
 kernel/time/timekeeping.c           | 30 +++++++++++++++++++++++
 5 files changed, 123 insertions(+), 15 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 176d4fe4f076..52255676b6e2 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -91,6 +91,7 @@ the beginning of each description states the restrictions within which a
 	NUMA	NUMA support is enabled.
 	NFS	Appropriate NFS support is enabled.
 	OSS	OSS sound support is enabled.
+	PERF	Performance events and counters support is enabled.
 	PV_OPS	A paravirtualized kernel is enabled.
 	PARIDE	The ParIDE (parallel port IDE) subsystem is enabled.
 	PARISC	The PA-RISC architecture is enabled.
@@ -2796,6 +2797,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			allocator.  This parameter is primarily	for debugging
 			and performance comparison.
 
+	perf_use_local_clock
+			[PERF]
+			Use local_clock() as a source for perf timestamps
+			generation. This was be the default behaviour and
+			this parameter can be used to maintain backward
+			compatibility or on older hardware with expensive
+			monotonic clock source.
+
 	pf.		[PARIDE]
 			See Documentation/blockdev/paride.txt.
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index b71a7f86d68a..436a66632f76 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1952,6 +1952,35 @@ static struct pmu pmu = {
 	.flush_branch_stack	= x86_pmu_flush_branch_stack,
 };
 
+static void local_clock_user_time(struct perf_event_mmap_page *userpg, u64 now)
+{
+	data = cyc2ns_read_begin();
+
+	userpg->cap_user_time = 1;
+	userpg->time_mult = data->cyc2ns_mul;
+	userpg->time_shift = data->cyc2ns_shift;
+	userpg->time_offset = data->cyc2ns_offset - now;
+
+	userpg->cap_user_time_zero = 1;
+	userpg->time_zero = data->cyc2ns_offset;
+
+	cyc2ns_read_end(data);
+}
+
+extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
+
+static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
+{
+	userpg->cap_user_time = 1;
+	userpg->cap_user_time_zero = 1;
+
+	__ktime_get_mono_fast(&userpg->time_zero,
+			      &userpg->time_mult,
+			      &userpg->time_shift);
+
+	userpg->offset = userpg->time_zero - now;
+}
+
 void arch_perf_update_userpage(struct perf_event *event,
 			       struct perf_event_mmap_page *userpg, u64 now)
 {
@@ -1966,17 +1995,10 @@ void arch_perf_update_userpage(struct perf_event *event,
 	if (!sched_clock_stable())
 		return;
 
-	data = cyc2ns_read_begin();
-
-	userpg->cap_user_time = 1;
-	userpg->time_mult = data->cyc2ns_mul;
-	userpg->time_shift = data->cyc2ns_shift;
-	userpg->time_offset = data->cyc2ns_offset - now;
-
-	userpg->cap_user_time_zero = 1;
-	userpg->time_zero = data->cyc2ns_offset;
-
-	cyc2ns_read_end(data);
+	if (static_key_false(&perf_use_local_clock_key))
+		local_clock_user_time(userpg, now);
+	else
+		ktime_fast_mono_user_time(userpg, now);
 }
 
 /*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 33262004c310..1d61f968113a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -561,6 +561,12 @@ extern void perf_pmu_enable(struct pmu *pmu);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
 extern int perf_event_refresh(struct perf_event *event, int refresh);
+
+extern struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
+extern void __weak
+arch_perf_update_userpage(struct perf_event *event,
+			  struct perf_event_mmap_page *userpg, u64 now);
+
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 13209a90b751..7bad385103ea 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -42,6 +42,8 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/compat.h>
+#include <linux/sysctl.h>
+#include <linux/jump_label.h>
 
 #include "internal.h"
 
@@ -322,9 +324,43 @@ extern __weak const char *perf_pmu_name(void)
 	return "pmu";
 }
 
+struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
+static bool perf_use_local_clock_param __initdata;
+static int __init perf_use_local_clock_setup(char *__unused)
+{
+	perf_use_local_clock_param = true;
+	return 1;
+}
+__setup("perf_use_local_clock", perf_use_local_clock_setup);
+
+static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
+
+static struct ctl_table perf_sample_time_kern_table[] = {
+	{
+		.procname       = "perf_sample_time_clk_id",
+		.data           = &sysctl_perf_sample_time_clk_id,
+		.maxlen         = sizeof(int),
+		.mode           = 0444,
+		.proc_handler   = proc_dointvec,
+	},
+	{}
+};
+
+static struct ctl_table perf_sample_time_root_table[] = {
+	{
+		.procname	= "kernel",
+		.mode		= 0555,
+		.child		= perf_sample_time_kern_table,
+	},
+	{}
+};
+
 static inline u64 perf_clock(void)
 {
-	return local_clock();
+	if (static_key_false(&perf_use_local_clock_key))
+		return local_clock();
+	else
+		return ktime_get_mono_fast_ns();
 }
 
 static inline struct perf_cpu_context *
@@ -4101,8 +4137,8 @@ static void perf_event_init_userpage(struct perf_event *event)
 	rcu_read_unlock();
 }
 
-void __weak arch_perf_update_userpage(
-	struct perf_event *event, struct perf_event_mmap_page *userpg, u64 now)
+void __weak arch_perf_update_userpage(struct perf_event *event,
+				      struct perf_event_mmap_page *userpg, u64 now)
 {
 }
 
@@ -4487,7 +4523,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma->vm_flags & VM_WRITE)
 		flags |= RING_BUFFER_WRITABLE;
 
-	rb = rb_alloc(nr_pages, 
+	rb = rb_alloc(nr_pages,
 		event->attr.watermark ? event->attr.wakeup_watermark : 0,
 		event->cpu, flags);
 
@@ -8516,6 +8552,11 @@ void __init perf_event_init(void)
 	 */
 	BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
 		     != 1024);
+
+	if (perf_use_local_clock_param)
+		static_key_slow_inc(&perf_use_local_clock_key);
+	else
+		register_sysctl_table(perf_sample_time_root_table);
 }
 
 static int __init perf_event_sysfs_init(void)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index b124af259800..37bed5931a91 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -334,6 +334,36 @@ u64 notrace ktime_get_mono_fast_ns(void)
 }
 EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
 
+void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift)
+{
+	struct tk_read_base *tkr;
+	unsigned int seq;
+	cycle_t cycle_now, delta;
+	u64 nsecs, now;
+
+	do {
+		seq = raw_read_seqcount(&tk_fast_mono.seq);
+		tkr = tk_fast_mono.base + (seq & 0x01);
+
+		cycle_now = tkr->read(tkr->clock);
+		delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
+
+		nsec = delta * tkr->mult + tkr->xtime_nsec;
+		nsec >>= tkr->shift;
+		nsec += arch_gettimeoffset();
+
+		now = ktime_to_ns(tkr->base_mono) + nsec;
+
+		*mult = tkr->mult;
+		*shift = tkr->shift;
+
+		nsec = mul_u64_u32_shr(cycle_now, tkr->mult, tkr->shift);
+
+		*offset = now - nsec;
+
+	} while (read_seqcount_retry(&tk_fast_mono.seq, seq));
+}
+
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
 
 static inline void update_vsyscall(struct timekeeper *tk)

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-12 10:04               ` Adrian Hunter
  0 siblings, 0 replies; 58+ messages in thread
From: Adrian Hunter @ 2015-02-12 10:04 UTC (permalink / raw)
  To: Peter Zijlstra, Pawel Moll
  Cc: ajh mls, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso, linux-kernel,
	linux-api

On 11/02/15 18:12, Peter Zijlstra wrote:
> 
> How about something like the below? I _think_ it should mostly work for
> x86, where the tsc is a 64bit wide cycle counter.

It would have to be based on CLOCK_MONOTONIC_RAW not CLOCK_MONOTONIC and you
would have to check the clocksource is TSC.

Why is CLOCK_MONOTONIC preferred anyway - I would have thought any
adjustment would skew performance timings?

> 
> I suppose we should extend the perf userpage time data with
> time_last_cycle and time_mask if/when we want to make this work on
> something with a short counter.
> 
> Of course, at that time we also need to somehow deal with that counter
> wrapping, its hardly practical to go iterate all possible userpg
> instances from a timer handler.
> 
> 
> ---
>  Documentation/kernel-parameters.txt |  9 +++++++
>  arch/x86/kernel/cpu/perf_event.c    | 44 ++++++++++++++++++++++++---------
>  include/linux/perf_event.h          |  6 +++++
>  kernel/events/core.c                | 49 ++++++++++++++++++++++++++++++++++---
>  kernel/time/timekeeping.c           | 30 +++++++++++++++++++++++
>  5 files changed, 123 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 176d4fe4f076..52255676b6e2 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -91,6 +91,7 @@ the beginning of each description states the restrictions within which a
>  	NUMA	NUMA support is enabled.
>  	NFS	Appropriate NFS support is enabled.
>  	OSS	OSS sound support is enabled.
> +	PERF	Performance events and counters support is enabled.
>  	PV_OPS	A paravirtualized kernel is enabled.
>  	PARIDE	The ParIDE (parallel port IDE) subsystem is enabled.
>  	PARISC	The PA-RISC architecture is enabled.
> @@ -2796,6 +2797,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			allocator.  This parameter is primarily	for debugging
>  			and performance comparison.
>  
> +	perf_use_local_clock
> +			[PERF]
> +			Use local_clock() as a source for perf timestamps
> +			generation. This was be the default behaviour and
> +			this parameter can be used to maintain backward
> +			compatibility or on older hardware with expensive
> +			monotonic clock source.
> +
>  	pf.		[PARIDE]
>  			See Documentation/blockdev/paride.txt.
>  
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index b71a7f86d68a..436a66632f76 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1952,6 +1952,35 @@ static struct pmu pmu = {
>  	.flush_branch_stack	= x86_pmu_flush_branch_stack,
>  };
>  
> +static void local_clock_user_time(struct perf_event_mmap_page *userpg, u64 now)
> +{
> +	data = cyc2ns_read_begin();
> +
> +	userpg->cap_user_time = 1;
> +	userpg->time_mult = data->cyc2ns_mul;
> +	userpg->time_shift = data->cyc2ns_shift;
> +	userpg->time_offset = data->cyc2ns_offset - now;
> +
> +	userpg->cap_user_time_zero = 1;
> +	userpg->time_zero = data->cyc2ns_offset;
> +
> +	cyc2ns_read_end(data);
> +}
> +
> +extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
> +
> +static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
> +{
> +	userpg->cap_user_time = 1;
> +	userpg->cap_user_time_zero = 1;
> +
> +	__ktime_get_mono_fast(&userpg->time_zero,
> +			      &userpg->time_mult,
> +			      &userpg->time_shift);
> +
> +	userpg->offset = userpg->time_zero - now;
> +}
> +
>  void arch_perf_update_userpage(struct perf_event *event,
>  			       struct perf_event_mmap_page *userpg, u64 now)
>  {
> @@ -1966,17 +1995,10 @@ void arch_perf_update_userpage(struct perf_event *event,
>  	if (!sched_clock_stable())
>  		return;
>  
> -	data = cyc2ns_read_begin();
> -
> -	userpg->cap_user_time = 1;
> -	userpg->time_mult = data->cyc2ns_mul;
> -	userpg->time_shift = data->cyc2ns_shift;
> -	userpg->time_offset = data->cyc2ns_offset - now;
> -
> -	userpg->cap_user_time_zero = 1;
> -	userpg->time_zero = data->cyc2ns_offset;
> -
> -	cyc2ns_read_end(data);
> +	if (static_key_false(&perf_use_local_clock_key))
> +		local_clock_user_time(userpg, now);
> +	else
> +		ktime_fast_mono_user_time(userpg, now);
>  }
>  
>  /*
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 33262004c310..1d61f968113a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -561,6 +561,12 @@ extern void perf_pmu_enable(struct pmu *pmu);
>  extern int perf_event_task_disable(void);
>  extern int perf_event_task_enable(void);
>  extern int perf_event_refresh(struct perf_event *event, int refresh);
> +
> +extern struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
> +extern void __weak
> +arch_perf_update_userpage(struct perf_event *event,
> +			  struct perf_event_mmap_page *userpg, u64 now);
> +
>  extern void perf_event_update_userpage(struct perf_event *event);
>  extern int perf_event_release_kernel(struct perf_event *event);
>  extern struct perf_event *
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 13209a90b751..7bad385103ea 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -42,6 +42,8 @@
>  #include <linux/module.h>
>  #include <linux/mman.h>
>  #include <linux/compat.h>
> +#include <linux/sysctl.h>
> +#include <linux/jump_label.h>
>  
>  #include "internal.h"
>  
> @@ -322,9 +324,43 @@ extern __weak const char *perf_pmu_name(void)
>  	return "pmu";
>  }
>  
> +struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
> +static bool perf_use_local_clock_param __initdata;
> +static int __init perf_use_local_clock_setup(char *__unused)
> +{
> +	perf_use_local_clock_param = true;
> +	return 1;
> +}
> +__setup("perf_use_local_clock", perf_use_local_clock_setup);
> +
> +static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
> +
> +static struct ctl_table perf_sample_time_kern_table[] = {
> +	{
> +		.procname       = "perf_sample_time_clk_id",
> +		.data           = &sysctl_perf_sample_time_clk_id,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0444,
> +		.proc_handler   = proc_dointvec,
> +	},
> +	{}
> +};
> +
> +static struct ctl_table perf_sample_time_root_table[] = {
> +	{
> +		.procname	= "kernel",
> +		.mode		= 0555,
> +		.child		= perf_sample_time_kern_table,
> +	},
> +	{}
> +};
> +
>  static inline u64 perf_clock(void)
>  {
> -	return local_clock();
> +	if (static_key_false(&perf_use_local_clock_key))
> +		return local_clock();
> +	else
> +		return ktime_get_mono_fast_ns();
>  }
>  
>  static inline struct perf_cpu_context *
> @@ -4101,8 +4137,8 @@ static void perf_event_init_userpage(struct perf_event *event)
>  	rcu_read_unlock();
>  }
>  
> -void __weak arch_perf_update_userpage(
> -	struct perf_event *event, struct perf_event_mmap_page *userpg, u64 now)
> +void __weak arch_perf_update_userpage(struct perf_event *event,
> +				      struct perf_event_mmap_page *userpg, u64 now)
>  {
>  }
>  
> @@ -4487,7 +4523,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (vma->vm_flags & VM_WRITE)
>  		flags |= RING_BUFFER_WRITABLE;
>  
> -	rb = rb_alloc(nr_pages, 
> +	rb = rb_alloc(nr_pages,
>  		event->attr.watermark ? event->attr.wakeup_watermark : 0,
>  		event->cpu, flags);
>  
> @@ -8516,6 +8552,11 @@ void __init perf_event_init(void)
>  	 */
>  	BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
>  		     != 1024);
> +
> +	if (perf_use_local_clock_param)
> +		static_key_slow_inc(&perf_use_local_clock_key);
> +	else
> +		register_sysctl_table(perf_sample_time_root_table);
>  }
>  
>  static int __init perf_event_sysfs_init(void)
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index b124af259800..37bed5931a91 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -334,6 +334,36 @@ u64 notrace ktime_get_mono_fast_ns(void)
>  }
>  EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
>  
> +void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift)
> +{
> +	struct tk_read_base *tkr;
> +	unsigned int seq;
> +	cycle_t cycle_now, delta;
> +	u64 nsecs, now;
> +
> +	do {
> +		seq = raw_read_seqcount(&tk_fast_mono.seq);
> +		tkr = tk_fast_mono.base + (seq & 0x01);
> +
> +		cycle_now = tkr->read(tkr->clock);
> +		delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
> +
> +		nsec = delta * tkr->mult + tkr->xtime_nsec;
> +		nsec >>= tkr->shift;
> +		nsec += arch_gettimeoffset();
> +
> +		now = ktime_to_ns(tkr->base_mono) + nsec;
> +
> +		*mult = tkr->mult;
> +		*shift = tkr->shift;
> +
> +		nsec = mul_u64_u32_shr(cycle_now, tkr->mult, tkr->shift);
> +
> +		*offset = now - nsec;
> +
> +	} while (read_seqcount_retry(&tk_fast_mono.seq, seq));
> +}
> +
>  #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
>  
>  static inline void update_vsyscall(struct timekeeper *tk)
> 
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-12 10:04               ` Adrian Hunter
  0 siblings, 0 replies; 58+ messages in thread
From: Adrian Hunter @ 2015-02-12 10:04 UTC (permalink / raw)
  To: Peter Zijlstra, Pawel Moll
  Cc: ajh mls, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 11/02/15 18:12, Peter Zijlstra wrote:
> 
> How about something like the below? I _think_ it should mostly work for
> x86, where the tsc is a 64bit wide cycle counter.

It would have to be based on CLOCK_MONOTONIC_RAW not CLOCK_MONOTONIC and you
would have to check the clocksource is TSC.

Why is CLOCK_MONOTONIC preferred anyway - I would have thought any
adjustment would skew performance timings?

> 
> I suppose we should extend the perf userpage time data with
> time_last_cycle and time_mask if/when we want to make this work on
> something with a short counter.
> 
> Of course, at that time we also need to somehow deal with that counter
> wrapping, its hardly practical to go iterate all possible userpg
> instances from a timer handler.
> 
> 
> ---
>  Documentation/kernel-parameters.txt |  9 +++++++
>  arch/x86/kernel/cpu/perf_event.c    | 44 ++++++++++++++++++++++++---------
>  include/linux/perf_event.h          |  6 +++++
>  kernel/events/core.c                | 49 ++++++++++++++++++++++++++++++++++---
>  kernel/time/timekeeping.c           | 30 +++++++++++++++++++++++
>  5 files changed, 123 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 176d4fe4f076..52255676b6e2 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -91,6 +91,7 @@ the beginning of each description states the restrictions within which a
>  	NUMA	NUMA support is enabled.
>  	NFS	Appropriate NFS support is enabled.
>  	OSS	OSS sound support is enabled.
> +	PERF	Performance events and counters support is enabled.
>  	PV_OPS	A paravirtualized kernel is enabled.
>  	PARIDE	The ParIDE (parallel port IDE) subsystem is enabled.
>  	PARISC	The PA-RISC architecture is enabled.
> @@ -2796,6 +2797,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			allocator.  This parameter is primarily	for debugging
>  			and performance comparison.
>  
> +	perf_use_local_clock
> +			[PERF]
> +			Use local_clock() as a source for perf timestamps
> +			generation. This was be the default behaviour and
> +			this parameter can be used to maintain backward
> +			compatibility or on older hardware with expensive
> +			monotonic clock source.
> +
>  	pf.		[PARIDE]
>  			See Documentation/blockdev/paride.txt.
>  
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index b71a7f86d68a..436a66632f76 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1952,6 +1952,35 @@ static struct pmu pmu = {
>  	.flush_branch_stack	= x86_pmu_flush_branch_stack,
>  };
>  
> +static void local_clock_user_time(struct perf_event_mmap_page *userpg, u64 now)
> +{
> +	data = cyc2ns_read_begin();
> +
> +	userpg->cap_user_time = 1;
> +	userpg->time_mult = data->cyc2ns_mul;
> +	userpg->time_shift = data->cyc2ns_shift;
> +	userpg->time_offset = data->cyc2ns_offset - now;
> +
> +	userpg->cap_user_time_zero = 1;
> +	userpg->time_zero = data->cyc2ns_offset;
> +
> +	cyc2ns_read_end(data);
> +}
> +
> +extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
> +
> +static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
> +{
> +	userpg->cap_user_time = 1;
> +	userpg->cap_user_time_zero = 1;
> +
> +	__ktime_get_mono_fast(&userpg->time_zero,
> +			      &userpg->time_mult,
> +			      &userpg->time_shift);
> +
> +	userpg->offset = userpg->time_zero - now;
> +}
> +
>  void arch_perf_update_userpage(struct perf_event *event,
>  			       struct perf_event_mmap_page *userpg, u64 now)
>  {
> @@ -1966,17 +1995,10 @@ void arch_perf_update_userpage(struct perf_event *event,
>  	if (!sched_clock_stable())
>  		return;
>  
> -	data = cyc2ns_read_begin();
> -
> -	userpg->cap_user_time = 1;
> -	userpg->time_mult = data->cyc2ns_mul;
> -	userpg->time_shift = data->cyc2ns_shift;
> -	userpg->time_offset = data->cyc2ns_offset - now;
> -
> -	userpg->cap_user_time_zero = 1;
> -	userpg->time_zero = data->cyc2ns_offset;
> -
> -	cyc2ns_read_end(data);
> +	if (static_key_false(&perf_use_local_clock_key))
> +		local_clock_user_time(userpg, now);
> +	else
> +		ktime_fast_mono_user_time(userpg, now);
>  }
>  
>  /*
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 33262004c310..1d61f968113a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -561,6 +561,12 @@ extern void perf_pmu_enable(struct pmu *pmu);
>  extern int perf_event_task_disable(void);
>  extern int perf_event_task_enable(void);
>  extern int perf_event_refresh(struct perf_event *event, int refresh);
> +
> +extern struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
> +extern void __weak
> +arch_perf_update_userpage(struct perf_event *event,
> +			  struct perf_event_mmap_page *userpg, u64 now);
> +
>  extern void perf_event_update_userpage(struct perf_event *event);
>  extern int perf_event_release_kernel(struct perf_event *event);
>  extern struct perf_event *
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 13209a90b751..7bad385103ea 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -42,6 +42,8 @@
>  #include <linux/module.h>
>  #include <linux/mman.h>
>  #include <linux/compat.h>
> +#include <linux/sysctl.h>
> +#include <linux/jump_label.h>
>  
>  #include "internal.h"
>  
> @@ -322,9 +324,43 @@ extern __weak const char *perf_pmu_name(void)
>  	return "pmu";
>  }
>  
> +struct static_key perf_use_local_clock_key = STATIC_KEY_INIT_FALSE;
> +static bool perf_use_local_clock_param __initdata;
> +static int __init perf_use_local_clock_setup(char *__unused)
> +{
> +	perf_use_local_clock_param = true;
> +	return 1;
> +}
> +__setup("perf_use_local_clock", perf_use_local_clock_setup);
> +
> +static int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
> +
> +static struct ctl_table perf_sample_time_kern_table[] = {
> +	{
> +		.procname       = "perf_sample_time_clk_id",
> +		.data           = &sysctl_perf_sample_time_clk_id,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0444,
> +		.proc_handler   = proc_dointvec,
> +	},
> +	{}
> +};
> +
> +static struct ctl_table perf_sample_time_root_table[] = {
> +	{
> +		.procname	= "kernel",
> +		.mode		= 0555,
> +		.child		= perf_sample_time_kern_table,
> +	},
> +	{}
> +};
> +
>  static inline u64 perf_clock(void)
>  {
> -	return local_clock();
> +	if (static_key_false(&perf_use_local_clock_key))
> +		return local_clock();
> +	else
> +		return ktime_get_mono_fast_ns();
>  }
>  
>  static inline struct perf_cpu_context *
> @@ -4101,8 +4137,8 @@ static void perf_event_init_userpage(struct perf_event *event)
>  	rcu_read_unlock();
>  }
>  
> -void __weak arch_perf_update_userpage(
> -	struct perf_event *event, struct perf_event_mmap_page *userpg, u64 now)
> +void __weak arch_perf_update_userpage(struct perf_event *event,
> +				      struct perf_event_mmap_page *userpg, u64 now)
>  {
>  }
>  
> @@ -4487,7 +4523,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (vma->vm_flags & VM_WRITE)
>  		flags |= RING_BUFFER_WRITABLE;
>  
> -	rb = rb_alloc(nr_pages, 
> +	rb = rb_alloc(nr_pages,
>  		event->attr.watermark ? event->attr.wakeup_watermark : 0,
>  		event->cpu, flags);
>  
> @@ -8516,6 +8552,11 @@ void __init perf_event_init(void)
>  	 */
>  	BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
>  		     != 1024);
> +
> +	if (perf_use_local_clock_param)
> +		static_key_slow_inc(&perf_use_local_clock_key);
> +	else
> +		register_sysctl_table(perf_sample_time_root_table);
>  }
>  
>  static int __init perf_event_sysfs_init(void)
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index b124af259800..37bed5931a91 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -334,6 +334,36 @@ u64 notrace ktime_get_mono_fast_ns(void)
>  }
>  EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
>  
> +void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift)
> +{
> +	struct tk_read_base *tkr;
> +	unsigned int seq;
> +	cycle_t cycle_now, delta;
> +	u64 nsecs, now;
> +
> +	do {
> +		seq = raw_read_seqcount(&tk_fast_mono.seq);
> +		tkr = tk_fast_mono.base + (seq & 0x01);
> +
> +		cycle_now = tkr->read(tkr->clock);
> +		delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
> +
> +		nsec = delta * tkr->mult + tkr->xtime_nsec;
> +		nsec >>= tkr->shift;
> +		nsec += arch_gettimeoffset();
> +
> +		now = ktime_to_ns(tkr->base_mono) + nsec;
> +
> +		*mult = tkr->mult;
> +		*shift = tkr->shift;
> +
> +		nsec = mul_u64_u32_shr(cycle_now, tkr->mult, tkr->shift);
> +
> +		*offset = now - nsec;
> +
> +	} while (read_seqcount_retry(&tk_fast_mono.seq, seq));
> +}
> +
>  #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
>  
>  static inline void update_vsyscall(struct timekeeper *tk)
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-12 10:28                 ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-02-12 10:28 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Pawel Moll, ajh mls, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel, linux-api

On Thu, Feb 12, 2015 at 12:04:54PM +0200, Adrian Hunter wrote:
> On 11/02/15 18:12, Peter Zijlstra wrote:
> > 
> > How about something like the below? I _think_ it should mostly work for
> > x86, where the tsc is a 64bit wide cycle counter.
> 
> It would have to be based on CLOCK_MONOTONIC_RAW not CLOCK_MONOTONIC 

Why?

> and you would have to check the clocksource is TSC.

It implicitly does that; it has that sched_clock_stable() thing, but
yeah I suppose someone could change the clocksource even though the tsc
is stable.

Not using TSC when its available is quite crazy though.. but sure.

> Why is CLOCK_MONOTONIC preferred anyway - I would have thought any
> adjustment would skew performance timings?

Because you can do inter-machine stuff with MONOTONIC and that's
entirely impossible with MONO_RAW.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-12 10:28                 ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-02-12 10:28 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Pawel Moll, ajh mls, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 12, 2015 at 12:04:54PM +0200, Adrian Hunter wrote:
> On 11/02/15 18:12, Peter Zijlstra wrote:
> > 
> > How about something like the below? I _think_ it should mostly work for
> > x86, where the tsc is a 64bit wide cycle counter.
> 
> It would have to be based on CLOCK_MONOTONIC_RAW not CLOCK_MONOTONIC 

Why?

> and you would have to check the clocksource is TSC.

It implicitly does that; it has that sched_clock_stable() thing, but
yeah I suppose someone could change the clocksource even though the tsc
is stable.

Not using TSC when its available is quite crazy though.. but sure.

> Why is CLOCK_MONOTONIC preferred anyway - I would have thought any
> adjustment would skew performance timings?

Because you can do inter-machine stuff with MONOTONIC and that's
entirely impossible with MONO_RAW.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-12 15:38                   ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-02-12 15:38 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Pawel Moll, ajh mls, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel, linux-api

On Thu, Feb 12, 2015 at 11:28:14AM +0100, Peter Zijlstra wrote:
> > and you would have to check the clocksource is TSC.
> 
> It implicitly does that; it has that sched_clock_stable() thing, but
> yeah I suppose someone could change the clocksource even though the tsc
> is stable.
> 
> Not using TSC when its available is quite crazy though.. but sure.

Something like this on top then.. it might have a few header issues, the
whole asm/tsc.h vs clocksource.h thing looks like pain.

I haven't tried to compile it, maybe we can move cycle_t into types and
fwd declare struct clocksource or whatnot.

Of course, all this is quite horrible on the timekeeping side; it might
be tglx and/or jstutlz are having spasms just reading it :-)

---
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1967,17 +1967,19 @@ static void local_clock_user_time(struct
 	cyc2ns_read_end(data);
 }
 
-extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
+extern bool notrace __ktime_get_mono_fast(cycle_t (*read)(struct clocksource *cs),
+		u64 *offset, u32 *mult, u16 *shift);
 
 static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
 {
+	if (!__ktime_get_mono_fast(read_tsc, &userpg->time_zero,
+				   &userpg->time_mult,
+				   &userpg->time_shift))
+		return;
+
 	userpg->cap_user_time = 1;
 	userpg->cap_user_time_zero = 1;
 
-	__ktime_get_mono_fast(&userpg->time_zero,
-			      &userpg->time_mult,
-			      &userpg->time_shift);
-
 	userpg->offset = userpg->time_zero - now;
 }
 
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -334,7 +334,8 @@ u64 notrace ktime_get_mono_fast_ns(void)
 }
 EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
 
-void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift)
+bool notrace __ktime_get_mono_fast(cycle_t (*read)(struct clocksource *),
+				   u64 *offset, u32 *mult, u16 *shift)
 {
 	struct tk_read_base *tkr;
 	unsigned int seq;
@@ -345,6 +346,9 @@ void notrace __ktime_get_mono_fast(u64 *
 		seq = raw_read_seqcount(&tk_fast_mono.seq);
 		tkr = tk_fast_mono.base + (seq & 0x01);
 
+		if (tkr->read != read)
+			return false;
+
 		cycle_now = tkr->read(tkr->clock);
 		delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
 
@@ -362,6 +366,8 @@ void notrace __ktime_get_mono_fast(u64 *
 		*offset = now - nsec;
 
 	} while (read_seqcount_retry(&tk_fast_mono.seq, seq));
+
+	return true;
 }
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94605c0e9cee..68e4039a58ea 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -32,6 +32,8 @@ static inline cycles_t get_cycles(void)
 	return ret;
 }
 
+extern void cycle_t read_tsc(struct clocksource *);
+
 static __always_inline cycles_t vget_cycles(void)
 {
 	/*
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 505449700e0c..c580998f0160 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -965,7 +965,7 @@ static struct clocksource clocksource_tsc;
  * checking the result of read_tsc() - cycle_last for being negative.
  * That works because CLOCKSOURCE_MASK(64) does not mask out any bit.
  */
-static cycle_t read_tsc(struct clocksource *cs)
+cycle_t read_tsc(struct clocksource *cs)
 {
 	return (cycle_t)get_cycles();
 }

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-12 15:38                   ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2015-02-12 15:38 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Pawel Moll, ajh mls, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 12, 2015 at 11:28:14AM +0100, Peter Zijlstra wrote:
> > and you would have to check the clocksource is TSC.
> 
> It implicitly does that; it has that sched_clock_stable() thing, but
> yeah I suppose someone could change the clocksource even though the tsc
> is stable.
> 
> Not using TSC when its available is quite crazy though.. but sure.

Something like this on top then.. it might have a few header issues, the
whole asm/tsc.h vs clocksource.h thing looks like pain.

I haven't tried to compile it, maybe we can move cycle_t into types and
fwd declare struct clocksource or whatnot.

Of course, all this is quite horrible on the timekeeping side; it might
be tglx and/or jstutlz are having spasms just reading it :-)

---
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1967,17 +1967,19 @@ static void local_clock_user_time(struct
 	cyc2ns_read_end(data);
 }
 
-extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
+extern bool notrace __ktime_get_mono_fast(cycle_t (*read)(struct clocksource *cs),
+		u64 *offset, u32 *mult, u16 *shift);
 
 static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
 {
+	if (!__ktime_get_mono_fast(read_tsc, &userpg->time_zero,
+				   &userpg->time_mult,
+				   &userpg->time_shift))
+		return;
+
 	userpg->cap_user_time = 1;
 	userpg->cap_user_time_zero = 1;
 
-	__ktime_get_mono_fast(&userpg->time_zero,
-			      &userpg->time_mult,
-			      &userpg->time_shift);
-
 	userpg->offset = userpg->time_zero - now;
 }
 
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -334,7 +334,8 @@ u64 notrace ktime_get_mono_fast_ns(void)
 }
 EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
 
-void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift)
+bool notrace __ktime_get_mono_fast(cycle_t (*read)(struct clocksource *),
+				   u64 *offset, u32 *mult, u16 *shift)
 {
 	struct tk_read_base *tkr;
 	unsigned int seq;
@@ -345,6 +346,9 @@ void notrace __ktime_get_mono_fast(u64 *
 		seq = raw_read_seqcount(&tk_fast_mono.seq);
 		tkr = tk_fast_mono.base + (seq & 0x01);
 
+		if (tkr->read != read)
+			return false;
+
 		cycle_now = tkr->read(tkr->clock);
 		delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
 
@@ -362,6 +366,8 @@ void notrace __ktime_get_mono_fast(u64 *
 		*offset = now - nsec;
 
 	} while (read_seqcount_retry(&tk_fast_mono.seq, seq));
+
+	return true;
 }
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94605c0e9cee..68e4039a58ea 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -32,6 +32,8 @@ static inline cycles_t get_cycles(void)
 	return ret;
 }
 
+extern void cycle_t read_tsc(struct clocksource *);
+
 static __always_inline cycles_t vget_cycles(void)
 {
 	/*
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 505449700e0c..c580998f0160 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -965,7 +965,7 @@ static struct clocksource clocksource_tsc;
  * checking the result of read_tsc() - cycle_last for being negative.
  * That works because CLOCKSOURCE_MASK(64) does not mask out any bit.
  */
-static cycle_t read_tsc(struct clocksource *cs)
+cycle_t read_tsc(struct clocksource *cs)
 {
 	return (cycle_t)get_cycles();
 }

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-13  0:25                     ` John Stultz
  0 siblings, 0 replies; 58+ messages in thread
From: John Stultz @ 2015-02-13  0:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Pawel Moll, ajh mls, Richard Cochran,
	Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api

On Thu, Feb 12, 2015 at 11:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 12, 2015 at 11:28:14AM +0100, Peter Zijlstra wrote:
>> > and you would have to check the clocksource is TSC.
>>
>> It implicitly does that; it has that sched_clock_stable() thing, but
>> yeah I suppose someone could change the clocksource even though the tsc
>> is stable.
>>
>> Not using TSC when its available is quite crazy though.. but sure.
>
> Something like this on top then.. it might have a few header issues, the
> whole asm/tsc.h vs clocksource.h thing looks like pain.
>
> I haven't tried to compile it, maybe we can move cycle_t into types and
> fwd declare struct clocksource or whatnot.
>
> Of course, all this is quite horrible on the timekeeping side; it might
> be tglx and/or jstutlz are having spasms just reading it :-)


Oof.. Yea, this exposes all sorts of timekeeping internals out to the
rest of the kernel that I'd rather not have out there.

> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1967,17 +1967,19 @@ static void local_clock_user_time(struct
>         cyc2ns_read_end(data);
>  }
>
> -extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
> +extern bool notrace __ktime_get_mono_fast(cycle_t (*read)(struct clocksource *cs),
> +               u64 *offset, u32 *mult, u16 *shift);
>
>  static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
>  {
> +       if (!__ktime_get_mono_fast(read_tsc, &userpg->time_zero,
> +                                  &userpg->time_mult,
> +                                  &userpg->time_shift))

Soo.. instead of hard-coding read_tsc here, can we instead use a
clocksource flag that we can check that the current clocksource is
valid for this sort of use?

thanks
-john

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-13  0:25                     ` John Stultz
  0 siblings, 0 replies; 58+ messages in thread
From: John Stultz @ 2015-02-13  0:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Pawel Moll, ajh mls, Richard Cochran,
	Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern,
	Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 12, 2015 at 11:38 PM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Thu, Feb 12, 2015 at 11:28:14AM +0100, Peter Zijlstra wrote:
>> > and you would have to check the clocksource is TSC.
>>
>> It implicitly does that; it has that sched_clock_stable() thing, but
>> yeah I suppose someone could change the clocksource even though the tsc
>> is stable.
>>
>> Not using TSC when its available is quite crazy though.. but sure.
>
> Something like this on top then.. it might have a few header issues, the
> whole asm/tsc.h vs clocksource.h thing looks like pain.
>
> I haven't tried to compile it, maybe we can move cycle_t into types and
> fwd declare struct clocksource or whatnot.
>
> Of course, all this is quite horrible on the timekeeping side; it might
> be tglx and/or jstutlz are having spasms just reading it :-)


Oof.. Yea, this exposes all sorts of timekeeping internals out to the
rest of the kernel that I'd rather not have out there.

> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1967,17 +1967,19 @@ static void local_clock_user_time(struct
>         cyc2ns_read_end(data);
>  }
>
> -extern void notrace __ktime_get_mono_fast(u64 *offset, u32 *mult, u16 *shift);
> +extern bool notrace __ktime_get_mono_fast(cycle_t (*read)(struct clocksource *cs),
> +               u64 *offset, u32 *mult, u16 *shift);
>
>  static void ktime_fast_mono_user_time(struct perf_event_mmap_page *userpg, u64 now)
>  {
> +       if (!__ktime_get_mono_fast(read_tsc, &userpg->time_zero,
> +                                  &userpg->time_mult,
> +                                  &userpg->time_shift))

Soo.. instead of hard-coding read_tsc here, can we instead use a
clocksource flag that we can check that the current clocksource is
valid for this sort of use?

thanks
-john

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-13  7:07                   ` Adrian Hunter
  0 siblings, 0 replies; 58+ messages in thread
From: Adrian Hunter @ 2015-02-13  7:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pawel Moll, ajh mls, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel, linux-api

On 12/02/15 12:28, Peter Zijlstra wrote:
> On Thu, Feb 12, 2015 at 12:04:54PM +0200, Adrian Hunter wrote:
>> On 11/02/15 18:12, Peter Zijlstra wrote:
>>>
>>> How about something like the below? I _think_ it should mostly work for
>>> x86, where the tsc is a 64bit wide cycle counter.
>>
>> It would have to be based on CLOCK_MONOTONIC_RAW not CLOCK_MONOTONIC 
> 
> Why?

In the CLOCK_MONOTONIC case, the components of the calculation (mult and
shift etc) are subject to change, so the calculation would be increasingly
inaccurate the greater the time between reading those values the reading TSC
or capturing perf events.

Accuracy is important for matching sideband events against Intel PT. e.g.
did a mmap event happen before or after a given TSC timestamp.

Adding another sample value (Pawel's patch 3) is more accurate and simpler
to understand. It just needs to be extended to allow TSC.

> 
>> and you would have to check the clocksource is TSC.
> 
> It implicitly does that; it has that sched_clock_stable() thing, but
> yeah I suppose someone could change the clocksource even though the tsc
> is stable.
> 
> Not using TSC when its available is quite crazy though.. but sure.
> 
>> Why is CLOCK_MONOTONIC preferred anyway - I would have thought any
>> adjustment would skew performance timings?
> 
> Because you can do inter-machine stuff with MONOTONIC and that's
> entirely impossible with MONO_RAW.

Ok, the man page does not make it sound as bad as that.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
@ 2015-02-13  7:07                   ` Adrian Hunter
  0 siblings, 0 replies; 58+ messages in thread
From: Adrian Hunter @ 2015-02-13  7:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pawel Moll, ajh mls, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 12/02/15 12:28, Peter Zijlstra wrote:
> On Thu, Feb 12, 2015 at 12:04:54PM +0200, Adrian Hunter wrote:
>> On 11/02/15 18:12, Peter Zijlstra wrote:
>>>
>>> How about something like the below? I _think_ it should mostly work for
>>> x86, where the tsc is a 64bit wide cycle counter.
>>
>> It would have to be based on CLOCK_MONOTONIC_RAW not CLOCK_MONOTONIC 
> 
> Why?

In the CLOCK_MONOTONIC case, the components of the calculation (mult and
shift etc) are subject to change, so the calculation would be increasingly
inaccurate the greater the time between reading those values the reading TSC
or capturing perf events.

Accuracy is important for matching sideband events against Intel PT. e.g.
did a mmap event happen before or after a given TSC timestamp.

Adding another sample value (Pawel's patch 3) is more accurate and simpler
to understand. It just needs to be extended to allow TSC.

> 
>> and you would have to check the clocksource is TSC.
> 
> It implicitly does that; it has that sched_clock_stable() thing, but
> yeah I suppose someone could change the clocksource even though the tsc
> is stable.
> 
> Not using TSC when its available is quite crazy though.. but sure.
> 
>> Why is CLOCK_MONOTONIC preferred anyway - I would have thought any
>> adjustment would skew performance timings?
> 
> Because you can do inter-machine stuff with MONOTONIC and that's
> entirely impossible with MONO_RAW.

Ok, the man page does not make it sound as bad as that.

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2015-02-13  7:09 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-06 16:51 [PATCH v4 0/3] perf: User/kernel time correlation and event generation Pawel Moll
2014-11-06 16:51 ` [PATCH v4 1/3] perf: Use monotonic clock as a source for timestamps Pawel Moll
2014-11-27 15:05   ` Pawel Moll
2014-11-27 15:05     ` Pawel Moll
2014-12-11 13:39     ` Pawel Moll
2014-12-11 13:39       ` Pawel Moll
2015-01-05 13:01       ` Peter Zijlstra
2015-01-05 13:01         ` Peter Zijlstra
2015-01-21 15:47         ` Pawel Moll
2015-01-05 13:00   ` Peter Zijlstra
2015-01-21 15:52     ` Pawel Moll
2015-01-21 15:52       ` Pawel Moll
2015-01-21 19:48       ` Pawel Moll
2015-01-21 19:48         ` Pawel Moll
2015-01-21 20:07         ` Pawel Moll
2015-01-21 20:07           ` Pawel Moll
2015-01-16 12:41   ` Adrian Hunter
2015-01-16 12:41     ` Adrian Hunter
2015-01-21 20:27   ` [PATCH v5] " Pawel Moll
2015-01-21 20:27     ` Pawel Moll
2015-02-02 16:52     ` Pawel Moll
2015-02-02 16:52       ` Pawel Moll
     [not found]       ` <CAN+dfcT_6zZZ4oeyngUE5N0Wtx2B9CvXsfU71m+cuyXpq2KBdw@mail.gmail.com>
2015-02-03  9:20         ` Pawel Moll
2015-02-03  9:20           ` Pawel Moll
2015-02-11 16:12           ` Peter Zijlstra
2015-02-11 16:12             ` Peter Zijlstra
2015-02-12 10:04             ` Adrian Hunter
2015-02-12 10:04               ` Adrian Hunter
2015-02-12 10:28               ` Peter Zijlstra
2015-02-12 10:28                 ` Peter Zijlstra
2015-02-12 15:38                 ` Peter Zijlstra
2015-02-12 15:38                   ` Peter Zijlstra
2015-02-13  0:25                   ` John Stultz
2015-02-13  0:25                     ` John Stultz
2015-02-13  7:07                 ` Adrian Hunter
2015-02-13  7:07                   ` Adrian Hunter
2014-11-06 16:51 ` [PATCH v4 2/3] perf: Userspace event Pawel Moll
2014-11-06 16:51   ` Pawel Moll
2015-01-05 13:12   ` Peter Zijlstra
2015-01-05 13:12     ` Peter Zijlstra
2015-01-21 16:01     ` Pawel Moll
2015-01-21 16:01       ` Pawel Moll
2014-11-06 16:51 ` [PATCH v4 3/3] perf: Sample additional clock value Pawel Moll
2015-01-05 13:45   ` Peter Zijlstra
2015-01-05 13:45     ` Peter Zijlstra
2015-01-05 19:17     ` John Stultz
2015-01-21 17:12     ` Pawel Moll
2015-01-21 17:12       ` Pawel Moll
2015-01-21 17:44       ` John Stultz
2015-01-21 17:44         ` John Stultz
2015-01-21 17:54         ` Pawel Moll
2015-01-21 17:54           ` Pawel Moll
2015-01-21 18:05           ` John Stultz
2015-01-21 18:05             ` John Stultz
2015-01-23 17:06     ` Pawel Moll
2015-01-23 17:06       ` Pawel Moll
2015-01-23 18:05       ` David Ahern
2015-01-23 18:05         ` David Ahern

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.