linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] perf: User/kernel time correlation and event generation
@ 2014-09-23 17:03 Pawel Moll
  2014-09-23 17:03 ` [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock Pawel Moll
  2014-09-23 17:03 ` [PATCH v2 2/2] perf: Userspace event Pawel Moll
  0 siblings, 2 replies; 18+ messages in thread
From: Pawel Moll @ 2014-09-23 17:03 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz
  Cc: linux-kernel, linux-api, Pawel Moll

Greetings,

Another week, another series. Previous versions:

- RFC: http://www.spinics.net/lists/kernel/msg1824419.html
- v1: http://thread.gmane.org/gmane.linux.kernel/1790231

The first patch adds an additional timestamp field in the perf
sample data, which can be requested for any perf event along
with normal PERF_SAMPLE_TIME. Events with both values appearing
periodically in the perf data allow user code to translate
raw monotonic time (obtained via POSIX clock API) to sched_clock
domain. Although any perf event can be used, the natural choice
would be a sched_switch trace event (for processes with root
permissions) or a hrtimer-based PERF_COUNT_SW_CPU_CLOCK.

It didn't attract any comments previously, so is just re-posted
without any changes.

The second patch, functionally orthogonal but complementing
the first one, builds on the ftrace "trace_maker" idea. It adds
a write syscall handler for the perf file descriptor, that can
be used to inject a userspace-generated data into the perf buffer.
It provides base for printf-like functionality in perf world.
If used with the previous patch, it can be also used to provide
synchronisation points for sched vs. raw monotonic time stamps
correlation.

The "uevent", besides the size and data from the write syscall,
contains a 32 bit integer value which can be used to distinguish
different types of the events. The type of following events can
be set with an additional ioctl.

Type value "0" is defined as a zero-terminated string (which
makes it trivial to generate with dprintf() library function),
but meaning of data for other types is of no interest for the
kernel. The intention is to host a list of "well known" types
(with reference parsers for them) in the user perf tool code.

Pawel Moll (2):
  perf: Add sampling of the raw monotonic clock
  perf: Userspace event

 include/linux/perf_event.h      | 16 +++++++
 include/uapi/linux/perf_event.h | 26 +++++++++++-
 kernel/events/core.c            | 93 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 134 insertions(+), 1 deletion(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock
  2014-09-23 17:03 [PATCH v2 0/2] perf: User/kernel time correlation and event generation Pawel Moll
@ 2014-09-23 17:03 ` Pawel Moll
  2014-09-24  5:41   ` Namhyung Kim
  2014-09-23 17:03 ` [PATCH v2 2/2] perf: Userspace event Pawel Moll
  1 sibling, 1 reply; 18+ messages in thread
From: Pawel Moll @ 2014-09-23 17:03 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz
  Cc: linux-kernel, linux-api, Pawel Moll

This patch adds an option to sample raw monotonic clock
value with any perf event, with the the aim of allowing
time correlation between data coming from perf and
additional performance-related information generated in
userspace.

In order to correlate timestamps in perf data stream
with events happening in userspace (be it JITed debug
symbols or hwmon-originating environment data), user
requests a more or less periodic event (sched_switch
trace event of a hrtimer-based cpu-clock being the
most obvious examples) with PERF_SAMPLE_TIME *and*
PERF_SAMPLE_CLOCK_RAW_MONOTONIC and stamps
user-originating data with values obtained from
clock_gettime(CLOCK_MONOTONIC_RAW). Then, during
analysis, one looks at the perf events immediately
preceding and following (in terms of the
clock_raw_monotonic sample) the userspace event and
does simple linear approximation to get the equivalent
perf time.

        perf event     user event
       -----O--------------+-------------O------> t_mono
            :              |             :
            :              V             :
       -----O----------------------------O------> t_perf

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
Changes since v1:

- none

 include/linux/perf_event.h      |  2 ++
 include/uapi/linux/perf_event.h |  4 +++-
 kernel/events/core.c            | 13 +++++++++++++
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 707617a..28b73b2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -602,6 +602,8 @@ struct perf_sample_data {
 	 * Transaction flags for abort events:
 	 */
 	u64				txn;
+	/* Raw monotonic timestamp, for userspace time correlation */
+	u64				clock_raw_monotonic;
 };
 
 static inline void perf_sample_data_init(struct perf_sample_data *data,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9269de2..e5a75c5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_SRC			= 1U << 15,
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
+	PERF_SAMPLE_CLOCK_RAW_MONOTONIC		= 1U << 18,
 
-	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
 };
 
 /*
@@ -686,6 +687,7 @@ enum perf_event_type {
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+	 *	{ u64			clock_raw_monotonic; } && PERF_SAMPLE_CLOCK_RAW_MONOTONIC
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f9c1ed0..f6df547 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1216,6 +1216,9 @@ static void perf_event__header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		size += sizeof(data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC)
+		size += sizeof(data->clock_raw_monotonic);
+
 	event->header_size = size;
 }
 
@@ -4456,6 +4459,13 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
 		data->cpu_entry.cpu	 = raw_smp_processor_id();
 		data->cpu_entry.reserved = 0;
 	}
+
+	if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC) {
+		struct timespec now;
+
+		getrawmonotonic(&now);
+		data->clock_raw_monotonic = timespec_to_ns(&now);
+	}
 }
 
 void perf_event_header__init_id(struct perf_event_header *header,
@@ -4714,6 +4724,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		perf_output_put(handle, data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC)
+		perf_output_put(handle, data->clock_raw_monotonic);
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/2] perf: Userspace event
  2014-09-23 17:03 [PATCH v2 0/2] perf: User/kernel time correlation and event generation Pawel Moll
  2014-09-23 17:03 ` [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock Pawel Moll
@ 2014-09-23 17:03 ` Pawel Moll
  2014-09-24  6:07   ` Namhyung Kim
  1 sibling, 1 reply; 18+ messages in thread
From: Pawel Moll @ 2014-09-23 17:03 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz
  Cc: linux-kernel, linux-api, Pawel Moll

This patch adds a new PERF_COUNT_SW_UEVENT software event
and a related PERF_SAMPLE_UEVENT sample. User can now
write to the the perf file descriptor, injecting such
event in the perf buffer.

The UEVENT sample begins with a 32 bit unsigned integer
value describing type of the generated event. The type
can be set with PERF_EVENT_IOC_SET_UEVENT_TYPE ioctl
(zero is the default value). Then follows the 32 bit
unsigned size of the data (provided as the "count" argument
of the write syscall) and the data itself plus padding
aligning the overall sample size to 8 bytes.

Data Events with type equal 0 are defined as zero-terminated
strings, other types are defined by userspace (the perf tool
will contain a list of known values with reference
implementation of data content parsers).

Possible use cases for this feature:

- "perf_printf" like mechanism to add logging messages
  to one's perf session; in the simplest case it can be just

	uint32_t type = 0;
	ioctl(perf_fd, PERF_EVENT_IOC_SET_UEVENT_TYPE, &type);
	dprintf(perf_fd, "Message");

  (note that dprintf does *not* write the terminating '\0'; for
   users' convenience kernel add it when type is set to zero)

- "perf_printf" used by for perf trace tool,
  where certain traced process' calls are intercepted
  (eg. using LD_PRELOAD) and treated as logging
  requests, with it output redirected into the
  perf buffer

- synchronisation of performance data generated in
  user space with the perf stream coming from the kernel.
  For example, the marker can be inserted by a JIT engine
  after it generated portion of the code, but before the
  code is executed for the first time, allowing the
  post-processor to pick the correct debugging
  information.

- other example is a system profiling tool taking data
  from other sources than just perf, which generates a marker
  at the beginning at at the end of the session
  (also possibly periodically during the session) to
  synchronise kernel timestamps with clock values
  obtained in userspace (gtod or raw_monotonic).

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
Changes since v1:

- replaced ioctl-based interface with write syscall
  (there's still a ioctl to set an event type)

- replaced all "USERSPACE_EVENT" and alike strings
  with much shorter "UEVENT"

 include/linux/perf_event.h      | 14 ++++++++
 include/uapi/linux/perf_event.h | 24 ++++++++++++-
 kernel/events/core.c            | 80 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 28b73b2..c130579 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -64,6 +64,12 @@ struct perf_raw_record {
 	void				*data;
 };
 
+struct perf_uevent {
+	u32				type;
+	u32				size;
+	u8				data[0];
+};
+
 /*
  * branch stack layout:
  *  nr: number of taken branches stored in entries[]
@@ -433,6 +439,7 @@ struct perf_event {
 
 	struct pid_namespace		*ns;
 	u64				id;
+	u32				uevent_type;
 
 	perf_overflow_handler_t		overflow_handler;
 	void				*overflow_handler_context;
@@ -604,6 +611,8 @@ struct perf_sample_data {
 	u64				txn;
 	/* Raw monotonic timestamp, for userspace time correlation */
 	u64				clock_raw_monotonic;
+	/* Userspace-originating event */
+	struct perf_uevent		*uevent;
 };
 
 static inline void perf_sample_data_init(struct perf_sample_data *data,
@@ -685,6 +694,9 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
 	}
 }
 
+int perf_uevent_write(struct perf_event *event, u32 type, u32 size,
+		const char __user *data);
+
 extern struct static_key_deferred perf_sched_events;
 
 static inline void perf_event_task_sched_in(struct task_struct *prev,
@@ -807,6 +819,8 @@ static inline int perf_event_refresh(struct perf_event *event, int refresh)
 
 static inline void
 perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)	{ }
+static inline int perf_uevent_write(struct perf_event *event, u32 type,
+		u32 size, const char __user *data)			{ return -EINVAL; }
 static inline void
 perf_bp_event(struct perf_event *event, void *data)			{ }
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index e5a75c5..1fabc2c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -110,6 +110,7 @@ enum perf_sw_ids {
 	PERF_COUNT_SW_ALIGNMENT_FAULTS		= 7,
 	PERF_COUNT_SW_EMULATION_FAULTS		= 8,
 	PERF_COUNT_SW_DUMMY			= 9,
+	PERF_COUNT_SW_UEVENT			= 10,
 
 	PERF_COUNT_SW_MAX,			/* non-ABI */
 };
@@ -138,8 +139,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_CLOCK_RAW_MONOTONIC		= 1U << 18,
+	PERF_SAMPLE_UEVENT			= 1U << 19,
 
-	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
 };
 
 /*
@@ -350,6 +352,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_SET_OUTPUT	_IO ('$', 5)
 #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
 #define PERF_EVENT_IOC_ID		_IOR('$', 7, __u64 *)
+#define PERF_EVENT_IOC_SET_UEVENT_TYPE	_IOW('$', 8, __u32)
 
 enum perf_event_ioc_flags {
 	PERF_IOC_FLAG_GROUP		= 1U << 0,
@@ -688,6 +691,25 @@ enum perf_event_type {
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			clock_raw_monotonic; } && PERF_SAMPLE_CLOCK_RAW_MONOTONIC
+	 *
+	 *	#
+	 *	# Contents of UEVENT sample data depend on its type.
+	 *	#
+	 *	# Type 0 means that the data is a zero-terminated string that
+	 *	# can be printf-ed in the normal way.
+	 *	#
+	 *	# Meaning of other type values depends on the userspace
+	 *	# and the perf tool code contains a list of those with
+	 *	# reference implementations of parsers.
+	 *	#
+	 *	# Overall size of the sample (including type and size fields)
+	 *	# is always aligned to 8 bytes by adding padding after
+	 *	# the data.
+	 *	#
+	 *	{ u32			type;
+	 *	  u32			size;
+	 *	  char			data[size];
+	 *	  char                  __padding[] } && PERF_SAMPLE_UEVENT
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f6df547..69ca8c9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3526,6 +3526,15 @@ perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 	return perf_read_hw(event, buf, count);
 }
 
+static ssize_t
+perf_write(struct file *file, const char __user *buf, size_t count,
+		loff_t *ppos)
+{
+	struct perf_event *event = file->private_data;
+
+	return perf_uevent_write(event, event->uevent_type, count, buf);
+}
+
 static unsigned int perf_poll(struct file *file, poll_table *wait)
 {
 	struct perf_event *event = file->private_data;
@@ -3636,6 +3645,17 @@ unlock:
 	return ret;
 }
 
+static int perf_event_set_uevent_type(struct perf_event *event, u32 __user *arg)
+{
+	if (!arg)
+		return -EINVAL;
+
+	if (copy_from_user(&event->uevent_type, arg, sizeof(*arg)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static const struct file_operations perf_fops;
 
 static inline int perf_fget_light(int fd, struct fd *p)
@@ -3709,6 +3729,9 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case PERF_EVENT_IOC_SET_FILTER:
 		return perf_event_set_filter(event, (void __user *)arg);
 
+	case PERF_EVENT_IOC_SET_UEVENT_TYPE:
+		return perf_event_set_uevent_type(event, (u32 __user *)arg);
+
 	default:
 		return -ENOTTY;
 	}
@@ -4244,6 +4267,7 @@ static const struct file_operations perf_fops = {
 	.llseek			= no_llseek,
 	.release		= perf_release,
 	.read			= perf_read,
+	.write			= perf_write,
 	.poll			= perf_poll,
 	.unlocked_ioctl		= perf_ioctl,
 	.compat_ioctl		= perf_compat_ioctl,
@@ -4727,6 +4751,16 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC)
 		perf_output_put(handle, data->clock_raw_monotonic);
 
+	if (sample_type & PERF_SAMPLE_UEVENT) {
+		int size = data->uevent->size;
+		int padding = ALIGN(size, sizeof(u64)) - size;
+
+		perf_output_put(handle, data->uevent->type);
+		perf_output_put(handle, size);
+		__output_copy(handle, data->uevent->data, size);
+		perf_output_skip(handle, padding);
+	};
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -4834,6 +4868,10 @@ void perf_prepare_sample(struct perf_event_header *header,
 		data->stack_user_size = stack_size;
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_UEVENT)
+		header->size += sizeof(u32) + sizeof(u32) +
+				ALIGN(data->uevent->size, sizeof(u64));
 }
 
 static void perf_event_output(struct perf_event *event,
@@ -5961,6 +5999,48 @@ static struct pmu perf_swevent = {
 	.event_idx	= perf_swevent_event_idx,
 };
 
+int perf_uevent_write(struct perf_event *event, u32 type, u32 size,
+		const char __user *data)
+{
+	struct perf_uevent *uevent;
+	struct perf_sample_data sample;
+	struct pt_regs *regs = current_pt_regs();
+
+	/* Need some sane limit */
+	if (size > PAGE_SIZE)
+		return -EFBIG;
+
+	/*
+	 * Type 0 means zero-terminated string, but standard dprintf()
+	 * doesn't write the zero character. Let's allocate one more byte
+	 * for such event...
+	 */
+	uevent = kmalloc(sizeof(*uevent) + size + (type == 0 ? 1 : 0),
+			GFP_KERNEL);
+	if (!uevent)
+		return -ENOMEM;
+
+	if (copy_from_user(uevent->data, data, size)) {
+		kfree(uevent);
+		return -EFAULT;
+	}
+
+	/* ... and then zero it, if necessary. */
+	if (type == 0 && uevent->data[size - 1])
+		uevent->data[size++] = '\0';
+
+	uevent->type = type;
+	uevent->size = size;
+
+	perf_sample_data_init(&sample, 0, 0);
+	sample.uevent = uevent;
+	perf_event_output(event, &sample, regs);
+
+	kfree(uevent);
+
+	return size;
+}
+
 #ifdef CONFIG_EVENT_TRACING
 
 static int perf_tp_filter_match(struct perf_event *event,
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock
  2014-09-23 17:03 ` [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock Pawel Moll
@ 2014-09-24  5:41   ` Namhyung Kim
  2014-09-25 10:49     ` Pawel Moll
  0 siblings, 1 reply; 18+ messages in thread
From: Namhyung Kim @ 2014-09-24  5:41 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api

Hi Pawel,

On Tue, 23 Sep 2014 18:03:06 +0100, Pawel Moll wrote:
> This patch adds an option to sample raw monotonic clock
> value with any perf event, with the the aim of allowing
> time correlation between data coming from perf and
> additional performance-related information generated in
> userspace.
>
> In order to correlate timestamps in perf data stream
> with events happening in userspace (be it JITed debug
> symbols or hwmon-originating environment data), user
> requests a more or less periodic event (sched_switch
> trace event of a hrtimer-based cpu-clock being the
> most obvious examples) with PERF_SAMPLE_TIME *and*
> PERF_SAMPLE_CLOCK_RAW_MONOTONIC and stamps
> user-originating data with values obtained from
> clock_gettime(CLOCK_MONOTONIC_RAW). Then, during
> analysis, one looks at the perf events immediately
> preceding and following (in terms of the
> clock_raw_monotonic sample) the userspace event and
> does simple linear approximation to get the equivalent
> perf time.
>
>         perf event     user event
>        -----O--------------+-------------O------> t_mono
>             :              |             :
>             :              V             :
>        -----O----------------------------O------> t_perf

Probably a dumb question: why not make PERF_SAMPLE_TIME being monotonic
clock instead of adding a new PERF_SAMPLE_CLOCK_XXX flag?  Maybe we can
add a new ioctl command like PERF_EVENT_IOC_SET_CLOCK so that one can
pass a clock id.

Thanks,
Namhyung


>
> Signed-off-by: Pawel Moll <pawel.moll@arm.com>
> ---
> Changes since v1:
>
> - none
>
>  include/linux/perf_event.h      |  2 ++
>  include/uapi/linux/perf_event.h |  4 +++-
>  kernel/events/core.c            | 13 +++++++++++++
>  3 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 707617a..28b73b2 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -602,6 +602,8 @@ struct perf_sample_data {
>  	 * Transaction flags for abort events:
>  	 */
>  	u64				txn;
> +	/* Raw monotonic timestamp, for userspace time correlation */
> +	u64				clock_raw_monotonic;
>  };
>  
>  static inline void perf_sample_data_init(struct perf_sample_data *data,
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 9269de2..e5a75c5 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -137,8 +137,9 @@ enum perf_event_sample_format {
>  	PERF_SAMPLE_DATA_SRC			= 1U << 15,
>  	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
>  	PERF_SAMPLE_TRANSACTION			= 1U << 17,
> +	PERF_SAMPLE_CLOCK_RAW_MONOTONIC		= 1U << 18,
>  
> -	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
> +	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
>  };
>  
>  /*
> @@ -686,6 +687,7 @@ enum perf_event_type {
>  	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
>  	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
>  	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
> +	 *	{ u64			clock_raw_monotonic; } && PERF_SAMPLE_CLOCK_RAW_MONOTONIC
>  	 * };
>  	 */
>  	PERF_RECORD_SAMPLE			= 9,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f9c1ed0..f6df547 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1216,6 +1216,9 @@ static void perf_event__header_size(struct perf_event *event)
>  	if (sample_type & PERF_SAMPLE_TRANSACTION)
>  		size += sizeof(data->txn);
>  
> +	if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC)
> +		size += sizeof(data->clock_raw_monotonic);
> +
>  	event->header_size = size;
>  }
>  
> @@ -4456,6 +4459,13 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
>  		data->cpu_entry.cpu	 = raw_smp_processor_id();
>  		data->cpu_entry.reserved = 0;
>  	}
> +
> +	if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC) {
> +		struct timespec now;
> +
> +		getrawmonotonic(&now);
> +		data->clock_raw_monotonic = timespec_to_ns(&now);
> +	}
>  }
>  
>  void perf_event_header__init_id(struct perf_event_header *header,
> @@ -4714,6 +4724,9 @@ void perf_output_sample(struct perf_output_handle *handle,
>  	if (sample_type & PERF_SAMPLE_TRANSACTION)
>  		perf_output_put(handle, data->txn);
>  
> +	if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC)
> +		perf_output_put(handle, data->clock_raw_monotonic);
> +
>  	if (!event->attr.watermark) {
>  		int wakeup_events = event->attr.wakeup_events;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] perf: Userspace event
  2014-09-23 17:03 ` [PATCH v2 2/2] perf: Userspace event Pawel Moll
@ 2014-09-24  6:07   ` Namhyung Kim
  2014-09-24  7:20     ` Ingo Molnar
  2014-09-25 12:45     ` Pawel Moll
  0 siblings, 2 replies; 18+ messages in thread
From: Namhyung Kim @ 2014-09-24  6:07 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api

On Tue, 23 Sep 2014 18:03:07 +0100, Pawel Moll wrote:
> This patch adds a new PERF_COUNT_SW_UEVENT software event
> and a related PERF_SAMPLE_UEVENT sample. User can now
> write to the the perf file descriptor, injecting such
> event in the perf buffer.

It seems the PERF_SAMPLE_UEVENT sample can be injected to any event.  So
why the PERF_COUNT_SW_UEVENT is needed?  At least one can use the
SW_DUMMY event for that purpose.

Also I think it'd be better to be a record type (PERF_RECORD_XXX)
instead of a sample flag (PERF_SAMPLE_XXX).  In perf tools, we already
use perf_user_event_type for synthesized userspace events.  This way it
can avoid unnecessary sample processing for userspace events.

For contents, I prefer to give complete control to users - kernel
doesn't need to care about it other than its size.  If one just wants to
use strings only, she can write them directly.  If others want to mix
different types of data, they might need to define a data format for
their use.

Thanks,
Namhyung


>
> The UEVENT sample begins with a 32 bit unsigned integer
> value describing type of the generated event. The type
> can be set with PERF_EVENT_IOC_SET_UEVENT_TYPE ioctl
> (zero is the default value). Then follows the 32 bit
> unsigned size of the data (provided as the "count" argument
> of the write syscall) and the data itself plus padding
> aligning the overall sample size to 8 bytes.
>
> Data Events with type equal 0 are defined as zero-terminated
> strings, other types are defined by userspace (the perf tool
> will contain a list of known values with reference
> implementation of data content parsers).
>
> Possible use cases for this feature:
>
> - "perf_printf" like mechanism to add logging messages
>   to one's perf session; in the simplest case it can be just
>
> 	uint32_t type = 0;
> 	ioctl(perf_fd, PERF_EVENT_IOC_SET_UEVENT_TYPE, &type);
> 	dprintf(perf_fd, "Message");
>
>   (note that dprintf does *not* write the terminating '\0'; for
>    users' convenience kernel add it when type is set to zero)
>
> - "perf_printf" used by for perf trace tool,
>   where certain traced process' calls are intercepted
>   (eg. using LD_PRELOAD) and treated as logging
>   requests, with it output redirected into the
>   perf buffer
>
> - synchronisation of performance data generated in
>   user space with the perf stream coming from the kernel.
>   For example, the marker can be inserted by a JIT engine
>   after it generated portion of the code, but before the
>   code is executed for the first time, allowing the
>   post-processor to pick the correct debugging
>   information.
>
> - other example is a system profiling tool taking data
>   from other sources than just perf, which generates a marker
>   at the beginning at at the end of the session
>   (also possibly periodically during the session) to
>   synchronise kernel timestamps with clock values
>   obtained in userspace (gtod or raw_monotonic).
>
> Signed-off-by: Pawel Moll <pawel.moll@arm.com>
> ---
> Changes since v1:
>
> - replaced ioctl-based interface with write syscall
>   (there's still a ioctl to set an event type)
>
> - replaced all "USERSPACE_EVENT" and alike strings
>   with much shorter "UEVENT"
>
>  include/linux/perf_event.h      | 14 ++++++++
>  include/uapi/linux/perf_event.h | 24 ++++++++++++-
>  kernel/events/core.c            | 80 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 117 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 28b73b2..c130579 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -64,6 +64,12 @@ struct perf_raw_record {
>  	void				*data;
>  };
>  
> +struct perf_uevent {
> +	u32				type;
> +	u32				size;
> +	u8				data[0];
> +};
> +
>  /*
>   * branch stack layout:
>   *  nr: number of taken branches stored in entries[]
> @@ -433,6 +439,7 @@ struct perf_event {
>  
>  	struct pid_namespace		*ns;
>  	u64				id;
> +	u32				uevent_type;
>  
>  	perf_overflow_handler_t		overflow_handler;
>  	void				*overflow_handler_context;
> @@ -604,6 +611,8 @@ struct perf_sample_data {
>  	u64				txn;
>  	/* Raw monotonic timestamp, for userspace time correlation */
>  	u64				clock_raw_monotonic;
> +	/* Userspace-originating event */
> +	struct perf_uevent		*uevent;
>  };
>  
>  static inline void perf_sample_data_init(struct perf_sample_data *data,
> @@ -685,6 +694,9 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
>  	}
>  }
>  
> +int perf_uevent_write(struct perf_event *event, u32 type, u32 size,
> +		const char __user *data);
> +
>  extern struct static_key_deferred perf_sched_events;
>  
>  static inline void perf_event_task_sched_in(struct task_struct *prev,
> @@ -807,6 +819,8 @@ static inline int perf_event_refresh(struct perf_event *event, int refresh)
>  
>  static inline void
>  perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)	{ }
> +static inline int perf_uevent_write(struct perf_event *event, u32 type,
> +		u32 size, const char __user *data)			{ return -EINVAL; }
>  static inline void
>  perf_bp_event(struct perf_event *event, void *data)			{ }
>  
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index e5a75c5..1fabc2c 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -110,6 +110,7 @@ enum perf_sw_ids {
>  	PERF_COUNT_SW_ALIGNMENT_FAULTS		= 7,
>  	PERF_COUNT_SW_EMULATION_FAULTS		= 8,
>  	PERF_COUNT_SW_DUMMY			= 9,
> +	PERF_COUNT_SW_UEVENT			= 10,
>  
>  	PERF_COUNT_SW_MAX,			/* non-ABI */
>  };
> @@ -138,8 +139,9 @@ enum perf_event_sample_format {
>  	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
>  	PERF_SAMPLE_TRANSACTION			= 1U << 17,
>  	PERF_SAMPLE_CLOCK_RAW_MONOTONIC		= 1U << 18,
> +	PERF_SAMPLE_UEVENT			= 1U << 19,
>  
> -	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
> +	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
>  };
>  
>  /*
> @@ -350,6 +352,7 @@ struct perf_event_attr {
>  #define PERF_EVENT_IOC_SET_OUTPUT	_IO ('$', 5)
>  #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
>  #define PERF_EVENT_IOC_ID		_IOR('$', 7, __u64 *)
> +#define PERF_EVENT_IOC_SET_UEVENT_TYPE	_IOW('$', 8, __u32)
>  
>  enum perf_event_ioc_flags {
>  	PERF_IOC_FLAG_GROUP		= 1U << 0,
> @@ -688,6 +691,25 @@ enum perf_event_type {
>  	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
>  	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
>  	 *	{ u64			clock_raw_monotonic; } && PERF_SAMPLE_CLOCK_RAW_MONOTONIC
> +	 *
> +	 *	#
> +	 *	# Contents of UEVENT sample data depend on its type.
> +	 *	#
> +	 *	# Type 0 means that the data is a zero-terminated string that
> +	 *	# can be printf-ed in the normal way.
> +	 *	#
> +	 *	# Meaning of other type values depends on the userspace
> +	 *	# and the perf tool code contains a list of those with
> +	 *	# reference implementations of parsers.
> +	 *	#
> +	 *	# Overall size of the sample (including type and size fields)
> +	 *	# is always aligned to 8 bytes by adding padding after
> +	 *	# the data.
> +	 *	#
> +	 *	{ u32			type;
> +	 *	  u32			size;
> +	 *	  char			data[size];
> +	 *	  char                  __padding[] } && PERF_SAMPLE_UEVENT
>  	 * };
>  	 */
>  	PERF_RECORD_SAMPLE			= 9,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f6df547..69ca8c9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3526,6 +3526,15 @@ perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
>  	return perf_read_hw(event, buf, count);
>  }
>  
> +static ssize_t
> +perf_write(struct file *file, const char __user *buf, size_t count,
> +		loff_t *ppos)
> +{
> +	struct perf_event *event = file->private_data;
> +
> +	return perf_uevent_write(event, event->uevent_type, count, buf);
> +}
> +
>  static unsigned int perf_poll(struct file *file, poll_table *wait)
>  {
>  	struct perf_event *event = file->private_data;
> @@ -3636,6 +3645,17 @@ unlock:
>  	return ret;
>  }
>  
> +static int perf_event_set_uevent_type(struct perf_event *event, u32 __user *arg)
> +{
> +	if (!arg)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&event->uevent_type, arg, sizeof(*arg)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
>  static const struct file_operations perf_fops;
>  
>  static inline int perf_fget_light(int fd, struct fd *p)
> @@ -3709,6 +3729,9 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  	case PERF_EVENT_IOC_SET_FILTER:
>  		return perf_event_set_filter(event, (void __user *)arg);
>  
> +	case PERF_EVENT_IOC_SET_UEVENT_TYPE:
> +		return perf_event_set_uevent_type(event, (u32 __user *)arg);
> +
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -4244,6 +4267,7 @@ static const struct file_operations perf_fops = {
>  	.llseek			= no_llseek,
>  	.release		= perf_release,
>  	.read			= perf_read,
> +	.write			= perf_write,
>  	.poll			= perf_poll,
>  	.unlocked_ioctl		= perf_ioctl,
>  	.compat_ioctl		= perf_compat_ioctl,
> @@ -4727,6 +4751,16 @@ void perf_output_sample(struct perf_output_handle *handle,
>  	if (sample_type & PERF_SAMPLE_CLOCK_RAW_MONOTONIC)
>  		perf_output_put(handle, data->clock_raw_monotonic);
>  
> +	if (sample_type & PERF_SAMPLE_UEVENT) {
> +		int size = data->uevent->size;
> +		int padding = ALIGN(size, sizeof(u64)) - size;
> +
> +		perf_output_put(handle, data->uevent->type);
> +		perf_output_put(handle, size);
> +		__output_copy(handle, data->uevent->data, size);
> +		perf_output_skip(handle, padding);
> +	};
> +
>  	if (!event->attr.watermark) {
>  		int wakeup_events = event->attr.wakeup_events;
>  
> @@ -4834,6 +4868,10 @@ void perf_prepare_sample(struct perf_event_header *header,
>  		data->stack_user_size = stack_size;
>  		header->size += size;
>  	}
> +
> +	if (sample_type & PERF_SAMPLE_UEVENT)
> +		header->size += sizeof(u32) + sizeof(u32) +
> +				ALIGN(data->uevent->size, sizeof(u64));
>  }
>  
>  static void perf_event_output(struct perf_event *event,
> @@ -5961,6 +5999,48 @@ static struct pmu perf_swevent = {
>  	.event_idx	= perf_swevent_event_idx,
>  };
>  
> +int perf_uevent_write(struct perf_event *event, u32 type, u32 size,
> +		const char __user *data)
> +{
> +	struct perf_uevent *uevent;
> +	struct perf_sample_data sample;
> +	struct pt_regs *regs = current_pt_regs();
> +
> +	/* Need some sane limit */
> +	if (size > PAGE_SIZE)
> +		return -EFBIG;
> +
> +	/*
> +	 * Type 0 means zero-terminated string, but standard dprintf()
> +	 * doesn't write the zero character. Let's allocate one more byte
> +	 * for such event...
> +	 */
> +	uevent = kmalloc(sizeof(*uevent) + size + (type == 0 ? 1 : 0),
> +			GFP_KERNEL);
> +	if (!uevent)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(uevent->data, data, size)) {
> +		kfree(uevent);
> +		return -EFAULT;
> +	}
> +
> +	/* ... and then zero it, if necessary. */
> +	if (type == 0 && uevent->data[size - 1])
> +		uevent->data[size++] = '\0';
> +
> +	uevent->type = type;
> +	uevent->size = size;
> +
> +	perf_sample_data_init(&sample, 0, 0);
> +	sample.uevent = uevent;
> +	perf_event_output(event, &sample, regs);
> +
> +	kfree(uevent);
> +
> +	return size;
> +}
> +
>  #ifdef CONFIG_EVENT_TRACING
>  
>  static int perf_tp_filter_match(struct perf_event *event,

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] perf: Userspace event
  2014-09-24  6:07   ` Namhyung Kim
@ 2014-09-24  7:20     ` Ingo Molnar
  2014-09-26  6:21       ` Namhyung Kim
  2014-09-25 12:45     ` Pawel Moll
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2014-09-24  7:20 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Pawel Moll, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, linux-kernel, linux-api


* Namhyung Kim <namhyung@kernel.org> wrote:

> On Tue, 23 Sep 2014 18:03:07 +0100, Pawel Moll wrote:
> > This patch adds a new PERF_COUNT_SW_UEVENT software event
> > and a related PERF_SAMPLE_UEVENT sample. User can now
> > write to the the perf file descriptor, injecting such
> > event in the perf buffer.
> 
> It seems the PERF_SAMPLE_UEVENT sample can be injected to any event.  So
> why the PERF_COUNT_SW_UEVENT is needed?  At least one can use the
> SW_DUMMY event for that purpose.
> 
> Also I think it'd be better to be a record type (PERF_RECORD_XXX)
> instead of a sample flag (PERF_SAMPLE_XXX).  In perf tools, we already
> use perf_user_event_type for synthesized userspace events.  This way it
> can avoid unnecessary sample processing for userspace events.
> 
> For contents, I prefer to give complete control to users - kernel
> doesn't need to care about it other than its size.  If one just wants to
> use strings only, she can write them directly.  If others want to mix
> different types of data, they might need to define a data format for
> their use.

It would also be nice to add support for this to tools/perf/ (so 
that 'trace' displays such entries in a perf.data), with a 
minimum testcase for 'perf test' as well.

Perhaps also add a small sub-utility to inject such events from 
the command line, such as:

  trace user-event "this is a test message"

('trace' is a shortcut command for 'perf trace'.)

It would have a usecase straight away: perf could be used to 
easily trace script execution for example.

For that probably another mode of user event generation would be 
needed as well: a process that has no access to any perf fds 
should still be able to generate user events, if the 
profiling/tracing context has permitted that. In this case we'd 
inject the event either into the first, or all currently active 
events (but only once per output buffer, or so).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock
  2014-09-24  5:41   ` Namhyung Kim
@ 2014-09-25 10:49     ` Pawel Moll
  2014-09-26  6:16       ` Namhyung Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Pawel Moll @ 2014-09-25 10:49 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api

On Wed, 2014-09-24 at 06:41 +0100, Namhyung Kim wrote:
> Hi Pawel,
> 
> On Tue, 23 Sep 2014 18:03:06 +0100, Pawel Moll wrote:
> > This patch adds an option to sample raw monotonic clock
> > value with any perf event, with the the aim of allowing
> > time correlation between data coming from perf and
> > additional performance-related information generated in
> > userspace.
> >
> > In order to correlate timestamps in perf data stream
> > with events happening in userspace (be it JITed debug
> > symbols or hwmon-originating environment data), user
> > requests a more or less periodic event (sched_switch
> > trace event of a hrtimer-based cpu-clock being the
> > most obvious examples) with PERF_SAMPLE_TIME *and*
> > PERF_SAMPLE_CLOCK_RAW_MONOTONIC and stamps
> > user-originating data with values obtained from
> > clock_gettime(CLOCK_MONOTONIC_RAW). Then, during
> > analysis, one looks at the perf events immediately
> > preceding and following (in terms of the
> > clock_raw_monotonic sample) the userspace event and
> > does simple linear approximation to get the equivalent
> > perf time.
> >
> >         perf event     user event
> >        -----O--------------+-------------O------> t_mono
> >             :              |             :
> >             :              V             :
> >        -----O----------------------------O------> t_perf
> 
> Probably a dumb question: why not make PERF_SAMPLE_TIME being monotonic
> clock instead of adding a new PERF_SAMPLE_CLOCK_XXX flag? 

It's a valid question. And it was asked before, in this thread:

http://thread.gmane.org/gmane.linux.kernel/1611683

A summary of the answer would be:

> On Wed, 2013-12-11 at 12:07 +0000, Ingo Molnar wrote: 
> > * John Stultz <john.stultz@linaro.org> wrote:
> > 
> > > [...]
> > > 
> > > I'd much rather see perf export CLOCK_MONOTONIC_RAW timestamps, 
> > > since that clockid is well defined. [...]
> > 
> > So the problem with that clock is that it does the following for every 
> > timestamp:
> > 
> >         cycle_now = clock->read(clock);
> > 
> > ... which is impossibly slow if something like the HPET is used, which 
> > is rather common - so this is a non-starter to timestamp perf events 
> > with. We use the scheduler clock as a reasonable compromise between 
> > scalability and clock globality.

Now, for your other comment:

>  Maybe we can
> add a new ioctl command like PERF_EVENT_IOC_SET_CLOCK so that one can
> pass a clock id.

Did you mean selecting a time source for PERF_SAMPLE_TIME (so we don't
need two timestamps in a sample)? It would be doable, I guess, but what
if someone *wants* to have sched clock as the timestamps source (because
it's cheap) but still be able to correlate them with userspace? In this
case two separate timestamps are required to do the approximation.

Now, if you meant it to select a time source for the "other timestamp",
let's call it in a more generic way: PERF_SAMPLE_CLOCK_VALUE this could
work, yes. I see ALSA doing something similar (see
SNDRV_PCM_TSTAMP_TYPE_* enum in include/uapi/sound/asound.h). One
question would be: how does it work with groups? Does setting a
timesource for the leader selects them for all members? I don't think
you want a bunch of samples with different timestamp sources in the same
buffer.

Anyway, great stuff, thanks!

Pawel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] perf: Userspace event
  2014-09-24  6:07   ` Namhyung Kim
  2014-09-24  7:20     ` Ingo Molnar
@ 2014-09-25 12:45     ` Pawel Moll
  2014-09-26  6:23       ` Namhyung Kim
  1 sibling, 1 reply; 18+ messages in thread
From: Pawel Moll @ 2014-09-25 12:45 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api

On Wed, 2014-09-24 at 07:07 +0100, Namhyung Kim wrote:
> On Tue, 23 Sep 2014 18:03:07 +0100, Pawel Moll wrote:
> > This patch adds a new PERF_COUNT_SW_UEVENT software event
> > and a related PERF_SAMPLE_UEVENT sample. User can now
> > write to the the perf file descriptor, injecting such
> > event in the perf buffer.
> 
> It seems the PERF_SAMPLE_UEVENT sample can be injected to any event.  So
> why the PERF_COUNT_SW_UEVENT is needed?  At least one can use the
> SW_DUMMY event for that purpose.

You're right. I needed a different SW type in one of my early
prototypes, but it's not the case any more. Consider it gone.

> Also I think it'd be better to be a record type (PERF_RECORD_XXX)
> instead of a sample flag (PERF_SAMPLE_XXX).  In perf tools, we already
> use perf_user_event_type for synthesized userspace events.  This way it
> can avoid unnecessary sample processing for userspace events.

Fine with me. If no one objects, I'm more than happy to use
PERF_RECORD_UEVENT = 11 for it.

> For contents, I prefer to give complete control to users - kernel
> doesn't need to care about it other than its size.  If one just wants to
> use strings only, she can write them directly.  If others want to mix
> different types of data, they might need to define a data format for
> their use.

Are you saying to drop even the "type 0 means zero-terminated string"
definition, even if everything else is up to the user? I quite like that
idea, especially combined with write()ing to the perf_fd (it is very
much like trace_marker then, which is beautiful in its simplicity), but
the feelings are not that strong to fight a war over it.

Pawel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock
  2014-09-25 10:49     ` Pawel Moll
@ 2014-09-26  6:16       ` Namhyung Kim
  2014-09-26 10:58         ` Pawel Moll
  0 siblings, 1 reply; 18+ messages in thread
From: Namhyung Kim @ 2014-09-26  6:16 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Namhyung Kim, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org



Hi Pawel,

On Thu, 25 Sep 2014 11:49:58 +0100, Pawel Moll wrote:
> On Wed, 2014-09-24 at 06:41 +0100, Namhyung Kim wrote:
>> Hi Pawel,
>> 
>> On Tue, 23 Sep 2014 18:03:06 +0100, Pawel Moll wrote:
>> > This patch adds an option to sample raw monotonic clock
>> > value with any perf event, with the the aim of allowing
>> > time correlation between data coming from perf and
>> > additional performance-related information generated in
>> > userspace.
>> >
>> > In order to correlate timestamps in perf data stream
>> > with events happening in userspace (be it JITed debug
>> > symbols or hwmon-originating environment data), user
>> > requests a more or less periodic event (sched_switch
>> > trace event of a hrtimer-based cpu-clock being the
>> > most obvious examples) with PERF_SAMPLE_TIME *and*
>> > PERF_SAMPLE_CLOCK_RAW_MONOTONIC and stamps
>> > user-originating data with values obtained from
>> > clock_gettime(CLOCK_MONOTONIC_RAW). Then, during
>> > analysis, one looks at the perf events immediately
>> > preceding and following (in terms of the
>> > clock_raw_monotonic sample) the userspace event and
>> > does simple linear approximation to get the equivalent
>> > perf time.
>> >
>> >         perf event     user event
>> >        -----O--------------+-------------O------> t_mono
>> >             :              |             :
>> >             :              V             :
>> >        -----O----------------------------O------> t_perf
>> 
>> Probably a dumb question: why not make PERF_SAMPLE_TIME being monotonic
>> clock instead of adding a new PERF_SAMPLE_CLOCK_XXX flag? 
>
> It's a valid question. And it was asked before, in this thread:
>
> http://thread.gmane.org/gmane.linux.kernel/1611683
>
> A summary of the answer would be:
>
>> On Wed, 2013-12-11 at 12:07 +0000, Ingo Molnar wrote: 
>> > * John Stultz <john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>> > 
>> > > [...]
>> > > 
>> > > I'd much rather see perf export CLOCK_MONOTONIC_RAW timestamps, 
>> > > since that clockid is well defined. [...]
>> > 
>> > So the problem with that clock is that it does the following for every 
>> > timestamp:
>> > 
>> >         cycle_now = clock->read(clock);
>> > 
>> > ... which is impossibly slow if something like the HPET is used, which 
>> > is rather common - so this is a non-starter to timestamp perf events 
>> > with. We use the scheduler clock as a reasonable compromise between 
>> > scalability and clock globality.

Hmm.. but wouldn't it be up to user?  If one suffers from a slow clock
she will use default and fast one.  Well, if perf can know it'd be bad
for performance, it can warn users about the issue.


>
> Now, for your other comment:
>
>>  Maybe we can
>> add a new ioctl command like PERF_EVENT_IOC_SET_CLOCK so that one can
>> pass a clock id.
>
> Did you mean selecting a time source for PERF_SAMPLE_TIME (so we don't
> need two timestamps in a sample)?

Yes.


> It would be doable, I guess, but what
> if someone *wants* to have sched clock as the timestamps source (because
> it's cheap) but still be able to correlate them with userspace? In this
> case two separate timestamps are required to do the approximation.

But by collecting two timestamps, you'll loose the win of the first
timestamp, no?


>
> Now, if you meant it to select a time source for the "other timestamp",
> let's call it in a more generic way: PERF_SAMPLE_CLOCK_VALUE this could
> work, yes. I see ALSA doing something similar (see
> SNDRV_PCM_TSTAMP_TYPE_* enum in include/uapi/sound/asound.h). One
> question would be: how does it work with groups? Does setting a
> timesource for the leader selects them for all members? I don't think
> you want a bunch of samples with different timestamp sources in the same
> buffer.

My answer would be: if you don't want it, don't do that. :)

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] perf: Userspace event
  2014-09-24  7:20     ` Ingo Molnar
@ 2014-09-26  6:21       ` Namhyung Kim
  2014-09-26 10:59         ` Pawel Moll
  0 siblings, 1 reply; 18+ messages in thread
From: Namhyung Kim @ 2014-09-26  6:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Namhyung Kim, Pawel Moll, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz,
	public-linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	public-linux-api-u79uwXL29TY76Z2rM5mHXA



Hi Ingo,

On Wed, 24 Sep 2014 09:20:17 +0200, Ingo Molnar wrote:
> * Namhyung Kim <namhyung-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
>> On Tue, 23 Sep 2014 18:03:07 +0100, Pawel Moll wrote:
>> > This patch adds a new PERF_COUNT_SW_UEVENT software event
>> > and a related PERF_SAMPLE_UEVENT sample. User can now
>> > write to the the perf file descriptor, injecting such
>> > event in the perf buffer.
>> 
>> It seems the PERF_SAMPLE_UEVENT sample can be injected to any event.  So
>> why the PERF_COUNT_SW_UEVENT is needed?  At least one can use the
>> SW_DUMMY event for that purpose.
>> 
>> Also I think it'd be better to be a record type (PERF_RECORD_XXX)
>> instead of a sample flag (PERF_SAMPLE_XXX).  In perf tools, we already
>> use perf_user_event_type for synthesized userspace events.  This way it
>> can avoid unnecessary sample processing for userspace events.
>> 
>> For contents, I prefer to give complete control to users - kernel
>> doesn't need to care about it other than its size.  If one just wants to
>> use strings only, she can write them directly.  If others want to mix
>> different types of data, they might need to define a data format for
>> their use.
>
> It would also be nice to add support for this to tools/perf/ (so 
> that 'trace' displays such entries in a perf.data), with a 
> minimum testcase for 'perf test' as well.
>
> Perhaps also add a small sub-utility to inject such events from 
> the command line, such as:
>
>   trace user-event "this is a test message"
>
> ('trace' is a shortcut command for 'perf trace'.)
>
> It would have a usecase straight away: perf could be used to 
> easily trace script execution for example.

It looks like what trace-marker in ftrace does..  We might connect
output of the trace marker into a perf event somehow.

Thanks,
Namhyung


>
> For that probably another mode of user event generation would be 
> needed as well: a process that has no access to any perf fds 
> should still be able to generate user events, if the 
> profiling/tracing context has permitted that. In this case we'd 
> inject the event either into the first, or all currently active 
> events (but only once per output buffer, or so).
>
> Thanks,
>
> 	Ingo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] perf: Userspace event
  2014-09-25 12:45     ` Pawel Moll
@ 2014-09-26  6:23       ` Namhyung Kim
  0 siblings, 0 replies; 18+ messages in thread
From: Namhyung Kim @ 2014-09-26  6:23 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api

On Thu, 25 Sep 2014 13:45:05 +0100, Pawel Moll wrote:
> On Wed, 2014-09-24 at 07:07 +0100, Namhyung Kim wrote:
>> On Tue, 23 Sep 2014 18:03:07 +0100, Pawel Moll wrote:
>> > This patch adds a new PERF_COUNT_SW_UEVENT software event
>> > and a related PERF_SAMPLE_UEVENT sample. User can now
>> > write to the the perf file descriptor, injecting such
>> > event in the perf buffer.
>> 
>> It seems the PERF_SAMPLE_UEVENT sample can be injected to any event.  So
>> why the PERF_COUNT_SW_UEVENT is needed?  At least one can use the
>> SW_DUMMY event for that purpose.
>
> You're right. I needed a different SW type in one of my early
> prototypes, but it's not the case any more. Consider it gone.

Okay.


>
>> Also I think it'd be better to be a record type (PERF_RECORD_XXX)
>> instead of a sample flag (PERF_SAMPLE_XXX).  In perf tools, we already
>> use perf_user_event_type for synthesized userspace events.  This way it
>> can avoid unnecessary sample processing for userspace events.
>
> Fine with me. If no one objects, I'm more than happy to use
> PERF_RECORD_UEVENT = 11 for it.
>
>> For contents, I prefer to give complete control to users - kernel
>> doesn't need to care about it other than its size.  If one just wants to
>> use strings only, she can write them directly.  If others want to mix
>> different types of data, they might need to define a data format for
>> their use.
>
> Are you saying to drop even the "type 0 means zero-terminated string"
> definition, even if everything else is up to the user? I quite like that
> idea, especially combined with write()ing to the perf_fd (it is very
> much like trace_marker then, which is beautiful in its simplicity), but
> the feelings are not that strong to fight a war over it.

:)


Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock
  2014-09-26  6:16       ` Namhyung Kim
@ 2014-09-26 10:58         ` Pawel Moll
  2014-09-26 14:38           ` Namhyung Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Pawel Moll @ 2014-09-26 10:58 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api


On Fri, 2014-09-26 at 07:16 +0100, Namhyung Kim wrote:
> > It would be doable, I guess, but what
> > if someone *wants* to have sched clock as the timestamps source (because
> > it's cheap) but still be able to correlate them with userspace? In this
> > case two separate timestamps are required to do the approximation.
> 
> But by collecting two timestamps, you'll loose the win of the first
> timestamp, no?

But I can ask for both timestamps only being collected on "low
bandwidth" events, in particular context switches and/or periodic (eg.
10ms hrtimer) software events.

Then I have loads of normal normal samples, timestamped with sched clock
only, and every now and then one with both timestamps which then I can
use for time correlation. The whole point is that the frequency of such
"synchronisation" event can be much (much!) lower than of the normal
samples, but it still allows pretty good approximation (I was getting
accuracy of ~1 microsecond and better with sched_switch trace event
marked with additional raw monotonic timestamp).

Pawel

PS. Have you sent a couple of the messages via some kind of gmane's
proxy? All the mail addresses got rather messed up...


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] perf: Userspace event
  2014-09-26  6:21       ` Namhyung Kim
@ 2014-09-26 10:59         ` Pawel Moll
  2014-09-26 11:24           ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Pawel Moll @ 2014-09-26 10:59 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api

On Fri, 2014-09-26 at 07:21 +0100, Namhyung Kim wrote:
> It looks like what trace-marker in ftrace does..  We might connect
> output of the trace marker into a perf event somehow.

I can probably trace_marker's write handler do the same as the new
prctl() would do. But this means that we really want the pre-defined
"zero terminated string" type (0). Otherwise, what type would be
assigned to a record originating from it?

Pawel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/2] perf: Userspace event
  2014-09-26 10:59         ` Pawel Moll
@ 2014-09-26 11:24           ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2014-09-26 11:24 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Namhyung Kim, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, linux-kernel, linux-api


* Pawel Moll <pawel.moll@arm.com> wrote:

> On Fri, 2014-09-26 at 07:21 +0100, Namhyung Kim wrote:
>
> > It looks like what trace-marker in ftrace does..  We might 
> > connect output of the trace marker into a perf event somehow.
> 
> I can probably trace_marker's write handler do the same as the 
> new prctl() would do. [...]

Please keep this new facility separate, so that !ftrace kernels 
that have perf events enabled still have this facility, etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock
  2014-09-26 10:58         ` Pawel Moll
@ 2014-09-26 14:38           ` Namhyung Kim
  2014-09-26 15:05             ` Pawel Moll
  0 siblings, 1 reply; 18+ messages in thread
From: Namhyung Kim @ 2014-09-26 14:38 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api

2014-09-26 (금), 11:58 +0100, Pawel Moll:
> On Fri, 2014-09-26 at 07:16 +0100, Namhyung Kim wrote:
> > > It would be doable, I guess, but what
> > > if someone *wants* to have sched clock as the timestamps source (because
> > > it's cheap) but still be able to correlate them with userspace? In this
> > > case two separate timestamps are required to do the approximation.
> > 
> > But by collecting two timestamps, you'll loose the win of the first
> > timestamp, no?
> 
> But I can ask for both timestamps only being collected on "low
> bandwidth" events, in particular context switches and/or periodic (eg.
> 10ms hrtimer) software events.
> 
> Then I have loads of normal normal samples, timestamped with sched clock
> only, and every now and then one with both timestamps which then I can
> use for time correlation. The whole point is that the frequency of such
> "synchronisation" event can be much (much!) lower than of the normal
> samples, but it still allows pretty good approximation (I was getting
> accuracy of ~1 microsecond and better with sched_switch trace event
> marked with additional raw monotonic timestamp).

Okay.  But in that case wouldn't it be enough to use just a single
timestamp for each event - sched_clock for cpu-cycles and monotonic raw
for sched_switch?


> 
> Pawel
> 
> PS. Have you sent a couple of the messages via some kind of gmane's
> proxy? All the mail addresses got rather messed up...
> 

Yes, I use gnus to communicate with lkml, but it seems it messed up mail
addresses if it contains some list (like linux-api ?).  Sorry about
that.

Thanks,
Namhyung



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock
  2014-09-26 14:38           ` Namhyung Kim
@ 2014-09-26 15:05             ` Pawel Moll
  2014-09-26 19:25               ` David Ahern
  0 siblings, 1 reply; 18+ messages in thread
From: Pawel Moll @ 2014-09-26 15:05 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api

On Fri, 2014-09-26 at 15:38 +0100, Namhyung Kim wrote:
> > Then I have loads of normal normal samples, timestamped with sched clock
> > only, and every now and then one with both timestamps which then I can
> > use for time correlation. The whole point is that the frequency of such
> > "synchronisation" event can be much (much!) lower than of the normal
> > samples, but it still allows pretty good approximation (I was getting
> > accuracy of ~1 microsecond and better with sched_switch trace event
> > marked with additional raw monotonic timestamp).
> 
> Okay.  But in that case wouldn't it be enough to use just a single
> timestamp for each event - sched_clock for cpu-cycles and monotonic raw
> for sched_switch?

To do the correlation you need both timestamps to be "taken"
simultaneously:

        perf event     user event
       -----O--------------+-------------O------> t_mono
            :              |             :
            :              V             :
       -----O----------------------------O------> t_perf

Of course it's not possible get both values literally at the same time,
but placing them in a atomic context a couple of instructions from each
other still gives pretty good results. The larger this distance is, the
lower the accuracy will be. I must admit I haven't done such
experiments, but let me remind that I in my test I was getting results
in the range of 1000ns, with a single cycle of a 2GHz taking 0.5ns, so
moving the t_mono/t_perf value sampling further aside will reduce it
significantly...

Pawel

Pawel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock
  2014-09-26 15:05             ` Pawel Moll
@ 2014-09-26 19:25               ` David Ahern
  2014-09-29 14:47                 ` Pawel Moll
  0 siblings, 1 reply; 18+ messages in thread
From: David Ahern @ 2014-09-26 19:25 UTC (permalink / raw)
  To: Pawel Moll, Namhyung Kim
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel, linux-api

On 9/26/14, 9:05 AM, Pawel Moll wrote:
> To do the correlation you need both timestamps to be "taken"
> simultaneously:
>
>          perf event     user event
>         -----O--------------+-------------O------> t_mono
>              :              |             :
>              :              V             :
>         -----O----------------------------O------> t_perf
>
> Of course it's not possible get both values literally at the same time,
> but placing them in a atomic context a couple of instructions from each
> other still gives pretty good results. The larger this distance is, the

An early patchset on this topic added the realtime clock as an event and 
an ioctl was used to push a sample into the event stream. In that case 
you have wall clock and perf-clock samples taken in the same kernel 
context and about as close together as you can get.

https://lkml.org/lkml/2011/2/27/158
https://lkml.org/lkml/2011/2/27/159

David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock
  2014-09-26 19:25               ` David Ahern
@ 2014-09-29 14:47                 ` Pawel Moll
  0 siblings, 0 replies; 18+ messages in thread
From: Pawel Moll @ 2014-09-29 14:47 UTC (permalink / raw)
  To: David Ahern
  Cc: Namhyung Kim, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	John Stultz, linux-kernel, linux-api

On Fri, 2014-09-26 at 20:25 +0100, David Ahern wrote:
> On 9/26/14, 9:05 AM, Pawel Moll wrote:
> > To do the correlation you need both timestamps to be "taken"
> > simultaneously:
> >
> >          perf event     user event
> >         -----O--------------+-------------O------> t_mono
> >              :              |             :
> >              :              V             :
> >         -----O----------------------------O------> t_perf
> >
> > Of course it's not possible get both values literally at the same time,
> > but placing them in a atomic context a couple of instructions from each
> > other still gives pretty good results. The larger this distance is, the
> 
> An early patchset on this topic added the realtime clock as an event and 
> an ioctl was used to push a sample into the event stream. 

Yeah, I remember. If I remember correctly correctly the pushback was on
a custom event type, right?

Generally speaking I don't mind any solution that we'll get us to the
place both you and I want to be (just being able to time stamp some
performance data in userspace, how difficult can this be! ;-) but I like
the flexibility of an extra sample - one can pick and mix events and
samples at one's leisure.

> In that case 
> you have wall clock and perf-clock samples taken in the same kernel 
> context and about as close together as you can get.

Yep, that's what I was saying - we can't quite get two timestamps at the
*same*, but getting them within a single atomic block of instructions
gives reasonable accuracy.

Thanks!

Pawel


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-09-29 14:47 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-23 17:03 [PATCH v2 0/2] perf: User/kernel time correlation and event generation Pawel Moll
2014-09-23 17:03 ` [PATCH v2 1/2] perf: Add sampling of the raw monotonic clock Pawel Moll
2014-09-24  5:41   ` Namhyung Kim
2014-09-25 10:49     ` Pawel Moll
2014-09-26  6:16       ` Namhyung Kim
2014-09-26 10:58         ` Pawel Moll
2014-09-26 14:38           ` Namhyung Kim
2014-09-26 15:05             ` Pawel Moll
2014-09-26 19:25               ` David Ahern
2014-09-29 14:47                 ` Pawel Moll
2014-09-23 17:03 ` [PATCH v2 2/2] perf: Userspace event Pawel Moll
2014-09-24  6:07   ` Namhyung Kim
2014-09-24  7:20     ` Ingo Molnar
2014-09-26  6:21       ` Namhyung Kim
2014-09-26 10:59         ` Pawel Moll
2014-09-26 11:24           ` Ingo Molnar
2014-09-25 12:45     ` Pawel Moll
2014-09-26  6:23       ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).