All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing
@ 2022-02-14 11:09 Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 01/11] perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset Adrian Hunter
                   ` (11 more replies)
  0 siblings, 12 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Hi

These patches add 2 new perf event clocks based on TSC for use with VMs.

The first patch is a minor fix, the next 2 patches add each of the 2 new
clocks.  The remaining patches add minimal tools support and are based on
top of the Intel PT Event Trace tools' patches.

The future work, to add the ability to use perf inject to inject perf
events from a VM guest perf.data file into a VM host perf.data file,
has yet to be implemented.


Changes in V2:
      perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
	  Add __sched_clock_offset unconditionally

      perf/x86: Add support for TSC as a perf event clock
	  Use an attribute bit 'ns_clockid' to identify non-standard clockids

      perf/x86: Add support for TSC in nanoseconds as a perf event clock
	  Do not affect use of __sched_clock_offset
	  Adjust to use 'ns_clockid'

      perf tools: Add new perf clock IDs
      perf tools: Add API probes for new clock IDs
      perf tools: Add new clock IDs to "perf time to TSC" test
      perf tools: Add perf_read_tsc_conv_for_clockid()
      perf intel-pt: Add support for new clock IDs
      perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default
      perf intel-pt: Add config variables for timing parameters
      perf intel-pt: Add documentation for new clock IDs
	  Adjust to use 'ns_clockid'


Adrian Hunter (11):
      perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
      perf/x86: Add support for TSC as a perf event clock
      perf/x86: Add support for TSC in nanoseconds as a perf event clock
      perf tools: Add new perf clock IDs
      perf tools: Add API probes for new clock IDs
      perf tools: Add new clock IDs to "perf time to TSC" test
      perf tools: Add perf_read_tsc_conv_for_clockid()
      perf intel-pt: Add support for new clock IDs
      perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default
      perf intel-pt: Add config variables for timing parameters
      perf intel-pt: Add documentation for new clock IDs

 arch/x86/events/core.c                     | 39 ++++++++++--
 arch/x86/include/asm/perf_event.h          |  5 ++
 arch/x86/kernel/tsc.c                      |  2 +-
 include/uapi/linux/perf_event.h            | 18 +++++-
 kernel/events/core.c                       | 63 +++++++++++++-------
 tools/include/uapi/linux/perf_event.h      | 18 +++++-
 tools/perf/Documentation/perf-config.txt   | 18 ++++++
 tools/perf/Documentation/perf-intel-pt.txt | 47 +++++++++++++++
 tools/perf/Documentation/perf-record.txt   |  9 ++-
 tools/perf/arch/x86/util/intel-pt.c        | 95 ++++++++++++++++++++++++++++--
 tools/perf/builtin-record.c                |  2 +-
 tools/perf/tests/perf-time-to-tsc.c        | 42 ++++++++++---
 tools/perf/util/clockid.c                  | 14 +++++
 tools/perf/util/evsel.c                    |  1 +
 tools/perf/util/intel-pt.c                 | 27 +++++++--
 tools/perf/util/intel-pt.h                 |  7 ++-
 tools/perf/util/perf_api_probe.c           | 24 ++++++++
 tools/perf/util/perf_api_probe.h           |  2 +
 tools/perf/util/perf_event_attr_fprintf.c  |  1 +
 tools/perf/util/record.h                   |  2 +
 tools/perf/util/tsc.c                      | 58 ++++++++++++++++++
 tools/perf/util/tsc.h                      |  2 +
 22 files changed, 444 insertions(+), 52 deletions(-)


Regards
Adrian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH V2 01/11] perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock Adrian Hunter
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

native_perf_sched_clock_from_tsc() is used to produce a time value that can
be consistent with perf_clock().  Consequently, it should be adjusted by
__sched_clock_offset, the same as perf_clock() would be.

Fixes: 698eff6355f735 ("sched/clock, x86/perf: Fix perf test tsc")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/kernel/tsc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a698196377be..d9fe277c2471 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -242,7 +242,7 @@ u64 native_sched_clock(void)
  */
 u64 native_sched_clock_from_tsc(u64 tsc)
 {
-	return cycles_2_ns(tsc);
+	return cycles_2_ns(tsc) + __sched_clock_offset;
 }
 
 /* We need to define a real function for sched_clock, to override the
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 01/11] perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-03-04 12:30   ` Peter Zijlstra
                     ` (2 more replies)
  2022-02-14 11:09 ` [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds " Adrian Hunter
                   ` (9 subsequent siblings)
  11 siblings, 3 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Currently, using Intel PT to trace a VM guest is limited to kernel space
because decoding requires side band events such as MMAP and CONTEXT_SWITCH.
While these events can be collected for the host, there is not a way to do
that yet for a guest. One approach, would be to collect them inside the
guest, but that would require being able to synchronize with host
timestamps.

The motivation for this patch is to provide a clock that can be used within
a VM guest, and that correlates to a VM host clock. In the case of TSC, if
the hypervisor leaves rdtsc alone, the TSC value will be subject only to
the VMCS TSC Offset and Scaling. Adjusting for that would make it possible
to inject events from a guest perf.data file, into a host perf.data file.

Thus making possible the collection of VM guest side band for Intel PT
decoding.

There are other potential benefits of TSC as a perf event clock:
	- ability to work directly with TSC
	- ability to inject non-Intel-PT-related events from a guest

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/events/core.c            | 16 +++++++++
 arch/x86/include/asm/perf_event.h |  3 ++
 include/uapi/linux/perf_event.h   | 12 ++++++-
 kernel/events/core.c              | 57 +++++++++++++++++++------------
 4 files changed, 65 insertions(+), 23 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e686c5e0537b..51d5345de30a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2728,6 +2728,17 @@ void arch_perf_update_userpage(struct perf_event *event,
 		!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
 	userpg->pmc_width = x86_pmu.cntval_bits;
 
+	if (event->attr.use_clockid &&
+	    event->attr.ns_clockid &&
+	    event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
+		userpg->cap_user_time_zero = 1;
+		userpg->time_mult = 1;
+		userpg->time_shift = 0;
+		userpg->time_offset = 0;
+		userpg->time_zero = 0;
+		return;
+	}
+
 	if (!using_native_sched_clock() || !sched_clock_stable())
 		return;
 
@@ -2980,6 +2991,11 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
 	return misc;
 }
 
+u64 perf_hw_clock(void)
+{
+	return rdtsc_ordered();
+}
+
 void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
 	cap->version		= x86_pmu.version;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 58d9e4b1fa0a..5288ea1ae2ba 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -451,6 +451,9 @@ extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
 extern unsigned long perf_misc_flags(struct pt_regs *regs);
 #define perf_misc_flags(regs)	perf_misc_flags(regs)
 
+extern u64 perf_hw_clock(void);
+#define perf_hw_clock		perf_hw_clock
+
 #include <asm/stacktrace.h>
 
 /*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 82858b697c05..e8617efd552b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -290,6 +290,15 @@ enum {
 	PERF_TXN_ABORT_SHIFT = 32,
 };
 
+/*
+ * If supported, clockid value to select an architecture dependent hardware
+ * clock. Note this means the unit of time is ticks not nanoseconds.
+ * Requires ns_clockid to be set in addition to use_clockid.
+ * On x86, this clock is provided by the rdtsc instruction, and is not
+ * paravirtualized.
+ */
+#define CLOCK_PERF_HW_CLOCK		0x10000000
+
 /*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
@@ -409,7 +418,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				ns_clockid     :  1, /* non-standard clockid */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 57249f37c37d..15dee265a5b9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12008,35 +12008,48 @@ static void mutex_lock_double(struct mutex *a, struct mutex *b)
 	mutex_lock_nested(b, SINGLE_DEPTH_NESTING);
 }
 
-static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
+static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id, bool ns_clockid)
 {
 	bool nmi_safe = false;
 
-	switch (clk_id) {
-	case CLOCK_MONOTONIC:
-		event->clock = &ktime_get_mono_fast_ns;
-		nmi_safe = true;
-		break;
+	if (ns_clockid) {
+		switch (clk_id) {
+#ifdef perf_hw_clock
+		case CLOCK_PERF_HW_CLOCK:
+			event->clock = &perf_hw_clock;
+			nmi_safe = true;
+			break;
+#endif
+		default:
+			return -EINVAL;
+		}
+	} else {
+		switch (clk_id) {
+		case CLOCK_MONOTONIC:
+			event->clock = &ktime_get_mono_fast_ns;
+			nmi_safe = true;
+			break;
 
-	case CLOCK_MONOTONIC_RAW:
-		event->clock = &ktime_get_raw_fast_ns;
-		nmi_safe = true;
-		break;
+		case CLOCK_MONOTONIC_RAW:
+			event->clock = &ktime_get_raw_fast_ns;
+			nmi_safe = true;
+			break;
 
-	case CLOCK_REALTIME:
-		event->clock = &ktime_get_real_ns;
-		break;
+		case CLOCK_REALTIME:
+			event->clock = &ktime_get_real_ns;
+			break;
 
-	case CLOCK_BOOTTIME:
-		event->clock = &ktime_get_boottime_ns;
-		break;
+		case CLOCK_BOOTTIME:
+			event->clock = &ktime_get_boottime_ns;
+			break;
 
-	case CLOCK_TAI:
-		event->clock = &ktime_get_clocktai_ns;
-		break;
+		case CLOCK_TAI:
+			event->clock = &ktime_get_clocktai_ns;
+			break;
 
-	default:
-		return -EINVAL;
+		default:
+			return -EINVAL;
+		}
 	}
 
 	if (!nmi_safe && !(event->pmu->capabilities & PERF_PMU_CAP_NO_NMI))
@@ -12245,7 +12258,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	pmu = event->pmu;
 
 	if (attr.use_clockid) {
-		err = perf_event_set_clock(event, attr.clockid);
+		err = perf_event_set_clock(event, attr.clockid, attr.ns_clockid);
 		if (err)
 			goto err_alloc;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 01/11] perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-03-04 13:41   ` Peter Zijlstra
  2022-02-14 11:09 ` [PATCH V2 04/11] perf tools: Add new perf clock IDs Adrian Hunter
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Currently, when Intel PT is used within a VM guest, it is not possible to
make use of TSC because perf clock is subject to paravirtualization.

If the hypervisor leaves rdtsc alone, the TSC value will be subject only to
the VMCS TSC Offset and Scaling, the same as the TSC packet from Intel PT.
The new clock is based on rdtsc and not subject to paravirtualization.

Hence it would be possible to use this new clock for Intel PT decoding
within a VM guest.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 arch/x86/events/core.c            | 41 ++++++++++++++++++++-----------
 arch/x86/include/asm/perf_event.h |  2 ++
 include/uapi/linux/perf_event.h   |  6 +++++
 kernel/events/core.c              |  6 +++++
 4 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 51d5345de30a..905975a7d475 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -41,6 +41,7 @@
 #include <asm/desc.h>
 #include <asm/ldt.h>
 #include <asm/unwind.h>
+#include <asm/tsc.h>
 
 #include "perf_event.h"
 
@@ -2728,18 +2729,26 @@ void arch_perf_update_userpage(struct perf_event *event,
 		!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
 	userpg->pmc_width = x86_pmu.cntval_bits;
 
-	if (event->attr.use_clockid &&
-	    event->attr.ns_clockid &&
-	    event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
-		userpg->cap_user_time_zero = 1;
-		userpg->time_mult = 1;
-		userpg->time_shift = 0;
-		userpg->time_offset = 0;
-		userpg->time_zero = 0;
-		return;
+	if (event->attr.use_clockid && event->attr.ns_clockid) {
+		if (event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
+			userpg->cap_user_time_zero = 1;
+			userpg->time_mult = 1;
+			userpg->time_shift = 0;
+			userpg->time_offset = 0;
+			userpg->time_zero = 0;
+			return;
+		}
+		if (event->attr.clockid == CLOCK_PERF_HW_CLOCK_NS)
+			userpg->cap_user_time_zero = 1;
+	}
+
+	if (using_native_sched_clock() && sched_clock_stable()) {
+		userpg->cap_user_time = 1;
+		if (!event->attr.use_clockid)
+			userpg->cap_user_time_zero = 1;
 	}
 
-	if (!using_native_sched_clock() || !sched_clock_stable())
+	if (!userpg->cap_user_time && !userpg->cap_user_time_zero)
 		return;
 
 	cyc2ns_read_begin(&data);
@@ -2750,19 +2759,16 @@ void arch_perf_update_userpage(struct perf_event *event,
 	 * Internal timekeeping for enabled/running/stopped times
 	 * is always in the local_clock domain.
 	 */
-	userpg->cap_user_time = 1;
 	userpg->time_mult = data.cyc2ns_mul;
 	userpg->time_shift = data.cyc2ns_shift;
 	userpg->time_offset = offset - now;
 
 	/*
 	 * cap_user_time_zero doesn't make sense when we're using a different
-	 * time base for the records.
+	 * time base for the records, except for CLOCK_PERF_HW_CLOCK_NS.
 	 */
-	if (!event->attr.use_clockid) {
-		userpg->cap_user_time_zero = 1;
+	if (userpg->cap_user_time_zero)
 		userpg->time_zero = offset;
-	}
 
 	cyc2ns_read_end();
 }
@@ -2996,6 +3002,11 @@ u64 perf_hw_clock(void)
 	return rdtsc_ordered();
 }
 
+u64 perf_hw_clock_ns(void)
+{
+	return native_sched_clock_from_tsc(perf_hw_clock());
+}
+
 void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
 	cap->version		= x86_pmu.version;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 5288ea1ae2ba..46cbca90cdd1 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -453,6 +453,8 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
 
 extern u64 perf_hw_clock(void);
 #define perf_hw_clock		perf_hw_clock
+extern u64 perf_hw_clock_ns(void);
+#define perf_hw_clock_ns	perf_hw_clock_ns
 
 #include <asm/stacktrace.h>
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index e8617efd552b..0edc005f8ddf 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -298,6 +298,12 @@ enum {
  * paravirtualized.
  */
 #define CLOCK_PERF_HW_CLOCK		0x10000000
+/*
+ * Same as CLOCK_PERF_HW_CLOCK but in nanoseconds. Note support of
+ * CLOCK_PERF_HW_CLOCK_NS does not necesssarily imply support of
+ * CLOCK_PERF_HW_CLOCK or vice versa.
+ */
+#define CLOCK_PERF_HW_CLOCK_NS	0x10000001
 
 /*
  * The format of the data returned by read() on a perf event fd,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 15dee265a5b9..65e70fb669fd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12019,6 +12019,12 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id, bool
 			event->clock = &perf_hw_clock;
 			nmi_safe = true;
 			break;
+#endif
+#ifdef perf_hw_clock_ns
+		case CLOCK_PERF_HW_CLOCK_NS:
+			event->clock = &perf_hw_clock_ns;
+			nmi_safe = true;
+			break;
 #endif
 		default:
 			return -EINVAL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 04/11] perf tools: Add new perf clock IDs
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
                   ` (2 preceding siblings ...)
  2022-02-14 11:09 ` [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds " Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 05/11] perf tools: Add API probes for new " Adrian Hunter
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Add support for new clock IDs CLOCK_PERF_HW_CLOCK and
CLOCK_PERF_HW_CLOCK_NS.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/include/uapi/linux/perf_event.h     | 18 +++++++++++++++++-
 tools/perf/Documentation/perf-record.txt  |  9 ++++++++-
 tools/perf/builtin-record.c               |  2 +-
 tools/perf/util/clockid.c                 | 13 +++++++++++++
 tools/perf/util/evsel.c                   |  1 +
 tools/perf/util/perf_event_attr_fprintf.c |  1 +
 tools/perf/util/record.h                  |  1 +
 7 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 1b65042ab1db..7b3455dfda23 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -290,6 +290,21 @@ enum {
 	PERF_TXN_ABORT_SHIFT = 32,
 };
 
+/*
+ * If supported, clockid value to select an architecture dependent hardware
+ * clock. Note this means the unit of time is ticks not nanoseconds.
+ * Requires ns_clockid to be set in addition to use_clockid.
+ * On x86, this clock is provided by the rdtsc instruction, and is not
+ * paravirtualized.
+ */
+#define CLOCK_PERF_HW_CLOCK		0x10000000
+/*
+ * Same as CLOCK_PERF_HW_CLOCK but in nanoseconds. Note support of
+ * CLOCK_PERF_HW_CLOCK_NS does not necesssarily imply support of
+ * CLOCK_PERF_HW_CLOCK or vice versa.
+ */
+#define CLOCK_PERF_HW_CLOCK_NS	0x10000001
+
 /*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
@@ -409,7 +424,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				ns_clockid     :  1, /* non-standard clockid */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 9ccc75935bc5..a5ef4813093a 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -444,7 +444,14 @@ Record running and enabled time for read events (:S)
 Sets the clock id to use for the various time fields in the perf_event_type
 records. See clock_gettime(). In particular CLOCK_MONOTONIC and
 CLOCK_MONOTONIC_RAW are supported, some events might also allow
-CLOCK_BOOTTIME, CLOCK_REALTIME and CLOCK_TAI.
+CLOCK_BOOTTIME, CLOCK_REALTIME and CLOCK_TAI. In addition, the kernel might
+support CLOCK_PERF_HW_CLOCK to select an architecture dependent hardware
+clock, for which the unit of time is ticks not nanoseconds. On x86,
+CLOCK_PERF_HW_CLOCK is provided by the rdtsc instruction, and is not
+paravirtualized. There is also CLOCK_PERF_HW_CLOCK_NS which is the same as
+CLOCK_PERF_HW_CLOCK, but converted to nanoseconds. Note support of
+CLOCK_PERF_HW_CLOCK_NS does not necessarily imply support of
+CLOCK_PERF_HW_CLOCK or vice versa.
 
 -S::
 --snapshot::
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index bb716c953d02..febb51bac6ac 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1553,7 +1553,7 @@ static int record__init_clock(struct record *rec)
 	struct timeval ref_tod;
 	u64 ref;
 
-	if (!rec->opts.use_clockid)
+	if (!rec->opts.use_clockid || rec->opts.ns_clockid)
 		return 0;
 
 	if (rec->opts.use_clockid && rec->opts.clockid_res_ns)
diff --git a/tools/perf/util/clockid.c b/tools/perf/util/clockid.c
index 74365a5d99c1..2fcffee690e1 100644
--- a/tools/perf/util/clockid.c
+++ b/tools/perf/util/clockid.c
@@ -12,11 +12,15 @@
 struct clockid_map {
 	const char *name;
 	int clockid;
+	bool non_standard;
 };
 
 #define CLOCKID_MAP(n, c)	\
 	{ .name = n, .clockid = (c), }
 
+#define CLOCKID_MAP_NS(n, c)	\
+	{ .name = n, .clockid = (c), .non_standard = true, }
+
 #define CLOCKID_END	{ .name = NULL, }
 
 
@@ -49,6 +53,10 @@ static const struct clockid_map clockids[] = {
 	CLOCKID_MAP("real", CLOCK_REALTIME),
 	CLOCKID_MAP("boot", CLOCK_BOOTTIME),
 
+	/* non-standard clocks */
+	CLOCKID_MAP_NS("perf_hw_clock", CLOCK_PERF_HW_CLOCK),
+	CLOCKID_MAP_NS("perf_hw_clock_ns", CLOCK_PERF_HW_CLOCK_NS),
+
 	CLOCKID_END,
 };
 
@@ -97,6 +105,11 @@ int parse_clockid(const struct option *opt, const char *str, int unset)
 	for (cm = clockids; cm->name; cm++) {
 		if (!strcasecmp(str, cm->name)) {
 			opts->clockid = cm->clockid;
+			if (cm->non_standard) {
+				opts->ns_clockid = true;
+				opts->clockid_res_ns = 0;
+				return 0;
+			}
 			return get_clockid_res(opts->clockid,
 					       &opts->clockid_res_ns);
 		}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 22d3267ce294..be1d30490a43 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1294,6 +1294,7 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 	clockid = opts->clockid;
 	if (opts->use_clockid) {
 		attr->use_clockid = 1;
+		attr->ns_clockid = opts->ns_clockid;
 		attr->clockid = opts->clockid;
 	}
 
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 98af3fa4ea35..398f05f2e5b3 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -128,6 +128,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(mmap2, p_unsigned);
 	PRINT_ATTRf(comm_exec, p_unsigned);
 	PRINT_ATTRf(use_clockid, p_unsigned);
+	PRINT_ATTRf(ns_clockid, p_unsigned);
 	PRINT_ATTRf(context_switch, p_unsigned);
 	PRINT_ATTRf(write_backward, p_unsigned);
 	PRINT_ATTRf(namespaces, p_unsigned);
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index ef6c2715fdd9..1dbbf6b314dc 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -67,6 +67,7 @@ struct record_opts {
 	bool	      sample_transaction;
 	int	      initial_delay;
 	bool	      use_clockid;
+	bool	      ns_clockid;
 	clockid_t     clockid;
 	u64	      clockid_res_ns;
 	int	      nr_cblocks;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 05/11] perf tools: Add API probes for new clock IDs
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
                   ` (3 preceding siblings ...)
  2022-02-14 11:09 ` [PATCH V2 04/11] perf tools: Add new perf clock IDs Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 06/11] perf tools: Add new clock IDs to "perf time to TSC" test Adrian Hunter
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Add ability to check whether the kernel supports new clock IDs
CLOCK_PERF_HW_CLOCK and CLOCK_PERF_HW_CLOCK_NS.
They will be used in a subsequent patch.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/perf_api_probe.c | 24 ++++++++++++++++++++++++
 tools/perf/util/perf_api_probe.h |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/tools/perf/util/perf_api_probe.c b/tools/perf/util/perf_api_probe.c
index c28dd50bd571..33d3dd858ecc 100644
--- a/tools/perf/util/perf_api_probe.c
+++ b/tools/perf/util/perf_api_probe.c
@@ -109,6 +109,20 @@ static void perf_probe_cgroup(struct evsel *evsel)
 	evsel->core.attr.cgroup = 1;
 }
 
+static void perf_probe_hw_clock(struct evsel *evsel)
+{
+	evsel->core.attr.use_clockid = 1;
+	evsel->core.attr.ns_clockid = 1;
+	evsel->core.attr.clockid = CLOCK_PERF_HW_CLOCK;
+}
+
+static void perf_probe_hw_clock_ns(struct evsel *evsel)
+{
+	evsel->core.attr.use_clockid = 1;
+	evsel->core.attr.ns_clockid = 1;
+	evsel->core.attr.clockid = CLOCK_PERF_HW_CLOCK_NS;
+}
+
 bool perf_can_sample_identifier(void)
 {
 	return perf_probe_api(perf_probe_sample_identifier);
@@ -195,3 +209,13 @@ bool perf_can_record_cgroup(void)
 {
 	return perf_probe_api(perf_probe_cgroup);
 }
+
+bool perf_can_perf_clock_hw_clock(void)
+{
+	return perf_probe_api(perf_probe_hw_clock);
+}
+
+bool perf_can_perf_clock_hw_clock_ns(void)
+{
+	return perf_probe_api(perf_probe_hw_clock_ns);
+}
diff --git a/tools/perf/util/perf_api_probe.h b/tools/perf/util/perf_api_probe.h
index b104168efb15..5b30cbd260cf 100644
--- a/tools/perf/util/perf_api_probe.h
+++ b/tools/perf/util/perf_api_probe.h
@@ -13,5 +13,7 @@ bool perf_can_record_text_poke_events(void);
 bool perf_can_sample_identifier(void);
 bool perf_can_record_build_id(void);
 bool perf_can_record_cgroup(void);
+bool perf_can_perf_clock_hw_clock(void);
+bool perf_can_perf_clock_hw_clock_ns(void);
 
 #endif // __PERF_API_PROBE_H
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 06/11] perf tools: Add new clock IDs to "perf time to TSC" test
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
                   ` (4 preceding siblings ...)
  2022-02-14 11:09 ` [PATCH V2 05/11] perf tools: Add API probes for new " Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 07/11] perf tools: Add perf_read_tsc_conv_for_clockid() Adrian Hunter
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

The same "Convert perf time to TSC" test can be used with new clock IDs
CLOCK_PERF_HW_CLOCK and CLOCK_PERF_HW_CLOCK_NS.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/tests/perf-time-to-tsc.c | 42 ++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 9 deletions(-)

diff --git a/tools/perf/tests/perf-time-to-tsc.c b/tools/perf/tests/perf-time-to-tsc.c
index d12d0ad81801..9b75c029bb9d 100644
--- a/tools/perf/tests/perf-time-to-tsc.c
+++ b/tools/perf/tests/perf-time-to-tsc.c
@@ -22,6 +22,7 @@
 #include "tests.h"
 #include "pmu.h"
 #include "pmu-hybrid.h"
+#include "perf_api_probe.h"
 
 /*
  * Except x86_64/i386 and Arm64, other archs don't support TSC in perf.  Just
@@ -47,15 +48,7 @@
 	}					\
 }
 
-/**
- * test__perf_time_to_tsc - test converting perf time to TSC.
- *
- * This function implements a test that checks that the conversion of perf time
- * to and from TSC is consistent with the order of events.  If the test passes
- * %0 is returned, otherwise %-1 is returned.  If TSC conversion is not
- * supported then then the test passes but " (not supported)" is printed.
- */
-static int test__perf_time_to_tsc(struct test_suite *test __maybe_unused, int subtest __maybe_unused)
+static int perf_time_to_tsc_test(bool use_clockid, bool ns_clockid, s32 clockid)
 {
 	struct record_opts opts = {
 		.mmap_pages	     = UINT_MAX,
@@ -104,6 +97,9 @@ static int test__perf_time_to_tsc(struct test_suite *test __maybe_unused, int su
 	evsel->core.attr.comm = 1;
 	evsel->core.attr.disabled = 1;
 	evsel->core.attr.enable_on_exec = 0;
+	evsel->core.attr.use_clockid = use_clockid;
+	evsel->core.attr.ns_clockid = ns_clockid;
+	evsel->core.attr.clockid = clockid;
 
 	/*
 	 * For hybrid "cycles:u", it creates two events.
@@ -200,4 +196,32 @@ static int test__perf_time_to_tsc(struct test_suite *test __maybe_unused, int su
 	return err;
 }
 
+/**
+ * test__perf_time_to_tsc - test converting perf time to TSC.
+ *
+ * This function implements a test that checks that the conversion of perf time
+ * to and from TSC is consistent with the order of events.  If the test passes
+ * %0 is returned, otherwise %-1 is returned.  If TSC conversion is not
+ * supported then the test passes but " (not supported)" is printed.
+ */
+static int test__perf_time_to_tsc(struct test_suite *test __maybe_unused,
+				  int subtest __maybe_unused)
+{
+	int err;
+
+	err = perf_time_to_tsc_test(false, false, 0);
+
+	if (!err && perf_can_perf_clock_hw_clock()) {
+		pr_debug("Testing CLOCK_PERF_HW_CLOCK\n");
+		err = perf_time_to_tsc_test(true, true, CLOCK_PERF_HW_CLOCK);
+	}
+
+	if (!err && perf_can_perf_clock_hw_clock_ns()) {
+		pr_debug("Testing CLOCK_PERF_HW_CLOCK_NS\n");
+		err = perf_time_to_tsc_test(true, true, CLOCK_PERF_HW_CLOCK_NS);
+	}
+
+	return err;
+}
+
 DEFINE_SUITE("Convert perf time to TSC", perf_time_to_tsc);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 07/11] perf tools: Add perf_read_tsc_conv_for_clockid()
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
                   ` (5 preceding siblings ...)
  2022-02-14 11:09 ` [PATCH V2 06/11] perf tools: Add new clock IDs to "perf time to TSC" test Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 08/11] perf intel-pt: Add support for new clock IDs Adrian Hunter
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Add a function to read TSC conversion information for a particular clock
ID. It will be used in a subsequent patch.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/tsc.c | 58 +++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/tsc.h |  2 ++
 2 files changed, 60 insertions(+)

diff --git a/tools/perf/util/tsc.c b/tools/perf/util/tsc.c
index f19791d46e99..0bcedaa5fa5d 100644
--- a/tools/perf/util/tsc.c
+++ b/tools/perf/util/tsc.c
@@ -3,6 +3,8 @@
 #include <inttypes.h>
 #include <string.h>
 
+#include <sys/mman.h>
+
 #include <linux/compiler.h>
 #include <linux/perf_event.h>
 #include <linux/stddef.h>
@@ -14,6 +16,9 @@
 #include "synthetic-events.h"
 #include "debug.h"
 #include "tsc.h"
+#include "cpumap.h"
+#include "perf-sys.h"
+#include <internal/lib.h> /* page_size */
 
 u64 perf_time_to_tsc(u64 ns, struct perf_tsc_conversion *tc)
 {
@@ -71,6 +76,59 @@ int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
 	return 0;
 }
 
+static int perf_read_tsc_conv_attr_cpu(struct perf_event_attr *attr,
+				       struct perf_cpu cpu,
+				       struct perf_tsc_conversion *tc)
+{
+	size_t len = 2 * page_size;
+	int fd, err = -EINVAL;
+	void *addr;
+
+	fd = sys_perf_event_open(attr, 0, cpu.cpu, -1, 0);
+	if (fd == -1)
+		return -EINVAL;
+
+	addr = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED)
+		goto out_close;
+
+	err = perf_read_tsc_conversion(addr, tc);
+
+	munmap(addr, len);
+out_close:
+	close(fd);
+	return err;
+}
+
+static struct perf_cpu find_a_cpu(void)
+{
+	struct perf_cpu_map *cpus;
+	struct perf_cpu cpu = { .cpu = 0 };
+
+	cpus = perf_cpu_map__new(NULL);
+	if (!cpus)
+		return cpu;
+	cpu = cpus->map[0];
+	perf_cpu_map__put(cpus);
+	return cpu;
+}
+
+int perf_read_tsc_conv_for_clockid(s32 clockid, bool ns_clock,
+				   struct perf_tsc_conversion *tc)
+{
+	struct perf_event_attr attr = {
+		.size		= sizeof(attr),
+		.type		= PERF_TYPE_SOFTWARE,
+		.config		= PERF_COUNT_SW_DUMMY,
+		.exclude_kernel	= 1,
+		.use_clockid	= 1,
+		.ns_clockid	= ns_clock,
+		.clockid	= clockid,
+	};
+
+	return perf_read_tsc_conv_attr_cpu(&attr, find_a_cpu(), tc);
+}
+
 int perf_event__synth_time_conv(const struct perf_event_mmap_page *pc,
 				struct perf_tool *tool,
 				perf_event__handler_t process,
diff --git a/tools/perf/util/tsc.h b/tools/perf/util/tsc.h
index 7d83a31732a7..af3f9e6a1beb 100644
--- a/tools/perf/util/tsc.h
+++ b/tools/perf/util/tsc.h
@@ -21,6 +21,8 @@ struct perf_event_mmap_page;
 
 int perf_read_tsc_conversion(const struct perf_event_mmap_page *pc,
 			     struct perf_tsc_conversion *tc);
+int perf_read_tsc_conv_for_clockid(s32 clockid, bool ns_clockid,
+				   struct perf_tsc_conversion *tc);
 
 u64 perf_time_to_tsc(u64 ns, struct perf_tsc_conversion *tc);
 u64 tsc_to_perf_time(u64 cyc, struct perf_tsc_conversion *tc);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 08/11] perf intel-pt: Add support for new clock IDs
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
                   ` (6 preceding siblings ...)
  2022-02-14 11:09 ` [PATCH V2 07/11] perf tools: Add perf_read_tsc_conv_for_clockid() Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 09/11] perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default Adrian Hunter
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Add support for new clock IDs CLOCK_PERF_HW_CLOCK and
CLOCK_PERF_HW_CLOCK_NS. Mainly this means also keeping TSC conversion
information for CLOCK_PERF_HW_CLOCK_NS when CLOCK_PERF_HW_CLOCK is
being used, so that conversions from nanoseconds can still be done when
the perf event clock is TSC.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/arch/x86/util/intel-pt.c | 37 ++++++++++++++++++++++++++---
 tools/perf/util/intel-pt.c          | 21 ++++++++++++----
 tools/perf/util/intel-pt.h          |  2 +-
 3 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/tools/perf/arch/x86/util/intel-pt.c b/tools/perf/arch/x86/util/intel-pt.c
index 8c31578d6f4a..5424c42337e7 100644
--- a/tools/perf/arch/x86/util/intel-pt.c
+++ b/tools/perf/arch/x86/util/intel-pt.c
@@ -290,6 +290,21 @@ static const char *intel_pt_find_filter(struct evlist *evlist,
 	return NULL;
 }
 
+static bool intel_pt_clockid(struct evlist *evlist, struct perf_pmu *intel_pt_pmu, s32 clockid)
+{
+	struct evsel *evsel;
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel->core.attr.type == intel_pt_pmu->type &&
+		    evsel->core.attr.use_clockid &&
+		    evsel->core.attr.ns_clockid &&
+		    evsel->core.attr.clockid == clockid)
+			return true;
+	}
+
+	return false;
+}
+
 static size_t intel_pt_filter_bytes(const char *filter)
 {
 	size_t len = filter ? strlen(filter) : 0;
@@ -304,9 +319,11 @@ intel_pt_info_priv_size(struct auxtrace_record *itr, struct evlist *evlist)
 			container_of(itr, struct intel_pt_recording, itr);
 	const char *filter = intel_pt_find_filter(evlist, ptr->intel_pt_pmu);
 
-	ptr->priv_size = (INTEL_PT_AUXTRACE_PRIV_MAX * sizeof(u64)) +
+	ptr->priv_size = (INTEL_PT_AUXTRACE_PRIV_FIXED * sizeof(u64)) +
 			 intel_pt_filter_bytes(filter);
 	ptr->priv_size += sizeof(u64); /* Cap Event Trace */
+	ptr->priv_size += sizeof(u64); /* ns Time Shift */
+	ptr->priv_size += sizeof(u64); /* ns Time Multiplier */
 
 	return ptr->priv_size;
 }
@@ -414,6 +431,18 @@ static int intel_pt_info_fill(struct auxtrace_record *itr,
 
 	*info++ = event_trace;
 
+	if (intel_pt_clockid(session->evlist, ptr->intel_pt_pmu, CLOCK_PERF_HW_CLOCK)) {
+		struct perf_tsc_conversion ns_tc;
+
+		if (perf_read_tsc_conv_for_clockid(CLOCK_PERF_HW_CLOCK_NS, true, &ns_tc))
+			return -EINVAL;
+		*info++ = ns_tc.time_shift;
+		*info++ = ns_tc.time_mult;
+	} else {
+		*info++ = tc.time_shift;
+		*info++ = tc.time_mult;
+	}
+
 	return 0;
 }
 
@@ -664,8 +693,10 @@ static int intel_pt_recording_options(struct auxtrace_record *itr,
 		return -EINVAL;
 	}
 
-	if (opts->use_clockid) {
-		pr_err("Cannot use clockid (-k option) with " INTEL_PT_PMU_NAME "\n");
+	if (opts->use_clockid && opts->clockid != CLOCK_PERF_HW_CLOCK_NS &&
+	    opts->clockid != CLOCK_PERF_HW_CLOCK) {
+		pr_err("Cannot use clockid (-k option) with " INTEL_PT_PMU_NAME
+		       " except CLOCK_PERF_HW_CLOCK_NS and CLOCK_PERF_HW_CLOCK\n");
 		return -EINVAL;
 	}
 
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index ec43d364d0de..10d47759a41e 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -89,6 +89,8 @@ struct intel_pt {
 
 	struct perf_tsc_conversion tc;
 	bool cap_user_time_zero;
+	u16 ns_time_shift;
+	u32 ns_time_mult;
 
 	struct itrace_synth_opts synth_opts;
 
@@ -1100,10 +1102,10 @@ static u64 intel_pt_ns_to_ticks(const struct intel_pt *pt, u64 ns)
 {
 	u64 quot, rem;
 
-	quot = ns / pt->tc.time_mult;
-	rem  = ns % pt->tc.time_mult;
-	return (quot << pt->tc.time_shift) + (rem << pt->tc.time_shift) /
-		pt->tc.time_mult;
+	quot = ns / pt->ns_time_mult;
+	rem  = ns % pt->ns_time_mult;
+	return (quot << pt->ns_time_shift) + (rem << pt->ns_time_shift) /
+		pt->ns_time_mult;
 }
 
 static struct ip_callchain *intel_pt_alloc_chain(struct intel_pt *pt)
@@ -3987,6 +3989,17 @@ int intel_pt_process_auxtrace_info(union perf_event *event,
 				pt->cap_event_trace);
 	}
 
+	if ((void *)info < info_end) {
+		pt->ns_time_shift = *info++;
+		pt->ns_time_mult = *info++;
+		if (dump_trace) {
+			fprintf(stdout, "  ns Time Shift       %d\n", pt->ns_time_shift);
+			fprintf(stdout, "  ns Time Multiplier  %d\n", pt->ns_time_mult);
+		}
+	}
+	if (!pt->ns_time_mult)
+		pt->ns_time_mult = 1;
+
 	pt->timeless_decoding = intel_pt_timeless_decoding(pt);
 	if (pt->timeless_decoding && !pt->tc.time_mult)
 		pt->tc.time_mult = 1;
diff --git a/tools/perf/util/intel-pt.h b/tools/perf/util/intel-pt.h
index c7d6068e3a6b..a2c4474641c0 100644
--- a/tools/perf/util/intel-pt.h
+++ b/tools/perf/util/intel-pt.h
@@ -27,7 +27,7 @@ enum {
 	INTEL_PT_CYC_BIT,
 	INTEL_PT_MAX_NONTURBO_RATIO,
 	INTEL_PT_FILTER_STR_LEN,
-	INTEL_PT_AUXTRACE_PRIV_MAX,
+	INTEL_PT_AUXTRACE_PRIV_FIXED,
 };
 
 struct auxtrace_record;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 09/11] perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
                   ` (7 preceding siblings ...)
  2022-02-14 11:09 ` [PATCH V2 08/11] perf intel-pt: Add support for new clock IDs Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 10/11] perf intel-pt: Add config variables for timing parameters Adrian Hunter
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Make CLOCK_PERF_HW_CLOCK_NS the default for Intel PT if it is supported.
To allow that to be overridden, support also --no-clockid.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/arch/x86/util/intel-pt.c | 6 ++++++
 tools/perf/util/clockid.c           | 1 +
 tools/perf/util/record.h            | 1 +
 3 files changed, 8 insertions(+)

diff --git a/tools/perf/arch/x86/util/intel-pt.c b/tools/perf/arch/x86/util/intel-pt.c
index 5424c42337e7..bba55b6f75b6 100644
--- a/tools/perf/arch/x86/util/intel-pt.c
+++ b/tools/perf/arch/x86/util/intel-pt.c
@@ -927,6 +927,12 @@ static int intel_pt_recording_options(struct auxtrace_record *itr,
 		evsel__reset_sample_bit(tracking_evsel, BRANCH_STACK);
 	}
 
+	if (!opts->use_clockid && !opts->no_clockid && perf_can_perf_clock_hw_clock_ns()) {
+		opts->use_clockid = true;
+		opts->ns_clockid = true;
+		opts->clockid = CLOCK_PERF_HW_CLOCK_NS;
+	}
+
 	/*
 	 * Warn the user when we do not have enough information to decode i.e.
 	 * per-cpu with no sched_switch (except workload-only).
diff --git a/tools/perf/util/clockid.c b/tools/perf/util/clockid.c
index 2fcffee690e1..f9c0200e1ec2 100644
--- a/tools/perf/util/clockid.c
+++ b/tools/perf/util/clockid.c
@@ -81,6 +81,7 @@ int parse_clockid(const struct option *opt, const char *str, int unset)
 
 	if (unset) {
 		opts->use_clockid = 0;
+		opts->no_clockid = true;
 		return 0;
 	}
 
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 1dbbf6b314dc..9a1dabfd158b 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -68,6 +68,7 @@ struct record_opts {
 	int	      initial_delay;
 	bool	      use_clockid;
 	bool	      ns_clockid;
+	bool	      no_clockid;
 	clockid_t     clockid;
 	u64	      clockid_res_ns;
 	int	      nr_cblocks;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 10/11] perf intel-pt: Add config variables for timing parameters
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
                   ` (8 preceding siblings ...)
  2022-02-14 11:09 ` [PATCH V2 09/11] perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-02-14 11:09 ` [PATCH V2 11/11] perf intel-pt: Add documentation for new clock IDs Adrian Hunter
  2022-02-21  6:54 ` [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
  11 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Parameters needed to correctly interpret timing packets might be missing
in a virtual machine because the CPUID leaf or MSR is not supported by the
hypervisor / KVM.

Add perf config variables to overcome that for max_nonturbo_ratio
(missing from MSR_PLATFORM_INFO) and tsc_art_ratio (missing from CPUID leaf
 0x15), which were seen to be missing from QEMU / KVM.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/Documentation/perf-config.txt | 18 ++++++++
 tools/perf/arch/x86/util/intel-pt.c      | 52 +++++++++++++++++++++++-
 tools/perf/util/intel-pt.c               |  6 +++
 tools/perf/util/intel-pt.h               |  5 +++
 4 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-config.txt b/tools/perf/Documentation/perf-config.txt
index 0420e71698ee..3c4fc641fde7 100644
--- a/tools/perf/Documentation/perf-config.txt
+++ b/tools/perf/Documentation/perf-config.txt
@@ -709,7 +709,11 @@ stat.*::
 
 intel-pt.*::
 
+	Variables that affect Intel PT.
+
 	intel-pt.cache-divisor::
+		If set, the decoder instruction cache size is based on DSO size
+		divided by this number.
 
 	intel-pt.mispred-all::
 		If set, Intel PT decoder will set the mispred flag on all
@@ -721,6 +725,20 @@ intel-pt.*::
 		the maximum is exceeded there will be a "Never-ending loop"
 		error. The default is 100000.
 
+	intel-pt.max_nonturbo_ratio::
+		The kernel provides /sys/bus/event_source/devices/intel_pt/max_nonturbo_ratio
+		which can be zero in a virtual machine.  The decoder needs this
+		information to correctly interpret timing packets, so the value
+		can be provided by this variable in that case. Note in the absence
+		of VMCS TSC Scaling, this is probably the same as the host value.
+
+	intel-pt.tsc_art_ratio::
+		The kernel provides /sys/bus/event_source/devices/intel_pt/tsc_art_ratio
+		which can be 0:0 in a virtual machine.  The decoder needs this
+		information to correctly interpret timing packets, so the value
+		can be provided by this variable in that case. Note in the absence
+		of VMCS TSC Scaling, this is probably the same as the host value.
+
 auxtrace.*::
 
 	auxtrace.dumpdir::
diff --git a/tools/perf/arch/x86/util/intel-pt.c b/tools/perf/arch/x86/util/intel-pt.c
index bba55b6f75b6..d7f6a73095e3 100644
--- a/tools/perf/arch/x86/util/intel-pt.c
+++ b/tools/perf/arch/x86/util/intel-pt.c
@@ -24,6 +24,7 @@
 #include "../../../util/parse-events.h"
 #include "../../../util/pmu.h"
 #include "../../../util/debug.h"
+#include "../../../util/config.h"
 #include "../../../util/auxtrace.h"
 #include "../../../util/perf_api_probe.h"
 #include "../../../util/record.h"
@@ -328,15 +329,60 @@ intel_pt_info_priv_size(struct auxtrace_record *itr, struct evlist *evlist)
 	return ptr->priv_size;
 }
 
+struct tsc_art_ratio {
+	u32 *n;
+	u32 *d;
+};
+
+static int intel_pt_tsc_art_ratio(const char *var, const char *value, void *data)
+{
+	if (!strcmp(var, "intel-pt.tsc_art_ratio")) {
+		struct tsc_art_ratio *r = data;
+
+		if (sscanf(value, "%u:%u", r->n, r->d) != 2)
+			return -EINVAL;
+	}
+	return 0;
+}
+
+void intel_pt_tsc_ctc_ratio_from_config(u32 *n, u32 *d)
+{
+	struct tsc_art_ratio data = { .n = n, .d = d };
+
+	*n = 0;
+	*d = 0;
+	perf_config(intel_pt_tsc_art_ratio, &data);
+}
+
 static void intel_pt_tsc_ctc_ratio(u32 *n, u32 *d)
 {
 	unsigned int eax = 0, ebx = 0, ecx = 0, edx = 0;
 
 	__get_cpuid(0x15, &eax, &ebx, &ecx, &edx);
+	if (!eax || !ebx) {
+		intel_pt_tsc_ctc_ratio_from_config(n, d);
+		return;
+	}
 	*n = ebx;
 	*d = eax;
 }
 
+static int intel_pt_max_nonturbo_ratio(const char *var, const char *value, void *data)
+{
+	if (!strcmp(var, "intel-pt.max_nonturbo_ratio")) {
+		unsigned int *max_nonturbo_ratio = data;
+
+		if (sscanf(value, "%u", max_nonturbo_ratio) != 1)
+			return -EINVAL;
+	}
+	return 0;
+}
+
+void intel_pt_max_nonturbo_ratio_from_config(unsigned int *max_non_turbo_ratio)
+{
+	perf_config(intel_pt_max_nonturbo_ratio, max_non_turbo_ratio);
+}
+
 static int intel_pt_info_fill(struct auxtrace_record *itr,
 			      struct perf_session *session,
 			      struct perf_record_auxtrace_info *auxtrace_info,
@@ -350,7 +396,7 @@ static int intel_pt_info_fill(struct auxtrace_record *itr,
 	bool cap_user_time_zero = false, per_cpu_mmaps;
 	u64 tsc_bit, mtc_bit, mtc_freq_bits, cyc_bit, noretcomp_bit;
 	u32 tsc_ctc_ratio_n, tsc_ctc_ratio_d;
-	unsigned long max_non_turbo_ratio;
+	unsigned int max_non_turbo_ratio;
 	size_t filter_str_len;
 	const char *filter;
 	int event_trace;
@@ -374,8 +420,10 @@ static int intel_pt_info_fill(struct auxtrace_record *itr,
 	intel_pt_tsc_ctc_ratio(&tsc_ctc_ratio_n, &tsc_ctc_ratio_d);
 
 	if (perf_pmu__scan_file(intel_pt_pmu, "max_nonturbo_ratio",
-				"%lu", &max_non_turbo_ratio) != 1)
+				"%u", &max_non_turbo_ratio) != 1)
 		max_non_turbo_ratio = 0;
+	if (!max_non_turbo_ratio)
+		intel_pt_max_nonturbo_ratio_from_config(&max_non_turbo_ratio);
 	if (perf_pmu__scan_file(intel_pt_pmu, "caps/event_trace",
 				"%d", &event_trace) != 1)
 		event_trace = 0;
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index 10d47759a41e..6fa76b584537 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -3934,6 +3934,9 @@ int intel_pt_process_auxtrace_info(union perf_event *event,
 				    INTEL_PT_CYC_BIT);
 	}
 
+	if (!pt->tsc_ctc_ratio_n || !pt->tsc_ctc_ratio_d)
+		intel_pt_tsc_ctc_ratio_from_config(&pt->tsc_ctc_ratio_n, &pt->tsc_ctc_ratio_d);
+
 	if (intel_pt_has(auxtrace_info, INTEL_PT_MAX_NONTURBO_RATIO)) {
 		pt->max_non_turbo_ratio =
 			auxtrace_info->priv[INTEL_PT_MAX_NONTURBO_RATIO];
@@ -3942,6 +3945,9 @@ int intel_pt_process_auxtrace_info(union perf_event *event,
 				    INTEL_PT_MAX_NONTURBO_RATIO);
 	}
 
+	if (!pt->max_non_turbo_ratio)
+		intel_pt_max_nonturbo_ratio_from_config(&pt->max_non_turbo_ratio);
+
 	info = &auxtrace_info->priv[INTEL_PT_FILTER_STR_LEN] + 1;
 	info_end = (void *)auxtrace_info + auxtrace_info->header.size;
 
diff --git a/tools/perf/util/intel-pt.h b/tools/perf/util/intel-pt.h
index a2c4474641c0..99ac73f4a648 100644
--- a/tools/perf/util/intel-pt.h
+++ b/tools/perf/util/intel-pt.h
@@ -7,6 +7,8 @@
 #ifndef INCLUDE__PERF_INTEL_PT_H__
 #define INCLUDE__PERF_INTEL_PT_H__
 
+#include <linux/types.h>
+
 #define INTEL_PT_PMU_NAME "intel_pt"
 
 enum {
@@ -44,4 +46,7 @@ int intel_pt_process_auxtrace_info(union perf_event *event,
 
 struct perf_event_attr *intel_pt_pmu_default_config(struct perf_pmu *pmu);
 
+void intel_pt_tsc_ctc_ratio_from_config(u32 *n, u32 *d);
+void intel_pt_max_nonturbo_ratio_from_config(unsigned int *max_non_turbo_ratio);
+
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH V2 11/11] perf intel-pt: Add documentation for new clock IDs
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
                   ` (9 preceding siblings ...)
  2022-02-14 11:09 ` [PATCH V2 10/11] perf intel-pt: Add config variables for timing parameters Adrian Hunter
@ 2022-02-14 11:09 ` Adrian Hunter
  2022-02-21  6:54 ` [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
  11 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-02-14 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

Add brief documentation for new clock IDs CLOCK_PERF_HW_CLOCK and
CLOCK_PERF_HW_CLOCK_NS, as well as new config variables
intel-pt.max_nonturbo_ratio and intel-pt.tsc_art_ratio.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/Documentation/perf-intel-pt.txt | 47 ++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/tools/perf/Documentation/perf-intel-pt.txt b/tools/perf/Documentation/perf-intel-pt.txt
index ff58bd4c381b..45f750024e3d 100644
--- a/tools/perf/Documentation/perf-intel-pt.txt
+++ b/tools/perf/Documentation/perf-intel-pt.txt
@@ -509,6 +509,31 @@ notnt		Disable TNT packets.  Without TNT packets, it is not possible to walk
 		"0" otherwise.
 
 
+perf event clock
+~~~~~~~~~~~~~~~~
+
+Newer kernel and tools support 2 special clocks: CLOCK_PERF_HW_CLOCK which is
+TSC and CLOCK_PERF_HW_CLOCK_NS which is TSC converted to nanoseconds.
+CLOCK_PERF_HW_CLOCK_NS is the same as the default perf event clock, but it is
+not subject to paravirtualization, so it still works with Intel PT in a VM
+guest.  CLOCK_PERF_HW_CLOCK_NS is used by default if it is supported.
+
+To use TSC instead of nanoseconds, use the option:
+
+	--clockid CLOCK_PERF_HW_CLOCK
+
+Beware forgetting that the time stamp of events will show TSC ticks
+(divided by 1,000,000,000) not seconds.
+
+To use the default perf event clock instead of CLOCK_PERF_HW_CLOCK_NS when
+CLOCK_PERF_HW_CLOCK_NS is supported, use the option:
+
+	--no-clockid
+
+Other clocks are not supported for use with Intel PT because they cannot be
+converted to/from TSC.
+
+
 AUX area sampling option
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -1398,6 +1423,28 @@ There were none.
           :17006 17006 [001] 11500.262869216:  ffffffff8220116e error_entry+0xe ([guest.kernel.kallsyms])               pushq  %rax
 
 
+Tracing within a Virtual Machine
+--------------------------------
+
+When supported, using Intel PT within a virtual machine does not support TSC
+because the perf event clock is subject to paravirtualization.  That is
+overcome by the new CLOCK_PERF_HW_CLOCK_NS clock - refer 'perf event clock'
+above.  In addition, in a VM, the following might be zero:
+
+	/sys/bus/event_source/devices/intel_pt/max_nonturbo_ratio
+	/sys/bus/event_source/devices/intel_pt/tsc_art_ratio
+
+The decoder needs this information to correctly interpret timing packets,
+so the values can be provided by config variables in that case. Note in
+the absence of VMCS TSC Scaling, this is probably the same as the host values.
+The config variables are:
+
+	intel-pt.max_nonturbo_ratio
+	intel-pt.tsc_art_ratio
+
+For more information about perf config variables, refer linkperf:perf-config[1]
+
+
 Event Trace
 -----------
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing
  2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
                   ` (10 preceding siblings ...)
  2022-02-14 11:09 ` [PATCH V2 11/11] perf intel-pt: Add documentation for new clock IDs Adrian Hunter
@ 2022-02-21  6:54 ` Adrian Hunter
  2022-03-01 11:06   ` Adrian Hunter
  11 siblings, 1 reply; 46+ messages in thread
From: Adrian Hunter @ 2022-02-21  6:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

On 14/02/2022 13:09, Adrian Hunter wrote:
> Hi
> 
> These patches add 2 new perf event clocks based on TSC for use with VMs.
> 
> The first patch is a minor fix, the next 2 patches add each of the 2 new
> clocks.  The remaining patches add minimal tools support and are based on
> top of the Intel PT Event Trace tools' patches.
> 
> The future work, to add the ability to use perf inject to inject perf
> events from a VM guest perf.data file into a VM host perf.data file,
> has yet to be implemented.
> 
> 
> Changes in V2:
>       perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
> 	  Add __sched_clock_offset unconditionally
> 
>       perf/x86: Add support for TSC as a perf event clock
> 	  Use an attribute bit 'ns_clockid' to identify non-standard clockids
> 
>       perf/x86: Add support for TSC in nanoseconds as a perf event clock
> 	  Do not affect use of __sched_clock_offset
> 	  Adjust to use 'ns_clockid'

Any comments on version 2?

> 
>       perf tools: Add new perf clock IDs
>       perf tools: Add API probes for new clock IDs
>       perf tools: Add new clock IDs to "perf time to TSC" test
>       perf tools: Add perf_read_tsc_conv_for_clockid()
>       perf intel-pt: Add support for new clock IDs
>       perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default
>       perf intel-pt: Add config variables for timing parameters
>       perf intel-pt: Add documentation for new clock IDs
> 	  Adjust to use 'ns_clockid'
> 
> 
> Adrian Hunter (11):
>       perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
>       perf/x86: Add support for TSC as a perf event clock
>       perf/x86: Add support for TSC in nanoseconds as a perf event clock
>       perf tools: Add new perf clock IDs
>       perf tools: Add API probes for new clock IDs
>       perf tools: Add new clock IDs to "perf time to TSC" test
>       perf tools: Add perf_read_tsc_conv_for_clockid()
>       perf intel-pt: Add support for new clock IDs
>       perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default
>       perf intel-pt: Add config variables for timing parameters
>       perf intel-pt: Add documentation for new clock IDs
> 
>  arch/x86/events/core.c                     | 39 ++++++++++--
>  arch/x86/include/asm/perf_event.h          |  5 ++
>  arch/x86/kernel/tsc.c                      |  2 +-
>  include/uapi/linux/perf_event.h            | 18 +++++-
>  kernel/events/core.c                       | 63 +++++++++++++-------
>  tools/include/uapi/linux/perf_event.h      | 18 +++++-
>  tools/perf/Documentation/perf-config.txt   | 18 ++++++
>  tools/perf/Documentation/perf-intel-pt.txt | 47 +++++++++++++++
>  tools/perf/Documentation/perf-record.txt   |  9 ++-
>  tools/perf/arch/x86/util/intel-pt.c        | 95 ++++++++++++++++++++++++++++--
>  tools/perf/builtin-record.c                |  2 +-
>  tools/perf/tests/perf-time-to-tsc.c        | 42 ++++++++++---
>  tools/perf/util/clockid.c                  | 14 +++++
>  tools/perf/util/evsel.c                    |  1 +
>  tools/perf/util/intel-pt.c                 | 27 +++++++--
>  tools/perf/util/intel-pt.h                 |  7 ++-
>  tools/perf/util/perf_api_probe.c           | 24 ++++++++
>  tools/perf/util/perf_api_probe.h           |  2 +
>  tools/perf/util/perf_event_attr_fprintf.c  |  1 +
>  tools/perf/util/record.h                   |  2 +
>  tools/perf/util/tsc.c                      | 58 ++++++++++++++++++
>  tools/perf/util/tsc.h                      |  2 +
>  22 files changed, 444 insertions(+), 52 deletions(-)
> 
> 
> Regards
> Adrian


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing
  2022-02-21  6:54 ` [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
@ 2022-03-01 11:06   ` Adrian Hunter
  0 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-03-01 11:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

On 21/02/2022 08:54, Adrian Hunter wrote:
> On 14/02/2022 13:09, Adrian Hunter wrote:
>> Hi
>>
>> These patches add 2 new perf event clocks based on TSC for use with VMs.
>>
>> The first patch is a minor fix, the next 2 patches add each of the 2 new
>> clocks.  The remaining patches add minimal tools support and are based on
>> top of the Intel PT Event Trace tools' patches.
>>
>> The future work, to add the ability to use perf inject to inject perf
>> events from a VM guest perf.data file into a VM host perf.data file,
>> has yet to be implemented.
>>
>>
>> Changes in V2:
>>       perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset
>> 	  Add __sched_clock_offset unconditionally
>>
>>       perf/x86: Add support for TSC as a perf event clock
>> 	  Use an attribute bit 'ns_clockid' to identify non-standard clockids
>>
>>       perf/x86: Add support for TSC in nanoseconds as a perf event clock
>> 	  Do not affect use of __sched_clock_offset
>> 	  Adjust to use 'ns_clockid'
> 
> Any comments on version 2?

☺/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock
  2022-02-14 11:09 ` [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock Adrian Hunter
@ 2022-03-04 12:30   ` Peter Zijlstra
  2022-03-04 13:03     ` Adrian Hunter
  2022-03-04 12:32   ` Peter Zijlstra
  2022-03-04 12:33   ` Peter Zijlstra
  2 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-04 12:30 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
> Currently, using Intel PT to trace a VM guest is limited to kernel space
> because decoding requires side band events such as MMAP and CONTEXT_SWITCH.
> While these events can be collected for the host, there is not a way to do
> that yet for a guest. One approach, would be to collect them inside the
> guest, but that would require being able to synchronize with host
> timestamps.
> 
> The motivation for this patch is to provide a clock that can be used within
> a VM guest, and that correlates to a VM host clock. In the case of TSC, if
> the hypervisor leaves rdtsc alone, the TSC value will be subject only to
> the VMCS TSC Offset and Scaling. Adjusting for that would make it possible
> to inject events from a guest perf.data file, into a host perf.data file.
> 
> Thus making possible the collection of VM guest side band for Intel PT
> decoding.
> 
> There are other potential benefits of TSC as a perf event clock:
> 	- ability to work directly with TSC
> 	- ability to inject non-Intel-PT-related events from a guest
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  arch/x86/events/core.c            | 16 +++++++++
>  arch/x86/include/asm/perf_event.h |  3 ++
>  include/uapi/linux/perf_event.h   | 12 ++++++-
>  kernel/events/core.c              | 57 +++++++++++++++++++------------
>  4 files changed, 65 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index e686c5e0537b..51d5345de30a 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2728,6 +2728,17 @@ void arch_perf_update_userpage(struct perf_event *event,
>  		!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
>  	userpg->pmc_width = x86_pmu.cntval_bits;
>  
> +	if (event->attr.use_clockid &&
> +	    event->attr.ns_clockid &&
> +	    event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
> +		userpg->cap_user_time_zero = 1;
> +		userpg->time_mult = 1;
> +		userpg->time_shift = 0;
> +		userpg->time_offset = 0;
> +		userpg->time_zero = 0;
> +		return;
> +	}
> +
>  	if (!using_native_sched_clock() || !sched_clock_stable())
>  		return;

This looks the wrong way around. If TSC is found unstable, we should
never expose it.

And I'm not at all sure about the whole virt thing. Last time I looked
at pvclock it made no sense at all.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock
  2022-02-14 11:09 ` [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock Adrian Hunter
  2022-03-04 12:30   ` Peter Zijlstra
@ 2022-03-04 12:32   ` Peter Zijlstra
  2022-03-04 17:51     ` Thomas Gleixner
  2022-03-04 12:33   ` Peter Zijlstra
  2 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-04 12:32 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 82858b697c05..e8617efd552b 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -290,6 +290,15 @@ enum {
>  	PERF_TXN_ABORT_SHIFT = 32,
>  };
>  
> +/*
> + * If supported, clockid value to select an architecture dependent hardware
> + * clock. Note this means the unit of time is ticks not nanoseconds.
> + * Requires ns_clockid to be set in addition to use_clockid.
> + * On x86, this clock is provided by the rdtsc instruction, and is not
> + * paravirtualized.
> + */
> +#define CLOCK_PERF_HW_CLOCK		0x10000000
> +
>  /*
>   * The format of the data returned by read() on a perf event fd,
>   * as specified by attr.read_format:
> @@ -409,7 +418,8 @@ struct perf_event_attr {
>  				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
>  				remove_on_exec :  1, /* event is removed from task on exec */
>  				sigtrap        :  1, /* send synchronous SIGTRAP on event */
> -				__reserved_1   : 26;
> +				ns_clockid     :  1, /* non-standard clockid */
> +				__reserved_1   : 25;
>  
>  	union {
>  		__u32		wakeup_events;	  /* wakeup every n events */

Thomas, do we want to gate this behind this magic flag, or can that
CLOCKID be granted unconditionally?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock
  2022-02-14 11:09 ` [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock Adrian Hunter
  2022-03-04 12:30   ` Peter Zijlstra
  2022-03-04 12:32   ` Peter Zijlstra
@ 2022-03-04 12:33   ` Peter Zijlstra
  2022-03-04 12:41     ` Adrian Hunter
  2 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-04 12:33 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
> +u64 perf_hw_clock(void)
> +{
> +	return rdtsc_ordered();
> +}

Why the _ordered ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock
  2022-03-04 12:33   ` Peter Zijlstra
@ 2022-03-04 12:41     ` Adrian Hunter
  0 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-03-04 12:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

On 04/03/2022 14:33, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
>> +u64 perf_hw_clock(void)
>> +{
>> +	return rdtsc_ordered();
>> +}
> 
> Why the _ordered ?

To be on the safe-side - in case it matters.  trace_clock_x86_tsc() also uses the ordered variant.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock
  2022-03-04 12:30   ` Peter Zijlstra
@ 2022-03-04 13:03     ` Adrian Hunter
  0 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-03-04 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

On 04/03/2022 14:30, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
>> Currently, using Intel PT to trace a VM guest is limited to kernel space
>> because decoding requires side band events such as MMAP and CONTEXT_SWITCH.
>> While these events can be collected for the host, there is not a way to do
>> that yet for a guest. One approach, would be to collect them inside the
>> guest, but that would require being able to synchronize with host
>> timestamps.
>>
>> The motivation for this patch is to provide a clock that can be used within
>> a VM guest, and that correlates to a VM host clock. In the case of TSC, if
>> the hypervisor leaves rdtsc alone, the TSC value will be subject only to
>> the VMCS TSC Offset and Scaling. Adjusting for that would make it possible
>> to inject events from a guest perf.data file, into a host perf.data file.
>>
>> Thus making possible the collection of VM guest side band for Intel PT
>> decoding.
>>
>> There are other potential benefits of TSC as a perf event clock:
>> 	- ability to work directly with TSC
>> 	- ability to inject non-Intel-PT-related events from a guest
>>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>>  arch/x86/events/core.c            | 16 +++++++++
>>  arch/x86/include/asm/perf_event.h |  3 ++
>>  include/uapi/linux/perf_event.h   | 12 ++++++-
>>  kernel/events/core.c              | 57 +++++++++++++++++++------------
>>  4 files changed, 65 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index e686c5e0537b..51d5345de30a 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -2728,6 +2728,17 @@ void arch_perf_update_userpage(struct perf_event *event,
>>  		!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
>>  	userpg->pmc_width = x86_pmu.cntval_bits;
>>  
>> +	if (event->attr.use_clockid &&
>> +	    event->attr.ns_clockid &&
>> +	    event->attr.clockid == CLOCK_PERF_HW_CLOCK) {
>> +		userpg->cap_user_time_zero = 1;
>> +		userpg->time_mult = 1;
>> +		userpg->time_shift = 0;
>> +		userpg->time_offset = 0;
>> +		userpg->time_zero = 0;
>> +		return;
>> +	}
>> +
>>  	if (!using_native_sched_clock() || !sched_clock_stable())
>>  		return;
> 
> This looks the wrong way around. If TSC is found unstable, we should
> never expose it.

Intel PT traces contain TSC whether or not it is stable, and it could
still be usable in some cases e.g. short traces on a single CPU.

Ftrace seems to offer x86-tsc unconditionally as a clock.

We could add warnings to comments and documentation about its potential
pitfalls.

> 
> And I'm not at all sure about the whole virt thing. Last time I looked
> at pvclock it made no sense at all.

It is certainly not useful for synchronizing events against TSC.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-02-14 11:09 ` [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds " Adrian Hunter
@ 2022-03-04 13:41   ` Peter Zijlstra
  2022-03-04 18:27     ` Adrian Hunter
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-04 13:41 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote:
> Currently, when Intel PT is used within a VM guest, it is not possible to
> make use of TSC because perf clock is subject to paravirtualization.

Yeah, so how much of that still makes sense, or ever did? AFAIK the
whole pv_clock thing is utter crazy. Should we not fix that instead?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock
  2022-03-04 12:32   ` Peter Zijlstra
@ 2022-03-04 17:51     ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2022-03-04 17:51 UTC (permalink / raw)
  To: Peter Zijlstra, Adrian Hunter
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	kvm, H Peter Anvin, Mathieu Poirier, Suzuki K Poulose, Leo Yan

On Fri, Mar 04 2022 at 13:32, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:05PM +0200, Adrian Hunter wrote:
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index 82858b697c05..e8617efd552b 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -290,6 +290,15 @@ enum {
>>  	PERF_TXN_ABORT_SHIFT = 32,
>>  };
>>  
>> +/*
>> + * If supported, clockid value to select an architecture dependent hardware
>> + * clock. Note this means the unit of time is ticks not nanoseconds.
>> + * Requires ns_clockid to be set in addition to use_clockid.
>> + * On x86, this clock is provided by the rdtsc instruction, and is not
>> + * paravirtualized.
>> + */
>> +#define CLOCK_PERF_HW_CLOCK		0x10000000
>> +
>>  /*
>>   * The format of the data returned by read() on a perf event fd,
>>   * as specified by attr.read_format:
>> @@ -409,7 +418,8 @@ struct perf_event_attr {
>>  				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
>>  				remove_on_exec :  1, /* event is removed from task on exec */
>>  				sigtrap        :  1, /* send synchronous SIGTRAP on event */
>> -				__reserved_1   : 26;
>> +				ns_clockid     :  1, /* non-standard clockid */
>> +				__reserved_1   : 25;
>>  
>>  	union {
>>  		__u32		wakeup_events;	  /* wakeup every n events */
>
> Thomas, do we want to gate this behind this magic flag, or can that
> CLOCKID be granted unconditionally?

I'm not seeing a point in that flag and please define the clock id where
the other clockids are defined. We want a proper ID range for such
magically defined clocks.

We use INT_MIN < id < 16 today. I have plans to expand the ID space past
16, so using something like the above is fine.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-04 13:41   ` Peter Zijlstra
@ 2022-03-04 18:27     ` Adrian Hunter
  2022-03-07  9:50         ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Adrian Hunter @ 2022-03-04 18:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan

On 04/03/2022 15:41, Peter Zijlstra wrote:
> On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote:
>> Currently, when Intel PT is used within a VM guest, it is not possible to
>> make use of TSC because perf clock is subject to paravirtualization.
> 
> Yeah, so how much of that still makes sense, or ever did? AFAIK the
> whole pv_clock thing is utter crazy. Should we not fix that instead?

Presumably pv_clock must work with different host operating systems.
Similarly, KVM must work with different guest operating systems.
Perhaps I'm wrong, but I imagine re-engineering time virtualization
might be a pretty big deal,  far exceeding the scope of these patches.

While it is not something that I really need, it is also not obvious
that the virtualization people would see any benefit.

My primary goal is to be able to make a trace covering the host and
(Linux) guests.  Intel PT can do that.  It can trace straight through
VM-Entries/Exits, politely noting them on the way past.  Perf tools
already supports decoding that, but only for tracing the kernel because
it needs more information (so-called side-band events) to decode guest
user space.  The simplest way to get that is to run perf inside the
guests to capture the side-band events, and then inject them into the
host perf.data file during post processing.  That, however, requires a
clock that works for both host and guests.  TSC is suitable because
KVM largely leaves it alone, except for VMX TSC Offset and Scaling,
but that has to be dealt with anyway because it also affects the
Intel PT trace.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-04 18:27     ` Adrian Hunter
@ 2022-03-07  9:50         ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-07  9:50 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: kvm, Alexander Shishkin, Dave Hansen, virtualization,
	H Peter Anvin, Jiri Olsa, sthemmin, x86, pv-drivers, Ingo Molnar,
	Suzuki K Poulose, Arnaldo Carvalho de Melo, Andrew.Cooper3,
	Borislav Petkov, Thomas Gleixner, jgross, Mathieu Poirier,
	seanjc, linux-kernel, Leo Yan, pbonzini

On Fri, Mar 04, 2022 at 08:27:45PM +0200, Adrian Hunter wrote:
> On 04/03/2022 15:41, Peter Zijlstra wrote:
> > On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote:
> >> Currently, when Intel PT is used within a VM guest, it is not possible to
> >> make use of TSC because perf clock is subject to paravirtualization.
> > 
> > Yeah, so how much of that still makes sense, or ever did? AFAIK the
> > whole pv_clock thing is utter crazy. Should we not fix that instead?
> 
> Presumably pv_clock must work with different host operating systems.
> Similarly, KVM must work with different guest operating systems.
> Perhaps I'm wrong, but I imagine re-engineering time virtualization
> might be a pretty big deal,  far exceeding the scope of these patches.

I think not; on both counts. That is, I don't think it's going to be
hard, and even it if were, it would still be the right thing to do.

We're not going to add interface just to work around a known broken
piece of crap just because we don't want to fix it.

So I'm thinking we should do the below and simply ignore any paravirt
sched clock offered when there's ART on.

---
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 4420499f7bb4..a1f179ed39bf 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
 {
+	/*
+	 * Anything with ART on promises to have sane TSC, otherwise the whole
+	 * ART thing is useless. In order to make ART useful for guests, we
+	 * should continue to use the TSC. As such, ignore any paravirt
+	 * muckery.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_ART))
+		return;
+
 	static_call_update(pv_sched_clock, func);
 }
 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
@ 2022-03-07  9:50         ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-07  9:50 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan, jgross, sdeep, pv-drivers, pbonzini,
	seanjc, kys, sthemmin, virtualization, Andrew.Cooper3

On Fri, Mar 04, 2022 at 08:27:45PM +0200, Adrian Hunter wrote:
> On 04/03/2022 15:41, Peter Zijlstra wrote:
> > On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote:
> >> Currently, when Intel PT is used within a VM guest, it is not possible to
> >> make use of TSC because perf clock is subject to paravirtualization.
> > 
> > Yeah, so how much of that still makes sense, or ever did? AFAIK the
> > whole pv_clock thing is utter crazy. Should we not fix that instead?
> 
> Presumably pv_clock must work with different host operating systems.
> Similarly, KVM must work with different guest operating systems.
> Perhaps I'm wrong, but I imagine re-engineering time virtualization
> might be a pretty big deal,  far exceeding the scope of these patches.

I think not; on both counts. That is, I don't think it's going to be
hard, and even it if were, it would still be the right thing to do.

We're not going to add interface just to work around a known broken
piece of crap just because we don't want to fix it.

So I'm thinking we should do the below and simply ignore any paravirt
sched clock offered when there's ART on.

---
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 4420499f7bb4..a1f179ed39bf 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
 
 void paravirt_set_sched_clock(u64 (*func)(void))
 {
+	/*
+	 * Anything with ART on promises to have sane TSC, otherwise the whole
+	 * ART thing is useless. In order to make ART useful for guests, we
+	 * should continue to use the TSC. As such, ignore any paravirt
+	 * muckery.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_ART))
+		return;
+
 	static_call_update(pv_sched_clock, func);
 }
 

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-07  9:50         ` Peter Zijlstra
@ 2022-03-07 10:06           ` Juergen Gross
  -1 siblings, 0 replies; 46+ messages in thread
From: Juergen Gross via Virtualization @ 2022-03-07 10:06 UTC (permalink / raw)
  To: Peter Zijlstra, Adrian Hunter
  Cc: kvm, Alexander Shishkin, Dave Hansen, virtualization,
	H Peter Anvin, Jiri Olsa, sthemmin, x86, pv-drivers, Ingo Molnar,
	Suzuki K Poulose, Leo Yan, Arnaldo Carvalho de Melo,
	Borislav Petkov, Thomas Gleixner, Mathieu Poirier, seanjc,
	linux-kernel, Andrew.Cooper3, pbonzini


[-- Attachment #1.1.1.1: Type: text/plain, Size: 2054 bytes --]

On 07.03.22 10:50, Peter Zijlstra wrote:
> On Fri, Mar 04, 2022 at 08:27:45PM +0200, Adrian Hunter wrote:
>> On 04/03/2022 15:41, Peter Zijlstra wrote:
>>> On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote:
>>>> Currently, when Intel PT is used within a VM guest, it is not possible to
>>>> make use of TSC because perf clock is subject to paravirtualization.
>>>
>>> Yeah, so how much of that still makes sense, or ever did? AFAIK the
>>> whole pv_clock thing is utter crazy. Should we not fix that instead?
>>
>> Presumably pv_clock must work with different host operating systems.
>> Similarly, KVM must work with different guest operating systems.
>> Perhaps I'm wrong, but I imagine re-engineering time virtualization
>> might be a pretty big deal,  far exceeding the scope of these patches.
> 
> I think not; on both counts. That is, I don't think it's going to be
> hard, and even it if were, it would still be the right thing to do.
> 
> We're not going to add interface just to work around a known broken
> piece of crap just because we don't want to fix it.
> 
> So I'm thinking we should do the below and simply ignore any paravirt
> sched clock offered when there's ART on.
> 
> ---
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 4420499f7bb4..a1f179ed39bf 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>   
>   void paravirt_set_sched_clock(u64 (*func)(void))
>   {
> +	/*
> +	 * Anything with ART on promises to have sane TSC, otherwise the whole
> +	 * ART thing is useless. In order to make ART useful for guests, we
> +	 * should continue to use the TSC. As such, ignore any paravirt
> +	 * muckery.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_ART))
> +		return;
> +
>   	static_call_update(pv_sched_clock, func);
>   }
>   
> 

NAK, this will break live migration of a guest coming from a host
without this feature.


Juergen

[-- Attachment #1.1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
@ 2022-03-07 10:06           ` Juergen Gross
  0 siblings, 0 replies; 46+ messages in thread
From: Juergen Gross @ 2022-03-07 10:06 UTC (permalink / raw)
  To: Peter Zijlstra, Adrian Hunter
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan, sdeep, pv-drivers, pbonzini, seanjc,
	kys, sthemmin, virtualization, Andrew.Cooper3


[-- Attachment #1.1.1: Type: text/plain, Size: 2054 bytes --]

On 07.03.22 10:50, Peter Zijlstra wrote:
> On Fri, Mar 04, 2022 at 08:27:45PM +0200, Adrian Hunter wrote:
>> On 04/03/2022 15:41, Peter Zijlstra wrote:
>>> On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote:
>>>> Currently, when Intel PT is used within a VM guest, it is not possible to
>>>> make use of TSC because perf clock is subject to paravirtualization.
>>>
>>> Yeah, so how much of that still makes sense, or ever did? AFAIK the
>>> whole pv_clock thing is utter crazy. Should we not fix that instead?
>>
>> Presumably pv_clock must work with different host operating systems.
>> Similarly, KVM must work with different guest operating systems.
>> Perhaps I'm wrong, but I imagine re-engineering time virtualization
>> might be a pretty big deal,  far exceeding the scope of these patches.
> 
> I think not; on both counts. That is, I don't think it's going to be
> hard, and even it if were, it would still be the right thing to do.
> 
> We're not going to add interface just to work around a known broken
> piece of crap just because we don't want to fix it.
> 
> So I'm thinking we should do the below and simply ignore any paravirt
> sched clock offered when there's ART on.
> 
> ---
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 4420499f7bb4..a1f179ed39bf 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>   
>   void paravirt_set_sched_clock(u64 (*func)(void))
>   {
> +	/*
> +	 * Anything with ART on promises to have sane TSC, otherwise the whole
> +	 * ART thing is useless. In order to make ART useful for guests, we
> +	 * should continue to use the TSC. As such, ignore any paravirt
> +	 * muckery.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_ART))
> +		return;
> +
>   	static_call_update(pv_sched_clock, func);
>   }
>   
> 

NAK, this will break live migration of a guest coming from a host
without this feature.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-07 10:06           ` Juergen Gross
@ 2022-03-07 10:38             ` Peter Zijlstra
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-07 10:38 UTC (permalink / raw)
  To: Juergen Gross
  Cc: kvm, Alexander Shishkin, Dave Hansen, virtualization,
	H Peter Anvin, Jiri Olsa, sthemmin, x86, pv-drivers, Ingo Molnar,
	Suzuki K Poulose, Arnaldo Carvalho de Melo, Andrew.Cooper3,
	Borislav Petkov, Thomas Gleixner, Mathieu Poirier, seanjc,
	Adrian Hunter, linux-kernel, Leo Yan, pbonzini


[-- Attachment #1.1: Type: text/plain, Size: 1040 bytes --]

On Mon, Mar 07, 2022 at 11:06:46AM +0100, Juergen Gross wrote:

> > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> > index 4420499f7bb4..a1f179ed39bf 100644
> > --- a/arch/x86/kernel/paravirt.c
> > +++ b/arch/x86/kernel/paravirt.c
> > @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
> >   void paravirt_set_sched_clock(u64 (*func)(void))
> >   {
> > +	/*
> > +	 * Anything with ART on promises to have sane TSC, otherwise the whole
> > +	 * ART thing is useless. In order to make ART useful for guests, we
> > +	 * should continue to use the TSC. As such, ignore any paravirt
> > +	 * muckery.
> > +	 */
> > +	if (cpu_feature_enabled(X86_FEATURE_ART))
> > +		return;
> > +
> >   	static_call_update(pv_sched_clock, func);
> >   }
> > 
> 
> NAK, this will break live migration of a guest coming from a host
> without this feature.

I thought the whole live-migration nonsense made sure to equalize crud
like that. That is, then don't expose ART to the guest.


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
@ 2022-03-07 10:38             ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-07 10:38 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Jiri Olsa, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, kvm, H Peter Anvin,
	Mathieu Poirier, Suzuki K Poulose, Leo Yan, sdeep, pv-drivers,
	pbonzini, seanjc, kys, sthemmin, virtualization, Andrew.Cooper3

[-- Attachment #1: Type: text/plain, Size: 1040 bytes --]

On Mon, Mar 07, 2022 at 11:06:46AM +0100, Juergen Gross wrote:

> > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> > index 4420499f7bb4..a1f179ed39bf 100644
> > --- a/arch/x86/kernel/paravirt.c
> > +++ b/arch/x86/kernel/paravirt.c
> > @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
> >   void paravirt_set_sched_clock(u64 (*func)(void))
> >   {
> > +	/*
> > +	 * Anything with ART on promises to have sane TSC, otherwise the whole
> > +	 * ART thing is useless. In order to make ART useful for guests, we
> > +	 * should continue to use the TSC. As such, ignore any paravirt
> > +	 * muckery.
> > +	 */
> > +	if (cpu_feature_enabled(X86_FEATURE_ART))
> > +		return;
> > +
> >   	static_call_update(pv_sched_clock, func);
> >   }
> > 
> 
> NAK, this will break live migration of a guest coming from a host
> without this feature.

I thought the whole live-migration nonsense made sure to equalize crud
like that. That is, then don't expose ART to the guest.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-07 10:38             ` Peter Zijlstra
@ 2022-03-07 10:58               ` Juergen Gross
  -1 siblings, 0 replies; 46+ messages in thread
From: Juergen Gross via Virtualization @ 2022-03-07 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, Alexander Shishkin, Dave Hansen, virtualization,
	H Peter Anvin, Jiri Olsa, sthemmin, x86, pv-drivers, Ingo Molnar,
	Suzuki K Poulose, Arnaldo Carvalho de Melo, Andrew.Cooper3,
	Borislav Petkov, Thomas Gleixner, Mathieu Poirier, seanjc,
	Adrian Hunter, linux-kernel, Leo Yan, pbonzini


[-- Attachment #1.1.1.1: Type: text/plain, Size: 1199 bytes --]

On 07.03.22 11:38, Peter Zijlstra wrote:
> On Mon, Mar 07, 2022 at 11:06:46AM +0100, Juergen Gross wrote:
> 
>>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>>> index 4420499f7bb4..a1f179ed39bf 100644
>>> --- a/arch/x86/kernel/paravirt.c
>>> +++ b/arch/x86/kernel/paravirt.c
>>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>>>    void paravirt_set_sched_clock(u64 (*func)(void))
>>>    {
>>> +	/*
>>> +	 * Anything with ART on promises to have sane TSC, otherwise the whole
>>> +	 * ART thing is useless. In order to make ART useful for guests, we
>>> +	 * should continue to use the TSC. As such, ignore any paravirt
>>> +	 * muckery.
>>> +	 */
>>> +	if (cpu_feature_enabled(X86_FEATURE_ART))
>>> +		return;
>>> +
>>>    	static_call_update(pv_sched_clock, func);
>>>    }
>>>
>>
>> NAK, this will break live migration of a guest coming from a host
>> without this feature.
> 
> I thought the whole live-migration nonsense made sure to equalize crud
> like that. That is, then don't expose ART to the guest.

Oh, right. I managed to confuse host-side and guest-side usage.

Sorry for the noise.


Juergen

[-- Attachment #1.1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
@ 2022-03-07 10:58               ` Juergen Gross
  0 siblings, 0 replies; 46+ messages in thread
From: Juergen Gross @ 2022-03-07 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Jiri Olsa, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, kvm, H Peter Anvin,
	Mathieu Poirier, Suzuki K Poulose, Leo Yan, sdeep, pv-drivers,
	pbonzini, seanjc, kys, sthemmin, virtualization, Andrew.Cooper3


[-- Attachment #1.1.1: Type: text/plain, Size: 1199 bytes --]

On 07.03.22 11:38, Peter Zijlstra wrote:
> On Mon, Mar 07, 2022 at 11:06:46AM +0100, Juergen Gross wrote:
> 
>>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>>> index 4420499f7bb4..a1f179ed39bf 100644
>>> --- a/arch/x86/kernel/paravirt.c
>>> +++ b/arch/x86/kernel/paravirt.c
>>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>>>    void paravirt_set_sched_clock(u64 (*func)(void))
>>>    {
>>> +	/*
>>> +	 * Anything with ART on promises to have sane TSC, otherwise the whole
>>> +	 * ART thing is useless. In order to make ART useful for guests, we
>>> +	 * should continue to use the TSC. As such, ignore any paravirt
>>> +	 * muckery.
>>> +	 */
>>> +	if (cpu_feature_enabled(X86_FEATURE_ART))
>>> +		return;
>>> +
>>>    	static_call_update(pv_sched_clock, func);
>>>    }
>>>
>>
>> NAK, this will break live migration of a guest coming from a host
>> without this feature.
> 
> I thought the whole live-migration nonsense made sure to equalize crud
> like that. That is, then don't expose ART to the guest.

Oh, right. I managed to confuse host-side and guest-side usage.

Sorry for the noise.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-07  9:50         ` Peter Zijlstra
  (?)
  (?)
@ 2022-03-07 12:36         ` Adrian Hunter
  2022-03-07 14:42             ` Peter Zijlstra
  -1 siblings, 1 reply; 46+ messages in thread
From: Adrian Hunter @ 2022-03-07 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan, jgross, sdeep, pv-drivers, pbonzini,
	seanjc, kys, sthemmin, virtualization, Andrew.Cooper3

On 07/03/2022 11:50, Peter Zijlstra wrote:
> On Fri, Mar 04, 2022 at 08:27:45PM +0200, Adrian Hunter wrote:
>> On 04/03/2022 15:41, Peter Zijlstra wrote:
>>> On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote:
>>>> Currently, when Intel PT is used within a VM guest, it is not possible to
>>>> make use of TSC because perf clock is subject to paravirtualization.
>>>
>>> Yeah, so how much of that still makes sense, or ever did? AFAIK the
>>> whole pv_clock thing is utter crazy. Should we not fix that instead?
>>
>> Presumably pv_clock must work with different host operating systems.
>> Similarly, KVM must work with different guest operating systems.
>> Perhaps I'm wrong, but I imagine re-engineering time virtualization
>> might be a pretty big deal,  far exceeding the scope of these patches.
> 
> I think not; on both counts. That is, I don't think it's going to be
> hard, and even it if were, it would still be the right thing to do.
> 
> We're not going to add interface just to work around a known broken
> piece of crap just because we don't want to fix it.
> 
> So I'm thinking we should do the below and simply ignore any paravirt
> sched clock offered when there's ART on.
> 
> ---
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 4420499f7bb4..a1f179ed39bf 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>  
>  void paravirt_set_sched_clock(u64 (*func)(void))
>  {
> +	/*
> +	 * Anything with ART on promises to have sane TSC, otherwise the whole
> +	 * ART thing is useless. In order to make ART useful for guests, we
> +	 * should continue to use the TSC. As such, ignore any paravirt
> +	 * muckery.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_ART))

Does not seem to work because the feature X86_FEATURE_ART does not seem to get set.
Possibly because detect_art() excludes anything running on a hypervisor.

> +		return;
> +
>  	static_call_update(pv_sched_clock, func);
>  }
>  


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-07 12:36         ` Adrian Hunter
@ 2022-03-07 14:42             ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-07 14:42 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan, jgross, sdeep, pv-drivers, pbonzini,
	seanjc, kys, sthemmin, virtualization, Andrew.Cooper3,
	christopher.s.hall

On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote:

> > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> > index 4420499f7bb4..a1f179ed39bf 100644
> > --- a/arch/x86/kernel/paravirt.c
> > +++ b/arch/x86/kernel/paravirt.c
> > @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
> >  
> >  void paravirt_set_sched_clock(u64 (*func)(void))
> >  {
> > +	/*
> > +	 * Anything with ART on promises to have sane TSC, otherwise the whole
> > +	 * ART thing is useless. In order to make ART useful for guests, we
> > +	 * should continue to use the TSC. As such, ignore any paravirt
> > +	 * muckery.
> > +	 */
> > +	if (cpu_feature_enabled(X86_FEATURE_ART))
> 
> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set.
> Possibly because detect_art() excludes anything running on a hypervisor.

Simple enough to delete that clause I suppose. Christopher, what is
needed to make that go away? I suppose the guest needs to be aware of
the active TSC scaling parameters to make it work ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
@ 2022-03-07 14:42             ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2022-03-07 14:42 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: kvm, Alexander Shishkin, Dave Hansen, virtualization,
	H Peter Anvin, Jiri Olsa, christopher.s.hall, sthemmin, x86,
	pv-drivers, Ingo Molnar, Suzuki K Poulose,
	Arnaldo Carvalho de Melo, Andrew.Cooper3, Borislav Petkov,
	Thomas Gleixner, jgross, Mathieu Poirier, seanjc, linux-kernel,
	Leo Yan, pbonzini

On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote:

> > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> > index 4420499f7bb4..a1f179ed39bf 100644
> > --- a/arch/x86/kernel/paravirt.c
> > +++ b/arch/x86/kernel/paravirt.c
> > @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
> >  
> >  void paravirt_set_sched_clock(u64 (*func)(void))
> >  {
> > +	/*
> > +	 * Anything with ART on promises to have sane TSC, otherwise the whole
> > +	 * ART thing is useless. In order to make ART useful for guests, we
> > +	 * should continue to use the TSC. As such, ignore any paravirt
> > +	 * muckery.
> > +	 */
> > +	if (cpu_feature_enabled(X86_FEATURE_ART))
> 
> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set.
> Possibly because detect_art() excludes anything running on a hypervisor.

Simple enough to delete that clause I suppose. Christopher, what is
needed to make that go away? I suppose the guest needs to be aware of
the active TSC scaling parameters to make it work ?
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-07 14:42             ` Peter Zijlstra
  (?)
@ 2022-03-08 14:23             ` Adrian Hunter
  2022-03-08 21:06               ` Hall, Christopher S
  -1 siblings, 1 reply; 46+ messages in thread
From: Adrian Hunter @ 2022-03-08 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan, jgross, sdeep, pv-drivers, pbonzini,
	seanjc, kys, sthemmin, virtualization, Andrew.Cooper3,
	christopher.s.hall

On 7.3.2022 16.42, Peter Zijlstra wrote:
> On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote:
> 
>>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>>> index 4420499f7bb4..a1f179ed39bf 100644
>>> --- a/arch/x86/kernel/paravirt.c
>>> +++ b/arch/x86/kernel/paravirt.c
>>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>>>  
>>>  void paravirt_set_sched_clock(u64 (*func)(void))
>>>  {
>>> +	/*
>>> +	 * Anything with ART on promises to have sane TSC, otherwise the whole
>>> +	 * ART thing is useless. In order to make ART useful for guests, we
>>> +	 * should continue to use the TSC. As such, ignore any paravirt
>>> +	 * muckery.
>>> +	 */
>>> +	if (cpu_feature_enabled(X86_FEATURE_ART))
>>
>> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set.
>> Possibly because detect_art() excludes anything running on a hypervisor.
> 
> Simple enough to delete that clause I suppose. Christopher, what is
> needed to make that go away? I suppose the guest needs to be aware of
> the active TSC scaling parameters to make it work ?

There is also not X86_FEATURE_NONSTOP_TSC nor values for art_to_tsc_denominator
or art_to_tsc_numerator.  Also, from the VM's point of view, TSC will jump
forwards every VM-Exit / VM-Entry unless the hypervisor changes the offset
every VM-Entry, which KVM does not, so it still cannot be used as a stable
clocksource.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-08 14:23             ` Adrian Hunter
@ 2022-03-08 21:06               ` Hall, Christopher S
  2022-03-14 11:50                 ` Adrian Hunter
  0 siblings, 1 reply; 46+ messages in thread
From: Hall, Christopher S @ 2022-03-08 21:06 UTC (permalink / raw)
  To: Hunter, Adrian, Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan, jgross, sdeep, pv-drivers, pbonzini,
	seanjc, kys, sthemmin, virtualization, Andrew.Cooper3

Adrian Hunter wrote:
> On 7.3.2022 16.42, Peter Zijlstra wrote:
> > On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote:
> >
> >>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> >>> index 4420499f7bb4..a1f179ed39bf 100644
> >>> --- a/arch/x86/kernel/paravirt.c
> >>> +++ b/arch/x86/kernel/paravirt.c
> >>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
> >>>
> >>>  void paravirt_set_sched_clock(u64 (*func)(void))
> >>>  {
> >>> +	/*
> >>> +	 * Anything with ART on promises to have sane TSC, otherwise the whole
> >>> +	 * ART thing is useless. In order to make ART useful for guests, we
> >>> +	 * should continue to use the TSC. As such, ignore any paravirt
> >>> +	 * muckery.
> >>> +	 */
> >>> +	if (cpu_feature_enabled(X86_FEATURE_ART))
> >>
> >> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set.
> >> Possibly because detect_art() excludes anything running on a hypervisor.
> >
> > Simple enough to delete that clause I suppose. Christopher, what is
> > needed to make that go away? I suppose the guest needs to be aware of
> > the active TSC scaling parameters to make it work ?
> 
> There is also not X86_FEATURE_NONSTOP_TSC nor values for art_to_tsc_denominator
> or art_to_tsc_numerator.  Also, from the VM's point of view, TSC will jump
> forwards every VM-Exit / VM-Entry unless the hypervisor changes the offset
> every VM-Entry, which KVM does not, so it still cannot be used as a stable
> clocksource.

Translating between ART and the guest TSC can be a difficult problem and ART software
support is disabled by default in a VM.

There are two major issues translating ART to TSC in a VM:

The range of the TSC scaling field in the VMCS is much larger than the range of values
that can be represented using CPUID[15H], i.e., it is not possible to communicate this
to the VM using the current CPUID interface. The range of scaling would need to be
restricted or another para-virtualized method - preferably OS/hypervisor agnostic - to
communicate the scaling factor to the guest needs to be invented.

TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the
guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest
TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest
must be reflected in the VMCS and any changes to the offset in the VMCS must be
reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must
be invented to communicate an arbitrary VMCS TSC offset to the guest.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-08 21:06               ` Hall, Christopher S
@ 2022-03-14 11:50                 ` Adrian Hunter
  2022-04-25  5:30                   ` Adrian Hunter
  0 siblings, 1 reply; 46+ messages in thread
From: Adrian Hunter @ 2022-03-14 11:50 UTC (permalink / raw)
  To: Hall, Christopher S, Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan, jgross, sdeep, pv-drivers, pbonzini,
	seanjc, kys, sthemmin, virtualization, Andrew.Cooper3

On 08/03/2022 23:06, Hall, Christopher S wrote:
> Adrian Hunter wrote:
>> On 7.3.2022 16.42, Peter Zijlstra wrote:
>>> On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote:
>>>
>>>>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>>>>> index 4420499f7bb4..a1f179ed39bf 100644
>>>>> --- a/arch/x86/kernel/paravirt.c
>>>>> +++ b/arch/x86/kernel/paravirt.c
>>>>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>>>>>
>>>>>  void paravirt_set_sched_clock(u64 (*func)(void))
>>>>>  {
>>>>> +	/*
>>>>> +	 * Anything with ART on promises to have sane TSC, otherwise the whole
>>>>> +	 * ART thing is useless. In order to make ART useful for guests, we
>>>>> +	 * should continue to use the TSC. As such, ignore any paravirt
>>>>> +	 * muckery.
>>>>> +	 */
>>>>> +	if (cpu_feature_enabled(X86_FEATURE_ART))
>>>>
>>>> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set.
>>>> Possibly because detect_art() excludes anything running on a hypervisor.
>>>
>>> Simple enough to delete that clause I suppose. Christopher, what is
>>> needed to make that go away? I suppose the guest needs to be aware of
>>> the active TSC scaling parameters to make it work ?
>>
>> There is also not X86_FEATURE_NONSTOP_TSC nor values for art_to_tsc_denominator
>> or art_to_tsc_numerator.  Also, from the VM's point of view, TSC will jump
>> forwards every VM-Exit / VM-Entry unless the hypervisor changes the offset
>> every VM-Entry, which KVM does not, so it still cannot be used as a stable
>> clocksource.
> 
> Translating between ART and the guest TSC can be a difficult problem and ART software
> support is disabled by default in a VM.
> 
> There are two major issues translating ART to TSC in a VM:
> 
> The range of the TSC scaling field in the VMCS is much larger than the range of values
> that can be represented using CPUID[15H], i.e., it is not possible to communicate this
> to the VM using the current CPUID interface. The range of scaling would need to be
> restricted or another para-virtualized method - preferably OS/hypervisor agnostic - to
> communicate the scaling factor to the guest needs to be invented.
> 
> TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the
> guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest
> TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest
> must be reflected in the VMCS and any changes to the offset in the VMCS must be
> reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must
> be invented to communicate an arbitrary VMCS TSC offset to the guest.
> 

In my view it is reasonable for perf to support TSC as a perf clock in any case
because:
	a) it allows users to work entirely with TSC if they wish
	b) other kernel performance / debug facilities like ftrace already support TSC
	c) the patches to add TSC support are relatively small and straight-forward

May we have support for TSC as a perf event clock?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-03-14 11:50                 ` Adrian Hunter
@ 2022-04-25  5:30                   ` Adrian Hunter
  2022-04-25  9:32                       ` Thomas Gleixner
  0 siblings, 1 reply; 46+ messages in thread
From: Adrian Hunter @ 2022-04-25  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, kvm, H Peter Anvin, Mathieu Poirier,
	Suzuki K Poulose, Leo Yan, jgross, sdeep, pv-drivers, pbonzini,
	seanjc, kys, sthemmin, virtualization, Andrew.Cooper3, Hall,
	Christopher S

On 14/03/22 13:50, Adrian Hunter wrote:
> On 08/03/2022 23:06, Hall, Christopher S wrote:
>> Adrian Hunter wrote:
>>> On 7.3.2022 16.42, Peter Zijlstra wrote:
>>>> On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote:
>>>>
>>>>>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>>>>>> index 4420499f7bb4..a1f179ed39bf 100644
>>>>>> --- a/arch/x86/kernel/paravirt.c
>>>>>> +++ b/arch/x86/kernel/paravirt.c
>>>>>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
>>>>>>
>>>>>>  void paravirt_set_sched_clock(u64 (*func)(void))
>>>>>>  {
>>>>>> +	/*
>>>>>> +	 * Anything with ART on promises to have sane TSC, otherwise the whole
>>>>>> +	 * ART thing is useless. In order to make ART useful for guests, we
>>>>>> +	 * should continue to use the TSC. As such, ignore any paravirt
>>>>>> +	 * muckery.
>>>>>> +	 */
>>>>>> +	if (cpu_feature_enabled(X86_FEATURE_ART))
>>>>>
>>>>> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set.
>>>>> Possibly because detect_art() excludes anything running on a hypervisor.
>>>>
>>>> Simple enough to delete that clause I suppose. Christopher, what is
>>>> needed to make that go away? I suppose the guest needs to be aware of
>>>> the active TSC scaling parameters to make it work ?
>>>
>>> There is also not X86_FEATURE_NONSTOP_TSC nor values for art_to_tsc_denominator
>>> or art_to_tsc_numerator.  Also, from the VM's point of view, TSC will jump
>>> forwards every VM-Exit / VM-Entry unless the hypervisor changes the offset
>>> every VM-Entry, which KVM does not, so it still cannot be used as a stable
>>> clocksource.
>>
>> Translating between ART and the guest TSC can be a difficult problem and ART software
>> support is disabled by default in a VM.
>>
>> There are two major issues translating ART to TSC in a VM:
>>
>> The range of the TSC scaling field in the VMCS is much larger than the range of values
>> that can be represented using CPUID[15H], i.e., it is not possible to communicate this
>> to the VM using the current CPUID interface. The range of scaling would need to be
>> restricted or another para-virtualized method - preferably OS/hypervisor agnostic - to
>> communicate the scaling factor to the guest needs to be invented.
>>
>> TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the
>> guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest
>> TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest
>> must be reflected in the VMCS and any changes to the offset in the VMCS must be
>> reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must
>> be invented to communicate an arbitrary VMCS TSC offset to the guest.
>>
> 
> In my view it is reasonable for perf to support TSC as a perf clock in any case
> because:
> 	a) it allows users to work entirely with TSC if they wish
> 	b) other kernel performance / debug facilities like ftrace already support TSC
> 	c) the patches to add TSC support are relatively small and straight-forward
> 
> May we have support for TSC as a perf event clock?

Any update on this?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-04-25  5:30                   ` Adrian Hunter
@ 2022-04-25  9:32                       ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2022-04-25  9:32 UTC (permalink / raw)
  To: Adrian Hunter, Peter Zijlstra
  Cc: kvm, Alexander Shishkin, Dave Hansen, virtualization,
	H Peter Anvin, Jiri Olsa, Hall, Christopher S, sthemmin, x86,
	pv-drivers, Ingo Molnar, Suzuki K Poulose, Leo Yan,
	Arnaldo Carvalho de Melo, Borislav Petkov, jgross,
	Mathieu Poirier, seanjc, linux-kernel, Andrew.Cooper3, pbonzini

On Mon, Apr 25 2022 at 08:30, Adrian Hunter wrote:
> On 14/03/22 13:50, Adrian Hunter wrote:
>>> TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the
>>> guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest
>>> TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest
>>> must be reflected in the VMCS and any changes to the offset in the VMCS must be
>>> reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must
>>> be invented to communicate an arbitrary VMCS TSC offset to the guest.
>>>
>> 
>> In my view it is reasonable for perf to support TSC as a perf clock in any case
>> because:
>> 	a) it allows users to work entirely with TSC if they wish
>> 	b) other kernel performance / debug facilities like ftrace already support TSC
>> 	c) the patches to add TSC support are relatively small and straight-forward
>> 
>> May we have support for TSC as a perf event clock?
>
> Any update on this?

If TSC is reliable on the host, then there is absolutely no reason not
to use it in the guest all over the place. And that is independent of
exposing ART to the guest.

So why do we need extra solutions for PT and perf, ftrace and whatever?

Can we just fix the underlying problem and make the hypervisor tell the
guest that TSC is stable, reliable and good to use?

Then everything else just falls into place and using TSC is a
substantial performance gain in general. Just look at the VDSO
implementation of __arch_get_hw_counter() -> vread_pvclock():

Instead of just reading the TSC, this needs to take a nested seqcount,
read TSC and do yet another mult/shift, which makes clock_gettime() ~20%
slower than necessary.

It's hillarious, that we still cling to this pvclock abomination, while
we happily expose TSC deadline timer to the guest. TSC virt scaling was
implemented in hardware for a reason.

Thanks,

        tglx
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
@ 2022-04-25  9:32                       ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2022-04-25  9:32 UTC (permalink / raw)
  To: Adrian Hunter, Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	kvm, H Peter Anvin, Mathieu Poirier, Suzuki K Poulose, Leo Yan,
	jgross, sdeep, pv-drivers, pbonzini, seanjc, kys, sthemmin,
	virtualization, Andrew.Cooper3, Hall, Christopher S

On Mon, Apr 25 2022 at 08:30, Adrian Hunter wrote:
> On 14/03/22 13:50, Adrian Hunter wrote:
>>> TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the
>>> guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest
>>> TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest
>>> must be reflected in the VMCS and any changes to the offset in the VMCS must be
>>> reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must
>>> be invented to communicate an arbitrary VMCS TSC offset to the guest.
>>>
>> 
>> In my view it is reasonable for perf to support TSC as a perf clock in any case
>> because:
>> 	a) it allows users to work entirely with TSC if they wish
>> 	b) other kernel performance / debug facilities like ftrace already support TSC
>> 	c) the patches to add TSC support are relatively small and straight-forward
>> 
>> May we have support for TSC as a perf event clock?
>
> Any update on this?

If TSC is reliable on the host, then there is absolutely no reason not
to use it in the guest all over the place. And that is independent of
exposing ART to the guest.

So why do we need extra solutions for PT and perf, ftrace and whatever?

Can we just fix the underlying problem and make the hypervisor tell the
guest that TSC is stable, reliable and good to use?

Then everything else just falls into place and using TSC is a
substantial performance gain in general. Just look at the VDSO
implementation of __arch_get_hw_counter() -> vread_pvclock():

Instead of just reading the TSC, this needs to take a nested seqcount,
read TSC and do yet another mult/shift, which makes clock_gettime() ~20%
slower than necessary.

It's hillarious, that we still cling to this pvclock abomination, while
we happily expose TSC deadline timer to the guest. TSC virt scaling was
implemented in hardware for a reason.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-04-25  9:32                       ` Thomas Gleixner
  (?)
@ 2022-04-25 13:15                       ` Adrian Hunter
  2022-04-25 17:05                           ` Thomas Gleixner
  -1 siblings, 1 reply; 46+ messages in thread
From: Adrian Hunter @ 2022-04-25 13:15 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	kvm, H Peter Anvin, Mathieu Poirier, Suzuki K Poulose, Leo Yan,
	jgross, sdeep, pv-drivers, pbonzini, seanjc, kys, sthemmin,
	virtualization, Andrew.Cooper3, Hall, Christopher S

On 25/04/22 12:32, Thomas Gleixner wrote:
> On Mon, Apr 25 2022 at 08:30, Adrian Hunter wrote:
>> On 14/03/22 13:50, Adrian Hunter wrote:
>>>> TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the
>>>> guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest
>>>> TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest
>>>> must be reflected in the VMCS and any changes to the offset in the VMCS must be
>>>> reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must
>>>> be invented to communicate an arbitrary VMCS TSC offset to the guest.
>>>>
>>>
>>> In my view it is reasonable for perf to support TSC as a perf clock in any case
>>> because:
>>> 	a) it allows users to work entirely with TSC if they wish
>>> 	b) other kernel performance / debug facilities like ftrace already support TSC
>>> 	c) the patches to add TSC support are relatively small and straight-forward
>>>
>>> May we have support for TSC as a perf event clock?
>>
>> Any update on this?
> 
> If TSC is reliable on the host, then there is absolutely no reason not
> to use it in the guest all over the place. And that is independent of
> exposing ART to the guest.
> 
> So why do we need extra solutions for PT and perf, ftrace and whatever?
> 
> Can we just fix the underlying problem and make the hypervisor tell the
> guest that TSC is stable, reliable and good to use?
> 
> Then everything else just falls into place and using TSC is a
> substantial performance gain in general. Just look at the VDSO
> implementation of __arch_get_hw_counter() -> vread_pvclock():
> 
> Instead of just reading the TSC, this needs to take a nested seqcount,
> read TSC and do yet another mult/shift, which makes clock_gettime() ~20%
> slower than necessary.
> 
> It's hillarious, that we still cling to this pvclock abomination, while
> we happily expose TSC deadline timer to the guest. TSC virt scaling was
> implemented in hardware for a reason.

So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide
the time jumps when the VM is scheduled out?  Or neglect that and just let the time
jumps happen?

If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e.
wouldn't that mean each VCPU has to have the same VMX TSC Offset?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-04-25 13:15                       ` Adrian Hunter
@ 2022-04-25 17:05                           ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2022-04-25 17:05 UTC (permalink / raw)
  To: Adrian Hunter, Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	kvm, H Peter Anvin, Mathieu Poirier, Suzuki K Poulose, Leo Yan,
	jgross, sdeep, pv-drivers, pbonzini, seanjc, kys, sthemmin,
	virtualization, Andrew.Cooper3, Hall, Christopher S

On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote:
> On 25/04/22 12:32, Thomas Gleixner wrote:
>> It's hillarious, that we still cling to this pvclock abomination, while
>> we happily expose TSC deadline timer to the guest. TSC virt scaling was
>> implemented in hardware for a reason.
>
> So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide
> the time jumps when the VM is scheduled out?  Or neglect that and just let the time
> jumps happen?
>
> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e.
> wouldn't that mean each VCPU has to have the same VMX TSC Offset?

Obviously so. That's the only thing which makes sense, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
@ 2022-04-25 17:05                           ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2022-04-25 17:05 UTC (permalink / raw)
  To: Adrian Hunter, Peter Zijlstra
  Cc: kvm, Alexander Shishkin, Dave Hansen, virtualization,
	H Peter Anvin, Jiri Olsa, Hall, Christopher S, sthemmin, x86,
	pv-drivers, Ingo Molnar, Suzuki K Poulose, Leo Yan,
	Arnaldo Carvalho de Melo, Borislav Petkov, jgross,
	Mathieu Poirier, seanjc, linux-kernel, Andrew.Cooper3, pbonzini

On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote:
> On 25/04/22 12:32, Thomas Gleixner wrote:
>> It's hillarious, that we still cling to this pvclock abomination, while
>> we happily expose TSC deadline timer to the guest. TSC virt scaling was
>> implemented in hardware for a reason.
>
> So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide
> the time jumps when the VM is scheduled out?  Or neglect that and just let the time
> jumps happen?
>
> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e.
> wouldn't that mean each VCPU has to have the same VMX TSC Offset?

Obviously so. That's the only thing which makes sense, no?

Thanks,

        tglx
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-04-25 17:05                           ` Thomas Gleixner
  (?)
@ 2022-04-26  6:51                           ` Adrian Hunter
  2022-04-27 23:10                               ` Thomas Gleixner
  -1 siblings, 1 reply; 46+ messages in thread
From: Adrian Hunter @ 2022-04-26  6:51 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	kvm, H Peter Anvin, Mathieu Poirier, Suzuki K Poulose, Leo Yan,
	jgross, sdeep, pv-drivers, pbonzini, seanjc, kys, sthemmin,
	virtualization, Andrew.Cooper3, Hall, Christopher S

On 25/04/22 20:05, Thomas Gleixner wrote:
> On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote:
>> On 25/04/22 12:32, Thomas Gleixner wrote:
>>> It's hillarious, that we still cling to this pvclock abomination, while
>>> we happily expose TSC deadline timer to the guest. TSC virt scaling was
>>> implemented in hardware for a reason.
>>
>> So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide
>> the time jumps when the VM is scheduled out?  Or neglect that and just let the time
>> jumps happen?
>>
>> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e.
>> wouldn't that mean each VCPU has to have the same VMX TSC Offset?
> 
> Obviously so. That's the only thing which makes sense, no?

[ Sending this again, because I notice I messed up the email "From" ]

But wouldn't that mean changing all the VCPUs VMX TSC Offset at the same time,
which means when none are currently executing?  How could that be done?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-04-26  6:51                           ` Adrian Hunter
@ 2022-04-27 23:10                               ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2022-04-27 23:10 UTC (permalink / raw)
  To: Adrian Hunter, Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	kvm, H Peter Anvin, Mathieu Poirier, Suzuki K Poulose, Leo Yan,
	jgross, sdeep, pv-drivers, pbonzini, seanjc, kys, sthemmin,
	virtualization, Andrew.Cooper3, Hall, Christopher S

On Tue, Apr 26 2022 at 09:51, Adrian Hunter wrote:
> On 25/04/22 20:05, Thomas Gleixner wrote:
>> On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote:
>>> On 25/04/22 12:32, Thomas Gleixner wrote:
>>>> It's hillarious, that we still cling to this pvclock abomination, while
>>>> we happily expose TSC deadline timer to the guest. TSC virt scaling was
>>>> implemented in hardware for a reason.
>>>
>>> So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide
>>> the time jumps when the VM is scheduled out?  Or neglect that and just let the time
>>> jumps happen?
>>>
>>> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e.
>>> wouldn't that mean each VCPU has to have the same VMX TSC Offset?
>> 
>> Obviously so. That's the only thing which makes sense, no?
>
> [ Sending this again, because I notice I messed up the email "From" ]
>
> But wouldn't that mean changing all the VCPUs VMX TSC Offset at the same time,
> which means when none are currently executing?  How could that be done?

Why would you change TSC offset after the point where a VM is started
and why would it be different per vCPU?

Time is global and time moves on when a vCPU is scheduled out. Anything
else is bonkers, really. If the hypervisor tries to screw with that then
how does the guest do timekeeping in a consistent way?

    CLOCK_REALTIME = CLOCK_MONOTONIC + offset

That offset changes when something sets the clock, i.e. clock_settime(),
settimeofday() or adjtimex() in case that NTP cannot compensate or for
the beloved leap seconds adjustment. At any other time the offset is
constant.

CLOCK_MONOTONIC is derived from the underlying clocksource which is
expected to increment with constant frequency and that has to be
consistent accross _all_ vCPUs of a particular VM.

So how would a hypervisor 'hide' scheduled out time w/o screwing up
timekeeping completely?

The guest TSC which is based on the host TSC is:

    guestTSC = offset + hostTSC * factor;

If you make offset different between guest vCPUs then timekeeping in the
guest is screwed.

The whole point of that paravirt clock was to handle migration between
hosts which did not have the VMCS TSC scaling/offset mechanism. The CPUs
which did not have that went EOL at least 10 years ago.

So what are you concerned about?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
@ 2022-04-27 23:10                               ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2022-04-27 23:10 UTC (permalink / raw)
  To: Adrian Hunter, Peter Zijlstra
  Cc: kvm, Alexander Shishkin, Dave Hansen, virtualization,
	H Peter Anvin, Jiri Olsa, Hall, Christopher S, sthemmin, x86,
	pv-drivers, Ingo Molnar, Suzuki K Poulose, Leo Yan,
	Arnaldo Carvalho de Melo, Borislav Petkov, jgross,
	Mathieu Poirier, seanjc, linux-kernel, Andrew.Cooper3, pbonzini

On Tue, Apr 26 2022 at 09:51, Adrian Hunter wrote:
> On 25/04/22 20:05, Thomas Gleixner wrote:
>> On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote:
>>> On 25/04/22 12:32, Thomas Gleixner wrote:
>>>> It's hillarious, that we still cling to this pvclock abomination, while
>>>> we happily expose TSC deadline timer to the guest. TSC virt scaling was
>>>> implemented in hardware for a reason.
>>>
>>> So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide
>>> the time jumps when the VM is scheduled out?  Or neglect that and just let the time
>>> jumps happen?
>>>
>>> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e.
>>> wouldn't that mean each VCPU has to have the same VMX TSC Offset?
>> 
>> Obviously so. That's the only thing which makes sense, no?
>
> [ Sending this again, because I notice I messed up the email "From" ]
>
> But wouldn't that mean changing all the VCPUs VMX TSC Offset at the same time,
> which means when none are currently executing?  How could that be done?

Why would you change TSC offset after the point where a VM is started
and why would it be different per vCPU?

Time is global and time moves on when a vCPU is scheduled out. Anything
else is bonkers, really. If the hypervisor tries to screw with that then
how does the guest do timekeeping in a consistent way?

    CLOCK_REALTIME = CLOCK_MONOTONIC + offset

That offset changes when something sets the clock, i.e. clock_settime(),
settimeofday() or adjtimex() in case that NTP cannot compensate or for
the beloved leap seconds adjustment. At any other time the offset is
constant.

CLOCK_MONOTONIC is derived from the underlying clocksource which is
expected to increment with constant frequency and that has to be
consistent accross _all_ vCPUs of a particular VM.

So how would a hypervisor 'hide' scheduled out time w/o screwing up
timekeeping completely?

The guest TSC which is based on the host TSC is:

    guestTSC = offset + hostTSC * factor;

If you make offset different between guest vCPUs then timekeeping in the
guest is screwed.

The whole point of that paravirt clock was to handle migration between
hosts which did not have the VMCS TSC scaling/offset mechanism. The CPUs
which did not have that went EOL at least 10 years ago.

So what are you concerned about?

Thanks,

        tglx
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds as a perf event clock
  2022-04-27 23:10                               ` Thomas Gleixner
  (?)
@ 2022-05-16  7:20                               ` Adrian Hunter
  -1 siblings, 0 replies; 46+ messages in thread
From: Adrian Hunter @ 2022-05-16  7:20 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Alexander Shishkin, Arnaldo Carvalho de Melo, Jiri Olsa,
	linux-kernel, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	kvm, H Peter Anvin, Mathieu Poirier, Suzuki K Poulose, Leo Yan,
	jgross, sdeep, pv-drivers, pbonzini, seanjc, kys, sthemmin,
	virtualization, Andrew.Cooper3, Hall, Christopher S

On 28/04/22 02:10, Thomas Gleixner wrote:
> On Tue, Apr 26 2022 at 09:51, Adrian Hunter wrote:
>> On 25/04/22 20:05, Thomas Gleixner wrote:
>>> On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote:
>>>> On 25/04/22 12:32, Thomas Gleixner wrote:
>>>>> It's hillarious, that we still cling to this pvclock abomination, while
>>>>> we happily expose TSC deadline timer to the guest. TSC virt scaling was
>>>>> implemented in hardware for a reason.
>>>>
>>>> So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide
>>>> the time jumps when the VM is scheduled out?  Or neglect that and just let the time
>>>> jumps happen?
>>>>
>>>> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e.
>>>> wouldn't that mean each VCPU has to have the same VMX TSC Offset?
>>>
>>> Obviously so. That's the only thing which makes sense, no?
>>
>> [ Sending this again, because I notice I messed up the email "From" ]
>>
>> But wouldn't that mean changing all the VCPUs VMX TSC Offset at the same time,
>> which means when none are currently executing?  How could that be done?
> 
> Why would you change TSC offset after the point where a VM is started
> and why would it be different per vCPU?
> 
> Time is global and time moves on when a vCPU is scheduled out. Anything
> else is bonkers, really. If the hypervisor tries to screw with that then
> how does the guest do timekeeping in a consistent way?
> 
>     CLOCK_REALTIME = CLOCK_MONOTONIC + offset
> 
> That offset changes when something sets the clock, i.e. clock_settime(),
> settimeofday() or adjtimex() in case that NTP cannot compensate or for
> the beloved leap seconds adjustment. At any other time the offset is
> constant.
> 
> CLOCK_MONOTONIC is derived from the underlying clocksource which is
> expected to increment with constant frequency and that has to be
> consistent accross _all_ vCPUs of a particular VM.
> 
> So how would a hypervisor 'hide' scheduled out time w/o screwing up
> timekeeping completely?
> 
> The guest TSC which is based on the host TSC is:
> 
>     guestTSC = offset + hostTSC * factor;
> 
> If you make offset different between guest vCPUs then timekeeping in the
> guest is screwed.
> 
> The whole point of that paravirt clock was to handle migration between
> hosts which did not have the VMCS TSC scaling/offset mechanism. The CPUs
> which did not have that went EOL at least 10 years ago.
> 
> So what are you concerned about?

Thanks for the explanation.

Changing TSC offset / scaling makes it much harder for Intel PT on
the host to use, so there is no sense in my pushing for that at this
time when there is anyway kernel option no-kvmclock.


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-05-16  7:21 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-14 11:09 [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 01/11] perf/x86: Fix native_perf_sched_clock_from_tsc() with __sched_clock_offset Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 02/11] perf/x86: Add support for TSC as a perf event clock Adrian Hunter
2022-03-04 12:30   ` Peter Zijlstra
2022-03-04 13:03     ` Adrian Hunter
2022-03-04 12:32   ` Peter Zijlstra
2022-03-04 17:51     ` Thomas Gleixner
2022-03-04 12:33   ` Peter Zijlstra
2022-03-04 12:41     ` Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 03/11] perf/x86: Add support for TSC in nanoseconds " Adrian Hunter
2022-03-04 13:41   ` Peter Zijlstra
2022-03-04 18:27     ` Adrian Hunter
2022-03-07  9:50       ` Peter Zijlstra
2022-03-07  9:50         ` Peter Zijlstra
2022-03-07 10:06         ` Juergen Gross via Virtualization
2022-03-07 10:06           ` Juergen Gross
2022-03-07 10:38           ` Peter Zijlstra
2022-03-07 10:38             ` Peter Zijlstra
2022-03-07 10:58             ` Juergen Gross via Virtualization
2022-03-07 10:58               ` Juergen Gross
2022-03-07 12:36         ` Adrian Hunter
2022-03-07 14:42           ` Peter Zijlstra
2022-03-07 14:42             ` Peter Zijlstra
2022-03-08 14:23             ` Adrian Hunter
2022-03-08 21:06               ` Hall, Christopher S
2022-03-14 11:50                 ` Adrian Hunter
2022-04-25  5:30                   ` Adrian Hunter
2022-04-25  9:32                     ` Thomas Gleixner
2022-04-25  9:32                       ` Thomas Gleixner
2022-04-25 13:15                       ` Adrian Hunter
2022-04-25 17:05                         ` Thomas Gleixner
2022-04-25 17:05                           ` Thomas Gleixner
2022-04-26  6:51                           ` Adrian Hunter
2022-04-27 23:10                             ` Thomas Gleixner
2022-04-27 23:10                               ` Thomas Gleixner
2022-05-16  7:20                               ` Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 04/11] perf tools: Add new perf clock IDs Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 05/11] perf tools: Add API probes for new " Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 06/11] perf tools: Add new clock IDs to "perf time to TSC" test Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 07/11] perf tools: Add perf_read_tsc_conv_for_clockid() Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 08/11] perf intel-pt: Add support for new clock IDs Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 09/11] perf intel-pt: Use CLOCK_PERF_HW_CLOCK_NS by default Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 10/11] perf intel-pt: Add config variables for timing parameters Adrian Hunter
2022-02-14 11:09 ` [PATCH V2 11/11] perf intel-pt: Add documentation for new clock IDs Adrian Hunter
2022-02-21  6:54 ` [PATCH V2 00/11] perf intel-pt: Add perf event clocks to better support VM tracing Adrian Hunter
2022-03-01 11:06   ` Adrian Hunter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.