All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] x86/BPF: Add new BPF helper call bpf_rdtsc
@ 2023-07-03 10:57 Tero Kristo
  2023-07-03 10:57 ` [PATCH 1/2] x86/tsc: " Tero Kristo
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Tero Kristo @ 2023-07-03 10:57 UTC (permalink / raw)
  To: shuah, tglx, x86, bp, dave.hansen, mingo
  Cc: ast, linux-kselftest, linux-kernel, andrii, daniel, bpf

Hello,

This patch series adds a new x86 arch specific BPF helper, bpf_rdtsc()
which can be used for reading the hardware time stamp counter (TSC.)
Currently the same counter is directly accessible from userspace
(using RDTSC instruction), and kernel space using various rdtsc_*()
APIs, however eBPF lacks the support.

The main usage for the TSC counter is for various profiling and timing
purposes, getting accurate cycle counter values. The counter can be
currently read from BPF programs by using the existing perf subsystem
services (bpf_perf_event_read()), however its usage is cumbersome at
best. Additionally, the perf subsystem provides relative value only
for the counter, but absolute values are desired by some use cases
like Wult [1]. The absolute value of TSC can be read with BPF programs
currently via some kprobe / bpf_core_read() magic (see [2], [3], [4] for
example), but this relies on accessing kernel internals and is not
stable API, and is pretty cumbersome. Thus, this patch proposes a new
arch x86 specific BPF helper to avoid the above issues.

-Tero

[1] https://github.com/intel/wult
[2] https://github.com/intel/wult/blob/c92237c95b898498faf41e6644983102d1fe5156/helpers/wult-tdt-helper/tdt-bpf.c#L102
[3] https://github.com/intel/wult/blob/c92237c95b898498faf41e6644983102d1fe5156/helpers/wult-tdt-helper/tdt-bpf.c#L133
[4] https://github.com/intel/wult/blob/c92237c95b898498faf41e6644983102d1fe5156/helpers/wult-tdt-helper/tdt-bpf.c#L488



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-03 10:57 [PATCH 0/2] x86/BPF: Add new BPF helper call bpf_rdtsc Tero Kristo
@ 2023-07-03 10:57 ` Tero Kristo
  2023-07-04  4:49   ` Yonghong Song
  2023-07-06  3:02   ` Alexei Starovoitov
  2023-07-03 10:57 ` [PATCH 2/2] selftests/bpf: Add test for bpf_rdtsc Tero Kristo
  2023-07-03 21:55 ` [PATCH 0/2] x86/BPF: Add new BPF helper call bpf_rdtsc John Fastabend
  2 siblings, 2 replies; 17+ messages in thread
From: Tero Kristo @ 2023-07-03 10:57 UTC (permalink / raw)
  To: shuah, tglx, x86, bp, dave.hansen, mingo
  Cc: ast, linux-kselftest, linux-kernel, andrii, daniel, bpf

Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
and friends, and additionally even userspace has access to it via the
RDTSC assembly instruction. BPF programs on the other hand don't have
direct access to the TSC counter, but alternatively must go through the
performance subsystem (bpf_perf_event_read), which only provides relative
value compared to the start point of the program, and is also much slower
than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
can be used for any accurate profiling needs.

A use-case for the new API is for example wakeup latency tracing via
eBPF on Intel architecture, where it is extremely beneficial to be able
to get raw TSC timestamps and compare these directly to the value
programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
latency value from the hardware interrupt to the execution of the
interrupt handler can be calculated. Having the functionality within
eBPF also has added benefits of allowing to filter any other relevant
data like C-state residency values, and also to drop any irrelevant
data points directly in the kernel context, without passing all the
data to userspace for post-processing.

Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
---
 arch/x86/include/asm/msr.h |  1 +
 arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..3dde673cb563 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
 void msrs_free(struct msr *msrs);
 int msr_set_bit(u32 msr, u8 bit);
 int msr_clear_bit(u32 msr, u8 bit);
+u64 bpf_rdtsc(void);
 
 #ifdef CONFIG_SMP
 int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 344698852146..ded857abef81 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -15,6 +15,8 @@
 #include <linux/timex.h>
 #include <linux/static_key.h>
 #include <linux/static_call.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
 
 #include <asm/hpet.h>
 #include <asm/timer.h>
@@ -29,6 +31,7 @@
 #include <asm/intel-family.h>
 #include <asm/i8259.h>
 #include <asm/uv/uv.h>
+#include <asm/tlbflush.h>
 
 unsigned int __read_mostly cpu_khz;	/* TSC clocks / usec, not used here */
 EXPORT_SYMBOL(cpu_khz);
@@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
 	tsc_enable_sched_clock();
 }
 
+u64 bpf_rdtsc(void)
+{
+	/* Check if Time Stamp is enabled only in ring 0 */
+	if (cr4_read_shadow() & X86_CR4_TSD)
+		return 0;
+
+	return rdtsc_ordered();
+}
+
+BTF_SET8_START(tsc_bpf_kfunc_ids)
+BTF_ID_FLAGS(func, bpf_rdtsc)
+BTF_SET8_END(tsc_bpf_kfunc_ids)
+
+static const struct btf_kfunc_id_set tsc_bpf_kfunc_set = {
+	.owner		= THIS_MODULE,
+	.set		= &tsc_bpf_kfunc_ids,
+};
+
 void __init tsc_init(void)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_TSC)) {
@@ -1594,6 +1615,8 @@ void __init tsc_init(void)
 
 	clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
 	detect_art();
+
+	register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &tsc_bpf_kfunc_set);
 }
 
 #ifdef CONFIG_SMP
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] selftests/bpf: Add test for bpf_rdtsc
  2023-07-03 10:57 [PATCH 0/2] x86/BPF: Add new BPF helper call bpf_rdtsc Tero Kristo
  2023-07-03 10:57 ` [PATCH 1/2] x86/tsc: " Tero Kristo
@ 2023-07-03 10:57 ` Tero Kristo
  2023-07-03 22:00   ` John Fastabend
  2023-07-03 21:55 ` [PATCH 0/2] x86/BPF: Add new BPF helper call bpf_rdtsc John Fastabend
  2 siblings, 1 reply; 17+ messages in thread
From: Tero Kristo @ 2023-07-03 10:57 UTC (permalink / raw)
  To: shuah, tglx, x86, bp, dave.hansen, mingo
  Cc: ast, linux-kselftest, linux-kernel, andrii, daniel, bpf

Add selftest for bpf_rdtsc() which reads the TSC (Time Stamp Counter) on
x86_64 architectures. The test reads the TSC from both userspace and the
BPF program, and verifies the TSC values are in incremental order as
expected. The test is automatically skipped on architectures that do not
support the feature.

Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
---
 .../selftests/bpf/prog_tests/test_rdtsc.c     | 67 +++++++++++++++++++
 .../testing/selftests/bpf/progs/test_rdtsc.c  | 21 ++++++
 2 files changed, 88 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_rdtsc.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c b/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
new file mode 100644
index 000000000000..2b26deb5b35a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
@@ -0,0 +1,67 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Intel Corporation */
+
+#include "test_progs.h"
+#include "test_rdtsc.skel.h"
+
+#ifdef __x86_64__
+
+static inline u64 _rdtsc(void)
+{
+	u32 low, high;
+
+	__asm__ __volatile__("rdtscp" : "=a" (low), "=d" (high));
+	return ((u64)high << 32) | low;
+}
+
+static int rdtsc(struct test_rdtsc *skel)
+{
+	int err, prog_fd;
+	u64 user_c1, user_c2;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+	err = test_rdtsc__attach(skel);
+	if (!ASSERT_OK(err, "test_rdtsc_attach"))
+		return err;
+
+	user_c1 = _rdtsc();
+
+	prog_fd = bpf_program__fd(skel->progs.test1);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+	user_c2 = _rdtsc();
+
+	ASSERT_OK(err, "test_run");
+	ASSERT_EQ(topts.retval, 0, "test_run");
+
+	test_rdtsc__detach(skel);
+
+	ASSERT_GE(skel->bss->c1, user_c1, "bpf c1 > user c1");
+	ASSERT_GE(user_c2, skel->bss->c2, "user c2 > bpf c2");
+	ASSERT_GE(skel->bss->c2, user_c1, "bpf c2 > bpf c1");
+	ASSERT_GE(user_c2, user_c1, "user c2 > user c1");
+
+	return 0;
+}
+#endif
+
+void test_rdtsc(void)
+{
+#ifdef __x86_64__
+	struct test_rdtsc *skel;
+	int err;
+
+	skel = test_rdtsc__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_rdtsc_skel_load"))
+		goto cleanup;
+	err = rdtsc(skel);
+	ASSERT_OK(err, "rdtsc");
+
+cleanup:
+	test_rdtsc__destroy(skel);
+#else
+	printf("%s:SKIP:bpf_rdtsc() kfunc not supported\n", __func__);
+	test__skip();
+#endif
+}
diff --git a/tools/testing/selftests/bpf/progs/test_rdtsc.c b/tools/testing/selftests/bpf/progs/test_rdtsc.c
new file mode 100644
index 000000000000..14776b83bd3e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_rdtsc.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Intel Corporation */
+#include <linux/bpf.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+__u64 c1;
+__u64 c2;
+
+extern __u64 bpf_rdtsc(void) __ksym;
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG2(test1, int, a)
+{
+	c1 = bpf_rdtsc();
+	c2 = bpf_rdtsc();
+
+	return 0;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: [PATCH 0/2] x86/BPF: Add new BPF helper call bpf_rdtsc
  2023-07-03 10:57 [PATCH 0/2] x86/BPF: Add new BPF helper call bpf_rdtsc Tero Kristo
  2023-07-03 10:57 ` [PATCH 1/2] x86/tsc: " Tero Kristo
  2023-07-03 10:57 ` [PATCH 2/2] selftests/bpf: Add test for bpf_rdtsc Tero Kristo
@ 2023-07-03 21:55 ` John Fastabend
  2 siblings, 0 replies; 17+ messages in thread
From: John Fastabend @ 2023-07-03 21:55 UTC (permalink / raw)
  To: Tero Kristo, shuah, tglx, x86, bp, dave.hansen, mingo
  Cc: ast, linux-kselftest, linux-kernel, andrii, daniel, bpf

Tero Kristo wrote:
> Hello,
> 
> This patch series adds a new x86 arch specific BPF helper, bpf_rdtsc()
> which can be used for reading the hardware time stamp counter (TSC.)
> Currently the same counter is directly accessible from userspace
> (using RDTSC instruction), and kernel space using various rdtsc_*()
> APIs, however eBPF lacks the support.
> 
> The main usage for the TSC counter is for various profiling and timing
> purposes, getting accurate cycle counter values. The counter can be
> currently read from BPF programs by using the existing perf subsystem
> services (bpf_perf_event_read()), however its usage is cumbersome at
> best. Additionally, the perf subsystem provides relative value only
> for the counter, but absolute values are desired by some use cases
> like Wult [1]. The absolute value of TSC can be read with BPF programs
> currently via some kprobe / bpf_core_read() magic (see [2], [3], [4] for
> example), but this relies on accessing kernel internals and is not
> stable API, and is pretty cumbersome. Thus, this patch proposes a new
> arch x86 specific BPF helper to avoid the above issues.
> 
> -Tero
> 
> [1] https://github.com/intel/wult
> [2] https://github.com/intel/wult/blob/c92237c95b898498faf41e6644983102d1fe5156/helpers/wult-tdt-helper/tdt-bpf.c#L102
> [3] https://github.com/intel/wult/blob/c92237c95b898498faf41e6644983102d1fe5156/helpers/wult-tdt-helper/tdt-bpf.c#L133
> [4] https://github.com/intel/wult/blob/c92237c95b898498faf41e6644983102d1fe5156/helpers/wult-tdt-helper/tdt-bpf.c#L488
> 
> 
> 

Makes a lot of sense to me.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH 2/2] selftests/bpf: Add test for bpf_rdtsc
  2023-07-03 10:57 ` [PATCH 2/2] selftests/bpf: Add test for bpf_rdtsc Tero Kristo
@ 2023-07-03 22:00   ` John Fastabend
  2023-07-04  8:55     ` Tero Kristo
  0 siblings, 1 reply; 17+ messages in thread
From: John Fastabend @ 2023-07-03 22:00 UTC (permalink / raw)
  To: Tero Kristo, shuah, tglx, x86, bp, dave.hansen, mingo
  Cc: ast, linux-kselftest, linux-kernel, andrii, daniel, bpf

Tero Kristo wrote:
> Add selftest for bpf_rdtsc() which reads the TSC (Time Stamp Counter) on
> x86_64 architectures. The test reads the TSC from both userspace and the
> BPF program, and verifies the TSC values are in incremental order as
> expected. The test is automatically skipped on architectures that do not
> support the feature.
> 
> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
> ---
>  .../selftests/bpf/prog_tests/test_rdtsc.c     | 67 +++++++++++++++++++
>  .../testing/selftests/bpf/progs/test_rdtsc.c  | 21 ++++++
>  2 files changed, 88 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_rdtsc.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c b/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
> new file mode 100644
> index 000000000000..2b26deb5b35a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
> @@ -0,0 +1,67 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2023 Intel Corporation */
> +
> +#include "test_progs.h"
> +#include "test_rdtsc.skel.h"
> +
> +#ifdef __x86_64__
> +
> +static inline u64 _rdtsc(void)
> +{
> +	u32 low, high;
> +
> +	__asm__ __volatile__("rdtscp" : "=a" (low), "=d" (high));

I think its ok but note this could fail if user doesn't have
access to rdtscp and iirc that can be restricted?

> +	return ((u64)high << 32) | low;
> +}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-03 10:57 ` [PATCH 1/2] x86/tsc: " Tero Kristo
@ 2023-07-04  4:49   ` Yonghong Song
  2023-07-06 12:00     ` Tero Kristo
  2023-07-06  3:02   ` Alexei Starovoitov
  1 sibling, 1 reply; 17+ messages in thread
From: Yonghong Song @ 2023-07-04  4:49 UTC (permalink / raw)
  To: Tero Kristo, shuah, tglx, x86, bp, dave.hansen, mingo
  Cc: ast, linux-kselftest, linux-kernel, andrii, daniel, bpf



On 7/3/23 3:57 AM, Tero Kristo wrote:
> Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
> and friends, and additionally even userspace has access to it via the
> RDTSC assembly instruction. BPF programs on the other hand don't have
> direct access to the TSC counter, but alternatively must go through the
> performance subsystem (bpf_perf_event_read), which only provides relative
> value compared to the start point of the program, and is also much slower
> than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
> can be used for any accurate profiling needs.
> 
> A use-case for the new API is for example wakeup latency tracing via
> eBPF on Intel architecture, where it is extremely beneficial to be able
> to get raw TSC timestamps and compare these directly to the value
> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
> latency value from the hardware interrupt to the execution of the
> interrupt handler can be calculated. Having the functionality within
> eBPF also has added benefits of allowing to filter any other relevant
> data like C-state residency values, and also to drop any irrelevant
> data points directly in the kernel context, without passing all the
> data to userspace for post-processing.
> 
> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
> ---
>   arch/x86/include/asm/msr.h |  1 +
>   arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
>   2 files changed, 24 insertions(+)
> 
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..3dde673cb563 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
>   void msrs_free(struct msr *msrs);
>   int msr_set_bit(u32 msr, u8 bit);
>   int msr_clear_bit(u32 msr, u8 bit);
> +u64 bpf_rdtsc(void);
>   
>   #ifdef CONFIG_SMP
>   int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 344698852146..ded857abef81 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -15,6 +15,8 @@
>   #include <linux/timex.h>
>   #include <linux/static_key.h>
>   #include <linux/static_call.h>
> +#include <linux/btf.h>
> +#include <linux/btf_ids.h>
>   
>   #include <asm/hpet.h>
>   #include <asm/timer.h>
> @@ -29,6 +31,7 @@
>   #include <asm/intel-family.h>
>   #include <asm/i8259.h>
>   #include <asm/uv/uv.h>
> +#include <asm/tlbflush.h>
>   
>   unsigned int __read_mostly cpu_khz;	/* TSC clocks / usec, not used here */
>   EXPORT_SYMBOL(cpu_khz);
> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
>   	tsc_enable_sched_clock();
>   }
>   
> +u64 bpf_rdtsc(void)

Please see kernel/bpf/helpers.c. For kfunc definition, we should have

__diag_push();
__diag_ignore_all("-Wmissing-prototypes",
                   "Global functions as their definitions will be in 
vmlinux BTF");

_bpf_kfunc u64 bpf_rdtsc(void)
{
	...
}

__diag_pop();


> +{
> +	/* Check if Time Stamp is enabled only in ring 0 */
> +	if (cr4_read_shadow() & X86_CR4_TSD)
> +		return 0;
> +
> +	return rdtsc_ordered();
> +}
> +
> +BTF_SET8_START(tsc_bpf_kfunc_ids)
> +BTF_ID_FLAGS(func, bpf_rdtsc)
> +BTF_SET8_END(tsc_bpf_kfunc_ids)
> +
> +static const struct btf_kfunc_id_set tsc_bpf_kfunc_set = {
> +	.owner		= THIS_MODULE,
> +	.set		= &tsc_bpf_kfunc_ids,
> +};
> +
>   void __init tsc_init(void)
>   {
>   	if (!cpu_feature_enabled(X86_FEATURE_TSC)) {
> @@ -1594,6 +1615,8 @@ void __init tsc_init(void)
>   
>   	clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
>   	detect_art();
> +
> +	register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &tsc_bpf_kfunc_set);

register_btf_kfunc_id_set() could fail, maybe you at least wants to
have a warning so bpf prog users may be aware that kfunc bpf_rdtsc()
not really available to bpf programs?

>   }
>   
>   #ifdef CONFIG_SMP

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] selftests/bpf: Add test for bpf_rdtsc
  2023-07-03 22:00   ` John Fastabend
@ 2023-07-04  8:55     ` Tero Kristo
  2023-07-06  4:57       ` John Fastabend
  0 siblings, 1 reply; 17+ messages in thread
From: Tero Kristo @ 2023-07-04  8:55 UTC (permalink / raw)
  To: John Fastabend, shuah, tglx, x86, bp, dave.hansen, mingo
  Cc: ast, linux-kselftest, linux-kernel, andrii, daniel, bpf


On 04/07/2023 01:00, John Fastabend wrote:
> Tero Kristo wrote:
>> Add selftest for bpf_rdtsc() which reads the TSC (Time Stamp Counter) on
>> x86_64 architectures. The test reads the TSC from both userspace and the
>> BPF program, and verifies the TSC values are in incremental order as
>> expected. The test is automatically skipped on architectures that do not
>> support the feature.
>>
>> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
>> ---
>>   .../selftests/bpf/prog_tests/test_rdtsc.c     | 67 +++++++++++++++++++
>>   .../testing/selftests/bpf/progs/test_rdtsc.c  | 21 ++++++
>>   2 files changed, 88 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/test_rdtsc.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c b/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
>> new file mode 100644
>> index 000000000000..2b26deb5b35a
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
>> @@ -0,0 +1,67 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright(c) 2023 Intel Corporation */
>> +
>> +#include "test_progs.h"
>> +#include "test_rdtsc.skel.h"
>> +
>> +#ifdef __x86_64__
>> +
>> +static inline u64 _rdtsc(void)
>> +{
>> +	u32 low, high;
>> +
>> +	__asm__ __volatile__("rdtscp" : "=a" (low), "=d" (high));
> I think its ok but note this could fail if user doesn't have
> access to rdtscp and iirc that can be restricted?

It is possible to restrict RDTSC access from userspace by enabling the 
TSD bit in CR4 register, and it will cause the userspace process to trap 
with general protection fault.

However, the usage of RDTSC appears to be built-in to C standard 
libraries (probably some timer routines) and enabling the CR4 TSD makes 
the system near unusable. Things like sshd + systemd also start 
generating the same general protection faults if RDTSC is blocked. Also, 
attempting to run anything at all with the BPF selftest suite causes the 
same general protection fault; not only the rdtsc test.

I tried this with couple of setups, one system running a minimalistic 
buildroot and another one running a fedora37 installation and the 
results were similar.

-Tero

>
>> +	return ((u64)high << 32) | low;
>> +}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-03 10:57 ` [PATCH 1/2] x86/tsc: " Tero Kristo
  2023-07-04  4:49   ` Yonghong Song
@ 2023-07-06  3:02   ` Alexei Starovoitov
  2023-07-06  5:16     ` John Fastabend
  1 sibling, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2023-07-06  3:02 UTC (permalink / raw)
  To: Tero Kristo
  Cc: Shuah Khan, Thomas Gleixner, X86 ML, Borislav Petkov,
	Dave Hansen, Ingo Molnar, Alexei Starovoitov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Andrii Nakryiko,
	Daniel Borkmann, bpf

On Mon, Jul 3, 2023 at 3:58 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
>
> Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
> and friends, and additionally even userspace has access to it via the
> RDTSC assembly instruction. BPF programs on the other hand don't have
> direct access to the TSC counter, but alternatively must go through the
> performance subsystem (bpf_perf_event_read), which only provides relative
> value compared to the start point of the program, and is also much slower
> than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
> can be used for any accurate profiling needs.
>
> A use-case for the new API is for example wakeup latency tracing via
> eBPF on Intel architecture, where it is extremely beneficial to be able
> to get raw TSC timestamps and compare these directly to the value
> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
> latency value from the hardware interrupt to the execution of the
> interrupt handler can be calculated. Having the functionality within
> eBPF also has added benefits of allowing to filter any other relevant
> data like C-state residency values, and also to drop any irrelevant
> data points directly in the kernel context, without passing all the
> data to userspace for post-processing.
>
> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
> ---
>  arch/x86/include/asm/msr.h |  1 +
>  arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+)
>
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..3dde673cb563 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
>  void msrs_free(struct msr *msrs);
>  int msr_set_bit(u32 msr, u8 bit);
>  int msr_clear_bit(u32 msr, u8 bit);
> +u64 bpf_rdtsc(void);
>
>  #ifdef CONFIG_SMP
>  int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 344698852146..ded857abef81 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -15,6 +15,8 @@
>  #include <linux/timex.h>
>  #include <linux/static_key.h>
>  #include <linux/static_call.h>
> +#include <linux/btf.h>
> +#include <linux/btf_ids.h>
>
>  #include <asm/hpet.h>
>  #include <asm/timer.h>
> @@ -29,6 +31,7 @@
>  #include <asm/intel-family.h>
>  #include <asm/i8259.h>
>  #include <asm/uv/uv.h>
> +#include <asm/tlbflush.h>
>
>  unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not used here */
>  EXPORT_SYMBOL(cpu_khz);
> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
>         tsc_enable_sched_clock();
>  }
>
> +u64 bpf_rdtsc(void)
> +{
> +       /* Check if Time Stamp is enabled only in ring 0 */
> +       if (cr4_read_shadow() & X86_CR4_TSD)
> +               return 0;

Why check this? It's always enabled in the kernel, no?

> +
> +       return rdtsc_ordered();

Why _ordered? Why not just rdtsc ?
Especially since you want to trace latency. Extra lfence will ruin
the measurements.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] selftests/bpf: Add test for bpf_rdtsc
  2023-07-04  8:55     ` Tero Kristo
@ 2023-07-06  4:57       ` John Fastabend
  0 siblings, 0 replies; 17+ messages in thread
From: John Fastabend @ 2023-07-06  4:57 UTC (permalink / raw)
  To: Tero Kristo, John Fastabend, shuah, tglx, x86, bp, dave.hansen, mingo
  Cc: ast, linux-kselftest, linux-kernel, andrii, daniel, bpf

Tero Kristo wrote:
> 
> On 04/07/2023 01:00, John Fastabend wrote:
> > Tero Kristo wrote:
> >> Add selftest for bpf_rdtsc() which reads the TSC (Time Stamp Counter) on
> >> x86_64 architectures. The test reads the TSC from both userspace and the
> >> BPF program, and verifies the TSC values are in incremental order as
> >> expected. The test is automatically skipped on architectures that do not
> >> support the feature.
> >>
> >> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
> >> ---
> >>   .../selftests/bpf/prog_tests/test_rdtsc.c     | 67 +++++++++++++++++++
> >>   .../testing/selftests/bpf/progs/test_rdtsc.c  | 21 ++++++
> >>   2 files changed, 88 insertions(+)
> >>   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
> >>   create mode 100644 tools/testing/selftests/bpf/progs/test_rdtsc.c
> >>
> >> diff --git a/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c b/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
> >> new file mode 100644
> >> index 000000000000..2b26deb5b35a
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/bpf/prog_tests/test_rdtsc.c
> >> @@ -0,0 +1,67 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/* Copyright(c) 2023 Intel Corporation */
> >> +
> >> +#include "test_progs.h"
> >> +#include "test_rdtsc.skel.h"
> >> +
> >> +#ifdef __x86_64__
> >> +
> >> +static inline u64 _rdtsc(void)
> >> +{
> >> +	u32 low, high;
> >> +
> >> +	__asm__ __volatile__("rdtscp" : "=a" (low), "=d" (high));
> > I think its ok but note this could fail if user doesn't have
> > access to rdtscp and iirc that can be restricted?
> 
> It is possible to restrict RDTSC access from userspace by enabling the 
> TSD bit in CR4 register, and it will cause the userspace process to trap 
> with general protection fault.
> 
> However, the usage of RDTSC appears to be built-in to C standard 
> libraries (probably some timer routines) and enabling the CR4 TSD makes 
> the system near unusable. Things like sshd + systemd also start 
> generating the same general protection faults if RDTSC is blocked. Also, 
> attempting to run anything at all with the BPF selftest suite causes the 
> same general protection fault; not only the rdtsc test.
> 
> I tried this with couple of setups, one system running a minimalistic 
> buildroot and another one running a fedora37 installation and the 
> results were similar.

Thanks. Good enough for me.

> 
> -Tero

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-06  3:02   ` Alexei Starovoitov
@ 2023-07-06  5:16     ` John Fastabend
  2023-07-06 11:59       ` Tero Kristo
  0 siblings, 1 reply; 17+ messages in thread
From: John Fastabend @ 2023-07-06  5:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Tero Kristo
  Cc: Shuah Khan, Thomas Gleixner, X86 ML, Borislav Petkov,
	Dave Hansen, Ingo Molnar, Alexei Starovoitov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Andrii Nakryiko,
	Daniel Borkmann, bpf

Alexei Starovoitov wrote:
> On Mon, Jul 3, 2023 at 3:58 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
> >
> > Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
> > and friends, and additionally even userspace has access to it via the
> > RDTSC assembly instruction. BPF programs on the other hand don't have
> > direct access to the TSC counter, but alternatively must go through the
> > performance subsystem (bpf_perf_event_read), which only provides relative
> > value compared to the start point of the program, and is also much slower
> > than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
> > can be used for any accurate profiling needs.
> >
> > A use-case for the new API is for example wakeup latency tracing via
> > eBPF on Intel architecture, where it is extremely beneficial to be able
> > to get raw TSC timestamps and compare these directly to the value
> > programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
> > latency value from the hardware interrupt to the execution of the
> > interrupt handler can be calculated. Having the functionality within
> > eBPF also has added benefits of allowing to filter any other relevant
> > data like C-state residency values, and also to drop any irrelevant
> > data points directly in the kernel context, without passing all the
> > data to userspace for post-processing.
> >
> > Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
> > ---
> >  arch/x86/include/asm/msr.h |  1 +
> >  arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
> >  2 files changed, 24 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> > index 65ec1965cd28..3dde673cb563 100644
> > --- a/arch/x86/include/asm/msr.h
> > +++ b/arch/x86/include/asm/msr.h
> > @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
> >  void msrs_free(struct msr *msrs);
> >  int msr_set_bit(u32 msr, u8 bit);
> >  int msr_clear_bit(u32 msr, u8 bit);
> > +u64 bpf_rdtsc(void);
> >
> >  #ifdef CONFIG_SMP
> >  int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index 344698852146..ded857abef81 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -15,6 +15,8 @@
> >  #include <linux/timex.h>
> >  #include <linux/static_key.h>
> >  #include <linux/static_call.h>
> > +#include <linux/btf.h>
> > +#include <linux/btf_ids.h>
> >
> >  #include <asm/hpet.h>
> >  #include <asm/timer.h>
> > @@ -29,6 +31,7 @@
> >  #include <asm/intel-family.h>
> >  #include <asm/i8259.h>
> >  #include <asm/uv/uv.h>
> > +#include <asm/tlbflush.h>
> >
> >  unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not used here */
> >  EXPORT_SYMBOL(cpu_khz);
> > @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
> >         tsc_enable_sched_clock();
> >  }
> >
> > +u64 bpf_rdtsc(void)
> > +{
> > +       /* Check if Time Stamp is enabled only in ring 0 */
> > +       if (cr4_read_shadow() & X86_CR4_TSD)
> > +               return 0;
> 
> Why check this? It's always enabled in the kernel, no?
> 
> > +
> > +       return rdtsc_ordered();
> 
> Why _ordered? Why not just rdtsc ?
> Especially since you want to trace latency. Extra lfence will ruin
> the measurements.
> 

If we used it as a fast way to order events on multiple CPUs I
guess we need the lfence? We use ktime_get_ns() now for things
like this when we just need an order counter. We have also
observed time going backwards with this and have heuristics
to correct it but its rare.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-06  5:16     ` John Fastabend
@ 2023-07-06 11:59       ` Tero Kristo
  2023-07-06 19:51         ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Tero Kristo @ 2023-07-06 11:59 UTC (permalink / raw)
  To: John Fastabend, Alexei Starovoitov
  Cc: Shuah Khan, Thomas Gleixner, X86 ML, Borislav Petkov,
	Dave Hansen, Ingo Molnar, Alexei Starovoitov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Andrii Nakryiko,
	Daniel Borkmann, bpf


On 06/07/2023 08:16, John Fastabend wrote:
> Alexei Starovoitov wrote:
>> On Mon, Jul 3, 2023 at 3:58 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
>>> Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
>>> and friends, and additionally even userspace has access to it via the
>>> RDTSC assembly instruction. BPF programs on the other hand don't have
>>> direct access to the TSC counter, but alternatively must go through the
>>> performance subsystem (bpf_perf_event_read), which only provides relative
>>> value compared to the start point of the program, and is also much slower
>>> than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
>>> can be used for any accurate profiling needs.
>>>
>>> A use-case for the new API is for example wakeup latency tracing via
>>> eBPF on Intel architecture, where it is extremely beneficial to be able
>>> to get raw TSC timestamps and compare these directly to the value
>>> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
>>> latency value from the hardware interrupt to the execution of the
>>> interrupt handler can be calculated. Having the functionality within
>>> eBPF also has added benefits of allowing to filter any other relevant
>>> data like C-state residency values, and also to drop any irrelevant
>>> data points directly in the kernel context, without passing all the
>>> data to userspace for post-processing.
>>>
>>> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
>>> ---
>>>   arch/x86/include/asm/msr.h |  1 +
>>>   arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
>>>   2 files changed, 24 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>>> index 65ec1965cd28..3dde673cb563 100644
>>> --- a/arch/x86/include/asm/msr.h
>>> +++ b/arch/x86/include/asm/msr.h
>>> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
>>>   void msrs_free(struct msr *msrs);
>>>   int msr_set_bit(u32 msr, u8 bit);
>>>   int msr_clear_bit(u32 msr, u8 bit);
>>> +u64 bpf_rdtsc(void);
>>>
>>>   #ifdef CONFIG_SMP
>>>   int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
>>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>>> index 344698852146..ded857abef81 100644
>>> --- a/arch/x86/kernel/tsc.c
>>> +++ b/arch/x86/kernel/tsc.c
>>> @@ -15,6 +15,8 @@
>>>   #include <linux/timex.h>
>>>   #include <linux/static_key.h>
>>>   #include <linux/static_call.h>
>>> +#include <linux/btf.h>
>>> +#include <linux/btf_ids.h>
>>>
>>>   #include <asm/hpet.h>
>>>   #include <asm/timer.h>
>>> @@ -29,6 +31,7 @@
>>>   #include <asm/intel-family.h>
>>>   #include <asm/i8259.h>
>>>   #include <asm/uv/uv.h>
>>> +#include <asm/tlbflush.h>
>>>
>>>   unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not used here */
>>>   EXPORT_SYMBOL(cpu_khz);
>>> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
>>>          tsc_enable_sched_clock();
>>>   }
>>>
>>> +u64 bpf_rdtsc(void)
>>> +{
>>> +       /* Check if Time Stamp is enabled only in ring 0 */
>>> +       if (cr4_read_shadow() & X86_CR4_TSD)
>>> +               return 0;
>> Why check this? It's always enabled in the kernel, no?

It is always enabled, but there are certain syscalls that can be used to 
disable the TSC access for oneself. prctl(PR_SET_TSC, ...) and 
seccomp(SET_MODE_STRICT,...). Not having the check in place would in 
theory allow a restricted BPF program to circumvent this (if there ever 
was such a thing.) But yes, I do agree this part is a bit debatable 
whether it should be there at all.


>>> +
>>> +       return rdtsc_ordered();
>> Why _ordered? Why not just rdtsc ?
>> Especially since you want to trace latency. Extra lfence will ruin
>> the measurements.
>>
> If we used it as a fast way to order events on multiple CPUs I
> guess we need the lfence? We use ktime_get_ns() now for things
> like this when we just need an order counter. We have also
> observed time going backwards with this and have heuristics
> to correct it but its rare.

Yeah, I think it is better to induce some extra latency instead of 
having some weird ordering issues with the timestamps.

Also, things like the ftrace also use rdtsc_ordered() as its underlying 
clock, if you use x86-tsc as the trace clock (see 
arch/x86/kernel/trace_clock.c.)

-Tero


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-04  4:49   ` Yonghong Song
@ 2023-07-06 12:00     ` Tero Kristo
  0 siblings, 0 replies; 17+ messages in thread
From: Tero Kristo @ 2023-07-06 12:00 UTC (permalink / raw)
  To: Yonghong Song, shuah, tglx, x86, bp, dave.hansen, mingo
  Cc: ast, linux-kselftest, linux-kernel, andrii, daniel, bpf


On 04/07/2023 07:49, Yonghong Song wrote:
>
>
> On 7/3/23 3:57 AM, Tero Kristo wrote:
>> Currently the raw TSC counter can be read within kernel via 
>> rdtsc_ordered()
>> and friends, and additionally even userspace has access to it via the
>> RDTSC assembly instruction. BPF programs on the other hand don't have
>> direct access to the TSC counter, but alternatively must go through the
>> performance subsystem (bpf_perf_event_read), which only provides 
>> relative
>> value compared to the start point of the program, and is also much 
>> slower
>> than the direct read. Add a new BPF helper definition for bpf_rdtsc() 
>> which
>> can be used for any accurate profiling needs.
>>
>> A use-case for the new API is for example wakeup latency tracing via
>> eBPF on Intel architecture, where it is extremely beneficial to be able
>> to get raw TSC timestamps and compare these directly to the value
>> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
>> latency value from the hardware interrupt to the execution of the
>> interrupt handler can be calculated. Having the functionality within
>> eBPF also has added benefits of allowing to filter any other relevant
>> data like C-state residency values, and also to drop any irrelevant
>> data points directly in the kernel context, without passing all the
>> data to userspace for post-processing.
>>
>> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
>> ---
>>   arch/x86/include/asm/msr.h |  1 +
>>   arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
>>   2 files changed, 24 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>> index 65ec1965cd28..3dde673cb563 100644
>> --- a/arch/x86/include/asm/msr.h
>> +++ b/arch/x86/include/asm/msr.h
>> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
>>   void msrs_free(struct msr *msrs);
>>   int msr_set_bit(u32 msr, u8 bit);
>>   int msr_clear_bit(u32 msr, u8 bit);
>> +u64 bpf_rdtsc(void);
>>     #ifdef CONFIG_SMP
>>   int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>> index 344698852146..ded857abef81 100644
>> --- a/arch/x86/kernel/tsc.c
>> +++ b/arch/x86/kernel/tsc.c
>> @@ -15,6 +15,8 @@
>>   #include <linux/timex.h>
>>   #include <linux/static_key.h>
>>   #include <linux/static_call.h>
>> +#include <linux/btf.h>
>> +#include <linux/btf_ids.h>
>>     #include <asm/hpet.h>
>>   #include <asm/timer.h>
>> @@ -29,6 +31,7 @@
>>   #include <asm/intel-family.h>
>>   #include <asm/i8259.h>
>>   #include <asm/uv/uv.h>
>> +#include <asm/tlbflush.h>
>>     unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not 
>> used here */
>>   EXPORT_SYMBOL(cpu_khz);
>> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
>>       tsc_enable_sched_clock();
>>   }
>>   +u64 bpf_rdtsc(void)
>
> Please see kernel/bpf/helpers.c. For kfunc definition, we should have
>
> __diag_push();
> __diag_ignore_all("-Wmissing-prototypes",
>                   "Global functions as their definitions will be in 
> vmlinux BTF");
>
> _bpf_kfunc u64 bpf_rdtsc(void)
> {
>     ...
> }
>
> __diag_pop();
Thanks, I'll modify this for next rev.
>
>
>> +{
>> +    /* Check if Time Stamp is enabled only in ring 0 */
>> +    if (cr4_read_shadow() & X86_CR4_TSD)
>> +        return 0;
>> +
>> +    return rdtsc_ordered();
>> +}
>> +
>> +BTF_SET8_START(tsc_bpf_kfunc_ids)
>> +BTF_ID_FLAGS(func, bpf_rdtsc)
>> +BTF_SET8_END(tsc_bpf_kfunc_ids)
>> +
>> +static const struct btf_kfunc_id_set tsc_bpf_kfunc_set = {
>> +    .owner        = THIS_MODULE,
>> +    .set        = &tsc_bpf_kfunc_ids,
>> +};
>> +
>>   void __init tsc_init(void)
>>   {
>>       if (!cpu_feature_enabled(X86_FEATURE_TSC)) {
>> @@ -1594,6 +1615,8 @@ void __init tsc_init(void)
>>         clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
>>       detect_art();
>> +
>> +    register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, 
>> &tsc_bpf_kfunc_set);
>
> register_btf_kfunc_id_set() could fail, maybe you at least wants to
> have a warning so bpf prog users may be aware that kfunc bpf_rdtsc()
> not really available to bpf programs?

Yes, I'll add a warning print.

-Tero

>
>>   }
>>     #ifdef CONFIG_SMP

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-06 11:59       ` Tero Kristo
@ 2023-07-06 19:51         ` Alexei Starovoitov
  2023-07-07  5:41           ` John Fastabend
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2023-07-06 19:51 UTC (permalink / raw)
  To: Tero Kristo
  Cc: John Fastabend, Shuah Khan, Thomas Gleixner, X86 ML,
	Borislav Petkov, Dave Hansen, Ingo Molnar, Alexei Starovoitov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Andrii Nakryiko,
	Daniel Borkmann, bpf

On Thu, Jul 6, 2023 at 4:59 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
>
>
> On 06/07/2023 08:16, John Fastabend wrote:
> > Alexei Starovoitov wrote:
> >> On Mon, Jul 3, 2023 at 3:58 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
> >>> Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
> >>> and friends, and additionally even userspace has access to it via the
> >>> RDTSC assembly instruction. BPF programs on the other hand don't have
> >>> direct access to the TSC counter, but alternatively must go through the
> >>> performance subsystem (bpf_perf_event_read), which only provides relative
> >>> value compared to the start point of the program, and is also much slower
> >>> than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
> >>> can be used for any accurate profiling needs.
> >>>
> >>> A use-case for the new API is for example wakeup latency tracing via
> >>> eBPF on Intel architecture, where it is extremely beneficial to be able
> >>> to get raw TSC timestamps and compare these directly to the value
> >>> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
> >>> latency value from the hardware interrupt to the execution of the
> >>> interrupt handler can be calculated. Having the functionality within
> >>> eBPF also has added benefits of allowing to filter any other relevant
> >>> data like C-state residency values, and also to drop any irrelevant
> >>> data points directly in the kernel context, without passing all the
> >>> data to userspace for post-processing.
> >>>
> >>> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
> >>> ---
> >>>   arch/x86/include/asm/msr.h |  1 +
> >>>   arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
> >>>   2 files changed, 24 insertions(+)
> >>>
> >>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> >>> index 65ec1965cd28..3dde673cb563 100644
> >>> --- a/arch/x86/include/asm/msr.h
> >>> +++ b/arch/x86/include/asm/msr.h
> >>> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
> >>>   void msrs_free(struct msr *msrs);
> >>>   int msr_set_bit(u32 msr, u8 bit);
> >>>   int msr_clear_bit(u32 msr, u8 bit);
> >>> +u64 bpf_rdtsc(void);
> >>>
> >>>   #ifdef CONFIG_SMP
> >>>   int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
> >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> >>> index 344698852146..ded857abef81 100644
> >>> --- a/arch/x86/kernel/tsc.c
> >>> +++ b/arch/x86/kernel/tsc.c
> >>> @@ -15,6 +15,8 @@
> >>>   #include <linux/timex.h>
> >>>   #include <linux/static_key.h>
> >>>   #include <linux/static_call.h>
> >>> +#include <linux/btf.h>
> >>> +#include <linux/btf_ids.h>
> >>>
> >>>   #include <asm/hpet.h>
> >>>   #include <asm/timer.h>
> >>> @@ -29,6 +31,7 @@
> >>>   #include <asm/intel-family.h>
> >>>   #include <asm/i8259.h>
> >>>   #include <asm/uv/uv.h>
> >>> +#include <asm/tlbflush.h>
> >>>
> >>>   unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not used here */
> >>>   EXPORT_SYMBOL(cpu_khz);
> >>> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
> >>>          tsc_enable_sched_clock();
> >>>   }
> >>>
> >>> +u64 bpf_rdtsc(void)
> >>> +{
> >>> +       /* Check if Time Stamp is enabled only in ring 0 */
> >>> +       if (cr4_read_shadow() & X86_CR4_TSD)
> >>> +               return 0;
> >> Why check this? It's always enabled in the kernel, no?
>
> It is always enabled, but there are certain syscalls that can be used to
> disable the TSC access for oneself. prctl(PR_SET_TSC, ...) and
> seccomp(SET_MODE_STRICT,...). Not having the check in place would in
> theory allow a restricted BPF program to circumvent this (if there ever
> was such a thing.) But yes, I do agree this part is a bit debatable
> whether it should be there at all.

What do you mean 'circumvent' ?
It's a tracing bpf prog running in the kernel loaded by root
and reading tsc for the purpose of the kernel.
There is no unprivileged access to tsc here.

>
> >>> +
> >>> +       return rdtsc_ordered();
> >> Why _ordered? Why not just rdtsc ?
> >> Especially since you want to trace latency. Extra lfence will ruin
> >> the measurements.
> >>
> > If we used it as a fast way to order events on multiple CPUs I
> > guess we need the lfence? We use ktime_get_ns() now for things
> > like this when we just need an order counter. We have also
> > observed time going backwards with this and have heuristics
> > to correct it but its rare.
>
> Yeah, I think it is better to induce some extra latency instead of
> having some weird ordering issues with the timestamps.

lfence is not 'some extra latency'.
I suspect rdtsc_ordered() will be slower than bpf_ktime_get_ns().
What's the point of using it then?

>
> Also, things like the ftrace also use rdtsc_ordered() as its underlying
> clock, if you use x86-tsc as the trace clock (see
> arch/x86/kernel/trace_clock.c.)
>
> -Tero
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-06 19:51         ` Alexei Starovoitov
@ 2023-07-07  5:41           ` John Fastabend
  2023-07-07  8:27             ` Tero Kristo
  0 siblings, 1 reply; 17+ messages in thread
From: John Fastabend @ 2023-07-07  5:41 UTC (permalink / raw)
  To: Alexei Starovoitov, Tero Kristo
  Cc: John Fastabend, Shuah Khan, Thomas Gleixner, X86 ML,
	Borislav Petkov, Dave Hansen, Ingo Molnar, Alexei Starovoitov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Andrii Nakryiko,
	Daniel Borkmann, bpf

Alexei Starovoitov wrote:
> On Thu, Jul 6, 2023 at 4:59 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
> >
> >
> > On 06/07/2023 08:16, John Fastabend wrote:
> > > Alexei Starovoitov wrote:
> > >> On Mon, Jul 3, 2023 at 3:58 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
> > >>> Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
> > >>> and friends, and additionally even userspace has access to it via the
> > >>> RDTSC assembly instruction. BPF programs on the other hand don't have
> > >>> direct access to the TSC counter, but alternatively must go through the
> > >>> performance subsystem (bpf_perf_event_read), which only provides relative
> > >>> value compared to the start point of the program, and is also much slower
> > >>> than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
> > >>> can be used for any accurate profiling needs.
> > >>>
> > >>> A use-case for the new API is for example wakeup latency tracing via
> > >>> eBPF on Intel architecture, where it is extremely beneficial to be able
> > >>> to get raw TSC timestamps and compare these directly to the value
> > >>> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
> > >>> latency value from the hardware interrupt to the execution of the
> > >>> interrupt handler can be calculated. Having the functionality within
> > >>> eBPF also has added benefits of allowing to filter any other relevant
> > >>> data like C-state residency values, and also to drop any irrelevant
> > >>> data points directly in the kernel context, without passing all the
> > >>> data to userspace for post-processing.
> > >>>
> > >>> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
> > >>> ---
> > >>>   arch/x86/include/asm/msr.h |  1 +
> > >>>   arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
> > >>>   2 files changed, 24 insertions(+)
> > >>>
> > >>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> > >>> index 65ec1965cd28..3dde673cb563 100644
> > >>> --- a/arch/x86/include/asm/msr.h
> > >>> +++ b/arch/x86/include/asm/msr.h
> > >>> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
> > >>>   void msrs_free(struct msr *msrs);
> > >>>   int msr_set_bit(u32 msr, u8 bit);
> > >>>   int msr_clear_bit(u32 msr, u8 bit);
> > >>> +u64 bpf_rdtsc(void);
> > >>>
> > >>>   #ifdef CONFIG_SMP
> > >>>   int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
> > >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > >>> index 344698852146..ded857abef81 100644
> > >>> --- a/arch/x86/kernel/tsc.c
> > >>> +++ b/arch/x86/kernel/tsc.c
> > >>> @@ -15,6 +15,8 @@
> > >>>   #include <linux/timex.h>
> > >>>   #include <linux/static_key.h>
> > >>>   #include <linux/static_call.h>
> > >>> +#include <linux/btf.h>
> > >>> +#include <linux/btf_ids.h>
> > >>>
> > >>>   #include <asm/hpet.h>
> > >>>   #include <asm/timer.h>
> > >>> @@ -29,6 +31,7 @@
> > >>>   #include <asm/intel-family.h>
> > >>>   #include <asm/i8259.h>
> > >>>   #include <asm/uv/uv.h>
> > >>> +#include <asm/tlbflush.h>
> > >>>
> > >>>   unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not used here */
> > >>>   EXPORT_SYMBOL(cpu_khz);
> > >>> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
> > >>>          tsc_enable_sched_clock();
> > >>>   }
> > >>>
> > >>> +u64 bpf_rdtsc(void)
> > >>> +{
> > >>> +       /* Check if Time Stamp is enabled only in ring 0 */
> > >>> +       if (cr4_read_shadow() & X86_CR4_TSD)
> > >>> +               return 0;
> > >> Why check this? It's always enabled in the kernel, no?
> >
> > It is always enabled, but there are certain syscalls that can be used to
> > disable the TSC access for oneself. prctl(PR_SET_TSC, ...) and
> > seccomp(SET_MODE_STRICT,...). Not having the check in place would in
> > theory allow a restricted BPF program to circumvent this (if there ever
> > was such a thing.) But yes, I do agree this part is a bit debatable
> > whether it should be there at all.
> 
> What do you mean 'circumvent' ?
> It's a tracing bpf prog running in the kernel loaded by root
> and reading tsc for the purpose of the kernel.
> There is no unprivileged access to tsc here.
> 
> >
> > >>> +
> > >>> +       return rdtsc_ordered();
> > >> Why _ordered? Why not just rdtsc ?
> > >> Especially since you want to trace latency. Extra lfence will ruin
> > >> the measurements.
> > >>
> > > If we used it as a fast way to order events on multiple CPUs I
> > > guess we need the lfence? We use ktime_get_ns() now for things
> > > like this when we just need an order counter. We have also
> > > observed time going backwards with this and have heuristics
> > > to correct it but its rare.
> >
> > Yeah, I think it is better to induce some extra latency instead of
> > having some weird ordering issues with the timestamps.
> 
> lfence is not 'some extra latency'.
> I suspect rdtsc_ordered() will be slower than bpf_ktime_get_ns().
> What's the point of using it then?

I would only use it if its faster then bpf_ktime_get_ns() and
have already figured out how to handle rare unordered events
so I think its OK to relax somewhat strict ordering. 

> 
> >
> > Also, things like the ftrace also use rdtsc_ordered() as its underlying
> > clock, if you use x86-tsc as the trace clock (see
> > arch/x86/kernel/trace_clock.c.)
> >
> > -Tero
> >



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-07  5:41           ` John Fastabend
@ 2023-07-07  8:27             ` Tero Kristo
  2023-07-07 14:42               ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Tero Kristo @ 2023-07-07  8:27 UTC (permalink / raw)
  To: John Fastabend, Alexei Starovoitov
  Cc: Shuah Khan, Thomas Gleixner, X86 ML, Borislav Petkov,
	Dave Hansen, Ingo Molnar, Alexei Starovoitov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Andrii Nakryiko,
	Daniel Borkmann, bpf


On 07/07/2023 08:41, John Fastabend wrote:
> Alexei Starovoitov wrote:
>> On Thu, Jul 6, 2023 at 4:59 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
>>>
>>> On 06/07/2023 08:16, John Fastabend wrote:
>>>> Alexei Starovoitov wrote:
>>>>> On Mon, Jul 3, 2023 at 3:58 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
>>>>>> Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
>>>>>> and friends, and additionally even userspace has access to it via the
>>>>>> RDTSC assembly instruction. BPF programs on the other hand don't have
>>>>>> direct access to the TSC counter, but alternatively must go through the
>>>>>> performance subsystem (bpf_perf_event_read), which only provides relative
>>>>>> value compared to the start point of the program, and is also much slower
>>>>>> than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
>>>>>> can be used for any accurate profiling needs.
>>>>>>
>>>>>> A use-case for the new API is for example wakeup latency tracing via
>>>>>> eBPF on Intel architecture, where it is extremely beneficial to be able
>>>>>> to get raw TSC timestamps and compare these directly to the value
>>>>>> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
>>>>>> latency value from the hardware interrupt to the execution of the
>>>>>> interrupt handler can be calculated. Having the functionality within
>>>>>> eBPF also has added benefits of allowing to filter any other relevant
>>>>>> data like C-state residency values, and also to drop any irrelevant
>>>>>> data points directly in the kernel context, without passing all the
>>>>>> data to userspace for post-processing.
>>>>>>
>>>>>> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
>>>>>> ---
>>>>>>    arch/x86/include/asm/msr.h |  1 +
>>>>>>    arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
>>>>>>    2 files changed, 24 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>>>>>> index 65ec1965cd28..3dde673cb563 100644
>>>>>> --- a/arch/x86/include/asm/msr.h
>>>>>> +++ b/arch/x86/include/asm/msr.h
>>>>>> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
>>>>>>    void msrs_free(struct msr *msrs);
>>>>>>    int msr_set_bit(u32 msr, u8 bit);
>>>>>>    int msr_clear_bit(u32 msr, u8 bit);
>>>>>> +u64 bpf_rdtsc(void);
>>>>>>
>>>>>>    #ifdef CONFIG_SMP
>>>>>>    int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
>>>>>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>>>>>> index 344698852146..ded857abef81 100644
>>>>>> --- a/arch/x86/kernel/tsc.c
>>>>>> +++ b/arch/x86/kernel/tsc.c
>>>>>> @@ -15,6 +15,8 @@
>>>>>>    #include <linux/timex.h>
>>>>>>    #include <linux/static_key.h>
>>>>>>    #include <linux/static_call.h>
>>>>>> +#include <linux/btf.h>
>>>>>> +#include <linux/btf_ids.h>
>>>>>>
>>>>>>    #include <asm/hpet.h>
>>>>>>    #include <asm/timer.h>
>>>>>> @@ -29,6 +31,7 @@
>>>>>>    #include <asm/intel-family.h>
>>>>>>    #include <asm/i8259.h>
>>>>>>    #include <asm/uv/uv.h>
>>>>>> +#include <asm/tlbflush.h>
>>>>>>
>>>>>>    unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not used here */
>>>>>>    EXPORT_SYMBOL(cpu_khz);
>>>>>> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
>>>>>>           tsc_enable_sched_clock();
>>>>>>    }
>>>>>>
>>>>>> +u64 bpf_rdtsc(void)
>>>>>> +{
>>>>>> +       /* Check if Time Stamp is enabled only in ring 0 */
>>>>>> +       if (cr4_read_shadow() & X86_CR4_TSD)
>>>>>> +               return 0;
>>>>> Why check this? It's always enabled in the kernel, no?
>>> It is always enabled, but there are certain syscalls that can be used to
>>> disable the TSC access for oneself. prctl(PR_SET_TSC, ...) and
>>> seccomp(SET_MODE_STRICT,...). Not having the check in place would in
>>> theory allow a restricted BPF program to circumvent this (if there ever
>>> was such a thing.) But yes, I do agree this part is a bit debatable
>>> whether it should be there at all.
>> What do you mean 'circumvent' ?
>> It's a tracing bpf prog running in the kernel loaded by root
>> and reading tsc for the purpose of the kernel.
>> There is no unprivileged access to tsc here.
This was based on some discussions with the security team at Intel, I 
don't pretend to know anything about security myself. But I can drop the 
check. It is probably not needed because of the fact that it is already 
possible to read the TSC counter with the approach I mention in the 
cover letter; via perf and bpf_core_read().
>>
>>>>>> +
>>>>>> +       return rdtsc_ordered();
>>>>> Why _ordered? Why not just rdtsc ?
>>>>> Especially since you want to trace latency. Extra lfence will ruin
>>>>> the measurements.
>>>>>
>>>> If we used it as a fast way to order events on multiple CPUs I
>>>> guess we need the lfence? We use ktime_get_ns() now for things
>>>> like this when we just need an order counter. We have also
>>>> observed time going backwards with this and have heuristics
>>>> to correct it but its rare.
>>> Yeah, I think it is better to induce some extra latency instead of
>>> having some weird ordering issues with the timestamps.
>> lfence is not 'some extra latency'.
>> I suspect rdtsc_ordered() will be slower than bpf_ktime_get_ns().
>> What's the point of using it then?
> I would only use it if its faster then bpf_ktime_get_ns() and
> have already figured out how to handle rare unordered events
> so I think its OK to relax somewhat strict ordering.

I believe that on x86-arch using bpf_ktime_get_ns() also ends up calling 
rdtsc_odered() under the hood.

I just did some measurements on an Intel(R) Xeon(R) Platinum 8360Y CPU @ 
2.40GHz, with a simple BPF code:

         t1 = bpf_ktime_get_ns();

         for (i = 0; i < NUM_CYC; i++) {
                 bpf_rdtsc(); // or bpf_ktime_get_ns() here
         }

         t2 = bpf_ktime_get_ns();

The results I got with the CPU locked at 2.4GHz (average execution times 
per a call within the loop, this with some 10M executions):

bpf_rdtsc() ordered : 45ns

bpf_rdtsc() un-ordered : 23ns

bpf_ktime_get_ns() : 49ns

Locking the CPU at 800MHz the results are:

bpf_rdtsc() ordered : 55ns

bpf_rdtsc() un-ordered : 33ns

bpf_ktime_get_ns() : 71ns

The bpf_rdtsc() in these results contains some extra latency caused by 
conditional execution, I added a flag to the call to select whether it 
should use _ordered() or not, and it also still contains the CR4_TSD 
check in place.

-Tero

>
>>> Also, things like the ftrace also use rdtsc_ordered() as its underlying
>>> clock, if you use x86-tsc as the trace clock (see
>>> arch/x86/kernel/trace_clock.c.)
>>>
>>> -Tero
>>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-07  8:27             ` Tero Kristo
@ 2023-07-07 14:42               ` Alexei Starovoitov
  2023-08-09 11:31                 ` Tero Kristo
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2023-07-07 14:42 UTC (permalink / raw)
  To: Tero Kristo
  Cc: John Fastabend, Shuah Khan, Thomas Gleixner, X86 ML,
	Borislav Petkov, Dave Hansen, Ingo Molnar, Alexei Starovoitov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Andrii Nakryiko,
	Daniel Borkmann, bpf

On Fri, Jul 7, 2023 at 1:28 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
>
>
> On 07/07/2023 08:41, John Fastabend wrote:
> > Alexei Starovoitov wrote:
> >> On Thu, Jul 6, 2023 at 4:59 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
> >>>
> >>> On 06/07/2023 08:16, John Fastabend wrote:
> >>>> Alexei Starovoitov wrote:
> >>>>> On Mon, Jul 3, 2023 at 3:58 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
> >>>>>> Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
> >>>>>> and friends, and additionally even userspace has access to it via the
> >>>>>> RDTSC assembly instruction. BPF programs on the other hand don't have
> >>>>>> direct access to the TSC counter, but alternatively must go through the
> >>>>>> performance subsystem (bpf_perf_event_read), which only provides relative
> >>>>>> value compared to the start point of the program, and is also much slower
> >>>>>> than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
> >>>>>> can be used for any accurate profiling needs.
> >>>>>>
> >>>>>> A use-case for the new API is for example wakeup latency tracing via
> >>>>>> eBPF on Intel architecture, where it is extremely beneficial to be able
> >>>>>> to get raw TSC timestamps and compare these directly to the value
> >>>>>> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
> >>>>>> latency value from the hardware interrupt to the execution of the
> >>>>>> interrupt handler can be calculated. Having the functionality within
> >>>>>> eBPF also has added benefits of allowing to filter any other relevant
> >>>>>> data like C-state residency values, and also to drop any irrelevant
> >>>>>> data points directly in the kernel context, without passing all the
> >>>>>> data to userspace for post-processing.
> >>>>>>
> >>>>>> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
> >>>>>> ---
> >>>>>>    arch/x86/include/asm/msr.h |  1 +
> >>>>>>    arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
> >>>>>>    2 files changed, 24 insertions(+)
> >>>>>>
> >>>>>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> >>>>>> index 65ec1965cd28..3dde673cb563 100644
> >>>>>> --- a/arch/x86/include/asm/msr.h
> >>>>>> +++ b/arch/x86/include/asm/msr.h
> >>>>>> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
> >>>>>>    void msrs_free(struct msr *msrs);
> >>>>>>    int msr_set_bit(u32 msr, u8 bit);
> >>>>>>    int msr_clear_bit(u32 msr, u8 bit);
> >>>>>> +u64 bpf_rdtsc(void);
> >>>>>>
> >>>>>>    #ifdef CONFIG_SMP
> >>>>>>    int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
> >>>>>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> >>>>>> index 344698852146..ded857abef81 100644
> >>>>>> --- a/arch/x86/kernel/tsc.c
> >>>>>> +++ b/arch/x86/kernel/tsc.c
> >>>>>> @@ -15,6 +15,8 @@
> >>>>>>    #include <linux/timex.h>
> >>>>>>    #include <linux/static_key.h>
> >>>>>>    #include <linux/static_call.h>
> >>>>>> +#include <linux/btf.h>
> >>>>>> +#include <linux/btf_ids.h>
> >>>>>>
> >>>>>>    #include <asm/hpet.h>
> >>>>>>    #include <asm/timer.h>
> >>>>>> @@ -29,6 +31,7 @@
> >>>>>>    #include <asm/intel-family.h>
> >>>>>>    #include <asm/i8259.h>
> >>>>>>    #include <asm/uv/uv.h>
> >>>>>> +#include <asm/tlbflush.h>
> >>>>>>
> >>>>>>    unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not used here */
> >>>>>>    EXPORT_SYMBOL(cpu_khz);
> >>>>>> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
> >>>>>>           tsc_enable_sched_clock();
> >>>>>>    }
> >>>>>>
> >>>>>> +u64 bpf_rdtsc(void)
> >>>>>> +{
> >>>>>> +       /* Check if Time Stamp is enabled only in ring 0 */
> >>>>>> +       if (cr4_read_shadow() & X86_CR4_TSD)
> >>>>>> +               return 0;
> >>>>> Why check this? It's always enabled in the kernel, no?
> >>> It is always enabled, but there are certain syscalls that can be used to
> >>> disable the TSC access for oneself. prctl(PR_SET_TSC, ...) and
> >>> seccomp(SET_MODE_STRICT,...). Not having the check in place would in
> >>> theory allow a restricted BPF program to circumvent this (if there ever
> >>> was such a thing.) But yes, I do agree this part is a bit debatable
> >>> whether it should be there at all.
> >> What do you mean 'circumvent' ?
> >> It's a tracing bpf prog running in the kernel loaded by root
> >> and reading tsc for the purpose of the kernel.
> >> There is no unprivileged access to tsc here.
> This was based on some discussions with the security team at Intel, I
> don't pretend to know anything about security myself. But I can drop the
> check. It is probably not needed because of the fact that it is already
> possible to read the TSC counter with the approach I mention in the
> cover letter; via perf and bpf_core_read().
> >>
> >>>>>> +
> >>>>>> +       return rdtsc_ordered();
> >>>>> Why _ordered? Why not just rdtsc ?
> >>>>> Especially since you want to trace latency. Extra lfence will ruin
> >>>>> the measurements.
> >>>>>
> >>>> If we used it as a fast way to order events on multiple CPUs I
> >>>> guess we need the lfence? We use ktime_get_ns() now for things
> >>>> like this when we just need an order counter. We have also
> >>>> observed time going backwards with this and have heuristics
> >>>> to correct it but its rare.
> >>> Yeah, I think it is better to induce some extra latency instead of
> >>> having some weird ordering issues with the timestamps.
> >> lfence is not 'some extra latency'.
> >> I suspect rdtsc_ordered() will be slower than bpf_ktime_get_ns().
> >> What's the point of using it then?
> > I would only use it if its faster then bpf_ktime_get_ns() and
> > have already figured out how to handle rare unordered events
> > so I think its OK to relax somewhat strict ordering.
>
> I believe that on x86-arch using bpf_ktime_get_ns() also ends up calling
> rdtsc_odered() under the hood.
>
> I just did some measurements on an Intel(R) Xeon(R) Platinum 8360Y CPU @
> 2.40GHz, with a simple BPF code:
>
>          t1 = bpf_ktime_get_ns();
>
>          for (i = 0; i < NUM_CYC; i++) {
>                  bpf_rdtsc(); // or bpf_ktime_get_ns() here
>          }
>
>          t2 = bpf_ktime_get_ns();
>
> The results I got with the CPU locked at 2.4GHz (average execution times
> per a call within the loop, this with some 10M executions):
>
> bpf_rdtsc() ordered : 45ns
>
> bpf_rdtsc() un-ordered : 23ns
>
> bpf_ktime_get_ns() : 49ns

Thanks for crunching the numbers.
Based on them it's hard to justify adding the ordered variant.
We already have ktime_get_ns, ktime_get_boot_ns, ktime_get_coarse_ns,
ktime_get_tai_ns with pretty close performance and different time
constraints. rdtsc_ordered doesn't bring anything new to the table.
bpf_rdtsc() would be justified if it's significantly faster
than traditional ktime*() helpers.

> Locking the CPU at 800MHz the results are:
>
> bpf_rdtsc() ordered : 55ns
>
> bpf_rdtsc() un-ordered : 33ns
>
> bpf_ktime_get_ns() : 71ns
>
> The bpf_rdtsc() in these results contains some extra latency caused by
> conditional execution, I added a flag to the call to select whether it
> should use _ordered() or not, and it also still contains the CR4_TSD
> check in place.
>
> -Tero
>
> >
> >>> Also, things like the ftrace also use rdtsc_ordered() as its underlying
> >>> clock, if you use x86-tsc as the trace clock (see
> >>> arch/x86/kernel/trace_clock.c.)
> >>>
> >>> -Tero
> >>>
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc
  2023-07-07 14:42               ` Alexei Starovoitov
@ 2023-08-09 11:31                 ` Tero Kristo
  0 siblings, 0 replies; 17+ messages in thread
From: Tero Kristo @ 2023-08-09 11:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Shuah Khan, Thomas Gleixner, X86 ML,
	Borislav Petkov, Dave Hansen, Ingo Molnar, Alexei Starovoitov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Andrii Nakryiko,
	Daniel Borkmann, bpf

Hi,

Coming back to this bit late, I was on vacation for a few weeks.

On 07/07/2023 17:42, Alexei Starovoitov wrote:
> On Fri, Jul 7, 2023 at 1:28 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
>>
>> On 07/07/2023 08:41, John Fastabend wrote:
>>> Alexei Starovoitov wrote:
>>>> On Thu, Jul 6, 2023 at 4:59 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
>>>>> On 06/07/2023 08:16, John Fastabend wrote:
>>>>>> Alexei Starovoitov wrote:
>>>>>>> On Mon, Jul 3, 2023 at 3:58 AM Tero Kristo <tero.kristo@linux.intel.com> wrote:
>>>>>>>> Currently the raw TSC counter can be read within kernel via rdtsc_ordered()
>>>>>>>> and friends, and additionally even userspace has access to it via the
>>>>>>>> RDTSC assembly instruction. BPF programs on the other hand don't have
>>>>>>>> direct access to the TSC counter, but alternatively must go through the
>>>>>>>> performance subsystem (bpf_perf_event_read), which only provides relative
>>>>>>>> value compared to the start point of the program, and is also much slower
>>>>>>>> than the direct read. Add a new BPF helper definition for bpf_rdtsc() which
>>>>>>>> can be used for any accurate profiling needs.
>>>>>>>>
>>>>>>>> A use-case for the new API is for example wakeup latency tracing via
>>>>>>>> eBPF on Intel architecture, where it is extremely beneficial to be able
>>>>>>>> to get raw TSC timestamps and compare these directly to the value
>>>>>>>> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct
>>>>>>>> latency value from the hardware interrupt to the execution of the
>>>>>>>> interrupt handler can be calculated. Having the functionality within
>>>>>>>> eBPF also has added benefits of allowing to filter any other relevant
>>>>>>>> data like C-state residency values, and also to drop any irrelevant
>>>>>>>> data points directly in the kernel context, without passing all the
>>>>>>>> data to userspace for post-processing.
>>>>>>>>
>>>>>>>> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
>>>>>>>> ---
>>>>>>>>     arch/x86/include/asm/msr.h |  1 +
>>>>>>>>     arch/x86/kernel/tsc.c      | 23 +++++++++++++++++++++++
>>>>>>>>     2 files changed, 24 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>>>>>>>> index 65ec1965cd28..3dde673cb563 100644
>>>>>>>> --- a/arch/x86/include/asm/msr.h
>>>>>>>> +++ b/arch/x86/include/asm/msr.h
>>>>>>>> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void);
>>>>>>>>     void msrs_free(struct msr *msrs);
>>>>>>>>     int msr_set_bit(u32 msr, u8 bit);
>>>>>>>>     int msr_clear_bit(u32 msr, u8 bit);
>>>>>>>> +u64 bpf_rdtsc(void);
>>>>>>>>
>>>>>>>>     #ifdef CONFIG_SMP
>>>>>>>>     int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
>>>>>>>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>>>>>>>> index 344698852146..ded857abef81 100644
>>>>>>>> --- a/arch/x86/kernel/tsc.c
>>>>>>>> +++ b/arch/x86/kernel/tsc.c
>>>>>>>> @@ -15,6 +15,8 @@
>>>>>>>>     #include <linux/timex.h>
>>>>>>>>     #include <linux/static_key.h>
>>>>>>>>     #include <linux/static_call.h>
>>>>>>>> +#include <linux/btf.h>
>>>>>>>> +#include <linux/btf_ids.h>
>>>>>>>>
>>>>>>>>     #include <asm/hpet.h>
>>>>>>>>     #include <asm/timer.h>
>>>>>>>> @@ -29,6 +31,7 @@
>>>>>>>>     #include <asm/intel-family.h>
>>>>>>>>     #include <asm/i8259.h>
>>>>>>>>     #include <asm/uv/uv.h>
>>>>>>>> +#include <asm/tlbflush.h>
>>>>>>>>
>>>>>>>>     unsigned int __read_mostly cpu_khz;    /* TSC clocks / usec, not used here */
>>>>>>>>     EXPORT_SYMBOL(cpu_khz);
>>>>>>>> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void)
>>>>>>>>            tsc_enable_sched_clock();
>>>>>>>>     }
>>>>>>>>
>>>>>>>> +u64 bpf_rdtsc(void)
>>>>>>>> +{
>>>>>>>> +       /* Check if Time Stamp is enabled only in ring 0 */
>>>>>>>> +       if (cr4_read_shadow() & X86_CR4_TSD)
>>>>>>>> +               return 0;
>>>>>>> Why check this? It's always enabled in the kernel, no?
>>>>> It is always enabled, but there are certain syscalls that can be used to
>>>>> disable the TSC access for oneself. prctl(PR_SET_TSC, ...) and
>>>>> seccomp(SET_MODE_STRICT,...). Not having the check in place would in
>>>>> theory allow a restricted BPF program to circumvent this (if there ever
>>>>> was such a thing.) But yes, I do agree this part is a bit debatable
>>>>> whether it should be there at all.
>>>> What do you mean 'circumvent' ?
>>>> It's a tracing bpf prog running in the kernel loaded by root
>>>> and reading tsc for the purpose of the kernel.
>>>> There is no unprivileged access to tsc here.
>> This was based on some discussions with the security team at Intel, I
>> don't pretend to know anything about security myself. But I can drop the
>> check. It is probably not needed because of the fact that it is already
>> possible to read the TSC counter with the approach I mention in the
>> cover letter; via perf and bpf_core_read().
>>>>>>>> +
>>>>>>>> +       return rdtsc_ordered();
>>>>>>> Why _ordered? Why not just rdtsc ?
>>>>>>> Especially since you want to trace latency. Extra lfence will ruin
>>>>>>> the measurements.
>>>>>>>
>>>>>> If we used it as a fast way to order events on multiple CPUs I
>>>>>> guess we need the lfence? We use ktime_get_ns() now for things
>>>>>> like this when we just need an order counter. We have also
>>>>>> observed time going backwards with this and have heuristics
>>>>>> to correct it but its rare.
>>>>> Yeah, I think it is better to induce some extra latency instead of
>>>>> having some weird ordering issues with the timestamps.
>>>> lfence is not 'some extra latency'.
>>>> I suspect rdtsc_ordered() will be slower than bpf_ktime_get_ns().
>>>> What's the point of using it then?
>>> I would only use it if its faster then bpf_ktime_get_ns() and
>>> have already figured out how to handle rare unordered events
>>> so I think its OK to relax somewhat strict ordering.
>> I believe that on x86-arch using bpf_ktime_get_ns() also ends up calling
>> rdtsc_odered() under the hood.
>>
>> I just did some measurements on an Intel(R) Xeon(R) Platinum 8360Y CPU @
>> 2.40GHz, with a simple BPF code:
>>
>>           t1 = bpf_ktime_get_ns();
>>
>>           for (i = 0; i < NUM_CYC; i++) {
>>                   bpf_rdtsc(); // or bpf_ktime_get_ns() here
>>           }
>>
>>           t2 = bpf_ktime_get_ns();
>>
>> The results I got with the CPU locked at 2.4GHz (average execution times
>> per a call within the loop, this with some 10M executions):
>>
>> bpf_rdtsc() ordered : 45ns
>>
>> bpf_rdtsc() un-ordered : 23ns
>>
>> bpf_ktime_get_ns() : 49ns
> Thanks for crunching the numbers.
> Based on them it's hard to justify adding the ordered variant.
> We already have ktime_get_ns, ktime_get_boot_ns, ktime_get_coarse_ns,
> ktime_get_tai_ns with pretty close performance and different time
> constraints. rdtsc_ordered doesn't bring anything new to the table.
> bpf_rdtsc() would be justified if it's significantly faster
> than traditional ktime*() helpers.

The only other justification I can use here is that the TSC counter is 
useful if you are dealing with any other counters that use TSC as a 
reference; mainly the Intel power management residency counters use same 
time base / resolution as TSC.

Converting between the TSC / ktime can get cumbersome, and you would 
need to get the magic conversion factors from somewhere.

-Tero

>
>> Locking the CPU at 800MHz the results are:
>>
>> bpf_rdtsc() ordered : 55ns
>>
>> bpf_rdtsc() un-ordered : 33ns
>>
>> bpf_ktime_get_ns() : 71ns
>>
>> The bpf_rdtsc() in these results contains some extra latency caused by
>> conditional execution, I added a flag to the call to select whether it
>> should use _ordered() or not, and it also still contains the CR4_TSD
>> check in place.
>>
>> -Tero
>>
>>>>> Also, things like the ftrace also use rdtsc_ordered() as its underlying
>>>>> clock, if you use x86-tsc as the trace clock (see
>>>>> arch/x86/kernel/trace_clock.c.)
>>>>>
>>>>> -Tero
>>>>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-08-09 11:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-03 10:57 [PATCH 0/2] x86/BPF: Add new BPF helper call bpf_rdtsc Tero Kristo
2023-07-03 10:57 ` [PATCH 1/2] x86/tsc: " Tero Kristo
2023-07-04  4:49   ` Yonghong Song
2023-07-06 12:00     ` Tero Kristo
2023-07-06  3:02   ` Alexei Starovoitov
2023-07-06  5:16     ` John Fastabend
2023-07-06 11:59       ` Tero Kristo
2023-07-06 19:51         ` Alexei Starovoitov
2023-07-07  5:41           ` John Fastabend
2023-07-07  8:27             ` Tero Kristo
2023-07-07 14:42               ` Alexei Starovoitov
2023-08-09 11:31                 ` Tero Kristo
2023-07-03 10:57 ` [PATCH 2/2] selftests/bpf: Add test for bpf_rdtsc Tero Kristo
2023-07-03 22:00   ` John Fastabend
2023-07-04  8:55     ` Tero Kristo
2023-07-06  4:57       ` John Fastabend
2023-07-03 21:55 ` [PATCH 0/2] x86/BPF: Add new BPF helper call bpf_rdtsc John Fastabend

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.