All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/5] bpf: volatile compare
@ 2023-12-21  3:38 Alexei Starovoitov
  2023-12-21  3:38 ` [PATCH v2 bpf-next 1/5] selftests/bpf: Attempt to build BPF programs with -Wsign-compare Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-12-21  3:38 UTC (permalink / raw)
  To: daniel; +Cc: andrii, martin.lau, dxu, memxor, john.fastabend, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

v1->v2:
Fixed issues pointed out by Daniel, added more tests, attempted to convert profiler.c,
but barrier_var() wins vs bpf_cmp(). To be investigated.
Patches 1-4 are good to go, but 5 needs more work.

Alexei Starovoitov (5):
  selftests/bpf: Attempt to build BPF programs with -Wsign-compare
  bpf: Introduce "volatile compare" macro
  selftests/bpf: Convert exceptions_assert.c to bpf_cmp
  selftests/bpf: Remove bpf_assert_eq-like macros.
  selftests/bpf: Attempt to convert profiler.c to bpf_cmp.

 tools/testing/selftests/bpf/Makefile          |   1 +
 .../testing/selftests/bpf/bpf_experimental.h  | 194 ++++--------------
 .../bpf/progs/bpf_iter_bpf_percpu_hash_map.c  |   2 +-
 .../selftests/bpf/progs/bpf_iter_task_vmas.c  |   2 +-
 .../selftests/bpf/progs/bpf_iter_tasks.c      |   2 +-
 .../selftests/bpf/progs/bpf_iter_test_kern4.c |   2 +-
 .../progs/cgroup_getset_retval_setsockopt.c   |   2 +-
 .../selftests/bpf/progs/cgrp_ls_sleepable.c   |   2 +-
 .../selftests/bpf/progs/cpumask_success.c     |   2 +-
 .../testing/selftests/bpf/progs/exceptions.c  |  20 +-
 .../selftests/bpf/progs/exceptions_assert.c   |  80 ++++----
 tools/testing/selftests/bpf/progs/iters.c     |   4 +-
 .../selftests/bpf/progs/iters_task_vma.c      |   3 +-
 .../selftests/bpf/progs/linked_funcs1.c       |   2 +-
 .../selftests/bpf/progs/linked_funcs2.c       |   2 +-
 .../testing/selftests/bpf/progs/linked_list.c |   2 +-
 .../selftests/bpf/progs/local_storage.c       |   2 +-
 tools/testing/selftests/bpf/progs/lsm.c       |   2 +-
 .../selftests/bpf/progs/normal_map_btf.c      |   2 +-
 .../selftests/bpf/progs/profiler.inc.h        |  71 ++-----
 tools/testing/selftests/bpf/progs/profiler2.c |   1 +
 tools/testing/selftests/bpf/progs/profiler3.c |   1 +
 .../selftests/bpf/progs/sockopt_inherit.c     |   2 +-
 .../selftests/bpf/progs/sockopt_multi.c       |   2 +-
 .../selftests/bpf/progs/sockopt_qos_to_cc.c   |   2 +-
 .../testing/selftests/bpf/progs/test_bpf_ma.c |   2 +-
 .../bpf/progs/test_core_reloc_kernel.c        |   2 +-
 .../bpf/progs/test_core_reloc_module.c        |   8 +-
 .../selftests/bpf/progs/test_fsverity.c       |   2 +-
 .../bpf/progs/test_skc_to_unix_sock.c         |   2 +-
 .../bpf/progs/test_xdp_do_redirect.c          |   2 +-
 31 files changed, 146 insertions(+), 279 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 bpf-next 1/5] selftests/bpf: Attempt to build BPF programs with -Wsign-compare
  2023-12-21  3:38 [PATCH v2 bpf-next 0/5] bpf: volatile compare Alexei Starovoitov
@ 2023-12-21  3:38 ` Alexei Starovoitov
  2023-12-21  3:38 ` [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-12-21  3:38 UTC (permalink / raw)
  To: daniel; +Cc: andrii, martin.lau, dxu, memxor, john.fastabend, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

GCC's -Wall includes -Wsign-compare while clang does not.
Since BPF programs are built with clang we need to add this flag explicitly
to catch problematic comparisons like:

  int i = -1;
  unsigned int j = 1;
  if (i < j) // this is false.

  long i = -1;
  unsigned int j = 1;
  if (i < j) // this is true.

C standard for reference:

- If either operand is unsigned long the other shall be converted to unsigned long.

- Otherwise, if one operand is a long int and the other unsigned int, then if a
long int can represent all the values of an unsigned int, the unsigned int
shall be converted to a long int; otherwise both operands shall be converted to
unsigned long int.

- Otherwise, if either operand is long, the other shall be converted to long.

- Otherwise, if either operand is unsigned, the other shall be converted to unsigned.

Unfortunately clang's -Wsign-compare is very noisy.
It complains about (s32)a == (u32)b which is safe and doen't have surprising behavior.

This patch fixes some of the issues. It needs a follow up to fix the rest.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/Makefile                      | 1 +
 .../selftests/bpf/progs/bpf_iter_bpf_percpu_hash_map.c    | 2 +-
 tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c    | 2 +-
 tools/testing/selftests/bpf/progs/bpf_iter_tasks.c        | 2 +-
 tools/testing/selftests/bpf/progs/bpf_iter_test_kern4.c   | 2 +-
 .../selftests/bpf/progs/cgroup_getset_retval_setsockopt.c | 2 +-
 tools/testing/selftests/bpf/progs/cgrp_ls_sleepable.c     | 2 +-
 tools/testing/selftests/bpf/progs/cpumask_success.c       | 2 +-
 tools/testing/selftests/bpf/progs/iters.c                 | 4 ++--
 tools/testing/selftests/bpf/progs/linked_funcs1.c         | 2 +-
 tools/testing/selftests/bpf/progs/linked_funcs2.c         | 2 +-
 tools/testing/selftests/bpf/progs/linked_list.c           | 2 +-
 tools/testing/selftests/bpf/progs/local_storage.c         | 2 +-
 tools/testing/selftests/bpf/progs/lsm.c                   | 2 +-
 tools/testing/selftests/bpf/progs/normal_map_btf.c        | 2 +-
 tools/testing/selftests/bpf/progs/profiler.inc.h          | 4 ++--
 tools/testing/selftests/bpf/progs/sockopt_inherit.c       | 2 +-
 tools/testing/selftests/bpf/progs/sockopt_multi.c         | 2 +-
 tools/testing/selftests/bpf/progs/sockopt_qos_to_cc.c     | 2 +-
 tools/testing/selftests/bpf/progs/test_bpf_ma.c           | 2 +-
 .../testing/selftests/bpf/progs/test_core_reloc_kernel.c  | 2 +-
 .../testing/selftests/bpf/progs/test_core_reloc_module.c  | 8 ++++----
 tools/testing/selftests/bpf/progs/test_fsverity.c         | 2 +-
 tools/testing/selftests/bpf/progs/test_skc_to_unix_sock.c | 2 +-
 tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c  | 2 +-
 25 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 617ae55c3bb5..fd15017ed3b1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -383,6 +383,7 @@ CLANG_SYS_INCLUDES = $(call get_sys_includes,$(CLANG),$(CLANG_TARGET_ARCH))
 BPF_CFLAGS = -g -Wall -Werror -D__TARGET_ARCH_$(SRCARCH) $(MENDIAN)	\
 	     -I$(INCLUDE_DIR) -I$(CURDIR) -I$(APIDIR)			\
 	     -I$(abspath $(OUTPUT)/../usr/include)
+# TODO: enable me -Wsign-compare
 
 CLANG_CFLAGS = $(CLANG_SYS_INCLUDES) \
 	       -Wno-compare-distinct-pointer-types
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_bpf_percpu_hash_map.c b/tools/testing/selftests/bpf/progs/bpf_iter_bpf_percpu_hash_map.c
index feaaa2b89c57..5014a17d6c02 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_bpf_percpu_hash_map.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_bpf_percpu_hash_map.c
@@ -20,7 +20,7 @@ struct {
 } hashmap1 SEC(".maps");
 
 /* will set before prog run */
-volatile const __u32 num_cpus = 0;
+volatile const __s32 num_cpus = 0;
 
 /* will collect results during prog run */
 __u32 key_sum_a = 0, key_sum_b = 0, key_sum_c = 0;
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
index dd923dc637d5..423b39e60b6f 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c
@@ -35,7 +35,7 @@ SEC("iter/task_vma") int proc_maps(struct bpf_iter__task_vma *ctx)
 		return 0;
 
 	file = vma->vm_file;
-	if (task->tgid != pid) {
+	if (task->tgid != (pid_t)pid) {
 		if (one_task)
 			one_task_error = 1;
 		return 0;
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_tasks.c b/tools/testing/selftests/bpf/progs/bpf_iter_tasks.c
index 96131b9a1caa..6cbb3393f243 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_tasks.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_tasks.c
@@ -22,7 +22,7 @@ int dump_task(struct bpf_iter__task *ctx)
 		return 0;
 	}
 
-	if (task->pid != tid)
+	if (task->pid != (pid_t)tid)
 		num_unknown_tid++;
 	else
 		num_known_tid++;
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_test_kern4.c b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern4.c
index 400fdf8d6233..dbf61c44acac 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_test_kern4.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_test_kern4.c
@@ -45,7 +45,7 @@ int dump_bpf_map(struct bpf_iter__bpf_map *ctx)
 	}
 
 	/* fill seq_file buffer */
-	for (i = 0; i < print_len; i++)
+	for (i = 0; i < (int)print_len; i++)
 		bpf_seq_write(seq, &seq_num, sizeof(seq_num));
 
 	return ret;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_getset_retval_setsockopt.c b/tools/testing/selftests/bpf/progs/cgroup_getset_retval_setsockopt.c
index b7fa8804e19d..45a0e9f492a9 100644
--- a/tools/testing/selftests/bpf/progs/cgroup_getset_retval_setsockopt.c
+++ b/tools/testing/selftests/bpf/progs/cgroup_getset_retval_setsockopt.c
@@ -11,7 +11,7 @@
 __u32 invocations = 0;
 __u32 assertion_error = 0;
 __u32 retval_value = 0;
-__u32 page_size = 0;
+__s32 page_size = 0;
 
 SEC("cgroup/setsockopt")
 int get_retval(struct bpf_sockopt *ctx)
diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_sleepable.c b/tools/testing/selftests/bpf/progs/cgrp_ls_sleepable.c
index facedd8b8250..5e282c16eadc 100644
--- a/tools/testing/selftests/bpf/progs/cgrp_ls_sleepable.c
+++ b/tools/testing/selftests/bpf/progs/cgrp_ls_sleepable.c
@@ -15,7 +15,7 @@ struct {
 	__type(value, long);
 } map_a SEC(".maps");
 
-__u32 target_pid;
+__s32 target_pid;
 __u64 cgroup_id;
 int target_hid;
 bool is_cgroup1;
diff --git a/tools/testing/selftests/bpf/progs/cpumask_success.c b/tools/testing/selftests/bpf/progs/cpumask_success.c
index fc3666edf456..7a1e64c6c065 100644
--- a/tools/testing/selftests/bpf/progs/cpumask_success.c
+++ b/tools/testing/selftests/bpf/progs/cpumask_success.c
@@ -332,7 +332,7 @@ SEC("tp_btf/task_newtask")
 int BPF_PROG(test_copy_any_anyand, struct task_struct *task, u64 clone_flags)
 {
 	struct bpf_cpumask *mask1, *mask2, *dst1, *dst2;
-	u32 cpu;
+	int cpu;
 
 	if (!is_test_task())
 		return 0;
diff --git a/tools/testing/selftests/bpf/progs/iters.c b/tools/testing/selftests/bpf/progs/iters.c
index 3aca3dc145b5..fe971992e635 100644
--- a/tools/testing/selftests/bpf/progs/iters.c
+++ b/tools/testing/selftests/bpf/progs/iters.c
@@ -6,7 +6,7 @@
 #include <bpf/bpf_helpers.h>
 #include "bpf_misc.h"
 
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#define ARRAY_SIZE(x) (int)(sizeof(x) / sizeof((x)[0]))
 
 static volatile int zero = 0;
 
@@ -676,7 +676,7 @@ static __noinline int sum(struct bpf_iter_num *it, int *arr, __u32 n)
 
 	while ((t = bpf_iter_num_next(it))) {
 		i = *t;
-		if (i >= n)
+		if ((__u32)i >= n)
 			break;
 		sum += arr[i];
 	}
diff --git a/tools/testing/selftests/bpf/progs/linked_funcs1.c b/tools/testing/selftests/bpf/progs/linked_funcs1.c
index c4b49ceea967..cc79dddac182 100644
--- a/tools/testing/selftests/bpf/progs/linked_funcs1.c
+++ b/tools/testing/selftests/bpf/progs/linked_funcs1.c
@@ -8,7 +8,7 @@
 #include "bpf_misc.h"
 
 /* weak and shared between two files */
-const volatile int my_tid __weak;
+const volatile __u32 my_tid __weak;
 long syscall_id __weak;
 
 int output_val1;
diff --git a/tools/testing/selftests/bpf/progs/linked_funcs2.c b/tools/testing/selftests/bpf/progs/linked_funcs2.c
index 013ff0645f0c..942cc5526ddf 100644
--- a/tools/testing/selftests/bpf/progs/linked_funcs2.c
+++ b/tools/testing/selftests/bpf/progs/linked_funcs2.c
@@ -68,7 +68,7 @@ int BPF_PROG(handler2, struct pt_regs *regs, long id)
 {
 	static volatile int whatever;
 
-	if (my_tid != (u32)bpf_get_current_pid_tgid() || id != syscall_id)
+	if (my_tid != (s32)bpf_get_current_pid_tgid() || id != syscall_id)
 		return 0;
 
 	/* make sure we have CO-RE relocations in main program */
diff --git a/tools/testing/selftests/bpf/progs/linked_list.c b/tools/testing/selftests/bpf/progs/linked_list.c
index 84d1777a9e6c..26205ca80679 100644
--- a/tools/testing/selftests/bpf/progs/linked_list.c
+++ b/tools/testing/selftests/bpf/progs/linked_list.c
@@ -6,7 +6,7 @@
 #include "bpf_experimental.h"
 
 #ifndef ARRAY_SIZE
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#define ARRAY_SIZE(x) (int)(sizeof(x) / sizeof((x)[0]))
 #endif
 
 #include "linked_list.h"
diff --git a/tools/testing/selftests/bpf/progs/local_storage.c b/tools/testing/selftests/bpf/progs/local_storage.c
index bc8ea56671a1..e5e3a8b8dd07 100644
--- a/tools/testing/selftests/bpf/progs/local_storage.c
+++ b/tools/testing/selftests/bpf/progs/local_storage.c
@@ -13,7 +13,7 @@ char _license[] SEC("license") = "GPL";
 
 #define DUMMY_STORAGE_VALUE 0xdeadbeef
 
-int monitored_pid = 0;
+__u32 monitored_pid = 0;
 int inode_storage_result = -1;
 int sk_storage_result = -1;
 int task_storage_result = -1;
diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c
index fadfdd98707c..0c13b7409947 100644
--- a/tools/testing/selftests/bpf/progs/lsm.c
+++ b/tools/testing/selftests/bpf/progs/lsm.c
@@ -92,7 +92,7 @@ int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
 	if (ret != 0)
 		return ret;
 
-	__u32 pid = bpf_get_current_pid_tgid() >> 32;
+	__s32 pid = bpf_get_current_pid_tgid() >> 32;
 	int is_stack = 0;
 
 	is_stack = (vma->vm_start <= vma->vm_mm->start_stack &&
diff --git a/tools/testing/selftests/bpf/progs/normal_map_btf.c b/tools/testing/selftests/bpf/progs/normal_map_btf.c
index 66cde82aa86d..a45c9299552c 100644
--- a/tools/testing/selftests/bpf/progs/normal_map_btf.c
+++ b/tools/testing/selftests/bpf/progs/normal_map_btf.c
@@ -36,7 +36,7 @@ int add_to_list_in_array(void *ctx)
 	struct node_data *new;
 	int zero = 0;
 
-	if (done || (u32)bpf_get_current_pid_tgid() != pid)
+	if (done || (int)bpf_get_current_pid_tgid() != pid)
 		return 0;
 
 	value = bpf_map_lookup_elem(&array, &zero);
diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h b/tools/testing/selftests/bpf/progs/profiler.inc.h
index 897061930cb7..ba99d17dac54 100644
--- a/tools/testing/selftests/bpf/progs/profiler.inc.h
+++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
@@ -132,7 +132,7 @@ struct {
 } disallowed_exec_inodes SEC(".maps");
 
 #ifndef ARRAY_SIZE
-#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof(arr[0]))
+#define ARRAY_SIZE(arr) (int)(sizeof(arr) / sizeof(arr[0]))
 #endif
 
 static INLINE bool IS_ERR(const void* ptr)
@@ -645,7 +645,7 @@ int raw_tracepoint__sched_process_exit(void* ctx)
 	for (int i = 0; i < ARRAY_SIZE(arr_struct->array); i++) {
 		struct var_kill_data_t* past_kill_data = &arr_struct->array[i];
 
-		if (past_kill_data != NULL && past_kill_data->kill_target_pid == tpid) {
+		if (past_kill_data != NULL && past_kill_data->kill_target_pid == (pid_t)tpid) {
 			bpf_probe_read_kernel(kill_data, sizeof(*past_kill_data),
 					      past_kill_data);
 			void* payload = kill_data->payload;
diff --git a/tools/testing/selftests/bpf/progs/sockopt_inherit.c b/tools/testing/selftests/bpf/progs/sockopt_inherit.c
index c8f59caa4639..a3434b840928 100644
--- a/tools/testing/selftests/bpf/progs/sockopt_inherit.c
+++ b/tools/testing/selftests/bpf/progs/sockopt_inherit.c
@@ -9,7 +9,7 @@ char _license[] SEC("license") = "GPL";
 #define CUSTOM_INHERIT2			1
 #define CUSTOM_LISTENER			2
 
-__u32 page_size = 0;
+__s32 page_size = 0;
 
 struct sockopt_inherit {
 	__u8 val;
diff --git a/tools/testing/selftests/bpf/progs/sockopt_multi.c b/tools/testing/selftests/bpf/progs/sockopt_multi.c
index 96f29fce050b..db67278e12d4 100644
--- a/tools/testing/selftests/bpf/progs/sockopt_multi.c
+++ b/tools/testing/selftests/bpf/progs/sockopt_multi.c
@@ -5,7 +5,7 @@
 
 char _license[] SEC("license") = "GPL";
 
-__u32 page_size = 0;
+__s32 page_size = 0;
 
 SEC("cgroup/getsockopt")
 int _getsockopt_child(struct bpf_sockopt *ctx)
diff --git a/tools/testing/selftests/bpf/progs/sockopt_qos_to_cc.c b/tools/testing/selftests/bpf/progs/sockopt_qos_to_cc.c
index dbe235ede7f3..83753b00a556 100644
--- a/tools/testing/selftests/bpf/progs/sockopt_qos_to_cc.c
+++ b/tools/testing/selftests/bpf/progs/sockopt_qos_to_cc.c
@@ -9,7 +9,7 @@
 
 char _license[] SEC("license") = "GPL";
 
-__u32 page_size = 0;
+__s32 page_size = 0;
 
 SEC("cgroup/setsockopt")
 int sockopt_qos_to_cc(struct bpf_sockopt *ctx)
diff --git a/tools/testing/selftests/bpf/progs/test_bpf_ma.c b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
index 069db9085e78..b78f4f702ae0 100644
--- a/tools/testing/selftests/bpf/progs/test_bpf_ma.c
+++ b/tools/testing/selftests/bpf/progs/test_bpf_ma.c
@@ -21,7 +21,7 @@ const unsigned int data_sizes[] = {16, 32, 64, 96, 128, 192, 256, 512, 1024, 204
 const volatile unsigned int data_btf_ids[ARRAY_SIZE(data_sizes)] = {};
 
 int err = 0;
-int pid = 0;
+u32 pid = 0;
 
 #define DEFINE_ARRAY_WITH_KPTR(_size) \
 	struct bin_data_##_size { \
diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c b/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c
index a17dd83eae67..ee4a601dcb06 100644
--- a/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c
+++ b/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c
@@ -53,7 +53,7 @@ int test_core_kernel(void *ctx)
 	struct task_struct *task = (void *)bpf_get_current_task();
 	struct core_reloc_kernel_output *out = (void *)&data.out;
 	uint64_t pid_tgid = bpf_get_current_pid_tgid();
-	uint32_t real_tgid = (uint32_t)pid_tgid;
+	int32_t real_tgid = (int32_t)pid_tgid;
 	int pid, tgid;
 
 	if (data.my_pid_tgid != pid_tgid)
diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_module.c b/tools/testing/selftests/bpf/progs/test_core_reloc_module.c
index f59f175c7baf..bcb31ff92dcc 100644
--- a/tools/testing/selftests/bpf/progs/test_core_reloc_module.c
+++ b/tools/testing/selftests/bpf/progs/test_core_reloc_module.c
@@ -43,8 +43,8 @@ int BPF_PROG(test_core_module_probed,
 #if __has_builtin(__builtin_preserve_enum_value)
 	struct core_reloc_module_output *out = (void *)&data.out;
 	__u64 pid_tgid = bpf_get_current_pid_tgid();
-	__u32 real_tgid = (__u32)(pid_tgid >> 32);
-	__u32 real_pid = (__u32)pid_tgid;
+	__s32 real_tgid = (__s32)(pid_tgid >> 32);
+	__s32 real_pid = (__s32)pid_tgid;
 
 	if (data.my_pid_tgid != pid_tgid)
 		return 0;
@@ -77,8 +77,8 @@ int BPF_PROG(test_core_module_direct,
 #if __has_builtin(__builtin_preserve_enum_value)
 	struct core_reloc_module_output *out = (void *)&data.out;
 	__u64 pid_tgid = bpf_get_current_pid_tgid();
-	__u32 real_tgid = (__u32)(pid_tgid >> 32);
-	__u32 real_pid = (__u32)pid_tgid;
+	__s32 real_tgid = (__s32)(pid_tgid >> 32);
+	__s32 real_pid = (__s32)pid_tgid;
 
 	if (data.my_pid_tgid != pid_tgid)
 		return 0;
diff --git a/tools/testing/selftests/bpf/progs/test_fsverity.c b/tools/testing/selftests/bpf/progs/test_fsverity.c
index 3975495b75c8..9e0f73e8189c 100644
--- a/tools/testing/selftests/bpf/progs/test_fsverity.c
+++ b/tools/testing/selftests/bpf/progs/test_fsverity.c
@@ -38,7 +38,7 @@ int BPF_PROG(test_file_open, struct file *f)
 		return 0;
 	got_fsverity = 1;
 
-	for (i = 0; i < sizeof(digest); i++) {
+	for (i = 0; i < (int)sizeof(digest); i++) {
 		if (digest[i] != expected_digest[i])
 			return 0;
 	}
diff --git a/tools/testing/selftests/bpf/progs/test_skc_to_unix_sock.c b/tools/testing/selftests/bpf/progs/test_skc_to_unix_sock.c
index eacda9fe07eb..4cfa42aa9436 100644
--- a/tools/testing/selftests/bpf/progs/test_skc_to_unix_sock.c
+++ b/tools/testing/selftests/bpf/progs/test_skc_to_unix_sock.c
@@ -29,7 +29,7 @@ int BPF_PROG(unix_listen, struct socket *sock, int backlog)
 	len = unix_sk->addr->len - sizeof(short);
 	path[0] = '@';
 	for (i = 1; i < len; i++) {
-		if (i >= sizeof(struct sockaddr_un))
+		if (i >= (int)sizeof(struct sockaddr_un))
 			break;
 
 		path[i] = unix_sk->addr->name->sun_path[i];
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c b/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c
index 5baaafed0d2d..3abf068b8446 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c
@@ -38,7 +38,7 @@ int xdp_redirect(struct xdp_md *xdp)
 	if (payload + 1 > data_end)
 		return XDP_ABORTED;
 
-	if (xdp->ingress_ifindex != ifindex_in)
+	if (xdp->ingress_ifindex != (__u32)ifindex_in)
 		return XDP_ABORTED;
 
 	if (metadata + 1 > data)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2023-12-21  3:38 [PATCH v2 bpf-next 0/5] bpf: volatile compare Alexei Starovoitov
  2023-12-21  3:38 ` [PATCH v2 bpf-next 1/5] selftests/bpf: Attempt to build BPF programs with -Wsign-compare Alexei Starovoitov
@ 2023-12-21  3:38 ` Alexei Starovoitov
  2023-12-21  4:27   ` Kumar Kartikeya Dwivedi
  2023-12-21  3:38 ` [PATCH v2 bpf-next 3/5] selftests/bpf: Convert exceptions_assert.c to bpf_cmp Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2023-12-21  3:38 UTC (permalink / raw)
  To: daniel; +Cc: andrii, martin.lau, dxu, memxor, john.fastabend, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Compilers optimize conditional operators at will, but often bpf programmers
want to force compilers to keep the same operator in asm as it's written in C.
Introduce bpf_cmp(var1, conditional_op, var2) macro that can be used as:

-               if (seen >= 1000)
+               if (bpf_cmp(seen, >=, 1000))

The macro takes advantage of BPF assembly that is C like.

The macro checks the sign of variable 'seen' and emits either
signed or unsigned compare.

For example:
int a;
bpf_cmp(a, >, 0) will be translated to 'if rX s> 0 goto' in BPF assembly.

unsigned int a;
bpf_cmp(a, >, 0) will be translated to 'if rX > 0 goto' in BPF assembly.

C type conversions coupled with comparison operator are tricky.
  int i = -1;
  unsigned int j = 1;
  if (i < j) // this is false.

  long i = -1;
  unsigned int j = 1;
  if (i < j) // this is true.

Make sure BPF program is compiled with -Wsign-compare then the macro will catch
the mistake.

The macro checks LHS (left hand side) only to figure out the sign of compare.

'if 0 < rX goto' is not allowed in the assembly, so the users
have to use a variable on LHS anyway.

The patch updates few tests to demonstrate the use of the macro.

The macro allows to use BPF_JSET in C code, since LLVM doesn't generate it at
present. For example:

if (i & j) compiles into r0 &= r1; if r0 == 0 goto

while

if (bpf_cmp(i, &, j)) compiles into if r0 & r1 goto

Note that the macro has to be careful with RHS assembly predicate.
Since:
u64 __rhs = 1ull << 42;
asm goto("if r0 < %[rhs] goto +1" :: [rhs] "ri" (__rhs));
LLVM will silently truncate 64-bit constant into s32 imm.

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../testing/selftests/bpf/bpf_experimental.h  | 44 +++++++++++++++++++
 .../testing/selftests/bpf/progs/exceptions.c  | 20 ++++-----
 .../selftests/bpf/progs/iters_task_vma.c      |  3 +-
 3 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 1386baf9ae4a..b78a9449793d 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -254,6 +254,50 @@ extern void bpf_throw(u64 cookie) __ksym;
 		}									\
 	 })
 
+#define __cmp_cannot_be_signed(x) \
+	__builtin_strcmp(#x, "==") == 0 || __builtin_strcmp(#x, "!=") == 0 || \
+	__builtin_strcmp(#x, "&") == 0
+
+#define __is_signed_type(type) (((type)(-1)) < (type)1)
+
+#define __bpf_cmp(LHS, OP, SIGN, PRED, RHS) \
+	({ \
+		__label__ l_true; \
+		bool ret = true; \
+		asm volatile goto("if %[lhs] " SIGN #OP " %[rhs] goto %l[l_true]" \
+				  :: [lhs] "r"(LHS), [rhs] PRED (RHS) :: l_true); \
+		ret = false; \
+l_true:\
+		ret;\
+       })
+
+/* C type conversions coupled with comparison operator are tricky.
+ * Make sure BPF program is compiled with -Wsign-compre then
+ * __lhs OP __rhs below will catch the mistake.
+ * Be aware that we check only __lhs to figure out the sign of compare.
+ */
+#ifndef bpf_cmp
+#define bpf_cmp(LHS, OP, RHS) \
+	({ \
+		typeof(LHS) __lhs = (LHS); \
+		typeof(RHS) __rhs = (RHS); \
+		bool ret; \
+		(void)(__lhs OP __rhs); \
+		if (__cmp_cannot_be_signed(OP) || !__is_signed_type(typeof(__lhs))) {\
+			if (sizeof(__rhs) == 8) \
+				ret = __bpf_cmp(__lhs, OP, "", "r", __rhs); \
+			else \
+				ret = __bpf_cmp(__lhs, OP, "", "i", __rhs); \
+		} else { \
+			if (sizeof(__rhs) == 8) \
+				ret = __bpf_cmp(__lhs, OP, "s", "r", __rhs); \
+			else \
+				ret = __bpf_cmp(__lhs, OP, "s", "i", __rhs); \
+		} \
+		ret; \
+       })
+#endif
+
 /* Description
  *	Assert that a conditional expression is true.
  * Returns
diff --git a/tools/testing/selftests/bpf/progs/exceptions.c b/tools/testing/selftests/bpf/progs/exceptions.c
index 2811ee842b01..7acb859f57c7 100644
--- a/tools/testing/selftests/bpf/progs/exceptions.c
+++ b/tools/testing/selftests/bpf/progs/exceptions.c
@@ -210,7 +210,7 @@ __noinline int assert_zero_gfunc(u64 c)
 {
 	volatile u64 cookie = c;
 
-	bpf_assert_eq(cookie, 0);
+	bpf_assert(bpf_cmp(cookie, ==, 0));
 	return 0;
 }
 
@@ -218,7 +218,7 @@ __noinline int assert_neg_gfunc(s64 c)
 {
 	volatile s64 cookie = c;
 
-	bpf_assert_lt(cookie, 0);
+	bpf_assert(bpf_cmp(cookie, <, 0));
 	return 0;
 }
 
@@ -226,7 +226,7 @@ __noinline int assert_pos_gfunc(s64 c)
 {
 	volatile s64 cookie = c;
 
-	bpf_assert_gt(cookie, 0);
+	bpf_assert(bpf_cmp(cookie, >, 0));
 	return 0;
 }
 
@@ -234,7 +234,7 @@ __noinline int assert_negeq_gfunc(s64 c)
 {
 	volatile s64 cookie = c;
 
-	bpf_assert_le(cookie, -1);
+	bpf_assert(bpf_cmp(cookie, <=, -1));
 	return 0;
 }
 
@@ -242,7 +242,7 @@ __noinline int assert_poseq_gfunc(s64 c)
 {
 	volatile s64 cookie = c;
 
-	bpf_assert_ge(cookie, 1);
+	bpf_assert(bpf_cmp(cookie, >=, 1));
 	return 0;
 }
 
@@ -258,7 +258,7 @@ __noinline int assert_zero_gfunc_with(u64 c)
 {
 	volatile u64 cookie = c;
 
-	bpf_assert_eq_with(cookie, 0, cookie + 100);
+	bpf_assert_with(bpf_cmp(cookie, ==, 0), cookie + 100);
 	return 0;
 }
 
@@ -266,7 +266,7 @@ __noinline int assert_neg_gfunc_with(s64 c)
 {
 	volatile s64 cookie = c;
 
-	bpf_assert_lt_with(cookie, 0, cookie + 100);
+	bpf_assert_with(bpf_cmp(cookie, <, 0), cookie + 100);
 	return 0;
 }
 
@@ -274,7 +274,7 @@ __noinline int assert_pos_gfunc_with(s64 c)
 {
 	volatile s64 cookie = c;
 
-	bpf_assert_gt_with(cookie, 0, cookie + 100);
+	bpf_assert_with(bpf_cmp(cookie, >, 0), cookie + 100);
 	return 0;
 }
 
@@ -282,7 +282,7 @@ __noinline int assert_negeq_gfunc_with(s64 c)
 {
 	volatile s64 cookie = c;
 
-	bpf_assert_le_with(cookie, -1, cookie + 100);
+	bpf_assert_with(bpf_cmp(cookie, <=, -1), cookie + 100);
 	return 0;
 }
 
@@ -290,7 +290,7 @@ __noinline int assert_poseq_gfunc_with(s64 c)
 {
 	volatile s64 cookie = c;
 
-	bpf_assert_ge_with(cookie, 1, cookie + 100);
+	bpf_assert_with(bpf_cmp(cookie, >=, 1), cookie + 100);
 	return 0;
 }
 
diff --git a/tools/testing/selftests/bpf/progs/iters_task_vma.c b/tools/testing/selftests/bpf/progs/iters_task_vma.c
index e085a51d153e..7af9d6bca03b 100644
--- a/tools/testing/selftests/bpf/progs/iters_task_vma.c
+++ b/tools/testing/selftests/bpf/progs/iters_task_vma.c
@@ -28,9 +28,8 @@ int iter_task_vma_for_each(const void *ctx)
 		return 0;
 
 	bpf_for_each(task_vma, vma, task, 0) {
-		if (seen >= 1000)
+		if (bpf_cmp(seen, >=, 1000))
 			break;
-		barrier_var(seen);
 
 		vm_ranges[seen].vm_start = vma->vm_start;
 		vm_ranges[seen].vm_end = vma->vm_end;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 bpf-next 3/5] selftests/bpf: Convert exceptions_assert.c to bpf_cmp
  2023-12-21  3:38 [PATCH v2 bpf-next 0/5] bpf: volatile compare Alexei Starovoitov
  2023-12-21  3:38 ` [PATCH v2 bpf-next 1/5] selftests/bpf: Attempt to build BPF programs with -Wsign-compare Alexei Starovoitov
  2023-12-21  3:38 ` [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro Alexei Starovoitov
@ 2023-12-21  3:38 ` Alexei Starovoitov
  2023-12-21  3:38 ` [PATCH v2 bpf-next 4/5] selftests/bpf: Remove bpf_assert_eq-like macros Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-12-21  3:38 UTC (permalink / raw)
  To: daniel; +Cc: andrii, martin.lau, dxu, memxor, john.fastabend, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert exceptions_assert.c to bpf_cmp() macro.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../selftests/bpf/progs/exceptions_assert.c   | 80 +++++++++----------
 1 file changed, 40 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/exceptions_assert.c b/tools/testing/selftests/bpf/progs/exceptions_assert.c
index 0ef81040da59..84374c3ef23b 100644
--- a/tools/testing/selftests/bpf/progs/exceptions_assert.c
+++ b/tools/testing/selftests/bpf/progs/exceptions_assert.c
@@ -11,51 +11,51 @@
 #define check_assert(type, op, name, value)				\
 	SEC("?tc")							\
 	__log_level(2) __failure					\
-	int check_assert_##op##_##name(void *ctx)			\
+	int check_assert_##name(void *ctx)				\
 	{								\
 		type num = bpf_ktime_get_ns();				\
-		bpf_assert_##op(num, value);				\
+		bpf_assert(bpf_cmp(num, op, value));			\
 		return *(u64 *)num;					\
 	}
 
-__msg(": R0_w=0xffffffff80000000 R10=fp0")
-check_assert(s64, eq, int_min, INT_MIN);
-__msg(": R0_w=0x7fffffff R10=fp0")
-check_assert(s64, eq, int_max, INT_MAX);
-__msg(": R0_w=0 R10=fp0")
-check_assert(s64, eq, zero, 0);
-__msg(": R0_w=0x8000000000000000 R1_w=0x8000000000000000 R10=fp0")
-check_assert(s64, eq, llong_min, LLONG_MIN);
-__msg(": R0_w=0x7fffffffffffffff R1_w=0x7fffffffffffffff R10=fp0")
-check_assert(s64, eq, llong_max, LLONG_MAX);
-
-__msg(": R0_w=scalar(smax=0x7ffffffe) R10=fp0")
-check_assert(s64, lt, pos, INT_MAX);
-__msg(": R0_w=scalar(smax=-1,umin=0x8000000000000000,var_off=(0x8000000000000000; 0x7fffffffffffffff))")
-check_assert(s64, lt, zero, 0);
-__msg(": R0_w=scalar(smax=0xffffffff7fffffff,umin=0x8000000000000000,umax=0xffffffff7fffffff,var_off=(0x8000000000000000; 0x7fffffffffffffff))")
-check_assert(s64, lt, neg, INT_MIN);
-
-__msg(": R0_w=scalar(smax=0x7fffffff) R10=fp0")
-check_assert(s64, le, pos, INT_MAX);
-__msg(": R0_w=scalar(smax=0) R10=fp0")
-check_assert(s64, le, zero, 0);
-__msg(": R0_w=scalar(smax=0xffffffff80000000,umin=0x8000000000000000,umax=0xffffffff80000000,var_off=(0x8000000000000000; 0x7fffffffffffffff))")
-check_assert(s64, le, neg, INT_MIN);
-
-__msg(": R0_w=scalar(smin=umin=0x80000000,umax=0x7fffffffffffffff,var_off=(0x0; 0x7fffffffffffffff))")
-check_assert(s64, gt, pos, INT_MAX);
-__msg(": R0_w=scalar(smin=umin=1,umax=0x7fffffffffffffff,var_off=(0x0; 0x7fffffffffffffff))")
-check_assert(s64, gt, zero, 0);
-__msg(": R0_w=scalar(smin=0xffffffff80000001) R10=fp0")
-check_assert(s64, gt, neg, INT_MIN);
-
-__msg(": R0_w=scalar(smin=umin=0x7fffffff,umax=0x7fffffffffffffff,var_off=(0x0; 0x7fffffffffffffff))")
-check_assert(s64, ge, pos, INT_MAX);
-__msg(": R0_w=scalar(smin=0,umax=0x7fffffffffffffff,var_off=(0x0; 0x7fffffffffffffff)) R10=fp0")
-check_assert(s64, ge, zero, 0);
-__msg(": R0_w=scalar(smin=0xffffffff80000000) R10=fp0")
-check_assert(s64, ge, neg, INT_MIN);
+__msg(": R0_w=0xffffffff80000000")
+check_assert(s64, ==, eq_int_min, INT_MIN);
+__msg(": R0_w=0x7fffffff")
+check_assert(s64, ==, eq_int_max, INT_MAX);
+__msg(": R0_w=0")
+check_assert(s64, ==, eq_zero, 0);
+__msg(": R0_w=0x8000000000000000 R1_w=0x8000000000000000")
+check_assert(s64, ==, eq_llong_min, LLONG_MIN);
+__msg(": R0_w=0x7fffffffffffffff R1_w=0x7fffffffffffffff")
+check_assert(s64, ==, eq_llong_max, LLONG_MAX);
+
+__msg(": R0_w=scalar(id=1,smax=0x7ffffffe)")
+check_assert(s64, <, lt_pos, INT_MAX);
+__msg(": R0_w=scalar(id=1,smax=-1,umin=0x8000000000000000,var_off=(0x8000000000000000; 0x7fffffffffffffff))")
+check_assert(s64, <, lt_zero, 0);
+__msg(": R0_w=scalar(id=1,smax=0xffffffff7fffffff")
+check_assert(s64, <, lt_neg, INT_MIN);
+
+__msg(": R0_w=scalar(id=1,smax=0x7fffffff)")
+check_assert(s64, <=, le_pos, INT_MAX);
+__msg(": R0_w=scalar(id=1,smax=0)")
+check_assert(s64, <=, le_zero, 0);
+__msg(": R0_w=scalar(id=1,smax=0xffffffff80000000")
+check_assert(s64, <=, le_neg, INT_MIN);
+
+__msg(": R0_w=scalar(id=1,smin=umin=0x80000000,umax=0x7fffffffffffffff,var_off=(0x0; 0x7fffffffffffffff))")
+check_assert(s64, >, gt_pos, INT_MAX);
+__msg(": R0_w=scalar(id=1,smin=umin=1,umax=0x7fffffffffffffff,var_off=(0x0; 0x7fffffffffffffff))")
+check_assert(s64, >, gt_zero, 0);
+__msg(": R0_w=scalar(id=1,smin=0xffffffff80000001")
+check_assert(s64, >, gt_neg, INT_MIN);
+
+__msg(": R0_w=scalar(id=1,smin=umin=0x7fffffff,umax=0x7fffffffffffffff,var_off=(0x0; 0x7fffffffffffffff))")
+check_assert(s64, >=, ge_pos, INT_MAX);
+__msg(": R0_w=scalar(id=1,smin=0,umax=0x7fffffffffffffff,var_off=(0x0; 0x7fffffffffffffff))")
+check_assert(s64, >=, ge_zero, 0);
+__msg(": R0_w=scalar(id=1,smin=0xffffffff80000000")
+check_assert(s64, >=, ge_neg, INT_MIN);
 
 SEC("?tc")
 __log_level(2) __failure
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 bpf-next 4/5] selftests/bpf: Remove bpf_assert_eq-like macros.
  2023-12-21  3:38 [PATCH v2 bpf-next 0/5] bpf: volatile compare Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2023-12-21  3:38 ` [PATCH v2 bpf-next 3/5] selftests/bpf: Convert exceptions_assert.c to bpf_cmp Alexei Starovoitov
@ 2023-12-21  3:38 ` Alexei Starovoitov
  2023-12-21  3:38 ` [RFC PATCH v2 bpf-next 5/5] selftests/bpf: Attempt to convert profiler.c to bpf_cmp Alexei Starovoitov
  2023-12-21 18:01 ` [PATCH v2 bpf-next 0/5] bpf: volatile compare Jiri Olsa
  5 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-12-21  3:38 UTC (permalink / raw)
  To: daniel; +Cc: andrii, martin.lau, dxu, memxor, john.fastabend, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Since the last user was converted to bpf_cmp, remove bpf_assert_eq/ne/... macros.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../testing/selftests/bpf/bpf_experimental.h  | 150 ------------------
 1 file changed, 150 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index b78a9449793d..0f94c266c35b 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -316,156 +316,6 @@ l_true:\
  */
 #define bpf_assert_with(cond, value) if (!(cond)) bpf_throw(value);
 
-/* Description
- *	Assert that LHS is equal to RHS. This statement updates the known value
- *	of LHS during verification. Note that RHS must be a constant value, and
- *	must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the value zero when the assertion fails.
- */
-#define bpf_assert_eq(LHS, RHS)						\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, ==, RHS, 0, true);			\
-	})
-
-/* Description
- *	Assert that LHS is equal to RHS. This statement updates the known value
- *	of LHS during verification. Note that RHS must be a constant value, and
- *	must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the specified value when the assertion fails.
- */
-#define bpf_assert_eq_with(LHS, RHS, value)				\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, ==, RHS, value, true);		\
-	})
-
-/* Description
- *	Assert that LHS is less than RHS. This statement updates the known
- *	bounds of LHS during verification. Note that RHS must be a constant
- *	value, and must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the value zero when the assertion fails.
- */
-#define bpf_assert_lt(LHS, RHS)						\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, <, RHS, 0, false);			\
-	})
-
-/* Description
- *	Assert that LHS is less than RHS. This statement updates the known
- *	bounds of LHS during verification. Note that RHS must be a constant
- *	value, and must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the specified value when the assertion fails.
- */
-#define bpf_assert_lt_with(LHS, RHS, value)				\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, <, RHS, value, false);		\
-	})
-
-/* Description
- *	Assert that LHS is greater than RHS. This statement updates the known
- *	bounds of LHS during verification. Note that RHS must be a constant
- *	value, and must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the value zero when the assertion fails.
- */
-#define bpf_assert_gt(LHS, RHS)						\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, >, RHS, 0, false);			\
-	})
-
-/* Description
- *	Assert that LHS is greater than RHS. This statement updates the known
- *	bounds of LHS during verification. Note that RHS must be a constant
- *	value, and must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the specified value when the assertion fails.
- */
-#define bpf_assert_gt_with(LHS, RHS, value)				\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, >, RHS, value, false);		\
-	})
-
-/* Description
- *	Assert that LHS is less than or equal to RHS. This statement updates the
- *	known bounds of LHS during verification. Note that RHS must be a
- *	constant value, and must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the value zero when the assertion fails.
- */
-#define bpf_assert_le(LHS, RHS)						\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, <=, RHS, 0, false);		\
-	})
-
-/* Description
- *	Assert that LHS is less than or equal to RHS. This statement updates the
- *	known bounds of LHS during verification. Note that RHS must be a
- *	constant value, and must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the specified value when the assertion fails.
- */
-#define bpf_assert_le_with(LHS, RHS, value)				\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, <=, RHS, value, false);		\
-	})
-
-/* Description
- *	Assert that LHS is greater than or equal to RHS. This statement updates
- *	the known bounds of LHS during verification. Note that RHS must be a
- *	constant value, and must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the value zero when the assertion fails.
- */
-#define bpf_assert_ge(LHS, RHS)						\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, >=, RHS, 0, false);		\
-	})
-
-/* Description
- *	Assert that LHS is greater than or equal to RHS. This statement updates
- *	the known bounds of LHS during verification. Note that RHS must be a
- *	constant value, and must fit within the data type of LHS.
- * Returns
- *	Void.
- * Throws
- *	An exception with the specified value when the assertion fails.
- */
-#define bpf_assert_ge_with(LHS, RHS, value)				\
-	({								\
-		barrier_var(LHS);					\
-		__bpf_assert_op(LHS, >=, RHS, value, false);		\
-	})
-
 /* Description
  *	Assert that LHS is in the range [BEG, END] (inclusive of both). This
  *	statement updates the known bounds of LHS during verification. Note
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v2 bpf-next 5/5] selftests/bpf: Attempt to convert profiler.c to bpf_cmp.
  2023-12-21  3:38 [PATCH v2 bpf-next 0/5] bpf: volatile compare Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2023-12-21  3:38 ` [PATCH v2 bpf-next 4/5] selftests/bpf: Remove bpf_assert_eq-like macros Alexei Starovoitov
@ 2023-12-21  3:38 ` Alexei Starovoitov
  2023-12-21 18:01 ` [PATCH v2 bpf-next 0/5] bpf: volatile compare Jiri Olsa
  5 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-12-21  3:38 UTC (permalink / raw)
  To: daniel; +Cc: andrii, martin.lau, dxu, memxor, john.fastabend, bpf, kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Convert profiler.c to bpf_cmp() macro to compare barrier_var() approach vs bpf_cmp().

It works, but the results are not good:

./veristat -C -e prog,insns,states before after
Program                               Insns (A)  Insns (B)  Insns       (DIFF)  States (A)  States (B)  States     (DIFF)
------------------------------------  ---------  ---------  ------------------  ----------  ----------  -----------------
kprobe__proc_sys_write                     1603      19606  +18003 (+1123.08%)         123        1678  +1555 (+1264.23%)
kprobe__vfs_link                          11815      70305   +58490 (+495.05%)         971        4967   +3996 (+411.53%)
kprobe__vfs_symlink                        5464      42896   +37432 (+685.07%)         434        3126   +2692 (+620.28%)
kprobe_ret__do_filp_open                   5641      44578   +38937 (+690.25%)         446        3162   +2716 (+608.97%)
raw_tracepoint__sched_process_exec         2770      35962  +33192 (+1198.27%)         226        3121  +2895 (+1280.97%)
raw_tracepoint__sched_process_exit         1526       2135      +609 (+39.91%)         133         208      +75 (+56.39%)
raw_tracepoint__sched_process_fork          265        337       +72 (+27.17%)          19          24       +5 (+26.32%)
tracepoint__syscalls__sys_enter_kill      18782     140407  +121625 (+647.56%)        1286       12176  +10890 (+846.81%)

To be investigated.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../selftests/bpf/progs/profiler.inc.h        | 67 ++++++-------------
 tools/testing/selftests/bpf/progs/profiler2.c |  1 +
 tools/testing/selftests/bpf/progs/profiler3.c |  1 +
 3 files changed, 21 insertions(+), 48 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h b/tools/testing/selftests/bpf/progs/profiler.inc.h
index ba99d17dac54..c7546ed341e5 100644
--- a/tools/testing/selftests/bpf/progs/profiler.inc.h
+++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
@@ -7,6 +7,7 @@
 
 #include "profiler.h"
 #include "err.h"
+#include "bpf_experimental.h"
 
 #ifndef NULL
 #define NULL 0
@@ -221,8 +222,7 @@ static INLINE void* read_full_cgroup_path(struct kernfs_node* cgroup_node,
 			return payload;
 		if (cgroup_node == cgroup_root_node)
 			*root_pos = payload - payload_start;
-		if (filepart_length <= MAX_PATH) {
-			barrier_var(filepart_length);
+		if (bpf_cmp(filepart_length, <=, MAX_PATH)) {
 			payload += filepart_length;
 		}
 		cgroup_node = BPF_CORE_READ(cgroup_node, parent);
@@ -305,9 +305,7 @@ static INLINE void* populate_cgroup_info(struct cgroup_data_t* cgroup_data,
 	size_t cgroup_root_length =
 		bpf_probe_read_kernel_str(payload, MAX_PATH,
 					  BPF_CORE_READ(root_kernfs, name));
-	barrier_var(cgroup_root_length);
-	if (cgroup_root_length <= MAX_PATH) {
-		barrier_var(cgroup_root_length);
+	if (bpf_cmp(cgroup_root_length, <=, MAX_PATH)) {
 		cgroup_data->cgroup_root_length = cgroup_root_length;
 		payload += cgroup_root_length;
 	}
@@ -315,9 +313,7 @@ static INLINE void* populate_cgroup_info(struct cgroup_data_t* cgroup_data,
 	size_t cgroup_proc_length =
 		bpf_probe_read_kernel_str(payload, MAX_PATH,
 					  BPF_CORE_READ(proc_kernfs, name));
-	barrier_var(cgroup_proc_length);
-	if (cgroup_proc_length <= MAX_PATH) {
-		barrier_var(cgroup_proc_length);
+	if (bpf_cmp(cgroup_proc_length, <=, MAX_PATH)) {
 		cgroup_data->cgroup_proc_length = cgroup_proc_length;
 		payload += cgroup_proc_length;
 	}
@@ -347,9 +343,7 @@ static INLINE void* populate_var_metadata(struct var_metadata_t* metadata,
 	metadata->comm_length = 0;
 
 	size_t comm_length = bpf_core_read_str(payload, TASK_COMM_LEN, &task->comm);
-	barrier_var(comm_length);
-	if (comm_length <= TASK_COMM_LEN) {
-		barrier_var(comm_length);
+	if (bpf_cmp(comm_length, <=, TASK_COMM_LEN)) {
 		metadata->comm_length = comm_length;
 		payload += comm_length;
 	}
@@ -484,7 +478,7 @@ static INLINE size_t
 read_absolute_file_path_from_dentry(struct dentry* filp_dentry, void* payload)
 {
 	size_t length = 0;
-	size_t filepart_length;
+	u32 filepart_length;
 	struct dentry* parent_dentry;
 
 #ifdef UNROLL
@@ -494,10 +488,8 @@ read_absolute_file_path_from_dentry(struct dentry* filp_dentry, void* payload)
 		filepart_length =
 			bpf_probe_read_kernel_str(payload, MAX_PATH,
 						  BPF_CORE_READ(filp_dentry, d_name.name));
-		barrier_var(filepart_length);
-		if (filepart_length > MAX_PATH)
+		if (bpf_cmp(filepart_length, >, MAX_PATH))
 			break;
-		barrier_var(filepart_length);
 		payload += filepart_length;
 		length += filepart_length;
 
@@ -579,9 +571,7 @@ ssize_t BPF_KPROBE(kprobe__proc_sys_write,
 
 	size_t sysctl_val_length = bpf_probe_read_kernel_str(payload,
 							     CTL_MAXNAME, buf);
-	barrier_var(sysctl_val_length);
-	if (sysctl_val_length <= CTL_MAXNAME) {
-		barrier_var(sysctl_val_length);
+	if (bpf_cmp(sysctl_val_length, <=, CTL_MAXNAME)) {
 		sysctl_data->sysctl_val_length = sysctl_val_length;
 		payload += sysctl_val_length;
 	}
@@ -590,9 +580,7 @@ ssize_t BPF_KPROBE(kprobe__proc_sys_write,
 		bpf_probe_read_kernel_str(payload, MAX_PATH,
 					  BPF_CORE_READ(filp, f_path.dentry,
 							d_name.name));
-	barrier_var(sysctl_path_length);
-	if (sysctl_path_length <= MAX_PATH) {
-		barrier_var(sysctl_path_length);
+	if (bpf_cmp(sysctl_path_length, <=, MAX_PATH)) {
 		sysctl_data->sysctl_path_length = sysctl_path_length;
 		payload += sysctl_path_length;
 	}
@@ -658,9 +646,7 @@ int raw_tracepoint__sched_process_exit(void* ctx)
 			kill_data->kill_target_cgroup_proc_length = 0;
 
 			size_t comm_length = bpf_core_read_str(payload, TASK_COMM_LEN, &task->comm);
-			barrier_var(comm_length);
-			if (comm_length <= TASK_COMM_LEN) {
-				barrier_var(comm_length);
+			if (bpf_cmp(comm_length, <=, TASK_COMM_LEN)) {
 				kill_data->kill_target_name_length = comm_length;
 				payload += comm_length;
 			}
@@ -669,9 +655,7 @@ int raw_tracepoint__sched_process_exit(void* ctx)
 				bpf_probe_read_kernel_str(payload,
 							  KILL_TARGET_LEN,
 							  BPF_CORE_READ(proc_kernfs, name));
-			barrier_var(cgroup_proc_length);
-			if (cgroup_proc_length <= KILL_TARGET_LEN) {
-				barrier_var(cgroup_proc_length);
+			if (bpf_cmp(cgroup_proc_length, <=, KILL_TARGET_LEN)) {
 				kill_data->kill_target_cgroup_proc_length = cgroup_proc_length;
 				payload += cgroup_proc_length;
 			}
@@ -731,9 +715,7 @@ int raw_tracepoint__sched_process_exec(struct bpf_raw_tracepoint_args* ctx)
 	const char* filename = BPF_CORE_READ(bprm, filename);
 	size_t bin_path_length =
 		bpf_probe_read_kernel_str(payload, MAX_FILENAME_LEN, filename);
-	barrier_var(bin_path_length);
-	if (bin_path_length <= MAX_FILENAME_LEN) {
-		barrier_var(bin_path_length);
+	if (bpf_cmp(bin_path_length, <=, MAX_FILENAME_LEN)) {
 		proc_exec_data->bin_path_length = bin_path_length;
 		payload += bin_path_length;
 	}
@@ -743,8 +725,7 @@ int raw_tracepoint__sched_process_exec(struct bpf_raw_tracepoint_args* ctx)
 	unsigned int cmdline_length = probe_read_lim(payload, arg_start,
 						     arg_end - arg_start, MAX_ARGS_LEN);
 
-	if (cmdline_length <= MAX_ARGS_LEN) {
-		barrier_var(cmdline_length);
+	if (bpf_cmp(cmdline_length, <=, MAX_ARGS_LEN)) {
 		proc_exec_data->cmdline_length = cmdline_length;
 		payload += cmdline_length;
 	}
@@ -820,10 +801,8 @@ int kprobe_ret__do_filp_open(struct pt_regs* ctx)
 					      filemod_data->payload);
 	payload = populate_cgroup_info(&filemod_data->cgroup_data, task, payload);
 
-	size_t len = read_absolute_file_path_from_dentry(filp_dentry, payload);
-	barrier_var(len);
-	if (len <= MAX_FILEPATH_LENGTH) {
-		barrier_var(len);
+	u32 len = read_absolute_file_path_from_dentry(filp_dentry, payload);
+	if (bpf_cmp(len, <=, MAX_FILEPATH_LENGTH)) {
 		payload += len;
 		filemod_data->dst_filepath_length = len;
 	}
@@ -876,17 +855,13 @@ int BPF_KPROBE(kprobe__vfs_link,
 	payload = populate_cgroup_info(&filemod_data->cgroup_data, task, payload);
 
 	size_t len = read_absolute_file_path_from_dentry(old_dentry, payload);
-	barrier_var(len);
-	if (len <= MAX_FILEPATH_LENGTH) {
-		barrier_var(len);
+	if (bpf_cmp(len, <=, MAX_FILEPATH_LENGTH)) {
 		payload += len;
 		filemod_data->src_filepath_length = len;
 	}
 
 	len = read_absolute_file_path_from_dentry(new_dentry, payload);
-	barrier_var(len);
-	if (len <= MAX_FILEPATH_LENGTH) {
-		barrier_var(len);
+	if (bpf_cmp(len, <=, MAX_FILEPATH_LENGTH)) {
 		payload += len;
 		filemod_data->dst_filepath_length = len;
 	}
@@ -936,16 +911,12 @@ int BPF_KPROBE(kprobe__vfs_symlink, struct inode* dir, struct dentry* dentry,
 
 	size_t len = bpf_probe_read_kernel_str(payload, MAX_FILEPATH_LENGTH,
 					       oldname);
-	barrier_var(len);
-	if (len <= MAX_FILEPATH_LENGTH) {
-		barrier_var(len);
+	if (bpf_cmp(len, <=, MAX_FILEPATH_LENGTH)) {
 		payload += len;
 		filemod_data->src_filepath_length = len;
 	}
 	len = read_absolute_file_path_from_dentry(dentry, payload);
-	barrier_var(len);
-	if (len <= MAX_FILEPATH_LENGTH) {
-		barrier_var(len);
+	if (bpf_cmp(len, <=, MAX_FILEPATH_LENGTH)) {
 		payload += len;
 		filemod_data->dst_filepath_length = len;
 	}
diff --git a/tools/testing/selftests/bpf/progs/profiler2.c b/tools/testing/selftests/bpf/progs/profiler2.c
index 0f32a3cbf556..2e1193cc4fae 100644
--- a/tools/testing/selftests/bpf/progs/profiler2.c
+++ b/tools/testing/selftests/bpf/progs/profiler2.c
@@ -3,4 +3,5 @@
 #define barrier_var(var) /**/
 /* undef #define UNROLL */
 #define INLINE /**/
+#define bpf_cmp(lhs, op, rhs) lhs op rhs
 #include "profiler.inc.h"
diff --git a/tools/testing/selftests/bpf/progs/profiler3.c b/tools/testing/selftests/bpf/progs/profiler3.c
index 6249fc31ccb0..bf08523c1744 100644
--- a/tools/testing/selftests/bpf/progs/profiler3.c
+++ b/tools/testing/selftests/bpf/progs/profiler3.c
@@ -3,4 +3,5 @@
 #define barrier_var(var) /**/
 #define UNROLL
 #define INLINE __noinline
+#define bpf_cmp(lhs, op, rhs) lhs op rhs
 #include "profiler.inc.h"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2023-12-21  3:38 ` [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro Alexei Starovoitov
@ 2023-12-21  4:27   ` Kumar Kartikeya Dwivedi
  2023-12-22 22:59     ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-12-21  4:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: daniel, andrii, martin.lau, dxu, john.fastabend, bpf, kernel-team

On Thu, 21 Dec 2023 at 04:39, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Compilers optimize conditional operators at will, but often bpf programmers
> want to force compilers to keep the same operator in asm as it's written in C.
> Introduce bpf_cmp(var1, conditional_op, var2) macro that can be used as:
>
> -               if (seen >= 1000)
> +               if (bpf_cmp(seen, >=, 1000))
>
> The macro takes advantage of BPF assembly that is C like.
>
> The macro checks the sign of variable 'seen' and emits either
> signed or unsigned compare.
>
> For example:
> int a;
> bpf_cmp(a, >, 0) will be translated to 'if rX s> 0 goto' in BPF assembly.
>
> unsigned int a;
> bpf_cmp(a, >, 0) will be translated to 'if rX > 0 goto' in BPF assembly.
>
> C type conversions coupled with comparison operator are tricky.
>   int i = -1;
>   unsigned int j = 1;
>   if (i < j) // this is false.
>
>   long i = -1;
>   unsigned int j = 1;
>   if (i < j) // this is true.
>
> Make sure BPF program is compiled with -Wsign-compare then the macro will catch
> the mistake.
>
> The macro checks LHS (left hand side) only to figure out the sign of compare.
>
> 'if 0 < rX goto' is not allowed in the assembly, so the users
> have to use a variable on LHS anyway.
>
> The patch updates few tests to demonstrate the use of the macro.
>
> The macro allows to use BPF_JSET in C code, since LLVM doesn't generate it at
> present. For example:
>
> if (i & j) compiles into r0 &= r1; if r0 == 0 goto
>
> while
>
> if (bpf_cmp(i, &, j)) compiles into if r0 & r1 goto
>
> Note that the macro has to be careful with RHS assembly predicate.
> Since:
> u64 __rhs = 1ull << 42;
> asm goto("if r0 < %[rhs] goto +1" :: [rhs] "ri" (__rhs));
> LLVM will silently truncate 64-bit constant into s32 imm.
>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Sorry about the delayed feedback on this, I'm knocked out due to flu.

The macro looks very useful and I like the simplification for the
assertion macros, +1, but I would just like to make sure one important
point about this is not missed.

When I was writing the _eq/lt/... variants I noticed that unless I use
LHS to be a full 8-byte register the compiler can still play
shenanigans even with inline assembly, like choosing a different input
register for LHS than the one already allocated for a variable before
the assembly is emitted, doing left/right shifts to mask upper bits
before the inline assembly logic, and thus since the scalar id
association is broken on that, the cmp does not propagate to what are
logical copies.

I noticed it often when say a 8-byte value (known to be small) was
saved as int len = something, asserted upon, and then used later in
the program.

One of the contracts of the bpf_assert_eq* macros compared to just
plain bpf_assert was that the former was supposed to guarantee the
assertion resulted in the verifier preserving the new knowledge for
the LHS variable.

So maybe in addition to using bpf_cmp, we should instead do a
bpf_assert_op as a replacement that enforces this rule of LHS being 8
byte (basically just perform __bpf_assert_check) before dispatching to
the bpf_cmp.
WDYT? Or just enforce that in bpf_cmp (but that might be too strict
for general use of bpf_cmp on its own, so might be better to have a
separate macro).

Apart from that, consider
Acked-by: Kumar Kartikeya Dwivedi <memor@gmail.com>
for the whole set.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 bpf-next 0/5] bpf: volatile compare
  2023-12-21  3:38 [PATCH v2 bpf-next 0/5] bpf: volatile compare Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2023-12-21  3:38 ` [RFC PATCH v2 bpf-next 5/5] selftests/bpf: Attempt to convert profiler.c to bpf_cmp Alexei Starovoitov
@ 2023-12-21 18:01 ` Jiri Olsa
  5 siblings, 0 replies; 32+ messages in thread
From: Jiri Olsa @ 2023-12-21 18:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: daniel, andrii, martin.lau, dxu, memxor, john.fastabend, bpf,
	kernel-team

On Wed, Dec 20, 2023 at 07:38:49PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> v1->v2:
> Fixed issues pointed out by Daniel, added more tests, attempted to convert profiler.c,
> but barrier_var() wins vs bpf_cmp(). To be investigated.
> Patches 1-4 are good to go, but 5 needs more work.
> 
> Alexei Starovoitov (5):
>   selftests/bpf: Attempt to build BPF programs with -Wsign-compare
>   bpf: Introduce "volatile compare" macro
>   selftests/bpf: Convert exceptions_assert.c to bpf_cmp
>   selftests/bpf: Remove bpf_assert_eq-like macros.
>   selftests/bpf: Attempt to convert profiler.c to bpf_cmp.

lgtm, for patches 1-4:

Acked-by: Jiri Olsa <jolsa@kernel.org>

jirka

> 
>  tools/testing/selftests/bpf/Makefile          |   1 +
>  .../testing/selftests/bpf/bpf_experimental.h  | 194 ++++--------------
>  .../bpf/progs/bpf_iter_bpf_percpu_hash_map.c  |   2 +-
>  .../selftests/bpf/progs/bpf_iter_task_vmas.c  |   2 +-
>  .../selftests/bpf/progs/bpf_iter_tasks.c      |   2 +-
>  .../selftests/bpf/progs/bpf_iter_test_kern4.c |   2 +-
>  .../progs/cgroup_getset_retval_setsockopt.c   |   2 +-
>  .../selftests/bpf/progs/cgrp_ls_sleepable.c   |   2 +-
>  .../selftests/bpf/progs/cpumask_success.c     |   2 +-
>  .../testing/selftests/bpf/progs/exceptions.c  |  20 +-
>  .../selftests/bpf/progs/exceptions_assert.c   |  80 ++++----
>  tools/testing/selftests/bpf/progs/iters.c     |   4 +-
>  .../selftests/bpf/progs/iters_task_vma.c      |   3 +-
>  .../selftests/bpf/progs/linked_funcs1.c       |   2 +-
>  .../selftests/bpf/progs/linked_funcs2.c       |   2 +-
>  .../testing/selftests/bpf/progs/linked_list.c |   2 +-
>  .../selftests/bpf/progs/local_storage.c       |   2 +-
>  tools/testing/selftests/bpf/progs/lsm.c       |   2 +-
>  .../selftests/bpf/progs/normal_map_btf.c      |   2 +-
>  .../selftests/bpf/progs/profiler.inc.h        |  71 ++-----
>  tools/testing/selftests/bpf/progs/profiler2.c |   1 +
>  tools/testing/selftests/bpf/progs/profiler3.c |   1 +
>  .../selftests/bpf/progs/sockopt_inherit.c     |   2 +-
>  .../selftests/bpf/progs/sockopt_multi.c       |   2 +-
>  .../selftests/bpf/progs/sockopt_qos_to_cc.c   |   2 +-
>  .../testing/selftests/bpf/progs/test_bpf_ma.c |   2 +-
>  .../bpf/progs/test_core_reloc_kernel.c        |   2 +-
>  .../bpf/progs/test_core_reloc_module.c        |   8 +-
>  .../selftests/bpf/progs/test_fsverity.c       |   2 +-
>  .../bpf/progs/test_skc_to_unix_sock.c         |   2 +-
>  .../bpf/progs/test_xdp_do_redirect.c          |   2 +-
>  31 files changed, 146 insertions(+), 279 deletions(-)
> 
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2023-12-21  4:27   ` Kumar Kartikeya Dwivedi
@ 2023-12-22 22:59     ` Alexei Starovoitov
  2023-12-25 20:33       ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2023-12-22 22:59 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Daniel Xu,
	John Fastabend, bpf, Kernel Team

On Wed, Dec 20, 2023 at 8:28 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> When I was writing the _eq/lt/... variants I noticed that unless I use
> LHS to be a full 8-byte register the compiler can still play
> shenanigans even with inline assembly, like choosing a different input
> register for LHS than the one already allocated for a variable before
> the assembly is emitted, doing left/right shifts to mask upper bits
> before the inline assembly logic, and thus since the scalar id
> association is broken on that, the cmp does not propagate to what are
> logical copies.

I saw that llvm can add a redundant r1 = w2 and use r1 as LHS in some cases,
but I haven't seen extra shifts.
When it's assignment like this then the register link is there and
the verifier correctly propagates the range knowledge.
Could you share an example where you saw shifts?

I also don't understand the point of:
_Static_assert(__builtin_constant_p((RHS)), "2nd argument must be a
constant expression")
in your macro.
bpf_assert(bpf_cmp(foo, ==, bar));
seems useful too. Restricting RHS to a constant looks odd.

> So maybe in addition to using bpf_cmp, we should instead do a
> bpf_assert_op as a replacement that enforces this rule of LHS being 8
> byte (basically just perform __bpf_assert_check) before dispatching to
> the bpf_cmp.

I don't see the need to restrict LHS to 8 byte.

I considered making bpf_cmp macro smarter and use "w" registers
when sizeof(lhs)<=4, but I suspect it won't help the verifier much
since 32-bit compares update lower 32-bit only and range of
upper 32-bit may stay unknown (depending on what was in the code before 'if').
So I kept "r" only.

I still need to investigate why barrier_var is better in profiler.c.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2023-12-22 22:59     ` Alexei Starovoitov
@ 2023-12-25 20:33       ` Alexei Starovoitov
  2024-01-04 20:06         ` Eduard Zingerman
  2024-01-05 21:47         ` Eduard Zingerman
  0 siblings, 2 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2023-12-25 20:33 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Yonghong Song, Eddy Z
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Daniel Xu,
	John Fastabend, bpf, Kernel Team

On Fri, Dec 22, 2023 at 2:59 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Dec 20, 2023 at 8:28 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > When I was writing the _eq/lt/... variants I noticed that unless I use
> > LHS to be a full 8-byte register the compiler can still play
> > shenanigans even with inline assembly, like choosing a different input
> > register for LHS than the one already allocated for a variable before
> > the assembly is emitted, doing left/right shifts to mask upper bits
> > before the inline assembly logic, and thus since the scalar id
> > association is broken on that, the cmp does not propagate to what are
> > logical copies.
>
> I saw that llvm can add a redundant r1 = w2 and use r1 as LHS in some cases,
> but I haven't seen extra shifts.

It turned out there are indeed a bunch of redundant shifts
when u32 or s32 is passed into "r" asm constraint.

Strangely the shifts are there when compiled with -mcpu=v3 or v4
and no shifts with -mcpu=v1 and v2.

Also weird that u8 and u16 are passed into "r" without redundant shifts.
Hence I found a "workaround": cast u32 into u16 while passing.
The truncation of u32 doesn't happen and shifts to zero upper 32-bit
are gone as well.

https://godbolt.org/z/Kqszr6q3v

Another issue is llvm removes asm() completely when output "+r"
constraint is used. It has to be asm volatile to convince llvm
to keep that asm block. That's bad1() case.
asm() stays in place when input "r" constraint is used.
That's bad2().
asm() removal happens with x86 backend too. So maybe it's a feature?

I was worried whether:
asm("mov %[reg],%[reg]"::[reg]"r"((short)var));
is buggy too. Meaning whether the compiler should truncate before
allocating an input register.
It turns out it's not.
At least there is no truncation on x86 and on arm64.

https://godbolt.org/z/ovMdPxnj5

So I'm planning to unconditionally use "(short)var" workaround in bpf_cmp.

Anyway, none of that is urgent.
Yonghong or Eduard,
Please take a look at redundant shifts issue with mcpu=v3 after the holidays.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2023-12-25 20:33       ` Alexei Starovoitov
@ 2024-01-04 20:06         ` Eduard Zingerman
  2024-01-04 21:02           ` Alexei Starovoitov
  2024-01-05 21:47         ` Eduard Zingerman
  1 sibling, 1 reply; 32+ messages in thread
From: Eduard Zingerman @ 2024-01-04 20:06 UTC (permalink / raw)
  To: Alexei Starovoitov, Kumar Kartikeya Dwivedi, Yonghong Song
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Daniel Xu,
	John Fastabend, bpf, Kernel Team

On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
[...]

Regarding disappearing asm blocks.

> https://godbolt.org/z/Kqszr6q3v
> 
> Another issue is llvm removes asm() completely when output "+r"
> constraint is used. It has to be asm volatile to convince llvm
> to keep that asm block. That's bad1() case.
> asm() stays in place when input "r" constraint is used.
> That's bad2().
> asm() removal happens with x86 backend too. So maybe it's a feature?

The difference between bad1() and bad2() is small:

                             .---- output operand
                             v
bad1:    asm("%[reg] += 1":[reg]"+r"(var));
bad2:    asm("%[reg] += 1"::[reg]"r"(var));
                              ^
                              '--- input operand
                              
This leads to different IR generation at the beginning of translation:

  %1 = call i32 asm "$0 += 1", "=r,0"(i32 %0)

vs.

  call void asm sideeffect "$0 += 1", "r"(i32 %0)

Whether or not to add "sideeffect" property to asm call seem to be
controlled by the following condition in CodeGenFunction::EmitAsmStmt():

  bool HasSideEffect = S.isVolatile() || S.getNumOutputs() == 0;

See [1].
This is documented in [2] (assuming that clang and gcc are compatible):

>  asm statements that have no output operands and asm goto statements,
>  are implicitly volatile.

Asm block w/o sideeffect, output value of which is not used,
is removed from selection DAG at early stages of instruction generation.
If "bad1" is modified to use "var" after asm block (e.g. return it)
the asm block is not removed.

So, this looks like regular clang behavior, not specific to BPF target.

[1] https://github.com/llvm/llvm-project/blob/166bd4e1f18da221621953bd5943c1a8d17201fe/clang/lib/CodeGen/CGStmt.cpp#L2843
[2]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-04 20:06         ` Eduard Zingerman
@ 2024-01-04 21:02           ` Alexei Starovoitov
  0 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2024-01-04 21:02 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Daniel Xu, John Fastabend,
	bpf, Kernel Team

On Thu, Jan 4, 2024 at 12:06 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
> [...]
>
> Regarding disappearing asm blocks.
>
> > https://godbolt.org/z/Kqszr6q3v
> >
> > Another issue is llvm removes asm() completely when output "+r"
> > constraint is used. It has to be asm volatile to convince llvm
> > to keep that asm block. That's bad1() case.
> > asm() stays in place when input "r" constraint is used.
> > That's bad2().
> > asm() removal happens with x86 backend too. So maybe it's a feature?
>
> The difference between bad1() and bad2() is small:
>
>                              .---- output operand
>                              v
> bad1:    asm("%[reg] += 1":[reg]"+r"(var));
> bad2:    asm("%[reg] += 1"::[reg]"r"(var));
>                               ^
>                               '--- input operand
>
> This leads to different IR generation at the beginning of translation:
>
>   %1 = call i32 asm "$0 += 1", "=r,0"(i32 %0)
>
> vs.
>
>   call void asm sideeffect "$0 += 1", "r"(i32 %0)
>
> Whether or not to add "sideeffect" property to asm call seem to be
> controlled by the following condition in CodeGenFunction::EmitAsmStmt():
>
>   bool HasSideEffect = S.isVolatile() || S.getNumOutputs() == 0;
>
> See [1].
> This is documented in [2] (assuming that clang and gcc are compatible):
>
> >  asm statements that have no output operands and asm goto statements,
> >  are implicitly volatile.
>
> Asm block w/o sideeffect, output value of which is not used,
> is removed from selection DAG at early stages of instruction generation.
> If "bad1" is modified to use "var" after asm block (e.g. return it)
> the asm block is not removed.
>
> So, this looks like regular clang behavior, not specific to BPF target.

Wow. Thanks for those details.
I didn't know that getNumOutputs() == 0 is equivalent to volatile
in that sense. Sounds like we should always add volatile to
avoid surprises.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2023-12-25 20:33       ` Alexei Starovoitov
  2024-01-04 20:06         ` Eduard Zingerman
@ 2024-01-05 21:47         ` Eduard Zingerman
  2024-01-08 21:21           ` Yonghong Song
  2024-01-08 21:33           ` asm register constraint. Was: " Alexei Starovoitov
  1 sibling, 2 replies; 32+ messages in thread
From: Eduard Zingerman @ 2024-01-05 21:47 UTC (permalink / raw)
  To: Alexei Starovoitov, Kumar Kartikeya Dwivedi, Yonghong Song
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Daniel Xu,
	John Fastabend, bpf, Kernel Team

On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
[...]
> It turned out there are indeed a bunch of redundant shifts
> when u32 or s32 is passed into "r" asm constraint.
> 
> Strangely the shifts are there when compiled with -mcpu=v3 or v4
> and no shifts with -mcpu=v1 and v2.
> 
> Also weird that u8 and u16 are passed into "r" without redundant shifts.
> Hence I found a "workaround": cast u32 into u16 while passing.
> The truncation of u32 doesn't happen and shifts to zero upper 32-bit
> are gone as well.
> 
> https://godbolt.org/z/Kqszr6q3v

Regarding unnecessary shifts.
Sorry, a long email about minor feature/defect.

So, currently the following C program
(and it's variations with implicit casts):

    extern unsigned long bar(void);
    void foo(void) {
      asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar()));
    }

Is translated to the following BPF:

    $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d -
    
    <stdin>:	file format elf64-bpf
    
    Disassembly of section .text:
    
    0000000000000000 <foo>:
           0:	call -0x1
           1:	r0 <<= 0x20
           2:	r0 >>= 0x20
           3:	r0 += 0x1
           4:	exit
           
Note: no additional shifts are generated when "w" (32-bit register)
      constraint is used instead of "r".

First, is this right or wrong?
------------------------------

C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says
the following:

  If the value of the expression is represented with greater range or
  precision than required by the type named by the cast (6.3.1.8),
  then the cast specifies a conversion even if the type of the
  expression is the same as the named type and removes any extra range
  and precision.                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ^^^^^^^^^^^^^
  
What other LLVM backends do in such situations?
Consider the following program translated to amd64 [2] and aarch64 [3]:

    void foo(void) {
      asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long)  bar())); // 1
      asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int)   bar())); // 2
      asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3
    }

- for amd64 register of proper size is selected for `reg`;
- for aarch64 warnings about wrong operand size are emitted at (2) and (3)
  and 64-bit register is used w/o generating any additional instructions.

(Note, however, that 'arm' silently ignores the issue and uses 32-bit
 registers for all three points).

So, it looks like that something of this sort should be done:
- either extra precision should be removed via additional instructions;
- or 32-bit register should be picked for `reg`;
- or warning should be emitted as in aarch64 case.

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf
[2] https://godbolt.org/z/9nKxaMc5j
[3] https://godbolt.org/z/1zxEr5b3f


Second, what to do?
-------------------

I think that the following steps are needed:
- Investigation described in the next section shows that currently two
  shifts are generated accidentally w/o real intent to shed precision.
  I have a patch [6] that removes shifts generation, it should be applied.
- When 32-bit value is passed to "r" constraint:
  - for cpu v3/v4 a 32-bit register should be selected;
  - for cpu v1/v2 a warning should be reported.


Third, why two shifts are generated?
------------------------------------

(Details here might be interesting to Yonghong, regular reader could
 skip this section).

The two shifts are result of interaction between two IR constructs
`trunc` and `asm`. The C program above looks as follows in LLVM IR
before machine code generation:

    declare dso_local i64 @bar()
    define dso_local void @foo(i32 %p) {
    entry:
      %call = call i64 @bar()
      %v32 = trunc i64 %call to i32
      tail call void asm sideeffect "$0 += 1", "r"(i32 %v32)
      ret void
    }

Initial selection DAG:

    $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll
    SelectionDAG has 21 nodes:
      ...
      t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
   !     t11: i32 = truncate t10
   !    t15: i64 = zero_extend t11
      t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15
        t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
                         TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
      ...

Note values t11 and t15 marked with (!).

Optimized lowered selection DAG for this fragment:

    t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
  !   t22: i64 = and t10, Constant:i64<4294967295>
    t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
      t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
                       TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1

Note (zext (truncate ...)) converted to (and ... 0xffff_ffff).

DAG after instruction selection:

    t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2
  !     t25: i64 = SLL_ri t10, TargetConstant:i64<32>
  !   t22: i64 = SRL_ri t25, TargetConstant:i64<32>
    t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
      t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
                       TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1

Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)).
This happens because of the following pattern from BPFInstrInfo.td:

    // 0xffffFFFF doesn't fit into simm32, optimize common case
    def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
              (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>;

So, the two shift instructions are result of translation of (zext (trunc ...)).
However, closer examination shows that zext DAG node was generated
almost by accident. Here is the backtrace for when this node was created:

    Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND
        at .../SelectionDAG.cpp:5605
    (gdb) bt
    #0  llvm::SelectionDAG::getNode (...)
        at ...SelectionDAG.cpp:5605
    #1  0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND)
        at .../SelectionDAGBuilder.cpp:537
    #2  0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND)
        at .../SelectionDAGBuilder.cpp:958
    #3  0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...)
        at .../SelectionDAGBuilder.cpp:9640
        ...

The stack frame #2 is interesting, here is the code for it [4]:

    void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG,
                                     const SDLoc &dl, SDValue &Chain, SDValue *Glue,
                                     const Value *V,
                                     ISD::NodeType PreferredExtendType) const {
                                                   ^
                                                   '-- this is ANY_EXTEND
      ...
      for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) {
        ...
                                                   .-- this returns true
                                                   v
        if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT))
          ExtendKind = ISD::ZERO_EXTEND;
    
                               .-- this is ZERO_EXTEND
                               v
        getCopyToParts(..., ExtendKind);
        Part += NumParts;
      }
      ...
    }

The getCopyToRegs() function was called with ANY_EXTEND preference,
but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns
true for any 32 to 64-bit conversion [5].
However, in this case this is clearly a mistake, as zero extension of 
(zext i64 (truncate i32 ...)) costs two instructions.

The isZExtFree() behavior could be changed to report false for such
situations, as in my patch [6]. This in turn removes zext =>
removes two shifts from final asm.
Here is how DAG/asm look after patch [6]:

    Initial selection DAG:
      ...
      t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
  !   t11: i32 = truncate t10
      t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10
        t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
                         TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1
      ...
    
Final asm:

    ...
    # %bb.0:
    	call bar
    	#APP
    	r0 += 1
    	#NO_APP
    	exit
    ...
    
Note that [6] is a very minor change, it does not affect code
generation for selftests at all and I was unable to conjure examples
where it has effect aside from inline asm parameters.

[4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10
[5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21
[6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-05 21:47         ` Eduard Zingerman
@ 2024-01-08 21:21           ` Yonghong Song
  2024-01-08 23:16             ` Eduard Zingerman
  2024-01-08 21:33           ` asm register constraint. Was: " Alexei Starovoitov
  1 sibling, 1 reply; 32+ messages in thread
From: Yonghong Song @ 2024-01-08 21:21 UTC (permalink / raw)
  To: Eduard Zingerman, Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Daniel Xu,
	John Fastabend, bpf, Kernel Team


On 1/5/24 1:47 PM, Eduard Zingerman wrote:
> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
> [...]
>> It turned out there are indeed a bunch of redundant shifts
>> when u32 or s32 is passed into "r" asm constraint.
>>
>> Strangely the shifts are there when compiled with -mcpu=v3 or v4
>> and no shifts with -mcpu=v1 and v2.
>>
>> Also weird that u8 and u16 are passed into "r" without redundant shifts.
>> Hence I found a "workaround": cast u32 into u16 while passing.
>> The truncation of u32 doesn't happen and shifts to zero upper 32-bit
>> are gone as well.
>>
>> https://godbolt.org/z/Kqszr6q3v
> Regarding unnecessary shifts.
> Sorry, a long email about minor feature/defect.
>
> So, currently the following C program
> (and it's variations with implicit casts):
>
>      extern unsigned long bar(void);
>      void foo(void) {
>        asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar()));
>      }
>
> Is translated to the following BPF:
>
>      $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d -
>      
>      <stdin>:	file format elf64-bpf
>      
>      Disassembly of section .text:
>      
>      0000000000000000 <foo>:
>             0:	call -0x1
>             1:	r0 <<= 0x20
>             2:	r0 >>= 0x20
>             3:	r0 += 0x1
>             4:	exit
>             
> Note: no additional shifts are generated when "w" (32-bit register)
>        constraint is used instead of "r".
>
> First, is this right or wrong?
> ------------------------------
>
> C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says
> the following:
>
>    If the value of the expression is represented with greater range or
>    precision than required by the type named by the cast (6.3.1.8),
>    then the cast specifies a conversion even if the type of the
>    expression is the same as the named type and removes any extra range
>    and precision.                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>    ^^^^^^^^^^^^^
>    
> What other LLVM backends do in such situations?
> Consider the following program translated to amd64 [2] and aarch64 [3]:
>
>      void foo(void) {
>        asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long)  bar())); // 1
>        asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int)   bar())); // 2
>        asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3
>      }
>
> - for amd64 register of proper size is selected for `reg`;
> - for aarch64 warnings about wrong operand size are emitted at (2) and (3)
>    and 64-bit register is used w/o generating any additional instructions.
>
> (Note, however, that 'arm' silently ignores the issue and uses 32-bit
>   registers for all three points).
>
> So, it looks like that something of this sort should be done:
> - either extra precision should be removed via additional instructions;
> - or 32-bit register should be picked for `reg`;
> - or warning should be emitted as in aarch64 case.
>
> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf
> [2] https://godbolt.org/z/9nKxaMc5j
> [3] https://godbolt.org/z/1zxEr5b3f
>
>
> Second, what to do?
> -------------------
>
> I think that the following steps are needed:
> - Investigation described in the next section shows that currently two
>    shifts are generated accidentally w/o real intent to shed precision.
>    I have a patch [6] that removes shifts generation, it should be applied.
> - When 32-bit value is passed to "r" constraint:
>    - for cpu v3/v4 a 32-bit register should be selected;
>    - for cpu v1/v2 a warning should be reported.
>
>
> Third, why two shifts are generated?
> ------------------------------------
>
> (Details here might be interesting to Yonghong, regular reader could
>   skip this section).
>
> The two shifts are result of interaction between two IR constructs
> `trunc` and `asm`. The C program above looks as follows in LLVM IR
> before machine code generation:
>
>      declare dso_local i64 @bar()
>      define dso_local void @foo(i32 %p) {
>      entry:
>        %call = call i64 @bar()
>        %v32 = trunc i64 %call to i32
>        tail call void asm sideeffect "$0 += 1", "r"(i32 %v32)
>        ret void
>      }
>
> Initial selection DAG:
>
>      $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll
>      SelectionDAG has 21 nodes:
>        ...
>        t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>     !     t11: i32 = truncate t10
>     !    t15: i64 = zero_extend t11
>        t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15
>          t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                           TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>        ...
>
> Note values t11 and t15 marked with (!).
>
> Optimized lowered selection DAG for this fragment:
>
>      t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>    !   t22: i64 = and t10, Constant:i64<4294967295>
>      t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>        t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                         TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>
> Note (zext (truncate ...)) converted to (and ... 0xffff_ffff).
>
> DAG after instruction selection:
>
>      t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2
>    !     t25: i64 = SLL_ri t10, TargetConstant:i64<32>
>    !   t22: i64 = SRL_ri t25, TargetConstant:i64<32>
>      t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>        t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                         TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>
> Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)).
> This happens because of the following pattern from BPFInstrInfo.td:
>
>      // 0xffffFFFF doesn't fit into simm32, optimize common case
>      def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
>                (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>;
>
> So, the two shift instructions are result of translation of (zext (trunc ...)).
> However, closer examination shows that zext DAG node was generated
> almost by accident. Here is the backtrace for when this node was created:
>
>      Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND
>          at .../SelectionDAG.cpp:5605
>      (gdb) bt
>      #0  llvm::SelectionDAG::getNode (...)
>          at ...SelectionDAG.cpp:5605
>      #1  0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND)
>          at .../SelectionDAGBuilder.cpp:537
>      #2  0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND)
>          at .../SelectionDAGBuilder.cpp:958
>      #3  0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...)
>          at .../SelectionDAGBuilder.cpp:9640
>          ...
>
> The stack frame #2 is interesting, here is the code for it [4]:
>
>      void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG,
>                                       const SDLoc &dl, SDValue &Chain, SDValue *Glue,
>                                       const Value *V,
>                                       ISD::NodeType PreferredExtendType) const {
>                                                     ^
>                                                     '-- this is ANY_EXTEND
>        ...
>        for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) {
>          ...
>                                                     .-- this returns true
>                                                     v
>          if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT))
>            ExtendKind = ISD::ZERO_EXTEND;
>      
>                                 .-- this is ZERO_EXTEND
>                                 v
>          getCopyToParts(..., ExtendKind);
>          Part += NumParts;
>        }
>        ...
>      }
>
> The getCopyToRegs() function was called with ANY_EXTEND preference,
> but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns
> true for any 32 to 64-bit conversion [5].
> However, in this case this is clearly a mistake, as zero extension of
> (zext i64 (truncate i32 ...)) costs two instructions.
>
> The isZExtFree() behavior could be changed to report false for such
> situations, as in my patch [6]. This in turn removes zext =>
> removes two shifts from final asm.
> Here is how DAG/asm look after patch [6]:
>
>      Initial selection DAG:
>        ...
>        t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>    !   t11: i32 = truncate t10
>        t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10
>          t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                           TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1
>        ...
>      
> Final asm:
>
>      ...
>      # %bb.0:
>      	call bar
>      	#APP
>      	r0 += 1
>      	#NO_APP
>      	exit
>      ...

Thanks for the detailed analysis! Previously we intend to do the following:

- When 32-bit value is passed to "r" constraint:
   - for cpu v3/v4 a 32-bit register should be selected;
   - for cpu v1/v2 a warning should be reported.

So in the above, the desired asm code should be

     ...
     # %bb.0:
     	call bar
     	#APP
     	w0 += 1
     	#NO_APP
     	exit
     ...

for cpuv3/cpuv4. I guess some more work in llvm is needed
to achieve that.

On the other hand, for cpuv3/v4, for regular C code,
I think the compiler might be already omitting the conversion and use w
register already. So I am not sure whether the patch [6]
is needed or not. Could you double check?

>      
> Note that [6] is a very minor change, it does not affect code
> generation for selftests at all and I was unable to conjure examples
> where it has effect aside from inline asm parameters.
>
> [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10
> [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21
> [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-05 21:47         ` Eduard Zingerman
  2024-01-08 21:21           ` Yonghong Song
@ 2024-01-08 21:33           ` Alexei Starovoitov
  2024-01-08 23:22             ` Yonghong Song
                               ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2024-01-08 21:33 UTC (permalink / raw)
  To: Eduard Zingerman, Jose E. Marchesi, Jose E. Marchesi
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Daniel Xu, John Fastabend,
	bpf, Kernel Team

On Fri, Jan 5, 2024 at 1:47 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
> [...]
> > It turned out there are indeed a bunch of redundant shifts
> > when u32 or s32 is passed into "r" asm constraint.
> >
> > Strangely the shifts are there when compiled with -mcpu=v3 or v4
> > and no shifts with -mcpu=v1 and v2.
> >
> > Also weird that u8 and u16 are passed into "r" without redundant shifts.
> > Hence I found a "workaround": cast u32 into u16 while passing.
> > The truncation of u32 doesn't happen and shifts to zero upper 32-bit
> > are gone as well.
> >
> > https://godbolt.org/z/Kqszr6q3v
>
> Regarding unnecessary shifts.
> Sorry, a long email about minor feature/defect.
>
> So, currently the following C program
> (and it's variations with implicit casts):
>
>     extern unsigned long bar(void);
>     void foo(void) {
>       asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar()));
>     }
>
> Is translated to the following BPF:
>
>     $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d -
>
>     <stdin>:    file format elf64-bpf
>
>     Disassembly of section .text:
>
>     0000000000000000 <foo>:
>            0:   call -0x1
>            1:   r0 <<= 0x20
>            2:   r0 >>= 0x20
>            3:   r0 += 0x1
>            4:   exit
>
> Note: no additional shifts are generated when "w" (32-bit register)
>       constraint is used instead of "r".
>
> First, is this right or wrong?
> ------------------------------
>
> C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says
> the following:
>
>   If the value of the expression is represented with greater range or
>   precision than required by the type named by the cast (6.3.1.8),
>   then the cast specifies a conversion even if the type of the
>   expression is the same as the named type and removes any extra range
>   and precision.                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   ^^^^^^^^^^^^^
>
> What other LLVM backends do in such situations?
> Consider the following program translated to amd64 [2] and aarch64 [3]:
>
>     void foo(void) {
>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long)  bar())); // 1
>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int)   bar())); // 2
>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3
>     }
>
> - for amd64 register of proper size is selected for `reg`;
> - for aarch64 warnings about wrong operand size are emitted at (2) and (3)
>   and 64-bit register is used w/o generating any additional instructions.
>
> (Note, however, that 'arm' silently ignores the issue and uses 32-bit
>  registers for all three points).
>
> So, it looks like that something of this sort should be done:
> - either extra precision should be removed via additional instructions;
> - or 32-bit register should be picked for `reg`;
> - or warning should be emitted as in aarch64 case.
>
> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf
> [2] https://godbolt.org/z/9nKxaMc5j
> [3] https://godbolt.org/z/1zxEr5b3f
>
>
> Second, what to do?
> -------------------
>
> I think that the following steps are needed:
> - Investigation described in the next section shows that currently two
>   shifts are generated accidentally w/o real intent to shed precision.
>   I have a patch [6] that removes shifts generation, it should be applied.
> - When 32-bit value is passed to "r" constraint:
>   - for cpu v3/v4 a 32-bit register should be selected;
>   - for cpu v1/v2 a warning should be reported.

Thank you for the detailed analysis.

Agree that llvm fix [6] is a necessary step, then
using 'w' in v3/v4 and warn on v1/v2 makes sense too,
but we have this macro:
#define barrier_var(var) asm volatile("" : "+r"(var))
that is used in various bpf production programs.
tetragon is also a heavy user of inline asm.

Right now a full 64-bit register is allocated,
so switching to 'w' might cause unexpected behavior
including rejection by the verifier.
I think it makes sense to align the bpf backend with arm64 and x86,
but we need to broadcast this change widely.

Also need to align with GCC. (Jose cc-ed)

And, the most importantly, we need a way to go back to old behavior,
since u32 var; asm("...":: "r"(var)); will now
allocate "w" register or warn.

What should be the workaround?

I've tried:
u32 var; asm("...":: "r"((u64)var));

https://godbolt.org/z/n4ejvWj7v

and x86/arm64 generate 32-bit truction.
Sounds like the bpf backend has to do it as well.
We should be doing 'wX=wX' in such case (just like x86)
instead of <<=32 >>=32.

I think this should be done as a separate diff.
Our current pattern of using shifts is inefficient and guaranteed
to screw up verifier range analysis while wX=wX is faster
and more verifier friendly.
Yes it's still not going to be 1-1 to old (our current) behavior.

We probably need some macro (like we did for __BPF_CPU_VERSION__)
to identify such fixed llvm, so existing users with '(short)'
workaround and other tricks can detect new vs old compiler.

Looks like we opened a can of worms.
Aligning with x86/arm64 makes sense, but the cost of doing
the right thing is hard to estimate.

>
> Third, why two shifts are generated?
> ------------------------------------
>
> (Details here might be interesting to Yonghong, regular reader could
>  skip this section).
>
> The two shifts are result of interaction between two IR constructs
> `trunc` and `asm`. The C program above looks as follows in LLVM IR
> before machine code generation:
>
>     declare dso_local i64 @bar()
>     define dso_local void @foo(i32 %p) {
>     entry:
>       %call = call i64 @bar()
>       %v32 = trunc i64 %call to i32
>       tail call void asm sideeffect "$0 += 1", "r"(i32 %v32)
>       ret void
>     }
>
> Initial selection DAG:
>
>     $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll
>     SelectionDAG has 21 nodes:
>       ...
>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>    !     t11: i32 = truncate t10
>    !    t15: i64 = zero_extend t11
>       t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15
>         t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>       ...
>
> Note values t11 and t15 marked with (!).
>
> Optimized lowered selection DAG for this fragment:
>
>     t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>   !   t22: i64 = and t10, Constant:i64<4294967295>
>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>       t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>
> Note (zext (truncate ...)) converted to (and ... 0xffff_ffff).
>
> DAG after instruction selection:
>
>     t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2
>   !     t25: i64 = SLL_ri t10, TargetConstant:i64<32>
>   !   t22: i64 = SRL_ri t25, TargetConstant:i64<32>
>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>       t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>
> Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)).
> This happens because of the following pattern from BPFInstrInfo.td:
>
>     // 0xffffFFFF doesn't fit into simm32, optimize common case
>     def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
>               (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>;
>
> So, the two shift instructions are result of translation of (zext (trunc ...)).
> However, closer examination shows that zext DAG node was generated
> almost by accident. Here is the backtrace for when this node was created:
>
>     Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND
>         at .../SelectionDAG.cpp:5605
>     (gdb) bt
>     #0  llvm::SelectionDAG::getNode (...)
>         at ...SelectionDAG.cpp:5605
>     #1  0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND)
>         at .../SelectionDAGBuilder.cpp:537
>     #2  0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND)
>         at .../SelectionDAGBuilder.cpp:958
>     #3  0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...)
>         at .../SelectionDAGBuilder.cpp:9640
>         ...
>
> The stack frame #2 is interesting, here is the code for it [4]:
>
>     void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG,
>                                      const SDLoc &dl, SDValue &Chain, SDValue *Glue,
>                                      const Value *V,
>                                      ISD::NodeType PreferredExtendType) const {
>                                                    ^
>                                                    '-- this is ANY_EXTEND
>       ...
>       for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) {
>         ...
>                                                    .-- this returns true
>                                                    v
>         if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT))
>           ExtendKind = ISD::ZERO_EXTEND;
>
>                                .-- this is ZERO_EXTEND
>                                v
>         getCopyToParts(..., ExtendKind);
>         Part += NumParts;
>       }
>       ...
>     }
>
> The getCopyToRegs() function was called with ANY_EXTEND preference,
> but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns
> true for any 32 to 64-bit conversion [5].
> However, in this case this is clearly a mistake, as zero extension of
> (zext i64 (truncate i32 ...)) costs two instructions.
>
> The isZExtFree() behavior could be changed to report false for such
> situations, as in my patch [6]. This in turn removes zext =>
> removes two shifts from final asm.
> Here is how DAG/asm look after patch [6]:
>
>     Initial selection DAG:
>       ...
>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>   !   t11: i32 = truncate t10
>       t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10
>         t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1
>       ...
>
> Final asm:
>
>     ...
>     # %bb.0:
>         call bar
>         #APP
>         r0 += 1
>         #NO_APP
>         exit
>     ...
>
> Note that [6] is a very minor change, it does not affect code
> generation for selftests at all and I was unable to conjure examples
> where it has effect aside from inline asm parameters.
>
> [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10
> [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21
> [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-08 21:21           ` Yonghong Song
@ 2024-01-08 23:16             ` Eduard Zingerman
  0 siblings, 0 replies; 32+ messages in thread
From: Eduard Zingerman @ 2024-01-08 23:16 UTC (permalink / raw)
  To: Yonghong Song, Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Daniel Xu,
	John Fastabend, bpf, Kernel Team

On Mon, 2024-01-08 at 13:21 -0800, Yonghong Song wrote:
[...]
> Thanks for the detailed analysis! Previously we intend to do the following:
> 
> - When 32-bit value is passed to "r" constraint:
>    - for cpu v3/v4 a 32-bit register should be selected;
>    - for cpu v1/v2 a warning should be reported.
> 
> So in the above, the desired asm code should be
> 
>      ...
>      # %bb.0:
>      	call bar
>      	#APP
>      	w0 += 1
>      	#NO_APP
>      	exit
>      ...
> 
> for cpuv3/cpuv4. I guess some more work in llvm is needed
> to achieve that.

To make clang emit w0 the following modification is needed:

diff --git a/llvm/lib/Target/BPF/BPFISelLowering.cpp b/llvm/lib/Target/BPF/BPFISelLowering.cpp
index b20e09c7f95f..4c504d587ce6 100644
--- a/llvm/lib/Target/BPF/BPFISelLowering.cpp
+++ b/llvm/lib/Target/BPF/BPFISelLowering.cpp
@@ -265,6 +265,8 @@ BPFTargetLowering::getRegForInlineAsmConstraint(const TargetRegisterInfo *TRI,
     // GCC Constraint Letters
     switch (Constraint[0]) {
     case 'r': // GENERAL_REGS
+      if (HasAlu32 && VT == MVT::i32)
+        return std::make_pair(0U, &BPF::GPR32RegClass);
       return std::make_pair(0U, &BPF::GPRRegClass);
     case 'w':
       if (HasAlu32)

However, as Alexei notes in the sibling thread,
this leads to incompatibility with some existing inline assembly.
E.g. there are two compilation errors in selftests.
I'll write in some more detail in the sibling thread.

> On the other hand, for cpuv3/v4, for regular C code,
> I think the compiler might be already omitting the conversion and use w
> register already. So I am not sure whether the patch [6]
> is needed or not. Could you double check?

Yes, for regular C code, generated assembly uses 32-bit registers as expected:

echo $(cat <<EOF
extern unsigned long bar(unsigned);
void foo(void) {
  bar(bar(7));
}
EOF
) | clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -x c -S -o - -

...

foo:                                    # @foo
# %bb.0:
	w1 = 7
	call bar
	w1 = w0
	call bar
	exit

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-08 21:33           ` asm register constraint. Was: " Alexei Starovoitov
@ 2024-01-08 23:22             ` Yonghong Song
  2024-01-09 10:49             ` Jose E. Marchesi
  2024-01-11 16:46             ` Eduard Zingerman
  2 siblings, 0 replies; 32+ messages in thread
From: Yonghong Song @ 2024-01-08 23:22 UTC (permalink / raw)
  To: Alexei Starovoitov, Eduard Zingerman, Jose E. Marchesi, Jose E. Marchesi
  Cc: Kumar Kartikeya Dwivedi, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team


On 1/8/24 1:33 PM, Alexei Starovoitov wrote:
> On Fri, Jan 5, 2024 at 1:47 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
>> [...]
>>> It turned out there are indeed a bunch of redundant shifts
>>> when u32 or s32 is passed into "r" asm constraint.
>>>
>>> Strangely the shifts are there when compiled with -mcpu=v3 or v4
>>> and no shifts with -mcpu=v1 and v2.
>>>
>>> Also weird that u8 and u16 are passed into "r" without redundant shifts.
>>> Hence I found a "workaround": cast u32 into u16 while passing.
>>> The truncation of u32 doesn't happen and shifts to zero upper 32-bit
>>> are gone as well.
>>>
>>> https://godbolt.org/z/Kqszr6q3v
>> Regarding unnecessary shifts.
>> Sorry, a long email about minor feature/defect.
>>
>> So, currently the following C program
>> (and it's variations with implicit casts):
>>
>>      extern unsigned long bar(void);
>>      void foo(void) {
>>        asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar()));
>>      }
>>
>> Is translated to the following BPF:
>>
>>      $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d -
>>
>>      <stdin>:    file format elf64-bpf
>>
>>      Disassembly of section .text:
>>
>>      0000000000000000 <foo>:
>>             0:   call -0x1
>>             1:   r0 <<= 0x20
>>             2:   r0 >>= 0x20
>>             3:   r0 += 0x1
>>             4:   exit
>>
>> Note: no additional shifts are generated when "w" (32-bit register)
>>        constraint is used instead of "r".
>>
>> First, is this right or wrong?
>> ------------------------------
>>
>> C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says
>> the following:
>>
>>    If the value of the expression is represented with greater range or
>>    precision than required by the type named by the cast (6.3.1.8),
>>    then the cast specifies a conversion even if the type of the
>>    expression is the same as the named type and removes any extra range
>>    and precision.                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>    ^^^^^^^^^^^^^
>>
>> What other LLVM backends do in such situations?
>> Consider the following program translated to amd64 [2] and aarch64 [3]:
>>
>>      void foo(void) {
>>        asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long)  bar())); // 1
>>        asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int)   bar())); // 2
>>        asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3
>>      }
>>
>> - for amd64 register of proper size is selected for `reg`;
>> - for aarch64 warnings about wrong operand size are emitted at (2) and (3)
>>    and 64-bit register is used w/o generating any additional instructions.
>>
>> (Note, however, that 'arm' silently ignores the issue and uses 32-bit
>>   registers for all three points).
>>
>> So, it looks like that something of this sort should be done:
>> - either extra precision should be removed via additional instructions;
>> - or 32-bit register should be picked for `reg`;
>> - or warning should be emitted as in aarch64 case.
>>
>> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf
>> [2] https://godbolt.org/z/9nKxaMc5j
>> [3] https://godbolt.org/z/1zxEr5b3f
>>
>>
>> Second, what to do?
>> -------------------
>>
>> I think that the following steps are needed:
>> - Investigation described in the next section shows that currently two
>>    shifts are generated accidentally w/o real intent to shed precision.
>>    I have a patch [6] that removes shifts generation, it should be applied.
>> - When 32-bit value is passed to "r" constraint:
>>    - for cpu v3/v4 a 32-bit register should be selected;
>>    - for cpu v1/v2 a warning should be reported.
> Thank you for the detailed analysis.
>
> Agree that llvm fix [6] is a necessary step, then

[6] probably not needed for normal C code. Eduard may help
to confirm. But yes it is needed to simplify asm
with 'r' registers.

> using 'w' in v3/v4 and warn on v1/v2 makes sense too,
> but we have this macro:
> #define barrier_var(var) asm volatile("" : "+r"(var))
> that is used in various bpf production programs.
> tetragon is also a heavy user of inline asm.

asm volatile("" : "+r"(var)) is expected to be
used only in compiler and the code will be generated
in final binary. This is true for all the instances
I see. If in unusual cases,
we have code generated like (r1 = r1, w1.= w1),
we can remove that inline asm in bpf backend at
the very late peephole optimization.

>
> Right now a full 64-bit register is allocated,
> so switching to 'w' might cause unexpected behavior
> including rejection by the verifier.
> I think it makes sense to align the bpf backend with arm64 and x86,
> but we need to broadcast this change widely.

Indeed, just with [6] might cause wrong results in certain cases.
Or if [6] landed, some code might need to change from 'r' to 'w'
in their source code.

Agree we should do similar thing to arm64/x86.
For cpuv3/cpuv4 4-byte value, changing from 'r' to 'w'.
For other 4-byte value, or 1/2 byte value, issue warnings.

>
> Also need to align with GCC. (Jose cc-ed)
>
> And, the most importantly, we need a way to go back to old behavior,
> since u32 var; asm("...":: "r"(var)); will now
> allocate "w" register or warn.
>
> What should be the workaround?
>
> I've tried:
> u32 var; asm("...":: "r"((u64)var));
>
> https://godbolt.org/z/n4ejvWj7v
>
> and x86/arm64 generate 32-bit truction.
> Sounds like the bpf backend has to do it as well.
> We should be doing 'wX=wX' in such case (just like x86)
> instead of <<=32 >>=32.

Indeed, this is preferred.

>
> I think this should be done as a separate diff.
> Our current pattern of using shifts is inefficient and guaranteed
> to screw up verifier range analysis while wX=wX is faster
> and more verifier friendly.
> Yes it's still not going to be 1-1 to old (our current) behavior.

With [6] and replacing 'r' to 'w' in inline asm, it will be a
new behavior.

>
> We probably need some macro (like we did for __BPF_CPU_VERSION__)
> to identify such fixed llvm, so existing users with '(short)'
> workaround and other tricks can detect new vs old compiler.

We can add a feature macro like
https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/Targets/BPF.cpp#L55

>
> Looks like we opened a can of worms.
> Aligning with x86/arm64 makes sense, but the cost of doing
> the right thing is hard to estimate.
>
>> Third, why two shifts are generated?
>> ------------------------------------
>>
>> (Details here might be interesting to Yonghong, regular reader could
>>   skip this section).
>>
>> The two shifts are result of interaction between two IR constructs
>> `trunc` and `asm`. The C program above looks as follows in LLVM IR
>> before machine code generation:
>>
>>      declare dso_local i64 @bar()
>>      define dso_local void @foo(i32 %p) {
>>      entry:
>>        %call = call i64 @bar()
>>        %v32 = trunc i64 %call to i32
>>        tail call void asm sideeffect "$0 += 1", "r"(i32 %v32)
>>        ret void
>>      }
>>
>> Initial selection DAG:
>>
>>      $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll
>>      SelectionDAG has 21 nodes:
>>        ...
>>        t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>     !     t11: i32 = truncate t10
>>     !    t15: i64 = zero_extend t11
>>        t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15
>>          t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>                           TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>        ...
>>
>> Note values t11 and t15 marked with (!).
>>
>> Optimized lowered selection DAG for this fragment:
>>
>>      t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>    !   t22: i64 = and t10, Constant:i64<4294967295>
>>      t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>>        t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>                         TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>
>> Note (zext (truncate ...)) converted to (and ... 0xffff_ffff).
>>
>> DAG after instruction selection:
>>
>>      t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2
>>    !     t25: i64 = SLL_ri t10, TargetConstant:i64<32>
>>    !   t22: i64 = SRL_ri t25, TargetConstant:i64<32>
>>      t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>>        t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>                         TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>
>> Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)).
>> This happens because of the following pattern from BPFInstrInfo.td:
>>
>>      // 0xffffFFFF doesn't fit into simm32, optimize common case
>>      def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
>>                (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>;
>>
>> So, the two shift instructions are result of translation of (zext (trunc ...)).
>> However, closer examination shows that zext DAG node was generated
>> almost by accident. Here is the backtrace for when this node was created:
>>
>>      Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND
>>          at .../SelectionDAG.cpp:5605
>>      (gdb) bt
>>      #0  llvm::SelectionDAG::getNode (...)
>>          at ...SelectionDAG.cpp:5605
>>      #1  0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND)
>>          at .../SelectionDAGBuilder.cpp:537
>>      #2  0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND)
>>          at .../SelectionDAGBuilder.cpp:958
>>      #3  0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...)
>>          at .../SelectionDAGBuilder.cpp:9640
>>          ...
>>
>> The stack frame #2 is interesting, here is the code for it [4]:
>>
>>      void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG,
>>                                       const SDLoc &dl, SDValue &Chain, SDValue *Glue,
>>                                       const Value *V,
>>                                       ISD::NodeType PreferredExtendType) const {
>>                                                     ^
>>                                                     '-- this is ANY_EXTEND
>>        ...
>>        for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) {
>>          ...
>>                                                     .-- this returns true
>>                                                     v
>>          if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT))
>>            ExtendKind = ISD::ZERO_EXTEND;
>>
>>                                 .-- this is ZERO_EXTEND
>>                                 v
>>          getCopyToParts(..., ExtendKind);
>>          Part += NumParts;
>>        }
>>        ...
>>      }
>>
>> The getCopyToRegs() function was called with ANY_EXTEND preference,
>> but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns
>> true for any 32 to 64-bit conversion [5].
>> However, in this case this is clearly a mistake, as zero extension of
>> (zext i64 (truncate i32 ...)) costs two instructions.
>>
>> The isZExtFree() behavior could be changed to report false for such
>> situations, as in my patch [6]. This in turn removes zext =>
>> removes two shifts from final asm.
>> Here is how DAG/asm look after patch [6]:
>>
>>      Initial selection DAG:
>>        ...
>>        t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>    !   t11: i32 = truncate t10
>>        t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10
>>          t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>                           TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1
>>        ...
>>
>> Final asm:
>>
>>      ...
>>      # %bb.0:
>>          call bar
>>          #APP
>>          r0 += 1
>>          #NO_APP
>>          exit
>>      ...
>>
>> Note that [6] is a very minor change, it does not affect code
>> generation for selftests at all and I was unable to conjure examples
>> where it has effect aside from inline asm parameters.
>>
>> [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10
>> [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21
>> [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-08 21:33           ` asm register constraint. Was: " Alexei Starovoitov
  2024-01-08 23:22             ` Yonghong Song
@ 2024-01-09 10:49             ` Jose E. Marchesi
  2024-01-09 12:09               ` Jose E. Marchesi
  2024-01-11  2:46               ` Alexei Starovoitov
  2024-01-11 16:46             ` Eduard Zingerman
  2 siblings, 2 replies; 32+ messages in thread
From: Jose E. Marchesi @ 2024-01-09 10:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Yonghong Song, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team


> On Fri, Jan 5, 2024 at 1:47 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>>
>> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
>> [...]
>> > It turned out there are indeed a bunch of redundant shifts
>> > when u32 or s32 is passed into "r" asm constraint.
>> >
>> > Strangely the shifts are there when compiled with -mcpu=v3 or v4
>> > and no shifts with -mcpu=v1 and v2.
>> >
>> > Also weird that u8 and u16 are passed into "r" without redundant shifts.
>> > Hence I found a "workaround": cast u32 into u16 while passing.
>> > The truncation of u32 doesn't happen and shifts to zero upper 32-bit
>> > are gone as well.
>> >
>> > https://godbolt.org/z/Kqszr6q3v
>>
>> Regarding unnecessary shifts.
>> Sorry, a long email about minor feature/defect.
>>
>> So, currently the following C program
>> (and it's variations with implicit casts):
>>
>>     extern unsigned long bar(void);
>>     void foo(void) {
>>       asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar()));
>>     }
>>
>> Is translated to the following BPF:
>>
>>     $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d -
>>
>>     <stdin>:    file format elf64-bpf
>>
>>     Disassembly of section .text:
>>
>>     0000000000000000 <foo>:
>>            0:   call -0x1
>>            1:   r0 <<= 0x20
>>            2:   r0 >>= 0x20
>>            3:   r0 += 0x1
>>            4:   exit
>>
>> Note: no additional shifts are generated when "w" (32-bit register)
>>       constraint is used instead of "r".
>>
>> First, is this right or wrong?
>> ------------------------------
>>
>> C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says
>> the following:
>>
>>   If the value of the expression is represented with greater range or
>>   precision than required by the type named by the cast (6.3.1.8),
>>   then the cast specifies a conversion even if the type of the
>>   expression is the same as the named type and removes any extra range
>>   and precision.                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>   ^^^^^^^^^^^^^
>>
>> What other LLVM backends do in such situations?
>> Consider the following program translated to amd64 [2] and aarch64 [3]:
>>
>>     void foo(void) {
>>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long)  bar())); // 1
>>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int)   bar())); // 2
>>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3
>>     }
>>
>> - for amd64 register of proper size is selected for `reg`;
>> - for aarch64 warnings about wrong operand size are emitted at (2) and (3)
>>   and 64-bit register is used w/o generating any additional instructions.
>>
>> (Note, however, that 'arm' silently ignores the issue and uses 32-bit
>>  registers for all three points).
>>
>> So, it looks like that something of this sort should be done:
>> - either extra precision should be removed via additional instructions;
>> - or 32-bit register should be picked for `reg`;
>> - or warning should be emitted as in aarch64 case.
>>
>> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf
>> [2] https://godbolt.org/z/9nKxaMc5j
>> [3] https://godbolt.org/z/1zxEr5b3f
>>
>>
>> Second, what to do?
>> -------------------
>>
>> I think that the following steps are needed:
>> - Investigation described in the next section shows that currently two
>>   shifts are generated accidentally w/o real intent to shed precision.
>>   I have a patch [6] that removes shifts generation, it should be applied.
>> - When 32-bit value is passed to "r" constraint:
>>   - for cpu v3/v4 a 32-bit register should be selected;
>>   - for cpu v1/v2 a warning should be reported.
>
> Thank you for the detailed analysis.
>
> Agree that llvm fix [6] is a necessary step, then
> using 'w' in v3/v4 and warn on v1/v2 makes sense too,
> but we have this macro:
> #define barrier_var(var) asm volatile("" : "+r"(var))
> that is used in various bpf production programs.
> tetragon is also a heavy user of inline asm.
>
> Right now a full 64-bit register is allocated,
> so switching to 'w' might cause unexpected behavior
> including rejection by the verifier.
> I think it makes sense to align the bpf backend with arm64 and x86,
> but we need to broadcast this change widely.
>
> Also need to align with GCC. (Jose cc-ed)

GCC doesn't have an integrated assembler, so using -masm=pseudoc it just
compiles the program above to:

  foo:
  	call bar
  	r0 += 1
	exit

Also, at the moment we don't support a "w" constraint, because the
assembly-like assembly syntax we started with implies different
instructions that interpret the values stored in the BPF 64-bit
registers as 32-bit or 64-bit values, i.e.

  mov %r1, 1
  mov32 %r1, 1

But then the pseudo-c assembly syntax (that we also support) translates
some of the semantics of the instructions to the register names,
creating the notion that BPF actually has both 32-bit registers and
64-bit registers, i.e.

  r1 += 1
  w1 += 1

In GCC we support both assembly syntaxes and currently we lack the
ability to emit 32-bit variants in templates like "%[reg] += 1", so I
suppose we can introduce a "w" constraint to:

1. When assembly-like assembly syntax is used, expect a 32-bit mode to
   match the operand and warn about operand size overflow whenever
   necessary.  Always emit "%r" as the register name.

2. When pseudo-c assembly syntax is used, expect a 32-bit mode to match
   the operand and warn about operand size overflow whenever necessary,
   and then emit "w" instead of "r" as the register name.

> And, the most importantly, we need a way to go back to old behavior,
> since u32 var; asm("...":: "r"(var)); will now
> allocate "w" register or warn.

Is it really necessary to change the meaning of "r"?  You can write
templates like the one triggering this problem like:

  asm volatile ("%[reg] += 1"::[reg]"w"((unsigned)bar()));

Then the checks above will be performed, driven by the particular
constraint explicitly specified by the user, not driven by the type of
the value passed as the operand.

Or am I misunderstanding?

> What should be the workaround?
>
> I've tried:
> u32 var; asm("...":: "r"((u64)var));
>
> https://godbolt.org/z/n4ejvWj7v
>
> and x86/arm64 generate 32-bit truction.
> Sounds like the bpf backend has to do it as well.
> We should be doing 'wX=wX' in such case (just like x86)
> instead of <<=32 >>=32.
>
> I think this should be done as a separate diff.
> Our current pattern of using shifts is inefficient and guaranteed
> to screw up verifier range analysis while wX=wX is faster
> and more verifier friendly.
> Yes it's still not going to be 1-1 to old (our current) behavior.
>
> We probably need some macro (like we did for __BPF_CPU_VERSION__)
> to identify such fixed llvm, so existing users with '(short)'
> workaround and other tricks can detect new vs old compiler.
>
> Looks like we opened a can of worms.
> Aligning with x86/arm64 makes sense, but the cost of doing
> the right thing is hard to estimate.
>
>>
>> Third, why two shifts are generated?
>> ------------------------------------
>>
>> (Details here might be interesting to Yonghong, regular reader could
>>  skip this section).
>>
>> The two shifts are result of interaction between two IR constructs
>> `trunc` and `asm`. The C program above looks as follows in LLVM IR
>> before machine code generation:
>>
>>     declare dso_local i64 @bar()
>>     define dso_local void @foo(i32 %p) {
>>     entry:
>>       %call = call i64 @bar()
>>       %v32 = trunc i64 %call to i32
>>       tail call void asm sideeffect "$0 += 1", "r"(i32 %v32)
>>       ret void
>>     }
>>
>> Initial selection DAG:
>>
>>     $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll
>>     SelectionDAG has 21 nodes:
>>       ...
>>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>    !     t11: i32 = truncate t10
>>    !    t15: i64 = zero_extend t11
>>       t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15
>>         t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>       ...
>>
>> Note values t11 and t15 marked with (!).
>>
>> Optimized lowered selection DAG for this fragment:
>>
>>     t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>   !   t22: i64 = and t10, Constant:i64<4294967295>
>>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>>       t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>
>> Note (zext (truncate ...)) converted to (and ... 0xffff_ffff).
>>
>> DAG after instruction selection:
>>
>>     t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2
>>   !     t25: i64 = SLL_ri t10, TargetConstant:i64<32>
>>   !   t22: i64 = SRL_ri t25, TargetConstant:i64<32>
>>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>>       t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>
>> Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)).
>> This happens because of the following pattern from BPFInstrInfo.td:
>>
>>     // 0xffffFFFF doesn't fit into simm32, optimize common case
>>     def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
>>               (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>;
>>
>> So, the two shift instructions are result of translation of (zext (trunc ...)).
>> However, closer examination shows that zext DAG node was generated
>> almost by accident. Here is the backtrace for when this node was created:
>>
>>     Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND
>>         at .../SelectionDAG.cpp:5605
>>     (gdb) bt
>>     #0  llvm::SelectionDAG::getNode (...)
>>         at ...SelectionDAG.cpp:5605
>>     #1  0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND)
>>         at .../SelectionDAGBuilder.cpp:537
>>     #2  0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND)
>>         at .../SelectionDAGBuilder.cpp:958
>>     #3  0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...)
>>         at .../SelectionDAGBuilder.cpp:9640
>>         ...
>>
>> The stack frame #2 is interesting, here is the code for it [4]:
>>
>>     void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG,
>>                                      const SDLoc &dl, SDValue &Chain, SDValue *Glue,
>>                                      const Value *V,
>>                                      ISD::NodeType PreferredExtendType) const {
>>                                                    ^
>>                                                    '-- this is ANY_EXTEND
>>       ...
>>       for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) {
>>         ...
>>                                                    .-- this returns true
>>                                                    v
>>         if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT))
>>           ExtendKind = ISD::ZERO_EXTEND;
>>
>>                                .-- this is ZERO_EXTEND
>>                                v
>>         getCopyToParts(..., ExtendKind);
>>         Part += NumParts;
>>       }
>>       ...
>>     }
>>
>> The getCopyToRegs() function was called with ANY_EXTEND preference,
>> but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns
>> true for any 32 to 64-bit conversion [5].
>> However, in this case this is clearly a mistake, as zero extension of
>> (zext i64 (truncate i32 ...)) costs two instructions.
>>
>> The isZExtFree() behavior could be changed to report false for such
>> situations, as in my patch [6]. This in turn removes zext =>
>> removes two shifts from final asm.
>> Here is how DAG/asm look after patch [6]:
>>
>>     Initial selection DAG:
>>       ...
>>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>   !   t11: i32 = truncate t10
>>       t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10
>>         t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1
>>       ...
>>
>> Final asm:
>>
>>     ...
>>     # %bb.0:
>>         call bar
>>         #APP
>>         r0 += 1
>>         #NO_APP
>>         exit
>>     ...
>>
>> Note that [6] is a very minor change, it does not affect code
>> generation for selftests at all and I was unable to conjure examples
>> where it has effect aside from inline asm parameters.
>>
>> [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10
>> [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21
>> [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-09 10:49             ` Jose E. Marchesi
@ 2024-01-09 12:09               ` Jose E. Marchesi
  2024-01-11 18:33                 ` Jose E. Marchesi
  2024-01-11  2:46               ` Alexei Starovoitov
  1 sibling, 1 reply; 32+ messages in thread
From: Jose E. Marchesi @ 2024-01-09 12:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Yonghong Song, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team


>> On Fri, Jan 5, 2024 at 1:47 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>>>
>>> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote:
>>> [...]
>>> > It turned out there are indeed a bunch of redundant shifts
>>> > when u32 or s32 is passed into "r" asm constraint.
>>> >
>>> > Strangely the shifts are there when compiled with -mcpu=v3 or v4
>>> > and no shifts with -mcpu=v1 and v2.
>>> >
>>> > Also weird that u8 and u16 are passed into "r" without redundant shifts.
>>> > Hence I found a "workaround": cast u32 into u16 while passing.
>>> > The truncation of u32 doesn't happen and shifts to zero upper 32-bit
>>> > are gone as well.
>>> >
>>> > https://godbolt.org/z/Kqszr6q3v
>>>
>>> Regarding unnecessary shifts.
>>> Sorry, a long email about minor feature/defect.
>>>
>>> So, currently the following C program
>>> (and it's variations with implicit casts):
>>>
>>>     extern unsigned long bar(void);
>>>     void foo(void) {
>>>       asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar()));
>>>     }
>>>
>>> Is translated to the following BPF:
>>>
>>>     $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d -
>>>
>>>     <stdin>:    file format elf64-bpf
>>>
>>>     Disassembly of section .text:
>>>
>>>     0000000000000000 <foo>:
>>>            0:   call -0x1
>>>            1:   r0 <<= 0x20
>>>            2:   r0 >>= 0x20
>>>            3:   r0 += 0x1
>>>            4:   exit
>>>
>>> Note: no additional shifts are generated when "w" (32-bit register)
>>>       constraint is used instead of "r".
>>>
>>> First, is this right or wrong?
>>> ------------------------------
>>>
>>> C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says
>>> the following:
>>>
>>>   If the value of the expression is represented with greater range or
>>>   precision than required by the type named by the cast (6.3.1.8),
>>>   then the cast specifies a conversion even if the type of the
>>>   expression is the same as the named type and removes any extra range
>>>   and precision.                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>   ^^^^^^^^^^^^^
>>>
>>> What other LLVM backends do in such situations?
>>> Consider the following program translated to amd64 [2] and aarch64 [3]:
>>>
>>>     void foo(void) {
>>>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long)  bar())); // 1
>>>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int)   bar())); // 2
>>>       asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3
>>>     }
>>>
>>> - for amd64 register of proper size is selected for `reg`;
>>> - for aarch64 warnings about wrong operand size are emitted at (2) and (3)
>>>   and 64-bit register is used w/o generating any additional instructions.
>>>
>>> (Note, however, that 'arm' silently ignores the issue and uses 32-bit
>>>  registers for all three points).
>>>
>>> So, it looks like that something of this sort should be done:
>>> - either extra precision should be removed via additional instructions;
>>> - or 32-bit register should be picked for `reg`;
>>> - or warning should be emitted as in aarch64 case.
>>>
>>> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf
>>> [2] https://godbolt.org/z/9nKxaMc5j
>>> [3] https://godbolt.org/z/1zxEr5b3f
>>>
>>>
>>> Second, what to do?
>>> -------------------
>>>
>>> I think that the following steps are needed:
>>> - Investigation described in the next section shows that currently two
>>>   shifts are generated accidentally w/o real intent to shed precision.
>>>   I have a patch [6] that removes shifts generation, it should be applied.
>>> - When 32-bit value is passed to "r" constraint:
>>>   - for cpu v3/v4 a 32-bit register should be selected;
>>>   - for cpu v1/v2 a warning should be reported.
>>
>> Thank you for the detailed analysis.
>>
>> Agree that llvm fix [6] is a necessary step, then
>> using 'w' in v3/v4 and warn on v1/v2 makes sense too,
>> but we have this macro:
>> #define barrier_var(var) asm volatile("" : "+r"(var))
>> that is used in various bpf production programs.
>> tetragon is also a heavy user of inline asm.
>>
>> Right now a full 64-bit register is allocated,
>> so switching to 'w' might cause unexpected behavior
>> including rejection by the verifier.
>> I think it makes sense to align the bpf backend with arm64 and x86,
>> but we need to broadcast this change widely.
>>
>> Also need to align with GCC. (Jose cc-ed)
>
> GCC doesn't have an integrated assembler, so using -masm=pseudoc it just
> compiles the program above to:
>
>   foo:
>   	call bar
>   	r0 += 1
> 	exit
>
> Also, at the moment we don't support a "w" constraint, because the
> assembly-like assembly syntax we started with implies different
> instructions that interpret the values stored in the BPF 64-bit
> registers as 32-bit or 64-bit values, i.e.
>
>   mov %r1, 1
>   mov32 %r1, 1
>
> But then the pseudo-c assembly syntax (that we also support) translates
> some of the semantics of the instructions to the register names,
> creating the notion that BPF actually has both 32-bit registers and
> 64-bit registers, i.e.
>
>   r1 += 1
>   w1 += 1
>
> In GCC we support both assembly syntaxes and currently we lack the
> ability to emit 32-bit variants in templates like "%[reg] += 1", so I
> suppose we can introduce a "w" constraint to:
>
> 1. When assembly-like assembly syntax is used, expect a 32-bit mode to
>    match the operand and warn about operand size overflow whenever
>    necessary.  Always emit "%r" as the register name.
>
> 2. When pseudo-c assembly syntax is used, expect a 32-bit mode to match
>    the operand and warn about operand size overflow whenever necessary,
>    and then emit "w" instead of "r" as the register name.
>
>> And, the most importantly, we need a way to go back to old behavior,
>> since u32 var; asm("...":: "r"(var)); will now
>> allocate "w" register or warn.
>
> Is it really necessary to change the meaning of "r"?  You can write
> templates like the one triggering this problem like:
>
>   asm volatile ("%[reg] += 1"::[reg]"w"((unsigned)bar()));
>
> Then the checks above will be performed, driven by the particular
> constraint explicitly specified by the user, not driven by the type of
> the value passed as the operand.
>
> Or am I misunderstanding?

[I have just added a proposal for an agenda item to this week's BPF
 Office Hours so we can discuss about BPF sub-registers and compiler
 constraints, to complement this thread.]

>
>> What should be the workaround?
>>
>> I've tried:
>> u32 var; asm("...":: "r"((u64)var));
>>
>> https://godbolt.org/z/n4ejvWj7v
>>
>> and x86/arm64 generate 32-bit truction.
>> Sounds like the bpf backend has to do it as well.
>> We should be doing 'wX=wX' in such case (just like x86)
>> instead of <<=32 >>=32.
>>
>> I think this should be done as a separate diff.
>> Our current pattern of using shifts is inefficient and guaranteed
>> to screw up verifier range analysis while wX=wX is faster
>> and more verifier friendly.
>> Yes it's still not going to be 1-1 to old (our current) behavior.
>>
>> We probably need some macro (like we did for __BPF_CPU_VERSION__)
>> to identify such fixed llvm, so existing users with '(short)'
>> workaround and other tricks can detect new vs old compiler.
>>
>> Looks like we opened a can of worms.
>> Aligning with x86/arm64 makes sense, but the cost of doing
>> the right thing is hard to estimate.
>>
>>>
>>> Third, why two shifts are generated?
>>> ------------------------------------
>>>
>>> (Details here might be interesting to Yonghong, regular reader could
>>>  skip this section).
>>>
>>> The two shifts are result of interaction between two IR constructs
>>> `trunc` and `asm`. The C program above looks as follows in LLVM IR
>>> before machine code generation:
>>>
>>>     declare dso_local i64 @bar()
>>>     define dso_local void @foo(i32 %p) {
>>>     entry:
>>>       %call = call i64 @bar()
>>>       %v32 = trunc i64 %call to i32
>>>       tail call void asm sideeffect "$0 += 1", "r"(i32 %v32)
>>>       ret void
>>>     }
>>>
>>> Initial selection DAG:
>>>
>>>     $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll
>>>     SelectionDAG has 21 nodes:
>>>       ...
>>>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>>    !     t11: i32 = truncate t10
>>>    !    t15: i64 = zero_extend t11
>>>       t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15
>>>         t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>>       ...
>>>
>>> Note values t11 and t15 marked with (!).
>>>
>>> Optimized lowered selection DAG for this fragment:
>>>
>>>     t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>>   !   t22: i64 = and t10, Constant:i64<4294967295>
>>>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>>>       t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>>
>>> Note (zext (truncate ...)) converted to (and ... 0xffff_ffff).
>>>
>>> DAG after instruction selection:
>>>
>>>     t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2
>>>   !     t25: i64 = SLL_ri t10, TargetConstant:i64<32>
>>>   !   t22: i64 = SRL_ri t25, TargetConstant:i64<32>
>>>     t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22
>>>       t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>>                        TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1
>>>
>>> Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)).
>>> This happens because of the following pattern from BPFInstrInfo.td:
>>>
>>>     // 0xffffFFFF doesn't fit into simm32, optimize common case
>>>     def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
>>>               (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>;
>>>
>>> So, the two shift instructions are result of translation of (zext (trunc ...)).
>>> However, closer examination shows that zext DAG node was generated
>>> almost by accident. Here is the backtrace for when this node was created:
>>>
>>>     Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND
>>>         at .../SelectionDAG.cpp:5605
>>>     (gdb) bt
>>>     #0  llvm::SelectionDAG::getNode (...)
>>>         at ...SelectionDAG.cpp:5605
>>>     #1  0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND)
>>>         at .../SelectionDAGBuilder.cpp:537
>>>     #2  0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND)
>>>         at .../SelectionDAGBuilder.cpp:958
>>>     #3  0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...)
>>>         at .../SelectionDAGBuilder.cpp:9640
>>>         ...
>>>
>>> The stack frame #2 is interesting, here is the code for it [4]:
>>>
>>>     void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG,
>>>                                      const SDLoc &dl, SDValue &Chain, SDValue *Glue,
>>>                                      const Value *V,
>>>                                      ISD::NodeType PreferredExtendType) const {
>>>                                                    ^
>>>                                                    '-- this is ANY_EXTEND
>>>       ...
>>>       for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) {
>>>         ...
>>>                                                    .-- this returns true
>>>                                                    v
>>>         if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT))
>>>           ExtendKind = ISD::ZERO_EXTEND;
>>>
>>>                                .-- this is ZERO_EXTEND
>>>                                v
>>>         getCopyToParts(..., ExtendKind);
>>>         Part += NumParts;
>>>       }
>>>       ...
>>>     }
>>>
>>> The getCopyToRegs() function was called with ANY_EXTEND preference,
>>> but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns
>>> true for any 32 to 64-bit conversion [5].
>>> However, in this case this is clearly a mistake, as zero extension of
>>> (zext i64 (truncate i32 ...)) costs two instructions.
>>>
>>> The isZExtFree() behavior could be changed to report false for such
>>> situations, as in my patch [6]. This in turn removes zext =>
>>> removes two shifts from final asm.
>>> Here is how DAG/asm look after patch [6]:
>>>
>>>     Initial selection DAG:
>>>       ...
>>>       t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1
>>>   !   t11: i32 = truncate t10
>>>       t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10
>>>         t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch<null>,
>>>                          TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1
>>>       ...
>>>
>>> Final asm:
>>>
>>>     ...
>>>     # %bb.0:
>>>         call bar
>>>         #APP
>>>         r0 += 1
>>>         #NO_APP
>>>         exit
>>>     ...
>>>
>>> Note that [6] is a very minor change, it does not affect code
>>> generation for selftests at all and I was unable to conjure examples
>>> where it has effect aside from inline asm parameters.
>>>
>>> [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10
>>> [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21
>>> [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef
>>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-09 10:49             ` Jose E. Marchesi
  2024-01-09 12:09               ` Jose E. Marchesi
@ 2024-01-11  2:46               ` Alexei Starovoitov
  2024-01-11 10:34                 ` Jose E. Marchesi
  1 sibling, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2024-01-11  2:46 UTC (permalink / raw)
  To: Jose E. Marchesi
  Cc: Eduard Zingerman, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Yonghong Song, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team

On Tue, Jan 9, 2024 at 3:00 AM Jose E. Marchesi
<jose.marchesi@oracle.com> wrote:
>
> >
> > Also need to align with GCC. (Jose cc-ed)
>
> GCC doesn't have an integrated assembler, so using -masm=pseudoc it just
> compiles the program above to:
>
>   foo:
>         call bar
>         r0 += 1
>         exit
>
> Also, at the moment we don't support a "w" constraint, because the
> assembly-like assembly syntax we started with implies different
> instructions that interpret the values stored in the BPF 64-bit
> registers as 32-bit or 64-bit values, i.e.
>
>   mov %r1, 1
>   mov32 %r1, 1

Heh. gcc tried to invent a traditional looking asm for bpf and instead
invented the above :)
x86 and arm64 use single 'mov' and encode sub-registers as rax/eax or x0/w0.

imo support of gcc-only asm style is an obstacle in gcc-bpf adoption.
It's not too far to reconsider supporting this. You can easily
remove the support and it will reduce your maintenance/support work.
It's a bit of a distraction in this thread too.

> But then the pseudo-c assembly syntax (that we also support) translates
> some of the semantics of the instructions to the register names,
> creating the notion that BPF actually has both 32-bit registers and
> 64-bit registers, i.e.
>
>   r1 += 1
>   w1 += 1
>
> In GCC we support both assembly syntaxes and currently we lack the
> ability to emit 32-bit variants in templates like "%[reg] += 1", so I
> suppose we can introduce a "w" constraint to:
>
> 2. When pseudo-c assembly syntax is used, expect a 32-bit mode to match
>    the operand and warn about operand size overflow whenever necessary,
>    and then emit "w" instead of "r" as the register name.

clang supports "w" constraint with -mcpu=v3,v4 and emits 'w'
as register name.

> > And, the most importantly, we need a way to go back to old behavior,
> > since u32 var; asm("...":: "r"(var)); will now
> > allocate "w" register or warn.
>
> Is it really necessary to change the meaning of "r"?  You can write
> templates like the one triggering this problem like:
>
>   asm volatile ("%[reg] += 1"::[reg]"w"((unsigned)bar()));
>
> Then the checks above will be performed, driven by the particular
> constraint explicitly specified by the user, not driven by the type of
> the value passed as the operand.

That's a good question.
For x86 "r" constraint means 8, 16, 32, or 64 bit integer.
For arm64 "r" constraint means 32 or 64 bit integer.

and this is traditional behavior of "r" in other asms too:
AMDGPU - 32 or 64
Hexagon - 32 or 64
powerpc - 32 or 64
risc-v - 32 or 64

imo it makes sense for bpf asm to align with the rest so that:

asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar())); would generate
w0 += 1, NO warn (with -mcpu=v3,v4; and a warn with -mcpu=v1,v2)

asm volatile ("%[reg] += 1"::[reg]"r"((unsigned long)bar()));
r0 += 1, NO warn

asm volatile ("%[reg] += 1"::[reg]"w"((unsigned)bar()));
w0 += 1, NO warn

asm volatile ("%[reg] += 1"::[reg]"w"((unsigned long)bar()));
w0 += 1 and a warn (currently there is none in clang)

I think we can add "R" constraint to mean 64-bit register only:

asm volatile ("%[reg] += 1"::[reg]"R"((unsigned)bar()));
r0 += 1 and a warn

asm volatile ("%[reg] += 1"::[reg]"R"((unsigned long)bar()));
r0 += 1, NO warn

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-11  2:46               ` Alexei Starovoitov
@ 2024-01-11 10:34                 ` Jose E. Marchesi
  0 siblings, 0 replies; 32+ messages in thread
From: Jose E. Marchesi @ 2024-01-11 10:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Yonghong Song, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team


> On Tue, Jan 9, 2024 at 3:00 AM Jose E. Marchesi
> <jose.marchesi@oracle.com> wrote:
>>
>> >
>> > Also need to align with GCC. (Jose cc-ed)
>>
>> GCC doesn't have an integrated assembler, so using -masm=pseudoc it just
>> compiles the program above to:
>>
>>   foo:
>>         call bar
>>         r0 += 1
>>         exit
>>
>> Also, at the moment we don't support a "w" constraint, because the
>> assembly-like assembly syntax we started with implies different
>> instructions that interpret the values stored in the BPF 64-bit
>> registers as 32-bit or 64-bit values, i.e.
>>
>>   mov %r1, 1
>>   mov32 %r1, 1
>
> Heh. gcc tried to invent a traditional looking asm for bpf and instead
> invented the above :)

Very funny, but we didn't invent it.  We took it from ubpf.

> x86 and arm64 use single 'mov' and encode sub-registers as rax/eax or
> x0/w0.

Yes both targets support specifying portions of the 64-bit registers
using pseudo-register names, which is a better approch vs. using
explicit mnemonics for the 32-bit operations (mov32, add32, etc) because
it makes it possible to specify which instruction to use in a
per-operand basis, like making the mode of actually passed arguments in
inline assembly to influence the operation to be performed.

It is nice to have it also in BPF.

> imo support of gcc-only asm style is an obstacle in gcc-bpf adoption.
> It's not too far to reconsider supporting this. You can easily
> remove the support and it will reduce your maintenance/support work.
> It's a bit of a distraction in this thread too.
>
>> But then the pseudo-c assembly syntax (that we also support) translates
>> some of the semantics of the instructions to the register names,
>> creating the notion that BPF actually has both 32-bit registers and
>> 64-bit registers, i.e.
>>
>>   r1 += 1
>>   w1 += 1
>>
>> In GCC we support both assembly syntaxes and currently we lack the
>> ability to emit 32-bit variants in templates like "%[reg] += 1", so I
>> suppose we can introduce a "w" constraint to:
>>
>> 2. When pseudo-c assembly syntax is used, expect a 32-bit mode to match
>>    the operand and warn about operand size overflow whenever necessary,
>>    and then emit "w" instead of "r" as the register name.
>
> clang supports "w" constraint with -mcpu=v3,v4 and emits 'w'
> as register name.
>
>> > And, the most importantly, we need a way to go back to old behavior,
>> > since u32 var; asm("...":: "r"(var)); will now
>> > allocate "w" register or warn.
>>
>> Is it really necessary to change the meaning of "r"?  You can write
>> templates like the one triggering this problem like:
>>
>>   asm volatile ("%[reg] += 1"::[reg]"w"((unsigned)bar()));
>>
>> Then the checks above will be performed, driven by the particular
>> constraint explicitly specified by the user, not driven by the type of
>> the value passed as the operand.
>
> That's a good question.
> For x86 "r" constraint means 8, 16, 32, or 64 bit integer.
> For arm64 "r" constraint means 32 or 64 bit integer.
>
> and this is traditional behavior of "r" in other asms too:
> AMDGPU - 32 or 64
> Hexagon - 32 or 64
> powerpc - 32 or 64
> risc-v - 32 or 64
> imo it makes sense for bpf asm to align with the rest so that:

Yes you are right and I agree.  It makes sense to follow the established
practice where "r" can lead to any pseudo-register name depending on the
mode of the operand, like in x86_64:

   char     -> %al
   short    -> %ax
   int      -> %eax
   long int -> %rax

And then add diagnostics conditioned on the availability of 32-bit
instructions (alu32).

>
> asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar())); would generate
> w0 += 1, NO warn (with -mcpu=v3,v4; and a warn with -mcpu=v1,v2)
>
> asm volatile ("%[reg] += 1"::[reg]"r"((unsigned long)bar()));
> r0 += 1, NO warn
>
> asm volatile ("%[reg] += 1"::[reg]"w"((unsigned)bar()));
> w0 += 1, NO warn
>
> asm volatile ("%[reg] += 1"::[reg]"w"((unsigned long)bar()));
> w0 += 1 and a warn (currently there is none in clang)

Makes sense to me.

> I think we can add "R" constraint to mean 64-bit register only:
>
> asm volatile ("%[reg] += 1"::[reg]"R"((unsigned)bar()));
> r0 += 1 and a warn
>
> asm volatile ("%[reg] += 1"::[reg]"R"((unsigned long)bar()));
> r0 += 1, NO warn

The x86 target has similar constraints "q" (for %Rl registers) and "Q"
(for %Rh registers) but not for 32 and 64 pseudo-registers that I can
see.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-08 21:33           ` asm register constraint. Was: " Alexei Starovoitov
  2024-01-08 23:22             ` Yonghong Song
  2024-01-09 10:49             ` Jose E. Marchesi
@ 2024-01-11 16:46             ` Eduard Zingerman
  2 siblings, 0 replies; 32+ messages in thread
From: Eduard Zingerman @ 2024-01-11 16:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Jose E. Marchesi, Jose E. Marchesi
  Cc: Kumar Kartikeya Dwivedi, Yonghong Song, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Daniel Xu, John Fastabend,
	bpf, Kernel Team

On Mon, 2024-01-08 at 13:33 -0800, Alexei Starovoitov wrote:
[...]

> Agree that llvm fix [6] is a necessary step, then
> using 'w' in v3/v4 and warn on v1/v2 makes sense too,
> but we have this macro:
> #define barrier_var(var) asm volatile("" : "+r"(var))
> that is used in various bpf production programs.
> tetragon is also a heavy user of inline asm.

I have an llvm version [1] that implements 32-bit/64-bit register
selection and it indeed breaks some code when tested on BPF selftests.
The main pain point is that code base is compiled both with and
without ALU32, so 'r' constraint cannot be used with 'int'
(and constants are 'int' by default), hence the fixes like:

        size_t off = (size_t)buf->head;
-       asm("%0 -= %1" : "+r"(off) : "r"(buf->skb->data));
+       asm("%0 -= %1" : "+r"(off) : "r"((__u64)buf->skb->data));
        return off;

or

 #ifndef bpf_nop_mov
 #define bpf_nop_mov(var) \
-       asm volatile("%[reg]=%[reg]"::[reg]"r"((short)var))
+       asm volatile("%[reg]=%[reg]"::[reg]"r"((unsigned long)var))
 #endif

or

        int save_syn = 1;
        int rv = -1;
        int v = 0;
-       int op;
+       long op;
        ...
        asm volatile (
            "%[op] = *(u32 *)(%[skops] +96)"
            : [op] "+r"(op)
            : [skops] "r"(skops)
            :);

[1] https://github.com/eddyz87/llvm-project/tree/bpf-inline-asm-polymorphic-r

Need a bit more time to share the branch with updates for selftests.

> And, the most importantly, we need a way to go back to old behavior,
> since u32 var; asm("...":: "r"(var)); will now
> allocate "w" register or warn.
>
> What should be the workaround?

The u64 cast is the workaround.

> I've tried:
> u32 var; asm("...":: "r"((u64)var));
>
> https://godbolt.org/z/n4ejvWj7v
>
> and x86/arm64 generate 32-bit truction.
> Sounds like the bpf backend has to do it as well.

Here is godbolt output:

    callq   bar()@PLT
    movl    %eax, %eax ; <-- (u64)var is translated as zero extension
    movq    %rax, %rax ; <-- inline asm block uses 64-bit register

Here is LLVM IR for this example before code generation:

    %call = tail call i64 @bar() #2
    %conv1 = and i64 %call, 4294967295 ; <------- `(u64)var` translation
    tail call void asm sideeffect "mov $0, $0",
              "r,~{dirflag},~{fpsr},~{flags}"(i64 %conv1) #2
    ret void

> We should be doing 'wX=wX' in such case (just like x86)
> instead of <<=32 >>=32.

Opened pull request for this:
https://github.com/llvm/llvm-project/pull/77501

> We probably need some macro (like we did for __BPF_CPU_VERSION__)
> to identify such fixed llvm, so existing users with '(short)'
> workaround and other tricks can detect new vs old compiler.
> 
> Looks like we opened a can of worms.
> Aligning with x86/arm64 makes sense, but the cost of doing
> the right thing is hard to estimate.

Indeed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-09 12:09               ` Jose E. Marchesi
@ 2024-01-11 18:33                 ` Jose E. Marchesi
  2024-01-15 16:33                   ` Eduard Zingerman
  0 siblings, 1 reply; 32+ messages in thread
From: Jose E. Marchesi @ 2024-01-11 18:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Yonghong Song, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team



> [I have just added a proposal for an agenda item to this week's BPF
>  Office Hours so we can discuss about BPF sub-registers and compiler
>  constraints, to complement this thread.]

Notes from the office hours:

Availability of 32-bit arithmetic instructions:

  (cpu >= v3 AND not disabled with -mno-alu32) OR -malu32

Compiler constraints:

  "r"

     64-bit register (rN) or 32-bit sub-register (wN), based on the mode of
     the operand.

     If 32-bit arithmetic available
        char, short -> wN and warning
        int -> wN
        long int -> rN
     Else
        char, short, int -> rN and warning
        long int -> rN

  "w"

     32-bit sub-register (wN) regardless of the mode of the operand.

     If 32-bit arithmetic available
       char, short -> wN and warning
       int -> wN
       long int -> wN and warning
     Else
       char, short, int, long int -> wN and warning

  "R"

     64-bit register (rN) regardless of the mode of the operand.

     char, short, int -> rN and warn
     long int -> rN

Additional constraints for instruction immediates:

  "I" imm32
  "i" imm64  (already exists as "i" is standard.)
  "O" off16

  warning if not in range.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-11 18:33                 ` Jose E. Marchesi
@ 2024-01-15 16:33                   ` Eduard Zingerman
  2024-01-16 17:47                     ` Alexei Starovoitov
  2024-01-16 18:40                     ` Alexei Starovoitov
  0 siblings, 2 replies; 32+ messages in thread
From: Eduard Zingerman @ 2024-01-15 16:33 UTC (permalink / raw)
  To: Jose E. Marchesi, Alexei Starovoitov
  Cc: Jose E. Marchesi, Kumar Kartikeya Dwivedi, Yonghong Song,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Daniel Xu,
	John Fastabend, bpf, Kernel Team

On Thu, 2024-01-11 at 19:33 +0100, Jose E. Marchesi wrote:
> 
> > [I have just added a proposal for an agenda item to this week's BPF
> >  Office Hours so we can discuss about BPF sub-registers and compiler
> >  constraints, to complement this thread.]
> 
> Notes from the office hours:
> 
> Availability of 32-bit arithmetic instructions:
> 
>   (cpu >= v3 AND not disabled with -mno-alu32) OR -malu32
> 
> Compiler constraints:
> 
>   "r"
> 
>      64-bit register (rN) or 32-bit sub-register (wN), based on the mode of
>      the operand.
> 
>      If 32-bit arithmetic available
>         char, short -> wN and warning
>         int -> wN
>         long int -> rN
>      Else
>         char, short, int -> rN and warning
>         long int -> rN
> 
>   "w"
> 
>      32-bit sub-register (wN) regardless of the mode of the operand.
> 
>      If 32-bit arithmetic available
>        char, short -> wN and warning
>        int -> wN
>        long int -> wN and warning
>      Else
>        char, short, int, long int -> wN and warning

Hi All,

I have a preliminary implementation of agreed logic for "r" and "w"
inline asm constraints in LLVM branch [0]:

- Constraint "r":
| Operand type | Has ALU32          | No ALU32           |
|--------------+--------------------+--------------------|
| char         | warning            | warning            |
| short        | warning            | warning            |
| int          | use 32-bit "w" reg | warning            |
| long         | use 64-bit "r" reg | use 64-bit "r" reg |

- Constraint "w":
| Operand type | Has ALU32          | No ALU32 |
|--------------+--------------------+----------|
| char         | warning            | warning  |
| short        | warning            | warning  |
| int          | use 32-bit "w" reg | warning  |
| long         | warning            | warning  |

The following constraints are not implemented for now:
"R", "I", "i", "O.

I also have patches, adapting BPF selftests [1], Cilium [2] and
Tetragon [3] to compile using both current LLVM main and [0].
After porting, selftests [1] are passing, and there is no verification
pass/fail differences when loading [2,3] object files using veristat.

The main take-away from the the porting exercise is that all three
projects still use CPUv2 fully or in parts of the code-base.

The issue with CPUv2 is that 'int' type cannot be passed to input
"r" constraint w/o explicit cast, and cannot be passed to output
constraint at all. Note that by default integer literals in C code
have type 'int', hence passing a constant to "ri" constraint might
require cast as well.

E.g. changes for input constraints, necessary for CPUv2:

-   asm("%0 -= %1" : "+r"(off) : "r"(buf->skb->data));
+   asm("%0 -= %1" : "+r"(off) : "r"((__u64)buf->skb->data));

E.g. changes for ouput (input/output) constraints, necessary for CPUv2:

-   int am;
+   long am;
    ...
    asm volatile("%[am] &= 0xffff;\n" ::[am] "+r"(am)

Also, selftests use constraint named "g" in one place:

  #define __sink(expr) asm volatile("" : "+g"(expr))

This constraint is documented in [4] as follows:

  "g" Any register, memory or immediate integer operand is allowed,
      except for registers that are not general registers.

In [0] I apply to "g" same checks as to "r".
This is not fully correct, as it warns about e.g. operand 10 (but not 10L)
when built with CPUv2. It looks like some internal Clang interfaces
(shared between targets) would need adjustment to allow different
semantic action on value vs. constant passed to "g" constraint.

The only instance when I had to analyze verifier behavior while
porting selftests considers usage of the barrier_var() macro.
Test case verif_scale_pyperf600_bpf_loop contains a code sequence
similar to following:

    #define barrier_var(var) asm volatile("" : "+r"(var))
    ...
    void foo(unsigned int i, unsigned char *ptr) {
      ...
      barrier_var(i);
      if (i >= 100)
        return;
      ... ptr[i] ...;
    }

Previously such code was translated as follows:

    w1 = w1                ; sign extension for inline asm operand
    if w1 > 0x63 goto ...
    ...
    r2 += r1

But now sign extension is gone.
In the test case the range for 'i' was unknown to verifier prior to
comparison and 32-bit comparison only produces range for lower half of
the register. Thus verifier reported an error:

  math between map_value pointer and register with
  unbounded min value is not allowed

Unfortunately, I'm not aware of a simple way in C to force this
comparison be done in 64-bits w/o changing 'i' to 64-bit type:
InstCombinePass demotes it 32-bit comparison.
Hence, I changed the type of 'i' from __u32 to __u64.

There are 18 selftest object files with slight differences in code
generation when new constraint handling logic of [0] is applied.
I'll report on these differences a bit later.

Please let me know what you think about impact of the compiler changes
on the code base.

Thanks,
Eduard

[0] Updated LLVM
    https://github.com/eddyz87/llvm-project/tree/bpf-inline-asm-polymorphic-r
[1] selftests
    https://gist.github.com/eddyz87/276f1ecc51930017dcddbb56e37f57ad
[2] Cilium
    https://gist.github.com/eddyz87/4a485573556012ec730c2de0256a79db
    Note: this is based upon branch 'libbpf-friendliness'
          from https://github.com/anakryiko/cilium
[3] Tetragon
    https://gist.github.com/eddyz87/ca9a4b68007c72469307f2cce3f83bb1
[4] https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#index-g-in-constraint

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-15 16:33                   ` Eduard Zingerman
@ 2024-01-16 17:47                     ` Alexei Starovoitov
  2024-01-16 19:07                       ` Yonghong Song
  2024-01-16 23:55                       ` Eduard Zingerman
  2024-01-16 18:40                     ` Alexei Starovoitov
  1 sibling, 2 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2024-01-16 17:47 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Jose E. Marchesi, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Yonghong Song, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team

On Mon, Jan 15, 2024 at 8:33 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
>
> [0] Updated LLVM
>     https://github.com/eddyz87/llvm-project/tree/bpf-inline-asm-polymorphic-r

1.
// Use sequence 'wX = wX' if 32-bits ops are available.
let Predicates = [BPFHasALU32] in {

This is unnecessary conservative.
wX = wX instructions existed from day one.
The very first commit of the interpreter and the verifier recognized it.
No need to gate it by BPFHasALU32.

2.
case 'w':
if (Size == 32 && HasAlu32)

This is probably unnecessary as well.
When bpf programmer specifies 'w' constraint, llvm should probably use it
regardless of alu32 flag.

aarch64 has this comment:
    case 'x':
    case 'w':
      // For now assume that the person knows what they're
      // doing with the modifier.
      return true;

I'm reading it as the constraint is a final decision.
Inline assembly shouldn't change with -mcpu flags.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-15 16:33                   ` Eduard Zingerman
  2024-01-16 17:47                     ` Alexei Starovoitov
@ 2024-01-16 18:40                     ` Alexei Starovoitov
  2024-01-16 23:15                       ` Eduard Zingerman
  1 sibling, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2024-01-16 18:40 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Jose E. Marchesi, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Yonghong Song, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team

On Mon, Jan 15, 2024 at 8:33 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
> [1] selftests
>     https://gist.github.com/eddyz87/276f1ecc51930017dcddbb56e37f57ad
> [2] Cilium
>     https://gist.github.com/eddyz87/4a485573556012ec730c2de0256a79db
>     Note: this is based upon branch 'libbpf-friendliness'
>           from https://github.com/anakryiko/cilium
> [3] Tetragon
>     https://gist.github.com/eddyz87/ca9a4b68007c72469307f2cce3f83bb1


The changes to all three make sense, but they might cause regressions
if they are not synchronized with new llvm.
cilium/tetragon can control the llvm version to some degree, but not selftests.
Should we add clang macro like __BPF_CPU_VERSION__ and ifdef
different asm style depending on that?
I suspect this "(short)" workaround will still be needed for quite
some time while people upgrade to the latest llvm.
something like __BPF_STRICT_ASM_CONSTRAINT__ ?
Maybe a flag too that can revert to old behavior without warnings?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-16 17:47                     ` Alexei Starovoitov
@ 2024-01-16 19:07                       ` Yonghong Song
  2024-01-16 19:34                         ` Alexei Starovoitov
  2024-01-16 23:55                       ` Eduard Zingerman
  1 sibling, 1 reply; 32+ messages in thread
From: Yonghong Song @ 2024-01-16 19:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Eduard Zingerman
  Cc: Jose E. Marchesi, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Daniel Xu,
	John Fastabend, bpf, Kernel Team


On 1/16/24 9:47 AM, Alexei Starovoitov wrote:
> On Mon, Jan 15, 2024 at 8:33 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>>
>> [0] Updated LLVM
>>      https://github.com/eddyz87/llvm-project/tree/bpf-inline-asm-polymorphic-r
> 1.
> // Use sequence 'wX = wX' if 32-bits ops are available.
> let Predicates = [BPFHasALU32] in {
>
> This is unnecessary conservative.
> wX = wX instructions existed from day one.
> The very first commit of the interpreter and the verifier recognized it.
> No need to gate it by BPFHasALU32.

Actually this is not true from llvm perspective.
wX = wX is available in bpf ISA from day one, but
wX register is only introduced in llvm in 2017
and at the same time alu32 is added to facilitate
its usage.

>
> 2.
> case 'w':
> if (Size == 32 && HasAlu32)
>
> This is probably unnecessary as well.
> When bpf programmer specifies 'w' constraint, llvm should probably use it
> regardless of alu32 flag.
>
> aarch64 has this comment:
>      case 'x':
>      case 'w':
>        // For now assume that the person knows what they're
>        // doing with the modifier.
>        return true;
>
> I'm reading it as the constraint is a final decision.
> Inline assembly shouldn't change with -mcpu flags.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-16 19:07                       ` Yonghong Song
@ 2024-01-16 19:34                         ` Alexei Starovoitov
  2024-01-16 23:14                           ` Yonghong Song
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2024-01-16 19:34 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Eduard Zingerman, Jose E. Marchesi, Jose E. Marchesi,
	Kumar Kartikeya Dwivedi, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team

On Tue, Jan 16, 2024 at 11:07 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 1/16/24 9:47 AM, Alexei Starovoitov wrote:
> > On Mon, Jan 15, 2024 at 8:33 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
> >>
> >> [0] Updated LLVM
> >>      https://github.com/eddyz87/llvm-project/tree/bpf-inline-asm-polymorphic-r
> > 1.
> > // Use sequence 'wX = wX' if 32-bits ops are available.
> > let Predicates = [BPFHasALU32] in {
> >
> > This is unnecessary conservative.
> > wX = wX instructions existed from day one.
> > The very first commit of the interpreter and the verifier recognized it.
> > No need to gate it by BPFHasALU32.
>
> Actually this is not true from llvm perspective.
> wX = wX is available in bpf ISA from day one, but
> wX register is only introduced in llvm in 2017
> and at the same time alu32 is added to facilitate
> its usage.

Not quite. At that time we added general support in the verifier
for the majority of alu32 insns. The insns worked in the interpreter
earlier, but the verifier didn't handle them.
While wX=wX was supported by the verifier from the start.
So this particular single insn shouldn't be part of alu32 flag
It didn't need to be back in 2017 and doesn't need to be now.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-16 19:34                         ` Alexei Starovoitov
@ 2024-01-16 23:14                           ` Yonghong Song
  2024-01-17 22:43                             ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Yonghong Song @ 2024-01-16 23:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, Jose E. Marchesi, Jose E. Marchesi,
	Kumar Kartikeya Dwivedi, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team

On 1/16/24 11:34 AM, Alexei Starovoitov wrote:

> On Tue, Jan 16, 2024 at 11:07 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 1/16/24 9:47 AM, Alexei Starovoitov wrote:
>>> On Mon, Jan 15, 2024 at 8:33 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>>>> [0] Updated LLVM
>>>>       https://github.com/eddyz87/llvm-project/tree/bpf-inline-asm-polymorphic-r
>>> 1.
>>> // Use sequence 'wX = wX' if 32-bits ops are available.
>>> let Predicates = [BPFHasALU32] in {
>>>
>>> This is unnecessary conservative.
>>> wX = wX instructions existed from day one.
>>> The very first commit of the interpreter and the verifier recognized it.
>>> No need to gate it by BPFHasALU32.
>> Actually this is not true from llvm perspective.
>> wX = wX is available in bpf ISA from day one, but
>> wX register is only introduced in llvm in 2017
>> and at the same time alu32 is added to facilitate
>> its usage.
> Not quite. At that time we added general support in the verifier
> for the majority of alu32 insns. The insns worked in the interpreter
> earlier, but the verifier didn't handle them.
> While wX=wX was supported by the verifier from the start.
> So this particular single insn shouldn't be part of alu32 flag
> It didn't need to be back in 2017 and doesn't need to be now.

Okay, IIUC, currently 32-bit subreg is enabled
only if alu32 is enabled.
   if (STI.getHasAlu32())
     addRegisterClass(MVT::i32, &BPF::GPR32RegClass);

We should unconditionally enable 32-bit subreg with.
   addRegisterClass(MVT::i32, &BPF::GPR32RegClass);

We may need to add Alu32 control in some other
places which trying to access 32-bit subreg.
But for wX = wX thing, we do not need Alu32 control
and the following is okay:
  def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
           (INSERT_SUBREG
             (i64 (IMPLICIT_DEF)),
             (MOV_rr_32 (i32 (EXTRACT_SUBREG GPR:$src, sub_32))),
             sub_32)>;

I tried with the above change with unconditionally
doing addRegisterClass(MVT::i32, &BPF::GPR32RegClass).

$ cat t1.c
unsigned long test1(unsigned long x) {
         return (unsigned)x;
}
unsigned long test2(unsigned x) {
         return x;
}
#if 0
unsigned test3(unsigned x, unsigned y) {
         return x + y;
}
#endif
$ clang --target=bpf -mcpu=v1 -O2 -c t1.c && llvm-objdump -d t1.o

t1.o:   file format elf64-bpf

Disassembly of section .text:
         
0000000000000000 <test1>:
        0:       bc 10 00 00 00 00 00 00 w0 = w1
        1:       95 00 00 00 00 00 00 00 exit
         
0000000000000010 <test2>:
        2:       bc 10 00 00 00 00 00 00 w0 = w1
        3:       95 00 00 00 00 00 00 00 exit
$

More changes in BPFInstrInfo.td and possible other
places need to make the above test3() function work
at -mcpu=v1.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-16 18:40                     ` Alexei Starovoitov
@ 2024-01-16 23:15                       ` Eduard Zingerman
  0 siblings, 0 replies; 32+ messages in thread
From: Eduard Zingerman @ 2024-01-16 23:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jose E. Marchesi, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Yonghong Song, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team

On Tue, 2024-01-16 at 10:40 -0800, Alexei Starovoitov wrote:
[...]
> The changes to all three make sense, but they might cause regressions
> if they are not synchronized with new llvm.
> cilium/tetragon can control the llvm version to some degree, but not selftests.
> Should we add clang macro like __BPF_CPU_VERSION__ and ifdef
> different asm style depending on that?
> I suspect this "(short)" workaround will still be needed for quite
> some time while people upgrade to the latest llvm.
> something like __BPF_STRICT_ASM_CONSTRAINT__ ?
> Maybe a flag too that can revert to old behavior without warnings?

After my changes selftests are passing both with old and new
constraint semantics, so such macro definitions / compiler flags
are not necessary for selftests.
(Although, I have not yet checked the codegen difference,
 so the absence of the "(short)" thing might be visible there).

As for Cilium / Tetragon: I checked verification of the object files
with both LLVM versions, but adding compiler flag might make sense.
Maybe compiler users should comment?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-16 17:47                     ` Alexei Starovoitov
  2024-01-16 19:07                       ` Yonghong Song
@ 2024-01-16 23:55                       ` Eduard Zingerman
  1 sibling, 0 replies; 32+ messages in thread
From: Eduard Zingerman @ 2024-01-16 23:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jose E. Marchesi, Jose E. Marchesi, Kumar Kartikeya Dwivedi,
	Yonghong Song, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team

On Tue, 2024-01-16 at 09:47 -0800, Alexei Starovoitov wrote:

[...]

> 2.
> case 'w':
> if (Size == 32 && HasAlu32)
> 
> This is probably unnecessary as well.
> When bpf programmer specifies 'w' constraint, llvm should probably use it
> regardless of alu32 flag.
> 
> aarch64 has this comment:
>     case 'x':
>     case 'w':
>       // For now assume that the person knows what they're
>       // doing with the modifier.
>       return true;
> 
> I'm reading it as the constraint is a final decision.
> Inline assembly shouldn't change with -mcpu flags.

llvm-mc -arch=bpf -mcpu=v2 compiles the asm below w/o warnings:

	.text
	.file	"test-asm-shifts.c"
	.globl	foo
	.p2align	3
	.type	foo,@function
foo:
	call bar
	w0 += 1
	exit
.Lfunc_end0:
	.size	foo, .Lfunc_end0-foo

However, clang would complain about "w" constraint when used
for inline assembly when CPUv2 is used:

  error: couldn't allocate input reg for constraint 'w'

As Yonghong notes in the sibling thread, this would probably require
enabling 32-bit sub-registers for CPUv2.
Will need some time to investigate.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro
  2024-01-16 23:14                           ` Yonghong Song
@ 2024-01-17 22:43                             ` Alexei Starovoitov
  0 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2024-01-17 22:43 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Eduard Zingerman, Jose E. Marchesi, Jose E. Marchesi,
	Kumar Kartikeya Dwivedi, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Daniel Xu, John Fastabend, bpf, Kernel Team

On Tue, Jan 16, 2024 at 3:14 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> On 1/16/24 11:34 AM, Alexei Starovoitov wrote:
>
> > On Tue, Jan 16, 2024 at 11:07 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>
> >> On 1/16/24 9:47 AM, Alexei Starovoitov wrote:
> >>> On Mon, Jan 15, 2024 at 8:33 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
> >>>> [0] Updated LLVM
> >>>>       https://github.com/eddyz87/llvm-project/tree/bpf-inline-asm-polymorphic-r
> >>> 1.
> >>> // Use sequence 'wX = wX' if 32-bits ops are available.
> >>> let Predicates = [BPFHasALU32] in {
> >>>
> >>> This is unnecessary conservative.
> >>> wX = wX instructions existed from day one.
> >>> The very first commit of the interpreter and the verifier recognized it.
> >>> No need to gate it by BPFHasALU32.
> >> Actually this is not true from llvm perspective.
> >> wX = wX is available in bpf ISA from day one, but
> >> wX register is only introduced in llvm in 2017
> >> and at the same time alu32 is added to facilitate
> >> its usage.
> > Not quite. At that time we added general support in the verifier
> > for the majority of alu32 insns. The insns worked in the interpreter
> > earlier, but the verifier didn't handle them.
> > While wX=wX was supported by the verifier from the start.
> > So this particular single insn shouldn't be part of alu32 flag
> > It didn't need to be back in 2017 and doesn't need to be now.
>
> Okay, IIUC, currently 32-bit subreg is enabled
> only if alu32 is enabled.
>    if (STI.getHasAlu32())
>      addRegisterClass(MVT::i32, &BPF::GPR32RegClass);
>
> We should unconditionally enable 32-bit subreg with.
>    addRegisterClass(MVT::i32, &BPF::GPR32RegClass);

Makes sense to me.
It should be fine with -mcpu=v1,v2.

> We may need to add Alu32 control in some other
> places which trying to access 32-bit subreg.
> But for wX = wX thing, we do not need Alu32 control
> and the following is okay:
>   def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)),
>            (INSERT_SUBREG
>              (i64 (IMPLICIT_DEF)),
>              (MOV_rr_32 (i32 (EXTRACT_SUBREG GPR:$src, sub_32))),
>              sub_32)>;
>
> I tried with the above change with unconditionally
> doing addRegisterClass(MVT::i32, &BPF::GPR32RegClass).
>
> $ cat t1.c
> unsigned long test1(unsigned long x) {
>          return (unsigned)x;
> }
> unsigned long test2(unsigned x) {
>          return x;
> }
> #if 0
> unsigned test3(unsigned x, unsigned y) {
>          return x + y;
> }
> #endif
> $ clang --target=bpf -mcpu=v1 -O2 -c t1.c && llvm-objdump -d t1.o
>
> t1.o:   file format elf64-bpf
>
> Disassembly of section .text:
>
> 0000000000000000 <test1>:
>         0:       bc 10 00 00 00 00 00 00 w0 = w1
>         1:       95 00 00 00 00 00 00 00 exit
>
> 0000000000000010 <test2>:
>         2:       bc 10 00 00 00 00 00 00 w0 = w1
>         3:       95 00 00 00 00 00 00 00 exit
> $
>
> More changes in BPFInstrInfo.td and possible other
> places need to make the above test3() function work
> at -mcpu=v1.

All makes sense to me.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-01-17 22:44 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-21  3:38 [PATCH v2 bpf-next 0/5] bpf: volatile compare Alexei Starovoitov
2023-12-21  3:38 ` [PATCH v2 bpf-next 1/5] selftests/bpf: Attempt to build BPF programs with -Wsign-compare Alexei Starovoitov
2023-12-21  3:38 ` [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro Alexei Starovoitov
2023-12-21  4:27   ` Kumar Kartikeya Dwivedi
2023-12-22 22:59     ` Alexei Starovoitov
2023-12-25 20:33       ` Alexei Starovoitov
2024-01-04 20:06         ` Eduard Zingerman
2024-01-04 21:02           ` Alexei Starovoitov
2024-01-05 21:47         ` Eduard Zingerman
2024-01-08 21:21           ` Yonghong Song
2024-01-08 23:16             ` Eduard Zingerman
2024-01-08 21:33           ` asm register constraint. Was: " Alexei Starovoitov
2024-01-08 23:22             ` Yonghong Song
2024-01-09 10:49             ` Jose E. Marchesi
2024-01-09 12:09               ` Jose E. Marchesi
2024-01-11 18:33                 ` Jose E. Marchesi
2024-01-15 16:33                   ` Eduard Zingerman
2024-01-16 17:47                     ` Alexei Starovoitov
2024-01-16 19:07                       ` Yonghong Song
2024-01-16 19:34                         ` Alexei Starovoitov
2024-01-16 23:14                           ` Yonghong Song
2024-01-17 22:43                             ` Alexei Starovoitov
2024-01-16 23:55                       ` Eduard Zingerman
2024-01-16 18:40                     ` Alexei Starovoitov
2024-01-16 23:15                       ` Eduard Zingerman
2024-01-11  2:46               ` Alexei Starovoitov
2024-01-11 10:34                 ` Jose E. Marchesi
2024-01-11 16:46             ` Eduard Zingerman
2023-12-21  3:38 ` [PATCH v2 bpf-next 3/5] selftests/bpf: Convert exceptions_assert.c to bpf_cmp Alexei Starovoitov
2023-12-21  3:38 ` [PATCH v2 bpf-next 4/5] selftests/bpf: Remove bpf_assert_eq-like macros Alexei Starovoitov
2023-12-21  3:38 ` [RFC PATCH v2 bpf-next 5/5] selftests/bpf: Attempt to convert profiler.c to bpf_cmp Alexei Starovoitov
2023-12-21 18:01 ` [PATCH v2 bpf-next 0/5] bpf: volatile compare Jiri Olsa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.