bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* selftest sock_fields failed on s390x with latest llvm17
@ 2023-05-03 19:35 Yonghong Song
  2023-05-03 19:46 ` Ilya Leoshkevich
  0 siblings, 1 reply; 10+ messages in thread
From: Yonghong Song @ 2023-05-03 19:35 UTC (permalink / raw)
  To: Ilya Leoshkevich; +Cc: bpf, Kui-Feng Lee, Manu Bretelle

Hi, Ilya,

BPF CI ([1]) detected a s390x failure when bpf program is compiled with 
latest llvm17 on bpf-next branch. To reproduce the issue, just run 
normal './test_progs -j'. The failure log looks like below:

Notice: Success: 341/3015, Skipped: 29, Failed: 1
Error: #191 sock_fields
   Error: #191 sock_fields
   create_netns:PASS:create netns 0 nsec
   create_netns:PASS:bring up lo 0 nsec
   serial_test_sock_fields:PASS:test_sock_fields__open_and_load 0 nsec
   serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock_fields) 0 
nsec
   serial_test_sock_fields:PASS:attach_cgroup(ingress_read_sock_fields) 
0 nsec
   serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port 0 nsec
   test:PASS:getsockname(listen_fd) 0 nsec
   test:PASS:getsockname(cli_fd) 0 nsec
   test:PASS:accept(listen_fd) 0 nsec
   init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd) 0 nsec
   init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_fd) 0 nsec
   test:PASS:send(accept_fd) 0 nsec
   test:PASS:recv(cli_fd) 0 nsec
   test:PASS:send(accept_fd) 0 nsec
   test:PASS:recv(cli_fd) 0 nsec
   test:PASS:recv(accept_fd) for fin 0 nsec
   test:PASS:recv(cli_fd) for fin 0 nsec
   check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt, 
&accept_fd) 0 nsec
   check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt, 
&cli_fd) 0 nsec
   check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
   check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
   check_result:PASS:bpf_map_lookup_elem(linum_map_fd, 
READ_SK_DST_PORT_IDX) 0 nsec
   check_result:FAIL:failure in read_sk_dst_port on line unexpected 
failure in read_sk_dst_port on line: actual 297 != expected 0
   listen_sk: state:10 bound_dev_if:0 family:10 type:1 protocol:6 mark:0 
priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) 
src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::) dst_port:0
   srv_sk: state:9 bound_dev_if:0 family:10 type:1 protocol:6 mark:0 
priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) 
src_port:51966 dst_ip4:7f000006(127.0.0.6) dst_ip6:0:0:0:1(::1) 
dst_port:38030
   cli_sk: state:5 bound_dev_if:0 family:10 type:1 protocol:6 mark:0 
priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) 
src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1) dst_port:51966
   listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295 
snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0 mss_cache:536 
ecn_flags:0 rate_delivered:0 rate_interval_us:0 packets_out:0 
retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0 segs_out:0 
data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0 bytes_acked:0
   srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272 snd_ssthresh:2147483647 
rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810 mss_cache:32768 
ecn_flags:0 rate_delivered:1 rate_interval_us:272 packets_out:0 
retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0 segs_out:3 
data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1 bytes_acked:22
   cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730 snd_ssthresh:2147483647 
rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715 mss_cache:32768 
ecn_flags:0 rate_delivered:1 rate_interval_us:925 packets_out:0 
retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2 segs_out:6 
data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23 bytes_acked:2
   check_result:PASS:listen_sk 0 nsec
   check_result:PASS:srv_sk 0 nsec
   check_result:PASS:srv_tp 0 nsec

If bpf program is compiled with llvm16, the test passed according to
a CI run.

I don't have s390x environment to debug this. Could you help debug it?

Thanks!

   [1] 
https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: selftest sock_fields failed on s390x with latest llvm17
  2023-05-03 19:35 selftest sock_fields failed on s390x with latest llvm17 Yonghong Song
@ 2023-05-03 19:46 ` Ilya Leoshkevich
  2023-05-12 14:40   ` Ilya Leoshkevich
  0 siblings, 1 reply; 10+ messages in thread
From: Ilya Leoshkevich @ 2023-05-03 19:46 UTC (permalink / raw)
  To: Yonghong Song; +Cc: bpf, Kui-Feng Lee, Manu Bretelle

On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
> Hi, Ilya,
> 
> BPF CI ([1]) detected a s390x failure when bpf program is compiled
> with 
> latest llvm17 on bpf-next branch. To reproduce the issue, just run 
> normal './test_progs -j'. The failure log looks like below:
> 
> Notice: Success: 341/3015, Skipped: 29, Failed: 1
> Error: #191 sock_fields
>    Error: #191 sock_fields
>    create_netns:PASS:create netns 0 nsec
>    create_netns:PASS:bring up lo 0 nsec
>    serial_test_sock_fields:PASS:test_sock_fields__open_and_load 0
> nsec
>   
> serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock_fields) 0
> nsec
>   
> serial_test_sock_fields:PASS:attach_cgroup(ingress_read_sock_fields) 
> 0 nsec
>    serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port 0 nsec
>    test:PASS:getsockname(listen_fd) 0 nsec
>    test:PASS:getsockname(cli_fd) 0 nsec
>    test:PASS:accept(listen_fd) 0 nsec
>    init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd) 0 nsec
>    init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_fd) 0
> nsec
>    test:PASS:send(accept_fd) 0 nsec
>    test:PASS:recv(cli_fd) 0 nsec
>    test:PASS:send(accept_fd) 0 nsec
>    test:PASS:recv(cli_fd) 0 nsec
>    test:PASS:recv(accept_fd) for fin 0 nsec
>    test:PASS:recv(cli_fd) for fin 0 nsec
>    check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt, 
> &accept_fd) 0 nsec
>    check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt, 
> &cli_fd) 0 nsec
>    check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>    check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>    check_result:PASS:bpf_map_lookup_elem(linum_map_fd, 
> READ_SK_DST_PORT_IDX) 0 nsec
>    check_result:FAIL:failure in read_sk_dst_port on line unexpected 
> failure in read_sk_dst_port on line: actual 297 != expected 0
>    listen_sk: state:10 bound_dev_if:0 family:10 type:1 protocol:6
> mark:0 
> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) 
> src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::) dst_port:0
>    srv_sk: state:9 bound_dev_if:0 family:10 type:1 protocol:6 mark:0 
> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) 
> src_port:51966 dst_ip4:7f000006(127.0.0.6) dst_ip6:0:0:0:1(::1) 
> dst_port:38030
>    cli_sk: state:5 bound_dev_if:0 family:10 type:1 protocol:6 mark:0 
> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) 
> src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1) dst_port:51966
>    listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295 
> snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0 mss_cache:536 
> ecn_flags:0 rate_delivered:0 rate_interval_us:0 packets_out:0 
> retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0 segs_out:0 
> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
> bytes_acked:0
>    srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
> snd_ssthresh:2147483647 
> rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
> mss_cache:32768 
> ecn_flags:0 rate_delivered:1 rate_interval_us:272 packets_out:0 
> retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0 segs_out:3 
> data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
> bytes_acked:22
>    cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
> snd_ssthresh:2147483647 
> rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
> mss_cache:32768 
> ecn_flags:0 rate_delivered:1 rate_interval_us:925 packets_out:0 
> retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2 segs_out:6 
> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
> bytes_acked:2
>    check_result:PASS:listen_sk 0 nsec
>    check_result:PASS:srv_sk 0 nsec
>    check_result:PASS:srv_tp 0 nsec
> 
> If bpf program is compiled with llvm16, the test passed according to
> a CI run.
> 
> I don't have s390x environment to debug this. Could you help debug
> it?
> 
> Thanks!
> 
>    [1] 
> https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645


Hi,

thank for letting me know.
I will look into this.

Best regards,
Ilya

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: selftest sock_fields failed on s390x with latest llvm17
  2023-05-03 19:46 ` Ilya Leoshkevich
@ 2023-05-12 14:40   ` Ilya Leoshkevich
  2023-05-13  4:13     ` Yonghong Song
  0 siblings, 1 reply; 10+ messages in thread
From: Ilya Leoshkevich @ 2023-05-12 14:40 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Kui-Feng Lee, Manu Bretelle, Alexei Starovoitov, Jakub Sitnicki

On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
> On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
> > Hi, Ilya,
> > 
> > BPF CI ([1]) detected a s390x failure when bpf program is compiled
> > with 
> > latest llvm17 on bpf-next branch. To reproduce the issue, just run 
> > normal './test_progs -j'. The failure log looks like below:
> > 
> > Notice: Success: 341/3015, Skipped: 29, Failed: 1
> > Error: #191 sock_fields
> >    Error: #191 sock_fields
> >    create_netns:PASS:create netns 0 nsec
> >    create_netns:PASS:bring up lo 0 nsec
> >    serial_test_sock_fields:PASS:test_sock_fields__open_and_load 0
> > nsec
> >   
> > serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock_fields)
> > 0
> > nsec
> >   
> > serial_test_sock_fields:PASS:attach_cgroup(ingress_read_sock_fields
> > ) 
> > 0 nsec
> >    serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port 0
> > nsec
> >    test:PASS:getsockname(listen_fd) 0 nsec
> >    test:PASS:getsockname(cli_fd) 0 nsec
> >    test:PASS:accept(listen_fd) 0 nsec
> >    init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd) 0
> > nsec
> >    init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_fd) 0
> > nsec
> >    test:PASS:send(accept_fd) 0 nsec
> >    test:PASS:recv(cli_fd) 0 nsec
> >    test:PASS:send(accept_fd) 0 nsec
> >    test:PASS:recv(cli_fd) 0 nsec
> >    test:PASS:recv(accept_fd) for fin 0 nsec
> >    test:PASS:recv(cli_fd) for fin 0 nsec
> >    check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt, 
> > &accept_fd) 0 nsec
> >    check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt, 
> > &cli_fd) 0 nsec
> >    check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
> >    check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
> >    check_result:PASS:bpf_map_lookup_elem(linum_map_fd, 
> > READ_SK_DST_PORT_IDX) 0 nsec
> >    check_result:FAIL:failure in read_sk_dst_port on line unexpected
> > failure in read_sk_dst_port on line: actual 297 != expected 0
> >    listen_sk: state:10 bound_dev_if:0 family:10 type:1 protocol:6
> > mark:0 
> > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) 
> > src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::) dst_port:0
> >    srv_sk: state:9 bound_dev_if:0 family:10 type:1 protocol:6
> > mark:0 
> > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) 
> > src_port:51966 dst_ip4:7f000006(127.0.0.6) dst_ip6:0:0:0:1(::1) 
> > dst_port:38030
> >    cli_sk: state:5 bound_dev_if:0 family:10 type:1 protocol:6
> > mark:0 
> > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) 
> > src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
> > dst_port:51966
> >    listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295 
> > snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0 mss_cache:536
> > ecn_flags:0 rate_delivered:0 rate_interval_us:0 packets_out:0 
> > retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0 segs_out:0 
> > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
> > bytes_acked:0
> >    srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
> > snd_ssthresh:2147483647 
> > rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
> > mss_cache:32768 
> > ecn_flags:0 rate_delivered:1 rate_interval_us:272 packets_out:0 
> > retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0 segs_out:3 
> > data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
> > bytes_acked:22
> >    cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
> > snd_ssthresh:2147483647 
> > rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
> > mss_cache:32768 
> > ecn_flags:0 rate_delivered:1 rate_interval_us:925 packets_out:0 
> > retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2 segs_out:6 
> > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
> > bytes_acked:2
> >    check_result:PASS:listen_sk 0 nsec
> >    check_result:PASS:srv_sk 0 nsec
> >    check_result:PASS:srv_tp 0 nsec
> > 
> > If bpf program is compiled with llvm16, the test passed according
> > to
> > a CI run.
> > 
> > I don't have s390x environment to debug this. Could you help debug
> > it?
> > 
> > Thanks!
> > 
> >    [1] 
> > https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
> 
> 
> Hi,
> 
> thank for letting me know.
> I will look into this.
> 
> Best regards,
> Ilya

In the meantime the issue was fixed by:

commit 141be5c062ecf22bd287afffd310e8ac4711444a
Author: Shoaib Meenai <smeenai@fb.com>
Date:   Fri May 5 14:18:12 2023 -0700

    Revert "Reland [Pipeline] Don't limit ArgumentPromotion to -O3"
    
    This reverts commit 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
    
    https://reviews.llvm.org/D149768 is causing size regressions for -
Oz
    with FullLTO, and I'm reverting that one while investigating. This
    commit depends on that one, so it needs to be reverted as well.

But looking at the codegen differences:

$ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g fail.s)

-pass.o:        file format elf64-bpf
+fail.o:        file format elf64-bpf

-00000000000002c8 <sk_dst_port__load_half>
-       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
+00000000000002c0 <sk_dst_port__load_half>
+       54 10 00 00 00 00 ff ff w1 &= 65535
        b4 00 00 00 00 00 00 01 w0 = 1
        16 10 00 01 00 00 ca fe if w1 == 51966 goto +1 <LBB6_2>
        b4 00 00 00 00 00 00 00 w0 = 0

This is what ArgumentPromotion is supposed to do, so that's okay so
far. However, further down below we have:

 Disassembly of section cgroup_skb/egress:

-       bf 16 00 00 00 00 00 00 r1 = r6
+       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
+       bc 17 00 00 00 00 00 00 w1 = w7
        85 01 00 00 00 00 00 53 call sk_dst_port__load_word

...

-       bf 16 00 00 00 00 00 00 r1 = r6
+       74 70 00 00 00 00 00 10 w7 >>= 16
+       bc 17 00 00 00 00 00 00 w1 = w7
        85 01 00 00 00 00 00 57 call sk_dst_port__load_half

so there is no 16-bit load anymore, instead, the result from the
earlier 32-bit load is reused. However, on BE these kinds of loads
for this particular field are not consistent at the moment - see [1]
and the previous discussions.

De-facto we have the following results:

- int load: 0x0000cafe
- short load: 0xcafe

On a consistent BE we should have rather had:

- int load: 0x0000cafe
- short load: 0

Clang, of course, expects a consistent BE and optimizes around that.

This was a conscious tradeoff Jakub and I have agreed on in order to
keep the quirky behavior from the past. Given what's happening with
Clang now, I wonder if it would be worth revisiting it in the name of
consistency?

[1]
https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: selftest sock_fields failed on s390x with latest llvm17
  2023-05-12 14:40   ` Ilya Leoshkevich
@ 2023-05-13  4:13     ` Yonghong Song
  2023-05-13  8:24       ` Ilya Leoshkevich
  0 siblings, 1 reply; 10+ messages in thread
From: Yonghong Song @ 2023-05-13  4:13 UTC (permalink / raw)
  To: Ilya Leoshkevich
  Cc: bpf, Kui-Feng Lee, Manu Bretelle, Alexei Starovoitov, Jakub Sitnicki



On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
> On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
>> On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
>>> Hi, Ilya,
>>>
>>> BPF CI ([1]) detected a s390x failure when bpf program is compiled
>>> with
>>> latest llvm17 on bpf-next branch. To reproduce the issue, just run
>>> normal './test_progs -j'. The failure log looks like below:
>>>
>>> Notice: Success: 341/3015, Skipped: 29, Failed: 1
>>> Error: #191 sock_fields
>>>     Error: #191 sock_fields
>>>     create_netns:PASS:create netns 0 nsec
>>>     create_netns:PASS:bring up lo 0 nsec
>>>     serial_test_sock_fields:PASS:test_sock_fields__open_and_load 0
>>> nsec
>>>    
>>> serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock_fields)
>>> 0
>>> nsec
>>>    
>>> serial_test_sock_fields:PASS:attach_cgroup(ingress_read_sock_fields
>>> )
>>> 0 nsec
>>>     serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port 0
>>> nsec
>>>     test:PASS:getsockname(listen_fd) 0 nsec
>>>     test:PASS:getsockname(cli_fd) 0 nsec
>>>     test:PASS:accept(listen_fd) 0 nsec
>>>     init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd) 0
>>> nsec
>>>     init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_fd) 0
>>> nsec
>>>     test:PASS:send(accept_fd) 0 nsec
>>>     test:PASS:recv(cli_fd) 0 nsec
>>>     test:PASS:send(accept_fd) 0 nsec
>>>     test:PASS:recv(cli_fd) 0 nsec
>>>     test:PASS:recv(accept_fd) for fin 0 nsec
>>>     test:PASS:recv(cli_fd) for fin 0 nsec
>>>     check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
>>> &accept_fd) 0 nsec
>>>     check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
>>> &cli_fd) 0 nsec
>>>     check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>>>     check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>>>     check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
>>> READ_SK_DST_PORT_IDX) 0 nsec
>>>     check_result:FAIL:failure in read_sk_dst_port on line unexpected
>>> failure in read_sk_dst_port on line: actual 297 != expected 0
>>>     listen_sk: state:10 bound_dev_if:0 family:10 type:1 protocol:6
>>> mark:0
>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>> src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::) dst_port:0
>>>     srv_sk: state:9 bound_dev_if:0 family:10 type:1 protocol:6
>>> mark:0
>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>> src_port:51966 dst_ip4:7f000006(127.0.0.6) dst_ip6:0:0:0:1(::1)
>>> dst_port:38030
>>>     cli_sk: state:5 bound_dev_if:0 family:10 type:1 protocol:6
>>> mark:0
>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>> src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
>>> dst_port:51966
>>>     listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295
>>> snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0 mss_cache:536
>>> ecn_flags:0 rate_delivered:0 rate_interval_us:0 packets_out:0
>>> retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0 segs_out:0
>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
>>> bytes_acked:0
>>>     srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
>>> snd_ssthresh:2147483647
>>> rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
>>> mss_cache:32768
>>> ecn_flags:0 rate_delivered:1 rate_interval_us:272 packets_out:0
>>> retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0 segs_out:3
>>> data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
>>> bytes_acked:22
>>>     cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
>>> snd_ssthresh:2147483647
>>> rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
>>> mss_cache:32768
>>> ecn_flags:0 rate_delivered:1 rate_interval_us:925 packets_out:0
>>> retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2 segs_out:6
>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
>>> bytes_acked:2
>>>     check_result:PASS:listen_sk 0 nsec
>>>     check_result:PASS:srv_sk 0 nsec
>>>     check_result:PASS:srv_tp 0 nsec
>>>
>>> If bpf program is compiled with llvm16, the test passed according
>>> to
>>> a CI run.
>>>
>>> I don't have s390x environment to debug this. Could you help debug
>>> it?
>>>
>>> Thanks!
>>>
>>>     [1]
>>> https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
>>
>>
>> Hi,
>>
>> thank for letting me know.
>> I will look into this.
>>
>> Best regards,
>> Ilya
> 
> In the meantime the issue was fixed by:
> 
> commit 141be5c062ecf22bd287afffd310e8ac4711444a
> Author: Shoaib Meenai <smeenai@fb.com>
> Date:   Fri May 5 14:18:12 2023 -0700
> 
>      Revert "Reland [Pipeline] Don't limit ArgumentPromotion to -O3"
>      
>      This reverts commit 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
>      
>      https://reviews.llvm.org/D149768  is causing size regressions for -
> Oz
>      with FullLTO, and I'm reverting that one while investigating. This
>      commit depends on that one, so it needs to be reverted as well.

The transformtion "Don't limit ArgumentPromotion to -O3" is temporarily 
reverted. But it could be reverted again once the issue is resolved.
So it is a good idea to resolve the issue in the kernel.

> 
> But looking at the codegen differences:
> 
> $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g fail.s)
> 
> -pass.o:        file format elf64-bpf
> +fail.o:        file format elf64-bpf
> 
> -00000000000002c8 <sk_dst_port__load_half>
> -       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
> +00000000000002c0 <sk_dst_port__load_half>
> +       54 10 00 00 00 00 ff ff w1 &= 65535
>          b4 00 00 00 00 00 00 01 w0 = 1
>          16 10 00 01 00 00 ca fe if w1 == 51966 goto +1 <LBB6_2>
>          b4 00 00 00 00 00 00 00 w0 = 0
> 
> This is what ArgumentPromotion is supposed to do, so that's okay so
> far. However, further down below we have:
> 
>   Disassembly of section cgroup_skb/egress:
> 
> -       bf 16 00 00 00 00 00 00 r1 = r6
> +       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
> +       bc 17 00 00 00 00 00 00 w1 = w7
>          85 01 00 00 00 00 00 53 call sk_dst_port__load_word
> 
> ...
> 
> -       bf 16 00 00 00 00 00 00 r1 = r6
> +       74 70 00 00 00 00 00 10 w7 >>= 16
> +       bc 17 00 00 00 00 00 00 w1 = w7
>          85 01 00 00 00 00 00 57 call sk_dst_port__load_half
> 
> so there is no 16-bit load anymore, instead, the result from the
> earlier 32-bit load is reused. However, on BE these kinds of loads
> for this particular field are not consistent at the moment - see [1]
> and the previous discussions.
> 
> De-facto we have the following results:
> 
> - int load: 0x0000cafe
> - short load: 0xcafe

So 'De-facto' means the above is the expected result.

> 
> On a consistent BE we should have rather had:
> 
> - int load: 0x0000cafe
> - short load: 0

What does 'consistent BE' mean here? Does it mean the expected
result from the source code itself?

> 
> Clang, of course, expects a consistent BE and optimizes around that.
> 
> This was a conscious tradeoff Jakub and I have agreed on in order to
> keep the quirky behavior from the past. Given what's happening with
> Clang now, I wonder if it would be worth revisiting it in the name of
> consistency?

If I understand correctly, I think the issue is
     r7 = *(u32 *)(r6 + 48)
     w7 >= 16
     w1 = w7

after verifier, it is changed to
    r7 = *(u16 *)(r6 + <kernel offset>)
    w7 >= 16
    w1 = w7

and the result after verifier rewrite is completely wrong.
Is it right?

If this is the case, during verifier rewrite, if it is
big endian, say the user intends to load 4 bytes (from uapi header)
while the kernel field is 2 bytes, in such cases, kernel
can still pretend to generate 4-byte load. For example,
for the above example, the code after verification could be:
    r7 = *(u16 *)(r6 + <kernel offset>)
    r7 <= 16
    w7 >= 16
    w1 = w7

Hopefully, we won't have many such cases. Does this work?

> 
> [1]
> https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: selftest sock_fields failed on s390x with latest llvm17
  2023-05-13  4:13     ` Yonghong Song
@ 2023-05-13  8:24       ` Ilya Leoshkevich
  2023-05-14 16:58         ` Yonghong Song
  0 siblings, 1 reply; 10+ messages in thread
From: Ilya Leoshkevich @ 2023-05-13  8:24 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Kui-Feng Lee, Manu Bretelle, Alexei Starovoitov, Jakub Sitnicki

On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote:
> 
> 
> On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
> > On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
> > > On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
> > > > Hi, Ilya,
> > > > 
> > > > BPF CI ([1]) detected a s390x failure when bpf program is
> > > > compiled
> > > > with
> > > > latest llvm17 on bpf-next branch. To reproduce the issue, just
> > > > run
> > > > normal './test_progs -j'. The failure log looks like below:
> > > > 
> > > > Notice: Success: 341/3015, Skipped: 29, Failed: 1
> > > > Error: #191 sock_fields
> > > >     Error: #191 sock_fields
> > > >     create_netns:PASS:create netns 0 nsec
> > > >     create_netns:PASS:bring up lo 0 nsec
> > > >    
> > > > serial_test_sock_fields:PASS:test_sock_fields__open_and_load 0
> > > > nsec
> > > >    
> > > > serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock_fie
> > > > lds)
> > > > 0
> > > > nsec
> > > >    
> > > > serial_test_sock_fields:PASS:attach_cgroup(ingress_read_sock_fi
> > > > elds
> > > > )
> > > > 0 nsec
> > > >     serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port
> > > > 0
> > > > nsec
> > > >     test:PASS:getsockname(listen_fd) 0 nsec
> > > >     test:PASS:getsockname(cli_fd) 0 nsec
> > > >     test:PASS:accept(listen_fd) 0 nsec
> > > >     init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd)
> > > > 0
> > > > nsec
> > > >    
> > > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_fd) 0
> > > > nsec
> > > >     test:PASS:send(accept_fd) 0 nsec
> > > >     test:PASS:recv(cli_fd) 0 nsec
> > > >     test:PASS:send(accept_fd) 0 nsec
> > > >     test:PASS:recv(cli_fd) 0 nsec
> > > >     test:PASS:recv(accept_fd) for fin 0 nsec
> > > >     test:PASS:recv(cli_fd) for fin 0 nsec
> > > >    
> > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
> > > > &accept_fd) 0 nsec
> > > >    
> > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
> > > > &cli_fd) 0 nsec
> > > >     check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
> > > >     check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
> > > >     check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
> > > > READ_SK_DST_PORT_IDX) 0 nsec
> > > >     check_result:FAIL:failure in read_sk_dst_port on line
> > > > unexpected
> > > > failure in read_sk_dst_port on line: actual 297 != expected 0
> > > >     listen_sk: state:10 bound_dev_if:0 family:10 type:1
> > > > protocol:6
> > > > mark:0
> > > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
> > > > src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::)
> > > > dst_port:0
> > > >     srv_sk: state:9 bound_dev_if:0 family:10 type:1 protocol:6
> > > > mark:0
> > > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
> > > > src_port:51966 dst_ip4:7f000006(127.0.0.6) dst_ip6:0:0:0:1(::1)
> > > > dst_port:38030
> > > >     cli_sk: state:5 bound_dev_if:0 family:10 type:1 protocol:6
> > > > mark:0
> > > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
> > > > src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
> > > > dst_port:51966
> > > >     listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295
> > > > snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0
> > > > mss_cache:536
> > > > ecn_flags:0 rate_delivered:0 rate_interval_us:0 packets_out:0
> > > > retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0
> > > > segs_out:0
> > > > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
> > > > bytes_acked:0
> > > >     srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
> > > > snd_ssthresh:2147483647
> > > > rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
> > > > mss_cache:32768
> > > > ecn_flags:0 rate_delivered:1 rate_interval_us:272 packets_out:0
> > > > retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0
> > > > segs_out:3
> > > > data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
> > > > bytes_acked:22
> > > >     cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
> > > > snd_ssthresh:2147483647
> > > > rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
> > > > mss_cache:32768
> > > > ecn_flags:0 rate_delivered:1 rate_interval_us:925 packets_out:0
> > > > retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2
> > > > segs_out:6
> > > > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
> > > > bytes_acked:2
> > > >     check_result:PASS:listen_sk 0 nsec
> > > >     check_result:PASS:srv_sk 0 nsec
> > > >     check_result:PASS:srv_tp 0 nsec
> > > > 
> > > > If bpf program is compiled with llvm16, the test passed
> > > > according
> > > > to
> > > > a CI run.
> > > > 
> > > > I don't have s390x environment to debug this. Could you help
> > > > debug
> > > > it?
> > > > 
> > > > Thanks!
> > > > 
> > > >     [1]
> > > > https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
> > > 
> > > 
> > > Hi,
> > > 
> > > thank for letting me know.
> > > I will look into this.
> > > 
> > > Best regards,
> > > Ilya
> > 
> > In the meantime the issue was fixed by:
> > 
> > commit 141be5c062ecf22bd287afffd310e8ac4711444a
> > Author: Shoaib Meenai <smeenai@fb.com>
> > Date:   Fri May 5 14:18:12 2023 -0700
> > 
> >      Revert "Reland [Pipeline] Don't limit ArgumentPromotion to -
> > O3"
> >      
> >      This reverts commit 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
> >      
> >      https://reviews.llvm.org/D149768  is causing size regressions
> > for -
> > Oz
> >      with FullLTO, and I'm reverting that one while investigating.
> > This
> >      commit depends on that one, so it needs to be reverted as
> > well.
> 
> The transformtion "Don't limit ArgumentPromotion to -O3" is
> temporarily 
> reverted. But it could be reverted again once the issue is resolved.
> So it is a good idea to resolve the issue in the kernel.
> 
> > 
> > But looking at the codegen differences:
> > 
> > $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g
> > fail.s)
> > 
> > -pass.o:        file format elf64-bpf
> > +fail.o:        file format elf64-bpf
> > 
> > -00000000000002c8 <sk_dst_port__load_half>
> > -       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
> > +00000000000002c0 <sk_dst_port__load_half>
> > +       54 10 00 00 00 00 ff ff w1 &= 65535
> >          b4 00 00 00 00 00 00 01 w0 = 1
> >          16 10 00 01 00 00 ca fe if w1 == 51966 goto +1 <LBB6_2>
> >          b4 00 00 00 00 00 00 00 w0 = 0
> > 
> > This is what ArgumentPromotion is supposed to do, so that's okay so
> > far. However, further down below we have:
> > 
> >   Disassembly of section cgroup_skb/egress:
> > 
> > -       bf 16 00 00 00 00 00 00 r1 = r6
> > +       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
> > +       bc 17 00 00 00 00 00 00 w1 = w7
> >          85 01 00 00 00 00 00 53 call sk_dst_port__load_word
> > 
> > ...
> > 
> > -       bf 16 00 00 00 00 00 00 r1 = r6
> > +       74 70 00 00 00 00 00 10 w7 >>= 16
> > +       bc 17 00 00 00 00 00 00 w1 = w7
> >          85 01 00 00 00 00 00 57 call sk_dst_port__load_half
> > 
> > so there is no 16-bit load anymore, instead, the result from the
> > earlier 32-bit load is reused. However, on BE these kinds of loads
> > for this particular field are not consistent at the moment - see
> > [1]
> > and the previous discussions.
> > 
> > De-facto we have the following results:
> > 
> > - int load: 0x0000cafe
> > - short load: 0xcafe
> 
> So 'De-facto' means the above is the expected result.
> 
> > 
> > On a consistent BE we should have rather had:
> > 
> > - int load: 0x0000cafe
> > - short load: 0
> 
> What does 'consistent BE' mean here? Does it mean the expected
> result from the source code itself?

I should not have called the de-facto example "BE" at all: it's rather
"mixed endianness" or "weird endianness" or something along these
lines.

On "consistent BE" or simply "BE" properties like

*(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2);

hold. This is currently not the case for bpf_sock.dst_port.

We compile with -mbig-endian, so we promise to the compiler that the
machine is big-endian, and the compiler expects the above to hold for
any p. Unfortunately when p points to bpf_sock.dst_port, this is not
the case.

The property above is important for the correctness of the load/store
tearing transformations. What we have here is not exactly tearing, but
is quite close.

> > Clang, of course, expects a consistent BE and optimizes around
> > that.
> > 
> > This was a conscious tradeoff Jakub and I have agreed on in order
> > to
> > keep the quirky behavior from the past. Given what's happening with
> > Clang now, I wonder if it would be worth revisiting it in the name
> > of
> > consistency?
> 
> If I understand correctly, I think the issue is
>      r7 = *(u32 *)(r6 + 48)
>      w7 >= 16
>      w1 = w7
> 
> after verifier, it is changed to
>     r7 = *(u16 *)(r6 + <kernel offset>)
>     w7 >= 16
>     w1 = w7
> 
> and the result after verifier rewrite is completely wrong.
> Is it right?

No, the verifier rewrite is correct.
The sk_dst_port__load_word() part of the test succeeds.

All these rewrites already work fine, they are correct and consistent.
It's really just bpf_sock.dst_port that is special.

> If this is the case, during verifier rewrite, if it is
> big endian, say the user intends to load 4 bytes (from uapi header)
> while the kernel field is 2 bytes, in such cases, kernel
> can still pretend to generate 4-byte load. For example,
> for the above example, the code after verification could be:
>     r7 = *(u16 *)(r6 + <kernel offset>)
>     r7 <= 16
>     w7 >= 16
>     w1 = w7
> 
> Hopefully, we won't have many such cases. Does this work?

This would break the sk_dst_port__load_word() part of the test.



Above I asked whether we can resolve the inconsistency, but I thought
about it and I don't see a way of doing it without breaking the ABI,
which is at worst unacceptable, and at best a last resort measure.

What do you think about marking bpf_sock.dst_port volatile? volatile
should prevent tearing and similar optimizations, with which we have a
problem here.

We could also add a comment warning users not to cast away this
volatile due to the quirk we have. And then we should adjust the test
(making all casts volatile) to comply with this new warning.

> > [1]
> > https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: selftest sock_fields failed on s390x with latest llvm17
  2023-05-13  8:24       ` Ilya Leoshkevich
@ 2023-05-14 16:58         ` Yonghong Song
  2023-05-15  7:55           ` Ilya Leoshkevich
  0 siblings, 1 reply; 10+ messages in thread
From: Yonghong Song @ 2023-05-14 16:58 UTC (permalink / raw)
  To: Ilya Leoshkevich
  Cc: bpf, Kui-Feng Lee, Manu Bretelle, Alexei Starovoitov, Jakub Sitnicki



On 5/13/23 1:24 AM, Ilya Leoshkevich wrote:
> On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote:
>>
>>
>> On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
>>> On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
>>>> On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
>>>>> Hi, Ilya,
>>>>>
>>>>> BPF CI ([1]) detected a s390x failure when bpf program is
>>>>> compiled
>>>>> with
>>>>> latest llvm17 on bpf-next branch. To reproduce the issue, just
>>>>> run
>>>>> normal './test_progs -j'. The failure log looks like below:
>>>>>
>>>>> Notice: Success: 341/3015, Skipped: 29, Failed: 1
>>>>> Error: #191 sock_fields
>>>>>      Error: #191 sock_fields
>>>>>      create_netns:PASS:create netns 0 nsec
>>>>>      create_netns:PASS:bring up lo 0 nsec
>>>>>     
>>>>> serial_test_sock_fields:PASS:test_sock_fields__open_and_load 0
>>>>> nsec
>>>>>     
>>>>> serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock_fie
>>>>> lds)
>>>>> 0
>>>>> nsec
>>>>>     
>>>>> serial_test_sock_fields:PASS:attach_cgroup(ingress_read_sock_fi
>>>>> elds
>>>>> )
>>>>> 0 nsec
>>>>>      serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port
>>>>> 0
>>>>> nsec
>>>>>      test:PASS:getsockname(listen_fd) 0 nsec
>>>>>      test:PASS:getsockname(cli_fd) 0 nsec
>>>>>      test:PASS:accept(listen_fd) 0 nsec
>>>>>      init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd)
>>>>> 0
>>>>> nsec
>>>>>     
>>>>> init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_fd) 0
>>>>> nsec
>>>>>      test:PASS:send(accept_fd) 0 nsec
>>>>>      test:PASS:recv(cli_fd) 0 nsec
>>>>>      test:PASS:send(accept_fd) 0 nsec
>>>>>      test:PASS:recv(cli_fd) 0 nsec
>>>>>      test:PASS:recv(accept_fd) for fin 0 nsec
>>>>>      test:PASS:recv(cli_fd) for fin 0 nsec
>>>>>     
>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
>>>>> &accept_fd) 0 nsec
>>>>>     
>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
>>>>> &cli_fd) 0 nsec
>>>>>      check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>>>>>      check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>>>>>      check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
>>>>> READ_SK_DST_PORT_IDX) 0 nsec
>>>>>      check_result:FAIL:failure in read_sk_dst_port on line
>>>>> unexpected
>>>>> failure in read_sk_dst_port on line: actual 297 != expected 0
>>>>>      listen_sk: state:10 bound_dev_if:0 family:10 type:1
>>>>> protocol:6
>>>>> mark:0
>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>> src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::)
>>>>> dst_port:0
>>>>>      srv_sk: state:9 bound_dev_if:0 family:10 type:1 protocol:6
>>>>> mark:0
>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>> src_port:51966 dst_ip4:7f000006(127.0.0.6) dst_ip6:0:0:0:1(::1)
>>>>> dst_port:38030
>>>>>      cli_sk: state:5 bound_dev_if:0 family:10 type:1 protocol:6
>>>>> mark:0
>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>> src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
>>>>> dst_port:51966
>>>>>      listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295
>>>>> snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0
>>>>> mss_cache:536
>>>>> ecn_flags:0 rate_delivered:0 rate_interval_us:0 packets_out:0
>>>>> retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0
>>>>> segs_out:0
>>>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
>>>>> bytes_acked:0
>>>>>      srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
>>>>> snd_ssthresh:2147483647
>>>>> rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
>>>>> mss_cache:32768
>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:272 packets_out:0
>>>>> retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0
>>>>> segs_out:3
>>>>> data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
>>>>> bytes_acked:22
>>>>>      cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
>>>>> snd_ssthresh:2147483647
>>>>> rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
>>>>> mss_cache:32768
>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:925 packets_out:0
>>>>> retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2
>>>>> segs_out:6
>>>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
>>>>> bytes_acked:2
>>>>>      check_result:PASS:listen_sk 0 nsec
>>>>>      check_result:PASS:srv_sk 0 nsec
>>>>>      check_result:PASS:srv_tp 0 nsec
>>>>>
>>>>> If bpf program is compiled with llvm16, the test passed
>>>>> according
>>>>> to
>>>>> a CI run.
>>>>>
>>>>> I don't have s390x environment to debug this. Could you help
>>>>> debug
>>>>> it?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>      [1]
>>>>> https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
>>>>
>>>>
>>>> Hi,
>>>>
>>>> thank for letting me know.
>>>> I will look into this.
>>>>
>>>> Best regards,
>>>> Ilya
>>>
>>> In the meantime the issue was fixed by:
>>>
>>> commit 141be5c062ecf22bd287afffd310e8ac4711444a
>>> Author: Shoaib Meenai <smeenai@fb.com>
>>> Date:   Fri May 5 14:18:12 2023 -0700
>>>
>>>       Revert "Reland [Pipeline] Don't limit ArgumentPromotion to -
>>> O3"
>>>       
>>>       This reverts commit 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
>>>       
>>>       https://reviews.llvm.org/D149768   is causing size regressions
>>> for -
>>> Oz
>>>       with FullLTO, and I'm reverting that one while investigating.
>>> This
>>>       commit depends on that one, so it needs to be reverted as
>>> well.
>>
>> The transformtion "Don't limit ArgumentPromotion to -O3" is
>> temporarily
>> reverted. But it could be reverted again once the issue is resolved.
>> So it is a good idea to resolve the issue in the kernel.
>>
>>>
>>> But looking at the codegen differences:
>>>
>>> $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g
>>> fail.s)
>>>
>>> -pass.o:        file format elf64-bpf
>>> +fail.o:        file format elf64-bpf
>>>
>>> -00000000000002c8 <sk_dst_port__load_half>
>>> -       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
>>> +00000000000002c0 <sk_dst_port__load_half>
>>> +       54 10 00 00 00 00 ff ff w1 &= 65535
>>>           b4 00 00 00 00 00 00 01 w0 = 1
>>>           16 10 00 01 00 00 ca fe if w1 == 51966 goto +1 <LBB6_2>
>>>           b4 00 00 00 00 00 00 00 w0 = 0
>>>
>>> This is what ArgumentPromotion is supposed to do, so that's okay so
>>> far. However, further down below we have:
>>>
>>>    Disassembly of section cgroup_skb/egress:
>>>
>>> -       bf 16 00 00 00 00 00 00 r1 = r6
>>> +       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
>>> +       bc 17 00 00 00 00 00 00 w1 = w7
>>>           85 01 00 00 00 00 00 53 call sk_dst_port__load_word
>>>
>>> ...
>>>
>>> -       bf 16 00 00 00 00 00 00 r1 = r6
>>> +       74 70 00 00 00 00 00 10 w7 >>= 16
>>> +       bc 17 00 00 00 00 00 00 w1 = w7
>>>           85 01 00 00 00 00 00 57 call sk_dst_port__load_half
>>>
>>> so there is no 16-bit load anymore, instead, the result from the
>>> earlier 32-bit load is reused. However, on BE these kinds of loads
>>> for this particular field are not consistent at the moment - see
>>> [1]
>>> and the previous discussions.
>>>
>>> De-facto we have the following results:
>>>
>>> - int load: 0x0000cafe
>>> - short load: 0xcafe
>>
>> So 'De-facto' means the above is the expected result.
>>
>>>
>>> On a consistent BE we should have rather had:
>>>
>>> - int load: 0x0000cafe
>>> - short load: 0
>>
>> What does 'consistent BE' mean here? Does it mean the expected
>> result from the source code itself?
> 
> I should not have called the de-facto example "BE" at all: it's rather
> "mixed endianness" or "weird endianness" or something along these
> lines.
> 
> On "consistent BE" or simply "BE" properties like
> 
> *(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2);
> 
> hold. This is currently not the case for bpf_sock.dst_port.
> 
> We compile with -mbig-endian, so we promise to the compiler that the
> machine is big-endian, and the compiler expects the above to hold for
> any p. Unfortunately when p points to bpf_sock.dst_port, this is not
> the case.

If I understand correctly, *(uint32_t *)p to get the bpf_sock.dst_port
is for backward compatibility. But the real u32 read by compiler will
do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the
same as expected *(uint32_t *)p so we have problem here.

> 
> The property above is important for the correctness of the load/store
> tearing transformations. What we have here is not exactly tearing, but
> is quite close.
> 
>>> Clang, of course, expects a consistent BE and optimizes around
>>> that.
>>>
>>> This was a conscious tradeoff Jakub and I have agreed on in order
>>> to
>>> keep the quirky behavior from the past. Given what's happening with
>>> Clang now, I wonder if it would be worth revisiting it in the name
>>> of
>>> consistency?
>>
>> If I understand correctly, I think the issue is
>>       r7 = *(u32 *)(r6 + 48)
>>       w7 >= 16
>>       w1 = w7
>>
>> after verifier, it is changed to
>>      r7 = *(u16 *)(r6 + <kernel offset>)
>>      w7 >= 16
>>      w1 = w7
>>
>> and the result after verifier rewrite is completely wrong.
>> Is it right?
> 
> No, the verifier rewrite is correct.
> The sk_dst_port__load_word() part of the test succeeds.
> 
> All these rewrites already work fine, they are correct and consistent.
> It's really just bpf_sock.dst_port that is special.
> 
>> If this is the case, during verifier rewrite, if it is
>> big endian, say the user intends to load 4 bytes (from uapi header)
>> while the kernel field is 2 bytes, in such cases, kernel
>> can still pretend to generate 4-byte load. For example,
>> for the above example, the code after verification could be:
>>      r7 = *(u16 *)(r6 + <kernel offset>)
>>      r7 <= 16
>>      w7 >= 16
>>      w1 = w7
>>
>> Hopefully, we won't have many such cases. Does this work?
> 
> This would break the sk_dst_port__load_word() part of the test.

This is a hack. This may work for this specific u16 case, but
yes, it won't work for u32 load case.

> 
> 
> 
> Above I asked whether we can resolve the inconsistency, but I thought
> about it and I don't see a way of doing it without breaking the ABI,
> which is at worst unacceptable, and at best a last resort measure.
> 
> What do you think about marking bpf_sock.dst_port volatile? volatile
> should prevent tearing and similar optimizations, with which we have a
> problem here.
> 
> We could also add a comment warning users not to cast away this
> volatile due to the quirk we have. And then we should adjust the test
> (making all casts volatile) to comply with this new warning.

I did a little study on this. The main problem here for
static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
{
         __u16 *half = (__u16 *)&sk->dst_port;
         return half[0] == bpf_htons(0xcafe);
}

Through some cross-function optimization by ArgumentPromotion
optimization, the compiler does:
    /* the below shared by sk_dst_port__load_word
     * and sk_dst_port__load_half
     */
    __u32 *word = (__u32 *)&sk->dst_port;
    __u32 word_val = word[0];

    /* the below is for sk_dst_port__load_half only */
    __u16 half_val = word_val >> 16;

    ... half_val passed into sk_dst_port__load_half ...
    return half_val == bpf_htons(0xcafe);

Here, 'word_val = word[0]' is replaced by 2-byte load
by the verifier and this caused the trouble for later
sk_dst_port__load_half().

I don't have a good solution here. The issue is exposed
as we have both u16 and u32 load for &sk->dst_port and
the compiler did some optimization with this.

I would say this is an extreme corner case and we can just
fix in the source code like below:

diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c 
b/tools/testing/selftests/bpf/progs/test_sock_fields.c
index bbad3c2d9aa5..39c975786720 100644
--- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
+++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
@@ -265,7 +265,10 @@ static __noinline bool 
sk_dst_port__load_word(struct bpf_sock *sk)

  static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
  {
-       __u16 *half = (__u16 *)&sk->dst_port;
+       __u16 *half;
+
+       asm volatile ("");
+       half  = (__u16 *)&sk->dst_port;
         return half[0] == bpf_htons(0xcafe);
  }

Could you try whether the above workaround works or not?
If we want the code to be future proof for potential
cross-func optimization for these noinline functions, we
can add similar asm codes to all of
bool sk_dst_port__load_{word, half, byte}.


> 
>>> [1]
>>> https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: selftest sock_fields failed on s390x with latest llvm17
  2023-05-14 16:58         ` Yonghong Song
@ 2023-05-15  7:55           ` Ilya Leoshkevich
  2023-05-15 15:27             ` Yonghong Song
  0 siblings, 1 reply; 10+ messages in thread
From: Ilya Leoshkevich @ 2023-05-15  7:55 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Kui-Feng Lee, Manu Bretelle, Alexei Starovoitov, Jakub Sitnicki

On Sun, 2023-05-14 at 09:58 -0700, Yonghong Song wrote:
> 
> 
> On 5/13/23 1:24 AM, Ilya Leoshkevich wrote:
> > On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote:
> > > 
> > > 
> > > On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
> > > > On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
> > > > > On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
> > > > > > Hi, Ilya,
> > > > > > 
> > > > > > BPF CI ([1]) detected a s390x failure when bpf program is
> > > > > > compiled
> > > > > > with
> > > > > > latest llvm17 on bpf-next branch. To reproduce the issue,
> > > > > > just
> > > > > > run
> > > > > > normal './test_progs -j'. The failure log looks like below:
> > > > > > 
> > > > > > Notice: Success: 341/3015, Skipped: 29, Failed: 1
> > > > > > Error: #191 sock_fields
> > > > > >      Error: #191 sock_fields
> > > > > >      create_netns:PASS:create netns 0 nsec
> > > > > >      create_netns:PASS:bring up lo 0 nsec
> > > > > >     
> > > > > > serial_test_sock_fields:PASS:test_sock_fields__open_and_loa
> > > > > > d 0
> > > > > > nsec
> > > > > >     
> > > > > > serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock
> > > > > > _fie
> > > > > > lds)
> > > > > > 0
> > > > > > nsec
> > > > > >     
> > > > > > serial_test_sock_fields:PASS:attach_cgroup(ingress_read_soc
> > > > > > k_fi
> > > > > > elds
> > > > > > )
> > > > > > 0 nsec
> > > > > >     
> > > > > > serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port
> > > > > > 0
> > > > > > nsec
> > > > > >      test:PASS:getsockname(listen_fd) 0 nsec
> > > > > >      test:PASS:getsockname(cli_fd) 0 nsec
> > > > > >      test:PASS:accept(listen_fd) 0 nsec
> > > > > >     
> > > > > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd)
> > > > > > 0
> > > > > > nsec
> > > > > >     
> > > > > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_f
> > > > > > d) 0
> > > > > > nsec
> > > > > >      test:PASS:send(accept_fd) 0 nsec
> > > > > >      test:PASS:recv(cli_fd) 0 nsec
> > > > > >      test:PASS:send(accept_fd) 0 nsec
> > > > > >      test:PASS:recv(cli_fd) 0 nsec
> > > > > >      test:PASS:recv(accept_fd) for fin 0 nsec
> > > > > >      test:PASS:recv(cli_fd) for fin 0 nsec
> > > > > >     
> > > > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cn
> > > > > > t,
> > > > > > &accept_fd) 0 nsec
> > > > > >     
> > > > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cn
> > > > > > t,
> > > > > > &cli_fd) 0 nsec
> > > > > >      check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
> > > > > > nsec
> > > > > >      check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
> > > > > > nsec
> > > > > >      check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
> > > > > > READ_SK_DST_PORT_IDX) 0 nsec
> > > > > >      check_result:FAIL:failure in read_sk_dst_port on line
> > > > > > unexpected
> > > > > > failure in read_sk_dst_port on line: actual 297 != expected
> > > > > > 0
> > > > > >      listen_sk: state:10 bound_dev_if:0 family:10 type:1
> > > > > > protocol:6
> > > > > > mark:0
> > > > > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
> > > > > > src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::)
> > > > > > dst_port:0
> > > > > >      srv_sk: state:9 bound_dev_if:0 family:10 type:1
> > > > > > protocol:6
> > > > > > mark:0
> > > > > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
> > > > > > src_port:51966 dst_ip4:7f000006(127.0.0.6)
> > > > > > dst_ip6:0:0:0:1(::1)
> > > > > > dst_port:38030
> > > > > >      cli_sk: state:5 bound_dev_if:0 family:10 type:1
> > > > > > protocol:6
> > > > > > mark:0
> > > > > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
> > > > > > src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
> > > > > > dst_port:51966
> > > > > >      listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295
> > > > > > snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0
> > > > > > mss_cache:536
> > > > > > ecn_flags:0 rate_delivered:0 rate_interval_us:0
> > > > > > packets_out:0
> > > > > > retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0
> > > > > > segs_out:0
> > > > > > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
> > > > > > bytes_acked:0
> > > > > >      srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
> > > > > > snd_ssthresh:2147483647
> > > > > > rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
> > > > > > mss_cache:32768
> > > > > > ecn_flags:0 rate_delivered:1 rate_interval_us:272
> > > > > > packets_out:0
> > > > > > retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0
> > > > > > segs_out:3
> > > > > > data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
> > > > > > bytes_acked:22
> > > > > >      cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
> > > > > > snd_ssthresh:2147483647
> > > > > > rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
> > > > > > mss_cache:32768
> > > > > > ecn_flags:0 rate_delivered:1 rate_interval_us:925
> > > > > > packets_out:0
> > > > > > retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2
> > > > > > segs_out:6
> > > > > > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
> > > > > > bytes_acked:2
> > > > > >      check_result:PASS:listen_sk 0 nsec
> > > > > >      check_result:PASS:srv_sk 0 nsec
> > > > > >      check_result:PASS:srv_tp 0 nsec
> > > > > > 
> > > > > > If bpf program is compiled with llvm16, the test passed
> > > > > > according
> > > > > > to
> > > > > > a CI run.
> > > > > > 
> > > > > > I don't have s390x environment to debug this. Could you
> > > > > > help
> > > > > > debug
> > > > > > it?
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > >      [1]
> > > > > > https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
> > > > > 
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > thank for letting me know.
> > > > > I will look into this.
> > > > > 
> > > > > Best regards,
> > > > > Ilya
> > > > 
> > > > In the meantime the issue was fixed by:
> > > > 
> > > > commit 141be5c062ecf22bd287afffd310e8ac4711444a
> > > > Author: Shoaib Meenai <smeenai@fb.com>
> > > > Date:   Fri May 5 14:18:12 2023 -0700
> > > > 
> > > >       Revert "Reland [Pipeline] Don't limit ArgumentPromotion
> > > > to -
> > > > O3"
> > > >       
> > > >       This reverts commit
> > > > 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
> > > >       
> > > >       https://reviews.llvm.org/D149768   is causing size
> > > > regressions
> > > > for -
> > > > Oz
> > > >       with FullLTO, and I'm reverting that one while
> > > > investigating.
> > > > This
> > > >       commit depends on that one, so it needs to be reverted as
> > > > well.
> > > 
> > > The transformtion "Don't limit ArgumentPromotion to -O3" is
> > > temporarily
> > > reverted. But it could be reverted again once the issue is
> > > resolved.
> > > So it is a good idea to resolve the issue in the kernel.
> > > 
> > > > 
> > > > But looking at the codegen differences:
> > > > 
> > > > $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g
> > > > fail.s)
> > > > 
> > > > -pass.o:        file format elf64-bpf
> > > > +fail.o:        file format elf64-bpf
> > > > 
> > > > -00000000000002c8 <sk_dst_port__load_half>
> > > > -       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
> > > > +00000000000002c0 <sk_dst_port__load_half>
> > > > +       54 10 00 00 00 00 ff ff w1 &= 65535
> > > >           b4 00 00 00 00 00 00 01 w0 = 1
> > > >           16 10 00 01 00 00 ca fe if w1 == 51966 goto +1
> > > > <LBB6_2>
> > > >           b4 00 00 00 00 00 00 00 w0 = 0
> > > > 
> > > > This is what ArgumentPromotion is supposed to do, so that's
> > > > okay so
> > > > far. However, further down below we have:
> > > > 
> > > >    Disassembly of section cgroup_skb/egress:
> > > > 
> > > > -       bf 16 00 00 00 00 00 00 r1 = r6
> > > > +       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
> > > > +       bc 17 00 00 00 00 00 00 w1 = w7
> > > >           85 01 00 00 00 00 00 53 call sk_dst_port__load_word
> > > > 
> > > > ...
> > > > 
> > > > -       bf 16 00 00 00 00 00 00 r1 = r6
> > > > +       74 70 00 00 00 00 00 10 w7 >>= 16
> > > > +       bc 17 00 00 00 00 00 00 w1 = w7
> > > >           85 01 00 00 00 00 00 57 call sk_dst_port__load_half
> > > > 
> > > > so there is no 16-bit load anymore, instead, the result from
> > > > the
> > > > earlier 32-bit load is reused. However, on BE these kinds of
> > > > loads
> > > > for this particular field are not consistent at the moment -
> > > > see
> > > > [1]
> > > > and the previous discussions.
> > > > 
> > > > De-facto we have the following results:
> > > > 
> > > > - int load: 0x0000cafe
> > > > - short load: 0xcafe
> > > 
> > > So 'De-facto' means the above is the expected result.
> > > 
> > > > 
> > > > On a consistent BE we should have rather had:
> > > > 
> > > > - int load: 0x0000cafe
> > > > - short load: 0
> > > 
> > > What does 'consistent BE' mean here? Does it mean the expected
> > > result from the source code itself?
> > 
> > I should not have called the de-facto example "BE" at all: it's
> > rather
> > "mixed endianness" or "weird endianness" or something along these
> > lines.
> > 
> > On "consistent BE" or simply "BE" properties like
> > 
> > *(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2);
> > 
> > hold. This is currently not the case for bpf_sock.dst_port.
> > 
> > We compile with -mbig-endian, so we promise to the compiler that
> > the
> > machine is big-endian, and the compiler expects the above to hold
> > for
> > any p. Unfortunately when p points to bpf_sock.dst_port, this is
> > not
> > the case.
> 
> If I understand correctly, *(uint32_t *)p to get the
> bpf_sock.dst_port
> is for backward compatibility. But the real u32 read by compiler will
> do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the
> same as expected *(uint32_t *)p so we have problem here.
> 
> > 
> > The property above is important for the correctness of the
> > load/store
> > tearing transformations. What we have here is not exactly tearing,
> > but
> > is quite close.
> > 
> > > > Clang, of course, expects a consistent BE and optimizes around
> > > > that.
> > > > 
> > > > This was a conscious tradeoff Jakub and I have agreed on in
> > > > order
> > > > to
> > > > keep the quirky behavior from the past. Given what's happening
> > > > with
> > > > Clang now, I wonder if it would be worth revisiting it in the
> > > > name
> > > > of
> > > > consistency?
> > > 
> > > If I understand correctly, I think the issue is
> > >       r7 = *(u32 *)(r6 + 48)
> > >       w7 >= 16
> > >       w1 = w7
> > > 
> > > after verifier, it is changed to
> > >      r7 = *(u16 *)(r6 + <kernel offset>)
> > >      w7 >= 16
> > >      w1 = w7
> > > 
> > > and the result after verifier rewrite is completely wrong.
> > > Is it right?
> > 
> > No, the verifier rewrite is correct.
> > The sk_dst_port__load_word() part of the test succeeds.
> > 
> > All these rewrites already work fine, they are correct and
> > consistent.
> > It's really just bpf_sock.dst_port that is special.
> > 
> > > If this is the case, during verifier rewrite, if it is
> > > big endian, say the user intends to load 4 bytes (from uapi
> > > header)
> > > while the kernel field is 2 bytes, in such cases, kernel
> > > can still pretend to generate 4-byte load. For example,
> > > for the above example, the code after verification could be:
> > >      r7 = *(u16 *)(r6 + <kernel offset>)
> > >      r7 <= 16
> > >      w7 >= 16
> > >      w1 = w7
> > > 
> > > Hopefully, we won't have many such cases. Does this work?
> > 
> > This would break the sk_dst_port__load_word() part of the test.
> 
> This is a hack. This may work for this specific u16 case, but
> yes, it won't work for u32 load case.
> 
> > 
> > 
> > 
> > Above I asked whether we can resolve the inconsistency, but I
> > thought
> > about it and I don't see a way of doing it without breaking the
> > ABI,
> > which is at worst unacceptable, and at best a last resort measure.
> > 
> > What do you think about marking bpf_sock.dst_port volatile?
> > volatile
> > should prevent tearing and similar optimizations, with which we
> > have a
> > problem here.
> > 
> > We could also add a comment warning users not to cast away this
> > volatile due to the quirk we have. And then we should adjust the
> > test
> > (making all casts volatile) to comply with this new warning.
> 
> I did a little study on this. The main problem here for
> static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
> {
>          __u16 *half = (__u16 *)&sk->dst_port;
>          return half[0] == bpf_htons(0xcafe);
> }
> 
> Through some cross-function optimization by ArgumentPromotion
> optimization, the compiler does:
>     /* the below shared by sk_dst_port__load_word
>      * and sk_dst_port__load_half
>      */
>     __u32 *word = (__u32 *)&sk->dst_port;
>     __u32 word_val = word[0];
> 
>     /* the below is for sk_dst_port__load_half only */
>     __u16 half_val = word_val >> 16;
> 
>     ... half_val passed into sk_dst_port__load_half ...
>     return half_val == bpf_htons(0xcafe);
> 
> Here, 'word_val = word[0]' is replaced by 2-byte load
> by the verifier and this caused the trouble for later
> sk_dst_port__load_half().
> 
> I don't have a good solution here. The issue is exposed
> as we have both u16 and u32 load for &sk->dst_port and
> the compiler did some optimization with this.
> 
> I would say this is an extreme corner case and we can just
> fix in the source code like below:
> 
> diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c 
> b/tools/testing/selftests/bpf/progs/test_sock_fields.c
> index bbad3c2d9aa5..39c975786720 100644
> --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
> +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
> @@ -265,7 +265,10 @@ static __noinline bool 
> sk_dst_port__load_word(struct bpf_sock *sk)
> 
>   static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
>   {
> -       __u16 *half = (__u16 *)&sk->dst_port;
> +       __u16 *half;
> +
> +       asm volatile ("");
> +       half  = (__u16 *)&sk->dst_port;
>          return half[0] == bpf_htons(0xcafe);
>   }
> 
> Could you try whether the above workaround works or not?
> If we want the code to be future proof for potential
> cross-func optimization for these noinline functions, we
> can add similar asm codes to all of
> bool sk_dst_port__load_{word, half, byte}.

Hi,

this makes the issue go away, thanks.

However, I'm still concerned, because this only inhibits a certain
optimization and does not address the underlying fundamental problem:
we promise to clang that the in-kernel implementation of the eBPF
virtual machine is big-endian, while in reality it's not. As compiler
optimizations get more aggressive, we will surely see more of this.

Why not do this instead?

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1bb11a6ee667..3c9b535532ae 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6102,7 +6102,7 @@ struct bpf_sock {
 	__u32 src_ip4;
 	__u32 src_ip6[4];
 	__u32 src_port;		/* host byte order */
-	__be16 dst_port;	/* network byte order */
+	volatile __be16 dst_port;	/* network byte order */
 	__u16 :16;		/* zero padding */
 	__u32 dst_ip4;
 	__u32 dst_ip6[4];
diff --git a/tools/include/uapi/linux/bpf.h
b/tools/include/uapi/linux/bpf.h
index 1bb11a6ee667..3c9b535532ae 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6102,7 +6102,7 @@ struct bpf_sock {
 	__u32 src_ip4;
 	__u32 src_ip6[4];
 	__u32 src_port;		/* host byte order */
-	__be16 dst_port;	/* network byte order */
+	volatile __be16 dst_port;	/* network byte order */
 	__u16 :16;		/* zero padding */
 	__u32 dst_ip4;
 	__u32 dst_ip6[4];
diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
b/tools/testing/selftests/bpf/progs/test_sock_fields.c
index bbad3c2d9aa5..773ded84ac12 100644
--- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
+++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
@@ -259,19 +259,19 @@ int ingress_read_sock_fields(struct __sk_buff
*skb)
  */
 static __noinline bool sk_dst_port__load_word(struct bpf_sock *sk)
 {
-	__u32 *word = (__u32 *)&sk->dst_port;
+	volatile __u32 *word = (volatile __u32 *)&sk->dst_port;
 	return word[0] == bpf_htons(0xcafe);
 }
 
 static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
 {
-	__u16 *half = (__u16 *)&sk->dst_port;
+	volatile __u16 *half = (volatile __u16 *)&sk->dst_port;
 	return half[0] == bpf_htons(0xcafe);
 }
 
 static __noinline bool sk_dst_port__load_byte(struct bpf_sock *sk)
 {
-	__u8 *byte = (__u8 *)&sk->dst_port;
+	volatile __u8 *byte = (volatile __u8 *)&sk->dst_port;
 	return byte[0] == 0xca && byte[1] == 0xfe;
 }
 
This also works, and as far as I'm concerned, this would be a proper
fix for the underlying issue: we tell the compiler that it should never
ever (with any of today's or future optimizations) try to be clever
when accessing dst_port.

Best regards,
Ilya


> > > > [1]
> > > > https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: selftest sock_fields failed on s390x with latest llvm17
  2023-05-15  7:55           ` Ilya Leoshkevich
@ 2023-05-15 15:27             ` Yonghong Song
  2023-05-16  9:16               ` Ilya Leoshkevich
  0 siblings, 1 reply; 10+ messages in thread
From: Yonghong Song @ 2023-05-15 15:27 UTC (permalink / raw)
  To: Ilya Leoshkevich
  Cc: bpf, Kui-Feng Lee, Manu Bretelle, Alexei Starovoitov, Jakub Sitnicki



On 5/15/23 12:55 AM, Ilya Leoshkevich wrote:
> On Sun, 2023-05-14 at 09:58 -0700, Yonghong Song wrote:
>>
>>
>> On 5/13/23 1:24 AM, Ilya Leoshkevich wrote:
>>> On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote:
>>>>
>>>>
>>>> On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
>>>>> On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
>>>>>> On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
>>>>>>> Hi, Ilya,
>>>>>>>
>>>>>>> BPF CI ([1]) detected a s390x failure when bpf program is
>>>>>>> compiled
>>>>>>> with
>>>>>>> latest llvm17 on bpf-next branch. To reproduce the issue,
>>>>>>> just
>>>>>>> run
>>>>>>> normal './test_progs -j'. The failure log looks like below:
>>>>>>>
>>>>>>> Notice: Success: 341/3015, Skipped: 29, Failed: 1
>>>>>>> Error: #191 sock_fields
>>>>>>>       Error: #191 sock_fields
>>>>>>>       create_netns:PASS:create netns 0 nsec
>>>>>>>       create_netns:PASS:bring up lo 0 nsec
>>>>>>>      
>>>>>>> serial_test_sock_fields:PASS:test_sock_fields__open_and_loa
>>>>>>> d 0
>>>>>>> nsec
>>>>>>>      
>>>>>>> serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock
>>>>>>> _fie
>>>>>>> lds)
>>>>>>> 0
>>>>>>> nsec
>>>>>>>      
>>>>>>> serial_test_sock_fields:PASS:attach_cgroup(ingress_read_soc
>>>>>>> k_fi
>>>>>>> elds
>>>>>>> )
>>>>>>> 0 nsec
>>>>>>>      
>>>>>>> serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port
>>>>>>> 0
>>>>>>> nsec
>>>>>>>       test:PASS:getsockname(listen_fd) 0 nsec
>>>>>>>       test:PASS:getsockname(cli_fd) 0 nsec
>>>>>>>       test:PASS:accept(listen_fd) 0 nsec
>>>>>>>      
>>>>>>> init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd)
>>>>>>> 0
>>>>>>> nsec
>>>>>>>      
>>>>>>> init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_f
>>>>>>> d) 0
>>>>>>> nsec
>>>>>>>       test:PASS:send(accept_fd) 0 nsec
>>>>>>>       test:PASS:recv(cli_fd) 0 nsec
>>>>>>>       test:PASS:send(accept_fd) 0 nsec
>>>>>>>       test:PASS:recv(cli_fd) 0 nsec
>>>>>>>       test:PASS:recv(accept_fd) for fin 0 nsec
>>>>>>>       test:PASS:recv(cli_fd) for fin 0 nsec
>>>>>>>      
>>>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cn
>>>>>>> t,
>>>>>>> &accept_fd) 0 nsec
>>>>>>>      
>>>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cn
>>>>>>> t,
>>>>>>> &cli_fd) 0 nsec
>>>>>>>       check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
>>>>>>> nsec
>>>>>>>       check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
>>>>>>> nsec
>>>>>>>       check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
>>>>>>> READ_SK_DST_PORT_IDX) 0 nsec
>>>>>>>       check_result:FAIL:failure in read_sk_dst_port on line
>>>>>>> unexpected
>>>>>>> failure in read_sk_dst_port on line: actual 297 != expected
>>>>>>> 0
>>>>>>>       listen_sk: state:10 bound_dev_if:0 family:10 type:1
>>>>>>> protocol:6
>>>>>>> mark:0
>>>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>>>> src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::)
>>>>>>> dst_port:0
>>>>>>>       srv_sk: state:9 bound_dev_if:0 family:10 type:1
>>>>>>> protocol:6
>>>>>>> mark:0
>>>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>>>> src_port:51966 dst_ip4:7f000006(127.0.0.6)
>>>>>>> dst_ip6:0:0:0:1(::1)
>>>>>>> dst_port:38030
>>>>>>>       cli_sk: state:5 bound_dev_if:0 family:10 type:1
>>>>>>> protocol:6
>>>>>>> mark:0
>>>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>>>> src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
>>>>>>> dst_port:51966
>>>>>>>       listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295
>>>>>>> snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0
>>>>>>> mss_cache:536
>>>>>>> ecn_flags:0 rate_delivered:0 rate_interval_us:0
>>>>>>> packets_out:0
>>>>>>> retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0
>>>>>>> segs_out:0
>>>>>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
>>>>>>> bytes_acked:0
>>>>>>>       srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
>>>>>>> snd_ssthresh:2147483647
>>>>>>> rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
>>>>>>> mss_cache:32768
>>>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:272
>>>>>>> packets_out:0
>>>>>>> retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0
>>>>>>> segs_out:3
>>>>>>> data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
>>>>>>> bytes_acked:22
>>>>>>>       cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
>>>>>>> snd_ssthresh:2147483647
>>>>>>> rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
>>>>>>> mss_cache:32768
>>>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:925
>>>>>>> packets_out:0
>>>>>>> retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2
>>>>>>> segs_out:6
>>>>>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
>>>>>>> bytes_acked:2
>>>>>>>       check_result:PASS:listen_sk 0 nsec
>>>>>>>       check_result:PASS:srv_sk 0 nsec
>>>>>>>       check_result:PASS:srv_tp 0 nsec
>>>>>>>
>>>>>>> If bpf program is compiled with llvm16, the test passed
>>>>>>> according
>>>>>>> to
>>>>>>> a CI run.
>>>>>>>
>>>>>>> I don't have s390x environment to debug this. Could you
>>>>>>> help
>>>>>>> debug
>>>>>>> it?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>       [1]
>>>>>>> https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> thank for letting me know.
>>>>>> I will look into this.
>>>>>>
>>>>>> Best regards,
>>>>>> Ilya
>>>>>
>>>>> In the meantime the issue was fixed by:
>>>>>
>>>>> commit 141be5c062ecf22bd287afffd310e8ac4711444a
>>>>> Author: Shoaib Meenai <smeenai@fb.com>
>>>>> Date:   Fri May 5 14:18:12 2023 -0700
>>>>>
>>>>>        Revert "Reland [Pipeline] Don't limit ArgumentPromotion
>>>>> to -
>>>>> O3"
>>>>>        
>>>>>        This reverts commit
>>>>> 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
>>>>>        
>>>>>        https://reviews.llvm.org/D149768    is causing size
>>>>> regressions
>>>>> for -
>>>>> Oz
>>>>>        with FullLTO, and I'm reverting that one while
>>>>> investigating.
>>>>> This
>>>>>        commit depends on that one, so it needs to be reverted as
>>>>> well.
>>>>
>>>> The transformtion "Don't limit ArgumentPromotion to -O3" is
>>>> temporarily
>>>> reverted. But it could be reverted again once the issue is
>>>> resolved.
>>>> So it is a good idea to resolve the issue in the kernel.
>>>>
>>>>>
>>>>> But looking at the codegen differences:
>>>>>
>>>>> $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g
>>>>> fail.s)
>>>>>
>>>>> -pass.o:        file format elf64-bpf
>>>>> +fail.o:        file format elf64-bpf
>>>>>
>>>>> -00000000000002c8 <sk_dst_port__load_half>
>>>>> -       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
>>>>> +00000000000002c0 <sk_dst_port__load_half>
>>>>> +       54 10 00 00 00 00 ff ff w1 &= 65535
>>>>>            b4 00 00 00 00 00 00 01 w0 = 1
>>>>>            16 10 00 01 00 00 ca fe if w1 == 51966 goto +1
>>>>> <LBB6_2>
>>>>>            b4 00 00 00 00 00 00 00 w0 = 0
>>>>>
>>>>> This is what ArgumentPromotion is supposed to do, so that's
>>>>> okay so
>>>>> far. However, further down below we have:
>>>>>
>>>>>     Disassembly of section cgroup_skb/egress:
>>>>>
>>>>> -       bf 16 00 00 00 00 00 00 r1 = r6
>>>>> +       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
>>>>> +       bc 17 00 00 00 00 00 00 w1 = w7
>>>>>            85 01 00 00 00 00 00 53 call sk_dst_port__load_word
>>>>>
>>>>> ...
>>>>>
>>>>> -       bf 16 00 00 00 00 00 00 r1 = r6
>>>>> +       74 70 00 00 00 00 00 10 w7 >>= 16
>>>>> +       bc 17 00 00 00 00 00 00 w1 = w7
>>>>>            85 01 00 00 00 00 00 57 call sk_dst_port__load_half
>>>>>
>>>>> so there is no 16-bit load anymore, instead, the result from
>>>>> the
>>>>> earlier 32-bit load is reused. However, on BE these kinds of
>>>>> loads
>>>>> for this particular field are not consistent at the moment -
>>>>> see
>>>>> [1]
>>>>> and the previous discussions.
>>>>>
>>>>> De-facto we have the following results:
>>>>>
>>>>> - int load: 0x0000cafe
>>>>> - short load: 0xcafe
>>>>
>>>> So 'De-facto' means the above is the expected result.
>>>>
>>>>>
>>>>> On a consistent BE we should have rather had:
>>>>>
>>>>> - int load: 0x0000cafe
>>>>> - short load: 0
>>>>
>>>> What does 'consistent BE' mean here? Does it mean the expected
>>>> result from the source code itself?
>>>
>>> I should not have called the de-facto example "BE" at all: it's
>>> rather
>>> "mixed endianness" or "weird endianness" or something along these
>>> lines.
>>>
>>> On "consistent BE" or simply "BE" properties like
>>>
>>> *(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2);
>>>
>>> hold. This is currently not the case for bpf_sock.dst_port.
>>>
>>> We compile with -mbig-endian, so we promise to the compiler that
>>> the
>>> machine is big-endian, and the compiler expects the above to hold
>>> for
>>> any p. Unfortunately when p points to bpf_sock.dst_port, this is
>>> not
>>> the case.
>>
>> If I understand correctly, *(uint32_t *)p to get the
>> bpf_sock.dst_port
>> is for backward compatibility. But the real u32 read by compiler will
>> do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the
>> same as expected *(uint32_t *)p so we have problem here.
>>
>>>
>>> The property above is important for the correctness of the
>>> load/store
>>> tearing transformations. What we have here is not exactly tearing,
>>> but
>>> is quite close.
>>>
>>>>> Clang, of course, expects a consistent BE and optimizes around
>>>>> that.
>>>>>
>>>>> This was a conscious tradeoff Jakub and I have agreed on in
>>>>> order
>>>>> to
>>>>> keep the quirky behavior from the past. Given what's happening
>>>>> with
>>>>> Clang now, I wonder if it would be worth revisiting it in the
>>>>> name
>>>>> of
>>>>> consistency?
>>>>
>>>> If I understand correctly, I think the issue is
>>>>        r7 = *(u32 *)(r6 + 48)
>>>>        w7 >= 16
>>>>        w1 = w7
>>>>
>>>> after verifier, it is changed to
>>>>       r7 = *(u16 *)(r6 + <kernel offset>)
>>>>       w7 >= 16
>>>>       w1 = w7
>>>>
>>>> and the result after verifier rewrite is completely wrong.
>>>> Is it right?
>>>
>>> No, the verifier rewrite is correct.
>>> The sk_dst_port__load_word() part of the test succeeds.
>>>
>>> All these rewrites already work fine, they are correct and
>>> consistent.
>>> It's really just bpf_sock.dst_port that is special.
>>>
>>>> If this is the case, during verifier rewrite, if it is
>>>> big endian, say the user intends to load 4 bytes (from uapi
>>>> header)
>>>> while the kernel field is 2 bytes, in such cases, kernel
>>>> can still pretend to generate 4-byte load. For example,
>>>> for the above example, the code after verification could be:
>>>>       r7 = *(u16 *)(r6 + <kernel offset>)
>>>>       r7 <= 16
>>>>       w7 >= 16
>>>>       w1 = w7
>>>>
>>>> Hopefully, we won't have many such cases. Does this work?
>>>
>>> This would break the sk_dst_port__load_word() part of the test.
>>
>> This is a hack. This may work for this specific u16 case, but
>> yes, it won't work for u32 load case.
>>
>>>
>>>
>>>
>>> Above I asked whether we can resolve the inconsistency, but I
>>> thought
>>> about it and I don't see a way of doing it without breaking the
>>> ABI,
>>> which is at worst unacceptable, and at best a last resort measure.
>>>
>>> What do you think about marking bpf_sock.dst_port volatile?
>>> volatile
>>> should prevent tearing and similar optimizations, with which we
>>> have a
>>> problem here.
>>>
>>> We could also add a comment warning users not to cast away this
>>> volatile due to the quirk we have. And then we should adjust the
>>> test
>>> (making all casts volatile) to comply with this new warning.
>>
>> I did a little study on this. The main problem here for
>> static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
>> {
>>           __u16 *half = (__u16 *)&sk->dst_port;
>>           return half[0] == bpf_htons(0xcafe);
>> }
>>
>> Through some cross-function optimization by ArgumentPromotion
>> optimization, the compiler does:
>>      /* the below shared by sk_dst_port__load_word
>>       * and sk_dst_port__load_half
>>       */
>>      __u32 *word = (__u32 *)&sk->dst_port;
>>      __u32 word_val = word[0];
>>
>>      /* the below is for sk_dst_port__load_half only */
>>      __u16 half_val = word_val >> 16;
>>
>>      ... half_val passed into sk_dst_port__load_half ...
>>      return half_val == bpf_htons(0xcafe);
>>
>> Here, 'word_val = word[0]' is replaced by 2-byte load
>> by the verifier and this caused the trouble for later
>> sk_dst_port__load_half().
>>
>> I don't have a good solution here. The issue is exposed
>> as we have both u16 and u32 load for &sk->dst_port and
>> the compiler did some optimization with this.
>>
>> I would say this is an extreme corner case and we can just
>> fix in the source code like below:
>>
>> diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
>> b/tools/testing/selftests/bpf/progs/test_sock_fields.c
>> index bbad3c2d9aa5..39c975786720 100644
>> --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
>> +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
>> @@ -265,7 +265,10 @@ static __noinline bool
>> sk_dst_port__load_word(struct bpf_sock *sk)
>>
>>    static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
>>    {
>> -       __u16 *half = (__u16 *)&sk->dst_port;
>> +       __u16 *half;
>> +
>> +       asm volatile ("");
>> +       half  = (__u16 *)&sk->dst_port;
>>           return half[0] == bpf_htons(0xcafe);
>>    }
>>
>> Could you try whether the above workaround works or not?
>> If we want the code to be future proof for potential
>> cross-func optimization for these noinline functions, we
>> can add similar asm codes to all of
>> bool sk_dst_port__load_{word, half, byte}.
> 
> Hi,
> 
> this makes the issue go away, thanks.
> 
> However, I'm still concerned, because this only inhibits a certain
> optimization and does not address the underlying fundamental problem:
> we promise to clang that the in-kernel implementation of the eBPF
> virtual machine is big-endian, while in reality it's not. As compiler
> optimizations get more aggressive, we will surely see more of this.
> 
> Why not do this instead?
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1bb11a6ee667..3c9b535532ae 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6102,7 +6102,7 @@ struct bpf_sock {
>   	__u32 src_ip4;
>   	__u32 src_ip6[4];
>   	__u32 src_port;		/* host byte order */
> -	__be16 dst_port;	/* network byte order */
> +	volatile __be16 dst_port;	/* network byte order */
>   	__u16 :16;		/* zero padding */
>   	__u32 dst_ip4;
>   	__u32 dst_ip6[4];
> diff --git a/tools/include/uapi/linux/bpf.h
> b/tools/include/uapi/linux/bpf.h
> index 1bb11a6ee667..3c9b535532ae 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -6102,7 +6102,7 @@ struct bpf_sock {
>   	__u32 src_ip4;
>   	__u32 src_ip6[4];
>   	__u32 src_port;		/* host byte order */
> -	__be16 dst_port;	/* network byte order */
> +	volatile __be16 dst_port;	/* network byte order */
>   	__u16 :16;		/* zero padding */
>   	__u32 dst_ip4;
>   	__u32 dst_ip6[4];
> diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
> b/tools/testing/selftests/bpf/progs/test_sock_fields.c
> index bbad3c2d9aa5..773ded84ac12 100644
> --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
> +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
> @@ -259,19 +259,19 @@ int ingress_read_sock_fields(struct __sk_buff
> *skb)
>    */
>   static __noinline bool sk_dst_port__load_word(struct bpf_sock *sk)
>   {
> -	__u32 *word = (__u32 *)&sk->dst_port;
> +	volatile __u32 *word = (volatile __u32 *)&sk->dst_port;
>   	return word[0] == bpf_htons(0xcafe);
>   }
>   
>   static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
>   {
> -	__u16 *half = (__u16 *)&sk->dst_port;
> +	volatile __u16 *half = (volatile __u16 *)&sk->dst_port;
>   	return half[0] == bpf_htons(0xcafe);
>   }
>   
>   static __noinline bool sk_dst_port__load_byte(struct bpf_sock *sk)
>   {
> -	__u8 *byte = (__u8 *)&sk->dst_port;
> +	volatile __u8 *byte = (volatile __u8 *)&sk->dst_port;
>   	return byte[0] == 0xca && byte[1] == 0xfe;
>   }
>   
> This also works, and as far as I'm concerned, this would be a proper
> fix for the underlying issue: we tell the compiler that it should never
> ever (with any of today's or future optimizations) try to be clever
> when accessing dst_port.

The above test_sock_fields.c change should work too.
I think the uapi change is not necessary. The key word 'volatile'
intends to avoid merging two or more identical loads together.
In this particular case, with only uapi changes, the
harmful transformation can still happen since the sk->dst_port
is indeed loaded only once.

The test_sock_fields.c change itself forces proper load
which is verifier friendly. So I suggest to have
test_sock_fields.c change only. My previously suggested
changes have the same effect, to preserve the verifier friendly
load.

> 
> Best regards,
> Ilya
> 
> 
>>>>> [1]
>>>>> https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: selftest sock_fields failed on s390x with latest llvm17
  2023-05-15 15:27             ` Yonghong Song
@ 2023-05-16  9:16               ` Ilya Leoshkevich
  2023-05-16 15:20                 ` Yonghong Song
  0 siblings, 1 reply; 10+ messages in thread
From: Ilya Leoshkevich @ 2023-05-16  9:16 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Kui-Feng Lee, Manu Bretelle, Alexei Starovoitov, Jakub Sitnicki

On Mon, 2023-05-15 at 08:27 -0700, Yonghong Song wrote:
> 
> 
> On 5/15/23 12:55 AM, Ilya Leoshkevich wrote:
> > On Sun, 2023-05-14 at 09:58 -0700, Yonghong Song wrote:
> > > 
> > > 
> > > On 5/13/23 1:24 AM, Ilya Leoshkevich wrote:
> > > > On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote:
> > > > > 
> > > > > 
> > > > > On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
> > > > > > On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
> > > > > > > On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
> > > > > > > > Hi, Ilya,
> > > > > > > > 
> > > > > > > > BPF CI ([1]) detected a s390x failure when bpf program
> > > > > > > > is
> > > > > > > > compiled
> > > > > > > > with
> > > > > > > > latest llvm17 on bpf-next branch. To reproduce the
> > > > > > > > issue,
> > > > > > > > just
> > > > > > > > run
> > > > > > > > normal './test_progs -j'. The failure log looks like
> > > > > > > > below:
> > > > > > > > 
> > > > > > > > Notice: Success: 341/3015, Skipped: 29, Failed: 1
> > > > > > > > Error: #191 sock_fields
> > > > > > > >       Error: #191 sock_fields
> > > > > > > >       create_netns:PASS:create netns 0 nsec
> > > > > > > >       create_netns:PASS:bring up lo 0 nsec
> > > > > > > >      
> > > > > > > > serial_test_sock_fields:PASS:test_sock_fields__open_and
> > > > > > > > _loa
> > > > > > > > d 0
> > > > > > > > nsec
> > > > > > > >      
> > > > > > > > serial_test_sock_fields:PASS:attach_cgroup(egress_read_
> > > > > > > > sock
> > > > > > > > _fie
> > > > > > > > lds)
> > > > > > > > 0
> > > > > > > > nsec
> > > > > > > >      
> > > > > > > > serial_test_sock_fields:PASS:attach_cgroup(ingress_read
> > > > > > > > _soc
> > > > > > > > k_fi
> > > > > > > > elds
> > > > > > > > )
> > > > > > > > 0 nsec
> > > > > > > >      
> > > > > > > > serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_
> > > > > > > > port
> > > > > > > > 0
> > > > > > > > nsec
> > > > > > > >       test:PASS:getsockname(listen_fd) 0 nsec
> > > > > > > >       test:PASS:getsockname(cli_fd) 0 nsec
> > > > > > > >       test:PASS:accept(listen_fd) 0 nsec
> > > > > > > >      
> > > > > > > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt
> > > > > > > > _fd)
> > > > > > > > 0
> > > > > > > > nsec
> > > > > > > >      
> > > > > > > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt
> > > > > > > > 10_f
> > > > > > > > d) 0
> > > > > > > > nsec
> > > > > > > >       test:PASS:send(accept_fd) 0 nsec
> > > > > > > >       test:PASS:recv(cli_fd) 0 nsec
> > > > > > > >       test:PASS:send(accept_fd) 0 nsec
> > > > > > > >       test:PASS:recv(cli_fd) 0 nsec
> > > > > > > >       test:PASS:recv(accept_fd) for fin 0 nsec
> > > > > > > >       test:PASS:recv(cli_fd) for fin 0 nsec
> > > > > > > >      
> > > > > > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_ou
> > > > > > > > t_cn
> > > > > > > > t,
> > > > > > > > &accept_fd) 0 nsec
> > > > > > > >      
> > > > > > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_ou
> > > > > > > > t_cn
> > > > > > > > t,
> > > > > > > > &cli_fd) 0 nsec
> > > > > > > >      
> > > > > > > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
> > > > > > > > nsec
> > > > > > > >      
> > > > > > > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
> > > > > > > > nsec
> > > > > > > >      
> > > > > > > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
> > > > > > > > READ_SK_DST_PORT_IDX) 0 nsec
> > > > > > > >       check_result:FAIL:failure in read_sk_dst_port on
> > > > > > > > line
> > > > > > > > unexpected
> > > > > > > > failure in read_sk_dst_port on line: actual 297 !=
> > > > > > > > expected
> > > > > > > > 0
> > > > > > > >       listen_sk: state:10 bound_dev_if:0 family:10
> > > > > > > > type:1
> > > > > > > > protocol:6
> > > > > > > > mark:0
> > > > > > > > priority:0 src_ip4:7f000006(127.0.0.6)
> > > > > > > > src_ip6:0:0:0:1(::1)
> > > > > > > > src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::)
> > > > > > > > dst_port:0
> > > > > > > >       srv_sk: state:9 bound_dev_if:0 family:10 type:1
> > > > > > > > protocol:6
> > > > > > > > mark:0
> > > > > > > > priority:0 src_ip4:7f000006(127.0.0.6)
> > > > > > > > src_ip6:0:0:0:1(::1)
> > > > > > > > src_port:51966 dst_ip4:7f000006(127.0.0.6)
> > > > > > > > dst_ip6:0:0:0:1(::1)
> > > > > > > > dst_port:38030
> > > > > > > >       cli_sk: state:5 bound_dev_if:0 family:10 type:1
> > > > > > > > protocol:6
> > > > > > > > mark:0
> > > > > > > > priority:0 src_ip4:7f000006(127.0.0.6)
> > > > > > > > src_ip6:0:0:0:1(::1)
> > > > > > > > src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
> > > > > > > > dst_port:51966
> > > > > > > >       listen_tp: snd_cwnd:10 srtt_us:0
> > > > > > > > rtt_min:4294967295
> > > > > > > > snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0
> > > > > > > > mss_cache:536
> > > > > > > > ecn_flags:0 rate_delivered:0 rate_interval_us:0
> > > > > > > > packets_out:0
> > > > > > > > retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0
> > > > > > > > segs_out:0
> > > > > > > > data_segs_out:0 lost_out:0 sacked_out:0
> > > > > > > > bytes_received:0
> > > > > > > > bytes_acked:0
> > > > > > > >       srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
> > > > > > > > snd_ssthresh:2147483647
> > > > > > > > rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
> > > > > > > > mss_cache:32768
> > > > > > > > ecn_flags:0 rate_delivered:1 rate_interval_us:272
> > > > > > > > packets_out:0
> > > > > > > > retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0
> > > > > > > > segs_out:3
> > > > > > > > data_segs_out:2 lost_out:0 sacked_out:0
> > > > > > > > bytes_received:1
> > > > > > > > bytes_acked:22
> > > > > > > >       cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
> > > > > > > > snd_ssthresh:2147483647
> > > > > > > > rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
> > > > > > > > mss_cache:32768
> > > > > > > > ecn_flags:0 rate_delivered:1 rate_interval_us:925
> > > > > > > > packets_out:0
> > > > > > > > retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2
> > > > > > > > segs_out:6
> > > > > > > > data_segs_out:0 lost_out:0 sacked_out:0
> > > > > > > > bytes_received:23
> > > > > > > > bytes_acked:2
> > > > > > > >       check_result:PASS:listen_sk 0 nsec
> > > > > > > >       check_result:PASS:srv_sk 0 nsec
> > > > > > > >       check_result:PASS:srv_tp 0 nsec
> > > > > > > > 
> > > > > > > > If bpf program is compiled with llvm16, the test passed
> > > > > > > > according
> > > > > > > > to
> > > > > > > > a CI run.
> > > > > > > > 
> > > > > > > > I don't have s390x environment to debug this. Could you
> > > > > > > > help
> > > > > > > > debug
> > > > > > > > it?
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > 
> > > > > > > >       [1]
> > > > > > > > https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
> > > > > > > 
> > > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > thank for letting me know.
> > > > > > > I will look into this.
> > > > > > > 
> > > > > > > Best regards,
> > > > > > > Ilya
> > > > > > 
> > > > > > In the meantime the issue was fixed by:
> > > > > > 
> > > > > > commit 141be5c062ecf22bd287afffd310e8ac4711444a
> > > > > > Author: Shoaib Meenai <smeenai@fb.com>
> > > > > > Date:   Fri May 5 14:18:12 2023 -0700
> > > > > > 
> > > > > >        Revert "Reland [Pipeline] Don't limit
> > > > > > ArgumentPromotion
> > > > > > to -
> > > > > > O3"
> > > > > >        
> > > > > >        This reverts commit
> > > > > > 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
> > > > > >        
> > > > > >        https://reviews.llvm.org/D149768    is causing size
> > > > > > regressions
> > > > > > for -
> > > > > > Oz
> > > > > >        with FullLTO, and I'm reverting that one while
> > > > > > investigating.
> > > > > > This
> > > > > >        commit depends on that one, so it needs to be
> > > > > > reverted as
> > > > > > well.
> > > > > 
> > > > > The transformtion "Don't limit ArgumentPromotion to -O3" is
> > > > > temporarily
> > > > > reverted. But it could be reverted again once the issue is
> > > > > resolved.
> > > > > So it is a good idea to resolve the issue in the kernel.
> > > > > 
> > > > > > 
> > > > > > But looking at the codegen differences:
> > > > > > 
> > > > > > $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-
> > > > > > 9]*://g
> > > > > > fail.s)
> > > > > > 
> > > > > > -pass.o:        file format elf64-bpf
> > > > > > +fail.o:        file format elf64-bpf
> > > > > > 
> > > > > > -00000000000002c8 <sk_dst_port__load_half>
> > > > > > -       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
> > > > > > +00000000000002c0 <sk_dst_port__load_half>
> > > > > > +       54 10 00 00 00 00 ff ff w1 &= 65535
> > > > > >            b4 00 00 00 00 00 00 01 w0 = 1
> > > > > >            16 10 00 01 00 00 ca fe if w1 == 51966 goto +1
> > > > > > <LBB6_2>
> > > > > >            b4 00 00 00 00 00 00 00 w0 = 0
> > > > > > 
> > > > > > This is what ArgumentPromotion is supposed to do, so that's
> > > > > > okay so
> > > > > > far. However, further down below we have:
> > > > > > 
> > > > > >     Disassembly of section cgroup_skb/egress:
> > > > > > 
> > > > > > -       bf 16 00 00 00 00 00 00 r1 = r6
> > > > > > +       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
> > > > > > +       bc 17 00 00 00 00 00 00 w1 = w7
> > > > > >            85 01 00 00 00 00 00 53 call
> > > > > > sk_dst_port__load_word
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > -       bf 16 00 00 00 00 00 00 r1 = r6
> > > > > > +       74 70 00 00 00 00 00 10 w7 >>= 16
> > > > > > +       bc 17 00 00 00 00 00 00 w1 = w7
> > > > > >            85 01 00 00 00 00 00 57 call
> > > > > > sk_dst_port__load_half
> > > > > > 
> > > > > > so there is no 16-bit load anymore, instead, the result
> > > > > > from
> > > > > > the
> > > > > > earlier 32-bit load is reused. However, on BE these kinds
> > > > > > of
> > > > > > loads
> > > > > > for this particular field are not consistent at the moment
> > > > > > -
> > > > > > see
> > > > > > [1]
> > > > > > and the previous discussions.
> > > > > > 
> > > > > > De-facto we have the following results:
> > > > > > 
> > > > > > - int load: 0x0000cafe
> > > > > > - short load: 0xcafe
> > > > > 
> > > > > So 'De-facto' means the above is the expected result.
> > > > > 
> > > > > > 
> > > > > > On a consistent BE we should have rather had:
> > > > > > 
> > > > > > - int load: 0x0000cafe
> > > > > > - short load: 0
> > > > > 
> > > > > What does 'consistent BE' mean here? Does it mean the
> > > > > expected
> > > > > result from the source code itself?
> > > > 
> > > > I should not have called the de-facto example "BE" at all: it's
> > > > rather
> > > > "mixed endianness" or "weird endianness" or something along
> > > > these
> > > > lines.
> > > > 
> > > > On "consistent BE" or simply "BE" properties like
> > > > 
> > > > *(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2);
> > > > 
> > > > hold. This is currently not the case for bpf_sock.dst_port.
> > > > 
> > > > We compile with -mbig-endian, so we promise to the compiler
> > > > that
> > > > the
> > > > machine is big-endian, and the compiler expects the above to
> > > > hold
> > > > for
> > > > any p. Unfortunately when p points to bpf_sock.dst_port, this
> > > > is
> > > > not
> > > > the case.
> > > 
> > > If I understand correctly, *(uint32_t *)p to get the
> > > bpf_sock.dst_port
> > > is for backward compatibility. But the real u32 read by compiler
> > > will
> > > do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the
> > > same as expected *(uint32_t *)p so we have problem here.
> > > 
> > > > 
> > > > The property above is important for the correctness of the
> > > > load/store
> > > > tearing transformations. What we have here is not exactly
> > > > tearing,
> > > > but
> > > > is quite close.
> > > > 
> > > > > > Clang, of course, expects a consistent BE and optimizes
> > > > > > around
> > > > > > that.
> > > > > > 
> > > > > > This was a conscious tradeoff Jakub and I have agreed on in
> > > > > > order
> > > > > > to
> > > > > > keep the quirky behavior from the past. Given what's
> > > > > > happening
> > > > > > with
> > > > > > Clang now, I wonder if it would be worth revisiting it in
> > > > > > the
> > > > > > name
> > > > > > of
> > > > > > consistency?
> > > > > 
> > > > > If I understand correctly, I think the issue is
> > > > >        r7 = *(u32 *)(r6 + 48)
> > > > >        w7 >= 16
> > > > >        w1 = w7
> > > > > 
> > > > > after verifier, it is changed to
> > > > >       r7 = *(u16 *)(r6 + <kernel offset>)
> > > > >       w7 >= 16
> > > > >       w1 = w7
> > > > > 
> > > > > and the result after verifier rewrite is completely wrong.
> > > > > Is it right?
> > > > 
> > > > No, the verifier rewrite is correct.
> > > > The sk_dst_port__load_word() part of the test succeeds.
> > > > 
> > > > All these rewrites already work fine, they are correct and
> > > > consistent.
> > > > It's really just bpf_sock.dst_port that is special.
> > > > 
> > > > > If this is the case, during verifier rewrite, if it is
> > > > > big endian, say the user intends to load 4 bytes (from uapi
> > > > > header)
> > > > > while the kernel field is 2 bytes, in such cases, kernel
> > > > > can still pretend to generate 4-byte load. For example,
> > > > > for the above example, the code after verification could be:
> > > > >       r7 = *(u16 *)(r6 + <kernel offset>)
> > > > >       r7 <= 16
> > > > >       w7 >= 16
> > > > >       w1 = w7
> > > > > 
> > > > > Hopefully, we won't have many such cases. Does this work?
> > > > 
> > > > This would break the sk_dst_port__load_word() part of the test.
> > > 
> > > This is a hack. This may work for this specific u16 case, but
> > > yes, it won't work for u32 load case.
> > > 
> > > > 
> > > > 
> > > > 
> > > > Above I asked whether we can resolve the inconsistency, but I
> > > > thought
> > > > about it and I don't see a way of doing it without breaking the
> > > > ABI,
> > > > which is at worst unacceptable, and at best a last resort
> > > > measure.
> > > > 
> > > > What do you think about marking bpf_sock.dst_port volatile?
> > > > volatile
> > > > should prevent tearing and similar optimizations, with which we
> > > > have a
> > > > problem here.
> > > > 
> > > > We could also add a comment warning users not to cast away this
> > > > volatile due to the quirk we have. And then we should adjust
> > > > the
> > > > test
> > > > (making all casts volatile) to comply with this new warning.
> > > 
> > > I did a little study on this. The main problem here for
> > > static __noinline bool sk_dst_port__load_half(struct bpf_sock
> > > *sk)
> > > {
> > >           __u16 *half = (__u16 *)&sk->dst_port;
> > >           return half[0] == bpf_htons(0xcafe);
> > > }
> > > 
> > > Through some cross-function optimization by ArgumentPromotion
> > > optimization, the compiler does:
> > >      /* the below shared by sk_dst_port__load_word
> > >       * and sk_dst_port__load_half
> > >       */
> > >      __u32 *word = (__u32 *)&sk->dst_port;
> > >      __u32 word_val = word[0];
> > > 
> > >      /* the below is for sk_dst_port__load_half only */
> > >      __u16 half_val = word_val >> 16;
> > > 
> > >      ... half_val passed into sk_dst_port__load_half ...
> > >      return half_val == bpf_htons(0xcafe);
> > > 
> > > Here, 'word_val = word[0]' is replaced by 2-byte load
> > > by the verifier and this caused the trouble for later
> > > sk_dst_port__load_half().
> > > 
> > > I don't have a good solution here. The issue is exposed
> > > as we have both u16 and u32 load for &sk->dst_port and
> > > the compiler did some optimization with this.
> > > 
> > > I would say this is an extreme corner case and we can just
> > > fix in the source code like below:
> > > 
> > > diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
> > > b/tools/testing/selftests/bpf/progs/test_sock_fields.c
> > > index bbad3c2d9aa5..39c975786720 100644
> > > --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
> > > +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
> > > @@ -265,7 +265,10 @@ static __noinline bool
> > > sk_dst_port__load_word(struct bpf_sock *sk)
> > > 
> > >    static __noinline bool sk_dst_port__load_half(struct bpf_sock
> > > *sk)
> > >    {
> > > -       __u16 *half = (__u16 *)&sk->dst_port;
> > > +       __u16 *half;
> > > +
> > > +       asm volatile ("");
> > > +       half  = (__u16 *)&sk->dst_port;
> > >           return half[0] == bpf_htons(0xcafe);
> > >    }
> > > 
> > > Could you try whether the above workaround works or not?
> > > If we want the code to be future proof for potential
> > > cross-func optimization for these noinline functions, we
> > > can add similar asm codes to all of
> > > bool sk_dst_port__load_{word, half, byte}.
> > 
> > Hi,
> > 
> > this makes the issue go away, thanks.
> > 
> > However, I'm still concerned, because this only inhibits a certain
> > optimization and does not address the underlying fundamental
> > problem:
> > we promise to clang that the in-kernel implementation of the eBPF
> > virtual machine is big-endian, while in reality it's not. As
> > compiler
> > optimizations get more aggressive, we will surely see more of this.
> > 
> > Why not do this instead?
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 1bb11a6ee667..3c9b535532ae 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -6102,7 +6102,7 @@ struct bpf_sock {
> >         __u32 src_ip4;
> >         __u32 src_ip6[4];
> >         __u32 src_port;         /* host byte order */
> > -       __be16 dst_port;        /* network byte order */
> > +       volatile __be16 dst_port;       /* network byte order */
> >         __u16 :16;              /* zero padding */
> >         __u32 dst_ip4;
> >         __u32 dst_ip6[4];
> > diff --git a/tools/include/uapi/linux/bpf.h
> > b/tools/include/uapi/linux/bpf.h
> > index 1bb11a6ee667..3c9b535532ae 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -6102,7 +6102,7 @@ struct bpf_sock {
> >         __u32 src_ip4;
> >         __u32 src_ip6[4];
> >         __u32 src_port;         /* host byte order */
> > -       __be16 dst_port;        /* network byte order */
> > +       volatile __be16 dst_port;       /* network byte order */
> >         __u16 :16;              /* zero padding */
> >         __u32 dst_ip4;
> >         __u32 dst_ip6[4];
> > diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
> > b/tools/testing/selftests/bpf/progs/test_sock_fields.c
> > index bbad3c2d9aa5..773ded84ac12 100644
> > --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
> > +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
> > @@ -259,19 +259,19 @@ int ingress_read_sock_fields(struct __sk_buff
> > *skb)
> >    */
> >   static __noinline bool sk_dst_port__load_word(struct bpf_sock
> > *sk)
> >   {
> > -       __u32 *word = (__u32 *)&sk->dst_port;
> > +       volatile __u32 *word = (volatile __u32 *)&sk->dst_port;
> >         return word[0] == bpf_htons(0xcafe);
> >   }
> >   
> >   static __noinline bool sk_dst_port__load_half(struct bpf_sock
> > *sk)
> >   {
> > -       __u16 *half = (__u16 *)&sk->dst_port;
> > +       volatile __u16 *half = (volatile __u16 *)&sk->dst_port;
> >         return half[0] == bpf_htons(0xcafe);
> >   }
> >   
> >   static __noinline bool sk_dst_port__load_byte(struct bpf_sock
> > *sk)
> >   {
> > -       __u8 *byte = (__u8 *)&sk->dst_port;
> > +       volatile __u8 *byte = (volatile __u8 *)&sk->dst_port;
> >         return byte[0] == 0xca && byte[1] == 0xfe;
> >   }
> >   
> > This also works, and as far as I'm concerned, this would be a
> > proper
> > fix for the underlying issue: we tell the compiler that it should
> > never
> > ever (with any of today's or future optimizations) try to be clever
> > when accessing dst_port.
> 
> The above test_sock_fields.c change should work too.
> I think the uapi change is not necessary. The key word 'volatile'
> intends to avoid merging two or more identical loads together.
> In this particular case, with only uapi changes, the
> harmful transformation can still happen since the sk->dst_port
> is indeed loaded only once.
> 
> The test_sock_fields.c change itself forces proper load
> which is verifier friendly. So I suggest to have
> test_sock_fields.c change only. My previously suggested
> changes have the same effect, to preserve the verifier friendly
> load.

Ok, let's do it this way then.

Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>

for the diff you posted above.

> 
> > 
> > Best regards,
> > Ilya
> > 
> > 
> > > > > > [1]
> > > > > > https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com
> > 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: selftest sock_fields failed on s390x with latest llvm17
  2023-05-16  9:16               ` Ilya Leoshkevich
@ 2023-05-16 15:20                 ` Yonghong Song
  0 siblings, 0 replies; 10+ messages in thread
From: Yonghong Song @ 2023-05-16 15:20 UTC (permalink / raw)
  To: Ilya Leoshkevich
  Cc: bpf, Kui-Feng Lee, Manu Bretelle, Alexei Starovoitov, Jakub Sitnicki



On 5/16/23 2:16 AM, Ilya Leoshkevich wrote:
> On Mon, 2023-05-15 at 08:27 -0700, Yonghong Song wrote:
>>
>>
>> On 5/15/23 12:55 AM, Ilya Leoshkevich wrote:
>>> On Sun, 2023-05-14 at 09:58 -0700, Yonghong Song wrote:
>>>>
>>>>
>>>> On 5/13/23 1:24 AM, Ilya Leoshkevich wrote:
>>>>> On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote:
>>>>>>
>>>>>>
>>>>>> On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
>>>>>>> On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
>>>>>>>> On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
>>>>>>>>> Hi, Ilya,
>>>>>>>>>
>>>>>>>>> BPF CI ([1]) detected a s390x failure when bpf program
>>>>>>>>> is
>>>>>>>>> compiled
>>>>>>>>> with
>>>>>>>>> latest llvm17 on bpf-next branch. To reproduce the
>>>>>>>>> issue,
>>>>>>>>> just
>>>>>>>>> run
>>>>>>>>> normal './test_progs -j'. The failure log looks like
>>>>>>>>> below:
>>>>>>>>>
>>>>>>>>> Notice: Success: 341/3015, Skipped: 29, Failed: 1
>>>>>>>>> Error: #191 sock_fields
>>>>>>>>>        Error: #191 sock_fields
>>>>>>>>>        create_netns:PASS:create netns 0 nsec
>>>>>>>>>        create_netns:PASS:bring up lo 0 nsec
>>>>>>>>>       
>>>>>>>>> serial_test_sock_fields:PASS:test_sock_fields__open_and
>>>>>>>>> _loa
>>>>>>>>> d 0
>>>>>>>>> nsec
>>>>>>>>>       
>>>>>>>>> serial_test_sock_fields:PASS:attach_cgroup(egress_read_
>>>>>>>>> sock
>>>>>>>>> _fie
>>>>>>>>> lds)
>>>>>>>>> 0
>>>>>>>>> nsec
>>>>>>>>>       
>>>>>>>>> serial_test_sock_fields:PASS:attach_cgroup(ingress_read
>>>>>>>>> _soc
>>>>>>>>> k_fi
>>>>>>>>> elds
>>>>>>>>> )
>>>>>>>>> 0 nsec
>>>>>>>>>       
>>>>>>>>> serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_
>>>>>>>>> port
>>>>>>>>> 0
>>>>>>>>> nsec
>>>>>>>>>        test:PASS:getsockname(listen_fd) 0 nsec
>>>>>>>>>        test:PASS:getsockname(cli_fd) 0 nsec
>>>>>>>>>        test:PASS:accept(listen_fd) 0 nsec
>>>>>>>>>       
>>>>>>>>> init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt
>>>>>>>>> _fd)
>>>>>>>>> 0
>>>>>>>>> nsec
>>>>>>>>>       
>>>>>>>>> init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt
>>>>>>>>> 10_f
>>>>>>>>> d) 0
>>>>>>>>> nsec
>>>>>>>>>        test:PASS:send(accept_fd) 0 nsec
>>>>>>>>>        test:PASS:recv(cli_fd) 0 nsec
>>>>>>>>>        test:PASS:send(accept_fd) 0 nsec
>>>>>>>>>        test:PASS:recv(cli_fd) 0 nsec
>>>>>>>>>        test:PASS:recv(accept_fd) for fin 0 nsec
>>>>>>>>>        test:PASS:recv(cli_fd) for fin 0 nsec
>>>>>>>>>       
>>>>>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_ou
>>>>>>>>> t_cn
>>>>>>>>> t,
>>>>>>>>> &accept_fd) 0 nsec
>>>>>>>>>       
>>>>>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_ou
>>>>>>>>> t_cn
>>>>>>>>> t,
>>>>>>>>> &cli_fd) 0 nsec
>>>>>>>>>       
>>>>>>>>> check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
>>>>>>>>> nsec
>>>>>>>>>       
>>>>>>>>> check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
>>>>>>>>> nsec
>>>>>>>>>       
>>>>>>>>> check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
>>>>>>>>> READ_SK_DST_PORT_IDX) 0 nsec
>>>>>>>>>        check_result:FAIL:failure in read_sk_dst_port on
>>>>>>>>> line
>>>>>>>>> unexpected
>>>>>>>>> failure in read_sk_dst_port on line: actual 297 !=
>>>>>>>>> expected
>>>>>>>>> 0
>>>>>>>>>        listen_sk: state:10 bound_dev_if:0 family:10
>>>>>>>>> type:1
>>>>>>>>> protocol:6
>>>>>>>>> mark:0
>>>>>>>>> priority:0 src_ip4:7f000006(127.0.0.6)
>>>>>>>>> src_ip6:0:0:0:1(::1)
>>>>>>>>> src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::)
>>>>>>>>> dst_port:0
>>>>>>>>>        srv_sk: state:9 bound_dev_if:0 family:10 type:1
>>>>>>>>> protocol:6
>>>>>>>>> mark:0
>>>>>>>>> priority:0 src_ip4:7f000006(127.0.0.6)
>>>>>>>>> src_ip6:0:0:0:1(::1)
>>>>>>>>> src_port:51966 dst_ip4:7f000006(127.0.0.6)
>>>>>>>>> dst_ip6:0:0:0:1(::1)
>>>>>>>>> dst_port:38030
>>>>>>>>>        cli_sk: state:5 bound_dev_if:0 family:10 type:1
>>>>>>>>> protocol:6
>>>>>>>>> mark:0
>>>>>>>>> priority:0 src_ip4:7f000006(127.0.0.6)
>>>>>>>>> src_ip6:0:0:0:1(::1)
>>>>>>>>> src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
>>>>>>>>> dst_port:51966
>>>>>>>>>        listen_tp: snd_cwnd:10 srtt_us:0
>>>>>>>>> rtt_min:4294967295
>>>>>>>>> snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0
>>>>>>>>> mss_cache:536
>>>>>>>>> ecn_flags:0 rate_delivered:0 rate_interval_us:0
>>>>>>>>> packets_out:0
>>>>>>>>> retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0
>>>>>>>>> segs_out:0
>>>>>>>>> data_segs_out:0 lost_out:0 sacked_out:0
>>>>>>>>> bytes_received:0
>>>>>>>>> bytes_acked:0
>>>>>>>>>        srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
>>>>>>>>> snd_ssthresh:2147483647
>>>>>>>>> rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
>>>>>>>>> mss_cache:32768
>>>>>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:272
>>>>>>>>> packets_out:0
>>>>>>>>> retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0
>>>>>>>>> segs_out:3
>>>>>>>>> data_segs_out:2 lost_out:0 sacked_out:0
>>>>>>>>> bytes_received:1
>>>>>>>>> bytes_acked:22
>>>>>>>>>        cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
>>>>>>>>> snd_ssthresh:2147483647
>>>>>>>>> rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
>>>>>>>>> mss_cache:32768
>>>>>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:925
>>>>>>>>> packets_out:0
>>>>>>>>> retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2
>>>>>>>>> segs_out:6
>>>>>>>>> data_segs_out:0 lost_out:0 sacked_out:0
>>>>>>>>> bytes_received:23
>>>>>>>>> bytes_acked:2
>>>>>>>>>        check_result:PASS:listen_sk 0 nsec
>>>>>>>>>        check_result:PASS:srv_sk 0 nsec
>>>>>>>>>        check_result:PASS:srv_tp 0 nsec
>>>>>>>>>
>>>>>>>>> If bpf program is compiled with llvm16, the test passed
>>>>>>>>> according
>>>>>>>>> to
>>>>>>>>> a CI run.
>>>>>>>>>
>>>>>>>>> I don't have s390x environment to debug this. Could you
>>>>>>>>> help
>>>>>>>>> debug
>>>>>>>>> it?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>        [1]
>>>>>>>>> https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> thank for letting me know.
>>>>>>>> I will look into this.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Ilya
>>>>>>>
>>>>>>> In the meantime the issue was fixed by:
>>>>>>>
>>>>>>> commit 141be5c062ecf22bd287afffd310e8ac4711444a
>>>>>>> Author: Shoaib Meenai <smeenai@fb.com>
>>>>>>> Date:   Fri May 5 14:18:12 2023 -0700
>>>>>>>
>>>>>>>         Revert "Reland [Pipeline] Don't limit
>>>>>>> ArgumentPromotion
>>>>>>> to -
>>>>>>> O3"
>>>>>>>         
>>>>>>>         This reverts commit
>>>>>>> 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
>>>>>>>         
>>>>>>>         https://reviews.llvm.org/D149768     is causing size
>>>>>>> regressions
>>>>>>> for -
>>>>>>> Oz
>>>>>>>         with FullLTO, and I'm reverting that one while
>>>>>>> investigating.
>>>>>>> This
>>>>>>>         commit depends on that one, so it needs to be
>>>>>>> reverted as
>>>>>>> well.
>>>>>>
>>>>>> The transformtion "Don't limit ArgumentPromotion to -O3" is
>>>>>> temporarily
>>>>>> reverted. But it could be reverted again once the issue is
>>>>>> resolved.
>>>>>> So it is a good idea to resolve the issue in the kernel.
>>>>>>
>>>>>>>
>>>>>>> But looking at the codegen differences:
>>>>>>>
>>>>>>> $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-
>>>>>>> 9]*://g
>>>>>>> fail.s)
>>>>>>>
>>>>>>> -pass.o:        file format elf64-bpf
>>>>>>> +fail.o:        file format elf64-bpf
>>>>>>>
>>>>>>> -00000000000002c8 <sk_dst_port__load_half>
>>>>>>> -       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
>>>>>>> +00000000000002c0 <sk_dst_port__load_half>
>>>>>>> +       54 10 00 00 00 00 ff ff w1 &= 65535
>>>>>>>             b4 00 00 00 00 00 00 01 w0 = 1
>>>>>>>             16 10 00 01 00 00 ca fe if w1 == 51966 goto +1
>>>>>>> <LBB6_2>
>>>>>>>             b4 00 00 00 00 00 00 00 w0 = 0
>>>>>>>
>>>>>>> This is what ArgumentPromotion is supposed to do, so that's
>>>>>>> okay so
>>>>>>> far. However, further down below we have:
>>>>>>>
>>>>>>>      Disassembly of section cgroup_skb/egress:
>>>>>>>
>>>>>>> -       bf 16 00 00 00 00 00 00 r1 = r6
>>>>>>> +       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
>>>>>>> +       bc 17 00 00 00 00 00 00 w1 = w7
>>>>>>>             85 01 00 00 00 00 00 53 call
>>>>>>> sk_dst_port__load_word
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> -       bf 16 00 00 00 00 00 00 r1 = r6
>>>>>>> +       74 70 00 00 00 00 00 10 w7 >>= 16
>>>>>>> +       bc 17 00 00 00 00 00 00 w1 = w7
>>>>>>>             85 01 00 00 00 00 00 57 call
>>>>>>> sk_dst_port__load_half
>>>>>>>
>>>>>>> so there is no 16-bit load anymore, instead, the result
>>>>>>> from
>>>>>>> the
>>>>>>> earlier 32-bit load is reused. However, on BE these kinds
>>>>>>> of
>>>>>>> loads
>>>>>>> for this particular field are not consistent at the moment
>>>>>>> -
>>>>>>> see
>>>>>>> [1]
>>>>>>> and the previous discussions.
>>>>>>>
>>>>>>> De-facto we have the following results:
>>>>>>>
>>>>>>> - int load: 0x0000cafe
>>>>>>> - short load: 0xcafe
>>>>>>
>>>>>> So 'De-facto' means the above is the expected result.
>>>>>>
>>>>>>>
>>>>>>> On a consistent BE we should have rather had:
>>>>>>>
>>>>>>> - int load: 0x0000cafe
>>>>>>> - short load: 0
>>>>>>
>>>>>> What does 'consistent BE' mean here? Does it mean the
>>>>>> expected
>>>>>> result from the source code itself?
>>>>>
>>>>> I should not have called the de-facto example "BE" at all: it's
>>>>> rather
>>>>> "mixed endianness" or "weird endianness" or something along
>>>>> these
>>>>> lines.
>>>>>
>>>>> On "consistent BE" or simply "BE" properties like
>>>>>
>>>>> *(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2);
>>>>>
>>>>> hold. This is currently not the case for bpf_sock.dst_port.
>>>>>
>>>>> We compile with -mbig-endian, so we promise to the compiler
>>>>> that
>>>>> the
>>>>> machine is big-endian, and the compiler expects the above to
>>>>> hold
>>>>> for
>>>>> any p. Unfortunately when p points to bpf_sock.dst_port, this
>>>>> is
>>>>> not
>>>>> the case.
>>>>
>>>> If I understand correctly, *(uint32_t *)p to get the
>>>> bpf_sock.dst_port
>>>> is for backward compatibility. But the real u32 read by compiler
>>>> will
>>>> do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the
>>>> same as expected *(uint32_t *)p so we have problem here.
>>>>
>>>>>
>>>>> The property above is important for the correctness of the
>>>>> load/store
>>>>> tearing transformations. What we have here is not exactly
>>>>> tearing,
>>>>> but
>>>>> is quite close.
>>>>>
>>>>>>> Clang, of course, expects a consistent BE and optimizes
>>>>>>> around
>>>>>>> that.
>>>>>>>
>>>>>>> This was a conscious tradeoff Jakub and I have agreed on in
>>>>>>> order
>>>>>>> to
>>>>>>> keep the quirky behavior from the past. Given what's
>>>>>>> happening
>>>>>>> with
>>>>>>> Clang now, I wonder if it would be worth revisiting it in
>>>>>>> the
>>>>>>> name
>>>>>>> of
>>>>>>> consistency?
>>>>>>
>>>>>> If I understand correctly, I think the issue is
>>>>>>         r7 = *(u32 *)(r6 + 48)
>>>>>>         w7 >= 16
>>>>>>         w1 = w7
>>>>>>
>>>>>> after verifier, it is changed to
>>>>>>        r7 = *(u16 *)(r6 + <kernel offset>)
>>>>>>        w7 >= 16
>>>>>>        w1 = w7
>>>>>>
>>>>>> and the result after verifier rewrite is completely wrong.
>>>>>> Is it right?
>>>>>
>>>>> No, the verifier rewrite is correct.
>>>>> The sk_dst_port__load_word() part of the test succeeds.
>>>>>
>>>>> All these rewrites already work fine, they are correct and
>>>>> consistent.
>>>>> It's really just bpf_sock.dst_port that is special.
>>>>>
>>>>>> If this is the case, during verifier rewrite, if it is
>>>>>> big endian, say the user intends to load 4 bytes (from uapi
>>>>>> header)
>>>>>> while the kernel field is 2 bytes, in such cases, kernel
>>>>>> can still pretend to generate 4-byte load. For example,
>>>>>> for the above example, the code after verification could be:
>>>>>>        r7 = *(u16 *)(r6 + <kernel offset>)
>>>>>>        r7 <= 16
>>>>>>        w7 >= 16
>>>>>>        w1 = w7
>>>>>>
>>>>>> Hopefully, we won't have many such cases. Does this work?
>>>>>
>>>>> This would break the sk_dst_port__load_word() part of the test.
>>>>
>>>> This is a hack. This may work for this specific u16 case, but
>>>> yes, it won't work for u32 load case.
>>>>
>>>>>
>>>>>
>>>>>
>>>>> Above I asked whether we can resolve the inconsistency, but I
>>>>> thought
>>>>> about it and I don't see a way of doing it without breaking the
>>>>> ABI,
>>>>> which is at worst unacceptable, and at best a last resort
>>>>> measure.
>>>>>
>>>>> What do you think about marking bpf_sock.dst_port volatile?
>>>>> volatile
>>>>> should prevent tearing and similar optimizations, with which we
>>>>> have a
>>>>> problem here.
>>>>>
>>>>> We could also add a comment warning users not to cast away this
>>>>> volatile due to the quirk we have. And then we should adjust
>>>>> the
>>>>> test
>>>>> (making all casts volatile) to comply with this new warning.
>>>>
>>>> I did a little study on this. The main problem here for
>>>> static __noinline bool sk_dst_port__load_half(struct bpf_sock
>>>> *sk)
>>>> {
>>>>            __u16 *half = (__u16 *)&sk->dst_port;
>>>>            return half[0] == bpf_htons(0xcafe);
>>>> }
>>>>
>>>> Through some cross-function optimization by ArgumentPromotion
>>>> optimization, the compiler does:
>>>>       /* the below shared by sk_dst_port__load_word
>>>>        * and sk_dst_port__load_half
>>>>        */
>>>>       __u32 *word = (__u32 *)&sk->dst_port;
>>>>       __u32 word_val = word[0];
>>>>
>>>>       /* the below is for sk_dst_port__load_half only */
>>>>       __u16 half_val = word_val >> 16;
>>>>
>>>>       ... half_val passed into sk_dst_port__load_half ...
>>>>       return half_val == bpf_htons(0xcafe);
>>>>
>>>> Here, 'word_val = word[0]' is replaced by 2-byte load
>>>> by the verifier and this caused the trouble for later
>>>> sk_dst_port__load_half().
>>>>
>>>> I don't have a good solution here. The issue is exposed
>>>> as we have both u16 and u32 load for &sk->dst_port and
>>>> the compiler did some optimization with this.
>>>>
>>>> I would say this is an extreme corner case and we can just
>>>> fix in the source code like below:
>>>>
>>>> diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
>>>> b/tools/testing/selftests/bpf/progs/test_sock_fields.c
>>>> index bbad3c2d9aa5..39c975786720 100644
>>>> --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
>>>> +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
>>>> @@ -265,7 +265,10 @@ static __noinline bool
>>>> sk_dst_port__load_word(struct bpf_sock *sk)
>>>>
>>>>     static __noinline bool sk_dst_port__load_half(struct bpf_sock
>>>> *sk)
>>>>     {
>>>> -       __u16 *half = (__u16 *)&sk->dst_port;
>>>> +       __u16 *half;
>>>> +
>>>> +       asm volatile ("");
>>>> +       half  = (__u16 *)&sk->dst_port;
>>>>            return half[0] == bpf_htons(0xcafe);
>>>>     }
>>>>
>>>> Could you try whether the above workaround works or not?
>>>> If we want the code to be future proof for potential
>>>> cross-func optimization for these noinline functions, we
>>>> can add similar asm codes to all of
>>>> bool sk_dst_port__load_{word, half, byte}.
>>>
>>> Hi,
>>>
>>> this makes the issue go away, thanks.
>>>
>>> However, I'm still concerned, because this only inhibits a certain
>>> optimization and does not address the underlying fundamental
>>> problem:
>>> we promise to clang that the in-kernel implementation of the eBPF
>>> virtual machine is big-endian, while in reality it's not. As
>>> compiler
>>> optimizations get more aggressive, we will surely see more of this.
>>>
>>> Why not do this instead?
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 1bb11a6ee667..3c9b535532ae 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -6102,7 +6102,7 @@ struct bpf_sock {
>>>          __u32 src_ip4;
>>>          __u32 src_ip6[4];
>>>          __u32 src_port;         /* host byte order */
>>> -       __be16 dst_port;        /* network byte order */
>>> +       volatile __be16 dst_port;       /* network byte order */
>>>          __u16 :16;              /* zero padding */
>>>          __u32 dst_ip4;
>>>          __u32 dst_ip6[4];
>>> diff --git a/tools/include/uapi/linux/bpf.h
>>> b/tools/include/uapi/linux/bpf.h
>>> index 1bb11a6ee667..3c9b535532ae 100644
>>> --- a/tools/include/uapi/linux/bpf.h
>>> +++ b/tools/include/uapi/linux/bpf.h
>>> @@ -6102,7 +6102,7 @@ struct bpf_sock {
>>>          __u32 src_ip4;
>>>          __u32 src_ip6[4];
>>>          __u32 src_port;         /* host byte order */
>>> -       __be16 dst_port;        /* network byte order */
>>> +       volatile __be16 dst_port;       /* network byte order */
>>>          __u16 :16;              /* zero padding */
>>>          __u32 dst_ip4;
>>>          __u32 dst_ip6[4];
>>> diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
>>> b/tools/testing/selftests/bpf/progs/test_sock_fields.c
>>> index bbad3c2d9aa5..773ded84ac12 100644
>>> --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
>>> +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
>>> @@ -259,19 +259,19 @@ int ingress_read_sock_fields(struct __sk_buff
>>> *skb)
>>>     */
>>>    static __noinline bool sk_dst_port__load_word(struct bpf_sock
>>> *sk)
>>>    {
>>> -       __u32 *word = (__u32 *)&sk->dst_port;
>>> +       volatile __u32 *word = (volatile __u32 *)&sk->dst_port;
>>>          return word[0] == bpf_htons(0xcafe);
>>>    }
>>>    
>>>    static __noinline bool sk_dst_port__load_half(struct bpf_sock
>>> *sk)
>>>    {
>>> -       __u16 *half = (__u16 *)&sk->dst_port;
>>> +       volatile __u16 *half = (volatile __u16 *)&sk->dst_port;
>>>          return half[0] == bpf_htons(0xcafe);
>>>    }
>>>    
>>>    static __noinline bool sk_dst_port__load_byte(struct bpf_sock
>>> *sk)
>>>    {
>>> -       __u8 *byte = (__u8 *)&sk->dst_port;
>>> +       volatile __u8 *byte = (volatile __u8 *)&sk->dst_port;
>>>          return byte[0] == 0xca && byte[1] == 0xfe;
>>>    }
>>>    
>>> This also works, and as far as I'm concerned, this would be a
>>> proper
>>> fix for the underlying issue: we tell the compiler that it should
>>> never
>>> ever (with any of today's or future optimizations) try to be clever
>>> when accessing dst_port.
>>
>> The above test_sock_fields.c change should work too.
>> I think the uapi change is not necessary. The key word 'volatile'
>> intends to avoid merging two or more identical loads together.
>> In this particular case, with only uapi changes, the
>> harmful transformation can still happen since the sk->dst_port
>> is indeed loaded only once.
>>
>> The test_sock_fields.c change itself forces proper load
>> which is verifier friendly. So I suggest to have
>> test_sock_fields.c change only. My previously suggested
>> changes have the same effect, to preserve the verifier friendly
>> load.
> 
> Ok, let's do it this way then.
> 
> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
> 
> for the diff you posted above.

Sounds good. I will submit a patch with my previous suggested
'asm volatile' approach. This is purely to workaround the test
case which tries to exercise different access patterns for dst_port
and such usage pattern should not appear in typical user bpf
programs.

> 
>>
>>>
>>> Best regards,
>>> Ilya
>>>
>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com
>>>
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-05-16 15:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03 19:35 selftest sock_fields failed on s390x with latest llvm17 Yonghong Song
2023-05-03 19:46 ` Ilya Leoshkevich
2023-05-12 14:40   ` Ilya Leoshkevich
2023-05-13  4:13     ` Yonghong Song
2023-05-13  8:24       ` Ilya Leoshkevich
2023-05-14 16:58         ` Yonghong Song
2023-05-15  7:55           ` Ilya Leoshkevich
2023-05-15 15:27             ` Yonghong Song
2023-05-16  9:16               ` Ilya Leoshkevich
2023-05-16 15:20                 ` Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).