All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 11:26 ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 11:26 UTC (permalink / raw)
  To: netdev
  Cc: kuba, davem, Nikolay Aleksandrov, stable, Jason Wang, Xuan Zhuo,
	Daniel Borkmann, Michael S. Tsirkin, virtualization

We received a report[1] of kernel crashes when Cilium is used in XDP
mode with virtio_net after updating to newer kernels. After
investigating the reason it turned out that when using mergeable bufs
with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
calculates the build_skb address wrong because the offset can become less
than the headroom so it gets the address of the previous page (-X bytes
depending on how lower offset is):
 page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256

This is a pr_err() I added in the beginning of page_to_skb which clearly
shows offset that is less than headroom by adding 4 bytes of metadata
via an xdp prog. The calculations done are:
 receive_mergeable():
 headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
 offset = xdp.data - page_address(xdp_page) -
          vi->hdr_len - metasize;

 page_to_skb():
 p = page_address(page) + offset;
 ...
 buf = p - headroom;

Now buf goes -4 bytes from the page's starting address as can be seen
above which is set as skb->head and skb->data by build_skb later. Depending
on what's done with the skb (when it's freed most often) we get all kinds
of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
commit is interesting because the patch was sent and applied twice (it
seems the first one got lost during merge back in 5.13 window). The
first version of the patch that was applied as:
 commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
was actually correct because it calculated the page starting address
without relying on offset or headroom, but then the second version that
was applied as:
 commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
was wrong and added the above calculation.
An example xdp prog[3] is below.

[1] https://github.com/cilium/cilium/issues/19453

[2] Two of the many traces:
 [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
 [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
 [   41.300891] kernel BUG at include/linux/mm.h:720!
 [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
 [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
 [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
 [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
 [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
 [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
 [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
 [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
 [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
 [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
 [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
 [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
 [   41.321387] Call Trace:
 [   41.321819]  <TASK>
 [   41.322193]  skb_release_data+0x13f/0x1c0
 [   41.322902]  __kfree_skb+0x20/0x30
 [   41.343870]  tcp_recvmsg_locked+0x671/0x880
 [   41.363764]  tcp_recvmsg+0x5e/0x1c0
 [   41.384102]  inet_recvmsg+0x42/0x100
 [   41.406783]  ? sock_recvmsg+0x1d/0x70
 [   41.428201]  sock_read_iter+0x84/0xd0
 [   41.445592]  ? 0xffffffffa3000000
 [   41.462442]  new_sync_read+0x148/0x160
 [   41.479314]  ? 0xffffffffa3000000
 [   41.496937]  vfs_read+0x138/0x190
 [   41.517198]  ksys_read+0x87/0xc0
 [   41.535336]  do_syscall_64+0x3b/0x90
 [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [   41.568050] RIP: 0033:0x48765b
 [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
 [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
 [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
 [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
 [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
 [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
 [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
 [   41.744254]  </TASK>
 [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net

 and

 [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
 [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
 [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
 [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
 [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
 [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
 [   33.532471] page dumped because: nonzero mapcount
 [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
 [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
 [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   33.532484] Call Trace:
 [   33.532496]  <TASK>
 [   33.532500]  dump_stack_lvl+0x45/0x5a
 [   33.532506]  bad_page.cold+0x63/0x94
 [   33.532510]  free_pcp_prepare+0x290/0x420
 [   33.532515]  free_unref_page+0x1b/0x100
 [   33.532518]  skb_release_data+0x13f/0x1c0
 [   33.532524]  kfree_skb_reason+0x3e/0xc0
 [   33.532527]  ip6_mc_input+0x23c/0x2b0
 [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
 [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0

[3] XDP program to reproduce(xdp_pass.c):
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>

 SEC("xdp_pass")
 int xdp_pkt_pass(struct xdp_md *ctx)
 {
          bpf_xdp_adjust_head(ctx, -(int)32);
          return XDP_PASS;
 }

 char _license[] SEC("license") = "GPL";

 compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
 load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass

CC: stable@vger.kernel.org
CC: Jason Wang <jasowang@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
---
 drivers/net/virtio_net.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..0687dd88e97f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	 * Buffers with headroom use PAGE_SIZE as alloc size, see
 	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
 	 */
-	truesize = headroom ? PAGE_SIZE : truesize;
+	if (headroom) {
+		truesize = PAGE_SIZE;
+		buf = (char *)((unsigned long)p & PAGE_MASK);
+	} else {
+		buf = p;
+	}
 	tailroom = truesize - headroom;
-	buf = p - headroom;
 
 	len -= hdr_len;
 	offset += hdr_padded_len;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 11:26 ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 11:26 UTC (permalink / raw)
  To: netdev
  Cc: Daniel Borkmann, Michael S. Tsirkin, stable, davem, kuba,
	virtualization, Nikolay Aleksandrov

We received a report[1] of kernel crashes when Cilium is used in XDP
mode with virtio_net after updating to newer kernels. After
investigating the reason it turned out that when using mergeable bufs
with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
calculates the build_skb address wrong because the offset can become less
than the headroom so it gets the address of the previous page (-X bytes
depending on how lower offset is):
 page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256

This is a pr_err() I added in the beginning of page_to_skb which clearly
shows offset that is less than headroom by adding 4 bytes of metadata
via an xdp prog. The calculations done are:
 receive_mergeable():
 headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
 offset = xdp.data - page_address(xdp_page) -
          vi->hdr_len - metasize;

 page_to_skb():
 p = page_address(page) + offset;
 ...
 buf = p - headroom;

Now buf goes -4 bytes from the page's starting address as can be seen
above which is set as skb->head and skb->data by build_skb later. Depending
on what's done with the skb (when it's freed most often) we get all kinds
of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
commit is interesting because the patch was sent and applied twice (it
seems the first one got lost during merge back in 5.13 window). The
first version of the patch that was applied as:
 commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
was actually correct because it calculated the page starting address
without relying on offset or headroom, but then the second version that
was applied as:
 commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
was wrong and added the above calculation.
An example xdp prog[3] is below.

[1] https://github.com/cilium/cilium/issues/19453

[2] Two of the many traces:
 [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
 [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
 [   41.300891] kernel BUG at include/linux/mm.h:720!
 [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
 [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
 [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
 [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
 [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
 [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
 [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
 [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
 [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
 [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
 [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
 [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
 [   41.321387] Call Trace:
 [   41.321819]  <TASK>
 [   41.322193]  skb_release_data+0x13f/0x1c0
 [   41.322902]  __kfree_skb+0x20/0x30
 [   41.343870]  tcp_recvmsg_locked+0x671/0x880
 [   41.363764]  tcp_recvmsg+0x5e/0x1c0
 [   41.384102]  inet_recvmsg+0x42/0x100
 [   41.406783]  ? sock_recvmsg+0x1d/0x70
 [   41.428201]  sock_read_iter+0x84/0xd0
 [   41.445592]  ? 0xffffffffa3000000
 [   41.462442]  new_sync_read+0x148/0x160
 [   41.479314]  ? 0xffffffffa3000000
 [   41.496937]  vfs_read+0x138/0x190
 [   41.517198]  ksys_read+0x87/0xc0
 [   41.535336]  do_syscall_64+0x3b/0x90
 [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [   41.568050] RIP: 0033:0x48765b
 [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
 [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
 [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
 [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
 [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
 [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
 [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
 [   41.744254]  </TASK>
 [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net

 and

 [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
 [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
 [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
 [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
 [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
 [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
 [   33.532471] page dumped because: nonzero mapcount
 [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
 [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
 [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   33.532484] Call Trace:
 [   33.532496]  <TASK>
 [   33.532500]  dump_stack_lvl+0x45/0x5a
 [   33.532506]  bad_page.cold+0x63/0x94
 [   33.532510]  free_pcp_prepare+0x290/0x420
 [   33.532515]  free_unref_page+0x1b/0x100
 [   33.532518]  skb_release_data+0x13f/0x1c0
 [   33.532524]  kfree_skb_reason+0x3e/0xc0
 [   33.532527]  ip6_mc_input+0x23c/0x2b0
 [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
 [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0

[3] XDP program to reproduce(xdp_pass.c):
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>

 SEC("xdp_pass")
 int xdp_pkt_pass(struct xdp_md *ctx)
 {
          bpf_xdp_adjust_head(ctx, -(int)32);
          return XDP_PASS;
 }

 char _license[] SEC("license") = "GPL";

 compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
 load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass

CC: stable@vger.kernel.org
CC: Jason Wang <jasowang@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
---
 drivers/net/virtio_net.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..0687dd88e97f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	 * Buffers with headroom use PAGE_SIZE as alloc size, see
 	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
 	 */
-	truesize = headroom ? PAGE_SIZE : truesize;
+	if (headroom) {
+		truesize = PAGE_SIZE;
+		buf = (char *)((unsigned long)p & PAGE_MASK);
+	} else {
+		buf = p;
+	}
 	tailroom = truesize - headroom;
-	buf = p - headroom;
 
 	len -= hdr_len;
 	offset += hdr_padded_len;
-- 
2.35.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 11:26 ` Nikolay Aleksandrov
@ 2022-04-23 13:31   ` Xuan Zhuo
  -1 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-23 13:31 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: kuba, davem, Nikolay Aleksandrov, stable, Jason Wang,
	Daniel Borkmann, Michael S. Tsirkin, virtualization, netdev

On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> We received a report[1] of kernel crashes when Cilium is used in XDP
> mode with virtio_net after updating to newer kernels. After
> investigating the reason it turned out that when using mergeable bufs
> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> calculates the build_skb address wrong because the offset can become less
> than the headroom so it gets the address of the previous page (-X bytes
> depending on how lower offset is):
>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>
> This is a pr_err() I added in the beginning of page_to_skb which clearly
> shows offset that is less than headroom by adding 4 bytes of metadata
> via an xdp prog. The calculations done are:
>  receive_mergeable():
>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>  offset = xdp.data - page_address(xdp_page) -
>           vi->hdr_len - metasize;
>
>  page_to_skb():
>  p = page_address(page) + offset;
>  ...
>  buf = p - headroom;
>
> Now buf goes -4 bytes from the page's starting address as can be seen
> above which is set as skb->head and skb->data by build_skb later. Depending
> on what's done with the skb (when it's freed most often) we get all kinds
> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> commit is interesting because the patch was sent and applied twice (it
> seems the first one got lost during merge back in 5.13 window). The
> first version of the patch that was applied as:
>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> was actually correct because it calculated the page starting address
> without relying on offset or headroom, but then the second version that
> was applied as:
>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> was wrong and added the above calculation.
> An example xdp prog[3] is below.
>
> [1] https://github.com/cilium/cilium/issues/19453
>
> [2] Two of the many traces:
>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>  [   41.321387] Call Trace:
>  [   41.321819]  <TASK>
>  [   41.322193]  skb_release_data+0x13f/0x1c0
>  [   41.322902]  __kfree_skb+0x20/0x30
>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>  [   41.384102]  inet_recvmsg+0x42/0x100
>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>  [   41.428201]  sock_read_iter+0x84/0xd0
>  [   41.445592]  ? 0xffffffffa3000000
>  [   41.462442]  new_sync_read+0x148/0x160
>  [   41.479314]  ? 0xffffffffa3000000
>  [   41.496937]  vfs_read+0x138/0x190
>  [   41.517198]  ksys_read+0x87/0xc0
>  [   41.535336]  do_syscall_64+0x3b/0x90
>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>  [   41.568050] RIP: 0033:0x48765b
>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>  [   41.744254]  </TASK>
>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>
>  and
>
>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>  [   33.532471] page dumped because: nonzero mapcount
>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   33.532484] Call Trace:
>  [   33.532496]  <TASK>
>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>  [   33.532506]  bad_page.cold+0x63/0x94
>  [   33.532510]  free_pcp_prepare+0x290/0x420
>  [   33.532515]  free_unref_page+0x1b/0x100
>  [   33.532518]  skb_release_data+0x13f/0x1c0
>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>
> [3] XDP program to reproduce(xdp_pass.c):
>  #include <linux/bpf.h>
>  #include <bpf/bpf_helpers.h>
>
>  SEC("xdp_pass")
>  int xdp_pkt_pass(struct xdp_md *ctx)
>  {
>           bpf_xdp_adjust_head(ctx, -(int)32);
>           return XDP_PASS;
>  }
>
>  char _license[] SEC("license") = "GPL";
>
>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>
> CC: stable@vger.kernel.org
> CC: Jason Wang <jasowang@redhat.com>
> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> CC: Daniel Borkmann <daniel@iogearbox.net>
> CC: "Michael S. Tsirkin" <mst@redhat.com>
> CC: virtualization@lists.linux-foundation.org
> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> ---
>  drivers/net/virtio_net.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..0687dd88e97f 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>  	 */
> -	truesize = headroom ? PAGE_SIZE : truesize;
> +	if (headroom) {
> +		truesize = PAGE_SIZE;
> +		buf = (char *)((unsigned long)p & PAGE_MASK);

The reason for not doing this is that buf and p may not be on the same page, and
buf is probably not page-aligned.

The implementation of virtio-net merge is add_recvbuf_mergeable(), which
allocates a large block of memory at one time, and allocates from it each time.
Although in xdp mode, each allocation is page_size, it does not guarantee that
each allocation is page-aligned .

The problem here is that the value of headroom is wrong, the package is
structured like this:

from device    | headroom          | virtio-net hdr | data |
after xdp      | headroom  |  virtio-net hdr | meta | data |

The page_address(page) + offset we pass to page_to_skb() points to the
virtio-net hdr.

So I think it might be better to change it this way.

Thanks.


diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..086ae835ec86 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                                head_skb = page_to_skb(vi, rq, xdp_page, offset,
                                                       len, PAGE_SIZE, false,
                                                       metasize,
-                                                      VIRTIO_XDP_HEADROOM);
+                                                      VIRTIO_XDP_HEADROOM - metazie);
                                return head_skb;
                        }
                        break;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 13:31   ` Xuan Zhuo
  0 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-23 13:31 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable, davem, kuba,
	virtualization, Nikolay Aleksandrov

On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> We received a report[1] of kernel crashes when Cilium is used in XDP
> mode with virtio_net after updating to newer kernels. After
> investigating the reason it turned out that when using mergeable bufs
> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> calculates the build_skb address wrong because the offset can become less
> than the headroom so it gets the address of the previous page (-X bytes
> depending on how lower offset is):
>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>
> This is a pr_err() I added in the beginning of page_to_skb which clearly
> shows offset that is less than headroom by adding 4 bytes of metadata
> via an xdp prog. The calculations done are:
>  receive_mergeable():
>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>  offset = xdp.data - page_address(xdp_page) -
>           vi->hdr_len - metasize;
>
>  page_to_skb():
>  p = page_address(page) + offset;
>  ...
>  buf = p - headroom;
>
> Now buf goes -4 bytes from the page's starting address as can be seen
> above which is set as skb->head and skb->data by build_skb later. Depending
> on what's done with the skb (when it's freed most often) we get all kinds
> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> commit is interesting because the patch was sent and applied twice (it
> seems the first one got lost during merge back in 5.13 window). The
> first version of the patch that was applied as:
>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> was actually correct because it calculated the page starting address
> without relying on offset or headroom, but then the second version that
> was applied as:
>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> was wrong and added the above calculation.
> An example xdp prog[3] is below.
>
> [1] https://github.com/cilium/cilium/issues/19453
>
> [2] Two of the many traces:
>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>  [   41.321387] Call Trace:
>  [   41.321819]  <TASK>
>  [   41.322193]  skb_release_data+0x13f/0x1c0
>  [   41.322902]  __kfree_skb+0x20/0x30
>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>  [   41.384102]  inet_recvmsg+0x42/0x100
>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>  [   41.428201]  sock_read_iter+0x84/0xd0
>  [   41.445592]  ? 0xffffffffa3000000
>  [   41.462442]  new_sync_read+0x148/0x160
>  [   41.479314]  ? 0xffffffffa3000000
>  [   41.496937]  vfs_read+0x138/0x190
>  [   41.517198]  ksys_read+0x87/0xc0
>  [   41.535336]  do_syscall_64+0x3b/0x90
>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>  [   41.568050] RIP: 0033:0x48765b
>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>  [   41.744254]  </TASK>
>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>
>  and
>
>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>  [   33.532471] page dumped because: nonzero mapcount
>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   33.532484] Call Trace:
>  [   33.532496]  <TASK>
>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>  [   33.532506]  bad_page.cold+0x63/0x94
>  [   33.532510]  free_pcp_prepare+0x290/0x420
>  [   33.532515]  free_unref_page+0x1b/0x100
>  [   33.532518]  skb_release_data+0x13f/0x1c0
>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>
> [3] XDP program to reproduce(xdp_pass.c):
>  #include <linux/bpf.h>
>  #include <bpf/bpf_helpers.h>
>
>  SEC("xdp_pass")
>  int xdp_pkt_pass(struct xdp_md *ctx)
>  {
>           bpf_xdp_adjust_head(ctx, -(int)32);
>           return XDP_PASS;
>  }
>
>  char _license[] SEC("license") = "GPL";
>
>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>
> CC: stable@vger.kernel.org
> CC: Jason Wang <jasowang@redhat.com>
> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> CC: Daniel Borkmann <daniel@iogearbox.net>
> CC: "Michael S. Tsirkin" <mst@redhat.com>
> CC: virtualization@lists.linux-foundation.org
> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> ---
>  drivers/net/virtio_net.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..0687dd88e97f 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>  	 */
> -	truesize = headroom ? PAGE_SIZE : truesize;
> +	if (headroom) {
> +		truesize = PAGE_SIZE;
> +		buf = (char *)((unsigned long)p & PAGE_MASK);

The reason for not doing this is that buf and p may not be on the same page, and
buf is probably not page-aligned.

The implementation of virtio-net merge is add_recvbuf_mergeable(), which
allocates a large block of memory at one time, and allocates from it each time.
Although in xdp mode, each allocation is page_size, it does not guarantee that
each allocation is page-aligned .

The problem here is that the value of headroom is wrong, the package is
structured like this:

from device    | headroom          | virtio-net hdr | data |
after xdp      | headroom  |  virtio-net hdr | meta | data |

The page_address(page) + offset we pass to page_to_skb() points to the
virtio-net hdr.

So I think it might be better to change it this way.

Thanks.


diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..086ae835ec86 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                                head_skb = page_to_skb(vi, rq, xdp_page, offset,
                                                       len, PAGE_SIZE, false,
                                                       metasize,
-                                                      VIRTIO_XDP_HEADROOM);
+                                                      VIRTIO_XDP_HEADROOM - metazie);
                                return head_skb;
                        }
                        break;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 13:31   ` Xuan Zhuo
@ 2022-04-23 14:16     ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 14:16 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 23/04/2022 16:31, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> We received a report[1] of kernel crashes when Cilium is used in XDP
>> mode with virtio_net after updating to newer kernels. After
>> investigating the reason it turned out that when using mergeable bufs
>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>> calculates the build_skb address wrong because the offset can become less
>> than the headroom so it gets the address of the previous page (-X bytes
>> depending on how lower offset is):
>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>
>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>> shows offset that is less than headroom by adding 4 bytes of metadata
>> via an xdp prog. The calculations done are:
>>  receive_mergeable():
>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>  offset = xdp.data - page_address(xdp_page) -
>>           vi->hdr_len - metasize;
>>
>>  page_to_skb():
>>  p = page_address(page) + offset;
>>  ...
>>  buf = p - headroom;
>>
>> Now buf goes -4 bytes from the page's starting address as can be seen
>> above which is set as skb->head and skb->data by build_skb later. Depending
>> on what's done with the skb (when it's freed most often) we get all kinds
>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>> commit is interesting because the patch was sent and applied twice (it
>> seems the first one got lost during merge back in 5.13 window). The
>> first version of the patch that was applied as:
>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>> was actually correct because it calculated the page starting address
>> without relying on offset or headroom, but then the second version that
>> was applied as:
>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>> was wrong and added the above calculation.
>> An example xdp prog[3] is below.
>>
>> [1] https://github.com/cilium/cilium/issues/19453
>>
>> [2] Two of the many traces:
>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>  [   41.321387] Call Trace:
>>  [   41.321819]  <TASK>
>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>  [   41.322902]  __kfree_skb+0x20/0x30
>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>  [   41.445592]  ? 0xffffffffa3000000
>>  [   41.462442]  new_sync_read+0x148/0x160
>>  [   41.479314]  ? 0xffffffffa3000000
>>  [   41.496937]  vfs_read+0x138/0x190
>>  [   41.517198]  ksys_read+0x87/0xc0
>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>  [   41.568050] RIP: 0033:0x48765b
>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>  [   41.744254]  </TASK>
>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>
>>  and
>>
>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>  [   33.532471] page dumped because: nonzero mapcount
>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   33.532484] Call Trace:
>>  [   33.532496]  <TASK>
>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>  [   33.532506]  bad_page.cold+0x63/0x94
>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>  [   33.532515]  free_unref_page+0x1b/0x100
>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>
>> [3] XDP program to reproduce(xdp_pass.c):
>>  #include <linux/bpf.h>
>>  #include <bpf/bpf_helpers.h>
>>
>>  SEC("xdp_pass")
>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>  {
>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>           return XDP_PASS;
>>  }
>>
>>  char _license[] SEC("license") = "GPL";
>>
>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>
>> CC: stable@vger.kernel.org
>> CC: Jason Wang <jasowang@redhat.com>
>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> CC: Daniel Borkmann <daniel@iogearbox.net>
>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>> CC: virtualization@lists.linux-foundation.org
>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>> ---
>>  drivers/net/virtio_net.c | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 87838cbe38cf..0687dd88e97f 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>  	 */
>> -	truesize = headroom ? PAGE_SIZE : truesize;
>> +	if (headroom) {
>> +		truesize = PAGE_SIZE;
>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> 
> The reason for not doing this is that buf and p may not be on the same page, and
> buf is probably not page-aligned.
> 
> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> allocates a large block of memory at one time, and allocates from it each time.
> Although in xdp mode, each allocation is page_size, it does not guarantee that
> each allocation is page-aligned .
> 
> The problem here is that the value of headroom is wrong, the package is
> structured like this:
> 
> from device    | headroom          | virtio-net hdr | data |
> after xdp      | headroom  |  virtio-net hdr | meta | data |

You're free to push data back (not necessarily through meta).
You don't have virtio-net hdr for the xdp case (hdr_valid is false there).

> 
> The page_address(page) + offset we pass to page_to_skb() points to the
> virtio-net hdr.
> 
> So I think it might be better to change it this way.
> 
> Thanks.
> 
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..086ae835ec86 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>                                                        len, PAGE_SIZE, false,
>                                                        metasize,
> -                                                      VIRTIO_XDP_HEADROOM);
> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>                                 return head_skb;
>                         }
>                         break;

That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
So just doing that would take care of the meta, but won't take care of moving data.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 14:16     ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 14:16 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 23/04/2022 16:31, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> We received a report[1] of kernel crashes when Cilium is used in XDP
>> mode with virtio_net after updating to newer kernels. After
>> investigating the reason it turned out that when using mergeable bufs
>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>> calculates the build_skb address wrong because the offset can become less
>> than the headroom so it gets the address of the previous page (-X bytes
>> depending on how lower offset is):
>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>
>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>> shows offset that is less than headroom by adding 4 bytes of metadata
>> via an xdp prog. The calculations done are:
>>  receive_mergeable():
>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>  offset = xdp.data - page_address(xdp_page) -
>>           vi->hdr_len - metasize;
>>
>>  page_to_skb():
>>  p = page_address(page) + offset;
>>  ...
>>  buf = p - headroom;
>>
>> Now buf goes -4 bytes from the page's starting address as can be seen
>> above which is set as skb->head and skb->data by build_skb later. Depending
>> on what's done with the skb (when it's freed most often) we get all kinds
>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>> commit is interesting because the patch was sent and applied twice (it
>> seems the first one got lost during merge back in 5.13 window). The
>> first version of the patch that was applied as:
>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>> was actually correct because it calculated the page starting address
>> without relying on offset or headroom, but then the second version that
>> was applied as:
>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>> was wrong and added the above calculation.
>> An example xdp prog[3] is below.
>>
>> [1] https://github.com/cilium/cilium/issues/19453
>>
>> [2] Two of the many traces:
>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>  [   41.321387] Call Trace:
>>  [   41.321819]  <TASK>
>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>  [   41.322902]  __kfree_skb+0x20/0x30
>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>  [   41.445592]  ? 0xffffffffa3000000
>>  [   41.462442]  new_sync_read+0x148/0x160
>>  [   41.479314]  ? 0xffffffffa3000000
>>  [   41.496937]  vfs_read+0x138/0x190
>>  [   41.517198]  ksys_read+0x87/0xc0
>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>  [   41.568050] RIP: 0033:0x48765b
>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>  [   41.744254]  </TASK>
>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>
>>  and
>>
>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>  [   33.532471] page dumped because: nonzero mapcount
>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   33.532484] Call Trace:
>>  [   33.532496]  <TASK>
>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>  [   33.532506]  bad_page.cold+0x63/0x94
>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>  [   33.532515]  free_unref_page+0x1b/0x100
>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>
>> [3] XDP program to reproduce(xdp_pass.c):
>>  #include <linux/bpf.h>
>>  #include <bpf/bpf_helpers.h>
>>
>>  SEC("xdp_pass")
>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>  {
>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>           return XDP_PASS;
>>  }
>>
>>  char _license[] SEC("license") = "GPL";
>>
>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>
>> CC: stable@vger.kernel.org
>> CC: Jason Wang <jasowang@redhat.com>
>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> CC: Daniel Borkmann <daniel@iogearbox.net>
>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>> CC: virtualization@lists.linux-foundation.org
>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>> ---
>>  drivers/net/virtio_net.c | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 87838cbe38cf..0687dd88e97f 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>  	 */
>> -	truesize = headroom ? PAGE_SIZE : truesize;
>> +	if (headroom) {
>> +		truesize = PAGE_SIZE;
>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> 
> The reason for not doing this is that buf and p may not be on the same page, and
> buf is probably not page-aligned.
> 
> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> allocates a large block of memory at one time, and allocates from it each time.
> Although in xdp mode, each allocation is page_size, it does not guarantee that
> each allocation is page-aligned .
> 
> The problem here is that the value of headroom is wrong, the package is
> structured like this:
> 
> from device    | headroom          | virtio-net hdr | data |
> after xdp      | headroom  |  virtio-net hdr | meta | data |

You're free to push data back (not necessarily through meta).
You don't have virtio-net hdr for the xdp case (hdr_valid is false there).

> 
> The page_address(page) + offset we pass to page_to_skb() points to the
> virtio-net hdr.
> 
> So I think it might be better to change it this way.
> 
> Thanks.
> 
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..086ae835ec86 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>                                                        len, PAGE_SIZE, false,
>                                                        metasize,
> -                                                      VIRTIO_XDP_HEADROOM);
> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>                                 return head_skb;
>                         }
>                         break;

That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
So just doing that would take care of the meta, but won't take care of moving data.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 14:16     ` Nikolay Aleksandrov
@ 2022-04-23 14:20       ` Xuan Zhuo
  -1 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-23 14:20 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On Sat, 23 Apr 2022 17:16:53 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 23/04/2022 16:31, Xuan Zhuo wrote:
> > On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >> We received a report[1] of kernel crashes when Cilium is used in XDP
> >> mode with virtio_net after updating to newer kernels. After
> >> investigating the reason it turned out that when using mergeable bufs
> >> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >> calculates the build_skb address wrong because the offset can become less
> >> than the headroom so it gets the address of the previous page (-X bytes
> >> depending on how lower offset is):
> >>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>
> >> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >> shows offset that is less than headroom by adding 4 bytes of metadata
> >> via an xdp prog. The calculations done are:
> >>  receive_mergeable():
> >>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>  offset = xdp.data - page_address(xdp_page) -
> >>           vi->hdr_len - metasize;
> >>
> >>  page_to_skb():
> >>  p = page_address(page) + offset;
> >>  ...
> >>  buf = p - headroom;
> >>
> >> Now buf goes -4 bytes from the page's starting address as can be seen
> >> above which is set as skb->head and skb->data by build_skb later. Depending
> >> on what's done with the skb (when it's freed most often) we get all kinds
> >> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> >> commit is interesting because the patch was sent and applied twice (it
> >> seems the first one got lost during merge back in 5.13 window). The
> >> first version of the patch that was applied as:
> >>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> >> was actually correct because it calculated the page starting address
> >> without relying on offset or headroom, but then the second version that
> >> was applied as:
> >>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >> was wrong and added the above calculation.
> >> An example xdp prog[3] is below.
> >>
> >> [1] https://github.com/cilium/cilium/issues/19453
> >>
> >> [2] Two of the many traces:
> >>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>  [   41.321387] Call Trace:
> >>  [   41.321819]  <TASK>
> >>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>  [   41.322902]  __kfree_skb+0x20/0x30
> >>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>  [   41.445592]  ? 0xffffffffa3000000
> >>  [   41.462442]  new_sync_read+0x148/0x160
> >>  [   41.479314]  ? 0xffffffffa3000000
> >>  [   41.496937]  vfs_read+0x138/0x190
> >>  [   41.517198]  ksys_read+0x87/0xc0
> >>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>  [   41.568050] RIP: 0033:0x48765b
> >>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>  [   41.744254]  </TASK>
> >>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>
> >>  and
> >>
> >>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>  [   33.532471] page dumped because: nonzero mapcount
> >>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   33.532484] Call Trace:
> >>  [   33.532496]  <TASK>
> >>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>  [   33.532506]  bad_page.cold+0x63/0x94
> >>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>  [   33.532515]  free_unref_page+0x1b/0x100
> >>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>
> >> [3] XDP program to reproduce(xdp_pass.c):
> >>  #include <linux/bpf.h>
> >>  #include <bpf/bpf_helpers.h>
> >>
> >>  SEC("xdp_pass")
> >>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>  {
> >>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>           return XDP_PASS;
> >>  }
> >>
> >>  char _license[] SEC("license") = "GPL";
> >>
> >>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>
> >> CC: stable@vger.kernel.org
> >> CC: Jason Wang <jasowang@redhat.com>
> >> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >> CC: Daniel Borkmann <daniel@iogearbox.net>
> >> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >> CC: virtualization@lists.linux-foundation.org
> >> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >> ---
> >>  drivers/net/virtio_net.c | 8 ++++++--
> >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 87838cbe38cf..0687dd88e97f 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
> >>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
> >>  	 */
> >> -	truesize = headroom ? PAGE_SIZE : truesize;
> >> +	if (headroom) {
> >> +		truesize = PAGE_SIZE;
> >> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> >
> > The reason for not doing this is that buf and p may not be on the same page, and
> > buf is probably not page-aligned.
> >
> > The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> > allocates a large block of memory at one time, and allocates from it each time.
> > Although in xdp mode, each allocation is page_size, it does not guarantee that
> > each allocation is page-aligned .
> >
> > The problem here is that the value of headroom is wrong, the package is
> > structured like this:
> >
> > from device    | headroom          | virtio-net hdr | data |
> > after xdp      | headroom  |  virtio-net hdr | meta | data |
>
> You're free to push data back (not necessarily through meta).

Oh, is that possible? Thanks, that should take this into account.

> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).

Yes, indeed hdr_vaild is false. But the virtio-net hdr space is included,
although it is not legal.

			offset = xdp.data - page_address(xdp_page) -
				 vi->hdr_len - metasize;

Thanks.

>
> >
> > The page_address(page) + offset we pass to page_to_skb() points to the
> > virtio-net hdr.
> >
> > So I think it might be better to change it this way.
> >
> > Thanks.
> >
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 87838cbe38cf..086ae835ec86 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >                                                        len, PAGE_SIZE, false,
> >                                                        metasize,
> > -                                                      VIRTIO_XDP_HEADROOM);
> > +                                                      VIRTIO_XDP_HEADROOM - metazie);
> >                                 return head_skb;
> >                         }
> >                         break;
>
> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> So just doing that would take care of the meta, but won't take care of moving data.
>




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 14:20       ` Xuan Zhuo
  0 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-23 14:20 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On Sat, 23 Apr 2022 17:16:53 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 23/04/2022 16:31, Xuan Zhuo wrote:
> > On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >> We received a report[1] of kernel crashes when Cilium is used in XDP
> >> mode with virtio_net after updating to newer kernels. After
> >> investigating the reason it turned out that when using mergeable bufs
> >> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >> calculates the build_skb address wrong because the offset can become less
> >> than the headroom so it gets the address of the previous page (-X bytes
> >> depending on how lower offset is):
> >>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>
> >> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >> shows offset that is less than headroom by adding 4 bytes of metadata
> >> via an xdp prog. The calculations done are:
> >>  receive_mergeable():
> >>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>  offset = xdp.data - page_address(xdp_page) -
> >>           vi->hdr_len - metasize;
> >>
> >>  page_to_skb():
> >>  p = page_address(page) + offset;
> >>  ...
> >>  buf = p - headroom;
> >>
> >> Now buf goes -4 bytes from the page's starting address as can be seen
> >> above which is set as skb->head and skb->data by build_skb later. Depending
> >> on what's done with the skb (when it's freed most often) we get all kinds
> >> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> >> commit is interesting because the patch was sent and applied twice (it
> >> seems the first one got lost during merge back in 5.13 window). The
> >> first version of the patch that was applied as:
> >>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> >> was actually correct because it calculated the page starting address
> >> without relying on offset or headroom, but then the second version that
> >> was applied as:
> >>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >> was wrong and added the above calculation.
> >> An example xdp prog[3] is below.
> >>
> >> [1] https://github.com/cilium/cilium/issues/19453
> >>
> >> [2] Two of the many traces:
> >>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>  [   41.321387] Call Trace:
> >>  [   41.321819]  <TASK>
> >>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>  [   41.322902]  __kfree_skb+0x20/0x30
> >>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>  [   41.445592]  ? 0xffffffffa3000000
> >>  [   41.462442]  new_sync_read+0x148/0x160
> >>  [   41.479314]  ? 0xffffffffa3000000
> >>  [   41.496937]  vfs_read+0x138/0x190
> >>  [   41.517198]  ksys_read+0x87/0xc0
> >>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>  [   41.568050] RIP: 0033:0x48765b
> >>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>  [   41.744254]  </TASK>
> >>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>
> >>  and
> >>
> >>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>  [   33.532471] page dumped because: nonzero mapcount
> >>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   33.532484] Call Trace:
> >>  [   33.532496]  <TASK>
> >>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>  [   33.532506]  bad_page.cold+0x63/0x94
> >>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>  [   33.532515]  free_unref_page+0x1b/0x100
> >>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>
> >> [3] XDP program to reproduce(xdp_pass.c):
> >>  #include <linux/bpf.h>
> >>  #include <bpf/bpf_helpers.h>
> >>
> >>  SEC("xdp_pass")
> >>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>  {
> >>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>           return XDP_PASS;
> >>  }
> >>
> >>  char _license[] SEC("license") = "GPL";
> >>
> >>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>
> >> CC: stable@vger.kernel.org
> >> CC: Jason Wang <jasowang@redhat.com>
> >> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >> CC: Daniel Borkmann <daniel@iogearbox.net>
> >> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >> CC: virtualization@lists.linux-foundation.org
> >> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >> ---
> >>  drivers/net/virtio_net.c | 8 ++++++--
> >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 87838cbe38cf..0687dd88e97f 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
> >>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
> >>  	 */
> >> -	truesize = headroom ? PAGE_SIZE : truesize;
> >> +	if (headroom) {
> >> +		truesize = PAGE_SIZE;
> >> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> >
> > The reason for not doing this is that buf and p may not be on the same page, and
> > buf is probably not page-aligned.
> >
> > The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> > allocates a large block of memory at one time, and allocates from it each time.
> > Although in xdp mode, each allocation is page_size, it does not guarantee that
> > each allocation is page-aligned .
> >
> > The problem here is that the value of headroom is wrong, the package is
> > structured like this:
> >
> > from device    | headroom          | virtio-net hdr | data |
> > after xdp      | headroom  |  virtio-net hdr | meta | data |
>
> You're free to push data back (not necessarily through meta).

Oh, is that possible? Thanks, that should take this into account.

> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).

Yes, indeed hdr_vaild is false. But the virtio-net hdr space is included,
although it is not legal.

			offset = xdp.data - page_address(xdp_page) -
				 vi->hdr_len - metasize;

Thanks.

>
> >
> > The page_address(page) + offset we pass to page_to_skb() points to the
> > virtio-net hdr.
> >
> > So I think it might be better to change it this way.
> >
> > Thanks.
> >
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 87838cbe38cf..086ae835ec86 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >                                                        len, PAGE_SIZE, false,
> >                                                        metasize,
> > -                                                      VIRTIO_XDP_HEADROOM);
> > +                                                      VIRTIO_XDP_HEADROOM - metazie);
> >                                 return head_skb;
> >                         }
> >                         break;
>
> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> So just doing that would take care of the meta, but won't take care of moving data.
>



_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 14:16     ` Nikolay Aleksandrov
@ 2022-04-23 14:30       ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 14:30 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
> On 23/04/2022 16:31, Xuan Zhuo wrote:
>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>> mode with virtio_net after updating to newer kernels. After
>>> investigating the reason it turned out that when using mergeable bufs
>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>> calculates the build_skb address wrong because the offset can become less
>>> than the headroom so it gets the address of the previous page (-X bytes
>>> depending on how lower offset is):
>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>
>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>> via an xdp prog. The calculations done are:
>>>  receive_mergeable():
>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>  offset = xdp.data - page_address(xdp_page) -
>>>           vi->hdr_len - metasize;
>>>
>>>  page_to_skb():
>>>  p = page_address(page) + offset;
>>>  ...
>>>  buf = p - headroom;
>>>
>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>> on what's done with the skb (when it's freed most often) we get all kinds
>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>> commit is interesting because the patch was sent and applied twice (it
>>> seems the first one got lost during merge back in 5.13 window). The
>>> first version of the patch that was applied as:
>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>> was actually correct because it calculated the page starting address
>>> without relying on offset or headroom, but then the second version that
>>> was applied as:
>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>> was wrong and added the above calculation.
>>> An example xdp prog[3] is below.
>>>
>>> [1] https://github.com/cilium/cilium/issues/19453
>>>
>>> [2] Two of the many traces:
>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>>  [   41.321387] Call Trace:
>>>  [   41.321819]  <TASK>
>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>>  [   41.322902]  __kfree_skb+0x20/0x30
>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>>  [   41.445592]  ? 0xffffffffa3000000
>>>  [   41.462442]  new_sync_read+0x148/0x160
>>>  [   41.479314]  ? 0xffffffffa3000000
>>>  [   41.496937]  vfs_read+0x138/0x190
>>>  [   41.517198]  ksys_read+0x87/0xc0
>>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>  [   41.568050] RIP: 0033:0x48765b
>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>>  [   41.744254]  </TASK>
>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>
>>>  and
>>>
>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>>  [   33.532471] page dumped because: nonzero mapcount
>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>  [   33.532484] Call Trace:
>>>  [   33.532496]  <TASK>
>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>>  [   33.532506]  bad_page.cold+0x63/0x94
>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>>  [   33.532515]  free_unref_page+0x1b/0x100
>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>>
>>> [3] XDP program to reproduce(xdp_pass.c):
>>>  #include <linux/bpf.h>
>>>  #include <bpf/bpf_helpers.h>
>>>
>>>  SEC("xdp_pass")
>>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>>  {
>>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>>           return XDP_PASS;
>>>  }
>>>
>>>  char _license[] SEC("license") = "GPL";
>>>
>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>>
>>> CC: stable@vger.kernel.org
>>> CC: Jason Wang <jasowang@redhat.com>
>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>> CC: virtualization@lists.linux-foundation.org
>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>> ---
>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 87838cbe38cf..0687dd88e97f 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>  	 */
>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>> +	if (headroom) {
>>> +		truesize = PAGE_SIZE;
>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>
>> The reason for not doing this is that buf and p may not be on the same page, and
>> buf is probably not page-aligned.
>>
>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>> allocates a large block of memory at one time, and allocates from it each time.
>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>> each allocation is page-aligned .
>>
>> The problem here is that the value of headroom is wrong, the package is
>> structured like this:
>>
>> from device    | headroom          | virtio-net hdr | data |
>> after xdp      | headroom  |  virtio-net hdr | meta | data |
> 
> You're free to push data back (not necessarily through meta).
> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
> 
>>
>> The page_address(page) + offset we pass to page_to_skb() points to the
>> virtio-net hdr.
>>
>> So I think it might be better to change it this way.
>>
>> Thanks.
>>
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 87838cbe38cf..086ae835ec86 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>                                                        len, PAGE_SIZE, false,
>>                                                        metasize,
>> -                                                      VIRTIO_XDP_HEADROOM);
>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>                                 return head_skb;
>>                         }
>>                         break;
> 
> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> So just doing that would take care of the meta, but won't take care of moving data.
> 

Also it doesn't take care of the case where page_to_skb() is called with the original page
i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).

The above change guarantees that buf and p will be in the same page and the skb_reserve() call will
make skb->data point to p - buf, i.e. to the beginning of the valid data in that page.
Unfortunately the new headroom will not be correct if it is a frag, it will be longer.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 14:30       ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 14:30 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
> On 23/04/2022 16:31, Xuan Zhuo wrote:
>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>> mode with virtio_net after updating to newer kernels. After
>>> investigating the reason it turned out that when using mergeable bufs
>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>> calculates the build_skb address wrong because the offset can become less
>>> than the headroom so it gets the address of the previous page (-X bytes
>>> depending on how lower offset is):
>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>
>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>> via an xdp prog. The calculations done are:
>>>  receive_mergeable():
>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>  offset = xdp.data - page_address(xdp_page) -
>>>           vi->hdr_len - metasize;
>>>
>>>  page_to_skb():
>>>  p = page_address(page) + offset;
>>>  ...
>>>  buf = p - headroom;
>>>
>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>> on what's done with the skb (when it's freed most often) we get all kinds
>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>> commit is interesting because the patch was sent and applied twice (it
>>> seems the first one got lost during merge back in 5.13 window). The
>>> first version of the patch that was applied as:
>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>> was actually correct because it calculated the page starting address
>>> without relying on offset or headroom, but then the second version that
>>> was applied as:
>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>> was wrong and added the above calculation.
>>> An example xdp prog[3] is below.
>>>
>>> [1] https://github.com/cilium/cilium/issues/19453
>>>
>>> [2] Two of the many traces:
>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>>  [   41.321387] Call Trace:
>>>  [   41.321819]  <TASK>
>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>>  [   41.322902]  __kfree_skb+0x20/0x30
>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>>  [   41.445592]  ? 0xffffffffa3000000
>>>  [   41.462442]  new_sync_read+0x148/0x160
>>>  [   41.479314]  ? 0xffffffffa3000000
>>>  [   41.496937]  vfs_read+0x138/0x190
>>>  [   41.517198]  ksys_read+0x87/0xc0
>>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>  [   41.568050] RIP: 0033:0x48765b
>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>>  [   41.744254]  </TASK>
>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>
>>>  and
>>>
>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>>  [   33.532471] page dumped because: nonzero mapcount
>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>  [   33.532484] Call Trace:
>>>  [   33.532496]  <TASK>
>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>>  [   33.532506]  bad_page.cold+0x63/0x94
>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>>  [   33.532515]  free_unref_page+0x1b/0x100
>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>>
>>> [3] XDP program to reproduce(xdp_pass.c):
>>>  #include <linux/bpf.h>
>>>  #include <bpf/bpf_helpers.h>
>>>
>>>  SEC("xdp_pass")
>>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>>  {
>>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>>           return XDP_PASS;
>>>  }
>>>
>>>  char _license[] SEC("license") = "GPL";
>>>
>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>>
>>> CC: stable@vger.kernel.org
>>> CC: Jason Wang <jasowang@redhat.com>
>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>> CC: virtualization@lists.linux-foundation.org
>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>> ---
>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 87838cbe38cf..0687dd88e97f 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>  	 */
>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>> +	if (headroom) {
>>> +		truesize = PAGE_SIZE;
>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>
>> The reason for not doing this is that buf and p may not be on the same page, and
>> buf is probably not page-aligned.
>>
>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>> allocates a large block of memory at one time, and allocates from it each time.
>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>> each allocation is page-aligned .
>>
>> The problem here is that the value of headroom is wrong, the package is
>> structured like this:
>>
>> from device    | headroom          | virtio-net hdr | data |
>> after xdp      | headroom  |  virtio-net hdr | meta | data |
> 
> You're free to push data back (not necessarily through meta).
> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
> 
>>
>> The page_address(page) + offset we pass to page_to_skb() points to the
>> virtio-net hdr.
>>
>> So I think it might be better to change it this way.
>>
>> Thanks.
>>
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 87838cbe38cf..086ae835ec86 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>                                                        len, PAGE_SIZE, false,
>>                                                        metasize,
>> -                                                      VIRTIO_XDP_HEADROOM);
>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>                                 return head_skb;
>>                         }
>>                         break;
> 
> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> So just doing that would take care of the meta, but won't take care of moving data.
> 

Also it doesn't take care of the case where page_to_skb() is called with the original page
i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).

The above change guarantees that buf and p will be in the same page and the skb_reserve() call will
make skb->data point to p - buf, i.e. to the beginning of the valid data in that page.
Unfortunately the new headroom will not be correct if it is a frag, it will be longer.


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 14:30       ` Nikolay Aleksandrov
@ 2022-04-23 14:36         ` Xuan Zhuo
  -1 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-23 14:36 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
> > On 23/04/2022 16:31, Xuan Zhuo wrote:
> >> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >>> We received a report[1] of kernel crashes when Cilium is used in XDP
> >>> mode with virtio_net after updating to newer kernels. After
> >>> investigating the reason it turned out that when using mergeable bufs
> >>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >>> calculates the build_skb address wrong because the offset can become less
> >>> than the headroom so it gets the address of the previous page (-X bytes
> >>> depending on how lower offset is):
> >>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>>
> >>> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >>> shows offset that is less than headroom by adding 4 bytes of metadata
> >>> via an xdp prog. The calculations done are:
> >>>  receive_mergeable():
> >>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>>  offset = xdp.data - page_address(xdp_page) -
> >>>           vi->hdr_len - metasize;
> >>>
> >>>  page_to_skb():
> >>>  p = page_address(page) + offset;
> >>>  ...
> >>>  buf = p - headroom;
> >>>
> >>> Now buf goes -4 bytes from the page's starting address as can be seen
> >>> above which is set as skb->head and skb->data by build_skb later. Depending
> >>> on what's done with the skb (when it's freed most often) we get all kinds
> >>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> >>> commit is interesting because the patch was sent and applied twice (it
> >>> seems the first one got lost during merge back in 5.13 window). The
> >>> first version of the patch that was applied as:
> >>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> >>> was actually correct because it calculated the page starting address
> >>> without relying on offset or headroom, but then the second version that
> >>> was applied as:
> >>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>> was wrong and added the above calculation.
> >>> An example xdp prog[3] is below.
> >>>
> >>> [1] https://github.com/cilium/cilium/issues/19453
> >>>
> >>> [2] Two of the many traces:
> >>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>>  [   41.321387] Call Trace:
> >>>  [   41.321819]  <TASK>
> >>>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>>  [   41.322902]  __kfree_skb+0x20/0x30
> >>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>>  [   41.445592]  ? 0xffffffffa3000000
> >>>  [   41.462442]  new_sync_read+0x148/0x160
> >>>  [   41.479314]  ? 0xffffffffa3000000
> >>>  [   41.496937]  vfs_read+0x138/0x190
> >>>  [   41.517198]  ksys_read+0x87/0xc0
> >>>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>  [   41.568050] RIP: 0033:0x48765b
> >>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>>  [   41.744254]  </TASK>
> >>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>
> >>>  and
> >>>
> >>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>>  [   33.532471] page dumped because: nonzero mapcount
> >>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>  [   33.532484] Call Trace:
> >>>  [   33.532496]  <TASK>
> >>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>>  [   33.532506]  bad_page.cold+0x63/0x94
> >>>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>>  [   33.532515]  free_unref_page+0x1b/0x100
> >>>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>>
> >>> [3] XDP program to reproduce(xdp_pass.c):
> >>>  #include <linux/bpf.h>
> >>>  #include <bpf/bpf_helpers.h>
> >>>
> >>>  SEC("xdp_pass")
> >>>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>>  {
> >>>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>>           return XDP_PASS;
> >>>  }
> >>>
> >>>  char _license[] SEC("license") = "GPL";
> >>>
> >>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>>
> >>> CC: stable@vger.kernel.org
> >>> CC: Jason Wang <jasowang@redhat.com>
> >>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>> CC: Daniel Borkmann <daniel@iogearbox.net>
> >>> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >>> CC: virtualization@lists.linux-foundation.org
> >>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >>> ---
> >>>  drivers/net/virtio_net.c | 8 ++++++--
> >>>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>> index 87838cbe38cf..0687dd88e97f 100644
> >>> --- a/drivers/net/virtio_net.c
> >>> +++ b/drivers/net/virtio_net.c
> >>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
> >>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
> >>>  	 */
> >>> -	truesize = headroom ? PAGE_SIZE : truesize;
> >>> +	if (headroom) {
> >>> +		truesize = PAGE_SIZE;
> >>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> >>
> >> The reason for not doing this is that buf and p may not be on the same page, and
> >> buf is probably not page-aligned.
> >>
> >> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> >> allocates a large block of memory at one time, and allocates from it each time.
> >> Although in xdp mode, each allocation is page_size, it does not guarantee that
> >> each allocation is page-aligned .
> >>
> >> The problem here is that the value of headroom is wrong, the package is
> >> structured like this:
> >>
> >> from device    | headroom          | virtio-net hdr | data |
> >> after xdp      | headroom  |  virtio-net hdr | meta | data |
> >
> > You're free to push data back (not necessarily through meta).
> > You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
> >
> >>
> >> The page_address(page) + offset we pass to page_to_skb() points to the
> >> virtio-net hdr.
> >>
> >> So I think it might be better to change it this way.
> >>
> >> Thanks.
> >>
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 87838cbe38cf..086ae835ec86 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >>                                                        len, PAGE_SIZE, false,
> >>                                                        metasize,
> >> -                                                      VIRTIO_XDP_HEADROOM);
> >> +                                                      VIRTIO_XDP_HEADROOM - metazie);
> >>                                 return head_skb;
> >>                         }
> >>                         break;
> >
> > That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> > So just doing that would take care of the meta, but won't take care of moving data.
> >
>
> Also it doesn't take care of the case where page_to_skb() is called with the original page
> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).

Yes, you are right.

>
> The above change guarantees that buf and p will be in the same page


How can this be guaranteed?

1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
   from the allocation.
2. set xdp
3. alloc for new buffer

The buffer allocated in the third step must be unaligned, and it is entirely
possible that p and buf are not on the same page.

I fixed my previous patch.

Thanks.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..d95e82255b94 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                         * xdp.data_meta were adjusted
                         */
                        len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
+
+                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
                        /* We can only create skb based on xdp_page. */
                        if (unlikely(xdp_page != page)) {
                                rcu_read_unlock();
@@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                                head_skb = page_to_skb(vi, rq, xdp_page, offset,
                                                       len, PAGE_SIZE, false,
                                                       metasize,
-                                                      VIRTIO_XDP_HEADROOM);
+                                                      headroom);
                                return head_skb;
                        }
                        break;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 14:36         ` Xuan Zhuo
  0 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-23 14:36 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
> > On 23/04/2022 16:31, Xuan Zhuo wrote:
> >> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >>> We received a report[1] of kernel crashes when Cilium is used in XDP
> >>> mode with virtio_net after updating to newer kernels. After
> >>> investigating the reason it turned out that when using mergeable bufs
> >>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >>> calculates the build_skb address wrong because the offset can become less
> >>> than the headroom so it gets the address of the previous page (-X bytes
> >>> depending on how lower offset is):
> >>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>>
> >>> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >>> shows offset that is less than headroom by adding 4 bytes of metadata
> >>> via an xdp prog. The calculations done are:
> >>>  receive_mergeable():
> >>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>>  offset = xdp.data - page_address(xdp_page) -
> >>>           vi->hdr_len - metasize;
> >>>
> >>>  page_to_skb():
> >>>  p = page_address(page) + offset;
> >>>  ...
> >>>  buf = p - headroom;
> >>>
> >>> Now buf goes -4 bytes from the page's starting address as can be seen
> >>> above which is set as skb->head and skb->data by build_skb later. Depending
> >>> on what's done with the skb (when it's freed most often) we get all kinds
> >>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> >>> commit is interesting because the patch was sent and applied twice (it
> >>> seems the first one got lost during merge back in 5.13 window). The
> >>> first version of the patch that was applied as:
> >>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> >>> was actually correct because it calculated the page starting address
> >>> without relying on offset or headroom, but then the second version that
> >>> was applied as:
> >>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>> was wrong and added the above calculation.
> >>> An example xdp prog[3] is below.
> >>>
> >>> [1] https://github.com/cilium/cilium/issues/19453
> >>>
> >>> [2] Two of the many traces:
> >>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>>  [   41.321387] Call Trace:
> >>>  [   41.321819]  <TASK>
> >>>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>>  [   41.322902]  __kfree_skb+0x20/0x30
> >>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>>  [   41.445592]  ? 0xffffffffa3000000
> >>>  [   41.462442]  new_sync_read+0x148/0x160
> >>>  [   41.479314]  ? 0xffffffffa3000000
> >>>  [   41.496937]  vfs_read+0x138/0x190
> >>>  [   41.517198]  ksys_read+0x87/0xc0
> >>>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>  [   41.568050] RIP: 0033:0x48765b
> >>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>>  [   41.744254]  </TASK>
> >>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>
> >>>  and
> >>>
> >>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>>  [   33.532471] page dumped because: nonzero mapcount
> >>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>  [   33.532484] Call Trace:
> >>>  [   33.532496]  <TASK>
> >>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>>  [   33.532506]  bad_page.cold+0x63/0x94
> >>>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>>  [   33.532515]  free_unref_page+0x1b/0x100
> >>>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>>
> >>> [3] XDP program to reproduce(xdp_pass.c):
> >>>  #include <linux/bpf.h>
> >>>  #include <bpf/bpf_helpers.h>
> >>>
> >>>  SEC("xdp_pass")
> >>>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>>  {
> >>>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>>           return XDP_PASS;
> >>>  }
> >>>
> >>>  char _license[] SEC("license") = "GPL";
> >>>
> >>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>>
> >>> CC: stable@vger.kernel.org
> >>> CC: Jason Wang <jasowang@redhat.com>
> >>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>> CC: Daniel Borkmann <daniel@iogearbox.net>
> >>> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >>> CC: virtualization@lists.linux-foundation.org
> >>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >>> ---
> >>>  drivers/net/virtio_net.c | 8 ++++++--
> >>>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>> index 87838cbe38cf..0687dd88e97f 100644
> >>> --- a/drivers/net/virtio_net.c
> >>> +++ b/drivers/net/virtio_net.c
> >>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
> >>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
> >>>  	 */
> >>> -	truesize = headroom ? PAGE_SIZE : truesize;
> >>> +	if (headroom) {
> >>> +		truesize = PAGE_SIZE;
> >>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> >>
> >> The reason for not doing this is that buf and p may not be on the same page, and
> >> buf is probably not page-aligned.
> >>
> >> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> >> allocates a large block of memory at one time, and allocates from it each time.
> >> Although in xdp mode, each allocation is page_size, it does not guarantee that
> >> each allocation is page-aligned .
> >>
> >> The problem here is that the value of headroom is wrong, the package is
> >> structured like this:
> >>
> >> from device    | headroom          | virtio-net hdr | data |
> >> after xdp      | headroom  |  virtio-net hdr | meta | data |
> >
> > You're free to push data back (not necessarily through meta).
> > You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
> >
> >>
> >> The page_address(page) + offset we pass to page_to_skb() points to the
> >> virtio-net hdr.
> >>
> >> So I think it might be better to change it this way.
> >>
> >> Thanks.
> >>
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 87838cbe38cf..086ae835ec86 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >>                                                        len, PAGE_SIZE, false,
> >>                                                        metasize,
> >> -                                                      VIRTIO_XDP_HEADROOM);
> >> +                                                      VIRTIO_XDP_HEADROOM - metazie);
> >>                                 return head_skb;
> >>                         }
> >>                         break;
> >
> > That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> > So just doing that would take care of the meta, but won't take care of moving data.
> >
>
> Also it doesn't take care of the case where page_to_skb() is called with the original page
> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).

Yes, you are right.

>
> The above change guarantees that buf and p will be in the same page


How can this be guaranteed?

1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
   from the allocation.
2. set xdp
3. alloc for new buffer

The buffer allocated in the third step must be unaligned, and it is entirely
possible that p and buf are not on the same page.

I fixed my previous patch.

Thanks.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..d95e82255b94 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                         * xdp.data_meta were adjusted
                         */
                        len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
+
+                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
                        /* We can only create skb based on xdp_page. */
                        if (unlikely(xdp_page != page)) {
                                rcu_read_unlock();
@@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                                head_skb = page_to_skb(vi, rq, xdp_page, offset,
                                                       len, PAGE_SIZE, false,
                                                       metasize,
-                                                      VIRTIO_XDP_HEADROOM);
+                                                      headroom);
                                return head_skb;
                        }
                        break;
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 14:30       ` Nikolay Aleksandrov
@ 2022-04-23 14:46         ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 14:46 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 23/04/2022 17:30, Nikolay Aleksandrov wrote:
> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>> mode with virtio_net after updating to newer kernels. After
>>>> investigating the reason it turned out that when using mergeable bufs
>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>> calculates the build_skb address wrong because the offset can become less
>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>> depending on how lower offset is):
>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>
>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>> via an xdp prog. The calculations done are:
>>>>  receive_mergeable():
>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>           vi->hdr_len - metasize;
>>>>
>>>>  page_to_skb():
>>>>  p = page_address(page) + offset;
>>>>  ...
>>>>  buf = p - headroom;
>>>>
>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>> commit is interesting because the patch was sent and applied twice (it
>>>> seems the first one got lost during merge back in 5.13 window). The
>>>> first version of the patch that was applied as:
>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>> was actually correct because it calculated the page starting address
>>>> without relying on offset or headroom, but then the second version that
>>>> was applied as:
>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>> was wrong and added the above calculation.
>>>> An example xdp prog[3] is below.
>>>>
>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>
>>>> [2] Two of the many traces:
[snip]
>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>  	 */
>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>> +	if (headroom) {
>>>> +		truesize = PAGE_SIZE;
>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>
>>> The reason for not doing this is that buf and p may not be on the same page, and
>>> buf is probably not page-aligned.
>>>
>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>> allocates a large block of memory at one time, and allocates from it each time.
>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>> each allocation is page-aligned .
>>>
>>> The problem here is that the value of headroom is wrong, the package is
>>> structured like this:
>>>
>>> from device    | headroom          | virtio-net hdr | data |
>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>
>> You're free to push data back (not necessarily through meta).
>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>
>>>
>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>> virtio-net hdr.
>>>
>>> So I think it might be better to change it this way.
>>>
>>> Thanks.
>>>
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 87838cbe38cf..086ae835ec86 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>                                                        len, PAGE_SIZE, false,
>>>                                                        metasize,
>>> -                                                      VIRTIO_XDP_HEADROOM);
>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>                                 return head_skb;
>>>                         }
>>>                         break;
>>
>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>> So just doing that would take care of the meta, but won't take care of moving data.
>>
> 
> Also it doesn't take care of the case where page_to_skb() is called with the original page
> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
> 
> The above change guarantees that buf and p will be in the same page and the skb_reserve() call will
> make skb->data point to p - buf, i.e. to the beginning of the valid data in that page.
> Unfortunately the new headroom will not be correct if it is a frag, it will be longer.
> 
> 

Completely untested alternative could be based on the offset size, that is if it has
eaten into the headroom and is smaller then we swap them (that means we start at page
boundary since we have headroom guaranteed space):
 buf = page_address(page) + (offset > headroom ? offset - headroom : 0);

or perhaps in current code terms:
 buf = p - (offset > headroom ? headroom : offset);

That means offset is somewhere inside the headroom of the buf and, the buf itself
starts at page boundary (when offset < headroom). I think this preserves the correct
headroom for the new skb. WDYT?

Cheers,
 Nik




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 14:46         ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 14:46 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 23/04/2022 17:30, Nikolay Aleksandrov wrote:
> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>> mode with virtio_net after updating to newer kernels. After
>>>> investigating the reason it turned out that when using mergeable bufs
>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>> calculates the build_skb address wrong because the offset can become less
>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>> depending on how lower offset is):
>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>
>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>> via an xdp prog. The calculations done are:
>>>>  receive_mergeable():
>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>           vi->hdr_len - metasize;
>>>>
>>>>  page_to_skb():
>>>>  p = page_address(page) + offset;
>>>>  ...
>>>>  buf = p - headroom;
>>>>
>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>> commit is interesting because the patch was sent and applied twice (it
>>>> seems the first one got lost during merge back in 5.13 window). The
>>>> first version of the patch that was applied as:
>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>> was actually correct because it calculated the page starting address
>>>> without relying on offset or headroom, but then the second version that
>>>> was applied as:
>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>> was wrong and added the above calculation.
>>>> An example xdp prog[3] is below.
>>>>
>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>
>>>> [2] Two of the many traces:
[snip]
>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>  	 */
>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>> +	if (headroom) {
>>>> +		truesize = PAGE_SIZE;
>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>
>>> The reason for not doing this is that buf and p may not be on the same page, and
>>> buf is probably not page-aligned.
>>>
>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>> allocates a large block of memory at one time, and allocates from it each time.
>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>> each allocation is page-aligned .
>>>
>>> The problem here is that the value of headroom is wrong, the package is
>>> structured like this:
>>>
>>> from device    | headroom          | virtio-net hdr | data |
>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>
>> You're free to push data back (not necessarily through meta).
>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>
>>>
>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>> virtio-net hdr.
>>>
>>> So I think it might be better to change it this way.
>>>
>>> Thanks.
>>>
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 87838cbe38cf..086ae835ec86 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>                                                        len, PAGE_SIZE, false,
>>>                                                        metasize,
>>> -                                                      VIRTIO_XDP_HEADROOM);
>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>                                 return head_skb;
>>>                         }
>>>                         break;
>>
>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>> So just doing that would take care of the meta, but won't take care of moving data.
>>
> 
> Also it doesn't take care of the case where page_to_skb() is called with the original page
> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
> 
> The above change guarantees that buf and p will be in the same page and the skb_reserve() call will
> make skb->data point to p - buf, i.e. to the beginning of the valid data in that page.
> Unfortunately the new headroom will not be correct if it is a frag, it will be longer.
> 
> 

Completely untested alternative could be based on the offset size, that is if it has
eaten into the headroom and is smaller then we swap them (that means we start at page
boundary since we have headroom guaranteed space):
 buf = page_address(page) + (offset > headroom ? offset - headroom : 0);

or perhaps in current code terms:
 buf = p - (offset > headroom ? headroom : offset);

That means offset is somewhere inside the headroom of the buf and, the buf itself
starts at page boundary (when offset < headroom). I think this preserves the correct
headroom for the new skb. WDYT?

Cheers,
 Nik



_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 14:36         ` Xuan Zhuo
@ 2022-04-23 14:58           ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 14:58 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 23/04/2022 17:36, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>>> mode with virtio_net after updating to newer kernels. After
>>>>> investigating the reason it turned out that when using mergeable bufs
>>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>>> calculates the build_skb address wrong because the offset can become less
>>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>>> depending on how lower offset is):
>>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>>
>>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>>> via an xdp prog. The calculations done are:
>>>>>  receive_mergeable():
>>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>>           vi->hdr_len - metasize;
>>>>>
>>>>>  page_to_skb():
>>>>>  p = page_address(page) + offset;
>>>>>  ...
>>>>>  buf = p - headroom;
>>>>>
>>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>>> commit is interesting because the patch was sent and applied twice (it
>>>>> seems the first one got lost during merge back in 5.13 window). The
>>>>> first version of the patch that was applied as:
>>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>>> was actually correct because it calculated the page starting address
>>>>> without relying on offset or headroom, but then the second version that
>>>>> was applied as:
>>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>> was wrong and added the above calculation.
>>>>> An example xdp prog[3] is below.
>>>>>
>>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>>
>>>>> [2] Two of the many traces:
>>>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>>>>  [   41.321387] Call Trace:
>>>>>  [   41.321819]  <TASK>
>>>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>>>>  [   41.322902]  __kfree_skb+0x20/0x30
>>>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>>>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>>>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>>>>  [   41.445592]  ? 0xffffffffa3000000
>>>>>  [   41.462442]  new_sync_read+0x148/0x160
>>>>>  [   41.479314]  ? 0xffffffffa3000000
>>>>>  [   41.496937]  vfs_read+0x138/0x190
>>>>>  [   41.517198]  ksys_read+0x87/0xc0
>>>>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>  [   41.568050] RIP: 0033:0x48765b
>>>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>>>>  [   41.744254]  </TASK>
>>>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>
>>>>>  and
>>>>>
>>>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>>>>  [   33.532471] page dumped because: nonzero mapcount
>>>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>  [   33.532484] Call Trace:
>>>>>  [   33.532496]  <TASK>
>>>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>>>>  [   33.532506]  bad_page.cold+0x63/0x94
>>>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>>>>  [   33.532515]  free_unref_page+0x1b/0x100
>>>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>>>>
>>>>> [3] XDP program to reproduce(xdp_pass.c):
>>>>>  #include <linux/bpf.h>
>>>>>  #include <bpf/bpf_helpers.h>
>>>>>
>>>>>  SEC("xdp_pass")
>>>>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>>>>  {
>>>>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>>>>           return XDP_PASS;
>>>>>  }
>>>>>
>>>>>  char _license[] SEC("license") = "GPL";
>>>>>
>>>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>>>>
>>>>> CC: stable@vger.kernel.org
>>>>> CC: Jason Wang <jasowang@redhat.com>
>>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>>>> CC: virtualization@lists.linux-foundation.org
>>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>> ---
>>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>>  	 */
>>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>>> +	if (headroom) {
>>>>> +		truesize = PAGE_SIZE;
>>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>>
>>>> The reason for not doing this is that buf and p may not be on the same page, and
>>>> buf is probably not page-aligned.
>>>>
>>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>>> allocates a large block of memory at one time, and allocates from it each time.
>>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>>> each allocation is page-aligned .
>>>>
>>>> The problem here is that the value of headroom is wrong, the package is
>>>> structured like this:
>>>>
>>>> from device    | headroom          | virtio-net hdr | data |
>>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>>
>>> You're free to push data back (not necessarily through meta).
>>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>>
>>>>
>>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>>> virtio-net hdr.
>>>>
>>>> So I think it might be better to change it this way.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..086ae835ec86 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>                                                        len, PAGE_SIZE, false,
>>>>                                                        metasize,
>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>                                 return head_skb;
>>>>                         }
>>>>                         break;
>>>
>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>
>>
>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
> 
> Yes, you are right.
> 
>>
>> The above change guarantees that buf and p will be in the same page
> 
> 
> How can this be guaranteed?
> 
> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>    from the allocation.
> 2. set xdp
> 3. alloc for new buffer
> 

p = page_address(page) + offset;
buf = p & PAGE_MASK; // whatever page p lands in is where buf is set

=> p and buf are always in the same page, no?

> The buffer allocated in the third step must be unaligned, and it is entirely
> possible that p and buf are not on the same page.
> 
> I fixed my previous patch.
> 
> Thanks.
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..d95e82255b94 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>                          * xdp.data_meta were adjusted
>                          */
>                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> +
> +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);

This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.

>                         /* We can only create skb based on xdp_page. */
>                         if (unlikely(xdp_page != page)) {
>                                 rcu_read_unlock();
> @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>                                                        len, PAGE_SIZE, false,
>                                                        metasize,
> -                                                      VIRTIO_XDP_HEADROOM);
> +                                                      headroom);
>                                 return head_skb;
>                         }
>                         break;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 14:58           ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 14:58 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 23/04/2022 17:36, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>>> mode with virtio_net after updating to newer kernels. After
>>>>> investigating the reason it turned out that when using mergeable bufs
>>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>>> calculates the build_skb address wrong because the offset can become less
>>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>>> depending on how lower offset is):
>>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>>
>>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>>> via an xdp prog. The calculations done are:
>>>>>  receive_mergeable():
>>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>>           vi->hdr_len - metasize;
>>>>>
>>>>>  page_to_skb():
>>>>>  p = page_address(page) + offset;
>>>>>  ...
>>>>>  buf = p - headroom;
>>>>>
>>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>>> commit is interesting because the patch was sent and applied twice (it
>>>>> seems the first one got lost during merge back in 5.13 window). The
>>>>> first version of the patch that was applied as:
>>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>>> was actually correct because it calculated the page starting address
>>>>> without relying on offset or headroom, but then the second version that
>>>>> was applied as:
>>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>> was wrong and added the above calculation.
>>>>> An example xdp prog[3] is below.
>>>>>
>>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>>
>>>>> [2] Two of the many traces:
>>>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>>>>  [   41.321387] Call Trace:
>>>>>  [   41.321819]  <TASK>
>>>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>>>>  [   41.322902]  __kfree_skb+0x20/0x30
>>>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>>>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>>>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>>>>  [   41.445592]  ? 0xffffffffa3000000
>>>>>  [   41.462442]  new_sync_read+0x148/0x160
>>>>>  [   41.479314]  ? 0xffffffffa3000000
>>>>>  [   41.496937]  vfs_read+0x138/0x190
>>>>>  [   41.517198]  ksys_read+0x87/0xc0
>>>>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>  [   41.568050] RIP: 0033:0x48765b
>>>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>>>>  [   41.744254]  </TASK>
>>>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>
>>>>>  and
>>>>>
>>>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>>>>  [   33.532471] page dumped because: nonzero mapcount
>>>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>  [   33.532484] Call Trace:
>>>>>  [   33.532496]  <TASK>
>>>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>>>>  [   33.532506]  bad_page.cold+0x63/0x94
>>>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>>>>  [   33.532515]  free_unref_page+0x1b/0x100
>>>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>>>>
>>>>> [3] XDP program to reproduce(xdp_pass.c):
>>>>>  #include <linux/bpf.h>
>>>>>  #include <bpf/bpf_helpers.h>
>>>>>
>>>>>  SEC("xdp_pass")
>>>>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>>>>  {
>>>>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>>>>           return XDP_PASS;
>>>>>  }
>>>>>
>>>>>  char _license[] SEC("license") = "GPL";
>>>>>
>>>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>>>>
>>>>> CC: stable@vger.kernel.org
>>>>> CC: Jason Wang <jasowang@redhat.com>
>>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>>>> CC: virtualization@lists.linux-foundation.org
>>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>> ---
>>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>>  	 */
>>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>>> +	if (headroom) {
>>>>> +		truesize = PAGE_SIZE;
>>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>>
>>>> The reason for not doing this is that buf and p may not be on the same page, and
>>>> buf is probably not page-aligned.
>>>>
>>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>>> allocates a large block of memory at one time, and allocates from it each time.
>>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>>> each allocation is page-aligned .
>>>>
>>>> The problem here is that the value of headroom is wrong, the package is
>>>> structured like this:
>>>>
>>>> from device    | headroom          | virtio-net hdr | data |
>>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>>
>>> You're free to push data back (not necessarily through meta).
>>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>>
>>>>
>>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>>> virtio-net hdr.
>>>>
>>>> So I think it might be better to change it this way.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..086ae835ec86 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>                                                        len, PAGE_SIZE, false,
>>>>                                                        metasize,
>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>                                 return head_skb;
>>>>                         }
>>>>                         break;
>>>
>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>
>>
>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
> 
> Yes, you are right.
> 
>>
>> The above change guarantees that buf and p will be in the same page
> 
> 
> How can this be guaranteed?
> 
> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>    from the allocation.
> 2. set xdp
> 3. alloc for new buffer
> 

p = page_address(page) + offset;
buf = p & PAGE_MASK; // whatever page p lands in is where buf is set

=> p and buf are always in the same page, no?

> The buffer allocated in the third step must be unaligned, and it is entirely
> possible that p and buf are not on the same page.
> 
> I fixed my previous patch.
> 
> Thanks.
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..d95e82255b94 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>                          * xdp.data_meta were adjusted
>                          */
>                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> +
> +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);

This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.

>                         /* We can only create skb based on xdp_page. */
>                         if (unlikely(xdp_page != page)) {
>                                 rcu_read_unlock();
> @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>                                                        len, PAGE_SIZE, false,
>                                                        metasize,
> -                                                      VIRTIO_XDP_HEADROOM);
> +                                                      headroom);
>                                 return head_skb;
>                         }
>                         break;

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 14:58           ` Nikolay Aleksandrov
@ 2022-04-23 15:01             ` Xuan Zhuo
  -1 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-23 15:01 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 23/04/2022 17:36, Xuan Zhuo wrote:
> > On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
> >>> On 23/04/2022 16:31, Xuan Zhuo wrote:
> >>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
> >>>>> mode with virtio_net after updating to newer kernels. After
> >>>>> investigating the reason it turned out that when using mergeable bufs
> >>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >>>>> calculates the build_skb address wrong because the offset can become less
> >>>>> than the headroom so it gets the address of the previous page (-X bytes
> >>>>> depending on how lower offset is):
> >>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>>>>
> >>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >>>>> shows offset that is less than headroom by adding 4 bytes of metadata
> >>>>> via an xdp prog. The calculations done are:
> >>>>>  receive_mergeable():
> >>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>>>>  offset = xdp.data - page_address(xdp_page) -
> >>>>>           vi->hdr_len - metasize;
> >>>>>
> >>>>>  page_to_skb():
> >>>>>  p = page_address(page) + offset;
> >>>>>  ...
> >>>>>  buf = p - headroom;
> >>>>>
> >>>>> Now buf goes -4 bytes from the page's starting address as can be seen
> >>>>> above which is set as skb->head and skb->data by build_skb later. Depending
> >>>>> on what's done with the skb (when it's freed most often) we get all kinds
> >>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> >>>>> commit is interesting because the patch was sent and applied twice (it
> >>>>> seems the first one got lost during merge back in 5.13 window). The
> >>>>> first version of the patch that was applied as:
> >>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> >>>>> was actually correct because it calculated the page starting address
> >>>>> without relying on offset or headroom, but then the second version that
> >>>>> was applied as:
> >>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>>>> was wrong and added the above calculation.
> >>>>> An example xdp prog[3] is below.
> >>>>>
> >>>>> [1] https://github.com/cilium/cilium/issues/19453
> >>>>>
> >>>>> [2] Two of the many traces:
> >>>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>>>>  [   41.321387] Call Trace:
> >>>>>  [   41.321819]  <TASK>
> >>>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>>>>  [   41.322902]  __kfree_skb+0x20/0x30
> >>>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>>>>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>>>>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>>>>  [   41.445592]  ? 0xffffffffa3000000
> >>>>>  [   41.462442]  new_sync_read+0x148/0x160
> >>>>>  [   41.479314]  ? 0xffffffffa3000000
> >>>>>  [   41.496937]  vfs_read+0x138/0x190
> >>>>>  [   41.517198]  ksys_read+0x87/0xc0
> >>>>>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>>>  [   41.568050] RIP: 0033:0x48765b
> >>>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>>>>  [   41.744254]  </TASK>
> >>>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>>>
> >>>>>  and
> >>>>>
> >>>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>>>>  [   33.532471] page dumped because: nonzero mapcount
> >>>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>>>  [   33.532484] Call Trace:
> >>>>>  [   33.532496]  <TASK>
> >>>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>>>>  [   33.532506]  bad_page.cold+0x63/0x94
> >>>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>>>>  [   33.532515]  free_unref_page+0x1b/0x100
> >>>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>>>>
> >>>>> [3] XDP program to reproduce(xdp_pass.c):
> >>>>>  #include <linux/bpf.h>
> >>>>>  #include <bpf/bpf_helpers.h>
> >>>>>
> >>>>>  SEC("xdp_pass")
> >>>>>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>>>>  {
> >>>>>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>>>>           return XDP_PASS;
> >>>>>  }
> >>>>>
> >>>>>  char _license[] SEC("license") = "GPL";
> >>>>>
> >>>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>>>>
> >>>>> CC: stable@vger.kernel.org
> >>>>> CC: Jason Wang <jasowang@redhat.com>
> >>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
> >>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >>>>> CC: virtualization@lists.linux-foundation.org
> >>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >>>>> ---
> >>>>>  drivers/net/virtio_net.c | 8 ++++++--
> >>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>>>> index 87838cbe38cf..0687dd88e97f 100644
> >>>>> --- a/drivers/net/virtio_net.c
> >>>>> +++ b/drivers/net/virtio_net.c
> >>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
> >>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
> >>>>>  	 */
> >>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
> >>>>> +	if (headroom) {
> >>>>> +		truesize = PAGE_SIZE;
> >>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> >>>>
> >>>> The reason for not doing this is that buf and p may not be on the same page, and
> >>>> buf is probably not page-aligned.
> >>>>
> >>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> >>>> allocates a large block of memory at one time, and allocates from it each time.
> >>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
> >>>> each allocation is page-aligned .
> >>>>
> >>>> The problem here is that the value of headroom is wrong, the package is
> >>>> structured like this:
> >>>>
> >>>> from device    | headroom          | virtio-net hdr | data |
> >>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
> >>>
> >>> You're free to push data back (not necessarily through meta).
> >>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
> >>>
> >>>>
> >>>> The page_address(page) + offset we pass to page_to_skb() points to the
> >>>> virtio-net hdr.
> >>>>
> >>>> So I think it might be better to change it this way.
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>>> index 87838cbe38cf..086ae835ec86 100644
> >>>> --- a/drivers/net/virtio_net.c
> >>>> +++ b/drivers/net/virtio_net.c
> >>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >>>>                                                        len, PAGE_SIZE, false,
> >>>>                                                        metasize,
> >>>> -                                                      VIRTIO_XDP_HEADROOM);
> >>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
> >>>>                                 return head_skb;
> >>>>                         }
> >>>>                         break;
> >>>
> >>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> >>> So just doing that would take care of the meta, but won't take care of moving data.
> >>>
> >>
> >> Also it doesn't take care of the case where page_to_skb() is called with the original page
> >> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
> >
> > Yes, you are right.
> >
> >>
> >> The above change guarantees that buf and p will be in the same page
> >
> >
> > How can this be guaranteed?
> >
> > 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
> >    from the allocation.
> > 2. set xdp
> > 3. alloc for new buffer
> >
>
> p = page_address(page) + offset;
> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>
> => p and buf are always in the same page, no?

I don't think it is, it's entirely possible to split on two pages.

>
> > The buffer allocated in the third step must be unaligned, and it is entirely
> > possible that p and buf are not on the same page.
> >
> > I fixed my previous patch.
> >
> > Thanks.
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 87838cbe38cf..d95e82255b94 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >                          * xdp.data_meta were adjusted
> >                          */
> >                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> > +
> > +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
>
> This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.

Yes, you are right. For the case of xdp_linearize_page() , we should change it.

   headroom = xdp.data - vi->hdr_len - metasize - page_address(xdp_page);

Thanks.


>
> >                         /* We can only create skb based on xdp_page. */
> >                         if (unlikely(xdp_page != page)) {
> >                                 rcu_read_unlock();
> > @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >                                                        len, PAGE_SIZE, false,
> >                                                        metasize,
> > -                                                      VIRTIO_XDP_HEADROOM);
> > +                                                      headroom);
> >                                 return head_skb;
> >                         }
> >                         break;
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 15:01             ` Xuan Zhuo
  0 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-23 15:01 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 23/04/2022 17:36, Xuan Zhuo wrote:
> > On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
> >>> On 23/04/2022 16:31, Xuan Zhuo wrote:
> >>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
> >>>>> mode with virtio_net after updating to newer kernels. After
> >>>>> investigating the reason it turned out that when using mergeable bufs
> >>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >>>>> calculates the build_skb address wrong because the offset can become less
> >>>>> than the headroom so it gets the address of the previous page (-X bytes
> >>>>> depending on how lower offset is):
> >>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>>>>
> >>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >>>>> shows offset that is less than headroom by adding 4 bytes of metadata
> >>>>> via an xdp prog. The calculations done are:
> >>>>>  receive_mergeable():
> >>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>>>>  offset = xdp.data - page_address(xdp_page) -
> >>>>>           vi->hdr_len - metasize;
> >>>>>
> >>>>>  page_to_skb():
> >>>>>  p = page_address(page) + offset;
> >>>>>  ...
> >>>>>  buf = p - headroom;
> >>>>>
> >>>>> Now buf goes -4 bytes from the page's starting address as can be seen
> >>>>> above which is set as skb->head and skb->data by build_skb later. Depending
> >>>>> on what's done with the skb (when it's freed most often) we get all kinds
> >>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> >>>>> commit is interesting because the patch was sent and applied twice (it
> >>>>> seems the first one got lost during merge back in 5.13 window). The
> >>>>> first version of the patch that was applied as:
> >>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> >>>>> was actually correct because it calculated the page starting address
> >>>>> without relying on offset or headroom, but then the second version that
> >>>>> was applied as:
> >>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>>>> was wrong and added the above calculation.
> >>>>> An example xdp prog[3] is below.
> >>>>>
> >>>>> [1] https://github.com/cilium/cilium/issues/19453
> >>>>>
> >>>>> [2] Two of the many traces:
> >>>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>>>>  [   41.321387] Call Trace:
> >>>>>  [   41.321819]  <TASK>
> >>>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>>>>  [   41.322902]  __kfree_skb+0x20/0x30
> >>>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>>>>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>>>>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>>>>  [   41.445592]  ? 0xffffffffa3000000
> >>>>>  [   41.462442]  new_sync_read+0x148/0x160
> >>>>>  [   41.479314]  ? 0xffffffffa3000000
> >>>>>  [   41.496937]  vfs_read+0x138/0x190
> >>>>>  [   41.517198]  ksys_read+0x87/0xc0
> >>>>>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>>>  [   41.568050] RIP: 0033:0x48765b
> >>>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>>>>  [   41.744254]  </TASK>
> >>>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>>>
> >>>>>  and
> >>>>>
> >>>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>>>>  [   33.532471] page dumped because: nonzero mapcount
> >>>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>>>  [   33.532484] Call Trace:
> >>>>>  [   33.532496]  <TASK>
> >>>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>>>>  [   33.532506]  bad_page.cold+0x63/0x94
> >>>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>>>>  [   33.532515]  free_unref_page+0x1b/0x100
> >>>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>>>>
> >>>>> [3] XDP program to reproduce(xdp_pass.c):
> >>>>>  #include <linux/bpf.h>
> >>>>>  #include <bpf/bpf_helpers.h>
> >>>>>
> >>>>>  SEC("xdp_pass")
> >>>>>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>>>>  {
> >>>>>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>>>>           return XDP_PASS;
> >>>>>  }
> >>>>>
> >>>>>  char _license[] SEC("license") = "GPL";
> >>>>>
> >>>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>>>>
> >>>>> CC: stable@vger.kernel.org
> >>>>> CC: Jason Wang <jasowang@redhat.com>
> >>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
> >>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >>>>> CC: virtualization@lists.linux-foundation.org
> >>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >>>>> ---
> >>>>>  drivers/net/virtio_net.c | 8 ++++++--
> >>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>>>> index 87838cbe38cf..0687dd88e97f 100644
> >>>>> --- a/drivers/net/virtio_net.c
> >>>>> +++ b/drivers/net/virtio_net.c
> >>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
> >>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
> >>>>>  	 */
> >>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
> >>>>> +	if (headroom) {
> >>>>> +		truesize = PAGE_SIZE;
> >>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> >>>>
> >>>> The reason for not doing this is that buf and p may not be on the same page, and
> >>>> buf is probably not page-aligned.
> >>>>
> >>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> >>>> allocates a large block of memory at one time, and allocates from it each time.
> >>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
> >>>> each allocation is page-aligned .
> >>>>
> >>>> The problem here is that the value of headroom is wrong, the package is
> >>>> structured like this:
> >>>>
> >>>> from device    | headroom          | virtio-net hdr | data |
> >>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
> >>>
> >>> You're free to push data back (not necessarily through meta).
> >>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
> >>>
> >>>>
> >>>> The page_address(page) + offset we pass to page_to_skb() points to the
> >>>> virtio-net hdr.
> >>>>
> >>>> So I think it might be better to change it this way.
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>>> index 87838cbe38cf..086ae835ec86 100644
> >>>> --- a/drivers/net/virtio_net.c
> >>>> +++ b/drivers/net/virtio_net.c
> >>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >>>>                                                        len, PAGE_SIZE, false,
> >>>>                                                        metasize,
> >>>> -                                                      VIRTIO_XDP_HEADROOM);
> >>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
> >>>>                                 return head_skb;
> >>>>                         }
> >>>>                         break;
> >>>
> >>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> >>> So just doing that would take care of the meta, but won't take care of moving data.
> >>>
> >>
> >> Also it doesn't take care of the case where page_to_skb() is called with the original page
> >> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
> >
> > Yes, you are right.
> >
> >>
> >> The above change guarantees that buf and p will be in the same page
> >
> >
> > How can this be guaranteed?
> >
> > 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
> >    from the allocation.
> > 2. set xdp
> > 3. alloc for new buffer
> >
>
> p = page_address(page) + offset;
> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>
> => p and buf are always in the same page, no?

I don't think it is, it's entirely possible to split on two pages.

>
> > The buffer allocated in the third step must be unaligned, and it is entirely
> > possible that p and buf are not on the same page.
> >
> > I fixed my previous patch.
> >
> > Thanks.
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 87838cbe38cf..d95e82255b94 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >                          * xdp.data_meta were adjusted
> >                          */
> >                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> > +
> > +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
>
> This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.

Yes, you are right. For the case of xdp_linearize_page() , we should change it.

   headroom = xdp.data - vi->hdr_len - metasize - page_address(xdp_page);

Thanks.


>
> >                         /* We can only create skb based on xdp_page. */
> >                         if (unlikely(xdp_page != page)) {
> >                                 rcu_read_unlock();
> > @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >                                                        len, PAGE_SIZE, false,
> >                                                        metasize,
> > -                                                      VIRTIO_XDP_HEADROOM);
> > +                                                      headroom);
> >                                 return head_skb;
> >                         }
> >                         break;
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 14:46         ` Nikolay Aleksandrov
@ 2022-04-23 15:10           ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 15:10 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 23/04/2022 17:46, Nikolay Aleksandrov wrote:
> On 23/04/2022 17:30, Nikolay Aleksandrov wrote:
>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>>> mode with virtio_net after updating to newer kernels. After
>>>>> investigating the reason it turned out that when using mergeable bufs
>>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>>> calculates the build_skb address wrong because the offset can become less
>>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>>> depending on how lower offset is):
>>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>>
>>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>>> via an xdp prog. The calculations done are:
>>>>>  receive_mergeable():
>>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>>           vi->hdr_len - metasize;
>>>>>
>>>>>  page_to_skb():
>>>>>  p = page_address(page) + offset;
>>>>>  ...
>>>>>  buf = p - headroom;
>>>>>
>>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>>> commit is interesting because the patch was sent and applied twice (it
>>>>> seems the first one got lost during merge back in 5.13 window). The
>>>>> first version of the patch that was applied as:
>>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>>> was actually correct because it calculated the page starting address
>>>>> without relying on offset or headroom, but then the second version that
>>>>> was applied as:
>>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>> was wrong and added the above calculation.
>>>>> An example xdp prog[3] is below.
>>>>>
>>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>>
>>>>> [2] Two of the many traces:
> [snip]
>>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>>  	 */
>>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>>> +	if (headroom) {
>>>>> +		truesize = PAGE_SIZE;
>>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>>
>>>> The reason for not doing this is that buf and p may not be on the same page, and
>>>> buf is probably not page-aligned.
>>>>
>>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>>> allocates a large block of memory at one time, and allocates from it each time.
>>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>>> each allocation is page-aligned .
>>>>
>>>> The problem here is that the value of headroom is wrong, the package is
>>>> structured like this:
>>>>
>>>> from device    | headroom          | virtio-net hdr | data |
>>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>>
>>> You're free to push data back (not necessarily through meta).
>>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>>
>>>>
>>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>>> virtio-net hdr.
>>>>
>>>> So I think it might be better to change it this way.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..086ae835ec86 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>                                                        len, PAGE_SIZE, false,
>>>>                                                        metasize,
>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>                                 return head_skb;
>>>>                         }
>>>>                         break;
>>>
>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>
>>
>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>
>> The above change guarantees that buf and p will be in the same page and the skb_reserve() call will
>> make skb->data point to p - buf, i.e. to the beginning of the valid data in that page.
>> Unfortunately the new headroom will not be correct if it is a frag, it will be longer.
>>
>>
> 
> Completely untested alternative could be based on the offset size, that is if it has
> eaten into the headroom and is smaller then we swap them (that means we start at page
> boundary since we have headroom guaranteed space):
>  buf = page_address(page) + (offset > headroom ? offset - headroom : 0);
> 
> or perhaps in current code terms:
>  buf = p - (offset > headroom ? headroom : offset);
> 

Actually looking at add_recvbuf_mergeable() I take that back. We should look into a
different solution. That seems wrong as well.

If headroom can reside in 2 pages it is more difficult to get the correct address.

> That means offset is somewhere inside the headroom of the buf and, the buf itself
> starts at page boundary (when offset < headroom). I think this preserves the correct
> headroom for the new skb. WDYT?
> 
> Cheers,
>  Nik
> 
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 15:10           ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 15:10 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 23/04/2022 17:46, Nikolay Aleksandrov wrote:
> On 23/04/2022 17:30, Nikolay Aleksandrov wrote:
>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>>> mode with virtio_net after updating to newer kernels. After
>>>>> investigating the reason it turned out that when using mergeable bufs
>>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>>> calculates the build_skb address wrong because the offset can become less
>>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>>> depending on how lower offset is):
>>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>>
>>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>>> via an xdp prog. The calculations done are:
>>>>>  receive_mergeable():
>>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>>           vi->hdr_len - metasize;
>>>>>
>>>>>  page_to_skb():
>>>>>  p = page_address(page) + offset;
>>>>>  ...
>>>>>  buf = p - headroom;
>>>>>
>>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>>> commit is interesting because the patch was sent and applied twice (it
>>>>> seems the first one got lost during merge back in 5.13 window). The
>>>>> first version of the patch that was applied as:
>>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>>> was actually correct because it calculated the page starting address
>>>>> without relying on offset or headroom, but then the second version that
>>>>> was applied as:
>>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>> was wrong and added the above calculation.
>>>>> An example xdp prog[3] is below.
>>>>>
>>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>>
>>>>> [2] Two of the many traces:
> [snip]
>>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>>  	 */
>>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>>> +	if (headroom) {
>>>>> +		truesize = PAGE_SIZE;
>>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>>
>>>> The reason for not doing this is that buf and p may not be on the same page, and
>>>> buf is probably not page-aligned.
>>>>
>>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>>> allocates a large block of memory at one time, and allocates from it each time.
>>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>>> each allocation is page-aligned .
>>>>
>>>> The problem here is that the value of headroom is wrong, the package is
>>>> structured like this:
>>>>
>>>> from device    | headroom          | virtio-net hdr | data |
>>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>>
>>> You're free to push data back (not necessarily through meta).
>>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>>
>>>>
>>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>>> virtio-net hdr.
>>>>
>>>> So I think it might be better to change it this way.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..086ae835ec86 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>                                                        len, PAGE_SIZE, false,
>>>>                                                        metasize,
>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>                                 return head_skb;
>>>>                         }
>>>>                         break;
>>>
>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>
>>
>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>
>> The above change guarantees that buf and p will be in the same page and the skb_reserve() call will
>> make skb->data point to p - buf, i.e. to the beginning of the valid data in that page.
>> Unfortunately the new headroom will not be correct if it is a frag, it will be longer.
>>
>>
> 
> Completely untested alternative could be based on the offset size, that is if it has
> eaten into the headroom and is smaller then we swap them (that means we start at page
> boundary since we have headroom guaranteed space):
>  buf = page_address(page) + (offset > headroom ? offset - headroom : 0);
> 
> or perhaps in current code terms:
>  buf = p - (offset > headroom ? headroom : offset);
> 

Actually looking at add_recvbuf_mergeable() I take that back. We should look into a
different solution. That seems wrong as well.

If headroom can reside in 2 pages it is more difficult to get the correct address.

> That means offset is somewhere inside the headroom of the buf and, the buf itself
> starts at page boundary (when offset < headroom). I think this preserves the correct
> headroom for the new skb. WDYT?
> 
> Cheers,
>  Nik
> 
> 
> 

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 15:01             ` Xuan Zhuo
@ 2022-04-23 15:23               ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 15:23 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 23/04/2022 18:01, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 23/04/2022 17:36, Xuan Zhuo wrote:
>>> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>>>>> mode with virtio_net after updating to newer kernels. After
>>>>>>> investigating the reason it turned out that when using mergeable bufs
>>>>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>>>>> calculates the build_skb address wrong because the offset can become less
>>>>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>>>>> depending on how lower offset is):
>>>>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>>>>
>>>>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>>>>> via an xdp prog. The calculations done are:
>>>>>>>  receive_mergeable():
>>>>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>>>>           vi->hdr_len - metasize;
>>>>>>>
>>>>>>>  page_to_skb():
>>>>>>>  p = page_address(page) + offset;
>>>>>>>  ...
>>>>>>>  buf = p - headroom;
>>>>>>>
>>>>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>>>>> commit is interesting because the patch was sent and applied twice (it
>>>>>>> seems the first one got lost during merge back in 5.13 window). The
>>>>>>> first version of the patch that was applied as:
>>>>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>>>>> was actually correct because it calculated the page starting address
>>>>>>> without relying on offset or headroom, but then the second version that
>>>>>>> was applied as:
>>>>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>>>> was wrong and added the above calculation.
>>>>>>> An example xdp prog[3] is below.
>>>>>>>
>>>>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>>>>
>>>>>>> [2] Two of the many traces:
>>>>>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>>>>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>>>>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>>>>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>>>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>>>>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>>>>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>>>>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>>>>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>>>>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>>>>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>>>>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>>>>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>>>>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>>>>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>>>>>>  [   41.321387] Call Trace:
>>>>>>>  [   41.321819]  <TASK>
>>>>>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>>>>>>  [   41.322902]  __kfree_skb+0x20/0x30
>>>>>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>>>>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>>>>>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>>>>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>>>>>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>>>>>>  [   41.445592]  ? 0xffffffffa3000000
>>>>>>>  [   41.462442]  new_sync_read+0x148/0x160
>>>>>>>  [   41.479314]  ? 0xffffffffa3000000
>>>>>>>  [   41.496937]  vfs_read+0x138/0x190
>>>>>>>  [   41.517198]  ksys_read+0x87/0xc0
>>>>>>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>>>>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>>  [   41.568050] RIP: 0033:0x48765b
>>>>>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>>>>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>>>>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>>>>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>>>>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>>>>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>>>>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>>>>>>  [   41.744254]  </TASK>
>>>>>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>>>
>>>>>>>  and
>>>>>>>
>>>>>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>>>>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>>>>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>>>>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>>>>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>>>>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>>>>>>  [   33.532471] page dumped because: nonzero mapcount
>>>>>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>>>>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>>>  [   33.532484] Call Trace:
>>>>>>>  [   33.532496]  <TASK>
>>>>>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>>>>>>  [   33.532506]  bad_page.cold+0x63/0x94
>>>>>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>>>>>>  [   33.532515]  free_unref_page+0x1b/0x100
>>>>>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>>>>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>>>>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>>>>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>>>>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>>>>>>
>>>>>>> [3] XDP program to reproduce(xdp_pass.c):
>>>>>>>  #include <linux/bpf.h>
>>>>>>>  #include <bpf/bpf_helpers.h>
>>>>>>>
>>>>>>>  SEC("xdp_pass")
>>>>>>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>>>>>>  {
>>>>>>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>>>>>>           return XDP_PASS;
>>>>>>>  }
>>>>>>>
>>>>>>>  char _license[] SEC("license") = "GPL";
>>>>>>>
>>>>>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>>>>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>>>>>>
>>>>>>> CC: stable@vger.kernel.org
>>>>>>> CC: Jason Wang <jasowang@redhat.com>
>>>>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>>>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>>>>>> CC: virtualization@lists.linux-foundation.org
>>>>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>>>> ---
>>>>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>>>>> --- a/drivers/net/virtio_net.c
>>>>>>> +++ b/drivers/net/virtio_net.c
>>>>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>>>>  	 */
>>>>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>>>>> +	if (headroom) {
>>>>>>> +		truesize = PAGE_SIZE;
>>>>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>>>>
>>>>>> The reason for not doing this is that buf and p may not be on the same page, and
>>>>>> buf is probably not page-aligned.
>>>>>>
>>>>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>>>>> allocates a large block of memory at one time, and allocates from it each time.
>>>>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>>>>> each allocation is page-aligned .
>>>>>>
>>>>>> The problem here is that the value of headroom is wrong, the package is
>>>>>> structured like this:
>>>>>>
>>>>>> from device    | headroom          | virtio-net hdr | data |
>>>>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>>>>
>>>>> You're free to push data back (not necessarily through meta).
>>>>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>>>>
>>>>>>
>>>>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>>>>> virtio-net hdr.
>>>>>>
>>>>>> So I think it might be better to change it this way.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>>> index 87838cbe38cf..086ae835ec86 100644
>>>>>> --- a/drivers/net/virtio_net.c
>>>>>> +++ b/drivers/net/virtio_net.c
>>>>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>>>                                                        len, PAGE_SIZE, false,
>>>>>>                                                        metasize,
>>>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>>>                                 return head_skb;
>>>>>>                         }
>>>>>>                         break;
>>>>>
>>>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>>>
>>>>
>>>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>>>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>>
>>> Yes, you are right.
>>>
>>>>
>>>> The above change guarantees that buf and p will be in the same page
>>>
>>>
>>> How can this be guaranteed?
>>>
>>> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>>>    from the allocation.
>>> 2. set xdp
>>> 3. alloc for new buffer
>>>
>>
>> p = page_address(page) + offset;
>> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>>
>> => p and buf are always in the same page, no?
> 
> I don't think it is, it's entirely possible to split on two pages.
> 
>>
>>> The buffer allocated in the third step must be unaligned, and it is entirely
>>> possible that p and buf are not on the same page.
>>>
>>> I fixed my previous patch.
>>>
>>> Thanks.
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 87838cbe38cf..d95e82255b94 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>                          * xdp.data_meta were adjusted
>>>                          */
>>>                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>>> +
>>> +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
>>
>> This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.
> 
> Yes, you are right. For the case of xdp_linearize_page() , we should change it.
> 
>    headroom = xdp.data - vi->hdr_len - metasize - page_address(xdp_page);
> 
> Thanks.
> 

That is equal to offset:
                       offset = xdp.data - page_address(xdp_page) -
                                 vi->hdr_len - metasize;

So I do agree that it will work, it is effectively what I also suggested in the
other email and it will be equal to just doing buf = page_address() in the xdp_linearize
case because p = page_address + offset, and now we do buf = p - headroom where headroom also
equals offset, so we get buf = page_address().

> 
>>
>>>                         /* We can only create skb based on xdp_page. */
>>>                         if (unlikely(xdp_page != page)) {
>>>                                 rcu_read_unlock();
>>> @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>                                                        len, PAGE_SIZE, false,
>>>                                                        metasize,
>>> -                                                      VIRTIO_XDP_HEADROOM);
>>> +                                                      headroom);
>>>                                 return head_skb;
>>>                         }
>>>                         break;
>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 15:23               ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 15:23 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 23/04/2022 18:01, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 23/04/2022 17:36, Xuan Zhuo wrote:
>>> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>>>>> mode with virtio_net after updating to newer kernels. After
>>>>>>> investigating the reason it turned out that when using mergeable bufs
>>>>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>>>>> calculates the build_skb address wrong because the offset can become less
>>>>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>>>>> depending on how lower offset is):
>>>>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>>>>
>>>>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>>>>> via an xdp prog. The calculations done are:
>>>>>>>  receive_mergeable():
>>>>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>>>>           vi->hdr_len - metasize;
>>>>>>>
>>>>>>>  page_to_skb():
>>>>>>>  p = page_address(page) + offset;
>>>>>>>  ...
>>>>>>>  buf = p - headroom;
>>>>>>>
>>>>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>>>>> commit is interesting because the patch was sent and applied twice (it
>>>>>>> seems the first one got lost during merge back in 5.13 window). The
>>>>>>> first version of the patch that was applied as:
>>>>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>>>>> was actually correct because it calculated the page starting address
>>>>>>> without relying on offset or headroom, but then the second version that
>>>>>>> was applied as:
>>>>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>>>> was wrong and added the above calculation.
>>>>>>> An example xdp prog[3] is below.
>>>>>>>
>>>>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>>>>
>>>>>>> [2] Two of the many traces:
>>>>>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>>>>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>>>>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>>>>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>>>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>>>>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>>>>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>>>>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>>>>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>>>>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>>>>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>>>>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>>>>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>>>>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>>>>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>>>>>>  [   41.321387] Call Trace:
>>>>>>>  [   41.321819]  <TASK>
>>>>>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>>>>>>  [   41.322902]  __kfree_skb+0x20/0x30
>>>>>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>>>>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>>>>>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>>>>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>>>>>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>>>>>>  [   41.445592]  ? 0xffffffffa3000000
>>>>>>>  [   41.462442]  new_sync_read+0x148/0x160
>>>>>>>  [   41.479314]  ? 0xffffffffa3000000
>>>>>>>  [   41.496937]  vfs_read+0x138/0x190
>>>>>>>  [   41.517198]  ksys_read+0x87/0xc0
>>>>>>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>>>>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>>  [   41.568050] RIP: 0033:0x48765b
>>>>>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>>>>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>>>>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>>>>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>>>>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>>>>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>>>>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>>>>>>  [   41.744254]  </TASK>
>>>>>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>>>
>>>>>>>  and
>>>>>>>
>>>>>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>>>>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>>>>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>>>>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>>>>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>>>>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>>>>>>  [   33.532471] page dumped because: nonzero mapcount
>>>>>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>>>>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>>>  [   33.532484] Call Trace:
>>>>>>>  [   33.532496]  <TASK>
>>>>>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>>>>>>  [   33.532506]  bad_page.cold+0x63/0x94
>>>>>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>>>>>>  [   33.532515]  free_unref_page+0x1b/0x100
>>>>>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>>>>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>>>>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>>>>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>>>>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>>>>>>
>>>>>>> [3] XDP program to reproduce(xdp_pass.c):
>>>>>>>  #include <linux/bpf.h>
>>>>>>>  #include <bpf/bpf_helpers.h>
>>>>>>>
>>>>>>>  SEC("xdp_pass")
>>>>>>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>>>>>>  {
>>>>>>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>>>>>>           return XDP_PASS;
>>>>>>>  }
>>>>>>>
>>>>>>>  char _license[] SEC("license") = "GPL";
>>>>>>>
>>>>>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>>>>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>>>>>>
>>>>>>> CC: stable@vger.kernel.org
>>>>>>> CC: Jason Wang <jasowang@redhat.com>
>>>>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>>>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>>>>>> CC: virtualization@lists.linux-foundation.org
>>>>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>>>> ---
>>>>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>>>>> --- a/drivers/net/virtio_net.c
>>>>>>> +++ b/drivers/net/virtio_net.c
>>>>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>>>>  	 */
>>>>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>>>>> +	if (headroom) {
>>>>>>> +		truesize = PAGE_SIZE;
>>>>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>>>>
>>>>>> The reason for not doing this is that buf and p may not be on the same page, and
>>>>>> buf is probably not page-aligned.
>>>>>>
>>>>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>>>>> allocates a large block of memory at one time, and allocates from it each time.
>>>>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>>>>> each allocation is page-aligned .
>>>>>>
>>>>>> The problem here is that the value of headroom is wrong, the package is
>>>>>> structured like this:
>>>>>>
>>>>>> from device    | headroom          | virtio-net hdr | data |
>>>>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>>>>
>>>>> You're free to push data back (not necessarily through meta).
>>>>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>>>>
>>>>>>
>>>>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>>>>> virtio-net hdr.
>>>>>>
>>>>>> So I think it might be better to change it this way.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>>> index 87838cbe38cf..086ae835ec86 100644
>>>>>> --- a/drivers/net/virtio_net.c
>>>>>> +++ b/drivers/net/virtio_net.c
>>>>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>>>                                                        len, PAGE_SIZE, false,
>>>>>>                                                        metasize,
>>>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>>>                                 return head_skb;
>>>>>>                         }
>>>>>>                         break;
>>>>>
>>>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>>>
>>>>
>>>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>>>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>>
>>> Yes, you are right.
>>>
>>>>
>>>> The above change guarantees that buf and p will be in the same page
>>>
>>>
>>> How can this be guaranteed?
>>>
>>> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>>>    from the allocation.
>>> 2. set xdp
>>> 3. alloc for new buffer
>>>
>>
>> p = page_address(page) + offset;
>> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>>
>> => p and buf are always in the same page, no?
> 
> I don't think it is, it's entirely possible to split on two pages.
> 
>>
>>> The buffer allocated in the third step must be unaligned, and it is entirely
>>> possible that p and buf are not on the same page.
>>>
>>> I fixed my previous patch.
>>>
>>> Thanks.
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 87838cbe38cf..d95e82255b94 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>                          * xdp.data_meta were adjusted
>>>                          */
>>>                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>>> +
>>> +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
>>
>> This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.
> 
> Yes, you are right. For the case of xdp_linearize_page() , we should change it.
> 
>    headroom = xdp.data - vi->hdr_len - metasize - page_address(xdp_page);
> 
> Thanks.
> 

That is equal to offset:
                       offset = xdp.data - page_address(xdp_page) -
                                 vi->hdr_len - metasize;

So I do agree that it will work, it is effectively what I also suggested in the
other email and it will be equal to just doing buf = page_address() in the xdp_linearize
case because p = page_address + offset, and now we do buf = p - headroom where headroom also
equals offset, so we get buf = page_address().

> 
>>
>>>                         /* We can only create skb based on xdp_page. */
>>>                         if (unlikely(xdp_page != page)) {
>>>                                 rcu_read_unlock();
>>> @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>                                                        len, PAGE_SIZE, false,
>>>                                                        metasize,
>>> -                                                      VIRTIO_XDP_HEADROOM);
>>> +                                                      headroom);
>>>                                 return head_skb;
>>>                         }
>>>                         break;
>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 15:23               ` Nikolay Aleksandrov
@ 2022-04-23 15:35                 ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 15:35 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 23/04/2022 18:23, Nikolay Aleksandrov wrote:
> On 23/04/2022 18:01, Xuan Zhuo wrote:
>> On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>> On 23/04/2022 17:36, Xuan Zhuo wrote:
>>>> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>>>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
[snip]
>>>>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>>>>                                 return head_skb;
>>>>>>>                         }
>>>>>>>                         break;
>>>>>>
>>>>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>>>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>>>>
>>>>>
>>>>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>>>>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>>>
>>>> Yes, you are right.
>>>>
>>>>>
>>>>> The above change guarantees that buf and p will be in the same page
>>>>
>>>>
>>>> How can this be guaranteed?
>>>>
>>>> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>>>>    from the allocation.
>>>> 2. set xdp
>>>> 3. alloc for new buffer
>>>>
>>>
>>> p = page_address(page) + offset;
>>> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>>>
>>> => p and buf are always in the same page, no?
>>
>> I don't think it is, it's entirely possible to split on two pages.
>>
>>>
>>>> The buffer allocated in the third step must be unaligned, and it is entirely
>>>> possible that p and buf are not on the same page.
>>>>
>>>> I fixed my previous patch.
>>>>
>>>> Thanks.
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..d95e82255b94 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                          * xdp.data_meta were adjusted
>>>>                          */
>>>>                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>>>> +
>>>> +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
>>>
>>> This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.
>>
>> Yes, you are right. For the case of xdp_linearize_page() , we should change it.
>>
>>    headroom = xdp.data - vi->hdr_len - metasize - page_address(xdp_page);
>>
>> Thanks.
>>
> 
> That is equal to offset:
>                        offset = xdp.data - page_address(xdp_page) -
>                                  vi->hdr_len - metasize;
> 
> So I do agree that it will work, it is effectively what I also suggested in the
> other email and it will be equal to just doing buf = page_address() in the xdp_linearize
> case because p = page_address + offset, and now we do buf = p - headroom where headroom also
> equals offset, so we get buf = page_address().
> 

All of the above is equivalent to:
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..12e88980e4b3 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1012,8 +1012,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                                head_skb = page_to_skb(vi, rq, xdp_page, offset,
                                                       len, PAGE_SIZE, false,
                                                       metasize,
-                                                      VIRTIO_XDP_HEADROOM);
+                                                      offset);
                                return head_skb;
+                       } else {
+                               headroom = xdp.data - (buf - headroom) - vi->hdr_len - metasize;
                        }
                        break;
                case XDP_TX:

I agree with that, it is also equivalent to my proposal in the other email. It adjusts the new
headroom after the xdp prog which is ok. I'll wait (and test it in the meantime) for other
feedback and if all agree I'll post v2.

>>
>>>
>>>>                         /* We can only create skb based on xdp_page. */
>>>>                         if (unlikely(xdp_page != page)) {
>>>>                                 rcu_read_unlock();
>>>> @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>                                                        len, PAGE_SIZE, false,
>>>>                                                        metasize,
>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>> +                                                      headroom);
>>>>                                 return head_skb;
>>>>                         }
>>>>                         break;
>>>
> 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 15:35                 ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 15:35 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 23/04/2022 18:23, Nikolay Aleksandrov wrote:
> On 23/04/2022 18:01, Xuan Zhuo wrote:
>> On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>> On 23/04/2022 17:36, Xuan Zhuo wrote:
>>>> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>>>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
[snip]
>>>>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>>>>                                 return head_skb;
>>>>>>>                         }
>>>>>>>                         break;
>>>>>>
>>>>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>>>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>>>>
>>>>>
>>>>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>>>>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>>>
>>>> Yes, you are right.
>>>>
>>>>>
>>>>> The above change guarantees that buf and p will be in the same page
>>>>
>>>>
>>>> How can this be guaranteed?
>>>>
>>>> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>>>>    from the allocation.
>>>> 2. set xdp
>>>> 3. alloc for new buffer
>>>>
>>>
>>> p = page_address(page) + offset;
>>> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>>>
>>> => p and buf are always in the same page, no?
>>
>> I don't think it is, it's entirely possible to split on two pages.
>>
>>>
>>>> The buffer allocated in the third step must be unaligned, and it is entirely
>>>> possible that p and buf are not on the same page.
>>>>
>>>> I fixed my previous patch.
>>>>
>>>> Thanks.
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..d95e82255b94 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                          * xdp.data_meta were adjusted
>>>>                          */
>>>>                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>>>> +
>>>> +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
>>>
>>> This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.
>>
>> Yes, you are right. For the case of xdp_linearize_page() , we should change it.
>>
>>    headroom = xdp.data - vi->hdr_len - metasize - page_address(xdp_page);
>>
>> Thanks.
>>
> 
> That is equal to offset:
>                        offset = xdp.data - page_address(xdp_page) -
>                                  vi->hdr_len - metasize;
> 
> So I do agree that it will work, it is effectively what I also suggested in the
> other email and it will be equal to just doing buf = page_address() in the xdp_linearize
> case because p = page_address + offset, and now we do buf = p - headroom where headroom also
> equals offset, so we get buf = page_address().
> 

All of the above is equivalent to:
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..12e88980e4b3 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1012,8 +1012,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                                head_skb = page_to_skb(vi, rq, xdp_page, offset,
                                                       len, PAGE_SIZE, false,
                                                       metasize,
-                                                      VIRTIO_XDP_HEADROOM);
+                                                      offset);
                                return head_skb;
+                       } else {
+                               headroom = xdp.data - (buf - headroom) - vi->hdr_len - metasize;
                        }
                        break;
                case XDP_TX:

I agree with that, it is also equivalent to my proposal in the other email. It adjusts the new
headroom after the xdp prog which is ok. I'll wait (and test it in the meantime) for other
feedback and if all agree I'll post v2.

>>
>>>
>>>>                         /* We can only create skb based on xdp_page. */
>>>>                         if (unlikely(xdp_page != page)) {
>>>>                                 rcu_read_unlock();
>>>> @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>                                                        len, PAGE_SIZE, false,
>>>>                                                        metasize,
>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>> +                                                      headroom);
>>>>                                 return head_skb;
>>>>                         }
>>>>                         break;
>>>
> 

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 15:01             ` Xuan Zhuo
@ 2022-04-23 15:55               ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 15:55 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 23/04/2022 18:01, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 23/04/2022 17:36, Xuan Zhuo wrote:
>>> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
[snip]                                   metasize,
>>>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>>>                                 return head_skb;
>>>>>>                         }
>>>>>>                         break;
>>>>>
>>>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>>>
>>>>
>>>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>>>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>>
>>> Yes, you are right.
>>>
>>>>
>>>> The above change guarantees that buf and p will be in the same page
>>>
>>>
>>> How can this be guaranteed?
>>>
>>> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>>>    from the allocation.
>>> 2. set xdp
>>> 3. alloc for new buffer
>>>
>>
>> p = page_address(page) + offset;
>> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>>
>> => p and buf are always in the same page, no?
> 
> I don't think it is, it's entirely possible to split on two pages.
> 

Ahhh, I completely misinterpreted page_address(). You're right.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-23 15:55               ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-23 15:55 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 23/04/2022 18:01, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 23/04/2022 17:36, Xuan Zhuo wrote:
>>> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
[snip]                                   metasize,
>>>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>>>                                 return head_skb;
>>>>>>                         }
>>>>>>                         break;
>>>>>
>>>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>>>
>>>>
>>>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>>>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>>
>>> Yes, you are right.
>>>
>>>>
>>>> The above change guarantees that buf and p will be in the same page
>>>
>>>
>>> How can this be guaranteed?
>>>
>>> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>>>    from the allocation.
>>> 2. set xdp
>>> 3. alloc for new buffer
>>>
>>
>> p = page_address(page) + offset;
>> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>>
>> => p and buf are always in the same page, no?
> 
> I don't think it is, it's entirely possible to split on two pages.
> 

Ahhh, I completely misinterpreted page_address(). You're right.


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-23 15:35                 ` Nikolay Aleksandrov
@ 2022-04-24 10:21                   ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-24 10:21 UTC (permalink / raw)
  To: netdev
  Cc: kuba, davem, Nikolay Aleksandrov, stable, Jason Wang, Xuan Zhuo,
	Daniel Borkmann, Michael S. Tsirkin, virtualization

We received a report[1] of kernel crashes when Cilium is used in XDP
mode with virtio_net after updating to newer kernels. After
investigating the reason it turned out that when using mergeable bufs
with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
calculates the build_skb address wrong because the offset can become less
than the headroom so it gets the address of the previous page (-X bytes
depending on how lower offset is):
 page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256

This is a pr_err() I added in the beginning of page_to_skb which clearly
shows offset that is less than headroom by adding 4 bytes of metadata
via an xdp prog. The calculations done are:
 receive_mergeable():
 headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
 offset = xdp.data - page_address(xdp_page) -
          vi->hdr_len - metasize;

 page_to_skb():
 p = page_address(page) + offset;
 ...
 buf = p - headroom;

Now buf goes -4 bytes from the page's starting address as can be seen
above which is set as skb->head and skb->data by build_skb later. Depending
on what's done with the skb (when it's freed most often) we get all kinds
of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
the new headroom after the xdp program has run, similar to how offset
and len are recalculated. Headroom is directly related to
data_hard_start, data and data_meta, so we use them to get the new size.
The result is correct (similar pr_err() in page_to_skb, one case of
xdp_page and one case of virtnet buf):
 a) Case with 4 bytes of metadata
 [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
 [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
 b) Case of pushing data +32 bytes
 [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
 [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
 c) Case of pushing data -33 bytes
 [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
 [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223

An example reproducer xdp prog[3] is below.

[1] https://github.com/cilium/cilium/issues/19453

[2] Two of the many traces:
 [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
 [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
 [   41.300891] kernel BUG at include/linux/mm.h:720!
 [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
 [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
 [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
 [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
 [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
 [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
 [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
 [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
 [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
 [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
 [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
 [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
 [   41.321387] Call Trace:
 [   41.321819]  <TASK>
 [   41.322193]  skb_release_data+0x13f/0x1c0
 [   41.322902]  __kfree_skb+0x20/0x30
 [   41.343870]  tcp_recvmsg_locked+0x671/0x880
 [   41.363764]  tcp_recvmsg+0x5e/0x1c0
 [   41.384102]  inet_recvmsg+0x42/0x100
 [   41.406783]  ? sock_recvmsg+0x1d/0x70
 [   41.428201]  sock_read_iter+0x84/0xd0
 [   41.445592]  ? 0xffffffffa3000000
 [   41.462442]  new_sync_read+0x148/0x160
 [   41.479314]  ? 0xffffffffa3000000
 [   41.496937]  vfs_read+0x138/0x190
 [   41.517198]  ksys_read+0x87/0xc0
 [   41.535336]  do_syscall_64+0x3b/0x90
 [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [   41.568050] RIP: 0033:0x48765b
 [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
 [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
 [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
 [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
 [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
 [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
 [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
 [   41.744254]  </TASK>
 [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net

 and

 [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
 [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
 [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
 [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
 [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
 [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
 [   33.532471] page dumped because: nonzero mapcount
 [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
 [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
 [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   33.532484] Call Trace:
 [   33.532496]  <TASK>
 [   33.532500]  dump_stack_lvl+0x45/0x5a
 [   33.532506]  bad_page.cold+0x63/0x94
 [   33.532510]  free_pcp_prepare+0x290/0x420
 [   33.532515]  free_unref_page+0x1b/0x100
 [   33.532518]  skb_release_data+0x13f/0x1c0
 [   33.532524]  kfree_skb_reason+0x3e/0xc0
 [   33.532527]  ip6_mc_input+0x23c/0x2b0
 [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
 [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0

[3] XDP program to reproduce(xdp_pass.c):
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>

 SEC("xdp_pass")
 int xdp_pkt_pass(struct xdp_md *ctx)
 {
          bpf_xdp_adjust_head(ctx, -(int)32);
          return XDP_PASS;
 }

 char _license[] SEC("license") = "GPL";

 compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
 load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass

CC: stable@vger.kernel.org
CC: Jason Wang <jasowang@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
---
v2: Recalculate headroom based on data, data_hard_start and data_meta

 drivers/net/virtio_net.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..a12338de7ef1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			 * xdp.data_meta were adjusted
 			 */
 			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
+
+			/* recalculate headroom if xdp.data or xdp.data_meta
+			 * were adjusted
+			 */
+			headroom = xdp.data - xdp.data_hard_start - metasize;
+
 			/* We can only create skb based on xdp_page. */
 			if (unlikely(xdp_page != page)) {
 				rcu_read_unlock();
@@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 				head_skb = page_to_skb(vi, rq, xdp_page, offset,
 						       len, PAGE_SIZE, false,
 						       metasize,
-						       VIRTIO_XDP_HEADROOM);
+						       headroom);
 				return head_skb;
 			}
 			break;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-24 10:21                   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-24 10:21 UTC (permalink / raw)
  To: netdev
  Cc: Daniel Borkmann, Michael S. Tsirkin, stable, davem, kuba,
	virtualization, Nikolay Aleksandrov

We received a report[1] of kernel crashes when Cilium is used in XDP
mode with virtio_net after updating to newer kernels. After
investigating the reason it turned out that when using mergeable bufs
with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
calculates the build_skb address wrong because the offset can become less
than the headroom so it gets the address of the previous page (-X bytes
depending on how lower offset is):
 page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256

This is a pr_err() I added in the beginning of page_to_skb which clearly
shows offset that is less than headroom by adding 4 bytes of metadata
via an xdp prog. The calculations done are:
 receive_mergeable():
 headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
 offset = xdp.data - page_address(xdp_page) -
          vi->hdr_len - metasize;

 page_to_skb():
 p = page_address(page) + offset;
 ...
 buf = p - headroom;

Now buf goes -4 bytes from the page's starting address as can be seen
above which is set as skb->head and skb->data by build_skb later. Depending
on what's done with the skb (when it's freed most often) we get all kinds
of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
the new headroom after the xdp program has run, similar to how offset
and len are recalculated. Headroom is directly related to
data_hard_start, data and data_meta, so we use them to get the new size.
The result is correct (similar pr_err() in page_to_skb, one case of
xdp_page and one case of virtnet buf):
 a) Case with 4 bytes of metadata
 [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
 [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
 b) Case of pushing data +32 bytes
 [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
 [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
 c) Case of pushing data -33 bytes
 [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
 [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223

An example reproducer xdp prog[3] is below.

[1] https://github.com/cilium/cilium/issues/19453

[2] Two of the many traces:
 [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
 [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
 [   41.300891] kernel BUG at include/linux/mm.h:720!
 [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
 [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
 [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
 [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
 [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
 [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
 [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
 [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
 [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
 [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
 [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
 [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
 [   41.321387] Call Trace:
 [   41.321819]  <TASK>
 [   41.322193]  skb_release_data+0x13f/0x1c0
 [   41.322902]  __kfree_skb+0x20/0x30
 [   41.343870]  tcp_recvmsg_locked+0x671/0x880
 [   41.363764]  tcp_recvmsg+0x5e/0x1c0
 [   41.384102]  inet_recvmsg+0x42/0x100
 [   41.406783]  ? sock_recvmsg+0x1d/0x70
 [   41.428201]  sock_read_iter+0x84/0xd0
 [   41.445592]  ? 0xffffffffa3000000
 [   41.462442]  new_sync_read+0x148/0x160
 [   41.479314]  ? 0xffffffffa3000000
 [   41.496937]  vfs_read+0x138/0x190
 [   41.517198]  ksys_read+0x87/0xc0
 [   41.535336]  do_syscall_64+0x3b/0x90
 [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [   41.568050] RIP: 0033:0x48765b
 [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
 [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
 [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
 [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
 [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
 [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
 [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
 [   41.744254]  </TASK>
 [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net

 and

 [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
 [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
 [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
 [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
 [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
 [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
 [   33.532471] page dumped because: nonzero mapcount
 [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
 [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
 [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   33.532484] Call Trace:
 [   33.532496]  <TASK>
 [   33.532500]  dump_stack_lvl+0x45/0x5a
 [   33.532506]  bad_page.cold+0x63/0x94
 [   33.532510]  free_pcp_prepare+0x290/0x420
 [   33.532515]  free_unref_page+0x1b/0x100
 [   33.532518]  skb_release_data+0x13f/0x1c0
 [   33.532524]  kfree_skb_reason+0x3e/0xc0
 [   33.532527]  ip6_mc_input+0x23c/0x2b0
 [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
 [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0

[3] XDP program to reproduce(xdp_pass.c):
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>

 SEC("xdp_pass")
 int xdp_pkt_pass(struct xdp_md *ctx)
 {
          bpf_xdp_adjust_head(ctx, -(int)32);
          return XDP_PASS;
 }

 char _license[] SEC("license") = "GPL";

 compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
 load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass

CC: stable@vger.kernel.org
CC: Jason Wang <jasowang@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
---
v2: Recalculate headroom based on data, data_hard_start and data_meta

 drivers/net/virtio_net.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..a12338de7ef1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			 * xdp.data_meta were adjusted
 			 */
 			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
+
+			/* recalculate headroom if xdp.data or xdp.data_meta
+			 * were adjusted
+			 */
+			headroom = xdp.data - xdp.data_hard_start - metasize;
+
 			/* We can only create skb based on xdp_page. */
 			if (unlikely(xdp_page != page)) {
 				rcu_read_unlock();
@@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 				head_skb = page_to_skb(vi, rq, xdp_page, offset,
 						       len, PAGE_SIZE, false,
 						       metasize,
-						       VIRTIO_XDP_HEADROOM);
+						       headroom);
 				return head_skb;
 			}
 			break;
-- 
2.35.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-24 10:21                   ` Nikolay Aleksandrov
@ 2022-04-24 10:42                     ` Xuan Zhuo
  -1 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-24 10:42 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable, davem, kuba,
	virtualization, Nikolay Aleksandrov

On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> We received a report[1] of kernel crashes when Cilium is used in XDP
> mode with virtio_net after updating to newer kernels. After
> investigating the reason it turned out that when using mergeable bufs
> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> calculates the build_skb address wrong because the offset can become less
> than the headroom so it gets the address of the previous page (-X bytes
> depending on how lower offset is):
>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>
> This is a pr_err() I added in the beginning of page_to_skb which clearly
> shows offset that is less than headroom by adding 4 bytes of metadata
> via an xdp prog. The calculations done are:
>  receive_mergeable():
>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>  offset = xdp.data - page_address(xdp_page) -
>           vi->hdr_len - metasize;
>
>  page_to_skb():
>  p = page_address(page) + offset;
>  ...
>  buf = p - headroom;
>
> Now buf goes -4 bytes from the page's starting address as can be seen
> above which is set as skb->head and skb->data by build_skb later. Depending
> on what's done with the skb (when it's freed most often) we get all kinds
> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
> the new headroom after the xdp program has run, similar to how offset
> and len are recalculated. Headroom is directly related to
> data_hard_start, data and data_meta, so we use them to get the new size.
> The result is correct (similar pr_err() in page_to_skb, one case of
> xdp_page and one case of virtnet buf):
>  a) Case with 4 bytes of metadata
>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
>  b) Case of pushing data +32 bytes
>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
>  c) Case of pushing data -33 bytes
>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
>
> An example reproducer xdp prog[3] is below.
>
> [1] https://github.com/cilium/cilium/issues/19453
>
> [2] Two of the many traces:
>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>  [   41.321387] Call Trace:
>  [   41.321819]  <TASK>
>  [   41.322193]  skb_release_data+0x13f/0x1c0
>  [   41.322902]  __kfree_skb+0x20/0x30
>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>  [   41.384102]  inet_recvmsg+0x42/0x100
>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>  [   41.428201]  sock_read_iter+0x84/0xd0
>  [   41.445592]  ? 0xffffffffa3000000
>  [   41.462442]  new_sync_read+0x148/0x160
>  [   41.479314]  ? 0xffffffffa3000000
>  [   41.496937]  vfs_read+0x138/0x190
>  [   41.517198]  ksys_read+0x87/0xc0
>  [   41.535336]  do_syscall_64+0x3b/0x90
>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>  [   41.568050] RIP: 0033:0x48765b
>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>  [   41.744254]  </TASK>
>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>
>  and
>
>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>  [   33.532471] page dumped because: nonzero mapcount
>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   33.532484] Call Trace:
>  [   33.532496]  <TASK>
>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>  [   33.532506]  bad_page.cold+0x63/0x94
>  [   33.532510]  free_pcp_prepare+0x290/0x420
>  [   33.532515]  free_unref_page+0x1b/0x100
>  [   33.532518]  skb_release_data+0x13f/0x1c0
>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>
> [3] XDP program to reproduce(xdp_pass.c):
>  #include <linux/bpf.h>
>  #include <bpf/bpf_helpers.h>
>
>  SEC("xdp_pass")
>  int xdp_pkt_pass(struct xdp_md *ctx)
>  {
>           bpf_xdp_adjust_head(ctx, -(int)32);
>           return XDP_PASS;
>  }
>
>  char _license[] SEC("license") = "GPL";
>
>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>
> CC: stable@vger.kernel.org
> CC: Jason Wang <jasowang@redhat.com>
> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> CC: Daniel Borkmann <daniel@iogearbox.net>
> CC: "Michael S. Tsirkin" <mst@redhat.com>
> CC: virtualization@lists.linux-foundation.org
> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> ---
> v2: Recalculate headroom based on data, data_hard_start and data_meta
>
>  drivers/net/virtio_net.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..a12338de7ef1 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			 * xdp.data_meta were adjusted
>  			 */
>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> +
> +			/* recalculate headroom if xdp.data or xdp.data_meta
> +			 * were adjusted
> +			 */
> +			headroom = xdp.data - xdp.data_hard_start - metasize;


This is incorrect.


		data = page_address(xdp_page) + offset;
		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);

so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len

(page_address(xdp_page) + offset) points to virtio-net hdr.
(page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.

xdp.data_hard_start points to buf + vi->hdr_len

Thanks.


> +
>  			/* We can only create skb based on xdp_page. */
>  			if (unlikely(xdp_page != page)) {
>  				rcu_read_unlock();
> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>  						       len, PAGE_SIZE, false,
>  						       metasize,
> -						       VIRTIO_XDP_HEADROOM);
> +						       headroom);
>  				return head_skb;
>  			}
>  			break;
> --
> 2.35.1
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-24 10:42                     ` Xuan Zhuo
  0 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-24 10:42 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: kuba, davem, Nikolay Aleksandrov, stable, Jason Wang,
	Daniel Borkmann, Michael S. Tsirkin, virtualization, netdev

On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> We received a report[1] of kernel crashes when Cilium is used in XDP
> mode with virtio_net after updating to newer kernels. After
> investigating the reason it turned out that when using mergeable bufs
> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> calculates the build_skb address wrong because the offset can become less
> than the headroom so it gets the address of the previous page (-X bytes
> depending on how lower offset is):
>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>
> This is a pr_err() I added in the beginning of page_to_skb which clearly
> shows offset that is less than headroom by adding 4 bytes of metadata
> via an xdp prog. The calculations done are:
>  receive_mergeable():
>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>  offset = xdp.data - page_address(xdp_page) -
>           vi->hdr_len - metasize;
>
>  page_to_skb():
>  p = page_address(page) + offset;
>  ...
>  buf = p - headroom;
>
> Now buf goes -4 bytes from the page's starting address as can be seen
> above which is set as skb->head and skb->data by build_skb later. Depending
> on what's done with the skb (when it's freed most often) we get all kinds
> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
> the new headroom after the xdp program has run, similar to how offset
> and len are recalculated. Headroom is directly related to
> data_hard_start, data and data_meta, so we use them to get the new size.
> The result is correct (similar pr_err() in page_to_skb, one case of
> xdp_page and one case of virtnet buf):
>  a) Case with 4 bytes of metadata
>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
>  b) Case of pushing data +32 bytes
>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
>  c) Case of pushing data -33 bytes
>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
>
> An example reproducer xdp prog[3] is below.
>
> [1] https://github.com/cilium/cilium/issues/19453
>
> [2] Two of the many traces:
>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>  [   41.321387] Call Trace:
>  [   41.321819]  <TASK>
>  [   41.322193]  skb_release_data+0x13f/0x1c0
>  [   41.322902]  __kfree_skb+0x20/0x30
>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>  [   41.384102]  inet_recvmsg+0x42/0x100
>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>  [   41.428201]  sock_read_iter+0x84/0xd0
>  [   41.445592]  ? 0xffffffffa3000000
>  [   41.462442]  new_sync_read+0x148/0x160
>  [   41.479314]  ? 0xffffffffa3000000
>  [   41.496937]  vfs_read+0x138/0x190
>  [   41.517198]  ksys_read+0x87/0xc0
>  [   41.535336]  do_syscall_64+0x3b/0x90
>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>  [   41.568050] RIP: 0033:0x48765b
>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>  [   41.744254]  </TASK>
>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>
>  and
>
>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>  [   33.532471] page dumped because: nonzero mapcount
>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   33.532484] Call Trace:
>  [   33.532496]  <TASK>
>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>  [   33.532506]  bad_page.cold+0x63/0x94
>  [   33.532510]  free_pcp_prepare+0x290/0x420
>  [   33.532515]  free_unref_page+0x1b/0x100
>  [   33.532518]  skb_release_data+0x13f/0x1c0
>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>
> [3] XDP program to reproduce(xdp_pass.c):
>  #include <linux/bpf.h>
>  #include <bpf/bpf_helpers.h>
>
>  SEC("xdp_pass")
>  int xdp_pkt_pass(struct xdp_md *ctx)
>  {
>           bpf_xdp_adjust_head(ctx, -(int)32);
>           return XDP_PASS;
>  }
>
>  char _license[] SEC("license") = "GPL";
>
>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>
> CC: stable@vger.kernel.org
> CC: Jason Wang <jasowang@redhat.com>
> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> CC: Daniel Borkmann <daniel@iogearbox.net>
> CC: "Michael S. Tsirkin" <mst@redhat.com>
> CC: virtualization@lists.linux-foundation.org
> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> ---
> v2: Recalculate headroom based on data, data_hard_start and data_meta
>
>  drivers/net/virtio_net.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..a12338de7ef1 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			 * xdp.data_meta were adjusted
>  			 */
>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> +
> +			/* recalculate headroom if xdp.data or xdp.data_meta
> +			 * were adjusted
> +			 */
> +			headroom = xdp.data - xdp.data_hard_start - metasize;


This is incorrect.


		data = page_address(xdp_page) + offset;
		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);

so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len

(page_address(xdp_page) + offset) points to virtio-net hdr.
(page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.

xdp.data_hard_start points to buf + vi->hdr_len

Thanks.


> +
>  			/* We can only create skb based on xdp_page. */
>  			if (unlikely(xdp_page != page)) {
>  				rcu_read_unlock();
> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>  						       len, PAGE_SIZE, false,
>  						       metasize,
> -						       VIRTIO_XDP_HEADROOM);
> +						       headroom);
>  				return head_skb;
>  			}
>  			break;
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-24 10:42                     ` Xuan Zhuo
@ 2022-04-24 10:56                       ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-24 10:56 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 24/04/2022 13:42, Xuan Zhuo wrote:
> On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> We received a report[1] of kernel crashes when Cilium is used in XDP
>> mode with virtio_net after updating to newer kernels. After
>> investigating the reason it turned out that when using mergeable bufs
>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>> calculates the build_skb address wrong because the offset can become less
>> than the headroom so it gets the address of the previous page (-X bytes
>> depending on how lower offset is):
>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>
>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>> shows offset that is less than headroom by adding 4 bytes of metadata
>> via an xdp prog. The calculations done are:
>>  receive_mergeable():
>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>  offset = xdp.data - page_address(xdp_page) -
>>           vi->hdr_len - metasize;
>>
>>  page_to_skb():
>>  p = page_address(page) + offset;
>>  ...
>>  buf = p - headroom;
>>
>> Now buf goes -4 bytes from the page's starting address as can be seen
>> above which is set as skb->head and skb->data by build_skb later. Depending
>> on what's done with the skb (when it's freed most often) we get all kinds
>> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
>> the new headroom after the xdp program has run, similar to how offset
>> and len are recalculated. Headroom is directly related to
>> data_hard_start, data and data_meta, so we use them to get the new size.
>> The result is correct (similar pr_err() in page_to_skb, one case of
>> xdp_page and one case of virtnet buf):
>>  a) Case with 4 bytes of metadata
>>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
>>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
>>  b) Case of pushing data +32 bytes
>>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
>>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
>>  c) Case of pushing data -33 bytes
>>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
>>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
>>
>> An example reproducer xdp prog[3] is below.
>>
>> [1] https://github.com/cilium/cilium/issues/19453
>>
>> [2] Two of the many traces:
>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>  [   41.321387] Call Trace:
>>  [   41.321819]  <TASK>
>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>  [   41.322902]  __kfree_skb+0x20/0x30
>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>  [   41.445592]  ? 0xffffffffa3000000
>>  [   41.462442]  new_sync_read+0x148/0x160
>>  [   41.479314]  ? 0xffffffffa3000000
>>  [   41.496937]  vfs_read+0x138/0x190
>>  [   41.517198]  ksys_read+0x87/0xc0
>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>  [   41.568050] RIP: 0033:0x48765b
>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>  [   41.744254]  </TASK>
>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>
>>  and
>>
>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>  [   33.532471] page dumped because: nonzero mapcount
>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   33.532484] Call Trace:
>>  [   33.532496]  <TASK>
>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>  [   33.532506]  bad_page.cold+0x63/0x94
>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>  [   33.532515]  free_unref_page+0x1b/0x100
>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>
>> [3] XDP program to reproduce(xdp_pass.c):
>>  #include <linux/bpf.h>
>>  #include <bpf/bpf_helpers.h>
>>
>>  SEC("xdp_pass")
>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>  {
>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>           return XDP_PASS;
>>  }
>>
>>  char _license[] SEC("license") = "GPL";
>>
>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>
>> CC: stable@vger.kernel.org
>> CC: Jason Wang <jasowang@redhat.com>
>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> CC: Daniel Borkmann <daniel@iogearbox.net>
>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>> CC: virtualization@lists.linux-foundation.org
>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>> ---
>> v2: Recalculate headroom based on data, data_hard_start and data_meta
>>
>>  drivers/net/virtio_net.c | 8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 87838cbe38cf..a12338de7ef1 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  			 * xdp.data_meta were adjusted
>>  			 */
>>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>> +
>> +			/* recalculate headroom if xdp.data or xdp.data_meta
>> +			 * were adjusted
>> +			 */
>> +			headroom = xdp.data - xdp.data_hard_start - metasize;
> 
> 
> This is incorrect.
> 
> 
> 		data = page_address(xdp_page) + offset;
> 		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
> 		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
> 				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);
> 
> so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len
> 
> (page_address(xdp_page) + offset) points to virtio-net hdr.
> (page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.
> 
> xdp.data_hard_start points to buf + vi->hdr_len
> 
> Thanks.
> 

xdp.data points to buf + vi->hdr_len + VIRTIO_XDP_HEADROOM, so we calculate
xdp.data - xdp.data_hard_start, i.e. buf + vi->hdr_len + VIRTIO_XDP_HEADROOM - (buf + vi->hdr_len)

You can see the headrooms from my tests above, they are correct and they match exactly
the values from the headroom calculations that you suggested earlier.

> 
>> +
>>  			/* We can only create skb based on xdp_page. */
>>  			if (unlikely(xdp_page != page)) {
>>  				rcu_read_unlock();
>> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>  						       len, PAGE_SIZE, false,
>>  						       metasize,
>> -						       VIRTIO_XDP_HEADROOM);
>> +						       headroom);
>>  				return head_skb;
>>  			}
>>  			break;
>> --
>> 2.35.1
>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-24 10:56                       ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-24 10:56 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 24/04/2022 13:42, Xuan Zhuo wrote:
> On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> We received a report[1] of kernel crashes when Cilium is used in XDP
>> mode with virtio_net after updating to newer kernels. After
>> investigating the reason it turned out that when using mergeable bufs
>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>> calculates the build_skb address wrong because the offset can become less
>> than the headroom so it gets the address of the previous page (-X bytes
>> depending on how lower offset is):
>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>
>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>> shows offset that is less than headroom by adding 4 bytes of metadata
>> via an xdp prog. The calculations done are:
>>  receive_mergeable():
>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>  offset = xdp.data - page_address(xdp_page) -
>>           vi->hdr_len - metasize;
>>
>>  page_to_skb():
>>  p = page_address(page) + offset;
>>  ...
>>  buf = p - headroom;
>>
>> Now buf goes -4 bytes from the page's starting address as can be seen
>> above which is set as skb->head and skb->data by build_skb later. Depending
>> on what's done with the skb (when it's freed most often) we get all kinds
>> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
>> the new headroom after the xdp program has run, similar to how offset
>> and len are recalculated. Headroom is directly related to
>> data_hard_start, data and data_meta, so we use them to get the new size.
>> The result is correct (similar pr_err() in page_to_skb, one case of
>> xdp_page and one case of virtnet buf):
>>  a) Case with 4 bytes of metadata
>>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
>>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
>>  b) Case of pushing data +32 bytes
>>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
>>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
>>  c) Case of pushing data -33 bytes
>>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
>>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
>>
>> An example reproducer xdp prog[3] is below.
>>
>> [1] https://github.com/cilium/cilium/issues/19453
>>
>> [2] Two of the many traces:
>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>  [   41.321387] Call Trace:
>>  [   41.321819]  <TASK>
>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>  [   41.322902]  __kfree_skb+0x20/0x30
>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>  [   41.445592]  ? 0xffffffffa3000000
>>  [   41.462442]  new_sync_read+0x148/0x160
>>  [   41.479314]  ? 0xffffffffa3000000
>>  [   41.496937]  vfs_read+0x138/0x190
>>  [   41.517198]  ksys_read+0x87/0xc0
>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>  [   41.568050] RIP: 0033:0x48765b
>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>  [   41.744254]  </TASK>
>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>
>>  and
>>
>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>  [   33.532471] page dumped because: nonzero mapcount
>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   33.532484] Call Trace:
>>  [   33.532496]  <TASK>
>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>  [   33.532506]  bad_page.cold+0x63/0x94
>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>  [   33.532515]  free_unref_page+0x1b/0x100
>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>
>> [3] XDP program to reproduce(xdp_pass.c):
>>  #include <linux/bpf.h>
>>  #include <bpf/bpf_helpers.h>
>>
>>  SEC("xdp_pass")
>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>  {
>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>           return XDP_PASS;
>>  }
>>
>>  char _license[] SEC("license") = "GPL";
>>
>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>
>> CC: stable@vger.kernel.org
>> CC: Jason Wang <jasowang@redhat.com>
>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> CC: Daniel Borkmann <daniel@iogearbox.net>
>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>> CC: virtualization@lists.linux-foundation.org
>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>> ---
>> v2: Recalculate headroom based on data, data_hard_start and data_meta
>>
>>  drivers/net/virtio_net.c | 8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 87838cbe38cf..a12338de7ef1 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  			 * xdp.data_meta were adjusted
>>  			 */
>>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>> +
>> +			/* recalculate headroom if xdp.data or xdp.data_meta
>> +			 * were adjusted
>> +			 */
>> +			headroom = xdp.data - xdp.data_hard_start - metasize;
> 
> 
> This is incorrect.
> 
> 
> 		data = page_address(xdp_page) + offset;
> 		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
> 		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
> 				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);
> 
> so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len
> 
> (page_address(xdp_page) + offset) points to virtio-net hdr.
> (page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.
> 
> xdp.data_hard_start points to buf + vi->hdr_len
> 
> Thanks.
> 

xdp.data points to buf + vi->hdr_len + VIRTIO_XDP_HEADROOM, so we calculate
xdp.data - xdp.data_hard_start, i.e. buf + vi->hdr_len + VIRTIO_XDP_HEADROOM - (buf + vi->hdr_len)

You can see the headrooms from my tests above, they are correct and they match exactly
the values from the headroom calculations that you suggested earlier.

> 
>> +
>>  			/* We can only create skb based on xdp_page. */
>>  			if (unlikely(xdp_page != page)) {
>>  				rcu_read_unlock();
>> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>  						       len, PAGE_SIZE, false,
>>  						       metasize,
>> -						       VIRTIO_XDP_HEADROOM);
>> +						       headroom);
>>  				return head_skb;
>>  			}
>>  			break;
>> --
>> 2.35.1
>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-24 10:56                       ` Nikolay Aleksandrov
@ 2022-04-24 11:18                         ` Xuan Zhuo
  -1 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-24 11:18 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On Sun, 24 Apr 2022 13:56:17 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 24/04/2022 13:42, Xuan Zhuo wrote:
> > On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >> We received a report[1] of kernel crashes when Cilium is used in XDP
> >> mode with virtio_net after updating to newer kernels. After
> >> investigating the reason it turned out that when using mergeable bufs
> >> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >> calculates the build_skb address wrong because the offset can become less
> >> than the headroom so it gets the address of the previous page (-X bytes
> >> depending on how lower offset is):
> >>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>
> >> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >> shows offset that is less than headroom by adding 4 bytes of metadata
> >> via an xdp prog. The calculations done are:
> >>  receive_mergeable():
> >>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>  offset = xdp.data - page_address(xdp_page) -
> >>           vi->hdr_len - metasize;
> >>
> >>  page_to_skb():
> >>  p = page_address(page) + offset;
> >>  ...
> >>  buf = p - headroom;
> >>
> >> Now buf goes -4 bytes from the page's starting address as can be seen
> >> above which is set as skb->head and skb->data by build_skb later. Depending
> >> on what's done with the skb (when it's freed most often) we get all kinds
> >> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
> >> the new headroom after the xdp program has run, similar to how offset
> >> and len are recalculated. Headroom is directly related to
> >> data_hard_start, data and data_meta, so we use them to get the new size.
> >> The result is correct (similar pr_err() in page_to_skb, one case of
> >> xdp_page and one case of virtnet buf):
> >>  a) Case with 4 bytes of metadata
> >>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
> >>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
> >>  b) Case of pushing data +32 bytes
> >>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
> >>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
> >>  c) Case of pushing data -33 bytes
> >>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
> >>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
> >>
> >> An example reproducer xdp prog[3] is below.
> >>
> >> [1] https://github.com/cilium/cilium/issues/19453
> >>
> >> [2] Two of the many traces:
> >>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>  [   41.321387] Call Trace:
> >>  [   41.321819]  <TASK>
> >>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>  [   41.322902]  __kfree_skb+0x20/0x30
> >>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>  [   41.445592]  ? 0xffffffffa3000000
> >>  [   41.462442]  new_sync_read+0x148/0x160
> >>  [   41.479314]  ? 0xffffffffa3000000
> >>  [   41.496937]  vfs_read+0x138/0x190
> >>  [   41.517198]  ksys_read+0x87/0xc0
> >>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>  [   41.568050] RIP: 0033:0x48765b
> >>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>  [   41.744254]  </TASK>
> >>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>
> >>  and
> >>
> >>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>  [   33.532471] page dumped because: nonzero mapcount
> >>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   33.532484] Call Trace:
> >>  [   33.532496]  <TASK>
> >>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>  [   33.532506]  bad_page.cold+0x63/0x94
> >>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>  [   33.532515]  free_unref_page+0x1b/0x100
> >>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>
> >> [3] XDP program to reproduce(xdp_pass.c):
> >>  #include <linux/bpf.h>
> >>  #include <bpf/bpf_helpers.h>
> >>
> >>  SEC("xdp_pass")
> >>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>  {
> >>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>           return XDP_PASS;
> >>  }
> >>
> >>  char _license[] SEC("license") = "GPL";
> >>
> >>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>
> >> CC: stable@vger.kernel.org
> >> CC: Jason Wang <jasowang@redhat.com>
> >> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >> CC: Daniel Borkmann <daniel@iogearbox.net>
> >> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >> CC: virtualization@lists.linux-foundation.org
> >> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >> ---
> >> v2: Recalculate headroom based on data, data_hard_start and data_meta
> >>
> >>  drivers/net/virtio_net.c | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 87838cbe38cf..a12338de7ef1 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>  			 * xdp.data_meta were adjusted
> >>  			 */
> >>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> >> +
> >> +			/* recalculate headroom if xdp.data or xdp.data_meta
> >> +			 * were adjusted
> >> +			 */
> >> +			headroom = xdp.data - xdp.data_hard_start - metasize;
> >
> >
> > This is incorrect.
> >
> >
> > 		data = page_address(xdp_page) + offset;
> > 		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
> > 		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
> > 				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);
> >
> > so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len
> >
> > (page_address(xdp_page) + offset) points to virtio-net hdr.
> > (page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.
> >
> > xdp.data_hard_start points to buf + vi->hdr_len
> >
> > Thanks.
> >
>
> xdp.data points to buf + vi->hdr_len + VIRTIO_XDP_HEADROOM, so we calculate
> xdp.data - xdp.data_hard_start, i.e. buf + vi->hdr_len + VIRTIO_XDP_HEADROOM - (buf + vi->hdr_len)
>
> You can see the headrooms from my tests above, they are correct and they match exactly
> the values from the headroom calculations that you suggested earlier.


OK. You are right, xdp.data, xdp.data_hard_start have an offset of hdr_len. I
hope this can be explained in the comments, because the headroom we want to get
is virtio_hdr - buf. Although the value here are equal.

In addition, if you are going to post v2, I think you should post a new thread
separately instead of replying in the previous thread.

Thanks.


>
> >
> >> +
> >>  			/* We can only create skb based on xdp_page. */
> >>  			if (unlikely(xdp_page != page)) {
> >>  				rcu_read_unlock();
> >> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >>  						       len, PAGE_SIZE, false,
> >>  						       metasize,
> >> -						       VIRTIO_XDP_HEADROOM);
> >> +						       headroom);
> >>  				return head_skb;
> >>  			}
> >>  			break;
> >> --
> >> 2.35.1
> >>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-24 11:18                         ` Xuan Zhuo
  0 siblings, 0 replies; 36+ messages in thread
From: Xuan Zhuo @ 2022-04-24 11:18 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On Sun, 24 Apr 2022 13:56:17 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 24/04/2022 13:42, Xuan Zhuo wrote:
> > On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >> We received a report[1] of kernel crashes when Cilium is used in XDP
> >> mode with virtio_net after updating to newer kernels. After
> >> investigating the reason it turned out that when using mergeable bufs
> >> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >> calculates the build_skb address wrong because the offset can become less
> >> than the headroom so it gets the address of the previous page (-X bytes
> >> depending on how lower offset is):
> >>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>
> >> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >> shows offset that is less than headroom by adding 4 bytes of metadata
> >> via an xdp prog. The calculations done are:
> >>  receive_mergeable():
> >>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>  offset = xdp.data - page_address(xdp_page) -
> >>           vi->hdr_len - metasize;
> >>
> >>  page_to_skb():
> >>  p = page_address(page) + offset;
> >>  ...
> >>  buf = p - headroom;
> >>
> >> Now buf goes -4 bytes from the page's starting address as can be seen
> >> above which is set as skb->head and skb->data by build_skb later. Depending
> >> on what's done with the skb (when it's freed most often) we get all kinds
> >> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
> >> the new headroom after the xdp program has run, similar to how offset
> >> and len are recalculated. Headroom is directly related to
> >> data_hard_start, data and data_meta, so we use them to get the new size.
> >> The result is correct (similar pr_err() in page_to_skb, one case of
> >> xdp_page and one case of virtnet buf):
> >>  a) Case with 4 bytes of metadata
> >>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
> >>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
> >>  b) Case of pushing data +32 bytes
> >>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
> >>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
> >>  c) Case of pushing data -33 bytes
> >>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
> >>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
> >>
> >> An example reproducer xdp prog[3] is below.
> >>
> >> [1] https://github.com/cilium/cilium/issues/19453
> >>
> >> [2] Two of the many traces:
> >>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>  [   41.321387] Call Trace:
> >>  [   41.321819]  <TASK>
> >>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>  [   41.322902]  __kfree_skb+0x20/0x30
> >>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>  [   41.445592]  ? 0xffffffffa3000000
> >>  [   41.462442]  new_sync_read+0x148/0x160
> >>  [   41.479314]  ? 0xffffffffa3000000
> >>  [   41.496937]  vfs_read+0x138/0x190
> >>  [   41.517198]  ksys_read+0x87/0xc0
> >>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>  [   41.568050] RIP: 0033:0x48765b
> >>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>  [   41.744254]  </TASK>
> >>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>
> >>  and
> >>
> >>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>  [   33.532471] page dumped because: nonzero mapcount
> >>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   33.532484] Call Trace:
> >>  [   33.532496]  <TASK>
> >>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>  [   33.532506]  bad_page.cold+0x63/0x94
> >>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>  [   33.532515]  free_unref_page+0x1b/0x100
> >>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>
> >> [3] XDP program to reproduce(xdp_pass.c):
> >>  #include <linux/bpf.h>
> >>  #include <bpf/bpf_helpers.h>
> >>
> >>  SEC("xdp_pass")
> >>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>  {
> >>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>           return XDP_PASS;
> >>  }
> >>
> >>  char _license[] SEC("license") = "GPL";
> >>
> >>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>
> >> CC: stable@vger.kernel.org
> >> CC: Jason Wang <jasowang@redhat.com>
> >> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >> CC: Daniel Borkmann <daniel@iogearbox.net>
> >> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >> CC: virtualization@lists.linux-foundation.org
> >> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >> ---
> >> v2: Recalculate headroom based on data, data_hard_start and data_meta
> >>
> >>  drivers/net/virtio_net.c | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 87838cbe38cf..a12338de7ef1 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>  			 * xdp.data_meta were adjusted
> >>  			 */
> >>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> >> +
> >> +			/* recalculate headroom if xdp.data or xdp.data_meta
> >> +			 * were adjusted
> >> +			 */
> >> +			headroom = xdp.data - xdp.data_hard_start - metasize;
> >
> >
> > This is incorrect.
> >
> >
> > 		data = page_address(xdp_page) + offset;
> > 		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
> > 		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
> > 				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);
> >
> > so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len
> >
> > (page_address(xdp_page) + offset) points to virtio-net hdr.
> > (page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.
> >
> > xdp.data_hard_start points to buf + vi->hdr_len
> >
> > Thanks.
> >
>
> xdp.data points to buf + vi->hdr_len + VIRTIO_XDP_HEADROOM, so we calculate
> xdp.data - xdp.data_hard_start, i.e. buf + vi->hdr_len + VIRTIO_XDP_HEADROOM - (buf + vi->hdr_len)
>
> You can see the headrooms from my tests above, they are correct and they match exactly
> the values from the headroom calculations that you suggested earlier.


OK. You are right, xdp.data, xdp.data_hard_start have an offset of hdr_len. I
hope this can be explained in the comments, because the headroom we want to get
is virtio_hdr - buf. Although the value here are equal.

In addition, if you are going to post v2, I think you should post a new thread
separately instead of replying in the previous thread.

Thanks.


>
> >
> >> +
> >>  			/* We can only create skb based on xdp_page. */
> >>  			if (unlikely(xdp_page != page)) {
> >>  				rcu_read_unlock();
> >> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >>  						       len, PAGE_SIZE, false,
> >>  						       metasize,
> >> -						       VIRTIO_XDP_HEADROOM);
> >> +						       headroom);
> >>  				return head_skb;
> >>  			}
> >>  			break;
> >> --
> >> 2.35.1
> >>
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
  2022-04-24 11:18                         ` Xuan Zhuo
@ 2022-04-24 14:53                           ` Nikolay Aleksandrov
  -1 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-24 14:53 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev

On 24/04/2022 14:18, Xuan Zhuo wrote:
> On Sun, 24 Apr 2022 13:56:17 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 24/04/2022 13:42, Xuan Zhuo wrote:
>>> On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
[snip]
>>>>
>>>> CC: stable@vger.kernel.org
>>>> CC: Jason Wang <jasowang@redhat.com>
>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>>> CC: virtualization@lists.linux-foundation.org
>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>> ---
>>>> v2: Recalculate headroom based on data, data_hard_start and data_meta
>>>>
>>>>  drivers/net/virtio_net.c | 8 +++++++-
>>>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..a12338de7ef1 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>  			 * xdp.data_meta were adjusted
>>>>  			 */
>>>>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>>>> +
>>>> +			/* recalculate headroom if xdp.data or xdp.data_meta
>>>> +			 * were adjusted
>>>> +			 */
>>>> +			headroom = xdp.data - xdp.data_hard_start - metasize;
>>>
>>>
>>> This is incorrect.
>>>
>>>
>>> 		data = page_address(xdp_page) + offset;
>>> 		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
>>> 		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
>>> 				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);
>>>
>>> so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len
>>>
>>> (page_address(xdp_page) + offset) points to virtio-net hdr.
>>> (page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.
>>>
>>> xdp.data_hard_start points to buf + vi->hdr_len
>>>
>>> Thanks.
>>>
>>
>> xdp.data points to buf + vi->hdr_len + VIRTIO_XDP_HEADROOM, so we calculate
>> xdp.data - xdp.data_hard_start, i.e. buf + vi->hdr_len + VIRTIO_XDP_HEADROOM - (buf + vi->hdr_len)
>>
>> You can see the headrooms from my tests above, they are correct and they match exactly
>> the values from the headroom calculations that you suggested earlier.
> 
> 
> OK. You are right, xdp.data, xdp.data_hard_start have an offset of hdr_len. I
> hope this can be explained in the comments, because the headroom we want to get
> is virtio_hdr - buf. Although the value here are equal.
> 

I think it's normal for them to be equal because buf + offset points to vnet_hdr start,
that is it doesn't include the vnet_hdr size, so all that is left to subtract to get
to buf is offset - headroom after the prepare (given correct headroom, of course). 
The linearized xdp page buf has the following initial geometry (at prepare):
 offset = VIRTIO_XDP_HEADROOM
 headroom = VIRTIO_XDP_HEADROOM
 data_hard_start = page_address + vnet_hdr
 data = page_address + vnet_hdr + headroom
 data_meta = data

after xdp prog has run:
offset is recalculated as: (page_address + vnet_hdr + adjusted headroom) - page_address -
                            vnet_hdr - metasize = adjusted headroom - metasize

I wrote adjusted headroom because it depends on data and data_meta adjustments done by
the xdp prog, so based on the above we can get back to page_address (or buf if it's a virtnet
buf) by subtracting the following headroom recalculation:
 headroom = data - data_hard_start - metasize =
 (page_address + vnet_hdr + adjusted headroom) - page_address - vnet_hdr - metasize =
 adjusted headroom - metasize

That is because in page_to_skb() p points to page_address + recalculated offset, i.e.
p = page_address + (adjusted headroom - metasize) for the linearized case.
For the other case it's the same just the initial offset is a larger value.

I'll add a comment that page_address + offset always points to vnet_hdr start so
it will be equal to headroom which is data - data_hard_start because we have
data = page_address + vnet_hdr + adjusted headroom and data_hard_start at page_address
+ vnet_hdr. Note that we have adjusted headroom + vnet_hdr bytes available behind data
always, so for offset to point to vnet_hdr it has to be equal to headroom, it just
starts at page_address while the actual headroom starts at page_address + vnet_hdr to
reserve that many bytes.

I'll see how I can make that more concise. :)

> In addition, if you are going to post v2, I think you should post a new thread
> separately instead of replying in the previous thread.
> 

sure, I don't mind either way

> Thanks.
> 
> 

Cheers,
 Nik

>>
>>>
>>>> +
>>>>  			/* We can only create skb based on xdp_page. */
>>>>  			if (unlikely(xdp_page != page)) {
>>>>  				rcu_read_unlock();
>>>> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>  						       len, PAGE_SIZE, false,
>>>>  						       metasize,
>>>> -						       VIRTIO_XDP_HEADROOM);
>>>> +						       headroom);
>>>>  				return head_skb;
>>>>  			}
>>>>  			break;
>>>> --
>>>> 2.35.1
>>>>
>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net v2] virtio_net: fix wrong buf address calculation when using xdp
@ 2022-04-24 14:53                           ` Nikolay Aleksandrov
  0 siblings, 0 replies; 36+ messages in thread
From: Nikolay Aleksandrov @ 2022-04-24 14:53 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Daniel Borkmann, Michael S. Tsirkin, netdev, stable,
	virtualization, kuba, davem

On 24/04/2022 14:18, Xuan Zhuo wrote:
> On Sun, 24 Apr 2022 13:56:17 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 24/04/2022 13:42, Xuan Zhuo wrote:
>>> On Sun, 24 Apr 2022 13:21:21 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
[snip]
>>>>
>>>> CC: stable@vger.kernel.org
>>>> CC: Jason Wang <jasowang@redhat.com>
>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>>> CC: virtualization@lists.linux-foundation.org
>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>> ---
>>>> v2: Recalculate headroom based on data, data_hard_start and data_meta
>>>>
>>>>  drivers/net/virtio_net.c | 8 +++++++-
>>>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..a12338de7ef1 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1005,6 +1005,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>  			 * xdp.data_meta were adjusted
>>>>  			 */
>>>>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>>>> +
>>>> +			/* recalculate headroom if xdp.data or xdp.data_meta
>>>> +			 * were adjusted
>>>> +			 */
>>>> +			headroom = xdp.data - xdp.data_hard_start - metasize;
>>>
>>>
>>> This is incorrect.
>>>
>>>
>>> 		data = page_address(xdp_page) + offset;
>>> 		xdp_init_buff(&xdp, frame_sz - vi->hdr_len, &rq->xdp_rxq);
>>> 		xdp_prepare_buff(&xdp, data - VIRTIO_XDP_HEADROOM + vi->hdr_len,
>>> 				 VIRTIO_XDP_HEADROOM, len - vi->hdr_len, true);
>>>
>>> so: xdp.data_hard_start = page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM + vi->hdr_len
>>>
>>> (page_address(xdp_page) + offset) points to virtio-net hdr.
>>> (page_address(xdp_page) + offset - VIRTIO_XDP_HEADROOM) points to the allocated buf.
>>>
>>> xdp.data_hard_start points to buf + vi->hdr_len
>>>
>>> Thanks.
>>>
>>
>> xdp.data points to buf + vi->hdr_len + VIRTIO_XDP_HEADROOM, so we calculate
>> xdp.data - xdp.data_hard_start, i.e. buf + vi->hdr_len + VIRTIO_XDP_HEADROOM - (buf + vi->hdr_len)
>>
>> You can see the headrooms from my tests above, they are correct and they match exactly
>> the values from the headroom calculations that you suggested earlier.
> 
> 
> OK. You are right, xdp.data, xdp.data_hard_start have an offset of hdr_len. I
> hope this can be explained in the comments, because the headroom we want to get
> is virtio_hdr - buf. Although the value here are equal.
> 

I think it's normal for them to be equal because buf + offset points to vnet_hdr start,
that is it doesn't include the vnet_hdr size, so all that is left to subtract to get
to buf is offset - headroom after the prepare (given correct headroom, of course). 
The linearized xdp page buf has the following initial geometry (at prepare):
 offset = VIRTIO_XDP_HEADROOM
 headroom = VIRTIO_XDP_HEADROOM
 data_hard_start = page_address + vnet_hdr
 data = page_address + vnet_hdr + headroom
 data_meta = data

after xdp prog has run:
offset is recalculated as: (page_address + vnet_hdr + adjusted headroom) - page_address -
                            vnet_hdr - metasize = adjusted headroom - metasize

I wrote adjusted headroom because it depends on data and data_meta adjustments done by
the xdp prog, so based on the above we can get back to page_address (or buf if it's a virtnet
buf) by subtracting the following headroom recalculation:
 headroom = data - data_hard_start - metasize =
 (page_address + vnet_hdr + adjusted headroom) - page_address - vnet_hdr - metasize =
 adjusted headroom - metasize

That is because in page_to_skb() p points to page_address + recalculated offset, i.e.
p = page_address + (adjusted headroom - metasize) for the linearized case.
For the other case it's the same just the initial offset is a larger value.

I'll add a comment that page_address + offset always points to vnet_hdr start so
it will be equal to headroom which is data - data_hard_start because we have
data = page_address + vnet_hdr + adjusted headroom and data_hard_start at page_address
+ vnet_hdr. Note that we have adjusted headroom + vnet_hdr bytes available behind data
always, so for offset to point to vnet_hdr it has to be equal to headroom, it just
starts at page_address while the actual headroom starts at page_address + vnet_hdr to
reserve that many bytes.

I'll see how I can make that more concise. :)

> In addition, if you are going to post v2, I think you should post a new thread
> separately instead of replying in the previous thread.
> 

sure, I don't mind either way

> Thanks.
> 
> 

Cheers,
 Nik

>>
>>>
>>>> +
>>>>  			/* We can only create skb based on xdp_page. */
>>>>  			if (unlikely(xdp_page != page)) {
>>>>  				rcu_read_unlock();
>>>> @@ -1012,7 +1018,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>  						       len, PAGE_SIZE, false,
>>>>  						       metasize,
>>>> -						       VIRTIO_XDP_HEADROOM);
>>>> +						       headroom);
>>>>  				return head_skb;
>>>>  			}
>>>>  			break;
>>>> --
>>>> 2.35.1
>>>>
>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-04-24 14:53 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-23 11:26 [PATCH net] virtio_net: fix wrong buf address calculation when using xdp Nikolay Aleksandrov
2022-04-23 11:26 ` Nikolay Aleksandrov
2022-04-23 13:31 ` Xuan Zhuo
2022-04-23 13:31   ` Xuan Zhuo
2022-04-23 14:16   ` Nikolay Aleksandrov
2022-04-23 14:16     ` Nikolay Aleksandrov
2022-04-23 14:20     ` Xuan Zhuo
2022-04-23 14:20       ` Xuan Zhuo
2022-04-23 14:30     ` Nikolay Aleksandrov
2022-04-23 14:30       ` Nikolay Aleksandrov
2022-04-23 14:36       ` Xuan Zhuo
2022-04-23 14:36         ` Xuan Zhuo
2022-04-23 14:58         ` Nikolay Aleksandrov
2022-04-23 14:58           ` Nikolay Aleksandrov
2022-04-23 15:01           ` Xuan Zhuo
2022-04-23 15:01             ` Xuan Zhuo
2022-04-23 15:23             ` Nikolay Aleksandrov
2022-04-23 15:23               ` Nikolay Aleksandrov
2022-04-23 15:35               ` Nikolay Aleksandrov
2022-04-23 15:35                 ` Nikolay Aleksandrov
2022-04-24 10:21                 ` [PATCH net v2] " Nikolay Aleksandrov
2022-04-24 10:21                   ` Nikolay Aleksandrov
2022-04-24 10:42                   ` Xuan Zhuo
2022-04-24 10:42                     ` Xuan Zhuo
2022-04-24 10:56                     ` Nikolay Aleksandrov
2022-04-24 10:56                       ` Nikolay Aleksandrov
2022-04-24 11:18                       ` Xuan Zhuo
2022-04-24 11:18                         ` Xuan Zhuo
2022-04-24 14:53                         ` Nikolay Aleksandrov
2022-04-24 14:53                           ` Nikolay Aleksandrov
2022-04-23 15:55             ` [PATCH net] " Nikolay Aleksandrov
2022-04-23 15:55               ` Nikolay Aleksandrov
2022-04-23 14:46       ` Nikolay Aleksandrov
2022-04-23 14:46         ` Nikolay Aleksandrov
2022-04-23 15:10         ` Nikolay Aleksandrov
2022-04-23 15:10           ` Nikolay Aleksandrov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.