bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 0/3] xsk: build skb by page
@ 2021-01-21 13:47 Xuan Zhuo
  2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Xuan Zhuo @ 2021-01-21 13:47 UTC (permalink / raw)
  To: bpf
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, netdev

v3:
    Optimized code

v2:
    1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature
    2. split the patch to three:
        a. add priv_flags IFF_TX_SKB_NO_LINEAR
        b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR
        c. When there is support this flag, construct skb without linear space
    3. use ERR_PTR() and PTR_ERR() to handle the err

v1 message log:
---------------

This patch is used to construct skb based on page to save memory copy
overhead.

This has one problem:

We construct the skb by fill the data page as a frag into the skb. In
this way, the linear space is empty, and the header information is also
in the frag, not in the linear space, which is not allowed for some
network cards. For example, Mellanox Technologies MT27710 Family
[ConnectX-4 Lx] will get the following error message:

    mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
    00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
    WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
    00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
    00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

I also tried to use build_skb to construct skb, but because of the
existence of skb_shinfo, it must be behind the linear space, so this
method is not working. We can't put skb_shinfo on desc->addr, it will be
exposed to users, this is not safe.

Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
network card supports the header information of the packet in the frag
and not in the linear space.

---------------- Performance Testing ------------

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s <msg size>
```

Test result data:

size    64      512     1024    1500
copy    1916747 1775988 1600203 1440054
page    1974058 1953655 1945463 1904478
percent 3.0%    10.0%   21.58%  32.3%

Xuan Zhuo (3):
  net: add priv_flags for allow tx skb without linear
  virtio-net: support IFF_TX_SKB_NO_LINEAR
  xsk: build skb by page

 drivers/net/virtio_net.c  |   3 +-
 include/linux/netdevice.h |   3 ++
 net/xdp/xsk.c             | 104 ++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 91 insertions(+), 19 deletions(-)

--
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear
  2021-01-21 13:47 [PATCH bpf-next v3 0/3] xsk: build skb by page Xuan Zhuo
@ 2021-01-21 13:47 ` Xuan Zhuo
  2021-01-21 15:13   ` Alexander Lobakin
  2021-01-21 13:47 ` [PATCH bpf-next v3 2/3] virtio-net: support IFF_TX_SKB_NO_LINEAR Xuan Zhuo
  2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo
  2 siblings, 1 reply; 22+ messages in thread
From: Xuan Zhuo @ 2021-01-21 13:47 UTC (permalink / raw)
  To: bpf
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, netdev

In some cases, we hope to construct skb directly based on the existing
memory without copying data. In this case, the page will be placed
directly in the skb, and the linear space of skb is empty. But
unfortunately, many the network card does not support this operation.
For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
get the following error message:

    mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
    00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
    WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
    00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
    00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

So a priv_flag is added here to indicate whether the network card
supports this feature.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Suggested-by: Alexander Lobakin <alobakin@pm.me>
---
 include/linux/netdevice.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ef51725..135db8f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1525,6 +1525,7 @@ struct net_device_ops {
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
+ * @IFF_TX_SKB_NO_LINEAR: allow tx skb linear is empty
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1558,6 +1559,7 @@ enum netdev_priv_flags {
 	IFF_FAILOVER_SLAVE		= 1<<28,
 	IFF_L3MDEV_RX_HANDLER		= 1<<29,
 	IFF_LIVE_RENAME_OK		= 1<<30,
+	IFF_TX_SKB_NO_LINEAR		= 1<<31,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1590,6 +1592,7 @@ enum netdev_priv_flags {
 #define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
 #define IFF_L3MDEV_RX_HANDLER		IFF_L3MDEV_RX_HANDLER
 #define IFF_LIVE_RENAME_OK		IFF_LIVE_RENAME_OK
+#define IFF_TX_SKB_NO_LINEAR		IFF_TX_SKB_NO_LINEAR
 
 /**
  *	struct net_device - The DEVICE structure.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v3 2/3] virtio-net: support IFF_TX_SKB_NO_LINEAR
  2021-01-21 13:47 [PATCH bpf-next v3 0/3] xsk: build skb by page Xuan Zhuo
  2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo
@ 2021-01-21 13:47 ` Xuan Zhuo
  2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo
  2 siblings, 0 replies; 22+ messages in thread
From: Xuan Zhuo @ 2021-01-21 13:47 UTC (permalink / raw)
  To: bpf
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, netdev

Virtio net supports the case where the skb linear space is empty, so add
priv_flags.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e637..f2ff6c3 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 		return -ENOMEM;
 
 	/* Set up network device as normal. */
-	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
+	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
+			   IFF_TX_SKB_NO_LINEAR;
 	dev->netdev_ops = &virtnet_netdev;
 	dev->features = NETIF_F_HIGHDMA;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-21 13:47 [PATCH bpf-next v3 0/3] xsk: build skb by page Xuan Zhuo
  2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo
  2021-01-21 13:47 ` [PATCH bpf-next v3 2/3] virtio-net: support IFF_TX_SKB_NO_LINEAR Xuan Zhuo
@ 2021-01-21 13:47 ` Xuan Zhuo
  2021-01-21 15:17   ` Magnus Karlsson
  2021-01-21 15:41   ` Eric Dumazet
  2 siblings, 2 replies; 22+ messages in thread
From: Xuan Zhuo @ 2021-01-21 13:47 UTC (permalink / raw)
  To: bpf
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, netdev

This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

---------------- Performance Testing ------------

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s <msg size>
```

Test result data:

size    64      512     1024    1500
copy    1916747 1775988 1600203 1440054
page    1974058 1953655 1945463 1904478
percent 3.0%    10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 86 insertions(+), 18 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4a83117..38af7f1 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
 	sock_wfree(skb);
 }
 
+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+					      struct xdp_desc *desc)
+{
+	u32 len, offset, copy, copied;
+	struct sk_buff *skb;
+	struct page *page;
+	void *buffer;
+	int err, i;
+	u64 addr;
+
+	skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
+	if (unlikely(!skb))
+		return ERR_PTR(err);
+
+	addr = desc->addr;
+	len = desc->len;
+
+	buffer = xsk_buff_raw_get_data(xs->pool, addr);
+	offset = offset_in_page(buffer);
+	addr = buffer - xs->pool->addrs;
+
+	for (copied = 0, i = 0; copied < len; i++) {
+		page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
+
+		get_page(page);
+
+		copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+
+		skb_fill_page_desc(skb, i, page, offset, copy);
+
+		copied += copy;
+		addr += copy;
+		offset = 0;
+	}
+
+	skb->len += len;
+	skb->data_len += len;
+	skb->truesize += len;
+
+	refcount_add(len, &xs->sk.sk_wmem_alloc);
+
+	return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+				     struct xdp_desc *desc)
+{
+	struct sk_buff *skb;
+
+	if (xs->dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+		skb = xsk_build_skb_zerocopy(xs, desc);
+		if (IS_ERR(skb))
+			return skb;
+	} else {
+		void *buffer;
+		u32 len;
+		int err;
+
+		len = desc->len;
+		skb = sock_alloc_send_skb(&xs->sk, len, 1, &err);
+		if (unlikely(!skb))
+			return ERR_PTR(err);
+
+		skb_put(skb, len);
+		buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+		err = skb_store_bits(skb, 0, buffer, len);
+		if (unlikely(err)) {
+			kfree_skb(skb);
+			return ERR_PTR(err);
+		}
+	}
+
+	skb->dev = xs->dev;
+	skb->priority = xs->sk.sk_priority;
+	skb->mark = xs->sk.sk_mark;
+	skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+	skb->destructor = xsk_destruct_skb;
+
+	return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
 	struct xdp_sock *xs = xdp_sk(sk);
@@ -446,43 +527,30 @@ static int xsk_generic_xmit(struct sock *sk)
 		goto out;
 
 	while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-		char *buffer;
-		u64 addr;
-		u32 len;
-
 		if (max_batch-- == 0) {
 			err = -EAGAIN;
 			goto out;
 		}
 
-		len = desc.len;
-		skb = sock_alloc_send_skb(sk, len, 1, &err);
-		if (unlikely(!skb))
+		skb = xsk_build_skb(xs, &desc);
+		if (IS_ERR(skb)) {
+			err = PTR_ERR(skb);
 			goto out;
+		}
 
-		skb_put(skb, len);
-		addr = desc.addr;
-		buffer = xsk_buff_raw_get_data(xs->pool, addr);
-		err = skb_store_bits(skb, 0, buffer, len);
 		/* This is the backpressure mechanism for the Tx path.
 		 * Reserve space in the completion queue and only proceed
 		 * if there is space in it. This avoids having to implement
 		 * any buffering in the Tx path.
 		 */
 		spin_lock_irqsave(&xs->pool->cq_lock, flags);
-		if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
+		if (xskq_prod_reserve(xs->pool->cq)) {
 			spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
 			kfree_skb(skb);
 			goto out;
 		}
 		spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
 
-		skb->dev = xs->dev;
-		skb->priority = sk->sk_priority;
-		skb->mark = sk->sk_mark;
-		skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
-		skb->destructor = xsk_destruct_skb;
-
 		err = __dev_direct_xmit(skb, xs->queue_id);
 		if  (err == NETDEV_TX_BUSY) {
 			/* Tell user-space to retry the send */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear
  2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo
@ 2021-01-21 15:13   ` Alexander Lobakin
  0 siblings, 0 replies; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-21 15:13 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Alexander Lobakin, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, Björn Töpel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, virtualization, bpf, netdev, linux-kernel

From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Date: Thu, 21 Jan 2021 21:47:07 +0800

> In some cases, we hope to construct skb directly based on the existing
> memory without copying data. In this case, the page will be placed
> directly in the skb, and the linear space of skb is empty. But
> unfortunately, many the network card does not support this operation.
> For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
> get the following error message:
> 
>     mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
>     00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
>     WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
>     00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
>     00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
>     00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb
> 
> So a priv_flag is added here to indicate whether the network card
> supports this feature.
> 
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Suggested-by: Alexander Lobakin <alobakin@pm.me>

Acked-by: Alexander Lobakin <alobakin@pm.me>

> ---
>  include/linux/netdevice.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ef51725..135db8f 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1525,6 +1525,7 @@ struct net_device_ops {
>   * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>   * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
>   * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
> + * @IFF_TX_SKB_NO_LINEAR: allow tx skb linear is empty
>   */
>  enum netdev_priv_flags {
>  	IFF_802_1Q_VLAN			= 1<<0,
> @@ -1558,6 +1559,7 @@ enum netdev_priv_flags {
>  	IFF_FAILOVER_SLAVE		= 1<<28,
>  	IFF_L3MDEV_RX_HANDLER		= 1<<29,
>  	IFF_LIVE_RENAME_OK		= 1<<30,
> +	IFF_TX_SKB_NO_LINEAR		= 1<<31,
>  };
>  
>  #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
> @@ -1590,6 +1592,7 @@ enum netdev_priv_flags {
>  #define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
>  #define IFF_L3MDEV_RX_HANDLER		IFF_L3MDEV_RX_HANDLER
>  #define IFF_LIVE_RENAME_OK		IFF_LIVE_RENAME_OK
> +#define IFF_TX_SKB_NO_LINEAR		IFF_TX_SKB_NO_LINEAR
>  
>  /**
>   *	struct net_device - The DEVICE structure.
> -- 
> 1.8.3.1

Thanks,
Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo
@ 2021-01-21 15:17   ` Magnus Karlsson
  2021-01-21 15:41   ` Eric Dumazet
  1 sibling, 0 replies; 22+ messages in thread
From: Magnus Karlsson @ 2021-01-21 15:17 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: bpf, Michael S. Tsirkin, Jason Wang, David S. Miller,
	Jakub Kicinski, Björn Töpel, Magnus Karlsson,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	virtualization, Network Development

On Thu, Jan 21, 2021 at 2:51 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> This patch is used to construct skb based on page to save memory copy
> overhead.
>
> This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> directly construct skb. If this feature is not supported, it is still
> necessary to copy data to construct skb.
>
> ---------------- Performance Testing ------------
>
> The test environment is Aliyun ECS server.
> Test cmd:
> ```
> xdpsock -i eth0 -t  -S -s <msg size>
> ```
>
> Test result data:
>
> size    64      512     1024    1500
> copy    1916747 1775988 1600203 1440054
> page    1974058 1953655 1945463 1904478
> percent 3.0%    10.0%   21.58%  32.3%
>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 86 insertions(+), 18 deletions(-)

Applied, compiled and tried it out on my NIC that does not support
IFF_TX_SKB_NO_LINEAR and it works fine. Thank you Xuan for all your
efforts. Appreciated.

Now it would be nice if we could get some physical NIC drivers to
support this too. Some probably already do and can just set the bit,
while others need some modifications to support this.

Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 4a83117..38af7f1 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
>         sock_wfree(skb);
>  }
>
> +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> +                                             struct xdp_desc *desc)
> +{
> +       u32 len, offset, copy, copied;
> +       struct sk_buff *skb;
> +       struct page *page;
> +       void *buffer;
> +       int err, i;
> +       u64 addr;
> +
> +       skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> +       if (unlikely(!skb))
> +               return ERR_PTR(err);
> +
> +       addr = desc->addr;
> +       len = desc->len;
> +
> +       buffer = xsk_buff_raw_get_data(xs->pool, addr);
> +       offset = offset_in_page(buffer);
> +       addr = buffer - xs->pool->addrs;
> +
> +       for (copied = 0, i = 0; copied < len; i++) {
> +               page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> +
> +               get_page(page);
> +
> +               copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> +
> +               skb_fill_page_desc(skb, i, page, offset, copy);
> +
> +               copied += copy;
> +               addr += copy;
> +               offset = 0;
> +       }
> +
> +       skb->len += len;
> +       skb->data_len += len;
> +       skb->truesize += len;
> +
> +       refcount_add(len, &xs->sk.sk_wmem_alloc);
> +
> +       return skb;
> +}
> +
> +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> +                                    struct xdp_desc *desc)
> +{
> +       struct sk_buff *skb;
> +
> +       if (xs->dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> +               skb = xsk_build_skb_zerocopy(xs, desc);
> +               if (IS_ERR(skb))
> +                       return skb;
> +       } else {
> +               void *buffer;
> +               u32 len;
> +               int err;
> +
> +               len = desc->len;
> +               skb = sock_alloc_send_skb(&xs->sk, len, 1, &err);
> +               if (unlikely(!skb))
> +                       return ERR_PTR(err);
> +
> +               skb_put(skb, len);
> +               buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
> +               err = skb_store_bits(skb, 0, buffer, len);
> +               if (unlikely(err)) {
> +                       kfree_skb(skb);
> +                       return ERR_PTR(err);
> +               }
> +       }
> +
> +       skb->dev = xs->dev;
> +       skb->priority = xs->sk.sk_priority;
> +       skb->mark = xs->sk.sk_mark;
> +       skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
> +       skb->destructor = xsk_destruct_skb;
> +
> +       return skb;
> +}
> +
>  static int xsk_generic_xmit(struct sock *sk)
>  {
>         struct xdp_sock *xs = xdp_sk(sk);
> @@ -446,43 +527,30 @@ static int xsk_generic_xmit(struct sock *sk)
>                 goto out;
>
>         while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
> -               char *buffer;
> -               u64 addr;
> -               u32 len;
> -
>                 if (max_batch-- == 0) {
>                         err = -EAGAIN;
>                         goto out;
>                 }
>
> -               len = desc.len;
> -               skb = sock_alloc_send_skb(sk, len, 1, &err);
> -               if (unlikely(!skb))
> +               skb = xsk_build_skb(xs, &desc);
> +               if (IS_ERR(skb)) {
> +                       err = PTR_ERR(skb);
>                         goto out;
> +               }
>
> -               skb_put(skb, len);
> -               addr = desc.addr;
> -               buffer = xsk_buff_raw_get_data(xs->pool, addr);
> -               err = skb_store_bits(skb, 0, buffer, len);
>                 /* This is the backpressure mechanism for the Tx path.
>                  * Reserve space in the completion queue and only proceed
>                  * if there is space in it. This avoids having to implement
>                  * any buffering in the Tx path.
>                  */
>                 spin_lock_irqsave(&xs->pool->cq_lock, flags);
> -               if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
> +               if (xskq_prod_reserve(xs->pool->cq)) {
>                         spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
>                         kfree_skb(skb);
>                         goto out;
>                 }
>                 spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
>
> -               skb->dev = xs->dev;
> -               skb->priority = sk->sk_priority;
> -               skb->mark = sk->sk_mark;
> -               skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
> -               skb->destructor = xsk_destruct_skb;
> -
>                 err = __dev_direct_xmit(skb, xs->queue_id);
>                 if  (err == NETDEV_TX_BUSY) {
>                         /* Tell user-space to retry the send */
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo
  2021-01-21 15:17   ` Magnus Karlsson
@ 2021-01-21 15:41   ` Eric Dumazet
  2021-01-22 11:47     ` Alexander Lobakin
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2021-01-21 15:41 UTC (permalink / raw)
  To: Xuan Zhuo, bpf
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Björn Töpel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, netdev



On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> This patch is used to construct skb based on page to save memory copy
> overhead.
> 
> This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> directly construct skb. If this feature is not supported, it is still
> necessary to copy data to construct skb.
> 
> ---------------- Performance Testing ------------
> 
> The test environment is Aliyun ECS server.
> Test cmd:
> ```
> xdpsock -i eth0 -t  -S -s <msg size>
> ```
> 
> Test result data:
> 
> size    64      512     1024    1500
> copy    1916747 1775988 1600203 1440054
> page    1974058 1953655 1945463 1904478
> percent 3.0%    10.0%   21.58%  32.3%
> 
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 86 insertions(+), 18 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 4a83117..38af7f1 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
>  	sock_wfree(skb);
>  }
>  
> +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> +					      struct xdp_desc *desc)
> +{
> +	u32 len, offset, copy, copied;
> +	struct sk_buff *skb;
> +	struct page *page;
> +	void *buffer;
> +	int err, i;
> +	u64 addr;
> +
> +	skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> +	if (unlikely(!skb))
> +		return ERR_PTR(err);
> +
> +	addr = desc->addr;
> +	len = desc->len;
> +
> +	buffer = xsk_buff_raw_get_data(xs->pool, addr);
> +	offset = offset_in_page(buffer);
> +	addr = buffer - xs->pool->addrs;
> +
> +	for (copied = 0, i = 0; copied < len; i++) {
> +		page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> +
> +		get_page(page);
> +
> +		copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> +
> +		skb_fill_page_desc(skb, i, page, offset, copy);
> +
> +		copied += copy;
> +		addr += copy;
> +		offset = 0;
> +	}
> +
> +	skb->len += len;
> +	skb->data_len += len;

> +	skb->truesize += len;

This is not the truesize, unfortunately.

We need to account for the number of pages, not number of bytes.

> +
> +	refcount_add(len, &xs->sk.sk_wmem_alloc);
> +
> +	return skb;
> +}
> +



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-21 15:41   ` Eric Dumazet
@ 2021-01-22 11:47     ` Alexander Lobakin
  2021-01-22 11:55       ` Alexander Lobakin
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-22 11:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Lobakin, Xuan Zhuo, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, Björn Töpel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, virtualization, bpf, netdev, linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 21 Jan 2021 16:41:33 +0100

> On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > This patch is used to construct skb based on page to save memory copy
> > overhead.
> > 
> > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > directly construct skb. If this feature is not supported, it is still
> > necessary to copy data to construct skb.
> > 
> > ---------------- Performance Testing ------------
> > 
> > The test environment is Aliyun ECS server.
> > Test cmd:
> > ```
> > xdpsock -i eth0 -t  -S -s <msg size>
> > ```
> > 
> > Test result data:
> > 
> > size    64      512     1024    1500
> > copy    1916747 1775988 1600203 1440054
> > page    1974058 1953655 1945463 1904478
> > percent 3.0%    10.0%   21.58%  32.3%
> > 
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > ---
> >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 86 insertions(+), 18 deletions(-)
> > 
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 4a83117..38af7f1 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> >  	sock_wfree(skb);
> >  }
> >  
> > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > +					      struct xdp_desc *desc)
> > +{
> > +	u32 len, offset, copy, copied;
> > +	struct sk_buff *skb;
> > +	struct page *page;
> > +	void *buffer;
> > +	int err, i;
> > +	u64 addr;
> > +
> > +	skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > +	if (unlikely(!skb))
> > +		return ERR_PTR(err);
> > +
> > +	addr = desc->addr;
> > +	len = desc->len;
> > +
> > +	buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > +	offset = offset_in_page(buffer);
> > +	addr = buffer - xs->pool->addrs;
> > +
> > +	for (copied = 0, i = 0; copied < len; i++) {
> > +		page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > +
> > +		get_page(page);
> > +
> > +		copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > +
> > +		skb_fill_page_desc(skb, i, page, offset, copy);
> > +
> > +		copied += copy;
> > +		addr += copy;
> > +		offset = 0;
> > +	}
> > +
> > +	skb->len += len;
> > +	skb->data_len += len;
> 
> > +	skb->truesize += len;
> 
> This is not the truesize, unfortunately.
> 
> We need to account for the number of pages, not number of bytes.

The easiest solution is:

	skb->truesize += PAGE_SIZE * i;

i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.

> > +
> > +	refcount_add(len, &xs->sk.sk_wmem_alloc);
> > +
> > +	return skb;
> > +}
> > +

Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-22 11:47     ` Alexander Lobakin
@ 2021-01-22 11:55       ` Alexander Lobakin
  2021-01-22 12:08         ` Alexander Lobakin
  2021-01-22 12:18         ` Magnus Karlsson
  0 siblings, 2 replies; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-22 11:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Lobakin, Xuan Zhuo, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, Björn Töpel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, virtualization, bpf, netdev, linux-kernel

From: Alexander Lobakin <alobakin@pm.me>
Date: Fri, 22 Jan 2021 11:47:45 +0000

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 21 Jan 2021 16:41:33 +0100
> 
> > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > This patch is used to construct skb based on page to save memory copy
> > > overhead.
> > > 
> > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > directly construct skb. If this feature is not supported, it is still
> > > necessary to copy data to construct skb.
> > > 
> > > ---------------- Performance Testing ------------
> > > 
> > > The test environment is Aliyun ECS server.
> > > Test cmd:
> > > ```
> > > xdpsock -i eth0 -t  -S -s <msg size>
> > > ```
> > > 
> > > Test result data:
> > > 
> > > size    64      512     1024    1500
> > > copy    1916747 1775988 1600203 1440054
> > > page    1974058 1953655 1945463 1904478
> > > percent 3.0%    10.0%   21.58%  32.3%
> > > 
> > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > ---
> > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 4a83117..38af7f1 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > >  	sock_wfree(skb);
> > >  }
> > >  
> > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > +					      struct xdp_desc *desc)
> > > +{
> > > +	u32 len, offset, copy, copied;
> > > +	struct sk_buff *skb;
> > > +	struct page *page;
> > > +	void *buffer;
> > > +	int err, i;
> > > +	u64 addr;
> > > +
> > > +	skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > +	if (unlikely(!skb))
> > > +		return ERR_PTR(err);
> > > +
> > > +	addr = desc->addr;
> > > +	len = desc->len;
> > > +
> > > +	buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > +	offset = offset_in_page(buffer);
> > > +	addr = buffer - xs->pool->addrs;
> > > +
> > > +	for (copied = 0, i = 0; copied < len; i++) {
> > > +		page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > +
> > > +		get_page(page);
> > > +
> > > +		copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > +
> > > +		skb_fill_page_desc(skb, i, page, offset, copy);
> > > +
> > > +		copied += copy;
> > > +		addr += copy;
> > > +		offset = 0;
> > > +	}
> > > +
> > > +	skb->len += len;
> > > +	skb->data_len += len;
> > 
> > > +	skb->truesize += len;
> > 
> > This is not the truesize, unfortunately.
> > 
> > We need to account for the number of pages, not number of bytes.
> 
> The easiest solution is:
> 
> 	skb->truesize += PAGE_SIZE * i;
> 
> i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.

Oops, pls ignore this. I forgot that XSK buffers are not
"one per page".
We need to count the number of pages manually and then do

	skb->truesize += PAGE_SIZE * npages;

Right.

> > > +
> > > +	refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > +
> > > +	return skb;
> > > +}
> > > +
> 
> Al

Thanks,
Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-22 11:55       ` Alexander Lobakin
@ 2021-01-22 12:08         ` Alexander Lobakin
  2021-01-22 12:18         ` Magnus Karlsson
  1 sibling, 0 replies; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-22 12:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Lobakin, Xuan Zhuo, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, bjorn, Magnus Karlsson,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	virtualization, bpf, netdev, linux-kernel

From: Alexander Lobakin <alobakin@pm.me>
Date: Fri, 22 Jan 2021 11:55:35 +0000

> From: Alexander Lobakin <alobakin@pm.me>
> Date: Fri, 22 Jan 2021 11:47:45 +0000
> 
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > 
> > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > This patch is used to construct skb based on page to save memory copy
> > > > overhead.
> > > > 
> > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > directly construct skb. If this feature is not supported, it is still
> > > > necessary to copy data to construct skb.
> > > > 
> > > > ---------------- Performance Testing ------------
> > > > 
> > > > The test environment is Aliyun ECS server.
> > > > Test cmd:
> > > > ```
> > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > ```
> > > > 
> > > > Test result data:
> > > > 
> > > > size    64      512     1024    1500
> > > > copy    1916747 1775988 1600203 1440054
> > > > page    1974058 1953655 1945463 1904478
> > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > 
> > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > ---
> > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > 
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index 4a83117..38af7f1 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > >  	sock_wfree(skb);
> > > >  }
> > > >  
> > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > +					      struct xdp_desc *desc)
> > > > +{
> > > > +	u32 len, offset, copy, copied;
> > > > +	struct sk_buff *skb;
> > > > +	struct page *page;
> > > > +	void *buffer;
> > > > +	int err, i;
> > > > +	u64 addr;
> > > > +
> > > > +	skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);

Also,
maybe we should allocate it with NET_SKB_PAD so NIC drivers could
use some reserved space?

		skb = sock_alloc_send_skb(&xs->sk, NET_SKB_PAD, 1, &err);
		...
		skb_reserve(skb, NET_SKB_PAD);

Eric, what do you think?

> > > > +	if (unlikely(!skb))
> > > > +		return ERR_PTR(err);
> > > > +
> > > > +	addr = desc->addr;
> > > > +	len = desc->len;
> > > > +
> > > > +	buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > +	offset = offset_in_page(buffer);
> > > > +	addr = buffer - xs->pool->addrs;
> > > > +
> > > > +	for (copied = 0, i = 0; copied < len; i++) {
> > > > +		page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > +
> > > > +		get_page(page);
> > > > +
> > > > +		copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > +
> > > > +		skb_fill_page_desc(skb, i, page, offset, copy);
> > > > +
> > > > +		copied += copy;
> > > > +		addr += copy;
> > > > +		offset = 0;
> > > > +	}
> > > > +
> > > > +	skb->len += len;
> > > > +	skb->data_len += len;
> > > 
> > > > +	skb->truesize += len;
> > > 
> > > This is not the truesize, unfortunately.
> > > 
> > > We need to account for the number of pages, not number of bytes.
> > 
> > The easiest solution is:
> > 
> > 	skb->truesize += PAGE_SIZE * i;
> > 
> > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> 
> Oops, pls ignore this. I forgot that XSK buffers are not
> "one per page".
> We need to count the number of pages manually and then do
> 
> 	skb->truesize += PAGE_SIZE * npages;
> 
> Right.
> 
> > > > +
> > > > +	refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > +
> > > > +	return skb;
> > > > +}
> > > > +
> > 
> > Al
> 
> Thanks,
> Al

Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-22 11:55       ` Alexander Lobakin
  2021-01-22 12:08         ` Alexander Lobakin
@ 2021-01-22 12:18         ` Magnus Karlsson
  2021-01-22 12:39           ` Alexander Lobakin
  1 sibling, 1 reply; 22+ messages in thread
From: Magnus Karlsson @ 2021-01-22 12:18 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Eric Dumazet, Xuan Zhuo, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, Björn Töpel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, virtualization, bpf, Network Development, open list

On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote:
>
> From: Alexander Lobakin <alobakin@pm.me>
> Date: Fri, 22 Jan 2021 11:47:45 +0000
>
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Thu, 21 Jan 2021 16:41:33 +0100
> >
> > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > This patch is used to construct skb based on page to save memory copy
> > > > overhead.
> > > >
> > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > directly construct skb. If this feature is not supported, it is still
> > > > necessary to copy data to construct skb.
> > > >
> > > > ---------------- Performance Testing ------------
> > > >
> > > > The test environment is Aliyun ECS server.
> > > > Test cmd:
> > > > ```
> > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > ```
> > > >
> > > > Test result data:
> > > >
> > > > size    64      512     1024    1500
> > > > copy    1916747 1775988 1600203 1440054
> > > > page    1974058 1953655 1945463 1904478
> > > > percent 3.0%    10.0%   21.58%  32.3%
> > > >
> > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > ---
> > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index 4a83117..38af7f1 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > >   sock_wfree(skb);
> > > >  }
> > > >
> > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > +                                       struct xdp_desc *desc)
> > > > +{
> > > > + u32 len, offset, copy, copied;
> > > > + struct sk_buff *skb;
> > > > + struct page *page;
> > > > + void *buffer;
> > > > + int err, i;
> > > > + u64 addr;
> > > > +
> > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > + if (unlikely(!skb))
> > > > +         return ERR_PTR(err);
> > > > +
> > > > + addr = desc->addr;
> > > > + len = desc->len;
> > > > +
> > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > + offset = offset_in_page(buffer);
> > > > + addr = buffer - xs->pool->addrs;
> > > > +
> > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > +         page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > +
> > > > +         get_page(page);
> > > > +
> > > > +         copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > +
> > > > +         skb_fill_page_desc(skb, i, page, offset, copy);
> > > > +
> > > > +         copied += copy;
> > > > +         addr += copy;
> > > > +         offset = 0;
> > > > + }
> > > > +
> > > > + skb->len += len;
> > > > + skb->data_len += len;
> > >
> > > > + skb->truesize += len;
> > >
> > > This is not the truesize, unfortunately.
> > >
> > > We need to account for the number of pages, not number of bytes.
> >
> > The easiest solution is:
> >
> >       skb->truesize += PAGE_SIZE * i;
> >
> > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
>
> Oops, pls ignore this. I forgot that XSK buffers are not
> "one per page".
> We need to count the number of pages manually and then do
>
>         skb->truesize += PAGE_SIZE * npages;
>
> Right.

There are two possible packet buffer (chunks) sizes in a umem, 2K and
4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
please correct me if wrong, truesize is used for memory accounting.
But in this code, no kernel memory has been allocated (apart from the
skb). The page is just a part of the umem that has been already
allocated beforehand and by user-space in this case. So what should
truesize be in this case? Do we add 0, chunk_size * i, or the
complicated case of counting exactly how many 4K pages that are used
when the chunk_size is 2K, as two chunks could occupy the same page,
or just the upper bound of PAGE_SIZE * i that is likely a good
approximation in most cases? Just note that there might be other uses
of truesize that I am unaware of that could impact this choice.

> > > > +
> > > > + refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > +
> > > > + return skb;
> > > > +}
> > > > +
> >
> > Al
>
> Thanks,
> Al
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-22 12:18         ` Magnus Karlsson
@ 2021-01-22 12:39           ` Alexander Lobakin
  2021-01-22 12:55             ` Magnus Karlsson
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-22 12:39 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Alexander Lobakin, Eric Dumazet, Xuan Zhuo, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Bjorn Topel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, virtualization, bpf, Network Development, open list

From: Magnus Karlsson <magnus.karlsson@gmail.com>
Date: Fri, 22 Jan 2021 13:18:47 +0100

> On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote:
> >
> > From: Alexander Lobakin <alobakin@pm.me>
> > Date: Fri, 22 Jan 2021 11:47:45 +0000
> >
> > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > >
> > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > This patch is used to construct skb based on page to save memory copy
> > > > > overhead.
> > > > >
> > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > directly construct skb. If this feature is not supported, it is still
> > > > > necessary to copy data to construct skb.
> > > > >
> > > > > ---------------- Performance Testing ------------
> > > > >
> > > > > The test environment is Aliyun ECS server.
> > > > > Test cmd:
> > > > > ```
> > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > ```
> > > > >
> > > > > Test result data:
> > > > >
> > > > > size    64      512     1024    1500
> > > > > copy    1916747 1775988 1600203 1440054
> > > > > page    1974058 1953655 1945463 1904478
> > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > >
> > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > ---
> > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > >
> > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > index 4a83117..38af7f1 100644
> > > > > --- a/net/xdp/xsk.c
> > > > > +++ b/net/xdp/xsk.c
> > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > >   sock_wfree(skb);
> > > > >  }
> > > > >
> > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > +                                       struct xdp_desc *desc)
> > > > > +{
> > > > > + u32 len, offset, copy, copied;
> > > > > + struct sk_buff *skb;
> > > > > + struct page *page;
> > > > > + void *buffer;
> > > > > + int err, i;
> > > > > + u64 addr;
> > > > > +
> > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > > + if (unlikely(!skb))
> > > > > +         return ERR_PTR(err);
> > > > > +
> > > > > + addr = desc->addr;
> > > > > + len = desc->len;
> > > > > +
> > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > + offset = offset_in_page(buffer);
> > > > > + addr = buffer - xs->pool->addrs;
> > > > > +
> > > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > > +         page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > +
> > > > > +         get_page(page);
> > > > > +
> > > > > +         copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > +
> > > > > +         skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > +
> > > > > +         copied += copy;
> > > > > +         addr += copy;
> > > > > +         offset = 0;
> > > > > + }
> > > > > +
> > > > > + skb->len += len;
> > > > > + skb->data_len += len;
> > > >
> > > > > + skb->truesize += len;
> > > >
> > > > This is not the truesize, unfortunately.
> > > >
> > > > We need to account for the number of pages, not number of bytes.
> > >
> > > The easiest solution is:
> > >
> > >       skb->truesize += PAGE_SIZE * i;
> > >
> > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> >
> > Oops, pls ignore this. I forgot that XSK buffers are not
> > "one per page".
> > We need to count the number of pages manually and then do
> >
> >         skb->truesize += PAGE_SIZE * npages;
> >
> > Right.
> 
> There are two possible packet buffer (chunks) sizes in a umem, 2K and
> 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
> please correct me if wrong, truesize is used for memory accounting.
> But in this code, no kernel memory has been allocated (apart from the
> skb). The page is just a part of the umem that has been already
> allocated beforehand and by user-space in this case. So what should
> truesize be in this case? Do we add 0, chunk_size * i, or the
> complicated case of counting exactly how many 4K pages that are used
> when the chunk_size is 2K, as two chunks could occupy the same page,
> or just the upper bound of PAGE_SIZE * i that is likely a good
> approximation in most cases? Just note that there might be other uses
> of truesize that I am unaware of that could impact this choice.

Truesize is "what amount of memory does this skb occupy with all its
fragments, linear space and struct sk_buff itself". The closest it
will be to the actual value, the better.
In this case, I think adding of chunk_size * i would be enough.

(PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially
for setups with PAGE_SIZE > SZ_4K)

> > > > > +
> > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > +
> > > > > + return skb;
> > > > > +}
> > > > > +
> > >
> > > Al
> >
> > Thanks,
> > Al

Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-22 12:39           ` Alexander Lobakin
@ 2021-01-22 12:55             ` Magnus Karlsson
  0 siblings, 0 replies; 22+ messages in thread
From: Magnus Karlsson @ 2021-01-22 12:55 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Eric Dumazet, Xuan Zhuo, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, Bjorn Topel, Magnus Karlsson,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	virtualization, bpf, Network Development, open list

On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin <alobakin@pm.me> wrote:
>
> From: Magnus Karlsson <magnus.karlsson@gmail.com>
> Date: Fri, 22 Jan 2021 13:18:47 +0100
>
> > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > >
> > > From: Alexander Lobakin <alobakin@pm.me>
> > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > >
> > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > >
> > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > overhead.
> > > > > >
> > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > necessary to copy data to construct skb.
> > > > > >
> > > > > > ---------------- Performance Testing ------------
> > > > > >
> > > > > > The test environment is Aliyun ECS server.
> > > > > > Test cmd:
> > > > > > ```
> > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > ```
> > > > > >
> > > > > > Test result data:
> > > > > >
> > > > > > size    64      512     1024    1500
> > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > page    1974058 1953655 1945463 1904478
> > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > >
> > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > ---
> > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > >
> > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > index 4a83117..38af7f1 100644
> > > > > > --- a/net/xdp/xsk.c
> > > > > > +++ b/net/xdp/xsk.c
> > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > >   sock_wfree(skb);
> > > > > >  }
> > > > > >
> > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > +                                       struct xdp_desc *desc)
> > > > > > +{
> > > > > > + u32 len, offset, copy, copied;
> > > > > > + struct sk_buff *skb;
> > > > > > + struct page *page;
> > > > > > + void *buffer;
> > > > > > + int err, i;
> > > > > > + u64 addr;
> > > > > > +
> > > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > > > + if (unlikely(!skb))
> > > > > > +         return ERR_PTR(err);
> > > > > > +
> > > > > > + addr = desc->addr;
> > > > > > + len = desc->len;
> > > > > > +
> > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > + offset = offset_in_page(buffer);
> > > > > > + addr = buffer - xs->pool->addrs;
> > > > > > +
> > > > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > > > +         page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > +
> > > > > > +         get_page(page);
> > > > > > +
> > > > > > +         copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > +
> > > > > > +         skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > +
> > > > > > +         copied += copy;
> > > > > > +         addr += copy;
> > > > > > +         offset = 0;
> > > > > > + }
> > > > > > +
> > > > > > + skb->len += len;
> > > > > > + skb->data_len += len;
> > > > >
> > > > > > + skb->truesize += len;
> > > > >
> > > > > This is not the truesize, unfortunately.
> > > > >
> > > > > We need to account for the number of pages, not number of bytes.
> > > >
> > > > The easiest solution is:
> > > >
> > > >       skb->truesize += PAGE_SIZE * i;
> > > >
> > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > >
> > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > "one per page".
> > > We need to count the number of pages manually and then do
> > >
> > >         skb->truesize += PAGE_SIZE * npages;
> > >
> > > Right.
> >
> > There are two possible packet buffer (chunks) sizes in a umem, 2K and
> > 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
> > please correct me if wrong, truesize is used for memory accounting.
> > But in this code, no kernel memory has been allocated (apart from the
> > skb). The page is just a part of the umem that has been already
> > allocated beforehand and by user-space in this case. So what should
> > truesize be in this case? Do we add 0, chunk_size * i, or the
> > complicated case of counting exactly how many 4K pages that are used
> > when the chunk_size is 2K, as two chunks could occupy the same page,
> > or just the upper bound of PAGE_SIZE * i that is likely a good
> > approximation in most cases? Just note that there might be other uses
> > of truesize that I am unaware of that could impact this choice.
>
> Truesize is "what amount of memory does this skb occupy with all its
> fragments, linear space and struct sk_buff itself". The closest it
> will be to the actual value, the better.
> In this case, I think adding of chunk_size * i would be enough.

Sounds like a good approximation to me.

> (PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially
> for setups with PAGE_SIZE > SZ_4K)

You are right. That would be quite horrible on a system with a page size of 64K.

> > > > > > +
> > > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > +
> > > > > > + return skb;
> > > > > > +}
> > > > > > +
> > > >
> > > > Al
> > >
> > > Thanks,
> > > Al
>
> Al
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
       [not found]             ` <1611587242.9653594-2-xuanzhuo@linux.alibaba.com>
@ 2021-01-26  7:34               ` Magnus Karlsson
  0 siblings, 0 replies; 22+ messages in thread
From: Magnus Karlsson @ 2021-01-26  7:34 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Eric Dumazet, Michael S . Tsirkin, Jason Wang, David S . Miller,
	Jakub Kicinski, Bjorn Topel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf,
	Network Development, open list, Alexander Lobakin

On Mon, Jan 25, 2021 at 4:22 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Mon, 25 Jan 2021 14:16:16 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > On Mon, Jan 25, 2021 at 1:43 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > On Mon, 25 Jan 2021 08:44:38 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > On Mon, Jan 25, 2021 at 3:27 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > >
> > > > > On Fri, 22 Jan 2021 19:37:06 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > > > On Fri, Jan 22, 2021 at 6:26 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > > >
> > > > > > > From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > Date: Fri, 22 Jan 2021 23:39:15 +0800
> > > > > > >
> > > > > > > > On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > > > > > > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > > > > > >
> > > > > > > > > > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > > > > > > > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > > > > > > > > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > > > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > > > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > > > > > > > > > > overhead.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > > > > > > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > > > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ---------------- Performance Testing ------------
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > > > > > > > Test cmd:
> > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Test result data:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > size    64      512     1024    1500
> > > > > > > > > > > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > > > > > > > > > > page    1974058 1953655 1945463 1904478
> > > > > > > > > > > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > > > > > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > > > > > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > > > > > > > > > > >   sock_wfree(skb);
> > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > > > > > > > > > > +                                       struct xdp_desc *desc)
> > > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > > + u32 len, offset, copy, copied;
> > > > > > > > > > > > > > > + struct sk_buff *skb;
> > > > > > > > > > > > > > > + struct page *page;
> > > > > > > > > > > > > > > + void *buffer;
> > > > > > > > > > > > > > > + int err, i;
> > > > > > > > > > > > > > > + u64 addr;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > > > > > > > > > > > > + if (unlikely(!skb))
> > > > > > > > > > > > > > > +         return ERR_PTR(err);
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + addr = desc->addr;
> > > > > > > > > > > > > > > + len = desc->len;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > > > > > > > > > > + offset = offset_in_page(buffer);
> > > > > > > > > > > > > > > + addr = buffer - xs->pool->addrs;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > > > > > > > > > > > > +         page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +         get_page(page);
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +         copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +         skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +         copied += copy;
> > > > > > > > > > > > > > > +         addr += copy;
> > > > > > > > > > > > > > > +         offset = 0;
> > > > > > > > > > > > > > > + }
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + skb->len += len;
> > > > > > > > > > > > > > > + skb->data_len += len;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > + skb->truesize += len;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is not the truesize, unfortunately.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We need to account for the number of pages, not number of bytes.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The easiest solution is:
> > > > > > > > > > > > >
> > > > > > > > > > > > >       skb->truesize += PAGE_SIZE * i;
> > > > > > > > > > > > >
> > > > > > > > > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > > > > > > > > > > >
> > > > > > > > > > > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > > > > > > > > > > "one per page".
> > > > > > > > > > > > We need to count the number of pages manually and then do
> > > > > > > > > > > >
> > > > > > > > > > > >         skb->truesize += PAGE_SIZE * npages;
> > > > > > > > > > > >
> > > > > > > > > > > > Right.
> > > > > > > > > > >
> > > > > > > > > > > There are two possible packet buffer (chunks) sizes in a umem, 2K and
> > > > > > > > > > > 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
> > > > > > > > > > > please correct me if wrong, truesize is used for memory accounting.
> > > > > > > > > > > But in this code, no kernel memory has been allocated (apart from the
> > > > > > > > > > > skb). The page is just a part of the umem that has been already
> > > > > > > > > > > allocated beforehand and by user-space in this case. So what should
> > > > > > > > > > > truesize be in this case? Do we add 0, chunk_size * i, or the
> > > > > > > > > > > complicated case of counting exactly how many 4K pages that are used
> > > > > > > > > > > when the chunk_size is 2K, as two chunks could occupy the same page,
> > > > > > > > > > > or just the upper bound of PAGE_SIZE * i that is likely a good
> > > > > > > > > > > approximation in most cases? Just note that there might be other uses
> > > > > > > > > > > of truesize that I am unaware of that could impact this choice.
> > > > > > > > > >
> > > > > > > > > > Truesize is "what amount of memory does this skb occupy with all its
> > > > > > > > > > fragments, linear space and struct sk_buff itself". The closest it
> > > > > > > > > > will be to the actual value, the better.
> > > > > > > > > > In this case, I think adding of chunk_size * i would be enough.
> > > > > > > > >
> > > > > > > > > Sounds like a good approximation to me.
> > > > > > > > >
> > > > > > > > > > (PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially
> > > > > > > > > > for setups with PAGE_SIZE > SZ_4K)
> > > > > > > > >
> > > > > > > > > You are right. That would be quite horrible on a system with a page size of 64K.
> > > > > > > >
> > > > > > > > Thank you everyone, I learned it.
> > > > > > > >
> > > > > > > > I also think it is appropriate to add a chunk size here, and there is actually
> > > > > > > > only one chunk here, so it's very simple
> > > > > > > >
> > > > > > > >       skb->truesize += xs->pool->chunk_size;
> > > > > > >
> > > > > > > umem chunks can't cross page boundaries. So if you're sure that
> > > > > > > there could be only one chunk, you don't need the loop at all,
> > > > > > > if I'm not missing anything.
> > > > > >
> > > > > > In the default mode, this is true. But in the unaligned_chunk mode
> > > > > > that can be set on the umem, the chunk may cross one page boundary, so
> > > > > > we need the loop and the chunk_size * i in the assignment of truesize.
> > > > > > So "i" can be 1 or 2, but nothing else.
> > > > >
> > > > > According to my understanding, in the unaligned mode, a desc will also refer to
> > > > > a chunk, although the chunk here may occupy multiple pages. And here is just a
> > > > > desc constructed into a skb, skb takes the largest amount from umem is a
> > > > > chunk, so what is the situation of i = 2 here? Did I miss something?
> > > >
> > > > A desc refers to a single packet, not a single chunk. One packet can
> > > > occupy either one or two chunks in the unaligned mode and when this
> > > > happens to be two, these two chunks might be on different pages. In
> > > > this case, you will go through the loop twice and produce two
> > > > fragments.
> > > >
> > > > In the aligned mode, every packet starts at the beginning of each
> > > > chunk (if there is no headroom) and can only occupy a single chunk
> > > > since no packet can be larger than the chunk size. In the unaligned
> > > > mode, the packet can start anywhere within the chunk and might even
> > > > continue into the next chunk, but never more than that since the max
> > > > packet size is still the chunk size. Why unaligned mode? This provides
> > > > the user-space with complete freedom on how to optimize the placement
> > > > of the packets in the umem and this might lead to better overall
> > > > performance. Note that if we just look at the kernel part, unaligned
> > > > mode is slower than aligned, but that might not be true on the overall
> > > > level.
> > >
> > > Thank you very much, very detailed explanation.
> > >
> > > I have a question: You said that in the case of unaligned mode, the packet
> > > pointed by desc may use two consecutive chunks at the same time. How do we
> > > ensure that we can apply for two consecutive chunks?
> >
> > I am not sure I understand "apply for", sorry. But in unaligned mode,
> > we reject any desc that spans two chunks that are not physically
> > consecutive in memory. So users of unaligned mode usually use huge
> > pages to make sure that this happens rarely and that these "breaks"
> > are well known to the entity in user-space that places the packets in
> > the umem (i.e., do not create packets that cross huge page boundaries
> > would suffice).
> >
> > > Back to this patch, is it okay to deal with it this way?
> > >
> > >         buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > >
> > >         .....
> > >
> > >         if (pool->unaligned) {
> > >                 u64 l;
> > >
> > >                 l = (buffer - xs->pool->addrs) % xs->pool->chunk_size;
> > >                 if (xs->pool->chunk_size - l < desc->len)
> > >                         skb->truesize += xs->pool->chunk_size * 2;
> > >                 else
> > >                         skb->truesize += xs->pool->chunk_size;
> > >         } else {
> > >                 skb->truesize += xs->pool->chunk_size;
> > >         }
> >
> > Why would not skb->truesize += xs->pool->chunk_size * i; work in both
> > cases? The copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > statement together with the loop itself should take care of the case
> > when the packet spans two pages as PAGE_SIZE - offset < len it will
> > loop again and copy the rest. In that case, i == 2 when you exit the
> > loop.
>
> Sorry, I feel that my understanding of the unaligned mode has not been very
> deep. So there may be some problems in understanding.
>
> Assuming that the page is 4k, the chunk size is 2k, and a packet is placed at
> the 1k position, it may occupy two chunk, as long as it is larger than 1k, but
> it only uses one page. like this:
>
>      0         2k        4k
>      |        page       |
>      |   chunk |  chunk  |
>            | packet |
>
> In my code, I want to find several pages occupied by desc and insert these pages
> into skb, so i = 1.

Got it, thanks! I actually think this approximation for truesize would
be the best one and we do not have to care about packets striding
chunk or page borders for the truesize calculation:

skb->truesize += pool->unaligned ? desc->len : pool->chunk_size;

For the aligned case, truesize is always one chunk as there can only
be one packet per chunk. For the unaligned case, truesize is just the
length of the packet since you can put many of these into the same
chunk or page. If you have a 2K chunk and a packet length of 256
bytes, you can put 8 packets into one single chunk. So adding
chunk_size for every single packet in the unaligned case would yield a
very bad approximation. We argued in the same way when we chose to use
chunk_size instead of PAGE_SIZE for the aligned case. The strength of
the unaligned case is that you can put many packets in a single chunk
(when packet size is low enough), so users tend to do this. If you do
not do this, then the aligned mode should be used as it will provide
better performance.

> Thanks.
>
> >
> > But it would be great if you could test that your code can deal with
> > this. How about extending the xsk selftests with a case for unaligned
> > mode? Should be quite straightforward. Just make sure you put the
> > packets next to each other so that some of them will cross page
> > boundaries.
> >
> > > Thanks.
> > >
> > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > > > > In addition, I actually borrowed from the tcp code:
> > > > > > > >
> > > > > > > >    tcp_build_frag:
> > > > > > > >    --------------
> > > > > > > >
> > > > > > > >       if (can_coalesce) {
> > > > > > > >               skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
> > > > > > > >       } else {
> > > > > > > >               get_page(page);
> > > > > > > >               skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > > >       }
> > > > > > > >
> > > > > > > >       if (!(flags & MSG_NO_SHARED_FRAGS))
> > > > > > > >               skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
> > > > > > > >
> > > > > > > >       skb->len += copy;
> > > > > > > >       skb->data_len += copy;
> > > > > > > >       skb->truesize += copy;
> > > > > > > >
> > > > > > > > So, here is one bug?
> > > > > > >
> > > > > > > skb_frag_t is an alias to struct bvec. It doesn't contain info about
> > > > > > > real memory consumption, so there's no other option buf just to add
> > > > > > > "copy" to truesize.
> > > > > > > XSK is different in this term, as it operates with chunks of a known
> > > > > > > size.
> > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + return skb;
> > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > >
> > > > > > > > > > > > > Al
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Al
> > > > > > > > > >
> > > > > > > > > > Al
> > > > > > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
       [not found] <1611586627.1035807-1-xuanzhuo@linux.alibaba.com>
@ 2021-01-25 15:07 ` Alexander Lobakin
  0 siblings, 0 replies; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-25 15:07 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Alexander Lobakin, Eric Dumazet, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, bjorn, Magnus Karlsson,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	virtualization, bpf, netdev, linux-kernel

From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Date: Mon, 25 Jan 2021 22:57:07 +0800

> On Mon, 25 Jan 2021 13:25:45 +0000, Alexander Lobakin <alobakin@pm.me> wrote:
> > From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Date: Mon, 25 Jan 2021 11:10:43 +0800
> >
> > > On Fri, 22 Jan 2021 16:24:17 +0000, Alexander Lobakin <alobakin@pm.me> wrote:
> > > > From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > Date: Fri, 22 Jan 2021 23:36:29 +0800
> > > >
> > > > > On Fri, 22 Jan 2021 12:08:00 +0000, Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > > > Date: Fri, 22 Jan 2021 11:55:35 +0000
> > > > > >
> > > > > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > > > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > > > > > >
> > > > > > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > > >
> > > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > > > > > overhead.
> > > > > > > > > >
> > > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > > >
> > > > > > > > > > ---------------- Performance Testing ------------
> > > > > > > > > >
> > > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > > Test cmd:
> > > > > > > > > > ```
> > > > > > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > Test result data:
> > > > > > > > > >
> > > > > > > > > > size    64      512     1024    1500
> > > > > > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > > > > > page    1974058 1953655 1945463 1904478
> > > > > > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > > > > > ---
> > > > > > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > > > > > >  	sock_wfree(skb);
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > > > > > +					      struct xdp_desc *desc)
> > > > > > > > > > +{
> > > > > > > > > > +	u32 len, offset, copy, copied;
> > > > > > > > > > +	struct sk_buff *skb;
> > > > > > > > > > +	struct page *page;
> > > > > > > > > > +	void *buffer;
> > > > > > > > > > +	int err, i;
> > > > > > > > > > +	u64 addr;
> > > > > > > > > > +
> > > > > > > > > > +	skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > > >
> > > > > > Also,
> > > > > > maybe we should allocate it with NET_SKB_PAD so NIC drivers could
> > > > > > use some reserved space?
> > > > > >
> > > > > > 		skb = sock_alloc_send_skb(&xs->sk, NET_SKB_PAD, 1, &err);
> > > > > > 		...
> > > > > > 		skb_reserve(skb, NET_SKB_PAD);
> > > > > >
> > > > > > Eric, what do you think?
> > > > >
> > > > > I think you are right. Some space should be added to continuous equipment. This
> > > > > space should also be added in the copy mode below. Is LL_RESERVED_SPACE more
> > > > > appropriate?
> > > >
> > > > No. If you look at __netdev_alloc_skb() and __napi_alloc_skb(), they
> > > > reserve NET_SKB_PAD at the beginning of linear area. Documentation of
> > > > __build_skb() also says that driver should reserve NET_SKB_PAD before
> > > > the actual frame, so it is a standartized hardware-independent
> > > > headroom.
> > >
> > > I understand that these scenarios are in the case of receiving packets, and the
> > > increased space is used by the protocol stack, especially RPS. I don't know if
> > > this also applies to the sending scenario?
> > >
> > > > Leaving that space in skb->head will allow developers to implement
> > > > IFF_TX_SKB_NO_LINEAR in a wider variety of drivers, especially when
> > > > a driver has to prepend some sort of data before the actual frame.
> > > > Since it's usually of a size of one cacheline, shouldn't be a big
> > > > deal.
> > > >
> > >
> > > I agree with this. Some network cards require some space. For example,
> > > virtio-net needs to add a virtio_net_hdr_mrg_rxbuf before skb->data, so my
> > > original understanding is used here. When we send the skb to the
> > > driver, the driver may need a memory space. So I refer to the
> > > implementation of __ip_append_data, I feel that adding
> > > LL_RESERVED_SPACE is a suitable solution.
> > >
> > > I feel that I may still not understand the use scene you mentioned. Can you
> > > elaborate on what you understand this space will be used for?
> >
> > LL_RESERVED_SPACE() consists of L2 header size (Ethernet for the most
> > cases) and dev->needed_headroom. That is not a value to count on, as:
> >  - L2 header is already here in XSK buffer;
> >  - not all drivers set dev->needed_headroom;
> >  - it's aligned by 16, not L1_CACHE_SIZE.
> >
> > As this path is XSK generic path, i.e. when driver-side XSK is not
> > present or not requested, it can be applied to every driver. Many
> > of them call skb_cow_head() + skb_push() on their xmit path:
> >  - nearly all virtual drivers (to insert their specific headers);
> >  - nearly all switch drivers (to insert switch CPU port tags);
> >  - some enterprise NIC drivers (ChelsIO for LSO, Netronome
> >    for TLS etc.).
> >
> > skb_cow_head() + skb_push() relies on a required NET_SKB_PAD headroom.
> > In case where there is no enough space (and you allocate an skb with
> > no headroom at all), skb will be COWed, which is a huge overhead and
> > will cause slowdowns.
> > So, adding NET_SKB_PAD would save from almost all, if not all, such
> > reallocations.
>
> I have learnt so much, thanks to you.

Glad to hear!

> > > Thanks.
> > >
> > > >
> > > > [ I also had an idea of allocating an skb with a headroom of
> > > > NET_SKB_PAD + 256 bytes, so nearly all drivers could just call
> > > > pskb_pull_tail() to support such type of skbuffs without much
> > > > effort, but I think that it's better to teach drivers to support
> > > > xmitting of really headless ones. If virtio_net can do it, why
> > > > shouldn't the others ]
> > > >
> > > > > > > > > > +	if (unlikely(!skb))
> > > > > > > > > > +		return ERR_PTR(err);
> > > > > > > > > > +
> > > > > > > > > > +	addr = desc->addr;
> > > > > > > > > > +	len = desc->len;
> > > > > > > > > > +
> > > > > > > > > > +	buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > > > > > +	offset = offset_in_page(buffer);
> > > > > > > > > > +	addr = buffer - xs->pool->addrs;
> > > > > > > > > > +
> > > > > > > > > > +	for (copied = 0, i = 0; copied < len; i++) {
> > > > > > > > > > +		page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > > > > > +
> > > > > > > > > > +		get_page(page);
> > > > > > > > > > +
> > > > > > > > > > +		copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > > > > > +
> > > > > > > > > > +		skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > > > > > +
> > > > > > > > > > +		copied += copy;
> > > > > > > > > > +		addr += copy;
> > > > > > > > > > +		offset = 0;
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	skb->len += len;
> > > > > > > > > > +	skb->data_len += len;
> > > > > > > > >
> > > > > > > > > > +	skb->truesize += len;
> > > > > > > > >
> > > > > > > > > This is not the truesize, unfortunately.
> > > > > > > > >
> > > > > > > > > We need to account for the number of pages, not number of bytes.
> > > > > > > >
> > > > > > > > The easiest solution is:
> > > > > > > >
> > > > > > > > 	skb->truesize += PAGE_SIZE * i;
> > > > > > > >
> > > > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > > > > > >
> > > > > > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > > > > > "one per page".
> > > > > > > We need to count the number of pages manually and then do
> > > > > > >
> > > > > > > 	skb->truesize += PAGE_SIZE * npages;
> > > > > > >
> > > > > > > Right.
> > > > > > >
> > > > > > > > > > +
> > > > > > > > > > +	refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > > > > > +
> > > > > > > > > > +	return skb;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > >
> > > > > > > > Al
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Al
> > > > > >
> > > > > > Al
> >
> > Thanks,
> > Al

Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
       [not found] <1611544243.8363645-1-xuanzhuo@linux.alibaba.com>
@ 2021-01-25 13:25 ` Alexander Lobakin
  0 siblings, 0 replies; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-25 13:25 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Alexander Lobakin, Eric Dumazet, Michael S. Tsirkin, Jason Wang,
	David S. Miller, Jakub Kicinski, bjorn, Magnus Karlsson,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	virtualization, bpf, netdev, linux-kernel

From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Date: Mon, 25 Jan 2021 11:10:43 +0800

> On Fri, 22 Jan 2021 16:24:17 +0000, Alexander Lobakin <alobakin@pm.me> wrote:
> > From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Date: Fri, 22 Jan 2021 23:36:29 +0800
> >
> > > On Fri, 22 Jan 2021 12:08:00 +0000, Alexander Lobakin <alobakin@pm.me> wrote:
> > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > Date: Fri, 22 Jan 2021 11:55:35 +0000
> > > >
> > > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > > > >
> > > > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > >
> > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > > > overhead.
> > > > > > > >
> > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > > > necessary to copy data to construct skb.
> > > > > > > >
> > > > > > > > ---------------- Performance Testing ------------
> > > > > > > >
> > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > Test cmd:
> > > > > > > > ```
> > > > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > > > ```
> > > > > > > >
> > > > > > > > Test result data:
> > > > > > > >
> > > > > > > > size    64      512     1024    1500
> > > > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > > > page    1974058 1953655 1945463 1904478
> > > > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > > > >
> > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > > > ---
> > > > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > > > >  	sock_wfree(skb);
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > > > +					      struct xdp_desc *desc)
> > > > > > > > +{
> > > > > > > > +	u32 len, offset, copy, copied;
> > > > > > > > +	struct sk_buff *skb;
> > > > > > > > +	struct page *page;
> > > > > > > > +	void *buffer;
> > > > > > > > +	int err, i;
> > > > > > > > +	u64 addr;
> > > > > > > > +
> > > > > > > > +	skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > >
> > > > Also,
> > > > maybe we should allocate it with NET_SKB_PAD so NIC drivers could
> > > > use some reserved space?
> > > >
> > > > 		skb = sock_alloc_send_skb(&xs->sk, NET_SKB_PAD, 1, &err);
> > > > 		...
> > > > 		skb_reserve(skb, NET_SKB_PAD);
> > > >
> > > > Eric, what do you think?
> > >
> > > I think you are right. Some space should be added to continuous equipment. This
> > > space should also be added in the copy mode below. Is LL_RESERVED_SPACE more
> > > appropriate?
> >
> > No. If you look at __netdev_alloc_skb() and __napi_alloc_skb(), they
> > reserve NET_SKB_PAD at the beginning of linear area. Documentation of
> > __build_skb() also says that driver should reserve NET_SKB_PAD before
> > the actual frame, so it is a standartized hardware-independent
> > headroom.
> 
> I understand that these scenarios are in the case of receiving packets, and the
> increased space is used by the protocol stack, especially RPS. I don't know if
> this also applies to the sending scenario?
> 
> > Leaving that space in skb->head will allow developers to implement
> > IFF_TX_SKB_NO_LINEAR in a wider variety of drivers, especially when
> > a driver has to prepend some sort of data before the actual frame.
> > Since it's usually of a size of one cacheline, shouldn't be a big
> > deal.
> >
> 
> I agree with this. Some network cards require some space. For example,
> virtio-net needs to add a virtio_net_hdr_mrg_rxbuf before skb->data, so my
> original understanding is used here. When we send the skb to the
> driver, the driver may need a memory space. So I refer to the
> implementation of __ip_append_data, I feel that adding
> LL_RESERVED_SPACE is a suitable solution.
> 
> I feel that I may still not understand the use scene you mentioned. Can you
> elaborate on what you understand this space will be used for?

LL_RESERVED_SPACE() consists of L2 header size (Ethernet for the most
cases) and dev->needed_headroom. That is not a value to count on, as:
 - L2 header is already here in XSK buffer;
 - not all drivers set dev->needed_headroom;
 - it's aligned by 16, not L1_CACHE_SIZE.

As this path is XSK generic path, i.e. when driver-side XSK is not
present or not requested, it can be applied to every driver. Many
of them call skb_cow_head() + skb_push() on their xmit path:
 - nearly all virtual drivers (to insert their specific headers);
 - nearly all switch drivers (to insert switch CPU port tags);
 - some enterprise NIC drivers (ChelsIO for LSO, Netronome
   for TLS etc.).

skb_cow_head() + skb_push() relies on a required NET_SKB_PAD headroom.
In case where there is no enough space (and you allocate an skb with
no headroom at all), skb will be COWed, which is a huge overhead and
will cause slowdowns.
So, adding NET_SKB_PAD would save from almost all, if not all, such
reallocations.

> Thanks.
> 
> >
> > [ I also had an idea of allocating an skb with a headroom of
> > NET_SKB_PAD + 256 bytes, so nearly all drivers could just call
> > pskb_pull_tail() to support such type of skbuffs without much
> > effort, but I think that it's better to teach drivers to support
> > xmitting of really headless ones. If virtio_net can do it, why
> > shouldn't the others ]
> >
> > > > > > > > +	if (unlikely(!skb))
> > > > > > > > +		return ERR_PTR(err);
> > > > > > > > +
> > > > > > > > +	addr = desc->addr;
> > > > > > > > +	len = desc->len;
> > > > > > > > +
> > > > > > > > +	buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > > > +	offset = offset_in_page(buffer);
> > > > > > > > +	addr = buffer - xs->pool->addrs;
> > > > > > > > +
> > > > > > > > +	for (copied = 0, i = 0; copied < len; i++) {
> > > > > > > > +		page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > > > +
> > > > > > > > +		get_page(page);
> > > > > > > > +
> > > > > > > > +		copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > > > +
> > > > > > > > +		skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > > > +
> > > > > > > > +		copied += copy;
> > > > > > > > +		addr += copy;
> > > > > > > > +		offset = 0;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	skb->len += len;
> > > > > > > > +	skb->data_len += len;
> > > > > > >
> > > > > > > > +	skb->truesize += len;
> > > > > > >
> > > > > > > This is not the truesize, unfortunately.
> > > > > > >
> > > > > > > We need to account for the number of pages, not number of bytes.
> > > > > >
> > > > > > The easiest solution is:
> > > > > >
> > > > > > 	skb->truesize += PAGE_SIZE * i;
> > > > > >
> > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > > > >
> > > > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > > > "one per page".
> > > > > We need to count the number of pages manually and then do
> > > > >
> > > > > 	skb->truesize += PAGE_SIZE * npages;
> > > > >
> > > > > Right.
> > > > >
> > > > > > > > +
> > > > > > > > +	refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > > > +
> > > > > > > > +	return skb;
> > > > > > > > +}
> > > > > > > > +
> > > > > >
> > > > > > Al
> > > > >
> > > > > Thanks,
> > > > > Al
> > > >
> > > > Al

Thanks,
Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
       [not found]         ` <1611578136.5043845-1-xuanzhuo@linux.alibaba.com>
@ 2021-01-25 13:16           ` Magnus Karlsson
       [not found]             ` <1611587242.9653594-2-xuanzhuo@linux.alibaba.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Magnus Karlsson @ 2021-01-25 13:16 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Eric Dumazet, Michael S . Tsirkin, Jason Wang, David S . Miller,
	Jakub Kicinski, Bjorn Topel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf,
	Network Development, open list, Alexander Lobakin

On Mon, Jan 25, 2021 at 1:43 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Mon, 25 Jan 2021 08:44:38 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > On Mon, Jan 25, 2021 at 3:27 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > On Fri, 22 Jan 2021 19:37:06 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > On Fri, Jan 22, 2021 at 6:26 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > >
> > > > > From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > Date: Fri, 22 Jan 2021 23:39:15 +0800
> > > > >
> > > > > > On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > > > > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > > > >
> > > > > > > > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > > > > > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > > > > > > >
> > > > > > > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > > > > > >
> > > > > > > > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > > > > > > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > > > > > > > > >
> > > > > > > > > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > > > > > >
> > > > > > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > > > > > > > > overhead.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > > > > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ---------------- Performance Testing ------------
> > > > > > > > > > > > >
> > > > > > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > > > > > Test cmd:
> > > > > > > > > > > > > ```
> > > > > > > > > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > > > > > > > > ```
> > > > > > > > > > > > >
> > > > > > > > > > > > > Test result data:
> > > > > > > > > > > > >
> > > > > > > > > > > > > size    64      512     1024    1500
> > > > > > > > > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > > > > > > > > page    1974058 1953655 1945463 1904478
> > > > > > > > > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > > > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > > > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > > > > > > > > >   sock_wfree(skb);
> > > > > > > > > > > > >  }
> > > > > > > > > > > > >
> > > > > > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > > > > > > > > +                                       struct xdp_desc *desc)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > + u32 len, offset, copy, copied;
> > > > > > > > > > > > > + struct sk_buff *skb;
> > > > > > > > > > > > > + struct page *page;
> > > > > > > > > > > > > + void *buffer;
> > > > > > > > > > > > > + int err, i;
> > > > > > > > > > > > > + u64 addr;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > > > > > > > > > > + if (unlikely(!skb))
> > > > > > > > > > > > > +         return ERR_PTR(err);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > + addr = desc->addr;
> > > > > > > > > > > > > + len = desc->len;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > > > > > > > > + offset = offset_in_page(buffer);
> > > > > > > > > > > > > + addr = buffer - xs->pool->addrs;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > > > > > > > > > > +         page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +         get_page(page);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +         copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +         skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +         copied += copy;
> > > > > > > > > > > > > +         addr += copy;
> > > > > > > > > > > > > +         offset = 0;
> > > > > > > > > > > > > + }
> > > > > > > > > > > > > +
> > > > > > > > > > > > > + skb->len += len;
> > > > > > > > > > > > > + skb->data_len += len;
> > > > > > > > > > > >
> > > > > > > > > > > > > + skb->truesize += len;
> > > > > > > > > > > >
> > > > > > > > > > > > This is not the truesize, unfortunately.
> > > > > > > > > > > >
> > > > > > > > > > > > We need to account for the number of pages, not number of bytes.
> > > > > > > > > > >
> > > > > > > > > > > The easiest solution is:
> > > > > > > > > > >
> > > > > > > > > > >       skb->truesize += PAGE_SIZE * i;
> > > > > > > > > > >
> > > > > > > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > > > > > > > > >
> > > > > > > > > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > > > > > > > > "one per page".
> > > > > > > > > > We need to count the number of pages manually and then do
> > > > > > > > > >
> > > > > > > > > >         skb->truesize += PAGE_SIZE * npages;
> > > > > > > > > >
> > > > > > > > > > Right.
> > > > > > > > >
> > > > > > > > > There are two possible packet buffer (chunks) sizes in a umem, 2K and
> > > > > > > > > 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
> > > > > > > > > please correct me if wrong, truesize is used for memory accounting.
> > > > > > > > > But in this code, no kernel memory has been allocated (apart from the
> > > > > > > > > skb). The page is just a part of the umem that has been already
> > > > > > > > > allocated beforehand and by user-space in this case. So what should
> > > > > > > > > truesize be in this case? Do we add 0, chunk_size * i, or the
> > > > > > > > > complicated case of counting exactly how many 4K pages that are used
> > > > > > > > > when the chunk_size is 2K, as two chunks could occupy the same page,
> > > > > > > > > or just the upper bound of PAGE_SIZE * i that is likely a good
> > > > > > > > > approximation in most cases? Just note that there might be other uses
> > > > > > > > > of truesize that I am unaware of that could impact this choice.
> > > > > > > >
> > > > > > > > Truesize is "what amount of memory does this skb occupy with all its
> > > > > > > > fragments, linear space and struct sk_buff itself". The closest it
> > > > > > > > will be to the actual value, the better.
> > > > > > > > In this case, I think adding of chunk_size * i would be enough.
> > > > > > >
> > > > > > > Sounds like a good approximation to me.
> > > > > > >
> > > > > > > > (PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially
> > > > > > > > for setups with PAGE_SIZE > SZ_4K)
> > > > > > >
> > > > > > > You are right. That would be quite horrible on a system with a page size of 64K.
> > > > > >
> > > > > > Thank you everyone, I learned it.
> > > > > >
> > > > > > I also think it is appropriate to add a chunk size here, and there is actually
> > > > > > only one chunk here, so it's very simple
> > > > > >
> > > > > >       skb->truesize += xs->pool->chunk_size;
> > > > >
> > > > > umem chunks can't cross page boundaries. So if you're sure that
> > > > > there could be only one chunk, you don't need the loop at all,
> > > > > if I'm not missing anything.
> > > >
> > > > In the default mode, this is true. But in the unaligned_chunk mode
> > > > that can be set on the umem, the chunk may cross one page boundary, so
> > > > we need the loop and the chunk_size * i in the assignment of truesize.
> > > > So "i" can be 1 or 2, but nothing else.
> > >
> > > According to my understanding, in the unaligned mode, a desc will also refer to
> > > a chunk, although the chunk here may occupy multiple pages. And here is just a
> > > desc constructed into a skb, skb takes the largest amount from umem is a
> > > chunk, so what is the situation of i = 2 here? Did I miss something?
> >
> > A desc refers to a single packet, not a single chunk. One packet can
> > occupy either one or two chunks in the unaligned mode and when this
> > happens to be two, these two chunks might be on different pages. In
> > this case, you will go through the loop twice and produce two
> > fragments.
> >
> > In the aligned mode, every packet starts at the beginning of each
> > chunk (if there is no headroom) and can only occupy a single chunk
> > since no packet can be larger than the chunk size. In the unaligned
> > mode, the packet can start anywhere within the chunk and might even
> > continue into the next chunk, but never more than that since the max
> > packet size is still the chunk size. Why unaligned mode? This provides
> > the user-space with complete freedom on how to optimize the placement
> > of the packets in the umem and this might lead to better overall
> > performance. Note that if we just look at the kernel part, unaligned
> > mode is slower than aligned, but that might not be true on the overall
> > level.
>
> Thank you very much, very detailed explanation.
>
> I have a question: You said that in the case of unaligned mode, the packet
> pointed by desc may use two consecutive chunks at the same time. How do we
> ensure that we can apply for two consecutive chunks?

I am not sure I understand "apply for", sorry. But in unaligned mode,
we reject any desc that spans two chunks that are not physically
consecutive in memory. So users of unaligned mode usually use huge
pages to make sure that this happens rarely and that these "breaks"
are well known to the entity in user-space that places the packets in
the umem (i.e., do not create packets that cross huge page boundaries
would suffice).

> Back to this patch, is it okay to deal with it this way?
>
>         buffer = xsk_buff_raw_get_data(xs->pool, addr);
>
>         .....
>
>         if (pool->unaligned) {
>                 u64 l;
>
>                 l = (buffer - xs->pool->addrs) % xs->pool->chunk_size;
>                 if (xs->pool->chunk_size - l < desc->len)
>                         skb->truesize += xs->pool->chunk_size * 2;
>                 else
>                         skb->truesize += xs->pool->chunk_size;
>         } else {
>                 skb->truesize += xs->pool->chunk_size;
>         }

Why would not skb->truesize += xs->pool->chunk_size * i; work in both
cases? The copy = min_t(u32, PAGE_SIZE - offset, len - copied);
statement together with the loop itself should take care of the case
when the packet spans two pages as PAGE_SIZE - offset < len it will
loop again and copy the rest. In that case, i == 2 when you exit the
loop.

But it would be great if you could test that your code can deal with
this. How about extending the xsk selftests with a case for unaligned
mode? Should be quite straightforward. Just make sure you put the
packets next to each other so that some of them will cross page
boundaries.

> Thanks.
>
> >
> > > Thanks.
> > >
> > > >
> > > > > > In addition, I actually borrowed from the tcp code:
> > > > > >
> > > > > >    tcp_build_frag:
> > > > > >    --------------
> > > > > >
> > > > > >       if (can_coalesce) {
> > > > > >               skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
> > > > > >       } else {
> > > > > >               get_page(page);
> > > > > >               skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > >       }
> > > > > >
> > > > > >       if (!(flags & MSG_NO_SHARED_FRAGS))
> > > > > >               skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
> > > > > >
> > > > > >       skb->len += copy;
> > > > > >       skb->data_len += copy;
> > > > > >       skb->truesize += copy;
> > > > > >
> > > > > > So, here is one bug?
> > > > >
> > > > > skb_frag_t is an alias to struct bvec. It doesn't contain info about
> > > > > real memory consumption, so there's no other option buf just to add
> > > > > "copy" to truesize.
> > > > > XSK is different in this term, as it operates with chunks of a known
> > > > > size.
> > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > >
> > > > > > > > > > > > > +
> > > > > > > > > > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > + return skb;
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > +
> > > > > > > > > > >
> > > > > > > > > > > Al
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Al
> > > > > > > >
> > > > > > > > Al
> > > > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
       [not found]     ` <1611541335.3012564-1-xuanzhuo@linux.alibaba.com>
@ 2021-01-25  7:44       ` Magnus Karlsson
       [not found]         ` <1611578136.5043845-1-xuanzhuo@linux.alibaba.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Magnus Karlsson @ 2021-01-25  7:44 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Eric Dumazet, Michael S . Tsirkin, Jason Wang, David S . Miller,
	Jakub Kicinski, Bjorn Topel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf,
	Network Development, open list, Alexander Lobakin

On Mon, Jan 25, 2021 at 3:27 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Fri, 22 Jan 2021 19:37:06 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > On Fri, Jan 22, 2021 at 6:26 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > >
> > > From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > Date: Fri, 22 Jan 2021 23:39:15 +0800
> > >
> > > > On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > >
> > > > > > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > > > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > > > > >
> > > > > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > > > >
> > > > > > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > > > > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > > > > > > >
> > > > > > > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > > > >
> > > > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > > > > > > overhead.
> > > > > > > > > > >
> > > > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > > > >
> > > > > > > > > > > ---------------- Performance Testing ------------
> > > > > > > > > > >
> > > > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > > > Test cmd:
> > > > > > > > > > > ```
> > > > > > > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > > > > > > ```
> > > > > > > > > > >
> > > > > > > > > > > Test result data:
> > > > > > > > > > >
> > > > > > > > > > > size    64      512     1024    1500
> > > > > > > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > > > > > > page    1974058 1953655 1945463 1904478
> > > > > > > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > > > > > > >   sock_wfree(skb);
> > > > > > > > > > >  }
> > > > > > > > > > >
> > > > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > > > > > > +                                       struct xdp_desc *desc)
> > > > > > > > > > > +{
> > > > > > > > > > > + u32 len, offset, copy, copied;
> > > > > > > > > > > + struct sk_buff *skb;
> > > > > > > > > > > + struct page *page;
> > > > > > > > > > > + void *buffer;
> > > > > > > > > > > + int err, i;
> > > > > > > > > > > + u64 addr;
> > > > > > > > > > > +
> > > > > > > > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > > > > > > > > + if (unlikely(!skb))
> > > > > > > > > > > +         return ERR_PTR(err);
> > > > > > > > > > > +
> > > > > > > > > > > + addr = desc->addr;
> > > > > > > > > > > + len = desc->len;
> > > > > > > > > > > +
> > > > > > > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > > > > > > + offset = offset_in_page(buffer);
> > > > > > > > > > > + addr = buffer - xs->pool->addrs;
> > > > > > > > > > > +
> > > > > > > > > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > > > > > > > > +         page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > > > > > > +
> > > > > > > > > > > +         get_page(page);
> > > > > > > > > > > +
> > > > > > > > > > > +         copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > > > > > > +
> > > > > > > > > > > +         skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > > > > > > +
> > > > > > > > > > > +         copied += copy;
> > > > > > > > > > > +         addr += copy;
> > > > > > > > > > > +         offset = 0;
> > > > > > > > > > > + }
> > > > > > > > > > > +
> > > > > > > > > > > + skb->len += len;
> > > > > > > > > > > + skb->data_len += len;
> > > > > > > > > >
> > > > > > > > > > > + skb->truesize += len;
> > > > > > > > > >
> > > > > > > > > > This is not the truesize, unfortunately.
> > > > > > > > > >
> > > > > > > > > > We need to account for the number of pages, not number of bytes.
> > > > > > > > >
> > > > > > > > > The easiest solution is:
> > > > > > > > >
> > > > > > > > >       skb->truesize += PAGE_SIZE * i;
> > > > > > > > >
> > > > > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > > > > > > >
> > > > > > > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > > > > > > "one per page".
> > > > > > > > We need to count the number of pages manually and then do
> > > > > > > >
> > > > > > > >         skb->truesize += PAGE_SIZE * npages;
> > > > > > > >
> > > > > > > > Right.
> > > > > > >
> > > > > > > There are two possible packet buffer (chunks) sizes in a umem, 2K and
> > > > > > > 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
> > > > > > > please correct me if wrong, truesize is used for memory accounting.
> > > > > > > But in this code, no kernel memory has been allocated (apart from the
> > > > > > > skb). The page is just a part of the umem that has been already
> > > > > > > allocated beforehand and by user-space in this case. So what should
> > > > > > > truesize be in this case? Do we add 0, chunk_size * i, or the
> > > > > > > complicated case of counting exactly how many 4K pages that are used
> > > > > > > when the chunk_size is 2K, as two chunks could occupy the same page,
> > > > > > > or just the upper bound of PAGE_SIZE * i that is likely a good
> > > > > > > approximation in most cases? Just note that there might be other uses
> > > > > > > of truesize that I am unaware of that could impact this choice.
> > > > > >
> > > > > > Truesize is "what amount of memory does this skb occupy with all its
> > > > > > fragments, linear space and struct sk_buff itself". The closest it
> > > > > > will be to the actual value, the better.
> > > > > > In this case, I think adding of chunk_size * i would be enough.
> > > > >
> > > > > Sounds like a good approximation to me.
> > > > >
> > > > > > (PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially
> > > > > > for setups with PAGE_SIZE > SZ_4K)
> > > > >
> > > > > You are right. That would be quite horrible on a system with a page size of 64K.
> > > >
> > > > Thank you everyone, I learned it.
> > > >
> > > > I also think it is appropriate to add a chunk size here, and there is actually
> > > > only one chunk here, so it's very simple
> > > >
> > > >       skb->truesize += xs->pool->chunk_size;
> > >
> > > umem chunks can't cross page boundaries. So if you're sure that
> > > there could be only one chunk, you don't need the loop at all,
> > > if I'm not missing anything.
> >
> > In the default mode, this is true. But in the unaligned_chunk mode
> > that can be set on the umem, the chunk may cross one page boundary, so
> > we need the loop and the chunk_size * i in the assignment of truesize.
> > So "i" can be 1 or 2, but nothing else.
>
> According to my understanding, in the unaligned mode, a desc will also refer to
> a chunk, although the chunk here may occupy multiple pages. And here is just a
> desc constructed into a skb, skb takes the largest amount from umem is a
> chunk, so what is the situation of i = 2 here? Did I miss something?

A desc refers to a single packet, not a single chunk. One packet can
occupy either one or two chunks in the unaligned mode and when this
happens to be two, these two chunks might be on different pages. In
this case, you will go through the loop twice and produce two
fragments.

In the aligned mode, every packet starts at the beginning of each
chunk (if there is no headroom) and can only occupy a single chunk
since no packet can be larger than the chunk size. In the unaligned
mode, the packet can start anywhere within the chunk and might even
continue into the next chunk, but never more than that since the max
packet size is still the chunk size. Why unaligned mode? This provides
the user-space with complete freedom on how to optimize the placement
of the packets in the umem and this might lead to better overall
performance. Note that if we just look at the kernel part, unaligned
mode is slower than aligned, but that might not be true on the overall
level.

> Thanks.
>
> >
> > > > In addition, I actually borrowed from the tcp code:
> > > >
> > > >    tcp_build_frag:
> > > >    --------------
> > > >
> > > >       if (can_coalesce) {
> > > >               skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
> > > >       } else {
> > > >               get_page(page);
> > > >               skb_fill_page_desc(skb, i, page, offset, copy);
> > > >       }
> > > >
> > > >       if (!(flags & MSG_NO_SHARED_FRAGS))
> > > >               skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
> > > >
> > > >       skb->len += copy;
> > > >       skb->data_len += copy;
> > > >       skb->truesize += copy;
> > > >
> > > > So, here is one bug?
> > >
> > > skb_frag_t is an alias to struct bvec. It doesn't contain info about
> > > real memory consumption, so there's no other option buf just to add
> > > "copy" to truesize.
> > > XSK is different in this term, as it operates with chunks of a known
> > > size.
> > >
> > > > Thanks.
> > > >
> > > > >
> > > > > > > > > > > +
> > > > > > > > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > > > > > > +
> > > > > > > > > > > + return skb;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > >
> > > > > > > > > Al
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Al
> > > > > >
> > > > > > Al
> > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-22 18:37   ` Magnus Karlsson
@ 2021-01-22 18:45     ` Alexander Lobakin
       [not found]     ` <1611541335.3012564-1-xuanzhuo@linux.alibaba.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-22 18:45 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Alexander Lobakin, Xuan Zhuo, Eric Dumazet, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Jakub Kicinski, Bjorn Topel,
	Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, virtualization, bpf, Network Development, open list

From: Magnus Karlsson <magnus.karlsson@gmail.com>
Date: Fri, 22 Jan 2021 19:37:06 +0100

> On Fri, Jan 22, 2021 at 6:26 PM Alexander Lobakin <alobakin@pm.me> wrote:
> >
> > From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Date: Fri, 22 Jan 2021 23:39:15 +0800
> >
> > > On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > >
> > > > > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > > > >
> > > > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > > >
> > > > > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > > > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > > > > > >
> > > > > > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > > >
> > > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > > > > > overhead.
> > > > > > > > > >
> > > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > > >
> > > > > > > > > > ---------------- Performance Testing ------------
> > > > > > > > > >
> > > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > > Test cmd:
> > > > > > > > > > ```
> > > > > > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > Test result data:
> > > > > > > > > >
> > > > > > > > > > size    64      512     1024    1500
> > > > > > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > > > > > page    1974058 1953655 1945463 1904478
> > > > > > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > > > > > ---
> > > > > > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > > > > > >   sock_wfree(skb);
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > > > > > +                                       struct xdp_desc *desc)
> > > > > > > > > > +{
> > > > > > > > > > + u32 len, offset, copy, copied;
> > > > > > > > > > + struct sk_buff *skb;
> > > > > > > > > > + struct page *page;
> > > > > > > > > > + void *buffer;
> > > > > > > > > > + int err, i;
> > > > > > > > > > + u64 addr;
> > > > > > > > > > +
> > > > > > > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > > > > > > > + if (unlikely(!skb))
> > > > > > > > > > +         return ERR_PTR(err);
> > > > > > > > > > +
> > > > > > > > > > + addr = desc->addr;
> > > > > > > > > > + len = desc->len;
> > > > > > > > > > +
> > > > > > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > > > > > + offset = offset_in_page(buffer);
> > > > > > > > > > + addr = buffer - xs->pool->addrs;
> > > > > > > > > > +
> > > > > > > > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > > > > > > > +         page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > > > > > +
> > > > > > > > > > +         get_page(page);
> > > > > > > > > > +
> > > > > > > > > > +         copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > > > > > +
> > > > > > > > > > +         skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > > > > > +
> > > > > > > > > > +         copied += copy;
> > > > > > > > > > +         addr += copy;
> > > > > > > > > > +         offset = 0;
> > > > > > > > > > + }
> > > > > > > > > > +
> > > > > > > > > > + skb->len += len;
> > > > > > > > > > + skb->data_len += len;
> > > > > > > > >
> > > > > > > > > > + skb->truesize += len;
> > > > > > > > >
> > > > > > > > > This is not the truesize, unfortunately.
> > > > > > > > >
> > > > > > > > > We need to account for the number of pages, not number of bytes.
> > > > > > > >
> > > > > > > > The easiest solution is:
> > > > > > > >
> > > > > > > >       skb->truesize += PAGE_SIZE * i;
> > > > > > > >
> > > > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > > > > > >
> > > > > > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > > > > > "one per page".
> > > > > > > We need to count the number of pages manually and then do
> > > > > > >
> > > > > > >         skb->truesize += PAGE_SIZE * npages;
> > > > > > >
> > > > > > > Right.
> > > > > >
> > > > > > There are two possible packet buffer (chunks) sizes in a umem, 2K and
> > > > > > 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
> > > > > > please correct me if wrong, truesize is used for memory accounting.
> > > > > > But in this code, no kernel memory has been allocated (apart from the
> > > > > > skb). The page is just a part of the umem that has been already
> > > > > > allocated beforehand and by user-space in this case. So what should
> > > > > > truesize be in this case? Do we add 0, chunk_size * i, or the
> > > > > > complicated case of counting exactly how many 4K pages that are used
> > > > > > when the chunk_size is 2K, as two chunks could occupy the same page,
> > > > > > or just the upper bound of PAGE_SIZE * i that is likely a good
> > > > > > approximation in most cases? Just note that there might be other uses
> > > > > > of truesize that I am unaware of that could impact this choice.
> > > > >
> > > > > Truesize is "what amount of memory does this skb occupy with all its
> > > > > fragments, linear space and struct sk_buff itself". The closest it
> > > > > will be to the actual value, the better.
> > > > > In this case, I think adding of chunk_size * i would be enough.
> > > >
> > > > Sounds like a good approximation to me.
> > > >
> > > > > (PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially
> > > > > for setups with PAGE_SIZE > SZ_4K)
> > > >
> > > > You are right. That would be quite horrible on a system with a page size of 64K.
> > >
> > > Thank you everyone, I learned it.
> > >
> > > I also think it is appropriate to add a chunk size here, and there is actually
> > > only one chunk here, so it's very simple
> > >
> > >       skb->truesize += xs->pool->chunk_size;
> >
> > umem chunks can't cross page boundaries. So if you're sure that
> > there could be only one chunk, you don't need the loop at all,
> > if I'm not missing anything.
> 
> In the default mode, this is true. But in the unaligned_chunk mode
> that can be set on the umem, the chunk may cross one page boundary, so
> we need the loop and the chunk_size * i in the assignment of truesize.
> So "i" can be 1 or 2, but nothing else.

Ah, unaligned mode, sure. Thanks for explanation!
Right, we need the loop and skb->truesize += xs->pool->chunk_size * i
then.

> > > In addition, I actually borrowed from the tcp code:
> > >
> > >    tcp_build_frag:
> > >    --------------
> > >
> > >       if (can_coalesce) {
> > >               skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
> > >       } else {
> > >               get_page(page);
> > >               skb_fill_page_desc(skb, i, page, offset, copy);
> > >       }
> > >
> > >       if (!(flags & MSG_NO_SHARED_FRAGS))
> > >               skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
> > >
> > >       skb->len += copy;
> > >       skb->data_len += copy;
> > >       skb->truesize += copy;
> > >
> > > So, here is one bug?
> >
> > skb_frag_t is an alias to struct bvec. It doesn't contain info about
> > real memory consumption, so there's no other option buf just to add
> > "copy" to truesize.
> > XSK is different in this term, as it operates with chunks of a known
> > size.
> >
> > > Thanks.
> > >
> > > >
> > > > > > > > > > +
> > > > > > > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > > > > > +
> > > > > > > > > > + return skb;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > >
> > > > > > > > Al
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Al
> > > > >
> > > > > Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
  2021-01-22 17:26 ` Alexander Lobakin
@ 2021-01-22 18:37   ` Magnus Karlsson
  2021-01-22 18:45     ` Alexander Lobakin
       [not found]     ` <1611541335.3012564-1-xuanzhuo@linux.alibaba.com>
  0 siblings, 2 replies; 22+ messages in thread
From: Magnus Karlsson @ 2021-01-22 18:37 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Xuan Zhuo, Eric Dumazet, Michael S . Tsirkin, Jason Wang,
	David S . Miller, Jakub Kicinski, Bjorn Topel, Magnus Karlsson,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	virtualization, bpf, Network Development, open list

On Fri, Jan 22, 2021 at 6:26 PM Alexander Lobakin <alobakin@pm.me> wrote:
>
> From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Date: Fri, 22 Jan 2021 23:39:15 +0800
>
> > On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > >
> > > > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > > >
> > > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > > >
> > > > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > > > > >
> > > > > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > >
> > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > > > > overhead.
> > > > > > > > >
> > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > >
> > > > > > > > > ---------------- Performance Testing ------------
> > > > > > > > >
> > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > Test cmd:
> > > > > > > > > ```
> > > > > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > Test result data:
> > > > > > > > >
> > > > > > > > > size    64      512     1024    1500
> > > > > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > > > > page    1974058 1953655 1945463 1904478
> > > > > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > > > > ---
> > > > > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > > > > >   sock_wfree(skb);
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > > > > +                                       struct xdp_desc *desc)
> > > > > > > > > +{
> > > > > > > > > + u32 len, offset, copy, copied;
> > > > > > > > > + struct sk_buff *skb;
> > > > > > > > > + struct page *page;
> > > > > > > > > + void *buffer;
> > > > > > > > > + int err, i;
> > > > > > > > > + u64 addr;
> > > > > > > > > +
> > > > > > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > > > > > > + if (unlikely(!skb))
> > > > > > > > > +         return ERR_PTR(err);
> > > > > > > > > +
> > > > > > > > > + addr = desc->addr;
> > > > > > > > > + len = desc->len;
> > > > > > > > > +
> > > > > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > > > > + offset = offset_in_page(buffer);
> > > > > > > > > + addr = buffer - xs->pool->addrs;
> > > > > > > > > +
> > > > > > > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > > > > > > +         page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > > > > +
> > > > > > > > > +         get_page(page);
> > > > > > > > > +
> > > > > > > > > +         copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > > > > +
> > > > > > > > > +         skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > > > > +
> > > > > > > > > +         copied += copy;
> > > > > > > > > +         addr += copy;
> > > > > > > > > +         offset = 0;
> > > > > > > > > + }
> > > > > > > > > +
> > > > > > > > > + skb->len += len;
> > > > > > > > > + skb->data_len += len;
> > > > > > > >
> > > > > > > > > + skb->truesize += len;
> > > > > > > >
> > > > > > > > This is not the truesize, unfortunately.
> > > > > > > >
> > > > > > > > We need to account for the number of pages, not number of bytes.
> > > > > > >
> > > > > > > The easiest solution is:
> > > > > > >
> > > > > > >       skb->truesize += PAGE_SIZE * i;
> > > > > > >
> > > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > > > > >
> > > > > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > > > > "one per page".
> > > > > > We need to count the number of pages manually and then do
> > > > > >
> > > > > >         skb->truesize += PAGE_SIZE * npages;
> > > > > >
> > > > > > Right.
> > > > >
> > > > > There are two possible packet buffer (chunks) sizes in a umem, 2K and
> > > > > 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
> > > > > please correct me if wrong, truesize is used for memory accounting.
> > > > > But in this code, no kernel memory has been allocated (apart from the
> > > > > skb). The page is just a part of the umem that has been already
> > > > > allocated beforehand and by user-space in this case. So what should
> > > > > truesize be in this case? Do we add 0, chunk_size * i, or the
> > > > > complicated case of counting exactly how many 4K pages that are used
> > > > > when the chunk_size is 2K, as two chunks could occupy the same page,
> > > > > or just the upper bound of PAGE_SIZE * i that is likely a good
> > > > > approximation in most cases? Just note that there might be other uses
> > > > > of truesize that I am unaware of that could impact this choice.
> > > >
> > > > Truesize is "what amount of memory does this skb occupy with all its
> > > > fragments, linear space and struct sk_buff itself". The closest it
> > > > will be to the actual value, the better.
> > > > In this case, I think adding of chunk_size * i would be enough.
> > >
> > > Sounds like a good approximation to me.
> > >
> > > > (PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially
> > > > for setups with PAGE_SIZE > SZ_4K)
> > >
> > > You are right. That would be quite horrible on a system with a page size of 64K.
> >
> > Thank you everyone, I learned it.
> >
> > I also think it is appropriate to add a chunk size here, and there is actually
> > only one chunk here, so it's very simple
> >
> >       skb->truesize += xs->pool->chunk_size;
>
> umem chunks can't cross page boundaries. So if you're sure that
> there could be only one chunk, you don't need the loop at all,
> if I'm not missing anything.

In the default mode, this is true. But in the unaligned_chunk mode
that can be set on the umem, the chunk may cross one page boundary, so
we need the loop and the chunk_size * i in the assignment of truesize.
So "i" can be 1 or 2, but nothing else.

> > In addition, I actually borrowed from the tcp code:
> >
> >    tcp_build_frag:
> >    --------------
> >
> >       if (can_coalesce) {
> >               skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
> >       } else {
> >               get_page(page);
> >               skb_fill_page_desc(skb, i, page, offset, copy);
> >       }
> >
> >       if (!(flags & MSG_NO_SHARED_FRAGS))
> >               skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
> >
> >       skb->len += copy;
> >       skb->data_len += copy;
> >       skb->truesize += copy;
> >
> > So, here is one bug?
>
> skb_frag_t is an alias to struct bvec. It doesn't contain info about
> real memory consumption, so there's no other option buf just to add
> "copy" to truesize.
> XSK is different in this term, as it operates with chunks of a known
> size.
>
> > Thanks.
> >
> > >
> > > > > > > > > +
> > > > > > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > > > > +
> > > > > > > > > + return skb;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > >
> > > > > > > Al
> > > > > >
> > > > > > Thanks,
> > > > > > Al
> > > >
> > > > Al
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
       [not found] <1611329955.4913929-2-xuanzhuo@linux.alibaba.com>
@ 2021-01-22 17:26 ` Alexander Lobakin
  2021-01-22 18:37   ` Magnus Karlsson
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-22 17:26 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Alexander Lobakin, Magnus Karlsson, Eric Dumazet,
	Michael S . Tsirkin, Jason Wang, David S . Miller,
	Jakub Kicinski, Bjorn Topel, Magnus Karlsson, Jonathan Lemon,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, virtualization, bpf,
	Network Development, open list

From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Date: Fri, 22 Jan 2021 23:39:15 +0800

> On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson <magnus.karlsson@gmail.com> wrote:
> > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > >
> > > From: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > >
> > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote:
> > > > >
> > > > > From: Alexander Lobakin <alobakin@pm.me>
> > > > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > > > >
> > > > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > >
> > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > > > overhead.
> > > > > > > >
> > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > > > necessary to copy data to construct skb.
> > > > > > > >
> > > > > > > > ---------------- Performance Testing ------------
> > > > > > > >
> > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > Test cmd:
> > > > > > > > ```
> > > > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > > > ```
> > > > > > > >
> > > > > > > > Test result data:
> > > > > > > >
> > > > > > > > size    64      512     1024    1500
> > > > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > > > page    1974058 1953655 1945463 1904478
> > > > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > > > >
> > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > > > ---
> > > > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > > > >   sock_wfree(skb);
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > > > +                                       struct xdp_desc *desc)
> > > > > > > > +{
> > > > > > > > + u32 len, offset, copy, copied;
> > > > > > > > + struct sk_buff *skb;
> > > > > > > > + struct page *page;
> > > > > > > > + void *buffer;
> > > > > > > > + int err, i;
> > > > > > > > + u64 addr;
> > > > > > > > +
> > > > > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> > > > > > > > + if (unlikely(!skb))
> > > > > > > > +         return ERR_PTR(err);
> > > > > > > > +
> > > > > > > > + addr = desc->addr;
> > > > > > > > + len = desc->len;
> > > > > > > > +
> > > > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > > > + offset = offset_in_page(buffer);
> > > > > > > > + addr = buffer - xs->pool->addrs;
> > > > > > > > +
> > > > > > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > > > > > +         page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > > > +
> > > > > > > > +         get_page(page);
> > > > > > > > +
> > > > > > > > +         copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > > > +
> > > > > > > > +         skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > > > +
> > > > > > > > +         copied += copy;
> > > > > > > > +         addr += copy;
> > > > > > > > +         offset = 0;
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > + skb->len += len;
> > > > > > > > + skb->data_len += len;
> > > > > > >
> > > > > > > > + skb->truesize += len;
> > > > > > >
> > > > > > > This is not the truesize, unfortunately.
> > > > > > >
> > > > > > > We need to account for the number of pages, not number of bytes.
> > > > > >
> > > > > > The easiest solution is:
> > > > > >
> > > > > >       skb->truesize += PAGE_SIZE * i;
> > > > > >
> > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > > > >
> > > > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > > > "one per page".
> > > > > We need to count the number of pages manually and then do
> > > > >
> > > > >         skb->truesize += PAGE_SIZE * npages;
> > > > >
> > > > > Right.
> > > >
> > > > There are two possible packet buffer (chunks) sizes in a umem, 2K and
> > > > 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
> > > > please correct me if wrong, truesize is used for memory accounting.
> > > > But in this code, no kernel memory has been allocated (apart from the
> > > > skb). The page is just a part of the umem that has been already
> > > > allocated beforehand and by user-space in this case. So what should
> > > > truesize be in this case? Do we add 0, chunk_size * i, or the
> > > > complicated case of counting exactly how many 4K pages that are used
> > > > when the chunk_size is 2K, as two chunks could occupy the same page,
> > > > or just the upper bound of PAGE_SIZE * i that is likely a good
> > > > approximation in most cases? Just note that there might be other uses
> > > > of truesize that I am unaware of that could impact this choice.
> > >
> > > Truesize is "what amount of memory does this skb occupy with all its
> > > fragments, linear space and struct sk_buff itself". The closest it
> > > will be to the actual value, the better.
> > > In this case, I think adding of chunk_size * i would be enough.
> >
> > Sounds like a good approximation to me.
> >
> > > (PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially
> > > for setups with PAGE_SIZE > SZ_4K)
> >
> > You are right. That would be quite horrible on a system with a page size of 64K.
> 
> Thank you everyone, I learned it.
> 
> I also think it is appropriate to add a chunk size here, and there is actually
> only one chunk here, so it's very simple
> 
> 	skb->truesize += xs->pool->chunk_size;

umem chunks can't cross page boundaries. So if you're sure that
there could be only one chunk, you don't need the loop at all,
if I'm not missing anything.

> In addition, I actually borrowed from the tcp code:
> 
>    tcp_build_frag:
>    --------------
> 
> 	if (can_coalesce) {
> 		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
> 	} else {
> 		get_page(page);
> 		skb_fill_page_desc(skb, i, page, offset, copy);
> 	}
> 
> 	if (!(flags & MSG_NO_SHARED_FRAGS))
> 		skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
> 
> 	skb->len += copy;
> 	skb->data_len += copy;
> 	skb->truesize += copy;
> 
> So, here is one bug?

skb_frag_t is an alias to struct bvec. It doesn't contain info about
real memory consumption, so there's no other option buf just to add
"copy" to truesize.
XSK is different in this term, as it operates with chunks of a known
size.

> Thanks.
> 
> >
> > > > > > > > +
> > > > > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > > > +
> > > > > > > > + return skb;
> > > > > > > > +}
> > > > > > > > +
> > > > > >
> > > > > > Al
> > > > >
> > > > > Thanks,
> > > > > Al
> > >
> > > Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page
       [not found] <1611329789.3222687-1-xuanzhuo@linux.alibaba.com>
@ 2021-01-22 16:24 ` Alexander Lobakin
  0 siblings, 0 replies; 22+ messages in thread
From: Alexander Lobakin @ 2021-01-22 16:24 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Alexander Lobakin, Eric Dumazet, Michael S . Tsirkin, Jason Wang,
	David S . Miller, Jakub Kicinski, bjorn, Magnus Karlsson,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	virtualization, bpf, netdev, linux-kernel

From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Date: Fri, 22 Jan 2021 23:36:29 +0800

> On Fri, 22 Jan 2021 12:08:00 +0000, Alexander Lobakin <alobakin@pm.me> wrote:
> > From: Alexander Lobakin <alobakin@pm.me>
> > Date: Fri, 22 Jan 2021 11:55:35 +0000
> >
> > > From: Alexander Lobakin <alobakin@pm.me>
> > > Date: Fri, 22 Jan 2021 11:47:45 +0000
> > >
> > > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > >
> > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > This patch is used to construct skb based on page to save memory copy
> > > > > > overhead.
> > > > > >
> > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > directly construct skb. If this feature is not supported, it is still
> > > > > > necessary to copy data to construct skb.
> > > > > >
> > > > > > ---------------- Performance Testing ------------
> > > > > >
> > > > > > The test environment is Aliyun ECS server.
> > > > > > Test cmd:
> > > > > > ```
> > > > > > xdpsock -i eth0 -t  -S -s <msg size>
> > > > > > ```
> > > > > >
> > > > > > Test result data:
> > > > > >
> > > > > > size    64      512     1024    1500
> > > > > > copy    1916747 1775988 1600203 1440054
> > > > > > page    1974058 1953655 1945463 1904478
> > > > > > percent 3.0%    10.0%   21.58%  32.3%
> > > > > >
> > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > > > > > ---
> > > > > >  net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++----------
> > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > >
> > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > index 4a83117..38af7f1 100644
> > > > > > --- a/net/xdp/xsk.c
> > > > > > +++ b/net/xdp/xsk.c
> > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > > > >  	sock_wfree(skb);
> > > > > >  }
> > > > > >
> > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > +					      struct xdp_desc *desc)
> > > > > > +{
> > > > > > +	u32 len, offset, copy, copied;
> > > > > > +	struct sk_buff *skb;
> > > > > > +	struct page *page;
> > > > > > +	void *buffer;
> > > > > > +	int err, i;
> > > > > > +	u64 addr;
> > > > > > +
> > > > > > +	skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err);
> >
> > Also,
> > maybe we should allocate it with NET_SKB_PAD so NIC drivers could
> > use some reserved space?
> >
> > 		skb = sock_alloc_send_skb(&xs->sk, NET_SKB_PAD, 1, &err);
> > 		...
> > 		skb_reserve(skb, NET_SKB_PAD);
> >
> > Eric, what do you think?
> 
> I think you are right. Some space should be added to continuous equipment. This
> space should also be added in the copy mode below. Is LL_RESERVED_SPACE more
> appropriate?

No. If you look at __netdev_alloc_skb() and __napi_alloc_skb(), they
reserve NET_SKB_PAD at the beginning of linear area. Documentation of
__build_skb() also says that driver should reserve NET_SKB_PAD before
the actual frame, so it is a standartized hardware-independent
headroom.
Leaving that space in skb->head will allow developers to implement
IFF_TX_SKB_NO_LINEAR in a wider variety of drivers, especially when
a driver has to prepend some sort of data before the actual frame.
Since it's usually of a size of one cacheline, shouldn't be a big
deal.


[ I also had an idea of allocating an skb with a headroom of
NET_SKB_PAD + 256 bytes, so nearly all drivers could just call
pskb_pull_tail() to support such type of skbuffs without much
effort, but I think that it's better to teach drivers to support
xmitting of really headless ones. If virtio_net can do it, why
shouldn't the others ]

> > > > > > +	if (unlikely(!skb))
> > > > > > +		return ERR_PTR(err);
> > > > > > +
> > > > > > +	addr = desc->addr;
> > > > > > +	len = desc->len;
> > > > > > +
> > > > > > +	buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > +	offset = offset_in_page(buffer);
> > > > > > +	addr = buffer - xs->pool->addrs;
> > > > > > +
> > > > > > +	for (copied = 0, i = 0; copied < len; i++) {
> > > > > > +		page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > +
> > > > > > +		get_page(page);
> > > > > > +
> > > > > > +		copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > +
> > > > > > +		skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > +
> > > > > > +		copied += copy;
> > > > > > +		addr += copy;
> > > > > > +		offset = 0;
> > > > > > +	}
> > > > > > +
> > > > > > +	skb->len += len;
> > > > > > +	skb->data_len += len;
> > > > >
> > > > > > +	skb->truesize += len;
> > > > >
> > > > > This is not the truesize, unfortunately.
> > > > >
> > > > > We need to account for the number of pages, not number of bytes.
> > > >
> > > > The easiest solution is:
> > > >
> > > > 	skb->truesize += PAGE_SIZE * i;
> > > >
> > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
> > >
> > > Oops, pls ignore this. I forgot that XSK buffers are not
> > > "one per page".
> > > We need to count the number of pages manually and then do
> > >
> > > 	skb->truesize += PAGE_SIZE * npages;
> > >
> > > Right.
> > >
> > > > > > +
> > > > > > +	refcount_add(len, &xs->sk.sk_wmem_alloc);
> > > > > > +
> > > > > > +	return skb;
> > > > > > +}
> > > > > > +
> > > >
> > > > Al
> > >
> > > Thanks,
> > > Al
> >
> > Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-01-26 19:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21 13:47 [PATCH bpf-next v3 0/3] xsk: build skb by page Xuan Zhuo
2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo
2021-01-21 15:13   ` Alexander Lobakin
2021-01-21 13:47 ` [PATCH bpf-next v3 2/3] virtio-net: support IFF_TX_SKB_NO_LINEAR Xuan Zhuo
2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo
2021-01-21 15:17   ` Magnus Karlsson
2021-01-21 15:41   ` Eric Dumazet
2021-01-22 11:47     ` Alexander Lobakin
2021-01-22 11:55       ` Alexander Lobakin
2021-01-22 12:08         ` Alexander Lobakin
2021-01-22 12:18         ` Magnus Karlsson
2021-01-22 12:39           ` Alexander Lobakin
2021-01-22 12:55             ` Magnus Karlsson
     [not found] <1611329789.3222687-1-xuanzhuo@linux.alibaba.com>
2021-01-22 16:24 ` Alexander Lobakin
     [not found] <1611329955.4913929-2-xuanzhuo@linux.alibaba.com>
2021-01-22 17:26 ` Alexander Lobakin
2021-01-22 18:37   ` Magnus Karlsson
2021-01-22 18:45     ` Alexander Lobakin
     [not found]     ` <1611541335.3012564-1-xuanzhuo@linux.alibaba.com>
2021-01-25  7:44       ` Magnus Karlsson
     [not found]         ` <1611578136.5043845-1-xuanzhuo@linux.alibaba.com>
2021-01-25 13:16           ` Magnus Karlsson
     [not found]             ` <1611587242.9653594-2-xuanzhuo@linux.alibaba.com>
2021-01-26  7:34               ` Magnus Karlsson
     [not found] <1611544243.8363645-1-xuanzhuo@linux.alibaba.com>
2021-01-25 13:25 ` Alexander Lobakin
     [not found] <1611586627.1035807-1-xuanzhuo@linux.alibaba.com>
2021-01-25 15:07 ` Alexander Lobakin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).