* [PATCH bpf-next v3 0/3] xsk: build skb by page @ 2021-01-21 13:47 Xuan Zhuo 2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Xuan Zhuo @ 2021-01-21 13:47 UTC (permalink / raw) To: bpf Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, netdev v3: Optimized code v2: 1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature 2. split the patch to three: a. add priv_flags IFF_TX_SKB_NO_LINEAR b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR c. When there is support this flag, construct skb without linear space 3. use ERR_PTR() and PTR_ERR() to handle the err v1 message log: --------------- This patch is used to construct skb based on page to save memory copy overhead. This has one problem: We construct the skb by fill the data page as a frag into the skb. In this way, the linear space is empty, and the header information is also in the frag, not in the linear space, which is not allowed for some network cards. For example, Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb I also tried to use build_skb to construct skb, but because of the existence of skb_shinfo, it must be behind the linear space, so this method is not working. We can't put skb_shinfo on desc->addr, it will be exposed to users, this is not safe. Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the network card supports the header information of the packet in the frag and not in the linear space. ---------------- Performance Testing ------------ The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s <msg size> ``` Test result data: size 64 512 1024 1500 copy 1916747 1775988 1600203 1440054 page 1974058 1953655 1945463 1904478 percent 3.0% 10.0% 21.58% 32.3% Xuan Zhuo (3): net: add priv_flags for allow tx skb without linear virtio-net: support IFF_TX_SKB_NO_LINEAR xsk: build skb by page drivers/net/virtio_net.c | 3 +- include/linux/netdevice.h | 3 ++ net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++-------- 3 files changed, 91 insertions(+), 19 deletions(-) -- 1.8.3.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear 2021-01-21 13:47 [PATCH bpf-next v3 0/3] xsk: build skb by page Xuan Zhuo @ 2021-01-21 13:47 ` Xuan Zhuo 2021-01-21 15:13 ` Alexander Lobakin 2021-01-21 13:47 ` [PATCH bpf-next v3 2/3] virtio-net: support IFF_TX_SKB_NO_LINEAR Xuan Zhuo 2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo 2 siblings, 1 reply; 13+ messages in thread From: Xuan Zhuo @ 2021-01-21 13:47 UTC (permalink / raw) To: bpf Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, netdev In some cases, we hope to construct skb directly based on the existing memory without copying data. In this case, the page will be placed directly in the skb, and the linear space of skb is empty. But unfortunately, many the network card does not support this operation. For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb So a priv_flag is added here to indicate whether the network card supports this feature. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Suggested-by: Alexander Lobakin <alobakin@pm.me> --- include/linux/netdevice.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ef51725..135db8f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1525,6 +1525,7 @@ struct net_device_ops { * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running + * @IFF_TX_SKB_NO_LINEAR: allow tx skb linear is empty */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 1<<0, @@ -1558,6 +1559,7 @@ enum netdev_priv_flags { IFF_FAILOVER_SLAVE = 1<<28, IFF_L3MDEV_RX_HANDLER = 1<<29, IFF_LIVE_RENAME_OK = 1<<30, + IFF_TX_SKB_NO_LINEAR = 1<<31, }; #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN @@ -1590,6 +1592,7 @@ enum netdev_priv_flags { #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE #define IFF_L3MDEV_RX_HANDLER IFF_L3MDEV_RX_HANDLER #define IFF_LIVE_RENAME_OK IFF_LIVE_RENAME_OK +#define IFF_TX_SKB_NO_LINEAR IFF_TX_SKB_NO_LINEAR /** * struct net_device - The DEVICE structure. -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear 2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo @ 2021-01-21 15:13 ` Alexander Lobakin 0 siblings, 0 replies; 13+ messages in thread From: Alexander Lobakin @ 2021-01-21 15:13 UTC (permalink / raw) To: Xuan Zhuo Cc: Alexander Lobakin, Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, bpf, netdev, linux-kernel From: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Date: Thu, 21 Jan 2021 21:47:07 +0800 > In some cases, we hope to construct skb directly based on the existing > memory without copying data. In this case, the page will be placed > directly in the skb, and the linear space of skb is empty. But > unfortunately, many the network card does not support this operation. > For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will > get the following error message: > > mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 > WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 > 00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 > 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb > > So a priv_flag is added here to indicate whether the network card > supports this feature. > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > Suggested-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Alexander Lobakin <alobakin@pm.me> > --- > include/linux/netdevice.h | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index ef51725..135db8f 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -1525,6 +1525,7 @@ struct net_device_ops { > * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device > * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device > * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running > + * @IFF_TX_SKB_NO_LINEAR: allow tx skb linear is empty > */ > enum netdev_priv_flags { > IFF_802_1Q_VLAN = 1<<0, > @@ -1558,6 +1559,7 @@ enum netdev_priv_flags { > IFF_FAILOVER_SLAVE = 1<<28, > IFF_L3MDEV_RX_HANDLER = 1<<29, > IFF_LIVE_RENAME_OK = 1<<30, > + IFF_TX_SKB_NO_LINEAR = 1<<31, > }; > > #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN > @@ -1590,6 +1592,7 @@ enum netdev_priv_flags { > #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE > #define IFF_L3MDEV_RX_HANDLER IFF_L3MDEV_RX_HANDLER > #define IFF_LIVE_RENAME_OK IFF_LIVE_RENAME_OK > +#define IFF_TX_SKB_NO_LINEAR IFF_TX_SKB_NO_LINEAR > > /** > * struct net_device - The DEVICE structure. > -- > 1.8.3.1 Thanks, Al ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH bpf-next v3 2/3] virtio-net: support IFF_TX_SKB_NO_LINEAR 2021-01-21 13:47 [PATCH bpf-next v3 0/3] xsk: build skb by page Xuan Zhuo 2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo @ 2021-01-21 13:47 ` Xuan Zhuo 2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo 2 siblings, 0 replies; 13+ messages in thread From: Xuan Zhuo @ 2021-01-21 13:47 UTC (permalink / raw) To: bpf Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, netdev Virtio net supports the case where the skb linear space is empty, so add priv_flags. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> --- drivers/net/virtio_net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ba8e637..f2ff6c3 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev) return -ENOMEM; /* Set up network device as normal. */ - dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE; + dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE | + IFF_TX_SKB_NO_LINEAR; dev->netdev_ops = &virtnet_netdev; dev->features = NETIF_F_HIGHDMA; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH bpf-next v3 3/3] xsk: build skb by page 2021-01-21 13:47 [PATCH bpf-next v3 0/3] xsk: build skb by page Xuan Zhuo 2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo 2021-01-21 13:47 ` [PATCH bpf-next v3 2/3] virtio-net: support IFF_TX_SKB_NO_LINEAR Xuan Zhuo @ 2021-01-21 13:47 ` Xuan Zhuo 2021-01-21 15:17 ` Magnus Karlsson 2021-01-21 15:41 ` Eric Dumazet 2 siblings, 2 replies; 13+ messages in thread From: Xuan Zhuo @ 2021-01-21 13:47 UTC (permalink / raw) To: bpf Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, netdev This patch is used to construct skb based on page to save memory copy overhead. This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to directly construct skb. If this feature is not supported, it is still necessary to copy data to construct skb. ---------------- Performance Testing ------------ The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s <msg size> ``` Test result data: size 64 512 1024 1500 copy 1916747 1775988 1600203 1440054 page 1974058 1953655 1945463 1904478 percent 3.0% 10.0% 21.58% 32.3% Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> --- net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 86 insertions(+), 18 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4a83117..38af7f1 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb) sock_wfree(skb); } +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, + struct xdp_desc *desc) +{ + u32 len, offset, copy, copied; + struct sk_buff *skb; + struct page *page; + void *buffer; + int err, i; + u64 addr; + + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + addr = desc->addr; + len = desc->len; + + buffer = xsk_buff_raw_get_data(xs->pool, addr); + offset = offset_in_page(buffer); + addr = buffer - xs->pool->addrs; + + for (copied = 0, i = 0; copied < len; i++) { + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; + + get_page(page); + + copy = min_t(u32, PAGE_SIZE - offset, len - copied); + + skb_fill_page_desc(skb, i, page, offset, copy); + + copied += copy; + addr += copy; + offset = 0; + } + + skb->len += len; + skb->data_len += len; + skb->truesize += len; + + refcount_add(len, &xs->sk.sk_wmem_alloc); + + return skb; +} + +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, + struct xdp_desc *desc) +{ + struct sk_buff *skb; + + if (xs->dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { + skb = xsk_build_skb_zerocopy(xs, desc); + if (IS_ERR(skb)) + return skb; + } else { + void *buffer; + u32 len; + int err; + + len = desc->len; + skb = sock_alloc_send_skb(&xs->sk, len, 1, &err); + if (unlikely(!skb)) + return ERR_PTR(err); + + skb_put(skb, len); + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr); + err = skb_store_bits(skb, 0, buffer, len); + if (unlikely(err)) { + kfree_skb(skb); + return ERR_PTR(err); + } + } + + skb->dev = xs->dev; + skb->priority = xs->sk.sk_priority; + skb->mark = xs->sk.sk_mark; + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr; + skb->destructor = xsk_destruct_skb; + + return skb; +} + static int xsk_generic_xmit(struct sock *sk) { struct xdp_sock *xs = xdp_sk(sk); @@ -446,43 +527,30 @@ static int xsk_generic_xmit(struct sock *sk) goto out; while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { - char *buffer; - u64 addr; - u32 len; - if (max_batch-- == 0) { err = -EAGAIN; goto out; } - len = desc.len; - skb = sock_alloc_send_skb(sk, len, 1, &err); - if (unlikely(!skb)) + skb = xsk_build_skb(xs, &desc); + if (IS_ERR(skb)) { + err = PTR_ERR(skb); goto out; + } - skb_put(skb, len); - addr = desc.addr; - buffer = xsk_buff_raw_get_data(xs->pool, addr); - err = skb_store_bits(skb, 0, buffer, len); /* This is the backpressure mechanism for the Tx path. * Reserve space in the completion queue and only proceed * if there is space in it. This avoids having to implement * any buffering in the Tx path. */ spin_lock_irqsave(&xs->pool->cq_lock, flags); - if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) { + if (xskq_prod_reserve(xs->pool->cq)) { spin_unlock_irqrestore(&xs->pool->cq_lock, flags); kfree_skb(skb); goto out; } spin_unlock_irqrestore(&xs->pool->cq_lock, flags); - skb->dev = xs->dev; - skb->priority = sk->sk_priority; - skb->mark = sk->sk_mark; - skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr; - skb->destructor = xsk_destruct_skb; - err = __dev_direct_xmit(skb, xs->queue_id); if (err == NETDEV_TX_BUSY) { /* Tell user-space to retry the send */ -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page 2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo @ 2021-01-21 15:17 ` Magnus Karlsson 2021-01-21 15:41 ` Eric Dumazet 1 sibling, 0 replies; 13+ messages in thread From: Magnus Karlsson @ 2021-01-21 15:17 UTC (permalink / raw) To: Xuan Zhuo Cc: bpf, Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, Network Development On Thu, Jan 21, 2021 at 2:51 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote: > > This patch is used to construct skb based on page to save memory copy > overhead. > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > directly construct skb. If this feature is not supported, it is still > necessary to copy data to construct skb. > > ---------------- Performance Testing ------------ > > The test environment is Aliyun ECS server. > Test cmd: > ``` > xdpsock -i eth0 -t -S -s <msg size> > ``` > > Test result data: > > size 64 512 1024 1500 > copy 1916747 1775988 1600203 1440054 > page 1974058 1953655 1945463 1904478 > percent 3.0% 10.0% 21.58% 32.3% > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > --- > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 86 insertions(+), 18 deletions(-) Applied, compiled and tried it out on my NIC that does not support IFF_TX_SKB_NO_LINEAR and it works fine. Thank you Xuan for all your efforts. Appreciated. Now it would be nice if we could get some physical NIC drivers to support this too. Some probably already do and can just set the bit, while others need some modifications to support this. Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index 4a83117..38af7f1 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb) > sock_wfree(skb); > } > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > + struct xdp_desc *desc) > +{ > + u32 len, offset, copy, copied; > + struct sk_buff *skb; > + struct page *page; > + void *buffer; > + int err, i; > + u64 addr; > + > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); > + if (unlikely(!skb)) > + return ERR_PTR(err); > + > + addr = desc->addr; > + len = desc->len; > + > + buffer = xsk_buff_raw_get_data(xs->pool, addr); > + offset = offset_in_page(buffer); > + addr = buffer - xs->pool->addrs; > + > + for (copied = 0, i = 0; copied < len; i++) { > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > + > + get_page(page); > + > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > + > + skb_fill_page_desc(skb, i, page, offset, copy); > + > + copied += copy; > + addr += copy; > + offset = 0; > + } > + > + skb->len += len; > + skb->data_len += len; > + skb->truesize += len; > + > + refcount_add(len, &xs->sk.sk_wmem_alloc); > + > + return skb; > +} > + > +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > + struct xdp_desc *desc) > +{ > + struct sk_buff *skb; > + > + if (xs->dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { > + skb = xsk_build_skb_zerocopy(xs, desc); > + if (IS_ERR(skb)) > + return skb; > + } else { > + void *buffer; > + u32 len; > + int err; > + > + len = desc->len; > + skb = sock_alloc_send_skb(&xs->sk, len, 1, &err); > + if (unlikely(!skb)) > + return ERR_PTR(err); > + > + skb_put(skb, len); > + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr); > + err = skb_store_bits(skb, 0, buffer, len); > + if (unlikely(err)) { > + kfree_skb(skb); > + return ERR_PTR(err); > + } > + } > + > + skb->dev = xs->dev; > + skb->priority = xs->sk.sk_priority; > + skb->mark = xs->sk.sk_mark; > + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr; > + skb->destructor = xsk_destruct_skb; > + > + return skb; > +} > + > static int xsk_generic_xmit(struct sock *sk) > { > struct xdp_sock *xs = xdp_sk(sk); > @@ -446,43 +527,30 @@ static int xsk_generic_xmit(struct sock *sk) > goto out; > > while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { > - char *buffer; > - u64 addr; > - u32 len; > - > if (max_batch-- == 0) { > err = -EAGAIN; > goto out; > } > > - len = desc.len; > - skb = sock_alloc_send_skb(sk, len, 1, &err); > - if (unlikely(!skb)) > + skb = xsk_build_skb(xs, &desc); > + if (IS_ERR(skb)) { > + err = PTR_ERR(skb); > goto out; > + } > > - skb_put(skb, len); > - addr = desc.addr; > - buffer = xsk_buff_raw_get_data(xs->pool, addr); > - err = skb_store_bits(skb, 0, buffer, len); > /* This is the backpressure mechanism for the Tx path. > * Reserve space in the completion queue and only proceed > * if there is space in it. This avoids having to implement > * any buffering in the Tx path. > */ > spin_lock_irqsave(&xs->pool->cq_lock, flags); > - if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) { > + if (xskq_prod_reserve(xs->pool->cq)) { > spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > kfree_skb(skb); > goto out; > } > spin_unlock_irqrestore(&xs->pool->cq_lock, flags); > > - skb->dev = xs->dev; > - skb->priority = sk->sk_priority; > - skb->mark = sk->sk_mark; > - skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr; > - skb->destructor = xsk_destruct_skb; > - > err = __dev_direct_xmit(skb, xs->queue_id); > if (err == NETDEV_TX_BUSY) { > /* Tell user-space to retry the send */ > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page 2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo 2021-01-21 15:17 ` Magnus Karlsson @ 2021-01-21 15:41 ` Eric Dumazet 2021-01-22 11:47 ` Alexander Lobakin 1 sibling, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2021-01-21 15:41 UTC (permalink / raw) To: Xuan Zhuo, bpf Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, netdev On 1/21/21 2:47 PM, Xuan Zhuo wrote: > This patch is used to construct skb based on page to save memory copy > overhead. > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > directly construct skb. If this feature is not supported, it is still > necessary to copy data to construct skb. > > ---------------- Performance Testing ------------ > > The test environment is Aliyun ECS server. > Test cmd: > ``` > xdpsock -i eth0 -t -S -s <msg size> > ``` > > Test result data: > > size 64 512 1024 1500 > copy 1916747 1775988 1600203 1440054 > page 1974058 1953655 1945463 1904478 > percent 3.0% 10.0% 21.58% 32.3% > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > --- > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 86 insertions(+), 18 deletions(-) > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index 4a83117..38af7f1 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb) > sock_wfree(skb); > } > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > + struct xdp_desc *desc) > +{ > + u32 len, offset, copy, copied; > + struct sk_buff *skb; > + struct page *page; > + void *buffer; > + int err, i; > + u64 addr; > + > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); > + if (unlikely(!skb)) > + return ERR_PTR(err); > + > + addr = desc->addr; > + len = desc->len; > + > + buffer = xsk_buff_raw_get_data(xs->pool, addr); > + offset = offset_in_page(buffer); > + addr = buffer - xs->pool->addrs; > + > + for (copied = 0, i = 0; copied < len; i++) { > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > + > + get_page(page); > + > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > + > + skb_fill_page_desc(skb, i, page, offset, copy); > + > + copied += copy; > + addr += copy; > + offset = 0; > + } > + > + skb->len += len; > + skb->data_len += len; > + skb->truesize += len; This is not the truesize, unfortunately. We need to account for the number of pages, not number of bytes. > + > + refcount_add(len, &xs->sk.sk_wmem_alloc); > + > + return skb; > +} > + ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page 2021-01-21 15:41 ` Eric Dumazet @ 2021-01-22 11:47 ` Alexander Lobakin 2021-01-22 11:55 ` Alexander Lobakin 0 siblings, 1 reply; 13+ messages in thread From: Alexander Lobakin @ 2021-01-22 11:47 UTC (permalink / raw) To: Eric Dumazet Cc: Alexander Lobakin, Xuan Zhuo, Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, bpf, netdev, linux-kernel From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 21 Jan 2021 16:41:33 +0100 > On 1/21/21 2:47 PM, Xuan Zhuo wrote: > > This patch is used to construct skb based on page to save memory copy > > overhead. > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > > directly construct skb. If this feature is not supported, it is still > > necessary to copy data to construct skb. > > > > ---------------- Performance Testing ------------ > > > > The test environment is Aliyun ECS server. > > Test cmd: > > ``` > > xdpsock -i eth0 -t -S -s <msg size> > > ``` > > > > Test result data: > > > > size 64 512 1024 1500 > > copy 1916747 1775988 1600203 1440054 > > page 1974058 1953655 1945463 1904478 > > percent 3.0% 10.0% 21.58% 32.3% > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > > --- > > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++---------- > > 1 file changed, 86 insertions(+), 18 deletions(-) > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > index 4a83117..38af7f1 100644 > > --- a/net/xdp/xsk.c > > +++ b/net/xdp/xsk.c > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb) > > sock_wfree(skb); > > } > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > > + struct xdp_desc *desc) > > +{ > > + u32 len, offset, copy, copied; > > + struct sk_buff *skb; > > + struct page *page; > > + void *buffer; > > + int err, i; > > + u64 addr; > > + > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); > > + if (unlikely(!skb)) > > + return ERR_PTR(err); > > + > > + addr = desc->addr; > > + len = desc->len; > > + > > + buffer = xsk_buff_raw_get_data(xs->pool, addr); > > + offset = offset_in_page(buffer); > > + addr = buffer - xs->pool->addrs; > > + > > + for (copied = 0, i = 0; copied < len; i++) { > > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > > + > > + get_page(page); > > + > > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > > + > > + skb_fill_page_desc(skb, i, page, offset, copy); > > + > > + copied += copy; > > + addr += copy; > > + offset = 0; > > + } > > + > > + skb->len += len; > > + skb->data_len += len; > > > + skb->truesize += len; > > This is not the truesize, unfortunately. > > We need to account for the number of pages, not number of bytes. The easiest solution is: skb->truesize += PAGE_SIZE * i; i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop. > > + > > + refcount_add(len, &xs->sk.sk_wmem_alloc); > > + > > + return skb; > > +} > > + Al ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page 2021-01-22 11:47 ` Alexander Lobakin @ 2021-01-22 11:55 ` Alexander Lobakin 2021-01-22 12:08 ` Alexander Lobakin 2021-01-22 12:18 ` Magnus Karlsson 0 siblings, 2 replies; 13+ messages in thread From: Alexander Lobakin @ 2021-01-22 11:55 UTC (permalink / raw) To: Eric Dumazet Cc: Alexander Lobakin, Xuan Zhuo, Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, bpf, netdev, linux-kernel From: Alexander Lobakin <alobakin@pm.me> Date: Fri, 22 Jan 2021 11:47:45 +0000 > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Thu, 21 Jan 2021 16:41:33 +0100 > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote: > > > This patch is used to construct skb based on page to save memory copy > > > overhead. > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > > > directly construct skb. If this feature is not supported, it is still > > > necessary to copy data to construct skb. > > > > > > ---------------- Performance Testing ------------ > > > > > > The test environment is Aliyun ECS server. > > > Test cmd: > > > ``` > > > xdpsock -i eth0 -t -S -s <msg size> > > > ``` > > > > > > Test result data: > > > > > > size 64 512 1024 1500 > > > copy 1916747 1775988 1600203 1440054 > > > page 1974058 1953655 1945463 1904478 > > > percent 3.0% 10.0% 21.58% 32.3% > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > > > --- > > > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++---------- > > > 1 file changed, 86 insertions(+), 18 deletions(-) > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > > index 4a83117..38af7f1 100644 > > > --- a/net/xdp/xsk.c > > > +++ b/net/xdp/xsk.c > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb) > > > sock_wfree(skb); > > > } > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > > > + struct xdp_desc *desc) > > > +{ > > > + u32 len, offset, copy, copied; > > > + struct sk_buff *skb; > > > + struct page *page; > > > + void *buffer; > > > + int err, i; > > > + u64 addr; > > > + > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); > > > + if (unlikely(!skb)) > > > + return ERR_PTR(err); > > > + > > > + addr = desc->addr; > > > + len = desc->len; > > > + > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr); > > > + offset = offset_in_page(buffer); > > > + addr = buffer - xs->pool->addrs; > > > + > > > + for (copied = 0, i = 0; copied < len; i++) { > > > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > > > + > > > + get_page(page); > > > + > > > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > > > + > > > + skb_fill_page_desc(skb, i, page, offset, copy); > > > + > > > + copied += copy; > > > + addr += copy; > > > + offset = 0; > > > + } > > > + > > > + skb->len += len; > > > + skb->data_len += len; > > > > > + skb->truesize += len; > > > > This is not the truesize, unfortunately. > > > > We need to account for the number of pages, not number of bytes. > > The easiest solution is: > > skb->truesize += PAGE_SIZE * i; > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop. Oops, pls ignore this. I forgot that XSK buffers are not "one per page". We need to count the number of pages manually and then do skb->truesize += PAGE_SIZE * npages; Right. > > > + > > > + refcount_add(len, &xs->sk.sk_wmem_alloc); > > > + > > > + return skb; > > > +} > > > + > > Al Thanks, Al ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page 2021-01-22 11:55 ` Alexander Lobakin @ 2021-01-22 12:08 ` Alexander Lobakin 2021-01-22 12:18 ` Magnus Karlsson 1 sibling, 0 replies; 13+ messages in thread From: Alexander Lobakin @ 2021-01-22 12:08 UTC (permalink / raw) To: Eric Dumazet Cc: Alexander Lobakin, Xuan Zhuo, Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, bjorn, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, bpf, netdev, linux-kernel From: Alexander Lobakin <alobakin@pm.me> Date: Fri, 22 Jan 2021 11:55:35 +0000 > From: Alexander Lobakin <alobakin@pm.me> > Date: Fri, 22 Jan 2021 11:47:45 +0000 > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Thu, 21 Jan 2021 16:41:33 +0100 > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote: > > > > This patch is used to construct skb based on page to save memory copy > > > > overhead. > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > > > > directly construct skb. If this feature is not supported, it is still > > > > necessary to copy data to construct skb. > > > > > > > > ---------------- Performance Testing ------------ > > > > > > > > The test environment is Aliyun ECS server. > > > > Test cmd: > > > > ``` > > > > xdpsock -i eth0 -t -S -s <msg size> > > > > ``` > > > > > > > > Test result data: > > > > > > > > size 64 512 1024 1500 > > > > copy 1916747 1775988 1600203 1440054 > > > > page 1974058 1953655 1945463 1904478 > > > > percent 3.0% 10.0% 21.58% 32.3% > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > > > > --- > > > > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++---------- > > > > 1 file changed, 86 insertions(+), 18 deletions(-) > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > > > index 4a83117..38af7f1 100644 > > > > --- a/net/xdp/xsk.c > > > > +++ b/net/xdp/xsk.c > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb) > > > > sock_wfree(skb); > > > > } > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > > > > + struct xdp_desc *desc) > > > > +{ > > > > + u32 len, offset, copy, copied; > > > > + struct sk_buff *skb; > > > > + struct page *page; > > > > + void *buffer; > > > > + int err, i; > > > > + u64 addr; > > > > + > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); Also, maybe we should allocate it with NET_SKB_PAD so NIC drivers could use some reserved space? skb = sock_alloc_send_skb(&xs->sk, NET_SKB_PAD, 1, &err); ... skb_reserve(skb, NET_SKB_PAD); Eric, what do you think? > > > > + if (unlikely(!skb)) > > > > + return ERR_PTR(err); > > > > + > > > > + addr = desc->addr; > > > > + len = desc->len; > > > > + > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr); > > > > + offset = offset_in_page(buffer); > > > > + addr = buffer - xs->pool->addrs; > > > > + > > > > + for (copied = 0, i = 0; copied < len; i++) { > > > > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > > > > + > > > > + get_page(page); > > > > + > > > > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > > > > + > > > > + skb_fill_page_desc(skb, i, page, offset, copy); > > > > + > > > > + copied += copy; > > > > + addr += copy; > > > > + offset = 0; > > > > + } > > > > + > > > > + skb->len += len; > > > > + skb->data_len += len; > > > > > > > + skb->truesize += len; > > > > > > This is not the truesize, unfortunately. > > > > > > We need to account for the number of pages, not number of bytes. > > > > The easiest solution is: > > > > skb->truesize += PAGE_SIZE * i; > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop. > > Oops, pls ignore this. I forgot that XSK buffers are not > "one per page". > We need to count the number of pages manually and then do > > skb->truesize += PAGE_SIZE * npages; > > Right. > > > > > + > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc); > > > > + > > > > + return skb; > > > > +} > > > > + > > > > Al > > Thanks, > Al Al ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page 2021-01-22 11:55 ` Alexander Lobakin 2021-01-22 12:08 ` Alexander Lobakin @ 2021-01-22 12:18 ` Magnus Karlsson 2021-01-22 12:39 ` Alexander Lobakin 1 sibling, 1 reply; 13+ messages in thread From: Magnus Karlsson @ 2021-01-22 12:18 UTC (permalink / raw) To: Alexander Lobakin Cc: Eric Dumazet, Xuan Zhuo, Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Björn Töpel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, bpf, Network Development, open list On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote: > > From: Alexander Lobakin <alobakin@pm.me> > Date: Fri, 22 Jan 2021 11:47:45 +0000 > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Thu, 21 Jan 2021 16:41:33 +0100 > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote: > > > > This patch is used to construct skb based on page to save memory copy > > > > overhead. > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > > > > directly construct skb. If this feature is not supported, it is still > > > > necessary to copy data to construct skb. > > > > > > > > ---------------- Performance Testing ------------ > > > > > > > > The test environment is Aliyun ECS server. > > > > Test cmd: > > > > ``` > > > > xdpsock -i eth0 -t -S -s <msg size> > > > > ``` > > > > > > > > Test result data: > > > > > > > > size 64 512 1024 1500 > > > > copy 1916747 1775988 1600203 1440054 > > > > page 1974058 1953655 1945463 1904478 > > > > percent 3.0% 10.0% 21.58% 32.3% > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > > > > --- > > > > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++---------- > > > > 1 file changed, 86 insertions(+), 18 deletions(-) > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > > > index 4a83117..38af7f1 100644 > > > > --- a/net/xdp/xsk.c > > > > +++ b/net/xdp/xsk.c > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb) > > > > sock_wfree(skb); > > > > } > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > > > > + struct xdp_desc *desc) > > > > +{ > > > > + u32 len, offset, copy, copied; > > > > + struct sk_buff *skb; > > > > + struct page *page; > > > > + void *buffer; > > > > + int err, i; > > > > + u64 addr; > > > > + > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); > > > > + if (unlikely(!skb)) > > > > + return ERR_PTR(err); > > > > + > > > > + addr = desc->addr; > > > > + len = desc->len; > > > > + > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr); > > > > + offset = offset_in_page(buffer); > > > > + addr = buffer - xs->pool->addrs; > > > > + > > > > + for (copied = 0, i = 0; copied < len; i++) { > > > > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > > > > + > > > > + get_page(page); > > > > + > > > > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > > > > + > > > > + skb_fill_page_desc(skb, i, page, offset, copy); > > > > + > > > > + copied += copy; > > > > + addr += copy; > > > > + offset = 0; > > > > + } > > > > + > > > > + skb->len += len; > > > > + skb->data_len += len; > > > > > > > + skb->truesize += len; > > > > > > This is not the truesize, unfortunately. > > > > > > We need to account for the number of pages, not number of bytes. > > > > The easiest solution is: > > > > skb->truesize += PAGE_SIZE * i; > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop. > > Oops, pls ignore this. I forgot that XSK buffers are not > "one per page". > We need to count the number of pages manually and then do > > skb->truesize += PAGE_SIZE * npages; > > Right. There are two possible packet buffer (chunks) sizes in a umem, 2K and 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and please correct me if wrong, truesize is used for memory accounting. But in this code, no kernel memory has been allocated (apart from the skb). The page is just a part of the umem that has been already allocated beforehand and by user-space in this case. So what should truesize be in this case? Do we add 0, chunk_size * i, or the complicated case of counting exactly how many 4K pages that are used when the chunk_size is 2K, as two chunks could occupy the same page, or just the upper bound of PAGE_SIZE * i that is likely a good approximation in most cases? Just note that there might be other uses of truesize that I am unaware of that could impact this choice. > > > > + > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc); > > > > + > > > > + return skb; > > > > +} > > > > + > > > > Al > > Thanks, > Al > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page 2021-01-22 12:18 ` Magnus Karlsson @ 2021-01-22 12:39 ` Alexander Lobakin 2021-01-22 12:55 ` Magnus Karlsson 0 siblings, 1 reply; 13+ messages in thread From: Alexander Lobakin @ 2021-01-22 12:39 UTC (permalink / raw) To: Magnus Karlsson Cc: Alexander Lobakin, Eric Dumazet, Xuan Zhuo, Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Bjorn Topel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, bpf, Network Development, open list From: Magnus Karlsson <magnus.karlsson@gmail.com> Date: Fri, 22 Jan 2021 13:18:47 +0100 > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote: > > > > From: Alexander Lobakin <alobakin@pm.me> > > Date: Fri, 22 Jan 2021 11:47:45 +0000 > > > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > > Date: Thu, 21 Jan 2021 16:41:33 +0100 > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote: > > > > > This patch is used to construct skb based on page to save memory copy > > > > > overhead. > > > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > > > > > directly construct skb. If this feature is not supported, it is still > > > > > necessary to copy data to construct skb. > > > > > > > > > > ---------------- Performance Testing ------------ > > > > > > > > > > The test environment is Aliyun ECS server. > > > > > Test cmd: > > > > > ``` > > > > > xdpsock -i eth0 -t -S -s <msg size> > > > > > ``` > > > > > > > > > > Test result data: > > > > > > > > > > size 64 512 1024 1500 > > > > > copy 1916747 1775988 1600203 1440054 > > > > > page 1974058 1953655 1945463 1904478 > > > > > percent 3.0% 10.0% 21.58% 32.3% > > > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > > > > > --- > > > > > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++---------- > > > > > 1 file changed, 86 insertions(+), 18 deletions(-) > > > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > > > > index 4a83117..38af7f1 100644 > > > > > --- a/net/xdp/xsk.c > > > > > +++ b/net/xdp/xsk.c > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb) > > > > > sock_wfree(skb); > > > > > } > > > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > > > > > + struct xdp_desc *desc) > > > > > +{ > > > > > + u32 len, offset, copy, copied; > > > > > + struct sk_buff *skb; > > > > > + struct page *page; > > > > > + void *buffer; > > > > > + int err, i; > > > > > + u64 addr; > > > > > + > > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); > > > > > + if (unlikely(!skb)) > > > > > + return ERR_PTR(err); > > > > > + > > > > > + addr = desc->addr; > > > > > + len = desc->len; > > > > > + > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr); > > > > > + offset = offset_in_page(buffer); > > > > > + addr = buffer - xs->pool->addrs; > > > > > + > > > > > + for (copied = 0, i = 0; copied < len; i++) { > > > > > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > > > > > + > > > > > + get_page(page); > > > > > + > > > > > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > > > > > + > > > > > + skb_fill_page_desc(skb, i, page, offset, copy); > > > > > + > > > > > + copied += copy; > > > > > + addr += copy; > > > > > + offset = 0; > > > > > + } > > > > > + > > > > > + skb->len += len; > > > > > + skb->data_len += len; > > > > > > > > > + skb->truesize += len; > > > > > > > > This is not the truesize, unfortunately. > > > > > > > > We need to account for the number of pages, not number of bytes. > > > > > > The easiest solution is: > > > > > > skb->truesize += PAGE_SIZE * i; > > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop. > > > > Oops, pls ignore this. I forgot that XSK buffers are not > > "one per page". > > We need to count the number of pages manually and then do > > > > skb->truesize += PAGE_SIZE * npages; > > > > Right. > > There are two possible packet buffer (chunks) sizes in a umem, 2K and > 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and > please correct me if wrong, truesize is used for memory accounting. > But in this code, no kernel memory has been allocated (apart from the > skb). The page is just a part of the umem that has been already > allocated beforehand and by user-space in this case. So what should > truesize be in this case? Do we add 0, chunk_size * i, or the > complicated case of counting exactly how many 4K pages that are used > when the chunk_size is 2K, as two chunks could occupy the same page, > or just the upper bound of PAGE_SIZE * i that is likely a good > approximation in most cases? Just note that there might be other uses > of truesize that I am unaware of that could impact this choice. Truesize is "what amount of memory does this skb occupy with all its fragments, linear space and struct sk_buff itself". The closest it will be to the actual value, the better. In this case, I think adding of chunk_size * i would be enough. (PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially for setups with PAGE_SIZE > SZ_4K) > > > > > + > > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc); > > > > > + > > > > > + return skb; > > > > > +} > > > > > + > > > > > > Al > > > > Thanks, > > Al Al ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v3 3/3] xsk: build skb by page 2021-01-22 12:39 ` Alexander Lobakin @ 2021-01-22 12:55 ` Magnus Karlsson 0 siblings, 0 replies; 13+ messages in thread From: Magnus Karlsson @ 2021-01-22 12:55 UTC (permalink / raw) To: Alexander Lobakin Cc: Eric Dumazet, Xuan Zhuo, Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski, Bjorn Topel, Magnus Karlsson, Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, virtualization, bpf, Network Development, open list On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin <alobakin@pm.me> wrote: > > From: Magnus Karlsson <magnus.karlsson@gmail.com> > Date: Fri, 22 Jan 2021 13:18:47 +0100 > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin <alobakin@pm.me> wrote: > > > > > > From: Alexander Lobakin <alobakin@pm.me> > > > Date: Fri, 22 Jan 2021 11:47:45 +0000 > > > > > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100 > > > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote: > > > > > > This patch is used to construct skb based on page to save memory copy > > > > > > overhead. > > > > > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to > > > > > > directly construct skb. If this feature is not supported, it is still > > > > > > necessary to copy data to construct skb. > > > > > > > > > > > > ---------------- Performance Testing ------------ > > > > > > > > > > > > The test environment is Aliyun ECS server. > > > > > > Test cmd: > > > > > > ``` > > > > > > xdpsock -i eth0 -t -S -s <msg size> > > > > > > ``` > > > > > > > > > > > > Test result data: > > > > > > > > > > > > size 64 512 1024 1500 > > > > > > copy 1916747 1775988 1600203 1440054 > > > > > > page 1974058 1953655 1945463 1904478 > > > > > > percent 3.0% 10.0% 21.58% 32.3% > > > > > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > > > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > > > > > > --- > > > > > > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++---------- > > > > > > 1 file changed, 86 insertions(+), 18 deletions(-) > > > > > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > > > > > index 4a83117..38af7f1 100644 > > > > > > --- a/net/xdp/xsk.c > > > > > > +++ b/net/xdp/xsk.c > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb) > > > > > > sock_wfree(skb); > > > > > > } > > > > > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs, > > > > > > + struct xdp_desc *desc) > > > > > > +{ > > > > > > + u32 len, offset, copy, copied; > > > > > > + struct sk_buff *skb; > > > > > > + struct page *page; > > > > > > + void *buffer; > > > > > > + int err, i; > > > > > > + u64 addr; > > > > > > + > > > > > > + skb = sock_alloc_send_skb(&xs->sk, 0, 1, &err); > > > > > > + if (unlikely(!skb)) > > > > > > + return ERR_PTR(err); > > > > > > + > > > > > > + addr = desc->addr; > > > > > > + len = desc->len; > > > > > > + > > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr); > > > > > > + offset = offset_in_page(buffer); > > > > > > + addr = buffer - xs->pool->addrs; > > > > > > + > > > > > > + for (copied = 0, i = 0; copied < len; i++) { > > > > > > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT]; > > > > > > + > > > > > > + get_page(page); > > > > > > + > > > > > > + copy = min_t(u32, PAGE_SIZE - offset, len - copied); > > > > > > + > > > > > > + skb_fill_page_desc(skb, i, page, offset, copy); > > > > > > + > > > > > > + copied += copy; > > > > > > + addr += copy; > > > > > > + offset = 0; > > > > > > + } > > > > > > + > > > > > > + skb->len += len; > > > > > > + skb->data_len += len; > > > > > > > > > > > + skb->truesize += len; > > > > > > > > > > This is not the truesize, unfortunately. > > > > > > > > > > We need to account for the number of pages, not number of bytes. > > > > > > > > The easiest solution is: > > > > > > > > skb->truesize += PAGE_SIZE * i; > > > > > > > > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop. > > > > > > Oops, pls ignore this. I forgot that XSK buffers are not > > > "one per page". > > > We need to count the number of pages manually and then do > > > > > > skb->truesize += PAGE_SIZE * npages; > > > > > > Right. > > > > There are two possible packet buffer (chunks) sizes in a umem, 2K and > > 4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and > > please correct me if wrong, truesize is used for memory accounting. > > But in this code, no kernel memory has been allocated (apart from the > > skb). The page is just a part of the umem that has been already > > allocated beforehand and by user-space in this case. So what should > > truesize be in this case? Do we add 0, chunk_size * i, or the > > complicated case of counting exactly how many 4K pages that are used > > when the chunk_size is 2K, as two chunks could occupy the same page, > > or just the upper bound of PAGE_SIZE * i that is likely a good > > approximation in most cases? Just note that there might be other uses > > of truesize that I am unaware of that could impact this choice. > > Truesize is "what amount of memory does this skb occupy with all its > fragments, linear space and struct sk_buff itself". The closest it > will be to the actual value, the better. > In this case, I think adding of chunk_size * i would be enough. Sounds like a good approximation to me. > (PAGE_SIZE * i can be overwhelming when chunk_size is 2K, especially > for setups with PAGE_SIZE > SZ_4K) You are right. That would be quite horrible on a system with a page size of 64K. > > > > > > + > > > > > > + refcount_add(len, &xs->sk.sk_wmem_alloc); > > > > > > + > > > > > > + return skb; > > > > > > +} > > > > > > + > > > > > > > > Al > > > > > > Thanks, > > > Al > > Al > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-01-22 12:56 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-01-21 13:47 [PATCH bpf-next v3 0/3] xsk: build skb by page Xuan Zhuo 2021-01-21 13:47 ` [PATCH bpf-next v3 1/3] net: add priv_flags for allow tx skb without linear Xuan Zhuo 2021-01-21 15:13 ` Alexander Lobakin 2021-01-21 13:47 ` [PATCH bpf-next v3 2/3] virtio-net: support IFF_TX_SKB_NO_LINEAR Xuan Zhuo 2021-01-21 13:47 ` [PATCH bpf-next v3 3/3] xsk: build skb by page Xuan Zhuo 2021-01-21 15:17 ` Magnus Karlsson 2021-01-21 15:41 ` Eric Dumazet 2021-01-22 11:47 ` Alexander Lobakin 2021-01-22 11:55 ` Alexander Lobakin 2021-01-22 12:08 ` Alexander Lobakin 2021-01-22 12:18 ` Magnus Karlsson 2021-01-22 12:39 ` Alexander Lobakin 2021-01-22 12:55 ` Magnus Karlsson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).