bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit)
@ 2021-02-18 20:49 Alexander Lobakin
  2021-02-18 20:49 ` [PATCH v8 bpf-next 1/5] netdevice: add missing IFF_PHONY_HEADROOM self-definition Alexander Lobakin
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Alexander Lobakin @ 2021-02-18 20:49 UTC (permalink / raw)
  To: Daniel Borkmann, Magnus Karlsson
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Jonathan Lemon, Alexei Starovoitov, Björn Töpel,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Paolo Abeni,
	Eric Dumazet, Xuan Zhuo, Dust Li, Alexander Lobakin,
	virtualization, netdev, linux-kernel, bpf

This series introduces XSK generic zerocopy xmit by adding XSK umem
pages as skb frags instead of copying data to linear space.
The only requirement for this for drivers is to be able to xmit skbs
with skb_headlen(skb) == 0, i.e. all data including hard headers
starts from frag 0.
To indicate whether a particular driver supports this, a new netdev
priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net
as it's already capable of doing it). So consider implementing this
in your drivers to greatly speed-up generic XSK xmit.

The first bit adds missing IFF self-definition. It's a bit out, but
"while we are here".
The fourth patch adds headroom and tailroom reservations for the
allocated skbs on XSK generic xmit path. This ensures there won't
be any unwanted skb reallocations on fast-path due to headroom and/or
tailroom driver/device requirements (own headers/descriptors etc.).
The other three add a new private flag, declare it in virtio_net
driver and introduce generic XSK zerocopy xmit itself.

The main body of work is created and done by Xuan Zhuo. His original
cover letter:

v3:
    Optimized code

v2:
    1. add priv_flags IFF_TX_SKB_NO_LINEAR instead of netdev_feature
    2. split the patch to three:
        a. add priv_flags IFF_TX_SKB_NO_LINEAR
        b. virtio net add priv_flags IFF_TX_SKB_NO_LINEAR
        c. When there is support this flag, construct skb without linear
           space
    3. use ERR_PTR() and PTR_ERR() to handle the err

v1 message log:
---------------

This patch is used to construct skb based on page to save memory copy
overhead.

This has one problem:

We construct the skb by fill the data page as a frag into the skb. In
this way, the linear space is empty, and the header information is also
in the frag, not in the linear space, which is not allowed for some
network cards. For example, Mellanox Technologies MT27710 Family
[ConnectX-4 Lx] will get the following error message:

    mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
    qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
    00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
    WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
    00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
    00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

I also tried to use build_skb to construct skb, but because of the
existence of skb_shinfo, it must be behind the linear space, so this
method is not working. We can't put skb_shinfo on desc->addr, it will be
exposed to users, this is not safe.

Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
network card supports the header information of the packet in the frag
and not in the linear space.

---------------- Performance Testing ------------

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s <msg size>
```

Test result data:

size    64      512     1024    1500
copy    1916747 1775988 1600203 1440054
page    1974058 1953655 1945463 1904478
percent 3.0%    10.0%   21.58%  32.3%

From v7 [4]:
 - drop netdev priv flags rework (will be issued separately);
 - pick up Acks from John.

From v6 [3]:
 - rebase ontop of bpf-next after merge with net-next;
 - address kdoc warnings.

From v5 [2]:
 - fix a refcount leak in 0006 introduced in v4.

From v4 [1]:
 - fix 0002 build error due to inverted static_assert() condition
   (0day bot);
 - collect two Acked-bys (Magnus).

From v3 [0]:
 - refactor netdev_priv_flags to make it easier to add new ones and
   prevent bitwidth overflow;
 - add headroom (both standard and zerocopy) and tailroom (standard)
   reservation in skb for drivers to avoid potential reallocations;
 - fix skb->truesize accounting;
 - misc comment rewords.

[0] https://lore.kernel.org/netdev/cover.1611236588.git.xuanzhuo@linux.alibaba.com
[1] https://lore.kernel.org/netdev/20210216113740.62041-1-alobakin@pm.me
[2] https://lore.kernel.org/netdev/20210216143333.5861-1-alobakin@pm.me
[3] https://lore.kernel.org/netdev/20210216172640.374487-1-alobakin@pm.me
[4] https://lore.kernel.org/netdev/20210217120003.7938-1-alobakin@pm.me

Alexander Lobakin (2):
  netdevice: add missing IFF_PHONY_HEADROOM self-definition
  xsk: respect device's headroom and tailroom on generic xmit path

Xuan Zhuo (3):
  net: add priv_flags for allow tx skb without linear
  virtio-net: support IFF_TX_SKB_NO_LINEAR
  xsk: build skb by page (aka generic zerocopy xmit)

 drivers/net/virtio_net.c  |   3 +-
 include/linux/netdevice.h |   5 ++
 net/xdp/xsk.c             | 114 ++++++++++++++++++++++++++++++++------
 3 files changed, 103 insertions(+), 19 deletions(-)

--
2.30.1



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v8 bpf-next 1/5] netdevice: add missing IFF_PHONY_HEADROOM self-definition
  2021-02-18 20:49 [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
@ 2021-02-18 20:49 ` Alexander Lobakin
  2021-02-18 20:50 ` [PATCH v8 bpf-next 2/5] net: add priv_flags for allow tx skb without linear Alexander Lobakin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Alexander Lobakin @ 2021-02-18 20:49 UTC (permalink / raw)
  To: Daniel Borkmann, Magnus Karlsson
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Jonathan Lemon, Alexei Starovoitov, Björn Töpel,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Paolo Abeni,
	Eric Dumazet, Xuan Zhuo, Dust Li, Alexander Lobakin,
	virtualization, netdev, linux-kernel, bpf

This is harmless for now, but can be fatal for future refactors.

Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/netdevice.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..3b6f82c2c271 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1577,6 +1577,7 @@ enum netdev_priv_flags {
 #define IFF_L3MDEV_SLAVE		IFF_L3MDEV_SLAVE
 #define IFF_TEAM			IFF_TEAM
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
+#define IFF_PHONY_HEADROOM		IFF_PHONY_HEADROOM
 #define IFF_MACSEC			IFF_MACSEC
 #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
 #define IFF_FAILOVER			IFF_FAILOVER
--
2.30.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v8 bpf-next 2/5] net: add priv_flags for allow tx skb without linear
  2021-02-18 20:49 [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
  2021-02-18 20:49 ` [PATCH v8 bpf-next 1/5] netdevice: add missing IFF_PHONY_HEADROOM self-definition Alexander Lobakin
@ 2021-02-18 20:50 ` Alexander Lobakin
  2021-02-18 20:50 ` [PATCH v8 bpf-next 3/5] virtio-net: support IFF_TX_SKB_NO_LINEAR Alexander Lobakin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Alexander Lobakin @ 2021-02-18 20:50 UTC (permalink / raw)
  To: Daniel Borkmann, Magnus Karlsson
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Jonathan Lemon, Alexei Starovoitov, Björn Töpel,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Paolo Abeni,
	Eric Dumazet, Xuan Zhuo, Dust Li, Alexander Lobakin,
	virtualization, netdev, linux-kernel, bpf

From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

In some cases, we hope to construct skb directly based on the existing
memory without copying data. In this case, the page will be placed
directly in the skb, and the linear space of skb is empty. But
unfortunately, many the network card does not support this operation.
For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will
get the following error message:

    mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8,
    qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
    00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
    WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
    00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
    00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
    00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb

So a priv_flag is added here to indicate whether the network card
supports this feature.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Suggested-by: Alexander Lobakin <alobakin@pm.me>
[ alobakin: give a new flag more detailed description ]
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/netdevice.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3b6f82c2c271..6cef47b76cc6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1518,6 +1518,8 @@ struct net_device_ops {
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
+ * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
+ *	skb_headlen(skb) == 0 (data starts from frag0)
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1551,6 +1553,7 @@ enum netdev_priv_flags {
 	IFF_FAILOVER_SLAVE		= 1<<28,
 	IFF_L3MDEV_RX_HANDLER		= 1<<29,
 	IFF_LIVE_RENAME_OK		= 1<<30,
+	IFF_TX_SKB_NO_LINEAR		= 1<<31,
 };

 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1584,6 +1587,7 @@ enum netdev_priv_flags {
 #define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
 #define IFF_L3MDEV_RX_HANDLER		IFF_L3MDEV_RX_HANDLER
 #define IFF_LIVE_RENAME_OK		IFF_LIVE_RENAME_OK
+#define IFF_TX_SKB_NO_LINEAR		IFF_TX_SKB_NO_LINEAR

 /**
  *	struct net_device - The DEVICE structure.
--
2.30.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v8 bpf-next 3/5] virtio-net: support IFF_TX_SKB_NO_LINEAR
  2021-02-18 20:49 [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
  2021-02-18 20:49 ` [PATCH v8 bpf-next 1/5] netdevice: add missing IFF_PHONY_HEADROOM self-definition Alexander Lobakin
  2021-02-18 20:50 ` [PATCH v8 bpf-next 2/5] net: add priv_flags for allow tx skb without linear Alexander Lobakin
@ 2021-02-18 20:50 ` Alexander Lobakin
  2021-02-18 20:50 ` [PATCH v8 bpf-next 4/5] xsk: respect device's headroom and tailroom on generic xmit path Alexander Lobakin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Alexander Lobakin @ 2021-02-18 20:50 UTC (permalink / raw)
  To: Daniel Borkmann, Magnus Karlsson
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Jonathan Lemon, Alexei Starovoitov, Björn Töpel,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Paolo Abeni,
	Eric Dumazet, Xuan Zhuo, Dust Li, Alexander Lobakin,
	virtualization, netdev, linux-kernel, bpf

From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

Virtio net supports the case where the skb linear space is empty, so add
priv_flags.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 drivers/net/virtio_net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e63792549..f2ff6c3906c1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 		return -ENOMEM;

 	/* Set up network device as normal. */
-	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
+	dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
+			   IFF_TX_SKB_NO_LINEAR;
 	dev->netdev_ops = &virtnet_netdev;
 	dev->features = NETIF_F_HIGHDMA;

--
2.30.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v8 bpf-next 4/5] xsk: respect device's headroom and tailroom on generic xmit path
  2021-02-18 20:49 [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
                   ` (2 preceding siblings ...)
  2021-02-18 20:50 ` [PATCH v8 bpf-next 3/5] virtio-net: support IFF_TX_SKB_NO_LINEAR Alexander Lobakin
@ 2021-02-18 20:50 ` Alexander Lobakin
  2021-02-18 20:50 ` [PATCH v8 bpf-next 5/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
  2021-02-25  0:46 ` [PATCH v8 bpf-next 0/5] " Daniel Borkmann
  5 siblings, 0 replies; 7+ messages in thread
From: Alexander Lobakin @ 2021-02-18 20:50 UTC (permalink / raw)
  To: Daniel Borkmann, Magnus Karlsson
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Jonathan Lemon, Alexei Starovoitov, Björn Töpel,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Paolo Abeni,
	Eric Dumazet, Xuan Zhuo, Dust Li, Alexander Lobakin,
	virtualization, netdev, linux-kernel, bpf

xsk_generic_xmit() allocates a new skb and then queues it for
xmitting. The size of new skb's headroom is desc->len, so it comes
to the driver/device with no reserved headroom and/or tailroom.
Lots of drivers need some headroom (and sometimes tailroom) to
prepend (and/or append) some headers or data, e.g. CPU tags,
device-specific headers/descriptors (LSO, TLS etc.), and if case
of no available space skb_cow_head() will reallocate the skb.
Reallocations are unwanted on fast-path, especially when it comes
to XDP, so generic XSK xmit should reserve the spaces declared in
dev->needed_headroom and dev->needed tailroom to avoid them.

Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)):

Usually, output functions reserve LL_RESERVED_SPACE(dev), which
consists of dev->hard_header_len + dev->needed_headroom, aligned
by 16.
However, on XSK xmit hard header is already here in the chunk, so
hard_header_len is not needed. But it'd still be better to align
data up to cacheline, while reserving no less than driver requests
for headroom. NET_SKB_PAD here is to double-insure there will be
no reallocations even when the driver advertises no needed_headroom,
but in fact need it (not so rare case).

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 net/xdp/xsk.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4faabd1ecfd1..143979ea4165 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk)
 	struct sk_buff *skb;
 	unsigned long flags;
 	int err = 0;
+	u32 hr, tr;

 	mutex_lock(&xs->mutex);

 	if (xs->queue_id >= xs->dev->real_num_tx_queues)
 		goto out;

+	hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+	tr = xs->dev->needed_tailroom;
+
 	while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
 		char *buffer;
 		u64 addr;
@@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk)
 		}

 		len = desc.len;
-		skb = sock_alloc_send_skb(sk, len, 1, &err);
+		skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err);
 		if (unlikely(!skb))
 			goto out;

+		skb_reserve(skb, hr);
 		skb_put(skb, len);
+
 		addr = desc.addr;
 		buffer = xsk_buff_raw_get_data(xs->pool, addr);
 		err = skb_store_bits(skb, 0, buffer, len);
--
2.30.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v8 bpf-next 5/5] xsk: build skb by page (aka generic zerocopy xmit)
  2021-02-18 20:49 [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
                   ` (3 preceding siblings ...)
  2021-02-18 20:50 ` [PATCH v8 bpf-next 4/5] xsk: respect device's headroom and tailroom on generic xmit path Alexander Lobakin
@ 2021-02-18 20:50 ` Alexander Lobakin
  2021-02-25  0:46 ` [PATCH v8 bpf-next 0/5] " Daniel Borkmann
  5 siblings, 0 replies; 7+ messages in thread
From: Alexander Lobakin @ 2021-02-18 20:50 UTC (permalink / raw)
  To: Daniel Borkmann, Magnus Karlsson
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Jonathan Lemon, Alexei Starovoitov, Björn Töpel,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Paolo Abeni,
	Eric Dumazet, Xuan Zhuo, Dust Li, Alexander Lobakin,
	virtualization, netdev, linux-kernel, bpf

From: Xuan Zhuo <xuanzhuo@linux.alibaba.com>

This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

---------------- Performance Testing ------------

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s <msg size>
```

Test result data:

size    64      512     1024    1500
copy    1916747 1775988 1600203 1440054
page    1974058 1953655 1945463 1904478
percent 3.0%    10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
[ alobakin:
 - expand subject to make it clearer;
 - improve skb->truesize calculation;
 - reserve some headroom in skb for drivers;
 - tailroom is not needed as skb is non-linear ]
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 net/xdp/xsk.c | 120 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 96 insertions(+), 24 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 143979ea4165..a71ed664da0a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -445,6 +445,97 @@ static void xsk_destruct_skb(struct sk_buff *skb)
 	sock_wfree(skb);
 }

+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+					      struct xdp_desc *desc)
+{
+	struct xsk_buff_pool *pool = xs->pool;
+	u32 hr, len, ts, offset, copy, copied;
+	struct sk_buff *skb;
+	struct page *page;
+	void *buffer;
+	int err, i;
+	u64 addr;
+
+	hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
+
+	skb = sock_alloc_send_skb(&xs->sk, hr, 1, &err);
+	if (unlikely(!skb))
+		return ERR_PTR(err);
+
+	skb_reserve(skb, hr);
+
+	addr = desc->addr;
+	len = desc->len;
+	ts = pool->unaligned ? len : pool->chunk_size;
+
+	buffer = xsk_buff_raw_get_data(pool, addr);
+	offset = offset_in_page(buffer);
+	addr = buffer - pool->addrs;
+
+	for (copied = 0, i = 0; copied < len; i++) {
+		page = pool->umem->pgs[addr >> PAGE_SHIFT];
+		get_page(page);
+
+		copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+		skb_fill_page_desc(skb, i, page, offset, copy);
+
+		copied += copy;
+		addr += copy;
+		offset = 0;
+	}
+
+	skb->len += len;
+	skb->data_len += len;
+	skb->truesize += ts;
+
+	refcount_add(ts, &xs->sk.sk_wmem_alloc);
+
+	return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+				     struct xdp_desc *desc)
+{
+	struct net_device *dev = xs->dev;
+	struct sk_buff *skb;
+
+	if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+		skb = xsk_build_skb_zerocopy(xs, desc);
+		if (IS_ERR(skb))
+			return skb;
+	} else {
+		u32 hr, tr, len;
+		void *buffer;
+		int err;
+
+		hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+		tr = dev->needed_tailroom;
+		len = desc->len;
+
+		skb = sock_alloc_send_skb(&xs->sk, hr + len + tr, 1, &err);
+		if (unlikely(!skb))
+			return ERR_PTR(err);
+
+		skb_reserve(skb, hr);
+		skb_put(skb, len);
+
+		buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+		err = skb_store_bits(skb, 0, buffer, len);
+		if (unlikely(err)) {
+			kfree_skb(skb);
+			return ERR_PTR(err);
+		}
+	}
+
+	skb->dev = dev;
+	skb->priority = xs->sk.sk_priority;
+	skb->mark = xs->sk.sk_mark;
+	skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+	skb->destructor = xsk_destruct_skb;
+
+	return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
 	struct xdp_sock *xs = xdp_sk(sk);
@@ -454,56 +545,37 @@ static int xsk_generic_xmit(struct sock *sk)
 	struct sk_buff *skb;
 	unsigned long flags;
 	int err = 0;
-	u32 hr, tr;

 	mutex_lock(&xs->mutex);

 	if (xs->queue_id >= xs->dev->real_num_tx_queues)
 		goto out;

-	hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
-	tr = xs->dev->needed_tailroom;
-
 	while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) {
-		char *buffer;
-		u64 addr;
-		u32 len;
-
 		if (max_batch-- == 0) {
 			err = -EAGAIN;
 			goto out;
 		}

-		len = desc.len;
-		skb = sock_alloc_send_skb(sk, hr + len + tr, 1, &err);
-		if (unlikely(!skb))
+		skb = xsk_build_skb(xs, &desc);
+		if (IS_ERR(skb)) {
+			err = PTR_ERR(skb);
 			goto out;
+		}

-		skb_reserve(skb, hr);
-		skb_put(skb, len);
-
-		addr = desc.addr;
-		buffer = xsk_buff_raw_get_data(xs->pool, addr);
-		err = skb_store_bits(skb, 0, buffer, len);
 		/* This is the backpressure mechanism for the Tx path.
 		 * Reserve space in the completion queue and only proceed
 		 * if there is space in it. This avoids having to implement
 		 * any buffering in the Tx path.
 		 */
 		spin_lock_irqsave(&xs->pool->cq_lock, flags);
-		if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
+		if (xskq_prod_reserve(xs->pool->cq)) {
 			spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
 			kfree_skb(skb);
 			goto out;
 		}
 		spin_unlock_irqrestore(&xs->pool->cq_lock, flags);

-		skb->dev = xs->dev;
-		skb->priority = sk->sk_priority;
-		skb->mark = sk->sk_mark;
-		skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
-		skb->destructor = xsk_destruct_skb;
-
 		err = __dev_direct_xmit(skb, xs->queue_id);
 		if  (err == NETDEV_TX_BUSY) {
 			/* Tell user-space to retry the send */
--
2.30.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit)
  2021-02-18 20:49 [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
                   ` (4 preceding siblings ...)
  2021-02-18 20:50 ` [PATCH v8 bpf-next 5/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
@ 2021-02-25  0:46 ` Daniel Borkmann
  5 siblings, 0 replies; 7+ messages in thread
From: Daniel Borkmann @ 2021-02-25  0:46 UTC (permalink / raw)
  To: Alexander Lobakin, Magnus Karlsson
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski,
	Jonathan Lemon, Alexei Starovoitov, Björn Töpel,
	Jesper Dangaard Brouer, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh, Paolo Abeni,
	Eric Dumazet, Xuan Zhuo, Dust Li, virtualization, netdev,
	linux-kernel, bpf

On 2/18/21 9:49 PM, Alexander Lobakin wrote:
> This series introduces XSK generic zerocopy xmit by adding XSK umem
> pages as skb frags instead of copying data to linear space.
> The only requirement for this for drivers is to be able to xmit skbs
> with skb_headlen(skb) == 0, i.e. all data including hard headers
> starts from frag 0.
> To indicate whether a particular driver supports this, a new netdev
> priv flag, IFF_TX_SKB_NO_LINEAR, is added (and declared in virtio_net
> as it's already capable of doing it). So consider implementing this
> in your drivers to greatly speed-up generic XSK xmit.
[...]

Applied, thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-02-25  0:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-18 20:49 [PATCH v8 bpf-next 0/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
2021-02-18 20:49 ` [PATCH v8 bpf-next 1/5] netdevice: add missing IFF_PHONY_HEADROOM self-definition Alexander Lobakin
2021-02-18 20:50 ` [PATCH v8 bpf-next 2/5] net: add priv_flags for allow tx skb without linear Alexander Lobakin
2021-02-18 20:50 ` [PATCH v8 bpf-next 3/5] virtio-net: support IFF_TX_SKB_NO_LINEAR Alexander Lobakin
2021-02-18 20:50 ` [PATCH v8 bpf-next 4/5] xsk: respect device's headroom and tailroom on generic xmit path Alexander Lobakin
2021-02-18 20:50 ` [PATCH v8 bpf-next 5/5] xsk: build skb by page (aka generic zerocopy xmit) Alexander Lobakin
2021-02-25  0:46 ` [PATCH v8 bpf-next 0/5] " Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).