bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE
@ 2023-03-29 18:04 Kal Conley
  2023-03-29 18:04 ` [PATCH bpf-next v2 01/10] selftests: xsk: Add xskxceiver.h dependency to Makefile Kal Conley
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:04 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: Kal Conley, netdev, bpf

The main purpose of this patchset is to add AF_XDP support for UMEM
chunk sizes > PAGE_SIZE. This is enabled for UMEMs backed by HugeTLB
pages.

The patchset starts off with many fixes and improvements to the XSK
selftests. Next, an existing issue is fixed whereby unaligned
descriptors may overrun the end of the UMEM in certain cases. Finally,
AF_XDP support for UMEM chunk_size > PAGE_SIZE is added.

v2 contains major improvements over v1. It should be both a joy to read
and review. :-)

Changes since v1:
  * Add many fixes/improvements to the XSK selftests.
  * Add check for unaligned descriptors that overrun UMEM.
  * Fix compile errors when CONFIG_HUGETLB_PAGE is not set.
  * Fix incorrect use of _Static_assert.
  * Update AF_XDP documentation.
  * Rename unaligned 9K frame size test.
  * Make xp_check_dma_contiguity less conservative.
  * Add more information to benchmark table.

Thanks to Magnus Karlsson for his initial feedback!

Kal Conley (10):
  selftests: xsk: Add xskxceiver.h dependency to Makefile
  selftests: xsk: Use correct UMEM size in testapp_invalid_desc
  selftests: xsk: Add test case for packets at end of UMEM
  selftests: xsk: Deflakify STATS_RX_DROPPED test
  selftests: xsk: Disable IPv6 on VETH1
  xsk: Add check for unaligned descriptors that overrun UMEM
  selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE
  xsk: Support UMEM chunk_size > PAGE_SIZE
  selftests: xsk: Use hugepages when umem->frame_size > PAGE_SIZE
  selftests: xsk: Add tests for 8K and 9K frame sizes

 Documentation/networking/af_xdp.rst      | 19 ++++--
 include/net/xdp_sock.h                   |  3 +
 include/net/xdp_sock_drv.h               | 12 ++++
 include/net/xsk_buff_pool.h              | 24 +++++---
 net/xdp/xdp_umem.c                       | 50 ++++++++++++---
 net/xdp/xsk_buff_pool.c                  | 34 +++++++----
 net/xdp/xsk_queue.h                      |  1 +
 tools/testing/selftests/bpf/Makefile     |  2 +-
 tools/testing/selftests/bpf/test_xsk.sh  |  1 +
 tools/testing/selftests/bpf/xskxceiver.c | 77 +++++++++++++++++++++---
 tools/testing/selftests/bpf/xskxceiver.h |  4 +-
 11 files changed, 183 insertions(+), 44 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 01/10] selftests: xsk: Add xskxceiver.h dependency to Makefile
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
@ 2023-03-29 18:04 ` Kal Conley
  2023-03-29 18:04 ` [PATCH bpf-next v2 02/10] selftests: xsk: Use correct UMEM size in testapp_invalid_desc Kal Conley
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:04 UTC (permalink / raw)
  To: Andrii Nakryiko, Mykola Lysenko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer
  Cc: Kal Conley, bpf, linux-kselftest, linux-kernel, netdev

xskxceiver depends on xskxceiver.h so tell make about it.

Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 tools/testing/selftests/bpf/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4a8ef118fd9d..223be997f15d 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -612,7 +612,7 @@ $(OUTPUT)/test_verifier: test_verifier.c verifier/tests.h $(BPFOBJ) | $(OUTPUT)
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(filter %.a %.o %.c,$^) $(LDLIBS) -o $@
 
-$(OUTPUT)/xskxceiver: xskxceiver.c $(OUTPUT)/xsk.o $(OUTPUT)/xsk_xdp_progs.skel.h $(BPFOBJ) | $(OUTPUT)
+$(OUTPUT)/xskxceiver: xskxceiver.c xskxceiver.h $(OUTPUT)/xsk.o $(OUTPUT)/xsk_xdp_progs.skel.h $(BPFOBJ) | $(OUTPUT)
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(filter %.a %.o %.c,$^) $(LDLIBS) -o $@
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 02/10] selftests: xsk: Use correct UMEM size in testapp_invalid_desc
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
  2023-03-29 18:04 ` [PATCH bpf-next v2 01/10] selftests: xsk: Add xskxceiver.h dependency to Makefile Kal Conley
@ 2023-03-29 18:04 ` Kal Conley
  2023-03-29 18:04 ` [PATCH bpf-next v2 03/10] selftests: xsk: Add test case for packets at end of UMEM Kal Conley
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:04 UTC (permalink / raw)
  To: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan
  Cc: Kal Conley, netdev, bpf, linux-kselftest, linux-kernel

Avoid UMEM_SIZE macro in testapp_invalid_desc which is incorrect when
the frame size is not XSK_UMEM__DEFAULT_FRAME_SIZE. Also remove the
macro since it's no longer being used.

Fixes: 909f0e28207c ("selftests: xsk: Add tests for 2K frame size")
Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 tools/testing/selftests/bpf/xskxceiver.c | 9 +++++----
 tools/testing/selftests/bpf/xskxceiver.h | 1 -
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
index b65e0645b0cd..3956f5db84f3 100644
--- a/tools/testing/selftests/bpf/xskxceiver.c
+++ b/tools/testing/selftests/bpf/xskxceiver.c
@@ -1652,6 +1652,7 @@ static void testapp_single_pkt(struct test_spec *test)
 
 static void testapp_invalid_desc(struct test_spec *test)
 {
+	u64 umem_size = test->ifobj_tx->umem->num_frames * test->ifobj_tx->umem->frame_size;
 	struct pkt pkts[] = {
 		/* Zero packet address allowed */
 		{0, PKT_SIZE, 0, true},
@@ -1662,9 +1663,9 @@ static void testapp_invalid_desc(struct test_spec *test)
 		/* Packet too large */
 		{0x2000, XSK_UMEM__INVALID_FRAME_SIZE, 0, false},
 		/* After umem ends */
-		{UMEM_SIZE, PKT_SIZE, 0, false},
+		{umem_size, PKT_SIZE, 0, false},
 		/* Straddle the end of umem */
-		{UMEM_SIZE - PKT_SIZE / 2, PKT_SIZE, 0, false},
+		{umem_size - PKT_SIZE / 2, PKT_SIZE, 0, false},
 		/* Straddle a page boundrary */
 		{0x3000 - PKT_SIZE / 2, PKT_SIZE, 0, false},
 		/* Straddle a 2K boundrary */
@@ -1682,8 +1683,8 @@ static void testapp_invalid_desc(struct test_spec *test)
 	}
 
 	if (test->ifobj_tx->shared_umem) {
-		pkts[4].addr += UMEM_SIZE;
-		pkts[5].addr += UMEM_SIZE;
+		pkts[4].addr += umem_size;
+		pkts[5].addr += umem_size;
 	}
 
 	pkt_stream_generate_custom(test, pkts, ARRAY_SIZE(pkts));
diff --git a/tools/testing/selftests/bpf/xskxceiver.h b/tools/testing/selftests/bpf/xskxceiver.h
index bdb4efedf3a9..cc24ab72f3ff 100644
--- a/tools/testing/selftests/bpf/xskxceiver.h
+++ b/tools/testing/selftests/bpf/xskxceiver.h
@@ -53,7 +53,6 @@
 #define THREAD_TMOUT 3
 #define DEFAULT_PKT_CNT (4 * 1024)
 #define DEFAULT_UMEM_BUFFERS (DEFAULT_PKT_CNT / 4)
-#define UMEM_SIZE (DEFAULT_UMEM_BUFFERS * XSK_UMEM__DEFAULT_FRAME_SIZE)
 #define RX_FULL_RXQSIZE 32
 #define UMEM_HEADROOM_TEST_SIZE 128
 #define XSK_UMEM__INVALID_FRAME_SIZE (XSK_UMEM__DEFAULT_FRAME_SIZE + 1)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 03/10] selftests: xsk: Add test case for packets at end of UMEM
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
  2023-03-29 18:04 ` [PATCH bpf-next v2 01/10] selftests: xsk: Add xskxceiver.h dependency to Makefile Kal Conley
  2023-03-29 18:04 ` [PATCH bpf-next v2 02/10] selftests: xsk: Use correct UMEM size in testapp_invalid_desc Kal Conley
@ 2023-03-29 18:04 ` Kal Conley
  2023-03-29 18:04 ` [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test Kal Conley
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:04 UTC (permalink / raw)
  To: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan
  Cc: Kal Conley, netdev, bpf, linux-kselftest, linux-kernel

Add test case to testapp_invalid_desc for valid packets at the end of
the UMEM.

Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 tools/testing/selftests/bpf/xskxceiver.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
index 3956f5db84f3..34a1f32fe752 100644
--- a/tools/testing/selftests/bpf/xskxceiver.c
+++ b/tools/testing/selftests/bpf/xskxceiver.c
@@ -1662,6 +1662,8 @@ static void testapp_invalid_desc(struct test_spec *test)
 		{-2, PKT_SIZE, 0, false},
 		/* Packet too large */
 		{0x2000, XSK_UMEM__INVALID_FRAME_SIZE, 0, false},
+		/* Up to end of umem allowed */
+		{umem_size - PKT_SIZE, PKT_SIZE, 0, true},
 		/* After umem ends */
 		{umem_size, PKT_SIZE, 0, false},
 		/* Straddle the end of umem */
@@ -1675,16 +1677,17 @@ static void testapp_invalid_desc(struct test_spec *test)
 
 	if (test->ifobj_tx->umem->unaligned_mode) {
 		/* Crossing a page boundrary allowed */
-		pkts[6].valid = true;
+		pkts[7].valid = true;
 	}
 	if (test->ifobj_tx->umem->frame_size == XSK_UMEM__DEFAULT_FRAME_SIZE / 2) {
 		/* Crossing a 2K frame size boundrary not allowed */
-		pkts[7].valid = false;
+		pkts[8].valid = false;
 	}
 
 	if (test->ifobj_tx->shared_umem) {
 		pkts[4].addr += umem_size;
 		pkts[5].addr += umem_size;
+		pkts[6].addr += umem_size;
 	}
 
 	pkt_stream_generate_custom(test, pkts, ARRAY_SIZE(pkts));
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
                   ` (2 preceding siblings ...)
  2023-03-29 18:04 ` [PATCH bpf-next v2 03/10] selftests: xsk: Add test case for packets at end of UMEM Kal Conley
@ 2023-03-29 18:04 ` Kal Conley
  2023-04-03 10:54   ` Magnus Karlsson
  2023-03-29 18:04 ` [PATCH bpf-next v2 05/10] selftests: xsk: Disable IPv6 on VETH1 Kal Conley
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:04 UTC (permalink / raw)
  To: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan
  Cc: Kal Conley, netdev, bpf, linux-kselftest, linux-kernel

Fix flaky STATS_RX_DROPPED test. The receiver calls getsockopt after
receiving the last (valid) packet which is not the final packet sent in
the test (valid and invalid packets are sent in alternating fashion with
the final packet being invalid). Since the last packet may or may not
have been dropped already, both outcomes must be allowed.

This issue could also be fixed by making sure the last packet sent is
valid. This alternative is left as an exercise to the reader (or the
benevolent maintainers of this file).

This problem was quite visible on certain setups. On one machine this
failure was observed 50% of the time.

Also, remove a redundant assignment of pkt_stream->nb_pkts. This field
is already initialized by __pkt_stream_alloc.

Fixes: 27e934bec35b ("selftests: xsk: make stat tests not spin on getsockopt")
Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 tools/testing/selftests/bpf/xskxceiver.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
index 34a1f32fe752..1a4bdd5aa78c 100644
--- a/tools/testing/selftests/bpf/xskxceiver.c
+++ b/tools/testing/selftests/bpf/xskxceiver.c
@@ -633,7 +633,6 @@ static struct pkt_stream *pkt_stream_generate(struct xsk_umem_info *umem, u32 nb
 	if (!pkt_stream)
 		exit_with_error(ENOMEM);
 
-	pkt_stream->nb_pkts = nb_pkts;
 	for (i = 0; i < nb_pkts; i++) {
 		pkt_set(umem, &pkt_stream->pkts[i], (i % umem->num_frames) * umem->frame_size,
 			pkt_len);
@@ -1141,7 +1140,14 @@ static int validate_rx_dropped(struct ifobject *ifobject)
 	if (err)
 		return TEST_FAILURE;
 
-	if (stats.rx_dropped == ifobject->pkt_stream->nb_pkts / 2)
+	/* The receiver calls getsockopt after receiving the last (valid)
+	 * packet which is not the final packet sent in this test (valid and
+	 * invalid packets are sent in alternating fashion with the final
+	 * packet being invalid). Since the last packet may or may not have
+	 * been dropped already, both outcomes must be allowed.
+	 */
+	if (stats.rx_dropped == ifobject->pkt_stream->nb_pkts / 2 ||
+	    stats.rx_dropped == ifobject->pkt_stream->nb_pkts / 2 - 1)
 		return TEST_PASS;
 
 	return TEST_FAILURE;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 05/10] selftests: xsk: Disable IPv6 on VETH1
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
                   ` (3 preceding siblings ...)
  2023-03-29 18:04 ` [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test Kal Conley
@ 2023-03-29 18:04 ` Kal Conley
  2023-03-29 18:04 ` [PATCH bpf-next v2 06/10] xsk: Add check for unaligned descriptors that overrun UMEM Kal Conley
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:04 UTC (permalink / raw)
  To: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan
  Cc: Kal Conley, netdev, bpf, linux-kselftest, linux-kernel

This change fixes flakiness in the BIDIRECTIONAL test:

    # [is_pkt_valid] expected length [60], got length [90]
    not ok 1 FAIL: SKB BUSY-POLL BIDIRECTIONAL

When IPv6 is enabled, the interface will periodically send MLDv1 and
MLDv2 packets. These packets can cause the BIDIRECTIONAL test to fail
since it uses VETH0 for RX.

For other tests, this was not a problem since they only receive on VETH1
and IPv6 was already disabled on VETH0.

Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 tools/testing/selftests/bpf/test_xsk.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/test_xsk.sh b/tools/testing/selftests/bpf/test_xsk.sh
index b077cf58f825..377fb157a57c 100755
--- a/tools/testing/selftests/bpf/test_xsk.sh
+++ b/tools/testing/selftests/bpf/test_xsk.sh
@@ -116,6 +116,7 @@ setup_vethPairs() {
 	ip link add ${VETH0} numtxqueues 4 numrxqueues 4 type veth peer name ${VETH1} numtxqueues 4 numrxqueues 4
 	if [ -f /proc/net/if_inet6 ]; then
 		echo 1 > /proc/sys/net/ipv6/conf/${VETH0}/disable_ipv6
+		echo 1 > /proc/sys/net/ipv6/conf/${VETH1}/disable_ipv6
 	fi
 	if [[ $verbose -eq 1 ]]; then
 	        echo "setting up ${VETH1}"
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 06/10] xsk: Add check for unaligned descriptors that overrun UMEM
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
                   ` (4 preceding siblings ...)
  2023-03-29 18:04 ` [PATCH bpf-next v2 05/10] selftests: xsk: Disable IPv6 on VETH1 Kal Conley
@ 2023-03-29 18:04 ` Kal Conley
  2023-04-03 12:23   ` Magnus Karlsson
  2023-03-29 18:04 ` [PATCH bpf-next v2 07/10] selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE Kal Conley
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:04 UTC (permalink / raw)
  To: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Maxim Mikityanskiy
  Cc: Kal Conley, netdev, bpf, linux-kernel

Make sure unaligned descriptors that straddle the end of the UMEM are
considered invalid. This check needs to happen before the page boundary
and contiguity checks in xp_desc_crosses_non_contig_pg(). Check this in
xp_unaligned_validate_desc() instead like xp_check_unaligned() already
does.

Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API")
Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 include/net/xsk_buff_pool.h | 9 ++-------
 net/xdp/xsk_queue.h         | 1 +
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 3e952e569418..d318c769b445 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -180,13 +180,8 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
 	if (likely(!cross_pg))
 		return false;
 
-	if (pool->dma_pages_cnt) {
-		return !(pool->dma_pages[addr >> PAGE_SHIFT] &
-			 XSK_NEXT_PG_CONTIG_MASK);
-	}
-
-	/* skb path */
-	return addr + len > pool->addrs_cnt;
+	return pool->dma_pages_cnt &&
+	       !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
 }
 
 static inline u64 xp_aligned_extract_addr(struct xsk_buff_pool *pool, u64 addr)
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index bfb2a7e50c26..66c6f57c9c44 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -162,6 +162,7 @@ static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
 		return false;
 
 	if (base_addr >= pool->addrs_cnt || addr >= pool->addrs_cnt ||
+	    addr + desc->len > pool->addrs_cnt ||
 	    xp_desc_crosses_non_contig_pg(pool, addr, desc->len))
 		return false;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 07/10] selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
                   ` (5 preceding siblings ...)
  2023-03-29 18:04 ` [PATCH bpf-next v2 06/10] xsk: Add check for unaligned descriptors that overrun UMEM Kal Conley
@ 2023-03-29 18:04 ` Kal Conley
  2023-04-03 12:25   ` Magnus Karlsson
  2023-03-29 18:05 ` [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:04 UTC (permalink / raw)
  To: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan
  Cc: Kal Conley, netdev, bpf, linux-kselftest, linux-kernel

Add unaligned descriptor test for frame size of 4001. Using an odd frame
size ensures that the end of the UMEM is not near a page boundary. This
allows testing descriptors that staddle the end of the UMEM but not a
page.

This test used to fail without the previous commit ("xsk: Add check for
unaligned descriptors that overrun UMEM").

Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 tools/testing/selftests/bpf/xskxceiver.c | 25 ++++++++++++++++++++++++
 tools/testing/selftests/bpf/xskxceiver.h |  1 +
 2 files changed, 26 insertions(+)

diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
index 1a4bdd5aa78c..9b9efd0e0a4c 100644
--- a/tools/testing/selftests/bpf/xskxceiver.c
+++ b/tools/testing/selftests/bpf/xskxceiver.c
@@ -69,6 +69,7 @@
  */
 
 #define _GNU_SOURCE
+#include <assert.h>
 #include <fcntl.h>
 #include <errno.h>
 #include <getopt.h>
@@ -1876,6 +1877,30 @@ static void run_pkt_test(struct test_spec *test, enum test_mode mode, enum test_
 		test->ifobj_rx->umem->unaligned_mode = true;
 		testapp_invalid_desc(test);
 		break;
+	case TEST_TYPE_UNALIGNED_INV_DESC_4K1_FRAME:
+		if (!hugepages_present(test->ifobj_tx)) {
+			ksft_test_result_skip("No 2M huge pages present.\n");
+			return;
+		}
+		test_spec_set_name(test, "UNALIGNED_INV_DESC_4K1_FRAME_SIZE");
+		/* Odd frame size so the UMEM doesn't end near a page boundary. */
+		test->ifobj_tx->umem->frame_size = 4001;
+		test->ifobj_rx->umem->frame_size = 4001;
+		test->ifobj_tx->umem->unaligned_mode = true;
+		test->ifobj_rx->umem->unaligned_mode = true;
+		/* This test exists to test descriptors that staddle the end of
+		 * the UMEM but not a page.
+		 */
+		{
+			u64 umem_size = test->ifobj_tx->umem->num_frames *
+					test->ifobj_tx->umem->frame_size;
+			u64 page_size = sysconf(_SC_PAGESIZE);
+
+			assert(umem_size % page_size > PKT_SIZE);
+			assert(umem_size % page_size < page_size - PKT_SIZE);
+		}
+		testapp_invalid_desc(test);
+		break;
 	case TEST_TYPE_UNALIGNED:
 		if (!testapp_unaligned(test))
 			return;
diff --git a/tools/testing/selftests/bpf/xskxceiver.h b/tools/testing/selftests/bpf/xskxceiver.h
index cc24ab72f3ff..919327807a4e 100644
--- a/tools/testing/selftests/bpf/xskxceiver.h
+++ b/tools/testing/selftests/bpf/xskxceiver.h
@@ -78,6 +78,7 @@ enum test_type {
 	TEST_TYPE_ALIGNED_INV_DESC,
 	TEST_TYPE_ALIGNED_INV_DESC_2K_FRAME,
 	TEST_TYPE_UNALIGNED_INV_DESC,
+	TEST_TYPE_UNALIGNED_INV_DESC_4K1_FRAME,
 	TEST_TYPE_HEADROOM,
 	TEST_TYPE_TEARDOWN,
 	TEST_TYPE_BIDI,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
                   ` (6 preceding siblings ...)
  2023-03-29 18:04 ` [PATCH bpf-next v2 07/10] selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE Kal Conley
@ 2023-03-29 18:05 ` Kal Conley
  2023-04-04  7:39   ` Magnus Karlsson
  2023-03-29 18:05 ` [PATCH bpf-next v2 09/10] selftests: xsk: Use hugepages when umem->frame_size " Kal Conley
  2023-03-29 18:05 ` [PATCH bpf-next v2 10/10] selftests: xsk: Add tests for 8K and 9K frame sizes Kal Conley
  9 siblings, 1 reply; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:05 UTC (permalink / raw)
  To: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend
  Cc: Kal Conley, netdev, bpf, linux-doc, linux-kernel

Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
enables sending/receiving jumbo ethernet frames up to the theoretical
maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
SKB mode is usable pending future driver work.

For consistency, check for HugeTLB pages during UMEM registration. This
implies that hugepages are required for XDP_COPY mode despite DMA not
being used. This restriction is desirable since it ensures user software
can take advantage of future driver support.

Even in HugeTLB mode, continue to do page accounting using order-0
(4 KiB) pages. This minimizes the size of this change and reduces the
risk of impacting driver code. Taking full advantage of hugepages for
accounting should improve XDP performance in the general case.

No significant change in RX/TX performance was observed with this patch.
A few data points are reproduced below:

Machine : Dell PowerEdge R940
CPU     : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
NIC     : MT27700 Family [ConnectX-4]

+-----+------+------+-------+--------+--------+--------+
|     |      |      | chunk | packet | rxdrop | rxdrop |
|     | mode |  mtu |  size |   size | (Mpps) | (Gbps) |
+-----+------+------+-------+--------+--------+--------+
| old |   -z | 3498 |  4000 |    320 |   15.7 |   40.2 |
| new |   -z | 3498 |  4000 |    320 |   15.8 |   40.4 |
+-----+------+------+-------+--------+--------+--------+
| old |   -z | 3498 |  4096 |    320 |   16.4 |   42.0 |
| new |   -z | 3498 |  4096 |    320 |   16.3 |   41.7 |
+-----+------+------+-------+--------+--------+--------+
| new |   -c | 3498 | 10240 |    320 |    6.3 |   16.1 |
+-----+------+------+-------+--------+--------+--------+
| new |   -S | 9000 | 10240 |   9000 |   0.35 |   25.2 |
+-----+------+------+-------+--------+--------+--------+

Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 Documentation/networking/af_xdp.rst | 19 +++++++----
 include/net/xdp_sock.h              |  3 ++
 include/net/xdp_sock_drv.h          | 12 +++++++
 include/net/xsk_buff_pool.h         | 15 ++++++++-
 net/xdp/xdp_umem.c                  | 50 ++++++++++++++++++++++++-----
 net/xdp/xsk_buff_pool.c             | 30 +++++++++++------
 6 files changed, 105 insertions(+), 24 deletions(-)

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 247c6c4127e9..0017f83c8fb8 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -419,13 +419,20 @@ XDP_UMEM_REG setsockopt
 -----------------------
 
 This setsockopt registers a UMEM to a socket. This is the area that
-contain all the buffers that packet can reside in. The call takes a
+contains all the buffers that packets can reside in. The call takes a
 pointer to the beginning of this area and the size of it. Moreover, it
-also has parameter called chunk_size that is the size that the UMEM is
-divided into. It can only be 2K or 4K at the moment. If you have an
-UMEM area that is 128K and a chunk size of 2K, this means that you
-will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM
-area and that your largest packet size can be 2K.
+also has a parameter called chunk_size that is the size that the UMEM is
+divided into. For example, if you have an UMEM area that is 128K and a
+chunk size of 2K, this means that you will be able to hold a maximum of
+128K / 2K = 64 packets in your UMEM and that your largest packet size
+can be 2K.
+
+Valid chunk sizes range from 2K to 64K. However, the chunk size must not
+exceed the size of a page (often 4K). This limitation is relaxed for
+UMEM areas allocated with HugeTLB pages. In this case, chunk sizes up
+to the system default hugepage size are supported. Note, this only works
+with hugepages allocated from the kernel's persistent pool. Using
+Transparent Huge Pages (THP) has no effect on the maximum chunk size.
 
 There is also an option to set the headroom of each single buffer in
 the UMEM. If you set this to N bytes, it means that the packet will
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index e96a1151ec75..ed88880d4b68 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -28,6 +28,9 @@ struct xdp_umem {
 	struct user_struct *user;
 	refcount_t users;
 	u8 flags;
+#ifdef CONFIG_HUGETLB_PAGE
+	bool hugetlb;
+#endif
 	bool zc;
 	struct page **pgs;
 	int id;
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 9c0d860609ba..83fba3060c9a 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -12,6 +12,18 @@
 #define XDP_UMEM_MIN_CHUNK_SHIFT 11
 #define XDP_UMEM_MIN_CHUNK_SIZE (1 << XDP_UMEM_MIN_CHUNK_SHIFT)
 
+static_assert(XDP_UMEM_MIN_CHUNK_SIZE <= PAGE_SIZE);
+
+/* Allow chunk sizes up to the maximum size of an ethernet frame (64 KiB).
+ * Larger chunks are not guaranteed to fit in a single SKB.
+ */
+#ifdef CONFIG_HUGETLB_PAGE
+#define XDP_UMEM_MAX_CHUNK_SHIFT min(16, HPAGE_SHIFT)
+#else
+#define XDP_UMEM_MAX_CHUNK_SHIFT min(16, PAGE_SHIFT)
+#endif
+#define XDP_UMEM_MAX_CHUNK_SIZE (1 << XDP_UMEM_MAX_CHUNK_SHIFT)
+
 #ifdef CONFIG_XDP_SOCKETS
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index d318c769b445..bb32112aefea 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -75,6 +75,9 @@ struct xsk_buff_pool {
 	u32 chunk_size;
 	u32 chunk_shift;
 	u32 frame_len;
+#ifdef CONFIG_HUGETLB_PAGE
+	u32 page_size;
+#endif
 	u8 cached_need_wakeup;
 	bool uses_need_wakeup;
 	bool dma_need_sync;
@@ -165,6 +168,15 @@ static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool,
 	xp_dma_sync_for_device_slow(pool, dma, size);
 }
 
+static inline u32 xp_get_page_size(struct xsk_buff_pool *pool)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	return pool->page_size;
+#else
+	return PAGE_SIZE;
+#endif
+}
+
 /* Masks for xdp_umem_page flags.
  * The low 12-bits of the addr will be 0 since this is the page address, so we
  * can use them for flags.
@@ -175,7 +187,8 @@ static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool,
 static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
 						 u64 addr, u32 len)
 {
-	bool cross_pg = (addr & (PAGE_SIZE - 1)) + len > PAGE_SIZE;
+	const u32 page_size = xp_get_page_size(pool);
+	bool cross_pg = (addr & (page_size - 1)) + len > page_size;
 
 	if (likely(!cross_pg))
 		return false;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 4681e8e8ad94..8ff687d7e735 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -10,6 +10,8 @@
 #include <linux/uaccess.h>
 #include <linux/slab.h>
 #include <linux/bpf.h>
+#include <linux/hugetlb.h>
+#include <linux/hugetlb_inline.h>
 #include <linux/mm.h>
 #include <linux/netdevice.h>
 #include <linux/rtnetlink.h>
@@ -91,8 +93,37 @@ void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)
 	}
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+/* Returns true if the UMEM contains HugeTLB pages exclusively, false otherwise.
+ *
+ * The mmap_lock must be held by the caller.
+ */
+static bool xdp_umem_is_hugetlb(struct xdp_umem *umem, unsigned long address)
+{
+	unsigned long end = address + umem->size;
+	struct vm_area_struct *vma;
+	struct vma_iterator vmi;
+
+	vma_iter_init(&vmi, current->mm, address);
+	for_each_vma_range(vmi, vma, end) {
+		if (!is_vm_hugetlb_page(vma))
+			return false;
+		/* Hugepage sizes smaller than the default are not supported. */
+		if (huge_page_size(hstate_vma(vma)) < HPAGE_SIZE)
+			return false;
+	}
+
+	return true;
+}
+
+#endif /* CONFIG_HUGETLB_PAGE */
+
 static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
 {
+#ifdef CONFIG_HUGETLB_PAGE
+	bool need_hugetlb = umem->chunk_size > PAGE_SIZE;
+#endif
 	unsigned int gup_flags = FOLL_WRITE;
 	long npgs;
 	int err;
@@ -102,8 +133,18 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
 		return -ENOMEM;
 
 	mmap_read_lock(current->mm);
+
+#ifdef CONFIG_HUGETLB_PAGE
+	umem->hugetlb = IS_ALIGNED(address, HPAGE_SIZE) && xdp_umem_is_hugetlb(umem, address);
+	if (need_hugetlb && !umem->hugetlb) {
+		mmap_read_unlock(current->mm);
+		err = -EINVAL;
+		goto out_pgs;
+	}
+#endif
 	npgs = pin_user_pages(address, umem->npgs,
 			      gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
+
 	mmap_read_unlock(current->mm);
 
 	if (npgs != umem->npgs) {
@@ -156,15 +197,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	unsigned int chunks, chunks_rem;
 	int err;
 
-	if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) {
-		/* Strictly speaking we could support this, if:
-		 * - huge pages, or*
-		 * - using an IOMMU, or
-		 * - making sure the memory area is consecutive
-		 * but for now, we simply say "computer says no".
-		 */
+	if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > XDP_UMEM_MAX_CHUNK_SIZE)
 		return -EINVAL;
-	}
 
 	if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG)
 		return -EINVAL;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index b2df1e0f8153..10933f78a5a2 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -80,9 +80,12 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 	pool->headroom = umem->headroom;
 	pool->chunk_size = umem->chunk_size;
 	pool->chunk_shift = ffs(umem->chunk_size) - 1;
-	pool->unaligned = unaligned;
 	pool->frame_len = umem->chunk_size - umem->headroom -
 		XDP_PACKET_HEADROOM;
+#ifdef CONFIG_HUGETLB_PAGE
+	pool->page_size = umem->hugetlb ? HPAGE_SIZE : PAGE_SIZE;
+#endif
+	pool->unaligned = unaligned;
 	pool->umem = umem;
 	pool->addrs = umem->addrs;
 	INIT_LIST_HEAD(&pool->free_list);
@@ -369,16 +372,25 @@ void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)
 }
 EXPORT_SYMBOL(xp_dma_unmap);
 
-static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map)
+/* HugeTLB pools consider contiguity at hugepage granularity only. Hence, all
+ * order-0 pages within a hugepage have the same contiguity value.
+ */
+static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map, u32 page_size)
 {
-	u32 i;
+	u32 stride = page_size >> PAGE_SHIFT; /* in order-0 pages */
+	u32 i, j;
 
-	for (i = 0; i < dma_map->dma_pages_cnt - 1; i++) {
-		if (dma_map->dma_pages[i] + PAGE_SIZE == dma_map->dma_pages[i + 1])
-			dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
-		else
-			dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
+	for (i = 0; i + stride < dma_map->dma_pages_cnt;) {
+		if (dma_map->dma_pages[i] + page_size == dma_map->dma_pages[i + stride]) {
+			for (j = 0; j < stride; i++, j++)
+				dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
+		} else {
+			for (j = 0; j < stride; i++, j++)
+				dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
+		}
 	}
+	for (; i < dma_map->dma_pages_cnt; i++)
+		dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
 }
 
 static int xp_init_dma_info(struct xsk_buff_pool *pool, struct xsk_dma_map *dma_map)
@@ -441,7 +453,7 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
 	}
 
 	if (pool->unaligned)
-		xp_check_dma_contiguity(dma_map);
+		xp_check_dma_contiguity(dma_map, xp_get_page_size(pool));
 
 	err = xp_init_dma_info(pool, dma_map);
 	if (err) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 09/10] selftests: xsk: Use hugepages when umem->frame_size > PAGE_SIZE
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
                   ` (7 preceding siblings ...)
  2023-03-29 18:05 ` [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
@ 2023-03-29 18:05 ` Kal Conley
  2023-03-29 18:05 ` [PATCH bpf-next v2 10/10] selftests: xsk: Add tests for 8K and 9K frame sizes Kal Conley
  9 siblings, 0 replies; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:05 UTC (permalink / raw)
  To: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan
  Cc: Kal Conley, netdev, bpf, linux-kselftest, linux-kernel

HugeTLB UMEMs now support chunk_size > PAGE_SIZE. Set MAP_HUGETLB when
frame_size > PAGE_SIZE for future tests.

Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 tools/testing/selftests/bpf/xskxceiver.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
index 9b9efd0e0a4c..f73bc6ff5c3d 100644
--- a/tools/testing/selftests/bpf/xskxceiver.c
+++ b/tools/testing/selftests/bpf/xskxceiver.c
@@ -1289,7 +1289,7 @@ static void thread_common_ops(struct test_spec *test, struct ifobject *ifobject)
 	void *bufs;
 	int ret;
 
-	if (ifobject->umem->unaligned_mode)
+	if (ifobject->umem->frame_size > sysconf(_SC_PAGESIZE) || ifobject->umem->unaligned_mode)
 		mmap_flags |= MAP_HUGETLB;
 
 	if (ifobject->shared_umem)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v2 10/10] selftests: xsk: Add tests for 8K and 9K frame sizes
  2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
                   ` (8 preceding siblings ...)
  2023-03-29 18:05 ` [PATCH bpf-next v2 09/10] selftests: xsk: Use hugepages when umem->frame_size " Kal Conley
@ 2023-03-29 18:05 ` Kal Conley
  9 siblings, 0 replies; 27+ messages in thread
From: Kal Conley @ 2023-03-29 18:05 UTC (permalink / raw)
  To: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan
  Cc: Kal Conley, netdev, bpf, linux-kselftest, linux-kernel

Add tests:
- RUN_TO_COMPLETION_8K_FRAME_SIZE: frame_size=8192 (aligned)
- UNALIGNED_9K_FRAME_SIZE: frame_size=9000 (unaligned)

Signed-off-by: Kal Conley <kal.conley@dectris.com>
---
 tools/testing/selftests/bpf/xskxceiver.c | 25 ++++++++++++++++++++++++
 tools/testing/selftests/bpf/xskxceiver.h |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
index f73bc6ff5c3d..fb5ba225b652 100644
--- a/tools/testing/selftests/bpf/xskxceiver.c
+++ b/tools/testing/selftests/bpf/xskxceiver.c
@@ -1841,6 +1841,17 @@ static void run_pkt_test(struct test_spec *test, enum test_mode mode, enum test_
 		pkt_stream_replace(test, DEFAULT_PKT_CNT, PKT_SIZE);
 		testapp_validate_traffic(test);
 		break;
+	case TEST_TYPE_RUN_TO_COMPLETION_8K_FRAME:
+		if (!hugepages_present(test->ifobj_tx)) {
+			ksft_test_result_skip("No 2M huge pages present.\n");
+			return;
+		}
+		test_spec_set_name(test, "RUN_TO_COMPLETION_8K_FRAME_SIZE");
+		test->ifobj_tx->umem->frame_size = 8192;
+		test->ifobj_rx->umem->frame_size = 8192;
+		pkt_stream_replace(test, DEFAULT_PKT_CNT, PKT_SIZE);
+		testapp_validate_traffic(test);
+		break;
 	case TEST_TYPE_RX_POLL:
 		test->ifobj_rx->use_poll = true;
 		test_spec_set_name(test, "POLL_RX");
@@ -1905,6 +1916,20 @@ static void run_pkt_test(struct test_spec *test, enum test_mode mode, enum test_
 		if (!testapp_unaligned(test))
 			return;
 		break;
+	case TEST_TYPE_UNALIGNED_9K_FRAME:
+		if (!hugepages_present(test->ifobj_tx)) {
+			ksft_test_result_skip("No 2M huge pages present.\n");
+			return;
+		}
+		test_spec_set_name(test, "UNALIGNED_9K_FRAME_SIZE");
+		test->ifobj_tx->umem->frame_size = 9000;
+		test->ifobj_rx->umem->frame_size = 9000;
+		test->ifobj_tx->umem->unaligned_mode = true;
+		test->ifobj_rx->umem->unaligned_mode = true;
+		pkt_stream_replace(test, DEFAULT_PKT_CNT, PKT_SIZE);
+		test->ifobj_rx->pkt_stream->use_addr_for_fill = true;
+		testapp_validate_traffic(test);
+		break;
 	case TEST_TYPE_HEADROOM:
 		testapp_headroom(test);
 		break;
diff --git a/tools/testing/selftests/bpf/xskxceiver.h b/tools/testing/selftests/bpf/xskxceiver.h
index 919327807a4e..7f52f737f5e9 100644
--- a/tools/testing/selftests/bpf/xskxceiver.h
+++ b/tools/testing/selftests/bpf/xskxceiver.h
@@ -69,12 +69,14 @@ enum test_mode {
 enum test_type {
 	TEST_TYPE_RUN_TO_COMPLETION,
 	TEST_TYPE_RUN_TO_COMPLETION_2K_FRAME,
+	TEST_TYPE_RUN_TO_COMPLETION_8K_FRAME,
 	TEST_TYPE_RUN_TO_COMPLETION_SINGLE_PKT,
 	TEST_TYPE_RX_POLL,
 	TEST_TYPE_TX_POLL,
 	TEST_TYPE_POLL_RXQ_TMOUT,
 	TEST_TYPE_POLL_TXQ_TMOUT,
 	TEST_TYPE_UNALIGNED,
+	TEST_TYPE_UNALIGNED_9K_FRAME,
 	TEST_TYPE_ALIGNED_INV_DESC,
 	TEST_TYPE_ALIGNED_INV_DESC_2K_FRAME,
 	TEST_TYPE_UNALIGNED_INV_DESC,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test
  2023-03-29 18:04 ` [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test Kal Conley
@ 2023-04-03 10:54   ` Magnus Karlsson
  2023-04-03 11:06     ` Kal Cutter Conley
  0 siblings, 1 reply; 27+ messages in thread
From: Magnus Karlsson @ 2023-04-03 10:54 UTC (permalink / raw)
  To: Kal Conley
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, netdev, bpf, linux-kselftest,
	linux-kernel

On Wed, 29 Mar 2023 at 20:11, Kal Conley <kal.conley@dectris.com> wrote:
>
> Fix flaky STATS_RX_DROPPED test. The receiver calls getsockopt after
> receiving the last (valid) packet which is not the final packet sent in
> the test (valid and invalid packets are sent in alternating fashion with
> the final packet being invalid). Since the last packet may or may not
> have been dropped already, both outcomes must be allowed.
>
> This issue could also be fixed by making sure the last packet sent is
> valid. This alternative is left as an exercise to the reader (or the
> benevolent maintainers of this file).
>
> This problem was quite visible on certain setups. On one machine this
> failure was observed 50% of the time.
>
> Also, remove a redundant assignment of pkt_stream->nb_pkts. This field
> is already initialized by __pkt_stream_alloc.

This has been bugging me for a while so thanks for fixing this. Please
break this commit out of this patch set and send it as a separate bug
fix.

Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

> Fixes: 27e934bec35b ("selftests: xsk: make stat tests not spin on getsockopt")
> Signed-off-by: Kal Conley <kal.conley@dectris.com>
> ---
>  tools/testing/selftests/bpf/xskxceiver.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
> index 34a1f32fe752..1a4bdd5aa78c 100644
> --- a/tools/testing/selftests/bpf/xskxceiver.c
> +++ b/tools/testing/selftests/bpf/xskxceiver.c
> @@ -633,7 +633,6 @@ static struct pkt_stream *pkt_stream_generate(struct xsk_umem_info *umem, u32 nb
>         if (!pkt_stream)
>                 exit_with_error(ENOMEM);
>
> -       pkt_stream->nb_pkts = nb_pkts;
>         for (i = 0; i < nb_pkts; i++) {
>                 pkt_set(umem, &pkt_stream->pkts[i], (i % umem->num_frames) * umem->frame_size,
>                         pkt_len);
> @@ -1141,7 +1140,14 @@ static int validate_rx_dropped(struct ifobject *ifobject)
>         if (err)
>                 return TEST_FAILURE;
>
> -       if (stats.rx_dropped == ifobject->pkt_stream->nb_pkts / 2)
> +       /* The receiver calls getsockopt after receiving the last (valid)
> +        * packet which is not the final packet sent in this test (valid and
> +        * invalid packets are sent in alternating fashion with the final
> +        * packet being invalid). Since the last packet may or may not have
> +        * been dropped already, both outcomes must be allowed.
> +        */
> +       if (stats.rx_dropped == ifobject->pkt_stream->nb_pkts / 2 ||
> +           stats.rx_dropped == ifobject->pkt_stream->nb_pkts / 2 - 1)
>                 return TEST_PASS;
>
>         return TEST_FAILURE;
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test
  2023-04-03 10:54   ` Magnus Karlsson
@ 2023-04-03 11:06     ` Kal Cutter Conley
  2023-04-03 11:40       ` Kal Cutter Conley
  0 siblings, 1 reply; 27+ messages in thread
From: Kal Cutter Conley @ 2023-04-03 11:06 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, netdev, bpf, linux-kselftest,
	linux-kernel

> This has been bugging me for a while so thanks for fixing this. Please
> break this commit out of this patch set and send it as a separate bug
> fix.
>
> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
>

Can I send patches 01-05 all together as one patchset?

Kal

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test
  2023-04-03 11:40       ` Kal Cutter Conley
@ 2023-04-03 11:38         ` Magnus Karlsson
  2023-04-03 11:47           ` Kal Cutter Conley
  0 siblings, 1 reply; 27+ messages in thread
From: Magnus Karlsson @ 2023-04-03 11:38 UTC (permalink / raw)
  To: Kal Cutter Conley
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, netdev, bpf, linux-kselftest,
	linux-kernel

On Mon, 3 Apr 2023 at 13:35, Kal Cutter Conley <kal.conley@dectris.com> wrote:
>
> >
> > Can I send patches 01-05 all together as one patchset?
> >
> On second thought, I will just send out 01, 04, and 05 already
> separately since those are all completely independent.

#2 is also a bug fix in my mind, but not #3 as it just adds a test.

For some reason I did not receive patch #1 and #8. Might be something
flaky on my side, do not know.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test
  2023-04-03 11:06     ` Kal Cutter Conley
@ 2023-04-03 11:40       ` Kal Cutter Conley
  2023-04-03 11:38         ` Magnus Karlsson
  0 siblings, 1 reply; 27+ messages in thread
From: Kal Cutter Conley @ 2023-04-03 11:40 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, netdev, bpf, linux-kselftest,
	linux-kernel

>
> Can I send patches 01-05 all together as one patchset?
>
On second thought, I will just send out 01, 04, and 05 already
separately since those are all completely independent.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test
  2023-04-03 11:38         ` Magnus Karlsson
@ 2023-04-03 11:47           ` Kal Cutter Conley
  0 siblings, 0 replies; 27+ messages in thread
From: Kal Cutter Conley @ 2023-04-03 11:47 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, netdev, bpf, linux-kselftest,
	linux-kernel

> #2 is also a bug fix in my mind, but not #3 as it just adds a test.
>

#2 is a bugfix, but if we put this on the "bpf" tree already, then it
will make a problem for later commits that depend on it which would go
on bpf-next.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 06/10] xsk: Add check for unaligned descriptors that overrun UMEM
  2023-03-29 18:04 ` [PATCH bpf-next v2 06/10] xsk: Add check for unaligned descriptors that overrun UMEM Kal Conley
@ 2023-04-03 12:23   ` Magnus Karlsson
  2023-04-03 12:50     ` Kal Cutter Conley
  0 siblings, 1 reply; 27+ messages in thread
From: Magnus Karlsson @ 2023-04-03 12:23 UTC (permalink / raw)
  To: Kal Conley
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Maxim Mikityanskiy,
	netdev, bpf, linux-kernel

On Wed, 29 Mar 2023 at 20:11, Kal Conley <kal.conley@dectris.com> wrote:
>
> Make sure unaligned descriptors that straddle the end of the UMEM are
> considered invalid. This check needs to happen before the page boundary
> and contiguity checks in xp_desc_crosses_non_contig_pg(). Check this in
> xp_unaligned_validate_desc() instead like xp_check_unaligned() already
> does.
>
> Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API")
> Signed-off-by: Kal Conley <kal.conley@dectris.com>
> ---
>  include/net/xsk_buff_pool.h | 9 ++-------
>  net/xdp/xsk_queue.h         | 1 +
>  2 files changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index 3e952e569418..d318c769b445 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -180,13 +180,8 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
>         if (likely(!cross_pg))
>                 return false;
>
> -       if (pool->dma_pages_cnt) {
> -               return !(pool->dma_pages[addr >> PAGE_SHIFT] &
> -                        XSK_NEXT_PG_CONTIG_MASK);
> -       }
> -
> -       /* skb path */
> -       return addr + len > pool->addrs_cnt;
> +       return pool->dma_pages_cnt &&
> +              !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
>  }
>
>  static inline u64 xp_aligned_extract_addr(struct xsk_buff_pool *pool, u64 addr)
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index bfb2a7e50c26..66c6f57c9c44 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -162,6 +162,7 @@ static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
>                 return false;
>
>         if (base_addr >= pool->addrs_cnt || addr >= pool->addrs_cnt ||
> +           addr + desc->len > pool->addrs_cnt ||
>             xp_desc_crosses_non_contig_pg(pool, addr, desc->len))
>                 return false;
>

Let me just check that I understand the conditions under which this
occurs. When selecting unaligned mode, there is no check that the size
is divisible by the chunk_size as is the case in aligned mode. So we
can register a umem that is for example 15 4K pages plus 100 bytes and
in this case the second to last page will be marked as contiguous
(with the CONTIG_MASK) and a packet of length 300 starting at 15*4K -
100 will be marked as valid even though it extends 100 bytes outside
the umem which ends at 15*4K + 100. Did I get this correctly? If so,
some more color in the commit message would be highly appreciated.

The best way around this would have been if we made sure that the umem
size was always divisible by PAGE_SIZE, but as there are users out
there that might have an unaligned umem of an slightly odd size, we
cannot risk breaking their program. PAGE_SIZE is also architecture
dependent and even configurable within some. So I think your solution
here is the right one.

This one should be considered a bug fix to and go to bpf. Good catch
if I understood the problem correctly above.



> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 07/10] selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE
  2023-03-29 18:04 ` [PATCH bpf-next v2 07/10] selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE Kal Conley
@ 2023-04-03 12:25   ` Magnus Karlsson
  0 siblings, 0 replies; 27+ messages in thread
From: Magnus Karlsson @ 2023-04-03 12:25 UTC (permalink / raw)
  To: Kal Conley
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Andrii Nakryiko, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, netdev, bpf, linux-kselftest,
	linux-kernel

On Wed, 29 Mar 2023 at 20:12, Kal Conley <kal.conley@dectris.com> wrote:
>
> Add unaligned descriptor test for frame size of 4001. Using an odd frame
> size ensures that the end of the UMEM is not near a page boundary. This
> allows testing descriptors that staddle the end of the UMEM but not a

nit: straddle

> page.
>
> This test used to fail without the previous commit ("xsk: Add check for
> unaligned descriptors that overrun UMEM").
>
> Signed-off-by: Kal Conley <kal.conley@dectris.com>
> ---
>  tools/testing/selftests/bpf/xskxceiver.c | 25 ++++++++++++++++++++++++
>  tools/testing/selftests/bpf/xskxceiver.h |  1 +
>  2 files changed, 26 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
> index 1a4bdd5aa78c..9b9efd0e0a4c 100644
> --- a/tools/testing/selftests/bpf/xskxceiver.c
> +++ b/tools/testing/selftests/bpf/xskxceiver.c
> @@ -69,6 +69,7 @@
>   */
>
>  #define _GNU_SOURCE
> +#include <assert.h>
>  #include <fcntl.h>
>  #include <errno.h>
>  #include <getopt.h>
> @@ -1876,6 +1877,30 @@ static void run_pkt_test(struct test_spec *test, enum test_mode mode, enum test_
>                 test->ifobj_rx->umem->unaligned_mode = true;
>                 testapp_invalid_desc(test);
>                 break;
> +       case TEST_TYPE_UNALIGNED_INV_DESC_4K1_FRAME:
> +               if (!hugepages_present(test->ifobj_tx)) {
> +                       ksft_test_result_skip("No 2M huge pages present.\n");
> +                       return;
> +               }
> +               test_spec_set_name(test, "UNALIGNED_INV_DESC_4K1_FRAME_SIZE");
> +               /* Odd frame size so the UMEM doesn't end near a page boundary. */
> +               test->ifobj_tx->umem->frame_size = 4001;
> +               test->ifobj_rx->umem->frame_size = 4001;
> +               test->ifobj_tx->umem->unaligned_mode = true;
> +               test->ifobj_rx->umem->unaligned_mode = true;
> +               /* This test exists to test descriptors that staddle the end of

nit: straddle

> +                * the UMEM but not a page.
> +                */
> +               {
> +                       u64 umem_size = test->ifobj_tx->umem->num_frames *
> +                                       test->ifobj_tx->umem->frame_size;
> +                       u64 page_size = sysconf(_SC_PAGESIZE);
> +
> +                       assert(umem_size % page_size > PKT_SIZE);
> +                       assert(umem_size % page_size < page_size - PKT_SIZE);
> +               }
> +               testapp_invalid_desc(test);

Please put this code in a function that you call. Declare your local
variables in the beginning of that function.

> +               break;
>         case TEST_TYPE_UNALIGNED:
>                 if (!testapp_unaligned(test))
>                         return;
> diff --git a/tools/testing/selftests/bpf/xskxceiver.h b/tools/testing/selftests/bpf/xskxceiver.h
> index cc24ab72f3ff..919327807a4e 100644
> --- a/tools/testing/selftests/bpf/xskxceiver.h
> +++ b/tools/testing/selftests/bpf/xskxceiver.h
> @@ -78,6 +78,7 @@ enum test_type {
>         TEST_TYPE_ALIGNED_INV_DESC,
>         TEST_TYPE_ALIGNED_INV_DESC_2K_FRAME,
>         TEST_TYPE_UNALIGNED_INV_DESC,
> +       TEST_TYPE_UNALIGNED_INV_DESC_4K1_FRAME,
>         TEST_TYPE_HEADROOM,
>         TEST_TYPE_TEARDOWN,
>         TEST_TYPE_BIDI,
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 06/10] xsk: Add check for unaligned descriptors that overrun UMEM
  2023-04-03 12:23   ` Magnus Karlsson
@ 2023-04-03 12:50     ` Kal Cutter Conley
  0 siblings, 0 replies; 27+ messages in thread
From: Kal Cutter Conley @ 2023-04-03 12:50 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Maxim Mikityanskiy,
	netdev, bpf, linux-kernel

>
> Let me just check that I understand the conditions under which this
> occurs. When selecting unaligned mode, there is no check that the size
> is divisible by the chunk_size as is the case in aligned mode. So we
> can register a umem that is for example 15 4K pages plus 100 bytes and
> in this case the second to last page will be marked as contiguous
> (with the CONTIG_MASK) and a packet of length 300 starting at 15*4K -
> 100 will be marked as valid even though it extends 100 bytes outside
> the umem which ends at 15*4K + 100. Did I get this correctly? If so,
> some more color in the commit message would be highly appreciated.

Yes. You don't even need to cross the page. For example, if you have a
packet length of 300 _within_ the final page then it could go past the
end of the umem. In this case, the CONTIG_MASK would not even be
looked at. The explanation is in the next commit message with the
test. I will improve the commit message here though.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE
  2023-03-29 18:05 ` [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
@ 2023-04-04  7:39   ` Magnus Karlsson
  2023-04-04  8:20     ` Kal Cutter Conley
  0 siblings, 1 reply; 27+ messages in thread
From: Magnus Karlsson @ 2023-04-04  7:39 UTC (permalink / raw)
  To: Kal Conley
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf, linux-doc, linux-kernel

On Wed, 29 Mar 2023 at 20:12, Kal Conley <kal.conley@dectris.com> wrote:
>
> Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> enables sending/receiving jumbo ethernet frames up to the theoretical
> maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> SKB mode is usable pending future driver work.
>
> For consistency, check for HugeTLB pages during UMEM registration. This
> implies that hugepages are required for XDP_COPY mode despite DMA not
> being used. This restriction is desirable since it ensures user software
> can take advantage of future driver support.
>
> Even in HugeTLB mode, continue to do page accounting using order-0
> (4 KiB) pages. This minimizes the size of this change and reduces the
> risk of impacting driver code. Taking full advantage of hugepages for
> accounting should improve XDP performance in the general case.
>
> No significant change in RX/TX performance was observed with this patch.
> A few data points are reproduced below:
>
> Machine : Dell PowerEdge R940
> CPU     : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
> NIC     : MT27700 Family [ConnectX-4]
>
> +-----+------+------+-------+--------+--------+--------+
> |     |      |      | chunk | packet | rxdrop | rxdrop |
> |     | mode |  mtu |  size |   size | (Mpps) | (Gbps) |
> +-----+------+------+-------+--------+--------+--------+
> | old |   -z | 3498 |  4000 |    320 |   15.7 |   40.2 |
> | new |   -z | 3498 |  4000 |    320 |   15.8 |   40.4 |
> +-----+------+------+-------+--------+--------+--------+
> | old |   -z | 3498 |  4096 |    320 |   16.4 |   42.0 |
> | new |   -z | 3498 |  4096 |    320 |   16.3 |   41.7 |
> +-----+------+------+-------+--------+--------+--------+
> | new |   -c | 3498 | 10240 |    320 |    6.3 |   16.1 |
> +-----+------+------+-------+--------+--------+--------+
> | new |   -S | 9000 | 10240 |   9000 |   0.35 |   25.2 |
> +-----+------+------+-------+--------+--------+--------+
>
> Signed-off-by: Kal Conley <kal.conley@dectris.com>
> ---
>  Documentation/networking/af_xdp.rst | 19 +++++++----
>  include/net/xdp_sock.h              |  3 ++
>  include/net/xdp_sock_drv.h          | 12 +++++++
>  include/net/xsk_buff_pool.h         | 15 ++++++++-
>  net/xdp/xdp_umem.c                  | 50 ++++++++++++++++++++++++-----
>  net/xdp/xsk_buff_pool.c             | 30 +++++++++++------
>  6 files changed, 105 insertions(+), 24 deletions(-)
>
> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> index 247c6c4127e9..0017f83c8fb8 100644
> --- a/Documentation/networking/af_xdp.rst
> +++ b/Documentation/networking/af_xdp.rst
> @@ -419,13 +419,20 @@ XDP_UMEM_REG setsockopt
>  -----------------------
>
>  This setsockopt registers a UMEM to a socket. This is the area that
> -contain all the buffers that packet can reside in. The call takes a
> +contains all the buffers that packets can reside in. The call takes a
>  pointer to the beginning of this area and the size of it. Moreover, it
> -also has parameter called chunk_size that is the size that the UMEM is
> -divided into. It can only be 2K or 4K at the moment. If you have an
> -UMEM area that is 128K and a chunk size of 2K, this means that you
> -will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM
> -area and that your largest packet size can be 2K.
> +also has a parameter called chunk_size that is the size that the UMEM is
> +divided into. For example, if you have an UMEM area that is 128K and a
> +chunk size of 2K, this means that you will be able to hold a maximum of
> +128K / 2K = 64 packets in your UMEM and that your largest packet size
> +can be 2K.
> +
> +Valid chunk sizes range from 2K to 64K. However, the chunk size must not

Has to be a power of 2, so 2K, 4K, 8K, 16K, 32K, and 64K are valid in
aligned mode. In unaligned, any value from 2K to 64K is fine. Explain
that in unaligned mode this only signifies the max allowed size of a
packet.

> +exceed the size of a page (often 4K). This limitation is relaxed for
> +UMEM areas allocated with HugeTLB pages. In this case, chunk sizes up
> +to the system default hugepage size are supported.

Is not the max 64K as you test against XDP_UMEM_MAX_CHUNK_SIZE in
xdp_umem_reg()?

> Note, this only works
> +with hugepages allocated from the kernel's persistent pool. Using
> +Transparent Huge Pages (THP) has no effect on the maximum chunk size.
>
>  There is also an option to set the headroom of each single buffer in
>  the UMEM. If you set this to N bytes, it means that the packet will
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index e96a1151ec75..ed88880d4b68 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -28,6 +28,9 @@ struct xdp_umem {
>         struct user_struct *user;
>         refcount_t users;
>         u8 flags;
> +#ifdef CONFIG_HUGETLB_PAGE

Sanity check: have you tried compiling your code without this config set?

> +       bool hugetlb;
> +#endif
>         bool zc;
>         struct page **pgs;
>         int id;
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 9c0d860609ba..83fba3060c9a 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -12,6 +12,18 @@
>  #define XDP_UMEM_MIN_CHUNK_SHIFT 11
>  #define XDP_UMEM_MIN_CHUNK_SIZE (1 << XDP_UMEM_MIN_CHUNK_SHIFT)
>
> +static_assert(XDP_UMEM_MIN_CHUNK_SIZE <= PAGE_SIZE);
> +
> +/* Allow chunk sizes up to the maximum size of an ethernet frame (64 KiB).
> + * Larger chunks are not guaranteed to fit in a single SKB.
> + */
> +#ifdef CONFIG_HUGETLB_PAGE
> +#define XDP_UMEM_MAX_CHUNK_SHIFT min(16, HPAGE_SHIFT)
> +#else
> +#define XDP_UMEM_MAX_CHUNK_SHIFT min(16, PAGE_SHIFT)
> +#endif
> +#define XDP_UMEM_MAX_CHUNK_SIZE (1 << XDP_UMEM_MAX_CHUNK_SHIFT)
> +
>  #ifdef CONFIG_XDP_SOCKETS
>
>  void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index d318c769b445..bb32112aefea 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -75,6 +75,9 @@ struct xsk_buff_pool {
>         u32 chunk_size;
>         u32 chunk_shift;
>         u32 frame_len;
> +#ifdef CONFIG_HUGETLB_PAGE
> +       u32 page_size;
> +#endif
>         u8 cached_need_wakeup;
>         bool uses_need_wakeup;
>         bool dma_need_sync;
> @@ -165,6 +168,15 @@ static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool,
>         xp_dma_sync_for_device_slow(pool, dma, size);
>  }
>
> +static inline u32 xp_get_page_size(struct xsk_buff_pool *pool)
> +{
> +#ifdef CONFIG_HUGETLB_PAGE
> +       return pool->page_size;
> +#else
> +       return PAGE_SIZE;
> +#endif
> +}
> +
>  /* Masks for xdp_umem_page flags.
>   * The low 12-bits of the addr will be 0 since this is the page address, so we
>   * can use them for flags.
> @@ -175,7 +187,8 @@ static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool,
>  static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
>                                                  u64 addr, u32 len)
>  {
> -       bool cross_pg = (addr & (PAGE_SIZE - 1)) + len > PAGE_SIZE;
> +       const u32 page_size = xp_get_page_size(pool);
> +       bool cross_pg = (addr & (page_size - 1)) + len > page_size;
>
>         if (likely(!cross_pg))
>                 return false;
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 4681e8e8ad94..8ff687d7e735 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -10,6 +10,8 @@
>  #include <linux/uaccess.h>
>  #include <linux/slab.h>
>  #include <linux/bpf.h>
> +#include <linux/hugetlb.h>
> +#include <linux/hugetlb_inline.h>
>  #include <linux/mm.h>
>  #include <linux/netdevice.h>
>  #include <linux/rtnetlink.h>
> @@ -91,8 +93,37 @@ void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)
>         }
>  }
>
> +#ifdef CONFIG_HUGETLB_PAGE
> +
> +/* Returns true if the UMEM contains HugeTLB pages exclusively, false otherwise.
> + *
> + * The mmap_lock must be held by the caller.
> + */
> +static bool xdp_umem_is_hugetlb(struct xdp_umem *umem, unsigned long address)
> +{
> +       unsigned long end = address + umem->size;
> +       struct vm_area_struct *vma;
> +       struct vma_iterator vmi;
> +
> +       vma_iter_init(&vmi, current->mm, address);
> +       for_each_vma_range(vmi, vma, end) {
> +               if (!is_vm_hugetlb_page(vma))
> +                       return false;
> +               /* Hugepage sizes smaller than the default are not supported. */
> +               if (huge_page_size(hstate_vma(vma)) < HPAGE_SIZE)
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +
> +#endif /* CONFIG_HUGETLB_PAGE */
> +
>  static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
>  {
> +#ifdef CONFIG_HUGETLB_PAGE

Let us try to get rid of most of these #ifdefs sprinkled around the
code. How about hiding this inside xdp_umem_is_hugetlb() and get rid
of these #ifdefs below? Since I believe it is quite uncommon not to
have this config enabled, we could simplify things by always using the
page_size in the pool, for example. And dito for the one in struct
xdp_umem. What do you think?

> +       bool need_hugetlb = umem->chunk_size > PAGE_SIZE;
> +#endif
>         unsigned int gup_flags = FOLL_WRITE;
>         long npgs;
>         int err;
> @@ -102,8 +133,18 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
>                 return -ENOMEM;
>
>         mmap_read_lock(current->mm);
> +
> +#ifdef CONFIG_HUGETLB_PAGE
> +       umem->hugetlb = IS_ALIGNED(address, HPAGE_SIZE) && xdp_umem_is_hugetlb(umem, address);
> +       if (need_hugetlb && !umem->hugetlb) {
> +               mmap_read_unlock(current->mm);
> +               err = -EINVAL;
> +               goto out_pgs;
> +       }
> +#endif
>         npgs = pin_user_pages(address, umem->npgs,
>                               gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
> +
>         mmap_read_unlock(current->mm);
>
>         if (npgs != umem->npgs) {
> @@ -156,15 +197,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>         unsigned int chunks, chunks_rem;
>         int err;
>
> -       if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) {
> -               /* Strictly speaking we could support this, if:
> -                * - huge pages, or*
> -                * - using an IOMMU, or
> -                * - making sure the memory area is consecutive
> -                * but for now, we simply say "computer says no".
> -                */
> +       if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > XDP_UMEM_MAX_CHUNK_SIZE)
>                 return -EINVAL;
> -       }
>
>         if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG)
>                 return -EINVAL;
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index b2df1e0f8153..10933f78a5a2 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -80,9 +80,12 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
>         pool->headroom = umem->headroom;
>         pool->chunk_size = umem->chunk_size;
>         pool->chunk_shift = ffs(umem->chunk_size) - 1;
> -       pool->unaligned = unaligned;
>         pool->frame_len = umem->chunk_size - umem->headroom -
>                 XDP_PACKET_HEADROOM;
> +#ifdef CONFIG_HUGETLB_PAGE
> +       pool->page_size = umem->hugetlb ? HPAGE_SIZE : PAGE_SIZE;
> +#endif
> +       pool->unaligned = unaligned;
>         pool->umem = umem;
>         pool->addrs = umem->addrs;
>         INIT_LIST_HEAD(&pool->free_list);
> @@ -369,16 +372,25 @@ void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)
>  }
>  EXPORT_SYMBOL(xp_dma_unmap);
>
> -static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map)
> +/* HugeTLB pools consider contiguity at hugepage granularity only. Hence, all
> + * order-0 pages within a hugepage have the same contiguity value.
> + */
> +static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map, u32 page_size)
>  {
> -       u32 i;
> +       u32 stride = page_size >> PAGE_SHIFT; /* in order-0 pages */
> +       u32 i, j;
>
> -       for (i = 0; i < dma_map->dma_pages_cnt - 1; i++) {
> -               if (dma_map->dma_pages[i] + PAGE_SIZE == dma_map->dma_pages[i + 1])
> -                       dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
> -               else
> -                       dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
> +       for (i = 0; i + stride < dma_map->dma_pages_cnt;) {
> +               if (dma_map->dma_pages[i] + page_size == dma_map->dma_pages[i + stride]) {
> +                       for (j = 0; j < stride; i++, j++)
> +                               dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
> +               } else {
> +                       for (j = 0; j < stride; i++, j++)
> +                               dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
> +               }

Still somewhat too conservative :-). If your page size is large you
will waste a lot of the umem.  For the last page mark all the 4K
"pages" that cannot cross the end of the umem due to the max size of a
packet with the XSK_NEXT_PG_CONTIG_MASK bit. So you only need to add
one more for-loop here to mark this, and then adjust the last for-loop
below so it only marks the last bunch of 4K pages at the end of the
umem as not contiguous.

>         }
> +       for (; i < dma_map->dma_pages_cnt; i++)
> +               dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
>  }
>
>  static int xp_init_dma_info(struct xsk_buff_pool *pool, struct xsk_dma_map *dma_map)
> @@ -441,7 +453,7 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
>         }
>
>         if (pool->unaligned)
> -               xp_check_dma_contiguity(dma_map);
> +               xp_check_dma_contiguity(dma_map, xp_get_page_size(pool));
>
>         err = xp_init_dma_info(pool, dma_map);
>         if (err) {
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE
  2023-04-04  7:39   ` Magnus Karlsson
@ 2023-04-04  8:20     ` Kal Cutter Conley
  2023-04-04  9:29       ` Magnus Karlsson
  0 siblings, 1 reply; 27+ messages in thread
From: Kal Cutter Conley @ 2023-04-04  8:20 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf, linux-doc, linux-kernel

> Is not the max 64K as you test against XDP_UMEM_MAX_CHUNK_SIZE in
> xdp_umem_reg()?

The absolute max is 64K. In the case of HPAGE_SIZE < 64K, then it
would be HPAGE_SIZE.

> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index e96a1151ec75..ed88880d4b68 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -28,6 +28,9 @@ struct xdp_umem {
> >         struct user_struct *user;
> >         refcount_t users;
> >         u8 flags;
> > +#ifdef CONFIG_HUGETLB_PAGE
>
> Sanity check: have you tried compiling your code without this config set?

Yes. The CI does this also on one of the platforms (hence some of the
bot errors in v1).

> >  static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
> >  {
> > +#ifdef CONFIG_HUGETLB_PAGE
>
> Let us try to get rid of most of these #ifdefs sprinkled around the
> code. How about hiding this inside xdp_umem_is_hugetlb() and get rid
> of these #ifdefs below? Since I believe it is quite uncommon not to
> have this config enabled, we could simplify things by always using the
> page_size in the pool, for example. And dito for the one in struct
> xdp_umem. What do you think?

I used #ifdef for `page_size` in the pool for maximum performance when
huge pages are disabled. We could also not worry about optimizing this
uncommon case though since the performance impact is very small.
However, I don't find the #ifdefs excessive either.

> > +static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map, u32 page_size)
> >  {
> > -       u32 i;
> > +       u32 stride = page_size >> PAGE_SHIFT; /* in order-0 pages */
> > +       u32 i, j;
> >
> > -       for (i = 0; i < dma_map->dma_pages_cnt - 1; i++) {
> > -               if (dma_map->dma_pages[i] + PAGE_SIZE == dma_map->dma_pages[i + 1])
> > -                       dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
> > -               else
> > -                       dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
> > +       for (i = 0; i + stride < dma_map->dma_pages_cnt;) {
> > +               if (dma_map->dma_pages[i] + page_size == dma_map->dma_pages[i + stride]) {
> > +                       for (j = 0; j < stride; i++, j++)
> > +                               dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
> > +               } else {
> > +                       for (j = 0; j < stride; i++, j++)
> > +                               dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
> > +               }
>
> Still somewhat too conservative :-). If your page size is large you
> will waste a lot of the umem.  For the last page mark all the 4K
> "pages" that cannot cross the end of the umem due to the max size of a
> packet with the XSK_NEXT_PG_CONTIG_MASK bit. So you only need to add
> one more for-loop here to mark this, and then adjust the last for-loop
> below so it only marks the last bunch of 4K pages at the end of the
> umem as not contiguous.

I don't understand the issue. The XSK_NEXT_PG_CONTIG_MASK bit is only
looked at if the descriptor actually crosses a page boundary. I don't
think the current implementation wastes any UMEM.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE
  2023-04-04  8:20     ` Kal Cutter Conley
@ 2023-04-04  9:29       ` Magnus Karlsson
  2023-04-04 10:33         ` Kal Cutter Conley
  0 siblings, 1 reply; 27+ messages in thread
From: Magnus Karlsson @ 2023-04-04  9:29 UTC (permalink / raw)
  To: Kal Cutter Conley
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf, linux-doc, linux-kernel

On Tue, 4 Apr 2023 at 10:15, Kal Cutter Conley <kal.conley@dectris.com> wrote:
>
> > Is not the max 64K as you test against XDP_UMEM_MAX_CHUNK_SIZE in
> > xdp_umem_reg()?
>
> The absolute max is 64K. In the case of HPAGE_SIZE < 64K, then it
> would be HPAGE_SIZE.

Is there such a case when HPAGE_SIZE would be less than 64K? If not,
then just write 64K.

> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index e96a1151ec75..ed88880d4b68 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -28,6 +28,9 @@ struct xdp_umem {
> > >         struct user_struct *user;
> > >         refcount_t users;
> > >         u8 flags;
> > > +#ifdef CONFIG_HUGETLB_PAGE
> >
> > Sanity check: have you tried compiling your code without this config set?
>
> Yes. The CI does this also on one of the platforms (hence some of the
> bot errors in v1).

Perfect!

> > >  static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
> > >  {
> > > +#ifdef CONFIG_HUGETLB_PAGE
> >
> > Let us try to get rid of most of these #ifdefs sprinkled around the
> > code. How about hiding this inside xdp_umem_is_hugetlb() and get rid
> > of these #ifdefs below? Since I believe it is quite uncommon not to
> > have this config enabled, we could simplify things by always using the
> > page_size in the pool, for example. And dito for the one in struct
> > xdp_umem. What do you think?
>
> I used #ifdef for `page_size` in the pool for maximum performance when
> huge pages are disabled. We could also not worry about optimizing this
> uncommon case though since the performance impact is very small.
> However, I don't find the #ifdefs excessive either.

Keep them to a minimum please since there are few of them in the
current code outside of some header files. And let us assume that
CONFIG_HUGETLB_PAGE is the common case.

> > > +static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map, u32 page_size)
> > >  {
> > > -       u32 i;
> > > +       u32 stride = page_size >> PAGE_SHIFT; /* in order-0 pages */
> > > +       u32 i, j;
> > >
> > > -       for (i = 0; i < dma_map->dma_pages_cnt - 1; i++) {
> > > -               if (dma_map->dma_pages[i] + PAGE_SIZE == dma_map->dma_pages[i + 1])
> > > -                       dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
> > > -               else
> > > -                       dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
> > > +       for (i = 0; i + stride < dma_map->dma_pages_cnt;) {
> > > +               if (dma_map->dma_pages[i] + page_size == dma_map->dma_pages[i + stride]) {
> > > +                       for (j = 0; j < stride; i++, j++)
> > > +                               dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
> > > +               } else {
> > > +                       for (j = 0; j < stride; i++, j++)
> > > +                               dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
> > > +               }
> >
> > Still somewhat too conservative :-). If your page size is large you
> > will waste a lot of the umem.  For the last page mark all the 4K
> > "pages" that cannot cross the end of the umem due to the max size of a
> > packet with the XSK_NEXT_PG_CONTIG_MASK bit. So you only need to add
> > one more for-loop here to mark this, and then adjust the last for-loop
> > below so it only marks the last bunch of 4K pages at the end of the
> > umem as not contiguous.
>
> I don't understand the issue. The XSK_NEXT_PG_CONTIG_MASK bit is only
> looked at if the descriptor actually crosses a page boundary. I don't
> think the current implementation wastes any UMEM.

I stand corrected. You do not waste any space, so please ignore.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE
  2023-04-04  9:29       ` Magnus Karlsson
@ 2023-04-04 10:33         ` Kal Cutter Conley
  2023-04-04 11:14           ` Magnus Karlsson
  0 siblings, 1 reply; 27+ messages in thread
From: Kal Cutter Conley @ 2023-04-04 10:33 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf, linux-doc, linux-kernel

> > > Is not the max 64K as you test against XDP_UMEM_MAX_CHUNK_SIZE in
> > > xdp_umem_reg()?
> >
> > The absolute max is 64K. In the case of HPAGE_SIZE < 64K, then it
> > would be HPAGE_SIZE.
>
> Is there such a case when HPAGE_SIZE would be less than 64K? If not,
> then just write 64K.

Yes. While most platforms have HPAGE_SIZE defined to a compile-time
constant >= 64K (very often 2M) there are platforms (at least ia64 and
powerpc) where the hugepage size is configured at boot. Specifically,
in the case of Itanium (ia64), the hugepage size may be configured at
boot to any valid page size > PAGE_SIZE (e.g. 8K). See:
https://elixir.bootlin.com/linux/latest/source/arch/ia64/mm/hugetlbpage.c#L159

>
> > > >  static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
> > > >  {
> > > > +#ifdef CONFIG_HUGETLB_PAGE
> > >
> > > Let us try to get rid of most of these #ifdefs sprinkled around the
> > > code. How about hiding this inside xdp_umem_is_hugetlb() and get rid
> > > of these #ifdefs below? Since I believe it is quite uncommon not to
> > > have this config enabled, we could simplify things by always using the
> > > page_size in the pool, for example. And dito for the one in struct
> > > xdp_umem. What do you think?
> >
> > I used #ifdef for `page_size` in the pool for maximum performance when
> > huge pages are disabled. We could also not worry about optimizing this
> > uncommon case though since the performance impact is very small.
> > However, I don't find the #ifdefs excessive either.
>
> Keep them to a minimum please since there are few of them in the
> current code outside of some header files. And let us assume that
> CONFIG_HUGETLB_PAGE is the common case.
>

Would you be OK if I just remove the ones from xsk_buff_pool? I think
the code in xdp_umem.c is quite readable and the #ifdefs are really
only used in xdp_umem_pin_pages.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE
  2023-04-04 10:33         ` Kal Cutter Conley
@ 2023-04-04 11:14           ` Magnus Karlsson
  2023-04-04 12:18             ` Kal Cutter Conley
  0 siblings, 1 reply; 27+ messages in thread
From: Magnus Karlsson @ 2023-04-04 11:14 UTC (permalink / raw)
  To: Kal Cutter Conley
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf, linux-doc, linux-kernel

On Tue, 4 Apr 2023 at 12:29, Kal Cutter Conley <kal.conley@dectris.com> wrote:
>
> > > > Is not the max 64K as you test against XDP_UMEM_MAX_CHUNK_SIZE in
> > > > xdp_umem_reg()?
> > >
> > > The absolute max is 64K. In the case of HPAGE_SIZE < 64K, then it
> > > would be HPAGE_SIZE.
> >
> > Is there such a case when HPAGE_SIZE would be less than 64K? If not,
> > then just write 64K.
>
> Yes. While most platforms have HPAGE_SIZE defined to a compile-time
> constant >= 64K (very often 2M) there are platforms (at least ia64 and
> powerpc) where the hugepage size is configured at boot. Specifically,
> in the case of Itanium (ia64), the hugepage size may be configured at
> boot to any valid page size > PAGE_SIZE (e.g. 8K). See:
> https://elixir.bootlin.com/linux/latest/source/arch/ia64/mm/hugetlbpage.c#L159

So for all practical purposes it is max 64K. Let us just write that then.

> >
> > > > >  static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
> > > > >  {
> > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > >
> > > > Let us try to get rid of most of these #ifdefs sprinkled around the
> > > > code. How about hiding this inside xdp_umem_is_hugetlb() and get rid
> > > > of these #ifdefs below? Since I believe it is quite uncommon not to
> > > > have this config enabled, we could simplify things by always using the
> > > > page_size in the pool, for example. And dito for the one in struct
> > > > xdp_umem. What do you think?
> > >
> > > I used #ifdef for `page_size` in the pool for maximum performance when
> > > huge pages are disabled. We could also not worry about optimizing this
> > > uncommon case though since the performance impact is very small.
> > > However, I don't find the #ifdefs excessive either.
> >
> > Keep them to a minimum please since there are few of them in the
> > current code outside of some header files. And let us assume that
> > CONFIG_HUGETLB_PAGE is the common case.
> >
>
> Would you be OK if I just remove the ones from xsk_buff_pool? I think
> the code in xdp_umem.c is quite readable and the #ifdefs are really
> only used in xdp_umem_pin_pages.

Please make an effort to remove the ones in xdp_umem.c too. The more
ifdefs you add, the harder it will be to read.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE
  2023-04-04 11:14           ` Magnus Karlsson
@ 2023-04-04 12:18             ` Kal Cutter Conley
  2023-04-04 12:37               ` Kal Cutter Conley
  0 siblings, 1 reply; 27+ messages in thread
From: Kal Cutter Conley @ 2023-04-04 12:18 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf, linux-doc, linux-kernel

> > > > > Is not the max 64K as you test against XDP_UMEM_MAX_CHUNK_SIZE in
> > > > > xdp_umem_reg()?
> > > >
> > > > The absolute max is 64K. In the case of HPAGE_SIZE < 64K, then it
> > > > would be HPAGE_SIZE.
> > >
> > > Is there such a case when HPAGE_SIZE would be less than 64K? If not,
> > > then just write 64K.
> >
> > Yes. While most platforms have HPAGE_SIZE defined to a compile-time
> > constant >= 64K (very often 2M) there are platforms (at least ia64 and
> > powerpc) where the hugepage size is configured at boot. Specifically,
> > in the case of Itanium (ia64), the hugepage size may be configured at
> > boot to any valid page size > PAGE_SIZE (e.g. 8K). See:
> > https://elixir.bootlin.com/linux/latest/source/arch/ia64/mm/hugetlbpage.c#L159
>
> So for all practical purposes it is max 64K. Let us just write that then.

What about when CONFIG_HUGETLB_PAGE is not defined? Should we keep it
set to PAGE_SIZE in that case, or would you like it to be a fixed
constant == 64K always?

>
> > >
> > > > > >  static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
> > > > > >  {
> > > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > >
> > > > > Let us try to get rid of most of these #ifdefs sprinkled around the
> > > > > code. How about hiding this inside xdp_umem_is_hugetlb() and get rid
> > > > > of these #ifdefs below? Since I believe it is quite uncommon not to
> > > > > have this config enabled, we could simplify things by always using the
> > > > > page_size in the pool, for example. And dito for the one in struct
> > > > > xdp_umem. What do you think?
> > > >
> > > > I used #ifdef for `page_size` in the pool for maximum performance when
> > > > huge pages are disabled. We could also not worry about optimizing this
> > > > uncommon case though since the performance impact is very small.
> > > > However, I don't find the #ifdefs excessive either.
> > >
> > > Keep them to a minimum please since there are few of them in the
> > > current code outside of some header files. And let us assume that
> > > CONFIG_HUGETLB_PAGE is the common case.
> > >
> >
> > Would you be OK if I just remove the ones from xsk_buff_pool? I think
> > the code in xdp_umem.c is quite readable and the #ifdefs are really
> > only used in xdp_umem_pin_pages.
>
> Please make an effort to remove the ones in xdp_umem.c too. The more
> ifdefs you add, the harder it will be to read.

OK

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE
  2023-04-04 12:18             ` Kal Cutter Conley
@ 2023-04-04 12:37               ` Kal Cutter Conley
  2023-04-04 12:54                 ` Magnus Karlsson
  0 siblings, 1 reply; 27+ messages in thread
From: Kal Cutter Conley @ 2023-04-04 12:37 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf, linux-doc, linux-kernel

> > > > > > Is not the max 64K as you test against XDP_UMEM_MAX_CHUNK_SIZE in
> > > > > > xdp_umem_reg()?
> > > > >
> > > > > The absolute max is 64K. In the case of HPAGE_SIZE < 64K, then it
> > > > > would be HPAGE_SIZE.
> > > >
> > > > Is there such a case when HPAGE_SIZE would be less than 64K? If not,
> > > > then just write 64K.
> > >
> > > Yes. While most platforms have HPAGE_SIZE defined to a compile-time
> > > constant >= 64K (very often 2M) there are platforms (at least ia64 and
> > > powerpc) where the hugepage size is configured at boot. Specifically,
> > > in the case of Itanium (ia64), the hugepage size may be configured at
> > > boot to any valid page size > PAGE_SIZE (e.g. 8K). See:
> > > https://elixir.bootlin.com/linux/latest/source/arch/ia64/mm/hugetlbpage.c#L159
> >
> > So for all practical purposes it is max 64K. Let us just write that then.
>
> What about when CONFIG_HUGETLB_PAGE is not defined? Should we keep it
> set to PAGE_SIZE in that case, or would you like it to be a fixed
> constant == 64K always?

Sorry. Now it's not clear to me if you are suggesting the
documentation be changed or the code or both?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE
  2023-04-04 12:37               ` Kal Cutter Conley
@ 2023-04-04 12:54                 ` Magnus Karlsson
  0 siblings, 0 replies; 27+ messages in thread
From: Magnus Karlsson @ 2023-04-04 12:54 UTC (permalink / raw)
  To: Kal Cutter Conley
  Cc: Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	bpf, linux-doc, linux-kernel

On Tue, 4 Apr 2023 at 14:32, Kal Cutter Conley <kal.conley@dectris.com> wrote:
>
> > > > > > > Is not the max 64K as you test against XDP_UMEM_MAX_CHUNK_SIZE in
> > > > > > > xdp_umem_reg()?
> > > > > >
> > > > > > The absolute max is 64K. In the case of HPAGE_SIZE < 64K, then it
> > > > > > would be HPAGE_SIZE.
> > > > >
> > > > > Is there such a case when HPAGE_SIZE would be less than 64K? If not,
> > > > > then just write 64K.
> > > >
> > > > Yes. While most platforms have HPAGE_SIZE defined to a compile-time
> > > > constant >= 64K (very often 2M) there are platforms (at least ia64 and
> > > > powerpc) where the hugepage size is configured at boot. Specifically,
> > > > in the case of Itanium (ia64), the hugepage size may be configured at
> > > > boot to any valid page size > PAGE_SIZE (e.g. 8K). See:
> > > > https://elixir.bootlin.com/linux/latest/source/arch/ia64/mm/hugetlbpage.c#L159
> > >
> > > So for all practical purposes it is max 64K. Let us just write that then.
> >
> > What about when CONFIG_HUGETLB_PAGE is not defined? Should we keep it
> > set to PAGE_SIZE in that case, or would you like it to be a fixed
> > constant == 64K always?
>
> Sorry. Now it's not clear to me if you are suggesting the
> documentation be changed or the code or both?

The documentation.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-04-04 12:54 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-29 18:04 [PATCH bpf-next v2 00/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
2023-03-29 18:04 ` [PATCH bpf-next v2 01/10] selftests: xsk: Add xskxceiver.h dependency to Makefile Kal Conley
2023-03-29 18:04 ` [PATCH bpf-next v2 02/10] selftests: xsk: Use correct UMEM size in testapp_invalid_desc Kal Conley
2023-03-29 18:04 ` [PATCH bpf-next v2 03/10] selftests: xsk: Add test case for packets at end of UMEM Kal Conley
2023-03-29 18:04 ` [PATCH bpf-next v2 04/10] selftests: xsk: Deflakify STATS_RX_DROPPED test Kal Conley
2023-04-03 10:54   ` Magnus Karlsson
2023-04-03 11:06     ` Kal Cutter Conley
2023-04-03 11:40       ` Kal Cutter Conley
2023-04-03 11:38         ` Magnus Karlsson
2023-04-03 11:47           ` Kal Cutter Conley
2023-03-29 18:04 ` [PATCH bpf-next v2 05/10] selftests: xsk: Disable IPv6 on VETH1 Kal Conley
2023-03-29 18:04 ` [PATCH bpf-next v2 06/10] xsk: Add check for unaligned descriptors that overrun UMEM Kal Conley
2023-04-03 12:23   ` Magnus Karlsson
2023-04-03 12:50     ` Kal Cutter Conley
2023-03-29 18:04 ` [PATCH bpf-next v2 07/10] selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE Kal Conley
2023-04-03 12:25   ` Magnus Karlsson
2023-03-29 18:05 ` [PATCH bpf-next v2 08/10] xsk: Support UMEM chunk_size > PAGE_SIZE Kal Conley
2023-04-04  7:39   ` Magnus Karlsson
2023-04-04  8:20     ` Kal Cutter Conley
2023-04-04  9:29       ` Magnus Karlsson
2023-04-04 10:33         ` Kal Cutter Conley
2023-04-04 11:14           ` Magnus Karlsson
2023-04-04 12:18             ` Kal Cutter Conley
2023-04-04 12:37               ` Kal Cutter Conley
2023-04-04 12:54                 ` Magnus Karlsson
2023-03-29 18:05 ` [PATCH bpf-next v2 09/10] selftests: xsk: Use hugepages when umem->frame_size " Kal Conley
2023-03-29 18:05 ` [PATCH bpf-next v2 10/10] selftests: xsk: Add tests for 8K and 9K frame sizes Kal Conley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).