All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/6] remove page frag implementation in vhost_net
@ 2024-01-03  9:56 ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-03  9:56 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Matthias Brugger,
	AngeloGioacchino Del Regno, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-arm-kernel,
	linux-mediatek, bpf

Currently there are three implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
   xdp frame, and it's implementation seems to be a mix of
   the above two.

This patch tries to unfiy the page frag implementation a
little bit by unifying gfp bit for order 3 page allocation
and replacing page frag implementation in vhost.c with the
one in page_alloc.c.

After this patchset, we are not only able to unify the page
frag implementation a little, but also able to have about
0.5% performance boost testing by using the vhost_net_test
introduced in the last patch.

Before this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     305325.78 msec task-clock                       #    1.738 CPUs utilized               ( +-  0.12% )
       1048668      context-switches                 #    3.435 K/sec                       ( +-  0.00% )
            11      cpu-migrations                   #    0.036 /sec                        ( +- 17.64% )
            33      page-faults                      #    0.108 /sec                        ( +-  0.49% )
  244651819491      cycles                           #    0.801 GHz                         ( +-  0.43% )  (64)
   64714638024      stalled-cycles-frontend          #   26.45% frontend cycles idle        ( +-  2.19% )  (67)
   30774313491      stalled-cycles-backend           #   12.58% backend cycles idle         ( +-  7.68% )  (70)
  201749748680      instructions                     #    0.82  insn per cycle
                                              #    0.32  stalled cycles per insn     ( +-  0.41% )  (66.76%)
   65494787909      branches                         #  214.508 M/sec                       ( +-  0.35% )  (64)
    4284111313      branch-misses                    #    6.54% of all branches             ( +-  0.45% )  (66)

       175.699 +- 0.189 seconds time elapsed  ( +-  0.11% )


After this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     303974.38 msec task-clock                       #    1.739 CPUs utilized               ( +-  0.14% )
       1048807      context-switches                 #    3.450 K/sec                       ( +-  0.00% )
            14      cpu-migrations                   #    0.046 /sec                        ( +- 12.86% )
            33      page-faults                      #    0.109 /sec                        ( +-  0.46% )
  251289376347      cycles                           #    0.827 GHz                         ( +-  0.32% )  (60)
   67885175415      stalled-cycles-frontend          #   27.01% frontend cycles idle        ( +-  0.48% )  (63)
   27809282600      stalled-cycles-backend           #   11.07% backend cycles idle         ( +-  0.36% )  (71)
  195543234672      instructions                     #    0.78  insn per cycle
                                              #    0.35  stalled cycles per insn     ( +-  0.29% )  (69.04%)
   62423183552      branches                         #  205.357 M/sec                       ( +-  0.48% )  (67)
    4135666632      branch-misses                    #    6.63% of all branches             ( +-  0.63% )  (67)

       174.764 +- 0.214 seconds time elapsed  ( +-  0.12% )

Changelog:
V2: Change 'xor'd' to 'masked off', add vhost tx testing for
    vhost_net_test.

V1: Fix some typo, drop RFC tag and rebase on latest net-next.

Yunsheng Lin (6):
  mm/page_alloc: modify page_frag_alloc_align() to accept align as an
    argument
  page_frag: unify gfp bits for order 3 page allocation
  mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  vhost/net: remove vhost_net_page_frag_refill()
  net: introduce page_frag_cache_drain()
  tools: virtio: introduce vhost_net_test

 drivers/net/ethernet/google/gve/gve_main.c |  11 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |  17 +-
 drivers/nvme/host/tcp.c                    |   7 +-
 drivers/nvme/target/tcp.c                  |   4 +-
 drivers/vhost/net.c                        |  91 +---
 include/linux/gfp.h                        |   6 +-
 include/linux/skbuff.h                     |  22 +-
 mm/page_alloc.c                            |  48 +-
 net/core/skbuff.c                          |  14 +-
 net/core/sock.c                            |   2 +-
 tools/virtio/Makefile                      |   8 +-
 tools/virtio/vhost_net_test.c              | 574 +++++++++++++++++++++
 12 files changed, 657 insertions(+), 147 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH net-next 0/6] remove page frag implementation in vhost_net
@ 2024-01-03  9:56 ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-03  9:56 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Matthias Brugger,
	AngeloGioacchino Del Regno, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-arm-kernel,
	linux-mediatek, bpf

Currently there are three implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
   xdp frame, and it's implementation seems to be a mix of
   the above two.

This patch tries to unfiy the page frag implementation a
little bit by unifying gfp bit for order 3 page allocation
and replacing page frag implementation in vhost.c with the
one in page_alloc.c.

After this patchset, we are not only able to unify the page
frag implementation a little, but also able to have about
0.5% performance boost testing by using the vhost_net_test
introduced in the last patch.

Before this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     305325.78 msec task-clock                       #    1.738 CPUs utilized               ( +-  0.12% )
       1048668      context-switches                 #    3.435 K/sec                       ( +-  0.00% )
            11      cpu-migrations                   #    0.036 /sec                        ( +- 17.64% )
            33      page-faults                      #    0.108 /sec                        ( +-  0.49% )
  244651819491      cycles                           #    0.801 GHz                         ( +-  0.43% )  (64)
   64714638024      stalled-cycles-frontend          #   26.45% frontend cycles idle        ( +-  2.19% )  (67)
   30774313491      stalled-cycles-backend           #   12.58% backend cycles idle         ( +-  7.68% )  (70)
  201749748680      instructions                     #    0.82  insn per cycle
                                              #    0.32  stalled cycles per insn     ( +-  0.41% )  (66.76%)
   65494787909      branches                         #  214.508 M/sec                       ( +-  0.35% )  (64)
    4284111313      branch-misses                    #    6.54% of all branches             ( +-  0.45% )  (66)

       175.699 +- 0.189 seconds time elapsed  ( +-  0.11% )


After this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     303974.38 msec task-clock                       #    1.739 CPUs utilized               ( +-  0.14% )
       1048807      context-switches                 #    3.450 K/sec                       ( +-  0.00% )
            14      cpu-migrations                   #    0.046 /sec                        ( +- 12.86% )
            33      page-faults                      #    0.109 /sec                        ( +-  0.46% )
  251289376347      cycles                           #    0.827 GHz                         ( +-  0.32% )  (60)
   67885175415      stalled-cycles-frontend          #   27.01% frontend cycles idle        ( +-  0.48% )  (63)
   27809282600      stalled-cycles-backend           #   11.07% backend cycles idle         ( +-  0.36% )  (71)
  195543234672      instructions                     #    0.78  insn per cycle
                                              #    0.35  stalled cycles per insn     ( +-  0.29% )  (69.04%)
   62423183552      branches                         #  205.357 M/sec                       ( +-  0.48% )  (67)
    4135666632      branch-misses                    #    6.63% of all branches             ( +-  0.63% )  (67)

       174.764 +- 0.214 seconds time elapsed  ( +-  0.12% )

Changelog:
V2: Change 'xor'd' to 'masked off', add vhost tx testing for
    vhost_net_test.

V1: Fix some typo, drop RFC tag and rebase on latest net-next.

Yunsheng Lin (6):
  mm/page_alloc: modify page_frag_alloc_align() to accept align as an
    argument
  page_frag: unify gfp bits for order 3 page allocation
  mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  vhost/net: remove vhost_net_page_frag_refill()
  net: introduce page_frag_cache_drain()
  tools: virtio: introduce vhost_net_test

 drivers/net/ethernet/google/gve/gve_main.c |  11 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |  17 +-
 drivers/nvme/host/tcp.c                    |   7 +-
 drivers/nvme/target/tcp.c                  |   4 +-
 drivers/vhost/net.c                        |  91 +---
 include/linux/gfp.h                        |   6 +-
 include/linux/skbuff.h                     |  22 +-
 mm/page_alloc.c                            |  48 +-
 net/core/skbuff.c                          |  14 +-
 net/core/sock.c                            |   2 +-
 tools/virtio/Makefile                      |   8 +-
 tools/virtio/vhost_net_test.c              | 574 +++++++++++++++++++++
 12 files changed, 657 insertions(+), 147 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

-- 
2.33.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH net-next 1/6] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument
  2024-01-03  9:56 ` Yunsheng Lin
  (?)
@ 2024-01-03  9:56 ` Yunsheng Lin
  2024-01-05 15:28   ` Alexander H Duyck
  -1 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-03  9:56 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Eric Dumazet, linux-mm

napi_alloc_frag_align() and netdev_alloc_frag_align() accept
align as an argument, and they are thin wrappers around the
__napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs
doing the align and align_mask conversion, in order to call
page_frag_alloc_align() directly.

As __napi_alloc_frag_align() and __netdev_alloc_frag_align()
APIs are only used by the above thin wrappers, it seems that
it makes more sense to remove align and align_mask conversion
and call page_frag_alloc_align() directly. By doing that, we
can also avoid the confusion between napi_alloc_frag_align()
accepting align as an argument and page_frag_alloc_align()
accepting align_mask as an argument when they both have the
'align' suffix.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
---
 include/linux/gfp.h    |  4 ++--
 include/linux/skbuff.h | 22 ++++------------------
 mm/page_alloc.c        |  6 ++++--
 net/core/skbuff.c      | 14 +++++++-------
 4 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index de292a007138..bbd75976541e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -314,12 +314,12 @@ struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
 extern void *page_frag_alloc_align(struct page_frag_cache *nc,
 				   unsigned int fragsz, gfp_t gfp_mask,
-				   unsigned int align_mask);
+				   unsigned int align);
 
 static inline void *page_frag_alloc(struct page_frag_cache *nc,
 			     unsigned int fragsz, gfp_t gfp_mask)
 {
-	return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
+	return page_frag_alloc_align(nc, fragsz, gfp_mask, 1);
 }
 
 extern void page_frag_free(void *addr);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a5ae952454c8..c0a3b44ef5da 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3192,7 +3192,7 @@ static inline void skb_queue_purge(struct sk_buff_head *list)
 unsigned int skb_rbtree_purge(struct rb_root *root);
 void skb_errqueue_purge(struct sk_buff_head *list);
 
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
+void *netdev_alloc_frag_align(unsigned int fragsz, unsigned int align);
 
 /**
  * netdev_alloc_frag - allocate a page fragment
@@ -3203,14 +3203,7 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
  */
 static inline void *netdev_alloc_frag(unsigned int fragsz)
 {
-	return __netdev_alloc_frag_align(fragsz, ~0u);
-}
-
-static inline void *netdev_alloc_frag_align(unsigned int fragsz,
-					    unsigned int align)
-{
-	WARN_ON_ONCE(!is_power_of_2(align));
-	return __netdev_alloc_frag_align(fragsz, -align);
+	return netdev_alloc_frag_align(fragsz, 1);
 }
 
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length,
@@ -3270,18 +3263,11 @@ static inline void skb_free_frag(void *addr)
 	page_frag_free(addr);
 }
 
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask);
+void *napi_alloc_frag_align(unsigned int fragsz, unsigned int align);
 
 static inline void *napi_alloc_frag(unsigned int fragsz)
 {
-	return __napi_alloc_frag_align(fragsz, ~0u);
-}
-
-static inline void *napi_alloc_frag_align(unsigned int fragsz,
-					  unsigned int align)
-{
-	WARN_ON_ONCE(!is_power_of_2(align));
-	return __napi_alloc_frag_align(fragsz, -align);
+	return napi_alloc_frag_align(fragsz, 1);
 }
 
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 37ca4f4b62bf..9a16305cf985 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4718,12 +4718,14 @@ EXPORT_SYMBOL(__page_frag_cache_drain);
 
 void *page_frag_alloc_align(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask,
-		      unsigned int align_mask)
+		      unsigned int align)
 {
 	unsigned int size = PAGE_SIZE;
 	struct page *page;
 	int offset;
 
+	WARN_ON_ONCE(!is_power_of_2(align));
+
 	if (unlikely(!nc->va)) {
 refill:
 		page = __page_frag_cache_refill(nc, gfp_mask);
@@ -4782,7 +4784,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 	}
 
 	nc->pagecnt_bias--;
-	offset &= align_mask;
+	offset &= -align;
 	nc->offset = offset;
 
 	return nc->va + offset;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 12d22c0b8551..84c29a48f1a8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -291,17 +291,17 @@ void napi_get_frags_check(struct napi_struct *napi)
 	local_bh_enable();
 }
 
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+void *napi_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
 	fragsz = SKB_DATA_ALIGN(fragsz);
 
-	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
 }
-EXPORT_SYMBOL(__napi_alloc_frag_align);
+EXPORT_SYMBOL(napi_alloc_frag_align);
 
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+void *netdev_alloc_frag_align(unsigned int fragsz, unsigned int align)
 {
 	void *data;
 
@@ -309,18 +309,18 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 	if (in_hardirq() || irqs_disabled()) {
 		struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
 
-		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
+		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align);
 	} else {
 		struct napi_alloc_cache *nc;
 
 		local_bh_disable();
 		nc = this_cpu_ptr(&napi_alloc_cache);
-		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align);
 		local_bh_enable();
 	}
 	return data;
 }
-EXPORT_SYMBOL(__netdev_alloc_frag_align);
+EXPORT_SYMBOL(netdev_alloc_frag_align);
 
 static struct sk_buff *napi_skb_cache_get(void)
 {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation
  2024-01-03  9:56 ` Yunsheng Lin
  (?)
  (?)
@ 2024-01-03  9:56 ` Yunsheng Lin
  2024-01-05 15:35   ` Alexander H Duyck
  -1 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-03  9:56 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Michael S. Tsirkin, Jason Wang, Andrew Morton, Eric Dumazet, kvm,
	virtualization, linux-mm

Currently there seems to be three page frag implementions
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.

The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
direct reclaim in skb_page_frag_refill(), but it is not
masked off in __page_frag_cache_refill().

This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and masking off
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
---
 drivers/vhost/net.c | 2 +-
 mm/page_alloc.c     | 4 ++--
 net/core/sock.c     | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..e574e21cc0ca 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
 		/* Avoid direct reclaim but allow kswapd to wake */
 		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
 					  __GFP_COMP | __GFP_NOWARN |
-					  __GFP_NORETRY,
+					  __GFP_NORETRY | __GFP_NOMEMALLOC,
 					  SKB_FRAG_PAGE_ORDER);
 		if (likely(pfrag->page)) {
 			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9a16305cf985..1f0b36dd81b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4693,8 +4693,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 	gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-		    __GFP_NOMEMALLOC;
+	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
+		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
 	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
 				PAGE_FRAG_CACHE_MAX_ORDER);
 	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
diff --git a/net/core/sock.c b/net/core/sock.c
index 446e945f736b..d643332c3ee5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2900,7 +2900,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
 		/* Avoid direct reclaim but allow kswapd to wake */
 		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
 					  __GFP_COMP | __GFP_NOWARN |
-					  __GFP_NORETRY,
+					  __GFP_NORETRY | __GFP_NOMEMALLOC,
 					  SKB_FRAG_PAGE_ORDER);
 		if (likely(pfrag->page)) {
 			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  2024-01-03  9:56 ` Yunsheng Lin
                   ` (2 preceding siblings ...)
  (?)
@ 2024-01-03  9:56 ` Yunsheng Lin
  2024-01-05 15:42   ` Alexander H Duyck
  -1 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-03  9:56 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, linux-mm

The next patch is above to use page_frag_alloc_align() to
replace vhost_net_page_frag_refill(), the main difference
between those two frag page implementations is whether we
use a initial zero offset or not.

It seems more nature to use a initial zero offset, as it
may enable more correct cache prefetching and skb frag
coalescing in the networking, so change it to use initial
zero offset.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
---
 mm/page_alloc.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f0b36dd81b5..083e0c38fb62 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4720,7 +4720,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask,
 		      unsigned int align)
 {
-	unsigned int size = PAGE_SIZE;
+	unsigned int size;
 	struct page *page;
 	int offset;
 
@@ -4732,10 +4732,6 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 		if (!page)
 			return NULL;
 
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
 		/* Even if we own the page, we do not use atomic_set().
 		 * This would break get_page_unless_zero() users.
 		 */
@@ -4744,11 +4740,18 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 		/* reset page count bias and offset to start of new frag */
 		nc->pfmemalloc = page_is_pfmemalloc(page);
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		nc->offset = size;
+		nc->offset = 0;
 	}
 
-	offset = nc->offset - fragsz;
-	if (unlikely(offset < 0)) {
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+	/* if size can vary use size else just use PAGE_SIZE */
+	size = nc->size;
+#else
+	size = PAGE_SIZE;
+#endif
+
+	offset = ALIGN(nc->offset, align);
+	if (unlikely(offset + fragsz > size)) {
 		page = virt_to_page(nc->va);
 
 		if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
@@ -4759,17 +4762,13 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 			goto refill;
 		}
 
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
 		/* OK, page count is 0, we can safely set it */
 		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
 
 		/* reset page count bias and offset to start of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		offset = size - fragsz;
-		if (unlikely(offset < 0)) {
+		offset = 0;
+		if (unlikely(fragsz > size)) {
 			/*
 			 * The caller is trying to allocate a fragment
 			 * with fragsz > PAGE_SIZE but the cache isn't big
@@ -4784,8 +4783,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 	}
 
 	nc->pagecnt_bias--;
-	offset &= -align;
-	nc->offset = offset;
+	nc->offset = offset + fragsz;
 
 	return nc->va + offset;
 }
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()
  2024-01-03  9:56 ` Yunsheng Lin
                   ` (3 preceding siblings ...)
  (?)
@ 2024-01-03  9:56 ` Yunsheng Lin
  2024-01-05 16:06   ` Alexander H Duyck
  -1 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-03  9:56 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Jason Wang,
	Michael S. Tsirkin, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, kvm, virtualization, bpf

The page frag in vhost_net_page_frag_refill() uses the
'struct page_frag' from skb_page_frag_refill(), but it's
implementation is similar to page_frag_alloc_align() now.

This patch removes vhost_net_page_frag_refill() by using
'struct page_frag_cache' instead of 'struct page_frag',
and allocating frag using page_frag_alloc_align().

The added benefit is that not only unifying the page frag
implementation a little, but also having about 0.5% performance
boost testing by using the vhost_net_test introduced in the
last patch.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 93 ++++++++++++++-------------------------------
 1 file changed, 29 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e574e21cc0ca..805e11d598e4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -141,10 +141,8 @@ struct vhost_net {
 	unsigned tx_zcopy_err;
 	/* Flush in progress. Protected by tx vq lock. */
 	bool tx_flush;
-	/* Private page frag */
-	struct page_frag page_frag;
-	/* Refcount bias of page frag */
-	int refcnt_bias;
+	/* Private page frag cache */
+	struct page_frag_cache pf_cache;
 };
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
@@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
 	       !vhost_vq_avail_empty(vq->dev, vq);
 }
 
-static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
-				       struct page_frag *pfrag, gfp_t gfp)
-{
-	if (pfrag->page) {
-		if (pfrag->offset + sz <= pfrag->size)
-			return true;
-		__page_frag_cache_drain(pfrag->page, net->refcnt_bias);
-	}
-
-	pfrag->offset = 0;
-	net->refcnt_bias = 0;
-	if (SKB_FRAG_PAGE_ORDER) {
-		/* Avoid direct reclaim but allow kswapd to wake */
-		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
-					  __GFP_COMP | __GFP_NOWARN |
-					  __GFP_NORETRY | __GFP_NOMEMALLOC,
-					  SKB_FRAG_PAGE_ORDER);
-		if (likely(pfrag->page)) {
-			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-			goto done;
-		}
-	}
-	pfrag->page = alloc_page(gfp);
-	if (likely(pfrag->page)) {
-		pfrag->size = PAGE_SIZE;
-		goto done;
-	}
-	return false;
-
-done:
-	net->refcnt_bias = USHRT_MAX;
-	page_ref_add(pfrag->page, USHRT_MAX - 1);
-	return true;
-}
-
 #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
 
 static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
@@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 	struct vhost_net *net = container_of(vq->dev, struct vhost_net,
 					     dev);
 	struct socket *sock = vhost_vq_get_backend(vq);
-	struct page_frag *alloc_frag = &net->page_frag;
 	struct virtio_net_hdr *gso;
 	struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
 	struct tun_xdp_hdr *hdr;
@@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 	int sock_hlen = nvq->sock_hlen;
 	void *buf;
 	int copied;
+	int ret;
 
 	if (unlikely(len < nvq->sock_hlen))
 		return -EFAULT;
@@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 		return -ENOSPC;
 
 	buflen += SKB_DATA_ALIGN(len + pad);
-	alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
-	if (unlikely(!vhost_net_page_frag_refill(net, buflen,
-						 alloc_frag, GFP_KERNEL)))
+	buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL,
+				    SMP_CACHE_BYTES);
+	if (unlikely(!buf))
 		return -ENOMEM;
 
-	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset +
-				     offsetof(struct tun_xdp_hdr, gso),
-				     sock_hlen, from);
-	if (copied != sock_hlen)
-		return -EFAULT;
+	copied = copy_from_iter(buf + offsetof(struct tun_xdp_hdr, gso),
+				sock_hlen, from);
+	if (copied != sock_hlen) {
+		ret = -EFAULT;
+		goto err;
+	}
 
 	hdr = buf;
 	gso = &hdr->gso;
@@ -743,27 +705,30 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 			       vhost16_to_cpu(vq, gso->csum_start) +
 			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
 
-		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
-			return -EINVAL;
+		if (vhost16_to_cpu(vq, gso->hdr_len) > len) {
+			ret = -EINVAL;
+			goto err;
+		}
 	}
 
 	len -= sock_hlen;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset + pad,
-				     len, from);
-	if (copied != len)
-		return -EFAULT;
+	copied = copy_from_iter(buf + pad, len, from);
+	if (copied != len) {
+		ret = -EFAULT;
+		goto err;
+	}
 
 	xdp_init_buff(xdp, buflen, NULL);
 	xdp_prepare_buff(xdp, buf, pad, len, true);
 	hdr->buflen = buflen;
 
-	--net->refcnt_bias;
-	alloc_frag->offset += buflen;
-
 	++nvq->batched_xdp;
 
 	return 0;
+
+err:
+	page_frag_free(buf);
+	return ret;
 }
 
 static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
@@ -1353,8 +1318,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
-	n->page_frag.page = NULL;
-	n->refcnt_bias = 0;
+	n->pf_cache.va = NULL;
 
 	return 0;
 }
@@ -1422,8 +1386,9 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
 	kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
 	kfree(n->dev.vqs);
-	if (n->page_frag.page)
-		__page_frag_cache_drain(n->page_frag.page, n->refcnt_bias);
+	if (n->pf_cache.va)
+		__page_frag_cache_drain(virt_to_head_page(n->pf_cache.va),
+					n->pf_cache.pagecnt_bias);
 	kvfree(n);
 	return 0;
 }
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 5/6] net: introduce page_frag_cache_drain()
  2024-01-03  9:56 ` Yunsheng Lin
@ 2024-01-03  9:56   ` Yunsheng Lin
  -1 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-03  9:56 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Jason Wang, Jeroen de Borst,
	Praveen Kaligineedi, Shailend Chand, Eric Dumazet, Felix Fietkau,
	John Crispin, Sean Wang, Mark Lee, Lorenzo Bianconi,
	Matthias Brugger, AngeloGioacchino Del Regno, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Michael S. Tsirkin, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm

When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/ethernet/google/gve/gve_main.c | 11 ++---------
 drivers/net/ethernet/mediatek/mtk_wed_wo.c | 17 ++---------------
 drivers/nvme/host/tcp.c                    |  7 +------
 drivers/nvme/target/tcp.c                  |  4 +---
 drivers/vhost/net.c                        |  4 +---
 include/linux/gfp.h                        |  2 ++
 mm/page_alloc.c                            | 10 ++++++++++
 7 files changed, 19 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 619bf63ec935..d976190b0f4d 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1278,17 +1278,10 @@ static void gve_unreg_xdp_info(struct gve_priv *priv)
 
 static void gve_drain_page_cache(struct gve_priv *priv)
 {
-	struct page_frag_cache *nc;
 	int i;
 
-	for (i = 0; i < priv->rx_cfg.num_queues; i++) {
-		nc = &priv->rx[i].page_cache;
-		if (nc->va) {
-			__page_frag_cache_drain(virt_to_page(nc->va),
-						nc->pagecnt_bias);
-			nc->va = NULL;
-		}
-	}
+	for (i = 0; i < priv->rx_cfg.num_queues; i++)
+		page_frag_cache_drain(&priv->rx[i].page_cache);
 }
 
 static int gve_open(struct net_device *dev)
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index d58b07e7e123..7063c78bd35f 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -286,7 +286,6 @@ mtk_wed_wo_queue_free(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 static void
 mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
 	int i;
 
 	for (i = 0; i < q->n_desc; i++) {
@@ -301,19 +300,12 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		entry->buf = NULL;
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_drain(&q->cache);
 }
 
 static void
 mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
-
 	for (;;) {
 		void *buf = mtk_wed_wo_dequeue(wo, q, NULL, true);
 
@@ -323,12 +315,7 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		skb_free_frag(buf);
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_drain(&q->cache);
 }
 
 static void
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 08805f027810..c80037a78066 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1344,7 +1344,6 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 
 static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 {
-	struct page *page;
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
 	unsigned int noreclaim_flag;
@@ -1355,11 +1354,7 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
-	if (queue->pf_cache.va) {
-		page = virt_to_head_page(queue->pf_cache.va);
-		__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
-		queue->pf_cache.va = NULL;
-	}
+	page_frag_cache_drain(&queue->pf_cache);
 
 	noreclaim_flag = memalloc_noreclaim_save();
 	/* ->sock will be released by fput() */
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 4cc27856aa8f..11237557cfc5 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1576,7 +1576,6 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
 
 static void nvmet_tcp_release_queue_work(struct work_struct *w)
 {
-	struct page *page;
 	struct nvmet_tcp_queue *queue =
 		container_of(w, struct nvmet_tcp_queue, release_work);
 
@@ -1600,8 +1599,7 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-	page = virt_to_head_page(queue->pf_cache.va);
-	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
+	page_frag_cache_drain(&queue->pf_cache);
 	kfree(queue);
 }
 
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 805e11d598e4..4b2fcb228a0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1386,9 +1386,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
 	kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
 	kfree(n->dev.vqs);
-	if (n->pf_cache.va)
-		__page_frag_cache_drain(virt_to_head_page(n->pf_cache.va),
-					n->pf_cache.pagecnt_bias);
+	page_frag_cache_drain(&n->pf_cache);
 	kvfree(n);
 	return 0;
 }
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index bbd75976541e..03ba079655d3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -316,6 +316,8 @@ extern void *page_frag_alloc_align(struct page_frag_cache *nc,
 				   unsigned int fragsz, gfp_t gfp_mask,
 				   unsigned int align);
 
+void page_frag_cache_drain(struct page_frag_cache *nc);
+
 static inline void *page_frag_alloc(struct page_frag_cache *nc,
 			     unsigned int fragsz, gfp_t gfp_mask)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 083e0c38fb62..5a0e68edcb05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4716,6 +4716,16 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
+void page_frag_cache_drain(struct page_frag_cache *nc)
+{
+	if (!nc->va)
+		return;
+
+	__page_frag_cache_drain(virt_to_head_page(nc->va), nc->pagecnt_bias);
+	nc->va = NULL;
+}
+EXPORT_SYMBOL(page_frag_cache_drain);
+
 void *page_frag_alloc_align(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask,
 		      unsigned int align)
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 5/6] net: introduce page_frag_cache_drain()
@ 2024-01-03  9:56   ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-03  9:56 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Jason Wang, Jeroen de Borst,
	Praveen Kaligineedi, Shailend Chand, Eric Dumazet, Felix Fietkau,
	John Crispin, Sean Wang, Mark Lee, Lorenzo Bianconi,
	Matthias Brugger, AngeloGioacchino Del Regno, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Michael S. Tsirkin, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm

When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/ethernet/google/gve/gve_main.c | 11 ++---------
 drivers/net/ethernet/mediatek/mtk_wed_wo.c | 17 ++---------------
 drivers/nvme/host/tcp.c                    |  7 +------
 drivers/nvme/target/tcp.c                  |  4 +---
 drivers/vhost/net.c                        |  4 +---
 include/linux/gfp.h                        |  2 ++
 mm/page_alloc.c                            | 10 ++++++++++
 7 files changed, 19 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 619bf63ec935..d976190b0f4d 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1278,17 +1278,10 @@ static void gve_unreg_xdp_info(struct gve_priv *priv)
 
 static void gve_drain_page_cache(struct gve_priv *priv)
 {
-	struct page_frag_cache *nc;
 	int i;
 
-	for (i = 0; i < priv->rx_cfg.num_queues; i++) {
-		nc = &priv->rx[i].page_cache;
-		if (nc->va) {
-			__page_frag_cache_drain(virt_to_page(nc->va),
-						nc->pagecnt_bias);
-			nc->va = NULL;
-		}
-	}
+	for (i = 0; i < priv->rx_cfg.num_queues; i++)
+		page_frag_cache_drain(&priv->rx[i].page_cache);
 }
 
 static int gve_open(struct net_device *dev)
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index d58b07e7e123..7063c78bd35f 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -286,7 +286,6 @@ mtk_wed_wo_queue_free(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 static void
 mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
 	int i;
 
 	for (i = 0; i < q->n_desc; i++) {
@@ -301,19 +300,12 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		entry->buf = NULL;
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_drain(&q->cache);
 }
 
 static void
 mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
-
 	for (;;) {
 		void *buf = mtk_wed_wo_dequeue(wo, q, NULL, true);
 
@@ -323,12 +315,7 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		skb_free_frag(buf);
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_drain(&q->cache);
 }
 
 static void
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 08805f027810..c80037a78066 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1344,7 +1344,6 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 
 static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 {
-	struct page *page;
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
 	unsigned int noreclaim_flag;
@@ -1355,11 +1354,7 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
-	if (queue->pf_cache.va) {
-		page = virt_to_head_page(queue->pf_cache.va);
-		__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
-		queue->pf_cache.va = NULL;
-	}
+	page_frag_cache_drain(&queue->pf_cache);
 
 	noreclaim_flag = memalloc_noreclaim_save();
 	/* ->sock will be released by fput() */
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 4cc27856aa8f..11237557cfc5 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1576,7 +1576,6 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
 
 static void nvmet_tcp_release_queue_work(struct work_struct *w)
 {
-	struct page *page;
 	struct nvmet_tcp_queue *queue =
 		container_of(w, struct nvmet_tcp_queue, release_work);
 
@@ -1600,8 +1599,7 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-	page = virt_to_head_page(queue->pf_cache.va);
-	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
+	page_frag_cache_drain(&queue->pf_cache);
 	kfree(queue);
 }
 
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 805e11d598e4..4b2fcb228a0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1386,9 +1386,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
 	kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
 	kfree(n->dev.vqs);
-	if (n->pf_cache.va)
-		__page_frag_cache_drain(virt_to_head_page(n->pf_cache.va),
-					n->pf_cache.pagecnt_bias);
+	page_frag_cache_drain(&n->pf_cache);
 	kvfree(n);
 	return 0;
 }
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index bbd75976541e..03ba079655d3 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -316,6 +316,8 @@ extern void *page_frag_alloc_align(struct page_frag_cache *nc,
 				   unsigned int fragsz, gfp_t gfp_mask,
 				   unsigned int align);
 
+void page_frag_cache_drain(struct page_frag_cache *nc);
+
 static inline void *page_frag_alloc(struct page_frag_cache *nc,
 			     unsigned int fragsz, gfp_t gfp_mask)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 083e0c38fb62..5a0e68edcb05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4716,6 +4716,16 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
+void page_frag_cache_drain(struct page_frag_cache *nc)
+{
+	if (!nc->va)
+		return;
+
+	__page_frag_cache_drain(virt_to_head_page(nc->va), nc->pagecnt_bias);
+	nc->va = NULL;
+}
+EXPORT_SYMBOL(page_frag_cache_drain);
+
 void *page_frag_alloc_align(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask,
 		      unsigned int align)
-- 
2.33.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
  2024-01-03  9:56 ` Yunsheng Lin
                   ` (5 preceding siblings ...)
  (?)
@ 2024-01-03  9:56 ` Yunsheng Lin
  2024-01-04 16:17   ` Eugenio Perez Martin
  -1 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-03  9:56 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, virtualization

introduce vhost_net_test basing on virtio_test to test
vhost_net changing in the kernel.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 tools/virtio/Makefile         |   8 +-
 tools/virtio/vhost_net_test.c | 574 ++++++++++++++++++++++++++++++++++
 2 files changed, 579 insertions(+), 3 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
index d128925980e0..e25e99c1c3b7 100644
--- a/tools/virtio/Makefile
+++ b/tools/virtio/Makefile
@@ -1,8 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 all: test mod
-test: virtio_test vringh_test
+test: virtio_test vringh_test vhost_net_test
 virtio_test: virtio_ring.o virtio_test.o
 vringh_test: vringh_test.o vringh.o virtio_ring.o
+vhost_net_test: virtio_ring.o vhost_net_test.o
 
 try-run = $(shell set -e;		\
 	if ($(1)) >/dev/null 2>&1;	\
@@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
 
 .PHONY: all test mod clean vhost oot oot-clean oot-build
 clean:
-	${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
-              vhost_test/Module.symvers vhost_test/modules.order *.d
+	${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
+              vhost_test/.*.cmd vhost_test/Module.symvers \
+              vhost_test/modules.order *.d
 -include *.d
diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
new file mode 100644
index 000000000000..cfffcef53d94
--- /dev/null
+++ b/tools/virtio/vhost_net_test.c
@@ -0,0 +1,574 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <getopt.h>
+#include <limits.h>
+#include <string.h>
+#include <poll.h>
+#include <sys/eventfd.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <linux/virtio_types.h>
+#include <linux/vhost.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ring.h>
+#include <linux/if.h>
+#include <linux/if_tun.h>
+#include <linux/in.h>
+#include <linux/if_packet.h>
+#include <netinet/ether.h>
+
+#define RANDOM_BATCH	-1
+#define HDR_LEN		12
+#define TEST_BUF_LEN	256
+#define TEST_PTYPE	ETH_P_LOOPBACK
+
+/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
+void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
+
+struct vq_info {
+	int kick;
+	int call;
+	int idx;
+	long started;
+	long completed;
+	struct pollfd fds;
+	void *ring;
+	/* copy used for control */
+	struct vring vring;
+	struct virtqueue *vq;
+};
+
+struct vdev_info {
+	struct virtio_device vdev;
+	int control;
+	struct vq_info vqs[2];
+	int nvqs;
+	void *buf;
+	size_t buf_size;
+	char *test_buf;
+	char *res_buf;
+	struct vhost_memory *mem;
+	int sock;
+	int ifindex;
+	unsigned char mac[ETHER_ADDR_LEN];
+};
+
+static int tun_alloc(struct vdev_info *dev)
+{
+	struct ifreq ifr;
+	int len = HDR_LEN;
+	int fd, e;
+
+	fd = open("/dev/net/tun", O_RDWR);
+	if (fd < 0) {
+		perror("Cannot open /dev/net/tun");
+		return fd;
+	}
+
+	memset(&ifr, 0, sizeof(ifr));
+
+	ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
+	snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
+
+	e = ioctl(fd, TUNSETIFF, &ifr);
+	if (e < 0) {
+		perror("ioctl[TUNSETIFF]");
+		close(fd);
+		return e;
+	}
+
+	e = ioctl(fd, TUNSETVNETHDRSZ, &len);
+	if (e < 0) {
+		perror("ioctl[TUNSETVNETHDRSZ]");
+		close(fd);
+		return e;
+	}
+
+	e = ioctl(fd, SIOCGIFHWADDR, &ifr);
+	if (e < 0) {
+		perror("ioctl[SIOCGIFHWADDR]");
+		close(fd);
+		return e;
+	}
+
+	memcpy(dev->mac, &ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
+	return fd;
+}
+
+static void vdev_create_socket(struct vdev_info *dev)
+{
+	struct ifreq ifr;
+
+	dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
+	assert(dev->sock != -1);
+
+	snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
+	assert(ioctl(dev->sock, SIOCGIFINDEX, &ifr) >= 0);
+
+	dev->ifindex = ifr.ifr_ifindex;
+
+	/* Set the flags that bring the device up */
+	assert(ioctl(dev->sock, SIOCGIFFLAGS, &ifr) >= 0);
+	ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
+	assert(ioctl(dev->sock, SIOCSIFFLAGS, &ifr) >= 0);
+}
+
+static void vdev_send_packet(struct vdev_info *dev)
+{
+	char *sendbuf = dev->test_buf + HDR_LEN;
+	struct sockaddr_ll saddrll = {0};
+	int sockfd = dev->sock;
+	int ret;
+
+	memset(&saddrll, 0, sizeof(saddrll));
+	saddrll.sll_family = PF_PACKET;
+	saddrll.sll_ifindex = dev->ifindex;
+	saddrll.sll_halen = ETH_ALEN;
+	saddrll.sll_protocol = htons(TEST_PTYPE);
+
+	ret = sendto(sockfd, sendbuf, TEST_BUF_LEN, 0,
+		     (struct sockaddr *)&saddrll,
+		     sizeof(struct sockaddr_ll));
+	assert(ret >= 0);
+}
+
+static bool vq_notify(struct virtqueue *vq)
+{
+	struct vq_info *info = vq->priv;
+	unsigned long long v = 1;
+	int r;
+
+	r = write(info->kick, &v, sizeof(v));
+	assert(r == sizeof(v));
+
+	return true;
+}
+
+static void vq_callback(struct virtqueue *vq)
+{
+}
+
+static void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
+{
+	struct vhost_vring_addr addr = {
+		.index = info->idx,
+		.desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
+		.avail_user_addr = (uint64_t)(unsigned long)info->vring.avail,
+		.used_user_addr = (uint64_t)(unsigned long)info->vring.used,
+	};
+	struct vhost_vring_state state = { .index = info->idx };
+	struct vhost_vring_file file = { .index = info->idx };
+	int r;
+
+	state.num = info->vring.num;
+	r = ioctl(dev->control, VHOST_SET_VRING_NUM, &state);
+	assert(r >= 0);
+
+	state.num = 0;
+	r = ioctl(dev->control, VHOST_SET_VRING_BASE, &state);
+	assert(r >= 0);
+
+	r = ioctl(dev->control, VHOST_SET_VRING_ADDR, &addr);
+	assert(r >= 0);
+
+	file.fd = info->kick;
+	r = ioctl(dev->control, VHOST_SET_VRING_KICK, &file);
+	assert(r >= 0);
+
+	file.fd = info->call;
+	r = ioctl(dev->control, VHOST_SET_VRING_CALL, &file);
+	assert(r >= 0);
+}
+
+static void vq_reset(struct vq_info *info, int num, struct virtio_device *vdev)
+{
+	if (info->vq)
+		vring_del_virtqueue(info->vq);
+
+	memset(info->ring, 0, vring_size(num, 4096));
+	vring_init(&info->vring, num, info->ring, 4096);
+	info->vq = vring_new_virtqueue(info->idx, num, 4096, vdev, true, false,
+				       info->ring, vq_notify, vq_callback, "test");
+	assert(info->vq);
+	info->vq->priv = info;
+}
+
+static void vq_info_add(struct vdev_info *dev, int idx, int num, int fd)
+{
+	struct vhost_vring_file backend = { .index = idx, .fd = fd };
+	struct vq_info *info = &dev->vqs[idx];
+	int r;
+
+	info->idx = idx;
+	info->kick = eventfd(0, EFD_NONBLOCK);
+	info->call = eventfd(0, EFD_NONBLOCK);
+	r = posix_memalign(&info->ring, 4096, vring_size(num, 4096));
+	assert(r >= 0);
+	vq_reset(info, num, &dev->vdev);
+	vhost_vq_setup(dev, info);
+	info->fds.fd = info->call;
+	info->fds.events = POLLIN;
+
+	r = ioctl(dev->control, VHOST_NET_SET_BACKEND, &backend);
+	assert(!r);
+}
+
+static void vdev_info_init(struct vdev_info *dev, unsigned long long features)
+{
+	struct ether_header *eh;
+	int i, r;
+
+	dev->vdev.features = features;
+	INIT_LIST_HEAD(&dev->vdev.vqs);
+	spin_lock_init(&dev->vdev.vqs_list_lock);
+
+	dev->buf_size = (HDR_LEN + TEST_BUF_LEN) * 2;
+	dev->buf = malloc(dev->buf_size);
+	assert(dev->buf);
+	dev->test_buf = dev->buf;
+	dev->res_buf = dev->test_buf + HDR_LEN + TEST_BUF_LEN;
+
+	memset(dev->test_buf, 0, HDR_LEN + TEST_BUF_LEN);
+	eh = (struct ether_header *)(dev->test_buf + HDR_LEN);
+	eh->ether_type = htons(TEST_PTYPE);
+	memcpy(eh->ether_dhost, dev->mac, ETHER_ADDR_LEN);
+	memcpy(eh->ether_shost, dev->mac, ETHER_ADDR_LEN);
+
+	for (i = sizeof(*eh); i < TEST_BUF_LEN; i++)
+		dev->test_buf[i + HDR_LEN] = (char)i;
+
+	dev->control = open("/dev/vhost-net", O_RDWR);
+	assert(dev->control >= 0);
+
+	r = ioctl(dev->control, VHOST_SET_OWNER, NULL);
+	assert(r >= 0);
+
+	dev->mem = malloc(offsetof(struct vhost_memory, regions) +
+			  sizeof(dev->mem->regions[0]));
+	assert(dev->mem);
+	memset(dev->mem, 0, offsetof(struct vhost_memory, regions) +
+	       sizeof(dev->mem->regions[0]));
+	dev->mem->nregions = 1;
+	dev->mem->regions[0].guest_phys_addr = (long)dev->buf;
+	dev->mem->regions[0].userspace_addr = (long)dev->buf;
+	dev->mem->regions[0].memory_size = dev->buf_size;
+
+	r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
+	assert(r >= 0);
+
+	r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
+	assert(r >= 0);
+
+	dev->nvqs = 2;
+}
+
+static void wait_for_interrupt(struct vq_info *vq)
+{
+	unsigned long long val;
+
+	poll(&vq->fds, 1, -1);
+
+	if (vq->fds.revents & POLLIN)
+		read(vq->fds.fd, &val, sizeof(val));
+}
+
+static void run_tx_test(struct vdev_info *dev, struct vq_info *vq,
+			bool delayed, int batch, int bufs)
+{
+	const bool random_batch = batch == RANDOM_BATCH;
+	long long spurious = 0;
+	struct scatterlist sl;
+	unsigned int len;
+	int r;
+
+	for (;;) {
+		long started_before = vq->started;
+		long completed_before = vq->completed;
+
+		virtqueue_disable_cb(vq->vq);
+		do {
+			if (random_batch)
+				batch = (random() % vq->vring.num) + 1;
+
+			while (vq->started < bufs &&
+			       (vq->started - vq->completed) < batch) {
+				sg_init_one(&sl, dev->test_buf, HDR_LEN + TEST_BUF_LEN);
+				r = virtqueue_add_outbuf(vq->vq, &sl, 1,
+							 dev->test_buf + vq->started,
+							 GFP_ATOMIC);
+				if (unlikely(r != 0)) {
+					if (r == -ENOSPC &&
+					    vq->started > started_before)
+						r = 0;
+					else
+						r = -1;
+					break;
+				}
+
+				++vq->started;
+
+				if (unlikely(!virtqueue_kick(vq->vq))) {
+					r = -1;
+					break;
+				}
+			}
+
+			if (vq->started >= bufs)
+				r = -1;
+
+			/* Flush out completed bufs if any */
+			while (virtqueue_get_buf(vq->vq, &len)) {
+				int n, i;
+
+				n = recvfrom(dev->sock, dev->res_buf, TEST_BUF_LEN, 0, NULL, NULL);
+				assert(n == TEST_BUF_LEN);
+
+				for (i = ETHER_HDR_LEN; i < n; i++)
+					assert(dev->res_buf[i] == (char)i);
+
+				++vq->completed;
+				r = 0;
+			}
+		} while (r == 0);
+
+		if (vq->completed == completed_before && vq->started == started_before)
+			++spurious;
+
+		assert(vq->completed <= bufs);
+		assert(vq->started <= bufs);
+		if (vq->completed == bufs)
+			break;
+
+		if (delayed) {
+			if (virtqueue_enable_cb_delayed(vq->vq))
+				wait_for_interrupt(vq);
+		} else {
+			if (virtqueue_enable_cb(vq->vq))
+				wait_for_interrupt(vq);
+		}
+	}
+	printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
+	       spurious, vq->started, vq->completed);
+}
+
+static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
+			bool delayed, int batch, int bufs)
+{
+	const bool random_batch = batch == RANDOM_BATCH;
+	long long spurious = 0;
+	struct scatterlist sl;
+	unsigned int len;
+	int r;
+
+	for (;;) {
+		long started_before = vq->started;
+		long completed_before = vq->completed;
+
+		do {
+			if (random_batch)
+				batch = (random() % vq->vring.num) + 1;
+
+			while (vq->started < bufs &&
+			       (vq->started - vq->completed) < batch) {
+				sg_init_one(&sl, dev->res_buf, HDR_LEN + TEST_BUF_LEN);
+
+				r = virtqueue_add_inbuf(vq->vq, &sl, 1,
+							dev->res_buf + vq->started,
+							GFP_ATOMIC);
+				if (unlikely(r != 0)) {
+					if (r == -ENOSPC &&
+					    vq->started > started_before)
+						r = 0;
+					else
+						r = -1;
+					break;
+				}
+
+				++vq->started;
+
+				vdev_send_packet(dev);
+
+				if (unlikely(!virtqueue_kick(vq->vq))) {
+					r = -1;
+					break;
+				}
+			}
+
+			if (vq->started >= bufs)
+				r = -1;
+
+			/* Flush out completed bufs if any */
+			while (virtqueue_get_buf(vq->vq, &len)) {
+				struct ether_header *eh;
+				int i;
+
+				eh = (struct ether_header *)(dev->res_buf + HDR_LEN);
+
+				/* tun netdev is up and running, ignore the
+				 * non-TEST_PTYPE packet.
+				 */
+				if (eh->ether_type != htons(TEST_PTYPE)) {
+					++vq->completed;
+					r = 0;
+					continue;
+				}
+
+				assert(len == TEST_BUF_LEN + HDR_LEN);
+
+				for (i = ETHER_HDR_LEN; i < TEST_BUF_LEN; i++)
+					assert(dev->res_buf[i + HDR_LEN] == (char)i);
+
+				++vq->completed;
+				r = 0;
+			}
+		} while (r == 0);
+		if (vq->completed == completed_before && vq->started == started_before)
+			++spurious;
+
+		assert(vq->completed <= bufs);
+		assert(vq->started <= bufs);
+		if (vq->completed == bufs)
+			break;
+	}
+
+	printf("RX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
+	       spurious, vq->started, vq->completed);
+}
+
+static const char optstring[] = "h";
+static const struct option longopts[] = {
+	{
+		.name = "help",
+		.val = 'h',
+	},
+	{
+		.name = "event-idx",
+		.val = 'E',
+	},
+	{
+		.name = "no-event-idx",
+		.val = 'e',
+	},
+	{
+		.name = "indirect",
+		.val = 'I',
+	},
+	{
+		.name = "no-indirect",
+		.val = 'i',
+	},
+	{
+		.name = "virtio-1",
+		.val = '1',
+	},
+	{
+		.name = "no-virtio-1",
+		.val = '0',
+	},
+	{
+		.name = "delayed-interrupt",
+		.val = 'D',
+	},
+	{
+		.name = "no-delayed-interrupt",
+		.val = 'd',
+	},
+	{
+		.name = "buf-num",
+		.val = 'n',
+		.has_arg = required_argument,
+	},
+	{
+		.name = "batch",
+		.val = 'b',
+		.has_arg = required_argument,
+	},
+	{
+	}
+};
+
+static void help(int status)
+{
+	fprintf(stderr, "Usage: vhost_net_test [--help]"
+		" [--no-indirect]"
+		" [--no-event-idx]"
+		" [--no-virtio-1]"
+		" [--delayed-interrupt]"
+		" [--buf-num]"
+		" [--batch=random/N]"
+		"\n");
+
+	exit(status);
+}
+
+int main(int argc, char **argv)
+{
+	unsigned long long features = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+		(1ULL << VIRTIO_RING_F_EVENT_IDX) | (1ULL << VIRTIO_F_VERSION_1);
+	long batch = 1, nbufs = 0x100000;
+	struct vdev_info dev;
+	bool delayed = false;
+	int o, fd;
+
+	for (;;) {
+		o = getopt_long(argc, argv, optstring, longopts, NULL);
+		switch (o) {
+		case -1:
+			goto done;
+		case '?':
+			help(2);
+		case 'e':
+			features &= ~(1ULL << VIRTIO_RING_F_EVENT_IDX);
+			break;
+		case 'h':
+			help(0);
+		case 'i':
+			features &= ~(1ULL << VIRTIO_RING_F_INDIRECT_DESC);
+			break;
+		case '0':
+			features &= ~(1ULL << VIRTIO_F_VERSION_1);
+			break;
+		case 'D':
+			delayed = true;
+			break;
+		case 'b':
+			if (!strcmp(optarg, "random")) {
+				batch = RANDOM_BATCH;
+			} else {
+				batch = strtol(optarg, NULL, 10);
+				assert(batch > 0);
+				assert(batch < (long)INT_MAX + 1);
+			}
+			break;
+		case 'n':
+			nbufs = strtol(optarg, NULL, 10);
+			assert(nbufs > 0);
+			break;
+		default:
+			assert(0);
+			break;
+		}
+	}
+
+done:
+	memset(&dev, 0, sizeof(dev));
+
+	fd = tun_alloc(&dev);
+	assert(fd >= 0);
+
+	vdev_info_init(&dev, features);
+	vq_info_add(&dev, 0, 256, fd);
+	vq_info_add(&dev, 1, 256, fd);
+	vdev_create_socket(&dev);
+
+	run_rx_test(&dev, &dev.vqs[0], delayed, batch, nbufs);
+	run_tx_test(&dev, &dev.vqs[1], delayed, batch, nbufs);
+
+	return 0;
+}
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
  2024-01-03  9:56 ` [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test Yunsheng Lin
@ 2024-01-04 16:17   ` Eugenio Perez Martin
  2024-01-05  2:09     ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Eugenio Perez Martin @ 2024-01-04 16:17 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, virtualization

On Wed, Jan 3, 2024 at 11:00 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> introduce vhost_net_test basing on virtio_test to test
> vhost_net changing in the kernel.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  tools/virtio/Makefile         |   8 +-
>  tools/virtio/vhost_net_test.c | 574 ++++++++++++++++++++++++++++++++++
>  2 files changed, 579 insertions(+), 3 deletions(-)
>  create mode 100644 tools/virtio/vhost_net_test.c
>
> diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
> index d128925980e0..e25e99c1c3b7 100644
> --- a/tools/virtio/Makefile
> +++ b/tools/virtio/Makefile
> @@ -1,8 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0
>  all: test mod
> -test: virtio_test vringh_test
> +test: virtio_test vringh_test vhost_net_test
>  virtio_test: virtio_ring.o virtio_test.o
>  vringh_test: vringh_test.o vringh.o virtio_ring.o
> +vhost_net_test: virtio_ring.o vhost_net_test.o
>
>  try-run = $(shell set -e;              \
>         if ($(1)) >/dev/null 2>&1;      \
> @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
>
>  .PHONY: all test mod clean vhost oot oot-clean oot-build
>  clean:
> -       ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
> -              vhost_test/Module.symvers vhost_test/modules.order *.d
> +       ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
> +              vhost_test/.*.cmd vhost_test/Module.symvers \
> +              vhost_test/modules.order *.d
>  -include *.d
> diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
> new file mode 100644
> index 000000000000..cfffcef53d94
> --- /dev/null
> +++ b/tools/virtio/vhost_net_test.c
> @@ -0,0 +1,574 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include <getopt.h>
> +#include <limits.h>
> +#include <string.h>
> +#include <poll.h>
> +#include <sys/eventfd.h>
> +#include <stdlib.h>
> +#include <assert.h>
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <fcntl.h>
> +#include <stdbool.h>
> +#include <linux/virtio_types.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_ring.h>
> +#include <linux/if.h>
> +#include <linux/if_tun.h>
> +#include <linux/in.h>
> +#include <linux/if_packet.h>
> +#include <netinet/ether.h>
> +
> +#define RANDOM_BATCH   -1
> +#define HDR_LEN                12
> +#define TEST_BUF_LEN   256
> +#define TEST_PTYPE     ETH_P_LOOPBACK
> +
> +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
> +
> +struct vq_info {
> +       int kick;
> +       int call;
> +       int idx;
> +       long started;
> +       long completed;
> +       struct pollfd fds;
> +       void *ring;
> +       /* copy used for control */
> +       struct vring vring;
> +       struct virtqueue *vq;
> +};
> +
> +struct vdev_info {
> +       struct virtio_device vdev;
> +       int control;
> +       struct vq_info vqs[2];
> +       int nvqs;
> +       void *buf;
> +       size_t buf_size;
> +       char *test_buf;
> +       char *res_buf;
> +       struct vhost_memory *mem;
> +       int sock;
> +       int ifindex;
> +       unsigned char mac[ETHER_ADDR_LEN];
> +};
> +
> +static int tun_alloc(struct vdev_info *dev)
> +{
> +       struct ifreq ifr;
> +       int len = HDR_LEN;
> +       int fd, e;
> +
> +       fd = open("/dev/net/tun", O_RDWR);
> +       if (fd < 0) {
> +               perror("Cannot open /dev/net/tun");
> +               return fd;
> +       }
> +
> +       memset(&ifr, 0, sizeof(ifr));
> +
> +       ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> +
> +       e = ioctl(fd, TUNSETIFF, &ifr);
> +       if (e < 0) {
> +               perror("ioctl[TUNSETIFF]");
> +               close(fd);
> +               return e;
> +       }
> +
> +       e = ioctl(fd, TUNSETVNETHDRSZ, &len);
> +       if (e < 0) {
> +               perror("ioctl[TUNSETVNETHDRSZ]");
> +               close(fd);
> +               return e;
> +       }
> +
> +       e = ioctl(fd, SIOCGIFHWADDR, &ifr);
> +       if (e < 0) {
> +               perror("ioctl[SIOCGIFHWADDR]");
> +               close(fd);
> +               return e;
> +       }
> +
> +       memcpy(dev->mac, &ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
> +       return fd;
> +}
> +
> +static void vdev_create_socket(struct vdev_info *dev)
> +{
> +       struct ifreq ifr;
> +
> +       dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
> +       assert(dev->sock != -1);
> +
> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> +       assert(ioctl(dev->sock, SIOCGIFINDEX, &ifr) >= 0);
> +
> +       dev->ifindex = ifr.ifr_ifindex;
> +
> +       /* Set the flags that bring the device up */
> +       assert(ioctl(dev->sock, SIOCGIFFLAGS, &ifr) >= 0);
> +       ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
> +       assert(ioctl(dev->sock, SIOCSIFFLAGS, &ifr) >= 0);
> +}
> +
> +static void vdev_send_packet(struct vdev_info *dev)
> +{
> +       char *sendbuf = dev->test_buf + HDR_LEN;
> +       struct sockaddr_ll saddrll = {0};
> +       int sockfd = dev->sock;
> +       int ret;
> +
> +       memset(&saddrll, 0, sizeof(saddrll));
> +       saddrll.sll_family = PF_PACKET;
> +       saddrll.sll_ifindex = dev->ifindex;
> +       saddrll.sll_halen = ETH_ALEN;
> +       saddrll.sll_protocol = htons(TEST_PTYPE);
> +
> +       ret = sendto(sockfd, sendbuf, TEST_BUF_LEN, 0,
> +                    (struct sockaddr *)&saddrll,
> +                    sizeof(struct sockaddr_ll));
> +       assert(ret >= 0);
> +}
> +
> +static bool vq_notify(struct virtqueue *vq)
> +{
> +       struct vq_info *info = vq->priv;
> +       unsigned long long v = 1;
> +       int r;
> +
> +       r = write(info->kick, &v, sizeof(v));
> +       assert(r == sizeof(v));
> +
> +       return true;
> +}
> +
> +static void vq_callback(struct virtqueue *vq)
> +{
> +}
> +
> +static void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
> +{
> +       struct vhost_vring_addr addr = {
> +               .index = info->idx,
> +               .desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
> +               .avail_user_addr = (uint64_t)(unsigned long)info->vring.avail,
> +               .used_user_addr = (uint64_t)(unsigned long)info->vring.used,
> +       };
> +       struct vhost_vring_state state = { .index = info->idx };
> +       struct vhost_vring_file file = { .index = info->idx };
> +       int r;
> +
> +       state.num = info->vring.num;
> +       r = ioctl(dev->control, VHOST_SET_VRING_NUM, &state);
> +       assert(r >= 0);
> +
> +       state.num = 0;
> +       r = ioctl(dev->control, VHOST_SET_VRING_BASE, &state);
> +       assert(r >= 0);
> +
> +       r = ioctl(dev->control, VHOST_SET_VRING_ADDR, &addr);
> +       assert(r >= 0);
> +
> +       file.fd = info->kick;
> +       r = ioctl(dev->control, VHOST_SET_VRING_KICK, &file);
> +       assert(r >= 0);
> +
> +       file.fd = info->call;
> +       r = ioctl(dev->control, VHOST_SET_VRING_CALL, &file);
> +       assert(r >= 0);
> +}
> +
> +static void vq_reset(struct vq_info *info, int num, struct virtio_device *vdev)
> +{
> +       if (info->vq)
> +               vring_del_virtqueue(info->vq);
> +
> +       memset(info->ring, 0, vring_size(num, 4096));
> +       vring_init(&info->vring, num, info->ring, 4096);
> +       info->vq = vring_new_virtqueue(info->idx, num, 4096, vdev, true, false,
> +                                      info->ring, vq_notify, vq_callback, "test");
> +       assert(info->vq);
> +       info->vq->priv = info;
> +}
> +
> +static void vq_info_add(struct vdev_info *dev, int idx, int num, int fd)
> +{
> +       struct vhost_vring_file backend = { .index = idx, .fd = fd };
> +       struct vq_info *info = &dev->vqs[idx];
> +       int r;
> +
> +       info->idx = idx;
> +       info->kick = eventfd(0, EFD_NONBLOCK);
> +       info->call = eventfd(0, EFD_NONBLOCK);
> +       r = posix_memalign(&info->ring, 4096, vring_size(num, 4096));
> +       assert(r >= 0);
> +       vq_reset(info, num, &dev->vdev);
> +       vhost_vq_setup(dev, info);
> +       info->fds.fd = info->call;
> +       info->fds.events = POLLIN;
> +
> +       r = ioctl(dev->control, VHOST_NET_SET_BACKEND, &backend);
> +       assert(!r);
> +}
> +
> +static void vdev_info_init(struct vdev_info *dev, unsigned long long features)
> +{
> +       struct ether_header *eh;
> +       int i, r;
> +
> +       dev->vdev.features = features;
> +       INIT_LIST_HEAD(&dev->vdev.vqs);
> +       spin_lock_init(&dev->vdev.vqs_list_lock);
> +
> +       dev->buf_size = (HDR_LEN + TEST_BUF_LEN) * 2;
> +       dev->buf = malloc(dev->buf_size);
> +       assert(dev->buf);
> +       dev->test_buf = dev->buf;
> +       dev->res_buf = dev->test_buf + HDR_LEN + TEST_BUF_LEN;
> +
> +       memset(dev->test_buf, 0, HDR_LEN + TEST_BUF_LEN);
> +       eh = (struct ether_header *)(dev->test_buf + HDR_LEN);
> +       eh->ether_type = htons(TEST_PTYPE);
> +       memcpy(eh->ether_dhost, dev->mac, ETHER_ADDR_LEN);
> +       memcpy(eh->ether_shost, dev->mac, ETHER_ADDR_LEN);
> +
> +       for (i = sizeof(*eh); i < TEST_BUF_LEN; i++)
> +               dev->test_buf[i + HDR_LEN] = (char)i;
> +
> +       dev->control = open("/dev/vhost-net", O_RDWR);
> +       assert(dev->control >= 0);
> +
> +       r = ioctl(dev->control, VHOST_SET_OWNER, NULL);
> +       assert(r >= 0);
> +
> +       dev->mem = malloc(offsetof(struct vhost_memory, regions) +
> +                         sizeof(dev->mem->regions[0]));
> +       assert(dev->mem);
> +       memset(dev->mem, 0, offsetof(struct vhost_memory, regions) +
> +              sizeof(dev->mem->regions[0]));
> +       dev->mem->nregions = 1;
> +       dev->mem->regions[0].guest_phys_addr = (long)dev->buf;
> +       dev->mem->regions[0].userspace_addr = (long)dev->buf;
> +       dev->mem->regions[0].memory_size = dev->buf_size;
> +
> +       r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
> +       assert(r >= 0);
> +
> +       r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
> +       assert(r >= 0);
> +
> +       dev->nvqs = 2;
> +}
> +
> +static void wait_for_interrupt(struct vq_info *vq)
> +{
> +       unsigned long long val;
> +
> +       poll(&vq->fds, 1, -1);
> +
> +       if (vq->fds.revents & POLLIN)
> +               read(vq->fds.fd, &val, sizeof(val));
> +}
> +
> +static void run_tx_test(struct vdev_info *dev, struct vq_info *vq,
> +                       bool delayed, int batch, int bufs)
> +{
> +       const bool random_batch = batch == RANDOM_BATCH;
> +       long long spurious = 0;
> +       struct scatterlist sl;
> +       unsigned int len;
> +       int r;
> +
> +       for (;;) {
> +               long started_before = vq->started;
> +               long completed_before = vq->completed;
> +
> +               virtqueue_disable_cb(vq->vq);
> +               do {
> +                       if (random_batch)
> +                               batch = (random() % vq->vring.num) + 1;
> +
> +                       while (vq->started < bufs &&
> +                              (vq->started - vq->completed) < batch) {
> +                               sg_init_one(&sl, dev->test_buf, HDR_LEN + TEST_BUF_LEN);
> +                               r = virtqueue_add_outbuf(vq->vq, &sl, 1,
> +                                                        dev->test_buf + vq->started,
> +                                                        GFP_ATOMIC);
> +                               if (unlikely(r != 0)) {
> +                                       if (r == -ENOSPC &&
> +                                           vq->started > started_before)
> +                                               r = 0;
> +                                       else
> +                                               r = -1;
> +                                       break;
> +                               }
> +
> +                               ++vq->started;
> +
> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
> +                                       r = -1;
> +                                       break;
> +                               }
> +                       }
> +
> +                       if (vq->started >= bufs)
> +                               r = -1;
> +
> +                       /* Flush out completed bufs if any */
> +                       while (virtqueue_get_buf(vq->vq, &len)) {
> +                               int n, i;
> +
> +                               n = recvfrom(dev->sock, dev->res_buf, TEST_BUF_LEN, 0, NULL, NULL);
> +                               assert(n == TEST_BUF_LEN);
> +
> +                               for (i = ETHER_HDR_LEN; i < n; i++)
> +                                       assert(dev->res_buf[i] == (char)i);
> +
> +                               ++vq->completed;
> +                               r = 0;
> +                       }
> +               } while (r == 0);
> +
> +               if (vq->completed == completed_before && vq->started == started_before)
> +                       ++spurious;
> +
> +               assert(vq->completed <= bufs);
> +               assert(vq->started <= bufs);
> +               if (vq->completed == bufs)
> +                       break;
> +
> +               if (delayed) {
> +                       if (virtqueue_enable_cb_delayed(vq->vq))
> +                               wait_for_interrupt(vq);
> +               } else {
> +                       if (virtqueue_enable_cb(vq->vq))
> +                               wait_for_interrupt(vq);
> +               }
> +       }
> +       printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
> +              spurious, vq->started, vq->completed);
> +}
> +
> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
> +                       bool delayed, int batch, int bufs)
> +{
> +       const bool random_batch = batch == RANDOM_BATCH;
> +       long long spurious = 0;
> +       struct scatterlist sl;
> +       unsigned int len;
> +       int r;
> +
> +       for (;;) {
> +               long started_before = vq->started;
> +               long completed_before = vq->completed;
> +
> +               do {
> +                       if (random_batch)
> +                               batch = (random() % vq->vring.num) + 1;
> +
> +                       while (vq->started < bufs &&
> +                              (vq->started - vq->completed) < batch) {
> +                               sg_init_one(&sl, dev->res_buf, HDR_LEN + TEST_BUF_LEN);
> +
> +                               r = virtqueue_add_inbuf(vq->vq, &sl, 1,
> +                                                       dev->res_buf + vq->started,
> +                                                       GFP_ATOMIC);
> +                               if (unlikely(r != 0)) {
> +                                       if (r == -ENOSPC &&
> +                                           vq->started > started_before)
> +                                               r = 0;
> +                                       else
> +                                               r = -1;
> +                                       break;
> +                               }
> +
> +                               ++vq->started;
> +
> +                               vdev_send_packet(dev);
> +
> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
> +                                       r = -1;
> +                                       break;
> +                               }
> +                       }
> +
> +                       if (vq->started >= bufs)
> +                               r = -1;
> +
> +                       /* Flush out completed bufs if any */
> +                       while (virtqueue_get_buf(vq->vq, &len)) {
> +                               struct ether_header *eh;
> +                               int i;
> +
> +                               eh = (struct ether_header *)(dev->res_buf + HDR_LEN);
> +
> +                               /* tun netdev is up and running, ignore the
> +                                * non-TEST_PTYPE packet.
> +                                */
> +                               if (eh->ether_type != htons(TEST_PTYPE)) {
> +                                       ++vq->completed;
> +                                       r = 0;
> +                                       continue;
> +                               }
> +
> +                               assert(len == TEST_BUF_LEN + HDR_LEN);
> +
> +                               for (i = ETHER_HDR_LEN; i < TEST_BUF_LEN; i++)
> +                                       assert(dev->res_buf[i + HDR_LEN] == (char)i);
> +
> +                               ++vq->completed;
> +                               r = 0;
> +                       }
> +               } while (r == 0);
> +               if (vq->completed == completed_before && vq->started == started_before)
> +                       ++spurious;
> +
> +               assert(vq->completed <= bufs);
> +               assert(vq->started <= bufs);
> +               if (vq->completed == bufs)
> +                       break;
> +       }
> +
> +       printf("RX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
> +              spurious, vq->started, vq->completed);
> +}

Hi!

I think the tests are great! Maybe both run_rx_test and run_tx_test
(and virtio_test.c:run_test) can be merged into a common function with
some callbacks? Not sure if it will complicate the code more.

Thanks!

> +
> +static const char optstring[] = "h";
> +static const struct option longopts[] = {
> +       {
> +               .name = "help",
> +               .val = 'h',
> +       },
> +       {
> +               .name = "event-idx",
> +               .val = 'E',
> +       },
> +       {
> +               .name = "no-event-idx",
> +               .val = 'e',
> +       },
> +       {
> +               .name = "indirect",
> +               .val = 'I',
> +       },
> +       {
> +               .name = "no-indirect",
> +               .val = 'i',
> +       },
> +       {
> +               .name = "virtio-1",
> +               .val = '1',
> +       },
> +       {
> +               .name = "no-virtio-1",
> +               .val = '0',
> +       },
> +       {
> +               .name = "delayed-interrupt",
> +               .val = 'D',
> +       },
> +       {
> +               .name = "no-delayed-interrupt",
> +               .val = 'd',
> +       },
> +       {
> +               .name = "buf-num",
> +               .val = 'n',
> +               .has_arg = required_argument,
> +       },
> +       {
> +               .name = "batch",
> +               .val = 'b',
> +               .has_arg = required_argument,
> +       },
> +       {
> +       }
> +};
> +
> +static void help(int status)
> +{
> +       fprintf(stderr, "Usage: vhost_net_test [--help]"
> +               " [--no-indirect]"
> +               " [--no-event-idx]"
> +               " [--no-virtio-1]"
> +               " [--delayed-interrupt]"
> +               " [--buf-num]"
> +               " [--batch=random/N]"
> +               "\n");
> +
> +       exit(status);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       unsigned long long features = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
> +               (1ULL << VIRTIO_RING_F_EVENT_IDX) | (1ULL << VIRTIO_F_VERSION_1);
> +       long batch = 1, nbufs = 0x100000;
> +       struct vdev_info dev;
> +       bool delayed = false;
> +       int o, fd;
> +
> +       for (;;) {
> +               o = getopt_long(argc, argv, optstring, longopts, NULL);
> +               switch (o) {
> +               case -1:
> +                       goto done;
> +               case '?':
> +                       help(2);
> +               case 'e':
> +                       features &= ~(1ULL << VIRTIO_RING_F_EVENT_IDX);
> +                       break;
> +               case 'h':
> +                       help(0);
> +               case 'i':
> +                       features &= ~(1ULL << VIRTIO_RING_F_INDIRECT_DESC);
> +                       break;
> +               case '0':
> +                       features &= ~(1ULL << VIRTIO_F_VERSION_1);
> +                       break;
> +               case 'D':
> +                       delayed = true;
> +                       break;
> +               case 'b':
> +                       if (!strcmp(optarg, "random")) {
> +                               batch = RANDOM_BATCH;
> +                       } else {
> +                               batch = strtol(optarg, NULL, 10);
> +                               assert(batch > 0);
> +                               assert(batch < (long)INT_MAX + 1);
> +                       }
> +                       break;
> +               case 'n':
> +                       nbufs = strtol(optarg, NULL, 10);
> +                       assert(nbufs > 0);
> +                       break;
> +               default:
> +                       assert(0);
> +                       break;
> +               }
> +       }
> +
> +done:
> +       memset(&dev, 0, sizeof(dev));
> +
> +       fd = tun_alloc(&dev);
> +       assert(fd >= 0);
> +
> +       vdev_info_init(&dev, features);
> +       vq_info_add(&dev, 0, 256, fd);
> +       vq_info_add(&dev, 1, 256, fd);
> +       vdev_create_socket(&dev);
> +
> +       run_rx_test(&dev, &dev.vqs[0], delayed, batch, nbufs);
> +       run_tx_test(&dev, &dev.vqs[1], delayed, batch, nbufs);
> +
> +       return 0;
> +}
> --
> 2.33.0
>
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test
  2024-01-04 16:17   ` Eugenio Perez Martin
@ 2024-01-05  2:09     ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-05  2:09 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, virtualization

On 2024/1/5 0:17, Eugenio Perez Martin wrote:
> On Wed, Jan 3, 2024 at 11:00 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:

...

>> +
>> +static void run_tx_test(struct vdev_info *dev, struct vq_info *vq,
>> +                       bool delayed, int batch, int bufs)
>> +{
>> +       const bool random_batch = batch == RANDOM_BATCH;
>> +       long long spurious = 0;
>> +       struct scatterlist sl;
>> +       unsigned int len;
>> +       int r;
>> +
>> +       for (;;) {
>> +               long started_before = vq->started;
>> +               long completed_before = vq->completed;
>> +
>> +               virtqueue_disable_cb(vq->vq);
>> +               do {
>> +                       if (random_batch)
>> +                               batch = (random() % vq->vring.num) + 1;
>> +
>> +                       while (vq->started < bufs &&
>> +                              (vq->started - vq->completed) < batch) {
>> +                               sg_init_one(&sl, dev->test_buf, HDR_LEN + TEST_BUF_LEN);
>> +                               r = virtqueue_add_outbuf(vq->vq, &sl, 1,
>> +                                                        dev->test_buf + vq->started,
>> +                                                        GFP_ATOMIC);
>> +                               if (unlikely(r != 0)) {
>> +                                       if (r == -ENOSPC &&
>> +                                           vq->started > started_before)
>> +                                               r = 0;
>> +                                       else
>> +                                               r = -1;
>> +                                       break;
>> +                               }
>> +
>> +                               ++vq->started;
>> +
>> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
>> +                                       r = -1;
>> +                                       break;
>> +                               }
>> +                       }
>> +
>> +                       if (vq->started >= bufs)
>> +                               r = -1;
>> +
>> +                       /* Flush out completed bufs if any */
>> +                       while (virtqueue_get_buf(vq->vq, &len)) {
>> +                               int n, i;
>> +
>> +                               n = recvfrom(dev->sock, dev->res_buf, TEST_BUF_LEN, 0, NULL, NULL);
>> +                               assert(n == TEST_BUF_LEN);
>> +
>> +                               for (i = ETHER_HDR_LEN; i < n; i++)
>> +                                       assert(dev->res_buf[i] == (char)i);
>> +
>> +                               ++vq->completed;
>> +                               r = 0;
>> +                       }
>> +               } while (r == 0);
>> +
>> +               if (vq->completed == completed_before && vq->started == started_before)
>> +                       ++spurious;
>> +
>> +               assert(vq->completed <= bufs);
>> +               assert(vq->started <= bufs);
>> +               if (vq->completed == bufs)
>> +                       break;
>> +
>> +               if (delayed) {
>> +                       if (virtqueue_enable_cb_delayed(vq->vq))
>> +                               wait_for_interrupt(vq);
>> +               } else {
>> +                       if (virtqueue_enable_cb(vq->vq))
>> +                               wait_for_interrupt(vq);
>> +               }
>> +       }
>> +       printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
>> +              spurious, vq->started, vq->completed);
>> +}
>> +
>> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
>> +                       bool delayed, int batch, int bufs)
>> +{
>> +       const bool random_batch = batch == RANDOM_BATCH;
>> +       long long spurious = 0;
>> +       struct scatterlist sl;
>> +       unsigned int len;
>> +       int r;
>> +
>> +       for (;;) {
>> +               long started_before = vq->started;
>> +               long completed_before = vq->completed;
>> +
>> +               do {
>> +                       if (random_batch)
>> +                               batch = (random() % vq->vring.num) + 1;
>> +
>> +                       while (vq->started < bufs &&
>> +                              (vq->started - vq->completed) < batch) {
>> +                               sg_init_one(&sl, dev->res_buf, HDR_LEN + TEST_BUF_LEN);
>> +
>> +                               r = virtqueue_add_inbuf(vq->vq, &sl, 1,
>> +                                                       dev->res_buf + vq->started,
>> +                                                       GFP_ATOMIC);
>> +                               if (unlikely(r != 0)) {
>> +                                       if (r == -ENOSPC &&
>> +                                           vq->started > started_before)
>> +                                               r = 0;
>> +                                       else
>> +                                               r = -1;
>> +                                       break;
>> +                               }
>> +
>> +                               ++vq->started;
>> +
>> +                               vdev_send_packet(dev);
>> +
>> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
>> +                                       r = -1;
>> +                                       break;
>> +                               }
>> +                       }
>> +
>> +                       if (vq->started >= bufs)
>> +                               r = -1;
>> +
>> +                       /* Flush out completed bufs if any */
>> +                       while (virtqueue_get_buf(vq->vq, &len)) {
>> +                               struct ether_header *eh;
>> +                               int i;
>> +
>> +                               eh = (struct ether_header *)(dev->res_buf + HDR_LEN);
>> +
>> +                               /* tun netdev is up and running, ignore the
>> +                                * non-TEST_PTYPE packet.
>> +                                */
>> +                               if (eh->ether_type != htons(TEST_PTYPE)) {
>> +                                       ++vq->completed;
>> +                                       r = 0;
>> +                                       continue;
>> +                               }
>> +
>> +                               assert(len == TEST_BUF_LEN + HDR_LEN);
>> +
>> +                               for (i = ETHER_HDR_LEN; i < TEST_BUF_LEN; i++)
>> +                                       assert(dev->res_buf[i + HDR_LEN] == (char)i);
>> +
>> +                               ++vq->completed;
>> +                               r = 0;
>> +                       }
>> +               } while (r == 0);
>> +               if (vq->completed == completed_before && vq->started == started_before)
>> +                       ++spurious;
>> +
>> +               assert(vq->completed <= bufs);
>> +               assert(vq->started <= bufs);
>> +               if (vq->completed == bufs)
>> +                       break;
>> +       }
>> +
>> +       printf("RX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
>> +              spurious, vq->started, vq->completed);
>> +}
> 
> Hi!
> 
> I think the tests are great! Maybe both run_rx_test and run_tx_test
> (and virtio_test.c:run_test) can be merged into a common function with
> some callbacks? Not sure if it will complicate the code more.

I looked at it more closely, it seems using callback just make the code
more complicated without obvious benefit, as the main steps is different
for tx and rx.

For tx:
1. Prepare a out buf
2. Kick the vhost_net to do tx processing
3. Do the receiving in the tun side
4. verify the data received by tun is correct

For rx:
1. Prepare a in buf
2. Do the sending in the tun side
3. Kick the vhost_net to do rx processing
4. verify the data received by vhost_net is correct

I could refactor out a common function to do the data verifying for step 4
of both tx and rx.

For virtio_test.c:run_test, it does not seems to be using the vhost_net directly,
it uses shim vhost_test.ko module, and it seems to only have step 1 & 3 of
vhost_net_test' rx testing. The new vhost_net_test should be able to replace the
virtio_test, I thought about deleting the virtio_test after adding the vhost_net_test,
but I am not familiar enough with virtio to do the judgement yet.

> 
> Thanks!
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/6] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument
  2024-01-03  9:56 ` [PATCH net-next 1/6] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument Yunsheng Lin
@ 2024-01-05 15:28   ` Alexander H Duyck
  2024-01-08  8:20     ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander H Duyck @ 2024-01-05 15:28 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Andrew Morton, Eric Dumazet, linux-mm

On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> napi_alloc_frag_align() and netdev_alloc_frag_align() accept
> align as an argument, and they are thin wrappers around the
> __napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs
> doing the align and align_mask conversion, in order to call
> page_frag_alloc_align() directly.
> 
> As __napi_alloc_frag_align() and __netdev_alloc_frag_align()
> APIs are only used by the above thin wrappers, it seems that
> it makes more sense to remove align and align_mask conversion
> and call page_frag_alloc_align() directly. By doing that, we
> can also avoid the confusion between napi_alloc_frag_align()
> accepting align as an argument and page_frag_alloc_align()
> accepting align_mask as an argument when they both have the
> 'align' suffix.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>

This patch overlooks much of the main reason for having the wrappers.
By having the in-line wrapper and passing the argument as a mask we can
avoid having to verify the alignment value during execution time since
it can usually be handled during compile time.

By moving it into the function itself we are adding additional CPU
overhead per page as we will have to go through and validate the
alignment value for every single page instead of when the driver using
the function is compiled.

The overhead may not seem like much, but when you are having to deal
with it per page and you are processing pages at millions per second it
can quickly start to add up.

This is essentially a code cleanup at the cost of some amount of
performance.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation
  2024-01-03  9:56 ` [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation Yunsheng Lin
@ 2024-01-05 15:35   ` Alexander H Duyck
  2024-01-08  8:25     ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander H Duyck @ 2024-01-05 15:35 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Michael S. Tsirkin, Jason Wang,
	Andrew Morton, Eric Dumazet, kvm, virtualization, linux-mm

On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> Currently there seems to be three page frag implementions
> which all try to allocate order 3 page, if that fails, it
> then fail back to allocate order 0 page, and each of them
> all allow order 3 page allocation to fail under certain
> condition by using specific gfp bits.
> 
> The gfp bits for order 3 page allocation are different
> between different implementation, __GFP_NOMEMALLOC is
> or'd to forbid access to emergency reserves memory for
> __page_frag_cache_refill(), but it is not or'd in other
> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
> direct reclaim in skb_page_frag_refill(), but it is not
> masked off in __page_frag_cache_refill().
> 
> This patch unifies the gfp bits used between different
> implementions by or'ing __GFP_NOMEMALLOC and masking off
> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
> possible pressure for mm.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> ---
>  drivers/vhost/net.c | 2 +-
>  mm/page_alloc.c     | 4 ++--
>  net/core/sock.c     | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f2ed7167c848..e574e21cc0ca 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
>  		/* Avoid direct reclaim but allow kswapd to wake */
>  		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>  					  __GFP_COMP | __GFP_NOWARN |
> -					  __GFP_NORETRY,
> +					  __GFP_NORETRY | __GFP_NOMEMALLOC,
>  					  SKB_FRAG_PAGE_ORDER);
>  		if (likely(pfrag->page)) {
>  			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9a16305cf985..1f0b36dd81b5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4693,8 +4693,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>  	gfp_t gfp = gfp_mask;
>  
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> -		    __GFP_NOMEMALLOC;
> +	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>  	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>  				PAGE_FRAG_CACHE_MAX_ORDER);
>  	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 446e945f736b..d643332c3ee5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2900,7 +2900,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>  		/* Avoid direct reclaim but allow kswapd to wake */
>  		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>  					  __GFP_COMP | __GFP_NOWARN |
> -					  __GFP_NORETRY,
> +					  __GFP_NORETRY | __GFP_NOMEMALLOC,
>  					  SKB_FRAG_PAGE_ORDER);
>  		if (likely(pfrag->page)) {
>  			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;

Looks fine to me.

One thing you may want to consider would be to place this all in an
inline function that could just consolidate all the code.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  2024-01-03  9:56 ` [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align() Yunsheng Lin
@ 2024-01-05 15:42   ` Alexander H Duyck
  2024-01-08  8:59     ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander H Duyck @ 2024-01-05 15:42 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Andrew Morton, linux-mm

On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> The next patch is above to use page_frag_alloc_align() to
> replace vhost_net_page_frag_refill(), the main difference
> between those two frag page implementations is whether we
> use a initial zero offset or not.
> 
> It seems more nature to use a initial zero offset, as it
> may enable more correct cache prefetching and skb frag
> coalescing in the networking, so change it to use initial
> zero offset.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>

There are several advantages to running the offset as a countdown
rather than count-up value.

1. Specifically for the size of the chunks we are allocating doing it
from the bottom up doesn't add any value as we are jumping in large
enough amounts and are being used for DMA so being sequential doesn't
add any value.

2. By starting at the end and working toward zero we can use built in
functionality of the CPU to only have to check and see if our result
would be signed rather than having to load two registers with the
values and then compare them which saves us a few cycles. In addition
it saves us from having to read both the size and the offset for every
page.

Again this is another code cleanup at the cost of performance. I
realize many of the items you are removing would be considered micro-
optimizations but when we are dealing with millions of packets per
second those optimizations add up.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 5/6] net: introduce page_frag_cache_drain()
  2024-01-03  9:56   ` Yunsheng Lin
@ 2024-01-05 15:48     ` Alexander H Duyck
  -1 siblings, 0 replies; 31+ messages in thread
From: Alexander H Duyck @ 2024-01-05 15:48 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Jason Wang, Jeroen de Borst,
	Praveen Kaligineedi, Shailend Chand, Eric Dumazet, Felix Fietkau,
	John Crispin, Sean Wang, Mark Lee, Lorenzo Bianconi,
	Matthias Brugger, AngeloGioacchino Del Regno, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Michael S. Tsirkin, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm

On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> When draining a page_frag_cache, most user are doing
> the similar steps, so introduce an API to avoid code
> duplication.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Acked-by: Jason Wang <jasowang@redhat.com>

Looks good to me.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 5/6] net: introduce page_frag_cache_drain()
@ 2024-01-05 15:48     ` Alexander H Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander H Duyck @ 2024-01-05 15:48 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Jason Wang, Jeroen de Borst,
	Praveen Kaligineedi, Shailend Chand, Eric Dumazet, Felix Fietkau,
	John Crispin, Sean Wang, Mark Lee, Lorenzo Bianconi,
	Matthias Brugger, AngeloGioacchino Del Regno, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni,
	Michael S. Tsirkin, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme, kvm, virtualization, linux-mm

On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> When draining a page_frag_cache, most user are doing
> the similar steps, so introduce an API to avoid code
> duplication.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Acked-by: Jason Wang <jasowang@redhat.com>

Looks good to me.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()
  2024-01-03  9:56 ` [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill() Yunsheng Lin
@ 2024-01-05 16:06   ` Alexander H Duyck
  2024-01-08  9:06     ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander H Duyck @ 2024-01-05 16:06 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Jason Wang, Michael S. Tsirkin,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, kvm, virtualization, bpf

On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> The page frag in vhost_net_page_frag_refill() uses the
> 'struct page_frag' from skb_page_frag_refill(), but it's
> implementation is similar to page_frag_alloc_align() now.
> 
> This patch removes vhost_net_page_frag_refill() by using
> 'struct page_frag_cache' instead of 'struct page_frag',
> and allocating frag using page_frag_alloc_align().
> 
> The added benefit is that not only unifying the page frag
> implementation a little, but also having about 0.5% performance
> boost testing by using the vhost_net_test introduced in the
> last patch.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Acked-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/net.c | 93 ++++++++++++++-------------------------------
>  1 file changed, 29 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index e574e21cc0ca..805e11d598e4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -141,10 +141,8 @@ struct vhost_net {
>  	unsigned tx_zcopy_err;
>  	/* Flush in progress. Protected by tx vq lock. */
>  	bool tx_flush;
> -	/* Private page frag */
> -	struct page_frag page_frag;
> -	/* Refcount bias of page frag */
> -	int refcnt_bias;
> +	/* Private page frag cache */
> +	struct page_frag_cache pf_cache;
>  };
>  
>  static unsigned vhost_net_zcopy_mask __read_mostly;
> @@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
>  	       !vhost_vq_avail_empty(vq->dev, vq);
>  }
>  
> -static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
> -				       struct page_frag *pfrag, gfp_t gfp)
> -{
> -	if (pfrag->page) {
> -		if (pfrag->offset + sz <= pfrag->size)
> -			return true;
> -		__page_frag_cache_drain(pfrag->page, net->refcnt_bias);
> -	}
> -
> -	pfrag->offset = 0;
> -	net->refcnt_bias = 0;
> -	if (SKB_FRAG_PAGE_ORDER) {
> -		/* Avoid direct reclaim but allow kswapd to wake */
> -		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> -					  __GFP_COMP | __GFP_NOWARN |
> -					  __GFP_NORETRY | __GFP_NOMEMALLOC,
> -					  SKB_FRAG_PAGE_ORDER);
> -		if (likely(pfrag->page)) {
> -			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> -			goto done;
> -		}
> -	}
> -	pfrag->page = alloc_page(gfp);
> -	if (likely(pfrag->page)) {
> -		pfrag->size = PAGE_SIZE;
> -		goto done;
> -	}
> -	return false;
> -
> -done:
> -	net->refcnt_bias = USHRT_MAX;
> -	page_ref_add(pfrag->page, USHRT_MAX - 1);
> -	return true;
> -}
> -
>  #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
>  
>  static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
> @@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>  	struct vhost_net *net = container_of(vq->dev, struct vhost_net,
>  					     dev);
>  	struct socket *sock = vhost_vq_get_backend(vq);
> -	struct page_frag *alloc_frag = &net->page_frag;
>  	struct virtio_net_hdr *gso;
>  	struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
>  	struct tun_xdp_hdr *hdr;
> @@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>  	int sock_hlen = nvq->sock_hlen;
>  	void *buf;
>  	int copied;
> +	int ret;
>  
>  	if (unlikely(len < nvq->sock_hlen))
>  		return -EFAULT;
> @@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>  		return -ENOSPC;
>  
>  	buflen += SKB_DATA_ALIGN(len + pad);
> -	alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
> -	if (unlikely(!vhost_net_page_frag_refill(net, buflen,
> -						 alloc_frag, GFP_KERNEL)))
> +	buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL,
> +				    SMP_CACHE_BYTES);

If your changes from patch 1 are just to make it fit into this layout
might I suggest just splitting up page_frag_alloc_align into an inline
that accepts the arguments you have here, and adding
__page_frag_alloc_align which is passed the mask the original function
expected. By doing that you should be able to maintain the same level
of performance and still get most of the code cleanup.

> +	if (unlikely(!buf))
>  		return -ENOMEM;
>  
> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> -	copied = copy_page_from_iter(alloc_frag->page,
> -				     alloc_frag->offset +
> -				     offsetof(struct tun_xdp_hdr, gso),
> -				     sock_hlen, from);
> -	if (copied != sock_hlen)
> -		return -EFAULT;
> +	copied = copy_from_iter(buf + offsetof(struct tun_xdp_hdr, gso),
> +				sock_hlen, from);
> +	if (copied != sock_hlen) {
> +		ret = -EFAULT;
> +		goto err;
> +	}
>  
>  	hdr = buf;
>  	gso = &hdr->gso;
> @@ -743,27 +705,30 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>  			       vhost16_to_cpu(vq, gso->csum_start) +
>  			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
>  
> -		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
> -			return -EINVAL;
> +		if (vhost16_to_cpu(vq, gso->hdr_len) > len) {
> +			ret = -EINVAL;
> +			goto err;
> +		}
>  	}
>  
>  	len -= sock_hlen;
> -	copied = copy_page_from_iter(alloc_frag->page,
> -				     alloc_frag->offset + pad,
> -				     len, from);
> -	if (copied != len)
> -		return -EFAULT;
> +	copied = copy_from_iter(buf + pad, len, from);
> +	if (copied != len) {
> +		ret = -EFAULT;
> +		goto err;
> +	}
>  
>  	xdp_init_buff(xdp, buflen, NULL);
>  	xdp_prepare_buff(xdp, buf, pad, len, true);
>  	hdr->buflen = buflen;
>  
> -	--net->refcnt_bias;
> -	alloc_frag->offset += buflen;
> -
>  	++nvq->batched_xdp;
>  
>  	return 0;
> +
> +err:
> +	page_frag_free(buf);
> +	return ret;
>  }
>  
>  static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> @@ -1353,8 +1318,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  			vqs[VHOST_NET_VQ_RX]);
>  
>  	f->private_data = n;
> -	n->page_frag.page = NULL;
> -	n->refcnt_bias = 0;
> +	n->pf_cache.va = NULL;
>  
>  	return 0;
>  }
> @@ -1422,8 +1386,9 @@ static int vhost_net_release(struct inode *inode, struct file *f)
>  	kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
>  	kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
>  	kfree(n->dev.vqs);
> -	if (n->page_frag.page)
> -		__page_frag_cache_drain(n->page_frag.page, n->refcnt_bias);
> +	if (n->pf_cache.va)
> +		__page_frag_cache_drain(virt_to_head_page(n->pf_cache.va),
> +					n->pf_cache.pagecnt_bias);
>  	kvfree(n);
>  	return 0;
>  }

I would recommend reordering this patch with patch 5. Then you could
remove the block that is setting "n->pf_cache.va = NULL" above and just
make use of page_frag_cache_drain in the lower block which would also
return the va to NULL.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 1/6] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument
  2024-01-05 15:28   ` Alexander H Duyck
@ 2024-01-08  8:20     ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-08  8:20 UTC (permalink / raw)
  To: Alexander H Duyck, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Andrew Morton, Eric Dumazet, linux-mm

On 2024/1/5 23:28, Alexander H Duyck wrote:
> On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
>> napi_alloc_frag_align() and netdev_alloc_frag_align() accept
>> align as an argument, and they are thin wrappers around the
>> __napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs
>> doing the align and align_mask conversion, in order to call
>> page_frag_alloc_align() directly.
>>
>> As __napi_alloc_frag_align() and __netdev_alloc_frag_align()
>> APIs are only used by the above thin wrappers, it seems that
>> it makes more sense to remove align and align_mask conversion
>> and call page_frag_alloc_align() directly. By doing that, we
>> can also avoid the confusion between napi_alloc_frag_align()
>> accepting align as an argument and page_frag_alloc_align()
>> accepting align_mask as an argument when they both have the
>> 'align' suffix.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
> 
> This patch overlooks much of the main reason for having the wrappers.
> By having the in-line wrapper and passing the argument as a mask we can
> avoid having to verify the alignment value during execution time since
> it can usually be handled during compile time.

When it can not be handled during compile time, we might have a bigger
executable size for kernel, right? doesn't that may defeat some purpose of
inlining?

But from the callers of those API, it does seems the handling during compile
time is the ususal case here. Will add a __page_frag_alloc_align which is
passed with the mask the original function expected as you suggested.

> 
> By moving it into the function itself we are adding additional CPU
> overhead per page as we will have to go through and validate the
> alignment value for every single page instead of when the driver using
> the function is compiled.
> 
> The overhead may not seem like much, but when you are having to deal
> with it per page and you are processing pages at millions per second it
> can quickly start to add up.
> 
> This is essentially a code cleanup at the cost of some amount of
> performance.
> 
> .
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation
  2024-01-05 15:35   ` Alexander H Duyck
@ 2024-01-08  8:25     ` Yunsheng Lin
  2024-01-08 16:13       ` Alexander Duyck
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-08  8:25 UTC (permalink / raw)
  To: Alexander H Duyck, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Michael S. Tsirkin, Jason Wang,
	Andrew Morton, Eric Dumazet, kvm, virtualization, linux-mm

On 2024/1/5 23:35, Alexander H Duyck wrote:
> On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
>> Currently there seems to be three page frag implementions
>> which all try to allocate order 3 page, if that fails, it
>> then fail back to allocate order 0 page, and each of them
>> all allow order 3 page allocation to fail under certain
>> condition by using specific gfp bits.
>>
>> The gfp bits for order 3 page allocation are different
>> between different implementation, __GFP_NOMEMALLOC is
>> or'd to forbid access to emergency reserves memory for
>> __page_frag_cache_refill(), but it is not or'd in other
>> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
>> direct reclaim in skb_page_frag_refill(), but it is not
>> masked off in __page_frag_cache_refill().
>>
>> This patch unifies the gfp bits used between different
>> implementions by or'ing __GFP_NOMEMALLOC and masking off
>> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
>> possible pressure for mm.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
>> ---
>>  drivers/vhost/net.c | 2 +-
>>  mm/page_alloc.c     | 4 ++--
>>  net/core/sock.c     | 2 +-
>>  3 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index f2ed7167c848..e574e21cc0ca 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
>>  		/* Avoid direct reclaim but allow kswapd to wake */
>>  		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>>  					  __GFP_COMP | __GFP_NOWARN |
>> -					  __GFP_NORETRY,
>> +					  __GFP_NORETRY | __GFP_NOMEMALLOC,
>>  					  SKB_FRAG_PAGE_ORDER);
>>  		if (likely(pfrag->page)) {
>>  			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 9a16305cf985..1f0b36dd81b5 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4693,8 +4693,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>>  	gfp_t gfp = gfp_mask;
>>  
>>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>> -	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
>> -		    __GFP_NOMEMALLOC;
>> +	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
>> +		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>>  	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>>  				PAGE_FRAG_CACHE_MAX_ORDER);
>>  	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 446e945f736b..d643332c3ee5 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2900,7 +2900,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>>  		/* Avoid direct reclaim but allow kswapd to wake */
>>  		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>>  					  __GFP_COMP | __GFP_NOWARN |
>> -					  __GFP_NORETRY,
>> +					  __GFP_NORETRY | __GFP_NOMEMALLOC,
>>  					  SKB_FRAG_PAGE_ORDER);
>>  		if (likely(pfrag->page)) {
>>  			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> 
> Looks fine to me.
> 
> One thing you may want to consider would be to place this all in an
> inline function that could just consolidate all the code.

Do you think it is possible to further unify the implementations of the
'struct page_frag_cache' and 'struct page_frag', so adding a inline
function for above is unnecessary?

> 
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
> 
> .
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  2024-01-05 15:42   ` Alexander H Duyck
@ 2024-01-08  8:59     ` Yunsheng Lin
  2024-01-08 16:25       ` Alexander Duyck
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-08  8:59 UTC (permalink / raw)
  To: Alexander H Duyck, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Andrew Morton, linux-mm

On 2024/1/5 23:42, Alexander H Duyck wrote:
> On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
>> The next patch is above to use page_frag_alloc_align() to
>> replace vhost_net_page_frag_refill(), the main difference
>> between those two frag page implementations is whether we
>> use a initial zero offset or not.
>>
>> It seems more nature to use a initial zero offset, as it
>> may enable more correct cache prefetching and skb frag
>> coalescing in the networking, so change it to use initial
>> zero offset.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
> 
> There are several advantages to running the offset as a countdown
> rather than count-up value.
> 
> 1. Specifically for the size of the chunks we are allocating doing it
> from the bottom up doesn't add any value as we are jumping in large
> enough amounts and are being used for DMA so being sequential doesn't
> add any value.

What is the expected size of the above chunks in your mind? I suppose
that is like NAPI_HAS_SMALL_PAGE_FRAG to avoid excessive truesize
underestimation?

It seems there is no limit for min size of allocation for
page_frag_alloc_align() now, and as the page_frag API seems to be only
used in networking, should we enforce the min size of allocation at API
level?

> 
> 2. By starting at the end and working toward zero we can use built in
> functionality of the CPU to only have to check and see if our result
> would be signed rather than having to load two registers with the
> values and then compare them which saves us a few cycles. In addition
> it saves us from having to read both the size and the offset for every
> page.

I suppose the above is ok if we only use the page_frag_alloc*() API to
allocate memory for skb->data, not for the frag in skb_shinfo(), as by
starting at the end and working toward zero, it means we can not do skb
coalescing.

As page_frag_alloc*() is returning va now, I am assuming most of users
is using the API for skb->data, I guess it is ok to drop this patch for
now.

If we allow page_frag_alloc*() to return struct page, we might need this
patch to enable coalescing.

> 
> Again this is another code cleanup at the cost of performance. I
> realize many of the items you are removing would be considered micro-
> optimizations but when we are dealing with millions of packets per
> second those optimizations add up.
> .
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()
  2024-01-05 16:06   ` Alexander H Duyck
@ 2024-01-08  9:06     ` Yunsheng Lin
  2024-01-08 15:45       ` Alexander Duyck
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-08  9:06 UTC (permalink / raw)
  To: Alexander H Duyck, davem, kuba, pabeni
  Cc: netdev, linux-kernel, Jason Wang, Michael S. Tsirkin,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, kvm, virtualization, bpf

On 2024/1/6 0:06, Alexander H Duyck wrote:
>>  
>>  static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>> @@ -1353,8 +1318,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>>  			vqs[VHOST_NET_VQ_RX]);
>>  
>>  	f->private_data = n;
>> -	n->page_frag.page = NULL;
>> -	n->refcnt_bias = 0;
>> +	n->pf_cache.va = NULL;
>>  
>>  	return 0;
>>  }
>> @@ -1422,8 +1386,9 @@ static int vhost_net_release(struct inode *inode, struct file *f)
>>  	kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
>>  	kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
>>  	kfree(n->dev.vqs);
>> -	if (n->page_frag.page)
>> -		__page_frag_cache_drain(n->page_frag.page, n->refcnt_bias);
>> +	if (n->pf_cache.va)
>> +		__page_frag_cache_drain(virt_to_head_page(n->pf_cache.va),
>> +					n->pf_cache.pagecnt_bias);
>>  	kvfree(n);
>>  	return 0;
>>  }
> 
> I would recommend reordering this patch with patch 5. Then you could
> remove the block that is setting "n->pf_cache.va = NULL" above and just
> make use of page_frag_cache_drain in the lower block which would also
> return the va to NULL.

I am not sure if we can as there is no zeroing for 'struct vhost_net' in
vhost_net_open().

If we don't have "n->pf_cache.va = NULL", don't we use the uninitialized data
when calling page_frag_alloc_align() for the first time?

> .
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()
  2024-01-08  9:06     ` Yunsheng Lin
@ 2024-01-08 15:45       ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2024-01-08 15:45 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Jason Wang,
	Michael S. Tsirkin, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, kvm, virtualization, bpf

On Mon, Jan 8, 2024 at 1:06 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/1/6 0:06, Alexander H Duyck wrote:
> >>
> >>  static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> >> @@ -1353,8 +1318,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> >>                      vqs[VHOST_NET_VQ_RX]);
> >>
> >>      f->private_data = n;
> >> -    n->page_frag.page = NULL;
> >> -    n->refcnt_bias = 0;
> >> +    n->pf_cache.va = NULL;
> >>
> >>      return 0;
> >>  }
> >> @@ -1422,8 +1386,9 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> >>      kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
> >>      kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
> >>      kfree(n->dev.vqs);
> >> -    if (n->page_frag.page)
> >> -            __page_frag_cache_drain(n->page_frag.page, n->refcnt_bias);
> >> +    if (n->pf_cache.va)
> >> +            __page_frag_cache_drain(virt_to_head_page(n->pf_cache.va),
> >> +                                    n->pf_cache.pagecnt_bias);
> >>      kvfree(n);
> >>      return 0;
> >>  }
> >
> > I would recommend reordering this patch with patch 5. Then you could
> > remove the block that is setting "n->pf_cache.va = NULL" above and just
> > make use of page_frag_cache_drain in the lower block which would also
> > return the va to NULL.
>
> I am not sure if we can as there is no zeroing for 'struct vhost_net' in
> vhost_net_open().
>
> If we don't have "n->pf_cache.va = NULL", don't we use the uninitialized data
> when calling page_frag_alloc_align() for the first time?

I see. So kvmalloc is used instead of kvzalloc when allocating the
structure. That might be an opportunity to clean things up a bit by
making that change to reduce the risk of some piece of memory
initialization being missed.

That said, I still think reordering the two patches might be useful as
it would help to make it so that the change you make to vhost_net is
encapsulated in one patch to fully enable the use of the new page pool
API.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation
  2024-01-08  8:25     ` Yunsheng Lin
@ 2024-01-08 16:13       ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2024-01-08 16:13 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
	Jason Wang, Andrew Morton, Eric Dumazet, kvm, virtualization,
	linux-mm

On Mon, Jan 8, 2024 at 12:25 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/1/5 23:35, Alexander H Duyck wrote:
> > On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> >> Currently there seems to be three page frag implementions
> >> which all try to allocate order 3 page, if that fails, it
> >> then fail back to allocate order 0 page, and each of them
> >> all allow order 3 page allocation to fail under certain
> >> condition by using specific gfp bits.
> >>
> >> The gfp bits for order 3 page allocation are different
> >> between different implementation, __GFP_NOMEMALLOC is
> >> or'd to forbid access to emergency reserves memory for
> >> __page_frag_cache_refill(), but it is not or'd in other
> >> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
> >> direct reclaim in skb_page_frag_refill(), but it is not
> >> masked off in __page_frag_cache_refill().
> >>
> >> This patch unifies the gfp bits used between different
> >> implementions by or'ing __GFP_NOMEMALLOC and masking off
> >> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
> >> possible pressure for mm.
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> CC: Alexander Duyck <alexander.duyck@gmail.com>
> >> ---
> >>  drivers/vhost/net.c | 2 +-
> >>  mm/page_alloc.c     | 4 ++--
> >>  net/core/sock.c     | 2 +-
> >>  3 files changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index f2ed7167c848..e574e21cc0ca 100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
> >>              /* Avoid direct reclaim but allow kswapd to wake */
> >>              pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> >>                                        __GFP_COMP | __GFP_NOWARN |
> >> -                                      __GFP_NORETRY,
> >> +                                      __GFP_NORETRY | __GFP_NOMEMALLOC,
> >>                                        SKB_FRAG_PAGE_ORDER);
> >>              if (likely(pfrag->page)) {
> >>                      pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 9a16305cf985..1f0b36dd81b5 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4693,8 +4693,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> >>      gfp_t gfp = gfp_mask;
> >>
> >>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> >> -    gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> >> -                __GFP_NOMEMALLOC;
> >> +    gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> >> +               __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> >>      page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
> >>                              PAGE_FRAG_CACHE_MAX_ORDER);
> >>      nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
> >> diff --git a/net/core/sock.c b/net/core/sock.c
> >> index 446e945f736b..d643332c3ee5 100644
> >> --- a/net/core/sock.c
> >> +++ b/net/core/sock.c
> >> @@ -2900,7 +2900,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
> >>              /* Avoid direct reclaim but allow kswapd to wake */
> >>              pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> >>                                        __GFP_COMP | __GFP_NOWARN |
> >> -                                      __GFP_NORETRY,
> >> +                                      __GFP_NORETRY | __GFP_NOMEMALLOC,
> >>                                        SKB_FRAG_PAGE_ORDER);
> >>              if (likely(pfrag->page)) {
> >>                      pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> >
> > Looks fine to me.
> >
> > One thing you may want to consider would be to place this all in an
> > inline function that could just consolidate all the code.
>
> Do you think it is possible to further unify the implementations of the
> 'struct page_frag_cache' and 'struct page_frag', so adding a inline
> function for above is unnecessary?

Actually the skb_page_frag_refill seems to function more similarly to
how the Intel drivers do in terms of handling fragments. It is
basically slicing off pieces until either it runs out of them and
allocates a new one, or if the page reference count is one without
pre-allocating the references.

However, with that said many of the core bits are the same so it might
be possible to look at unifiying at least pieces of this. For example
the page_frag has the same first 3 members as the page_frag_cache so
it might be possible to look at refactoring things further to unify
more of the frag_refill logic.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  2024-01-08  8:59     ` Yunsheng Lin
@ 2024-01-08 16:25       ` Alexander Duyck
  2024-01-09 11:22         ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Duyck @ 2024-01-08 16:25 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Andrew Morton, linux-mm

On Mon, Jan 8, 2024 at 12:59 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/1/5 23:42, Alexander H Duyck wrote:
> > On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> >> The next patch is above to use page_frag_alloc_align() to
> >> replace vhost_net_page_frag_refill(), the main difference
> >> between those two frag page implementations is whether we
> >> use a initial zero offset or not.
> >>
> >> It seems more nature to use a initial zero offset, as it
> >> may enable more correct cache prefetching and skb frag
> >> coalescing in the networking, so change it to use initial
> >> zero offset.
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> CC: Alexander Duyck <alexander.duyck@gmail.com>
> >
> > There are several advantages to running the offset as a countdown
> > rather than count-up value.
> >
> > 1. Specifically for the size of the chunks we are allocating doing it
> > from the bottom up doesn't add any value as we are jumping in large
> > enough amounts and are being used for DMA so being sequential doesn't
> > add any value.
>
> What is the expected size of the above chunks in your mind? I suppose
> that is like NAPI_HAS_SMALL_PAGE_FRAG to avoid excessive truesize
> underestimation?
>
> It seems there is no limit for min size of allocation for
> page_frag_alloc_align() now, and as the page_frag API seems to be only
> used in networking, should we enforce the min size of allocation at API
> level?

The primary use case for this is to allocate fragments to be used for
storing network data. We usually end up allocating a minimum of 1K in
most cases as you end up having to reserve something like 512B for
headroom and the skb_shared_info in an skb. In addition the slice
lengths become very hard to predict as these are usually used for
network traffic so the size can be as small as 60B for a packet or as
large as 9KB

> >
> > 2. By starting at the end and working toward zero we can use built in
> > functionality of the CPU to only have to check and see if our result
> > would be signed rather than having to load two registers with the
> > values and then compare them which saves us a few cycles. In addition
> > it saves us from having to read both the size and the offset for every
> > page.
>
> I suppose the above is ok if we only use the page_frag_alloc*() API to
> allocate memory for skb->data, not for the frag in skb_shinfo(), as by
> starting at the end and working toward zero, it means we can not do skb
> coalescing.
>
> As page_frag_alloc*() is returning va now, I am assuming most of users
> is using the API for skb->data, I guess it is ok to drop this patch for
> now.
>
> If we allow page_frag_alloc*() to return struct page, we might need this
> patch to enable coalescing.

I would argue this is not the interface for enabling coalescing. This
is one of the reasons why this is implemented the way it is. When you
are aligning fragments you aren't going to be able to coalesce the
frames anyway as the alignment would push the fragments apart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  2024-01-08 16:25       ` Alexander Duyck
@ 2024-01-09 11:22         ` Yunsheng Lin
  2024-01-09 15:37           ` Alexander Duyck
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-09 11:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Andrew Morton, linux-mm

On 2024/1/9 0:25, Alexander Duyck wrote:
> On Mon, Jan 8, 2024 at 12:59 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:

...

> 
>>>
>>> 2. By starting at the end and working toward zero we can use built in
>>> functionality of the CPU to only have to check and see if our result
>>> would be signed rather than having to load two registers with the
>>> values and then compare them which saves us a few cycles. In addition
>>> it saves us from having to read both the size and the offset for every
>>> page.
>>
>> I suppose the above is ok if we only use the page_frag_alloc*() API to
>> allocate memory for skb->data, not for the frag in skb_shinfo(), as by
>> starting at the end and working toward zero, it means we can not do skb
>> coalescing.
>>
>> As page_frag_alloc*() is returning va now, I am assuming most of users
>> is using the API for skb->data, I guess it is ok to drop this patch for
>> now.
>>
>> If we allow page_frag_alloc*() to return struct page, we might need this
>> patch to enable coalescing.
> 
> I would argue this is not the interface for enabling coalescing. This
> is one of the reasons why this is implemented the way it is. When you
> are aligning fragments you aren't going to be able to coalesce the
> frames anyway as the alignment would push the fragments apart.

It seems the alignment requirement is the same for the same user of a page_frag
instance, so the aligning does not seem to be a problem for coalescing?

> .
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  2024-01-09 11:22         ` Yunsheng Lin
@ 2024-01-09 15:37           ` Alexander Duyck
  2024-01-10  9:45             ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Duyck @ 2024-01-09 15:37 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Andrew Morton, linux-mm

On Tue, Jan 9, 2024 at 3:22 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/1/9 0:25, Alexander Duyck wrote:
> > On Mon, Jan 8, 2024 at 12:59 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> ...
>
> >
> >>>
> >>> 2. By starting at the end and working toward zero we can use built in
> >>> functionality of the CPU to only have to check and see if our result
> >>> would be signed rather than having to load two registers with the
> >>> values and then compare them which saves us a few cycles. In addition
> >>> it saves us from having to read both the size and the offset for every
> >>> page.
> >>
> >> I suppose the above is ok if we only use the page_frag_alloc*() API to
> >> allocate memory for skb->data, not for the frag in skb_shinfo(), as by
> >> starting at the end and working toward zero, it means we can not do skb
> >> coalescing.
> >>
> >> As page_frag_alloc*() is returning va now, I am assuming most of users
> >> is using the API for skb->data, I guess it is ok to drop this patch for
> >> now.
> >>
> >> If we allow page_frag_alloc*() to return struct page, we might need this
> >> patch to enable coalescing.
> >
> > I would argue this is not the interface for enabling coalescing. This
> > is one of the reasons why this is implemented the way it is. When you
> > are aligning fragments you aren't going to be able to coalesce the
> > frames anyway as the alignment would push the fragments apart.
>
> It seems the alignment requirement is the same for the same user of a page_frag
> instance, so the aligning does not seem to be a problem for coalescing?

I'm a bit confused as to what coalescing you are referring to. If you
can provide a link it would be useful.

The problem is page_frag is a very generic item and can be generated
from a regular page on NICs that can internally reuse the same page
instance for multiple buffers. So it is possible to coalesce page
frags, however it is very unlikely to be coalescing them in the case
of them being used for skb buffers since it would require aligned
payloads on the network in order to really make it work without
hardware intervention of some sort and on such devices they are likely
allocating entire pages instead of page frags for the buffers.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  2024-01-09 15:37           ` Alexander Duyck
@ 2024-01-10  9:45             ` Yunsheng Lin
  2024-01-10 16:21               ` Alexander Duyck
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-10  9:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Andrew Morton, linux-mm

On 2024/1/9 23:37, Alexander Duyck wrote:
> On Tue, Jan 9, 2024 at 3:22 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/1/9 0:25, Alexander Duyck wrote:
>>> On Mon, Jan 8, 2024 at 12:59 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> ...
>>
>>>
>>>>>
>>>>> 2. By starting at the end and working toward zero we can use built in
>>>>> functionality of the CPU to only have to check and see if our result
>>>>> would be signed rather than having to load two registers with the
>>>>> values and then compare them which saves us a few cycles. In addition
>>>>> it saves us from having to read both the size and the offset for every
>>>>> page.
>>>>
>>>> I suppose the above is ok if we only use the page_frag_alloc*() API to
>>>> allocate memory for skb->data, not for the frag in skb_shinfo(), as by
>>>> starting at the end and working toward zero, it means we can not do skb
>>>> coalescing.
>>>>
>>>> As page_frag_alloc*() is returning va now, I am assuming most of users
>>>> is using the API for skb->data, I guess it is ok to drop this patch for
>>>> now.
>>>>
>>>> If we allow page_frag_alloc*() to return struct page, we might need this
>>>> patch to enable coalescing.
>>>
>>> I would argue this is not the interface for enabling coalescing. This
>>> is one of the reasons why this is implemented the way it is. When you
>>> are aligning fragments you aren't going to be able to coalesce the
>>> frames anyway as the alignment would push the fragments apart.
>>
>> It seems the alignment requirement is the same for the same user of a page_frag
>> instance, so the aligning does not seem to be a problem for coalescing?
> 
> I'm a bit confused as to what coalescing you are referring to. If you
> can provide a link it would be useful.
> 
> The problem is page_frag is a very generic item and can be generated
> from a regular page on NICs that can internally reuse the same page
> instance for multiple buffers. So it is possible to coalesce page
> frags, however it is very unlikely to be coalescing them in the case
> of them being used for skb buffers since it would require aligned
> payloads on the network in order to really make it work without
> hardware intervention of some sort and on such devices they are likely
> allocating entire pages instead of page frags for the buffers.

The main usecase in my mind is the page_frag used in the tx part for
networking if we are able to unify the page_frag and page_frag_cache in
the future:
https://elixir.bootlin.com/linux/v6.7-rc8/source/net/ipv4/tcp.c#L1183

Do you think if it makes sense to unify them using below unified struct,
and provide API for returning 'page' and 'va' as page_pool does now?
It may mean we need to add one pointer to the new struct and are not able
do some trick for performance, I suppose that is ok as there are always
some trade off for maintainability and evolvability?

struct page_frag {
        struct *page;
	void *va;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
	__u16 offset;
	__u16 size;
#else
	__u32 offset;
#endif
	/* we maintain a pagecount bias, so that we dont dirty cache line
	 * containing page->_refcount every time we allocate a fragment.
	 */
	unsigned int		pagecnt_bias;
	bool pfmemalloc;
};


Another usecase that is not really related is: hw may be configured with
a small BD buf size, for 2K and configured with a big mtu size or have
hw gro enabled, for 4K pagesize, that means we may be able to reduce the
number of the frag num to half as it is usually the case that two
consecutive BD pointing to the same page. I implemented a POC in hns3
long time ago using the frag implememtation in page_pool, it did show
some obvious peformance gain, But as the priority shifts, I have not
been able to continue that POC yet.

> 
> .
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  2024-01-10  9:45             ` Yunsheng Lin
@ 2024-01-10 16:21               ` Alexander Duyck
  2024-01-11 12:37                 ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Duyck @ 2024-01-10 16:21 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Andrew Morton, linux-mm

On Wed, Jan 10, 2024 at 1:45 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/1/9 23:37, Alexander Duyck wrote:
> > On Tue, Jan 9, 2024 at 3:22 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2024/1/9 0:25, Alexander Duyck wrote:
> >>> On Mon, Jan 8, 2024 at 12:59 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> ...
> >>
> >>>
> >>>>>
> >>>>> 2. By starting at the end and working toward zero we can use built in
> >>>>> functionality of the CPU to only have to check and see if our result
> >>>>> would be signed rather than having to load two registers with the
> >>>>> values and then compare them which saves us a few cycles. In addition
> >>>>> it saves us from having to read both the size and the offset for every
> >>>>> page.
> >>>>
> >>>> I suppose the above is ok if we only use the page_frag_alloc*() API to
> >>>> allocate memory for skb->data, not for the frag in skb_shinfo(), as by
> >>>> starting at the end and working toward zero, it means we can not do skb
> >>>> coalescing.
> >>>>
> >>>> As page_frag_alloc*() is returning va now, I am assuming most of users
> >>>> is using the API for skb->data, I guess it is ok to drop this patch for
> >>>> now.
> >>>>
> >>>> If we allow page_frag_alloc*() to return struct page, we might need this
> >>>> patch to enable coalescing.
> >>>
> >>> I would argue this is not the interface for enabling coalescing. This
> >>> is one of the reasons why this is implemented the way it is. When you
> >>> are aligning fragments you aren't going to be able to coalesce the
> >>> frames anyway as the alignment would push the fragments apart.
> >>
> >> It seems the alignment requirement is the same for the same user of a page_frag
> >> instance, so the aligning does not seem to be a problem for coalescing?
> >
> > I'm a bit confused as to what coalescing you are referring to. If you
> > can provide a link it would be useful.
> >
> > The problem is page_frag is a very generic item and can be generated
> > from a regular page on NICs that can internally reuse the same page
> > instance for multiple buffers. So it is possible to coalesce page
> > frags, however it is very unlikely to be coalescing them in the case
> > of them being used for skb buffers since it would require aligned
> > payloads on the network in order to really make it work without
> > hardware intervention of some sort and on such devices they are likely
> > allocating entire pages instead of page frags for the buffers.
>
> The main usecase in my mind is the page_frag used in the tx part for
> networking if we are able to unify the page_frag and page_frag_cache in
> the future:
> https://elixir.bootlin.com/linux/v6.7-rc8/source/net/ipv4/tcp.c#L1183
>
> Do you think if it makes sense to unify them using below unified struct,
> and provide API for returning 'page' and 'va' as page_pool does now?

Short answer is no. The difference between the two is the use case,
and combining page and va in the same struct just ends up generating
indirect data duplication. So one step would be to look at seeing what
we could do to either convert page to va or va to page without taking
a significant penalty in either page_frag or page_frag_cache use case.
I had opted for using the virtual address as the Rx path has a strong
need for accessing the memory as soon as it is written to begin basic
parsing tasks and the like. In addition it is usually cheaper to go
from a virtual to a page rather than the other way around.

As for the rest of the fields we have essentially 2 main use cases.
The first is the Rx path which usually implies DMA and not knowing
what size of the incoming frame is and the need to have allocation
succeed to avoid jamming up a device. So basically it is always doing
something like allocating 2K although it may only receive somewhere
between 60B to 1514B, and it is always allocating from reserves. For
something like Tx keeping the pagecnt_bias and pfmemalloc values
doesn't make much sense as neither is really needed for the Tx use
case. Instead they can just make calls to page_get as they slice off
pieces of the page, and they have the option of failing if they cannot
get enough memory to put the skb together.

> It may mean we need to add one pointer to the new struct and are not able
> do some trick for performance, I suppose that is ok as there are always
> some trade off for maintainability and evolvability?
>
> struct page_frag {
>         struct *page;
>         void *va;
> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>         __u16 offset;
>         __u16 size;
> #else
>         __u32 offset;
> #endif
>         /* we maintain a pagecount bias, so that we dont dirty cache line
>          * containing page->_refcount every time we allocate a fragment.
>          */
>         unsigned int            pagecnt_bias;
>         bool pfmemalloc;
> };

My general thought was to instead just make the page_frag a member of
the page_frag_cache since those two blocks would be the same. Then we
could see how we evolve things from there. By doing that there should
essentially be no code change

> Another usecase that is not really related is: hw may be configured with
> a small BD buf size, for 2K and configured with a big mtu size or have
> hw gro enabled, for 4K pagesize, that means we may be able to reduce the
> number of the frag num to half as it is usually the case that two
> consecutive BD pointing to the same page. I implemented a POC in hns3
> long time ago using the frag implememtation in page_pool, it did show
> some obvious peformance gain, But as the priority shifts, I have not
> been able to continue that POC yet.

The main issue for that use case is that in order to make it work you
are having to usually copybreak the headers out of the page frags as
you won't be able to save the space for the skb tailroom. Either that
or you are using header/data split and in that case the data portion
should really be written to full pages instead of page fragments
anyway and be making use of page pool instead.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  2024-01-10 16:21               ` Alexander Duyck
@ 2024-01-11 12:37                 ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2024-01-11 12:37 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Andrew Morton, linux-mm

On 2024/1/11 0:21, Alexander Duyck wrote:

...

>>
>> The main usecase in my mind is the page_frag used in the tx part for
>> networking if we are able to unify the page_frag and page_frag_cache in
>> the future:
>> https://elixir.bootlin.com/linux/v6.7-rc8/source/net/ipv4/tcp.c#L1183
>>
>> Do you think if it makes sense to unify them using below unified struct,
>> and provide API for returning 'page' and 'va' as page_pool does now?
> 
> Short answer is no. The difference between the two is the use case,
> and combining page and va in the same struct just ends up generating
> indirect data duplication. So one step would be to look at seeing what
> we could do to either convert page to va or va to page without taking
> a significant penalty in either page_frag or page_frag_cache use case.

I think we might do something like the page_pool using some unused fields
in 'struct page' as the metadata of page_frag/page_frag_cache, and reduce
page_frag or page_frag_cache to a single pointer of 'struct page'?

I looked the the fields used by page_pool in 'struct page', it seems it is
enough for page_frag case too.

> I had opted for using the virtual address as the Rx path has a strong
> need for accessing the memory as soon as it is written to begin basic
> parsing tasks and the like. In addition it is usually cheaper to go
> from a virtual to a page rather than the other way around.

Is there a reason why it is usually cheaper to go from a virtual to a page
rather than the other way around? I looked the implementations of them, But
had not figured why yet.

> 
> As for the rest of the fields we have essentially 2 main use cases.
> The first is the Rx path which usually implies DMA and not knowing
> what size of the incoming frame is and the need to have allocation
> succeed to avoid jamming up a device. So basically it is always doing
> something like allocating 2K although it may only receive somewhere
> between 60B to 1514B, and it is always allocating from reserves. For
> something like Tx keeping the pagecnt_bias and pfmemalloc values

I am not so sure I understand why it is ok to keep reusing a pfmemalloc
page for Tx yet?

I am assuming the pfmemalloc is not about a specific page, but about the
state of mm system when a page is allocated under memory pressure?
The pfmemalloc is used to drop some rx packet which is not helpful for
reducing the memory pressure? And tx does not need to handle the pfmemalloc
case?

> doesn't make much sense as neither is really needed for the Tx use
> case. Instead they can just make calls to page_get as they slice off

I think for small packet, the bias may help to avoid some atomic
operations and some cache bouncing?

> pieces of the page, and they have the option of failing if they cannot
> get enough memory to put the skb together.
> 
>> It may mean we need to add one pointer to the new struct and are not able
>> do some trick for performance, I suppose that is ok as there are always
>> some trade off for maintainability and evolvability?
>>
>> struct page_frag {
>>         struct *page;
>>         void *va;
>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>         __u16 offset;
>>         __u16 size;
>> #else
>>         __u32 offset;
>> #endif
>>         /* we maintain a pagecount bias, so that we dont dirty cache line
>>          * containing page->_refcount every time we allocate a fragment.
>>          */
>>         unsigned int            pagecnt_bias;
>>         bool pfmemalloc;
>> };
> 
> My general thought was to instead just make the page_frag a member of
> the page_frag_cache since those two blocks would be the same. Then we
> could see how we evolve things from there. By doing that there should
> essentially be no code change

Yes, that is a possible way to evolve things. But I seems to perfer to
use the unused fields in 'struct page' for now, WDYT?

> 
>> Another usecase that is not really related is: hw may be configured with
>> a small BD buf size, for 2K and configured with a big mtu size or have
>> hw gro enabled, for 4K pagesize, that means we may be able to reduce the
>> number of the frag num to half as it is usually the case that two
>> consecutive BD pointing to the same page. I implemented a POC in hns3
>> long time ago using the frag implememtation in page_pool, it did show
>> some obvious peformance gain, But as the priority shifts, I have not
>> been able to continue that POC yet.
> 
> The main issue for that use case is that in order to make it work you
> are having to usually copybreak the headers out of the page frags as
> you won't be able to save the space for the skb tailroom. Either that
> or you are using header/data split and in that case the data portion
> should really be written to full pages instead of page fragments

I am not sure how hw implement the header/data split yet, is the hw able
to not enabling header/data split for small packets and enabling header/data
split for medium/big packets for the same queue, maybe spaning data part on
multi-bd for big packets, so that we have least memory usages and performance
penalty for small/medium/big packets received in the same queue?

> anyway and be making use of page pool instead.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH net-next 0/6] remove page frag implementation in vhost_net
@ 2023-12-05 11:34 ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2023-12-05 11:34 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, bpf,
	linux-arm-kernel, linux-mediatek

Currently there are three implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
   xdp frame, and it's implementation seems to be a mix of
   the above two.

This patch tries to unfiy the page frag implementation a
little bit by unifying gfp bit for order 3 page allocation
and replacing page frag implementation in vhost.c with the
one in page_alloc.c.

After this patchset, we are not only able to unify the page
frag implementation a little, but also able to have about
0.5% performance boost testing by using the vhost_net_test
introduced in the last patch.

Before this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     305325.78 msec task-clock                       #    1.738 CPUs utilized               ( +-  0.12% )
       1048668      context-switches                 #    3.435 K/sec                       ( +-  0.00% )
            11      cpu-migrations                   #    0.036 /sec                        ( +- 17.64% )
            33      page-faults                      #    0.108 /sec                        ( +-  0.49% )
  244651819491      cycles                           #    0.801 GHz                         ( +-  0.43% )  (64)
   64714638024      stalled-cycles-frontend          #   26.45% frontend cycles idle        ( +-  2.19% )  (67)
   30774313491      stalled-cycles-backend           #   12.58% backend cycles idle         ( +-  7.68% )  (70)
  201749748680      instructions                     #    0.82  insn per cycle
                                              #    0.32  stalled cycles per insn     ( +-  0.41% )  (66.76%)
   65494787909      branches                         #  214.508 M/sec                       ( +-  0.35% )  (64)
    4284111313      branch-misses                    #    6.54% of all branches             ( +-  0.45% )  (66)

       175.699 +- 0.189 seconds time elapsed  ( +-  0.11% )


After this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     303974.38 msec task-clock                       #    1.739 CPUs utilized               ( +-  0.14% )
       1048807      context-switches                 #    3.450 K/sec                       ( +-  0.00% )
            14      cpu-migrations                   #    0.046 /sec                        ( +- 12.86% )
            33      page-faults                      #    0.109 /sec                        ( +-  0.46% )
  251289376347      cycles                           #    0.827 GHz                         ( +-  0.32% )  (60)
   67885175415      stalled-cycles-frontend          #   27.01% frontend cycles idle        ( +-  0.48% )  (63)
   27809282600      stalled-cycles-backend           #   11.07% backend cycles idle         ( +-  0.36% )  (71)
  195543234672      instructions                     #    0.78  insn per cycle
                                              #    0.35  stalled cycles per insn     ( +-  0.29% )  (69.04%)
   62423183552      branches                         #  205.357 M/sec                       ( +-  0.48% )  (67)
    4135666632      branch-misses                    #    6.63% of all branches             ( +-  0.63% )  (67)

       174.764 +- 0.214 seconds time elapsed  ( +-  0.12% )

Changelog:
V1: Fix some typo, drop RFC tag and rebase on latest net-next.

Yunsheng Lin (6):
  mm/page_alloc: modify page_frag_alloc_align() to accept align as an
    argument
  page_frag: unify gfp bit for order 3 page allocation
  mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  vhost/net: remove vhost_net_page_frag_refill()
  net: introduce page_frag_cache_drain()
  tools: virtio: introduce vhost_net_test

 drivers/net/ethernet/google/gve/gve_main.c |  11 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |  17 +-
 drivers/nvme/host/tcp.c                    |   7 +-
 drivers/nvme/target/tcp.c                  |   4 +-
 drivers/vhost/net.c                        |  91 ++---
 include/linux/gfp.h                        |   6 +-
 include/linux/skbuff.h                     |  22 +-
 mm/page_alloc.c                            |  48 ++-
 net/core/skbuff.c                          |  14 +-
 net/core/sock.c                            |   2 +-
 tools/virtio/Makefile                      |   8 +-
 tools/virtio/vhost_net_test.c              | 441 +++++++++++++++++++++
 12 files changed, 524 insertions(+), 147 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH net-next 0/6] remove page frag implementation in vhost_net
@ 2023-12-05 11:34 ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2023-12-05 11:34 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, bpf,
	linux-arm-kernel, linux-mediatek

Currently there are three implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
   xdp frame, and it's implementation seems to be a mix of
   the above two.

This patch tries to unfiy the page frag implementation a
little bit by unifying gfp bit for order 3 page allocation
and replacing page frag implementation in vhost.c with the
one in page_alloc.c.

After this patchset, we are not only able to unify the page
frag implementation a little, but also able to have about
0.5% performance boost testing by using the vhost_net_test
introduced in the last patch.

Before this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     305325.78 msec task-clock                       #    1.738 CPUs utilized               ( +-  0.12% )
       1048668      context-switches                 #    3.435 K/sec                       ( +-  0.00% )
            11      cpu-migrations                   #    0.036 /sec                        ( +- 17.64% )
            33      page-faults                      #    0.108 /sec                        ( +-  0.49% )
  244651819491      cycles                           #    0.801 GHz                         ( +-  0.43% )  (64)
   64714638024      stalled-cycles-frontend          #   26.45% frontend cycles idle        ( +-  2.19% )  (67)
   30774313491      stalled-cycles-backend           #   12.58% backend cycles idle         ( +-  7.68% )  (70)
  201749748680      instructions                     #    0.82  insn per cycle
                                              #    0.32  stalled cycles per insn     ( +-  0.41% )  (66.76%)
   65494787909      branches                         #  214.508 M/sec                       ( +-  0.35% )  (64)
    4284111313      branch-misses                    #    6.54% of all branches             ( +-  0.45% )  (66)

       175.699 +- 0.189 seconds time elapsed  ( +-  0.11% )


After this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     303974.38 msec task-clock                       #    1.739 CPUs utilized               ( +-  0.14% )
       1048807      context-switches                 #    3.450 K/sec                       ( +-  0.00% )
            14      cpu-migrations                   #    0.046 /sec                        ( +- 12.86% )
            33      page-faults                      #    0.109 /sec                        ( +-  0.46% )
  251289376347      cycles                           #    0.827 GHz                         ( +-  0.32% )  (60)
   67885175415      stalled-cycles-frontend          #   27.01% frontend cycles idle        ( +-  0.48% )  (63)
   27809282600      stalled-cycles-backend           #   11.07% backend cycles idle         ( +-  0.36% )  (71)
  195543234672      instructions                     #    0.78  insn per cycle
                                              #    0.35  stalled cycles per insn     ( +-  0.29% )  (69.04%)
   62423183552      branches                         #  205.357 M/sec                       ( +-  0.48% )  (67)
    4135666632      branch-misses                    #    6.63% of all branches             ( +-  0.63% )  (67)

       174.764 +- 0.214 seconds time elapsed  ( +-  0.12% )

Changelog:
V1: Fix some typo, drop RFC tag and rebase on latest net-next.

Yunsheng Lin (6):
  mm/page_alloc: modify page_frag_alloc_align() to accept align as an
    argument
  page_frag: unify gfp bit for order 3 page allocation
  mm/page_alloc: use initial zero offset for page_frag_alloc_align()
  vhost/net: remove vhost_net_page_frag_refill()
  net: introduce page_frag_cache_drain()
  tools: virtio: introduce vhost_net_test

 drivers/net/ethernet/google/gve/gve_main.c |  11 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |  17 +-
 drivers/nvme/host/tcp.c                    |   7 +-
 drivers/nvme/target/tcp.c                  |   4 +-
 drivers/vhost/net.c                        |  91 ++---
 include/linux/gfp.h                        |   6 +-
 include/linux/skbuff.h                     |  22 +-
 mm/page_alloc.c                            |  48 ++-
 net/core/skbuff.c                          |  14 +-
 net/core/sock.c                            |   2 +-
 tools/virtio/Makefile                      |   8 +-
 tools/virtio/vhost_net_test.c              | 441 +++++++++++++++++++++
 12 files changed, 524 insertions(+), 147 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

-- 
2.33.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-01-11 12:37 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-03  9:56 [PATCH net-next 0/6] remove page frag implementation in vhost_net Yunsheng Lin
2024-01-03  9:56 ` Yunsheng Lin
2024-01-03  9:56 ` [PATCH net-next 1/6] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument Yunsheng Lin
2024-01-05 15:28   ` Alexander H Duyck
2024-01-08  8:20     ` Yunsheng Lin
2024-01-03  9:56 ` [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation Yunsheng Lin
2024-01-05 15:35   ` Alexander H Duyck
2024-01-08  8:25     ` Yunsheng Lin
2024-01-08 16:13       ` Alexander Duyck
2024-01-03  9:56 ` [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align() Yunsheng Lin
2024-01-05 15:42   ` Alexander H Duyck
2024-01-08  8:59     ` Yunsheng Lin
2024-01-08 16:25       ` Alexander Duyck
2024-01-09 11:22         ` Yunsheng Lin
2024-01-09 15:37           ` Alexander Duyck
2024-01-10  9:45             ` Yunsheng Lin
2024-01-10 16:21               ` Alexander Duyck
2024-01-11 12:37                 ` Yunsheng Lin
2024-01-03  9:56 ` [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill() Yunsheng Lin
2024-01-05 16:06   ` Alexander H Duyck
2024-01-08  9:06     ` Yunsheng Lin
2024-01-08 15:45       ` Alexander Duyck
2024-01-03  9:56 ` [PATCH net-next 5/6] net: introduce page_frag_cache_drain() Yunsheng Lin
2024-01-03  9:56   ` Yunsheng Lin
2024-01-05 15:48   ` Alexander H Duyck
2024-01-05 15:48     ` Alexander H Duyck
2024-01-03  9:56 ` [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test Yunsheng Lin
2024-01-04 16:17   ` Eugenio Perez Martin
2024-01-05  2:09     ` Yunsheng Lin
  -- strict thread matches above, loose matches on Subject: below --
2023-12-05 11:34 [PATCH net-next 0/6] remove page frag implementation in vhost_net Yunsheng Lin
2023-12-05 11:34 ` Yunsheng Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.