linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v4 0/5] remove page frag implementation in vhost_net
@ 2024-01-30 11:37 Yunsheng Lin
  2024-01-30 11:37 ` [PATCH net-next v4 1/5] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument Yunsheng Lin
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Yunsheng Lin @ 2024-01-30 11:37 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Matthias Brugger,
	AngeloGioacchino Del Regno, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-arm-kernel,
	linux-mediatek, bpf

Currently there are three implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().
3. drivers/vhost/net.c: vhost seems to be using it to build
   xdp frame, and it's implementation seems to be a mix of
   the above two.

This patchset tries to unfiy the page frag implementation a
little bit by unifying gfp bit for order 3 page allocation
and replacing page frag implementation in vhost.c with the
one in page_alloc.c.

After this patchset, we are not only able to unify the page
frag implementation a little, but also able to have about
0.5% performance boost testing by using the vhost_net_test
introduced in the last patch.

Before this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     305325.78 msec task-clock                       #    1.738 CPUs utilized               ( +-  0.12% )
       1048668      context-switches                 #    3.435 K/sec                       ( +-  0.00% )
            11      cpu-migrations                   #    0.036 /sec                        ( +- 17.64% )
            33      page-faults                      #    0.108 /sec                        ( +-  0.49% )
  244651819491      cycles                           #    0.801 GHz                         ( +-  0.43% )  (64)
   64714638024      stalled-cycles-frontend          #   26.45% frontend cycles idle        ( +-  2.19% )  (67)
   30774313491      stalled-cycles-backend           #   12.58% backend cycles idle         ( +-  7.68% )  (70)
  201749748680      instructions                     #    0.82  insn per cycle
                                              #    0.32  stalled cycles per insn     ( +-  0.41% )  (66.76%)
   65494787909      branches                         #  214.508 M/sec                       ( +-  0.35% )  (64)
    4284111313      branch-misses                    #    6.54% of all branches             ( +-  0.45% )  (66)

       175.699 +- 0.189 seconds time elapsed  ( +-  0.11% )


After this patchset:
Performance counter stats for './vhost_net_test' (10 runs):

     303974.38 msec task-clock                       #    1.739 CPUs utilized               ( +-  0.14% )
       1048807      context-switches                 #    3.450 K/sec                       ( +-  0.00% )
            14      cpu-migrations                   #    0.046 /sec                        ( +- 12.86% )
            33      page-faults                      #    0.109 /sec                        ( +-  0.46% )
  251289376347      cycles                           #    0.827 GHz                         ( +-  0.32% )  (60)
   67885175415      stalled-cycles-frontend          #   27.01% frontend cycles idle        ( +-  0.48% )  (63)
   27809282600      stalled-cycles-backend           #   11.07% backend cycles idle         ( +-  0.36% )  (71)
  195543234672      instructions                     #    0.78  insn per cycle
                                              #    0.35  stalled cycles per insn     ( +-  0.29% )  (69.04%)
   62423183552      branches                         #  205.357 M/sec                       ( +-  0.48% )  (67)
    4135666632      branch-misses                    #    6.63% of all branches             ( +-  0.63% )  (67)

       174.764 +- 0.214 seconds time elapsed  ( +-  0.12% )

Changelog:
V4: resend based on latest net-next branch.

V3:
1. Add __page_frag_alloc_align() which is passed with the align mask
   the original function expected as suggested by Alexander.
2. Drop patch 3 in v2 suggested by Alexander.
3. Reorder patch 4 & 5 in v2 suggested by Alexander.

Note that placing this gfp flags handing for order 3 page in an inline
function is not considered, as we may be able to unify the page_frag
and page_frag_cache handling.

V2: Change 'xor'd' to 'masked off', add vhost tx testing for
    vhost_net_test.

V1: Fix some typo, drop RFC tag and rebase on latest net-next.

Yunsheng Lin (5):
  mm/page_alloc: modify page_frag_alloc_align() to accept align as an
    argument
  page_frag: unify gfp bits for order 3 page allocation
  net: introduce page_frag_cache_drain()
  vhost/net: remove vhost_net_page_frag_refill()
  tools: virtio: introduce vhost_net_test

 drivers/net/ethernet/google/gve/gve_main.c |  11 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |  17 +-
 drivers/nvme/host/tcp.c                    |   7 +-
 drivers/nvme/target/tcp.c                  |   4 +-
 drivers/vhost/net.c                        |  91 +---
 include/linux/gfp.h                        |  16 +-
 mm/page_alloc.c                            |  22 +-
 net/core/skbuff.c                          |   6 +-
 net/core/sock.c                            |   2 +-
 tools/virtio/.gitignore                    |   1 +
 tools/virtio/Makefile                      |   8 +-
 tools/virtio/vhost_net_test.c              | 576 +++++++++++++++++++++
 12 files changed, 647 insertions(+), 114 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

-- 
2.33.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next v4 1/5] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument
  2024-01-30 11:37 [PATCH net-next v4 0/5] remove page frag implementation in vhost_net Yunsheng Lin
@ 2024-01-30 11:37 ` Yunsheng Lin
  2024-01-30 11:37 ` [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation Yunsheng Lin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Yunsheng Lin @ 2024-01-30 11:37 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Eric Dumazet, linux-mm

napi_alloc_frag_align() and netdev_alloc_frag_align() accept
align as an argument, and they are thin wrappers around the
__napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs
doing the alignment checking and align mask conversion, in order
to call page_frag_alloc_align() directly. The intention here is
to keep the alignment checking and the alignmask conversion in
in-line wrapper to avoid those kind of operations during execution
time since it can usually be handled during compile time.

We are going to use page_frag_alloc_align() in vhost_net.c, it
need the same kind of alignment checking and alignmask conversion,
so split up page_frag_alloc_align into an inline wrapper doing the
above operation, and add __page_frag_alloc_align() which is passed
with the align mask the original function expected as suggested by
Alexander.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
---
 include/linux/gfp.h | 15 +++++++++++----
 mm/page_alloc.c     |  8 ++++----
 net/core/skbuff.c   |  6 +++---
 3 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index de292a007138..28aea17fa59b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -312,14 +312,21 @@ extern void free_pages(unsigned long addr, unsigned int order);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
-extern void *page_frag_alloc_align(struct page_frag_cache *nc,
-				   unsigned int fragsz, gfp_t gfp_mask,
-				   unsigned int align_mask);
+void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz,
+			      gfp_t gfp_mask, unsigned int align_mask);
+
+static inline void *page_frag_alloc_align(struct page_frag_cache *nc,
+					  unsigned int fragsz, gfp_t gfp_mask,
+					  unsigned int align)
+{
+	WARN_ON_ONCE(!is_power_of_2(align));
+	return __page_frag_alloc_align(nc, fragsz, gfp_mask, -align);
+}
 
 static inline void *page_frag_alloc(struct page_frag_cache *nc,
 			     unsigned int fragsz, gfp_t gfp_mask)
 {
-	return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
+	return __page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
 }
 
 extern void page_frag_free(void *addr);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 150d4f23b010..c0f7e67c4250 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,9 +4708,9 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
-void *page_frag_alloc_align(struct page_frag_cache *nc,
-		      unsigned int fragsz, gfp_t gfp_mask,
-		      unsigned int align_mask)
+void *__page_frag_alloc_align(struct page_frag_cache *nc,
+			      unsigned int fragsz, gfp_t gfp_mask,
+			      unsigned int align_mask)
 {
 	unsigned int size = PAGE_SIZE;
 	struct page *page;
@@ -4779,7 +4779,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 
 	return nc->va + offset;
 }
-EXPORT_SYMBOL(page_frag_alloc_align);
+EXPORT_SYMBOL(__page_frag_alloc_align);
 
 /*
  * Frees a page fragment allocated out of either a compound or order 0 page.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index edbbef563d4d..bc8f3858bc1c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -297,7 +297,7 @@ void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 
 	fragsz = SKB_DATA_ALIGN(fragsz);
 
-	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+	return __page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
 }
 EXPORT_SYMBOL(__napi_alloc_frag_align);
 
@@ -309,13 +309,13 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 	if (in_hardirq() || irqs_disabled()) {
 		struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
 
-		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
+		data = __page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
 	} else {
 		struct napi_alloc_cache *nc;
 
 		local_bh_disable();
 		nc = this_cpu_ptr(&napi_alloc_cache);
-		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+		data = __page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
 		local_bh_enable();
 	}
 	return data;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation
  2024-01-30 11:37 [PATCH net-next v4 0/5] remove page frag implementation in vhost_net Yunsheng Lin
  2024-01-30 11:37 ` [PATCH net-next v4 1/5] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument Yunsheng Lin
@ 2024-01-30 11:37 ` Yunsheng Lin
  2024-02-01 13:16   ` Paolo Abeni
  2024-01-30 11:37 ` [PATCH net-next v4 3/5] net: introduce page_frag_cache_drain() Yunsheng Lin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2024-01-30 11:37 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Alexander Duyck, Michael S. Tsirkin, Jason Wang, Andrew Morton,
	Eric Dumazet, kvm, virtualization, linux-mm

Currently there seems to be three page frag implementions
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.

The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
direct reclaim in skb_page_frag_refill(), but it is not
masked off in __page_frag_cache_refill().

This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and masking off
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
---
 drivers/vhost/net.c | 2 +-
 mm/page_alloc.c     | 4 ++--
 net/core/sock.c     | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..e574e21cc0ca 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
 		/* Avoid direct reclaim but allow kswapd to wake */
 		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
 					  __GFP_COMP | __GFP_NOWARN |
-					  __GFP_NORETRY,
+					  __GFP_NORETRY | __GFP_NOMEMALLOC,
 					  SKB_FRAG_PAGE_ORDER);
 		if (likely(pfrag->page)) {
 			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0f7e67c4250..636145c29f70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 	gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-		    __GFP_NOMEMALLOC;
+	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
+		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
 	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
 				PAGE_FRAG_CACHE_MAX_ORDER);
 	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
diff --git a/net/core/sock.c b/net/core/sock.c
index 88bf810394a5..8289a3d8c375 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2919,7 +2919,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
 		/* Avoid direct reclaim but allow kswapd to wake */
 		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
 					  __GFP_COMP | __GFP_NOWARN |
-					  __GFP_NORETRY,
+					  __GFP_NORETRY | __GFP_NOMEMALLOC,
 					  SKB_FRAG_PAGE_ORDER);
 		if (likely(pfrag->page)) {
 			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next v4 3/5] net: introduce page_frag_cache_drain()
  2024-01-30 11:37 [PATCH net-next v4 0/5] remove page frag implementation in vhost_net Yunsheng Lin
  2024-01-30 11:37 ` [PATCH net-next v4 1/5] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument Yunsheng Lin
  2024-01-30 11:37 ` [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation Yunsheng Lin
@ 2024-01-30 11:37 ` Yunsheng Lin
  2024-01-30 11:37 ` [PATCH net-next v4 4/5] vhost/net: remove vhost_net_page_frag_refill() Yunsheng Lin
  2024-01-30 11:37 ` [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test Yunsheng Lin
  4 siblings, 0 replies; 15+ messages in thread
From: Yunsheng Lin @ 2024-01-30 11:37 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Jason Wang, Alexander Duyck,
	Jeroen de Borst, Praveen Kaligineedi, Shailend Chand,
	Eric Dumazet, Felix Fietkau, Sean Wang, Mark Lee,
	Lorenzo Bianconi, Matthias Brugger, AngeloGioacchino Del Regno,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Chaitanya Kulkarni, Andrew Morton, linux-arm-kernel,
	linux-mediatek, linux-nvme, linux-mm

When draining a page_frag_cache, most user are doing
the similar steps, so introduce an API to avoid code
duplication.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
---
 drivers/net/ethernet/google/gve/gve_main.c | 11 ++---------
 drivers/net/ethernet/mediatek/mtk_wed_wo.c | 17 ++---------------
 drivers/nvme/host/tcp.c                    |  7 +------
 drivers/nvme/target/tcp.c                  |  4 +---
 include/linux/gfp.h                        |  1 +
 mm/page_alloc.c                            | 10 ++++++++++
 6 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index db6d9ae7cd78..dec6458bb8d7 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1276,17 +1276,10 @@ static void gve_unreg_xdp_info(struct gve_priv *priv)
 
 static void gve_drain_page_cache(struct gve_priv *priv)
 {
-	struct page_frag_cache *nc;
 	int i;
 
-	for (i = 0; i < priv->rx_cfg.num_queues; i++) {
-		nc = &priv->rx[i].page_cache;
-		if (nc->va) {
-			__page_frag_cache_drain(virt_to_page(nc->va),
-						nc->pagecnt_bias);
-			nc->va = NULL;
-		}
-	}
+	for (i = 0; i < priv->rx_cfg.num_queues; i++)
+		page_frag_cache_drain(&priv->rx[i].page_cache);
 }
 
 static void gve_qpls_get_curr_alloc_cfg(struct gve_priv *priv,
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index d58b07e7e123..7063c78bd35f 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -286,7 +286,6 @@ mtk_wed_wo_queue_free(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 static void
 mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
 	int i;
 
 	for (i = 0; i < q->n_desc; i++) {
@@ -301,19 +300,12 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		entry->buf = NULL;
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_drain(&q->cache);
 }
 
 static void
 mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
-
 	for (;;) {
 		void *buf = mtk_wed_wo_dequeue(wo, q, NULL, true);
 
@@ -323,12 +315,7 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		skb_free_frag(buf);
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_drain(&q->cache);
 }
 
 static void
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d058d990532b..22e1fb9c9c0f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1344,7 +1344,6 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 
 static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 {
-	struct page *page;
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
 	unsigned int noreclaim_flag;
@@ -1355,11 +1354,7 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
-	if (queue->pf_cache.va) {
-		page = virt_to_head_page(queue->pf_cache.va);
-		__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
-		queue->pf_cache.va = NULL;
-	}
+	page_frag_cache_drain(&queue->pf_cache);
 
 	noreclaim_flag = memalloc_noreclaim_save();
 	/* ->sock will be released by fput() */
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 6a1e6bb80062..56224dc59f17 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1591,7 +1591,6 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
 
 static void nvmet_tcp_release_queue_work(struct work_struct *w)
 {
-	struct page *page;
 	struct nvmet_tcp_queue *queue =
 		container_of(w, struct nvmet_tcp_queue, release_work);
 
@@ -1615,8 +1614,7 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-	page = virt_to_head_page(queue->pf_cache.va);
-	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
+	page_frag_cache_drain(&queue->pf_cache);
 	kfree(queue);
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 28aea17fa59b..6cef1c241180 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -311,6 +311,7 @@ extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 
 struct page_frag_cache;
+void page_frag_cache_drain(struct page_frag_cache *nc);
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
 void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz,
 			      gfp_t gfp_mask, unsigned int align_mask);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 636145c29f70..06aa1ebbd21c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4699,6 +4699,16 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 	return page;
 }
 
+void page_frag_cache_drain(struct page_frag_cache *nc)
+{
+	if (!nc->va)
+		return;
+
+	__page_frag_cache_drain(virt_to_head_page(nc->va), nc->pagecnt_bias);
+	nc->va = NULL;
+}
+EXPORT_SYMBOL(page_frag_cache_drain);
+
 void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next v4 4/5] vhost/net: remove vhost_net_page_frag_refill()
  2024-01-30 11:37 [PATCH net-next v4 0/5] remove page frag implementation in vhost_net Yunsheng Lin
                   ` (2 preceding siblings ...)
  2024-01-30 11:37 ` [PATCH net-next v4 3/5] net: introduce page_frag_cache_drain() Yunsheng Lin
@ 2024-01-30 11:37 ` Yunsheng Lin
  2024-01-30 11:37 ` [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test Yunsheng Lin
  4 siblings, 0 replies; 15+ messages in thread
From: Yunsheng Lin @ 2024-01-30 11:37 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Jason Wang,
	Michael S. Tsirkin, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, kvm, virtualization, bpf

The page frag in vhost_net_page_frag_refill() uses the
'struct page_frag' from skb_page_frag_refill(), but it's
implementation is similar to page_frag_alloc_align() now.

This patch removes vhost_net_page_frag_refill() by using
'struct page_frag_cache' instead of 'struct page_frag',
and allocating frag using page_frag_alloc_align().

The added benefit is that not only unifying the page frag
implementation a little, but also having about 0.5% performance
boost testing by using the vhost_net_test introduced in the
last patch.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 91 ++++++++++++++-------------------------------
 1 file changed, 27 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e574e21cc0ca..4b2fcb228a0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -141,10 +141,8 @@ struct vhost_net {
 	unsigned tx_zcopy_err;
 	/* Flush in progress. Protected by tx vq lock. */
 	bool tx_flush;
-	/* Private page frag */
-	struct page_frag page_frag;
-	/* Refcount bias of page frag */
-	int refcnt_bias;
+	/* Private page frag cache */
+	struct page_frag_cache pf_cache;
 };
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
@@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
 	       !vhost_vq_avail_empty(vq->dev, vq);
 }
 
-static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
-				       struct page_frag *pfrag, gfp_t gfp)
-{
-	if (pfrag->page) {
-		if (pfrag->offset + sz <= pfrag->size)
-			return true;
-		__page_frag_cache_drain(pfrag->page, net->refcnt_bias);
-	}
-
-	pfrag->offset = 0;
-	net->refcnt_bias = 0;
-	if (SKB_FRAG_PAGE_ORDER) {
-		/* Avoid direct reclaim but allow kswapd to wake */
-		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
-					  __GFP_COMP | __GFP_NOWARN |
-					  __GFP_NORETRY | __GFP_NOMEMALLOC,
-					  SKB_FRAG_PAGE_ORDER);
-		if (likely(pfrag->page)) {
-			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-			goto done;
-		}
-	}
-	pfrag->page = alloc_page(gfp);
-	if (likely(pfrag->page)) {
-		pfrag->size = PAGE_SIZE;
-		goto done;
-	}
-	return false;
-
-done:
-	net->refcnt_bias = USHRT_MAX;
-	page_ref_add(pfrag->page, USHRT_MAX - 1);
-	return true;
-}
-
 #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
 
 static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
@@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 	struct vhost_net *net = container_of(vq->dev, struct vhost_net,
 					     dev);
 	struct socket *sock = vhost_vq_get_backend(vq);
-	struct page_frag *alloc_frag = &net->page_frag;
 	struct virtio_net_hdr *gso;
 	struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
 	struct tun_xdp_hdr *hdr;
@@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 	int sock_hlen = nvq->sock_hlen;
 	void *buf;
 	int copied;
+	int ret;
 
 	if (unlikely(len < nvq->sock_hlen))
 		return -EFAULT;
@@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 		return -ENOSPC;
 
 	buflen += SKB_DATA_ALIGN(len + pad);
-	alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
-	if (unlikely(!vhost_net_page_frag_refill(net, buflen,
-						 alloc_frag, GFP_KERNEL)))
+	buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL,
+				    SMP_CACHE_BYTES);
+	if (unlikely(!buf))
 		return -ENOMEM;
 
-	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset +
-				     offsetof(struct tun_xdp_hdr, gso),
-				     sock_hlen, from);
-	if (copied != sock_hlen)
-		return -EFAULT;
+	copied = copy_from_iter(buf + offsetof(struct tun_xdp_hdr, gso),
+				sock_hlen, from);
+	if (copied != sock_hlen) {
+		ret = -EFAULT;
+		goto err;
+	}
 
 	hdr = buf;
 	gso = &hdr->gso;
@@ -743,27 +705,30 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 			       vhost16_to_cpu(vq, gso->csum_start) +
 			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
 
-		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
-			return -EINVAL;
+		if (vhost16_to_cpu(vq, gso->hdr_len) > len) {
+			ret = -EINVAL;
+			goto err;
+		}
 	}
 
 	len -= sock_hlen;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset + pad,
-				     len, from);
-	if (copied != len)
-		return -EFAULT;
+	copied = copy_from_iter(buf + pad, len, from);
+	if (copied != len) {
+		ret = -EFAULT;
+		goto err;
+	}
 
 	xdp_init_buff(xdp, buflen, NULL);
 	xdp_prepare_buff(xdp, buf, pad, len, true);
 	hdr->buflen = buflen;
 
-	--net->refcnt_bias;
-	alloc_frag->offset += buflen;
-
 	++nvq->batched_xdp;
 
 	return 0;
+
+err:
+	page_frag_free(buf);
+	return ret;
 }
 
 static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
@@ -1353,8 +1318,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
-	n->page_frag.page = NULL;
-	n->refcnt_bias = 0;
+	n->pf_cache.va = NULL;
 
 	return 0;
 }
@@ -1422,8 +1386,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
 	kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
 	kfree(n->dev.vqs);
-	if (n->page_frag.page)
-		__page_frag_cache_drain(n->page_frag.page, n->refcnt_bias);
+	page_frag_cache_drain(&n->pf_cache);
 	kvfree(n);
 	return 0;
 }
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test
  2024-01-30 11:37 [PATCH net-next v4 0/5] remove page frag implementation in vhost_net Yunsheng Lin
                   ` (3 preceding siblings ...)
  2024-01-30 11:37 ` [PATCH net-next v4 4/5] vhost/net: remove vhost_net_page_frag_refill() Yunsheng Lin
@ 2024-01-30 11:37 ` Yunsheng Lin
  2024-02-02  4:05   ` Jason Wang
  4 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2024-01-30 11:37 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, virtualization

introduce vhost_net_test basing on virtio_test to test
vhost_net changing in the kernel.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 tools/virtio/.gitignore       |   1 +
 tools/virtio/Makefile         |   8 +-
 tools/virtio/vhost_net_test.c | 576 ++++++++++++++++++++++++++++++++++
 3 files changed, 582 insertions(+), 3 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
index 9934d48d9a55..7e47b281c442 100644
--- a/tools/virtio/.gitignore
+++ b/tools/virtio/.gitignore
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 *.d
 virtio_test
+vhost_net_test
 vringh_test
 virtio-trace/trace-agent
diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
index d128925980e0..e25e99c1c3b7 100644
--- a/tools/virtio/Makefile
+++ b/tools/virtio/Makefile
@@ -1,8 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 all: test mod
-test: virtio_test vringh_test
+test: virtio_test vringh_test vhost_net_test
 virtio_test: virtio_ring.o virtio_test.o
 vringh_test: vringh_test.o vringh.o virtio_ring.o
+vhost_net_test: virtio_ring.o vhost_net_test.o
 
 try-run = $(shell set -e;		\
 	if ($(1)) >/dev/null 2>&1;	\
@@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
 
 .PHONY: all test mod clean vhost oot oot-clean oot-build
 clean:
-	${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
-              vhost_test/Module.symvers vhost_test/modules.order *.d
+	${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
+              vhost_test/.*.cmd vhost_test/Module.symvers \
+              vhost_test/modules.order *.d
 -include *.d
diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
new file mode 100644
index 000000000000..e336792a0d77
--- /dev/null
+++ b/tools/virtio/vhost_net_test.c
@@ -0,0 +1,576 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <getopt.h>
+#include <limits.h>
+#include <string.h>
+#include <poll.h>
+#include <sys/eventfd.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <linux/virtio_types.h>
+#include <linux/vhost.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ring.h>
+#include <linux/if.h>
+#include <linux/if_tun.h>
+#include <linux/in.h>
+#include <linux/if_packet.h>
+#include <netinet/ether.h>
+
+#define RANDOM_BATCH	-1
+#define HDR_LEN		12
+#define TEST_BUF_LEN	256
+#define TEST_PTYPE	ETH_P_LOOPBACK
+
+/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
+void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
+
+struct vq_info {
+	int kick;
+	int call;
+	int idx;
+	long started;
+	long completed;
+	struct pollfd fds;
+	void *ring;
+	/* copy used for control */
+	struct vring vring;
+	struct virtqueue *vq;
+};
+
+struct vdev_info {
+	struct virtio_device vdev;
+	int control;
+	struct vq_info vqs[2];
+	int nvqs;
+	void *buf;
+	size_t buf_size;
+	char *test_buf;
+	char *res_buf;
+	struct vhost_memory *mem;
+	int sock;
+	int ifindex;
+	unsigned char mac[ETHER_ADDR_LEN];
+};
+
+static int tun_alloc(struct vdev_info *dev)
+{
+	struct ifreq ifr;
+	int len = HDR_LEN;
+	int fd, e;
+
+	fd = open("/dev/net/tun", O_RDWR);
+	if (fd < 0) {
+		perror("Cannot open /dev/net/tun");
+		return fd;
+	}
+
+	memset(&ifr, 0, sizeof(ifr));
+
+	ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
+	snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
+
+	e = ioctl(fd, TUNSETIFF, &ifr);
+	if (e < 0) {
+		perror("ioctl[TUNSETIFF]");
+		close(fd);
+		return e;
+	}
+
+	e = ioctl(fd, TUNSETVNETHDRSZ, &len);
+	if (e < 0) {
+		perror("ioctl[TUNSETVNETHDRSZ]");
+		close(fd);
+		return e;
+	}
+
+	e = ioctl(fd, SIOCGIFHWADDR, &ifr);
+	if (e < 0) {
+		perror("ioctl[SIOCGIFHWADDR]");
+		close(fd);
+		return e;
+	}
+
+	memcpy(dev->mac, &ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
+	return fd;
+}
+
+static void vdev_create_socket(struct vdev_info *dev)
+{
+	struct ifreq ifr;
+
+	dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
+	assert(dev->sock != -1);
+
+	snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
+	assert(ioctl(dev->sock, SIOCGIFINDEX, &ifr) >= 0);
+
+	dev->ifindex = ifr.ifr_ifindex;
+
+	/* Set the flags that bring the device up */
+	assert(ioctl(dev->sock, SIOCGIFFLAGS, &ifr) >= 0);
+	ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
+	assert(ioctl(dev->sock, SIOCSIFFLAGS, &ifr) >= 0);
+}
+
+static void vdev_send_packet(struct vdev_info *dev)
+{
+	char *sendbuf = dev->test_buf + HDR_LEN;
+	struct sockaddr_ll saddrll = {0};
+	int sockfd = dev->sock;
+	int ret;
+
+	saddrll.sll_family = PF_PACKET;
+	saddrll.sll_ifindex = dev->ifindex;
+	saddrll.sll_halen = ETH_ALEN;
+	saddrll.sll_protocol = htons(TEST_PTYPE);
+
+	ret = sendto(sockfd, sendbuf, TEST_BUF_LEN, 0,
+		     (struct sockaddr *)&saddrll,
+		     sizeof(struct sockaddr_ll));
+	assert(ret >= 0);
+}
+
+static bool vq_notify(struct virtqueue *vq)
+{
+	struct vq_info *info = vq->priv;
+	unsigned long long v = 1;
+	int r;
+
+	r = write(info->kick, &v, sizeof(v));
+	assert(r == sizeof(v));
+
+	return true;
+}
+
+static void vq_callback(struct virtqueue *vq)
+{
+}
+
+static void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
+{
+	struct vhost_vring_addr addr = {
+		.index = info->idx,
+		.desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
+		.avail_user_addr = (uint64_t)(unsigned long)info->vring.avail,
+		.used_user_addr = (uint64_t)(unsigned long)info->vring.used,
+	};
+	struct vhost_vring_state state = { .index = info->idx };
+	struct vhost_vring_file file = { .index = info->idx };
+	int r;
+
+	state.num = info->vring.num;
+	r = ioctl(dev->control, VHOST_SET_VRING_NUM, &state);
+	assert(r >= 0);
+
+	state.num = 0;
+	r = ioctl(dev->control, VHOST_SET_VRING_BASE, &state);
+	assert(r >= 0);
+
+	r = ioctl(dev->control, VHOST_SET_VRING_ADDR, &addr);
+	assert(r >= 0);
+
+	file.fd = info->kick;
+	r = ioctl(dev->control, VHOST_SET_VRING_KICK, &file);
+	assert(r >= 0);
+
+	file.fd = info->call;
+	r = ioctl(dev->control, VHOST_SET_VRING_CALL, &file);
+	assert(r >= 0);
+}
+
+static void vq_reset(struct vq_info *info, int num, struct virtio_device *vdev)
+{
+	if (info->vq)
+		vring_del_virtqueue(info->vq);
+
+	memset(info->ring, 0, vring_size(num, 4096));
+	vring_init(&info->vring, num, info->ring, 4096);
+	info->vq = vring_new_virtqueue(info->idx, num, 4096, vdev, true, false,
+				       info->ring, vq_notify, vq_callback, "test");
+	assert(info->vq);
+	info->vq->priv = info;
+}
+
+static void vq_info_add(struct vdev_info *dev, int idx, int num, int fd)
+{
+	struct vhost_vring_file backend = { .index = idx, .fd = fd };
+	struct vq_info *info = &dev->vqs[idx];
+	int r;
+
+	info->idx = idx;
+	info->kick = eventfd(0, EFD_NONBLOCK);
+	info->call = eventfd(0, EFD_NONBLOCK);
+	r = posix_memalign(&info->ring, 4096, vring_size(num, 4096));
+	assert(r >= 0);
+	vq_reset(info, num, &dev->vdev);
+	vhost_vq_setup(dev, info);
+	info->fds.fd = info->call;
+	info->fds.events = POLLIN;
+
+	r = ioctl(dev->control, VHOST_NET_SET_BACKEND, &backend);
+	assert(!r);
+}
+
+static void vdev_info_init(struct vdev_info *dev, unsigned long long features)
+{
+	struct ether_header *eh;
+	int i, r;
+
+	dev->vdev.features = features;
+	INIT_LIST_HEAD(&dev->vdev.vqs);
+	spin_lock_init(&dev->vdev.vqs_list_lock);
+
+	dev->buf_size = (HDR_LEN + TEST_BUF_LEN) * 2;
+	dev->buf = malloc(dev->buf_size);
+	assert(dev->buf);
+	dev->test_buf = dev->buf;
+	dev->res_buf = dev->test_buf + HDR_LEN + TEST_BUF_LEN;
+
+	memset(dev->test_buf, 0, HDR_LEN + TEST_BUF_LEN);
+	eh = (struct ether_header *)(dev->test_buf + HDR_LEN);
+	eh->ether_type = htons(TEST_PTYPE);
+	memcpy(eh->ether_dhost, dev->mac, ETHER_ADDR_LEN);
+	memcpy(eh->ether_shost, dev->mac, ETHER_ADDR_LEN);
+
+	for (i = sizeof(*eh); i < TEST_BUF_LEN; i++)
+		dev->test_buf[i + HDR_LEN] = (char)i;
+
+	dev->control = open("/dev/vhost-net", O_RDWR);
+	assert(dev->control >= 0);
+
+	r = ioctl(dev->control, VHOST_SET_OWNER, NULL);
+	assert(r >= 0);
+
+	dev->mem = malloc(offsetof(struct vhost_memory, regions) +
+			  sizeof(dev->mem->regions[0]));
+	assert(dev->mem);
+	memset(dev->mem, 0, offsetof(struct vhost_memory, regions) +
+	       sizeof(dev->mem->regions[0]));
+	dev->mem->nregions = 1;
+	dev->mem->regions[0].guest_phys_addr = (long)dev->buf;
+	dev->mem->regions[0].userspace_addr = (long)dev->buf;
+	dev->mem->regions[0].memory_size = dev->buf_size;
+
+	r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
+	assert(r >= 0);
+
+	r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
+	assert(r >= 0);
+
+	dev->nvqs = 2;
+}
+
+static void wait_for_interrupt(struct vq_info *vq)
+{
+	unsigned long long val;
+
+	poll(&vq->fds, 1, -1);
+
+	if (vq->fds.revents & POLLIN)
+		read(vq->fds.fd, &val, sizeof(val));
+}
+
+static void verify_res_buf(char *res_buf)
+{
+	int i;
+
+	for (i = ETHER_HDR_LEN; i < TEST_BUF_LEN; i++)
+		assert(res_buf[i] == (char)i);
+}
+
+static void run_tx_test(struct vdev_info *dev, struct vq_info *vq,
+			bool delayed, int batch, int bufs)
+{
+	const bool random_batch = batch == RANDOM_BATCH;
+	long long spurious = 0;
+	struct scatterlist sl;
+	unsigned int len;
+	int r;
+
+	for (;;) {
+		long started_before = vq->started;
+		long completed_before = vq->completed;
+
+		virtqueue_disable_cb(vq->vq);
+		do {
+			if (random_batch)
+				batch = (random() % vq->vring.num) + 1;
+
+			while (vq->started < bufs &&
+			       (vq->started - vq->completed) < batch) {
+				sg_init_one(&sl, dev->test_buf, HDR_LEN + TEST_BUF_LEN);
+				r = virtqueue_add_outbuf(vq->vq, &sl, 1,
+							 dev->test_buf + vq->started,
+							 GFP_ATOMIC);
+				if (unlikely(r != 0)) {
+					if (r == -ENOSPC &&
+					    vq->started > started_before)
+						r = 0;
+					else
+						r = -1;
+					break;
+				}
+
+				++vq->started;
+
+				if (unlikely(!virtqueue_kick(vq->vq))) {
+					r = -1;
+					break;
+				}
+			}
+
+			if (vq->started >= bufs)
+				r = -1;
+
+			/* Flush out completed bufs if any */
+			while (virtqueue_get_buf(vq->vq, &len)) {
+				int n;
+
+				n = recvfrom(dev->sock, dev->res_buf, TEST_BUF_LEN, 0, NULL, NULL);
+				assert(n == TEST_BUF_LEN);
+				verify_res_buf(dev->res_buf);
+
+				++vq->completed;
+				r = 0;
+			}
+		} while (r == 0);
+
+		if (vq->completed == completed_before && vq->started == started_before)
+			++spurious;
+
+		assert(vq->completed <= bufs);
+		assert(vq->started <= bufs);
+		if (vq->completed == bufs)
+			break;
+
+		if (delayed) {
+			if (virtqueue_enable_cb_delayed(vq->vq))
+				wait_for_interrupt(vq);
+		} else {
+			if (virtqueue_enable_cb(vq->vq))
+				wait_for_interrupt(vq);
+		}
+	}
+	printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
+	       spurious, vq->started, vq->completed);
+}
+
+static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
+			bool delayed, int batch, int bufs)
+{
+	const bool random_batch = batch == RANDOM_BATCH;
+	long long spurious = 0;
+	struct scatterlist sl;
+	unsigned int len;
+	int r;
+
+	for (;;) {
+		long started_before = vq->started;
+		long completed_before = vq->completed;
+
+		do {
+			if (random_batch)
+				batch = (random() % vq->vring.num) + 1;
+
+			while (vq->started < bufs &&
+			       (vq->started - vq->completed) < batch) {
+				sg_init_one(&sl, dev->res_buf, HDR_LEN + TEST_BUF_LEN);
+
+				r = virtqueue_add_inbuf(vq->vq, &sl, 1,
+							dev->res_buf + vq->started,
+							GFP_ATOMIC);
+				if (unlikely(r != 0)) {
+					if (r == -ENOSPC &&
+					    vq->started > started_before)
+						r = 0;
+					else
+						r = -1;
+					break;
+				}
+
+				++vq->started;
+
+				vdev_send_packet(dev);
+
+				if (unlikely(!virtqueue_kick(vq->vq))) {
+					r = -1;
+					break;
+				}
+			}
+
+			if (vq->started >= bufs)
+				r = -1;
+
+			/* Flush out completed bufs if any */
+			while (virtqueue_get_buf(vq->vq, &len)) {
+				struct ether_header *eh;
+
+				eh = (struct ether_header *)(dev->res_buf + HDR_LEN);
+
+				/* tun netdev is up and running, ignore the
+				 * non-TEST_PTYPE packet.
+				 */
+				if (eh->ether_type != htons(TEST_PTYPE)) {
+					++vq->completed;
+					r = 0;
+					continue;
+				}
+
+				assert(len == TEST_BUF_LEN + HDR_LEN);
+				verify_res_buf(dev->res_buf + HDR_LEN);
+
+				++vq->completed;
+				r = 0;
+			}
+		} while (r == 0);
+		if (vq->completed == completed_before && vq->started == started_before)
+			++spurious;
+
+		assert(vq->completed <= bufs);
+		assert(vq->started <= bufs);
+		if (vq->completed == bufs)
+			break;
+	}
+
+	printf("RX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
+	       spurious, vq->started, vq->completed);
+}
+
+static const char optstring[] = "h";
+static const struct option longopts[] = {
+	{
+		.name = "help",
+		.val = 'h',
+	},
+	{
+		.name = "event-idx",
+		.val = 'E',
+	},
+	{
+		.name = "no-event-idx",
+		.val = 'e',
+	},
+	{
+		.name = "indirect",
+		.val = 'I',
+	},
+	{
+		.name = "no-indirect",
+		.val = 'i',
+	},
+	{
+		.name = "virtio-1",
+		.val = '1',
+	},
+	{
+		.name = "no-virtio-1",
+		.val = '0',
+	},
+	{
+		.name = "delayed-interrupt",
+		.val = 'D',
+	},
+	{
+		.name = "no-delayed-interrupt",
+		.val = 'd',
+	},
+	{
+		.name = "buf-num",
+		.val = 'n',
+		.has_arg = required_argument,
+	},
+	{
+		.name = "batch",
+		.val = 'b',
+		.has_arg = required_argument,
+	},
+	{
+	}
+};
+
+static void help(int status)
+{
+	fprintf(stderr, "Usage: vhost_net_test [--help]"
+		" [--no-indirect]"
+		" [--no-event-idx]"
+		" [--no-virtio-1]"
+		" [--delayed-interrupt]"
+		" [--buf-num]"
+		" [--batch=random/N]"
+		"\n");
+
+	exit(status);
+}
+
+int main(int argc, char **argv)
+{
+	unsigned long long features = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+		(1ULL << VIRTIO_RING_F_EVENT_IDX) | (1ULL << VIRTIO_F_VERSION_1);
+	long batch = 1, nbufs = 0x100000;
+	struct vdev_info dev;
+	bool delayed = false;
+	int o, fd;
+
+	for (;;) {
+		o = getopt_long(argc, argv, optstring, longopts, NULL);
+		switch (o) {
+		case -1:
+			goto done;
+		case '?':
+			help(2);
+		case 'e':
+			features &= ~(1ULL << VIRTIO_RING_F_EVENT_IDX);
+			break;
+		case 'h':
+			help(0);
+		case 'i':
+			features &= ~(1ULL << VIRTIO_RING_F_INDIRECT_DESC);
+			break;
+		case '0':
+			features &= ~(1ULL << VIRTIO_F_VERSION_1);
+			break;
+		case 'D':
+			delayed = true;
+			break;
+		case 'b':
+			if (!strcmp(optarg, "random")) {
+				batch = RANDOM_BATCH;
+			} else {
+				batch = strtol(optarg, NULL, 10);
+				assert(batch > 0);
+				assert(batch < (long)INT_MAX + 1);
+			}
+			break;
+		case 'n':
+			nbufs = strtol(optarg, NULL, 10);
+			assert(nbufs > 0);
+			break;
+		default:
+			assert(0);
+			break;
+		}
+	}
+
+done:
+	memset(&dev, 0, sizeof(dev));
+
+	fd = tun_alloc(&dev);
+	assert(fd >= 0);
+
+	vdev_info_init(&dev, features);
+	vq_info_add(&dev, 0, 256, fd);
+	vq_info_add(&dev, 1, 256, fd);
+	vdev_create_socket(&dev);
+
+	run_rx_test(&dev, &dev.vqs[0], delayed, batch, nbufs);
+	run_tx_test(&dev, &dev.vqs[1], delayed, batch, nbufs);
+
+	return 0;
+}
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation
  2024-01-30 11:37 ` [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation Yunsheng Lin
@ 2024-02-01 13:16   ` Paolo Abeni
  2024-02-02  2:10     ` Yunsheng Lin
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Abeni @ 2024-02-01 13:16 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba
  Cc: netdev, linux-kernel, Alexander Duyck, Alexander Duyck,
	Michael S. Tsirkin, Jason Wang, Andrew Morton, Eric Dumazet, kvm,
	virtualization, linux-mm

On Tue, 2024-01-30 at 19:37 +0800, Yunsheng Lin wrote:
> Currently there seems to be three page frag implementions
> which all try to allocate order 3 page, if that fails, it
> then fail back to allocate order 0 page, and each of them
> all allow order 3 page allocation to fail under certain
> condition by using specific gfp bits.
> 
> The gfp bits for order 3 page allocation are different
> between different implementation, __GFP_NOMEMALLOC is
> or'd to forbid access to emergency reserves memory for
> __page_frag_cache_refill(), but it is not or'd in other
> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
> direct reclaim in skb_page_frag_refill(), but it is not
> masked off in __page_frag_cache_refill().
> 
> This patch unifies the gfp bits used between different
> implementions by or'ing __GFP_NOMEMALLOC and masking off
> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
> possible pressure for mm.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> ---
>  drivers/vhost/net.c | 2 +-
>  mm/page_alloc.c     | 4 ++--
>  net/core/sock.c     | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f2ed7167c848..e574e21cc0ca 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
>  		/* Avoid direct reclaim but allow kswapd to wake */
>  		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>  					  __GFP_COMP | __GFP_NOWARN |
> -					  __GFP_NORETRY,
> +					  __GFP_NORETRY | __GFP_NOMEMALLOC,
>  					  SKB_FRAG_PAGE_ORDER);

>  		if (likely(pfrag->page)) {
>  			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0f7e67c4250..636145c29f70 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>  	gfp_t gfp = gfp_mask;
>  
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> -		    __GFP_NOMEMALLOC;
> +	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>  	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>  				PAGE_FRAG_CACHE_MAX_ORDER);
>  	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 88bf810394a5..8289a3d8c375 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2919,7 +2919,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>  		/* Avoid direct reclaim but allow kswapd to wake */
>  		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>  					  __GFP_COMP | __GFP_NOWARN |
> -					  __GFP_NORETRY,
> +					  __GFP_NORETRY | __GFP_NOMEMALLOC,
>  					  SKB_FRAG_PAGE_ORDER);

This will prevent memory reserve usage when allocating order 3 pages,
but not when allocating a single page as a fallback. Still different
from the __page_frag_cache_refill() allocator - which never accesses
the memory reserves.

I'm unsure we want to propagate the __page_frag_cache_refill behavior
here, the current behavior could be required by some systems.

It looks like this series still leave the skb_page_frag_refill()
allocator alone, what about dropping this chunk, too? 

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation
  2024-02-01 13:16   ` Paolo Abeni
@ 2024-02-02  2:10     ` Yunsheng Lin
  2024-02-02  8:36       ` Paolo Abeni
  0 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2024-02-02  2:10 UTC (permalink / raw)
  To: Paolo Abeni, davem, kuba
  Cc: netdev, linux-kernel, Alexander Duyck, Alexander Duyck,
	Michael S. Tsirkin, Jason Wang, Andrew Morton, Eric Dumazet, kvm,
	virtualization, linux-mm

On 2024/2/1 21:16, Paolo Abeni wrote:
> On Tue, 2024-01-30 at 19:37 +0800, Yunsheng Lin wrote:
>> Currently there seems to be three page frag implementions
>> which all try to allocate order 3 page, if that fails, it
>> then fail back to allocate order 0 page, and each of them
>> all allow order 3 page allocation to fail under certain
>> condition by using specific gfp bits.
>>
>> The gfp bits for order 3 page allocation are different
>> between different implementation, __GFP_NOMEMALLOC is
>> or'd to forbid access to emergency reserves memory for
>> __page_frag_cache_refill(), but it is not or'd in other
>> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
>> direct reclaim in skb_page_frag_refill(), but it is not
>> masked off in __page_frag_cache_refill().
>>
>> This patch unifies the gfp bits used between different
>> implementions by or'ing __GFP_NOMEMALLOC and masking off
>> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
>> possible pressure for mm.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
>> ---
>>  drivers/vhost/net.c | 2 +-
>>  mm/page_alloc.c     | 4 ++--
>>  net/core/sock.c     | 2 +-
>>  3 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index f2ed7167c848..e574e21cc0ca 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
>>  		/* Avoid direct reclaim but allow kswapd to wake */
>>  		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>>  					  __GFP_COMP | __GFP_NOWARN |
>> -					  __GFP_NORETRY,
>> +					  __GFP_NORETRY | __GFP_NOMEMALLOC,
>>  					  SKB_FRAG_PAGE_ORDER);
> 
>>  		if (likely(pfrag->page)) {
>>  			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c0f7e67c4250..636145c29f70 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>>  	gfp_t gfp = gfp_mask;
>>  
>>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>> -	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
>> -		    __GFP_NOMEMALLOC;
>> +	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
>> +		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>>  	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>>  				PAGE_FRAG_CACHE_MAX_ORDER);
>>  	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 88bf810394a5..8289a3d8c375 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2919,7 +2919,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>>  		/* Avoid direct reclaim but allow kswapd to wake */
>>  		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>>  					  __GFP_COMP | __GFP_NOWARN |
>> -					  __GFP_NORETRY,
>> +					  __GFP_NORETRY | __GFP_NOMEMALLOC,
>>  					  SKB_FRAG_PAGE_ORDER);
> 
> This will prevent memory reserve usage when allocating order 3 pages,
> but not when allocating a single page as a fallback. Still different

More accurately, the above ensures memory reserve is always not used
for order 3 pages, whether memory reserve is used for order 0 pages
depending on original 'gfp' flags, if 'gfp' does not have __GFP_NOMEMALLOC
bit set, memory reserve may still be used  for order 0 pages.

> from the __page_frag_cache_refill() allocator - which never accesses
> the memory reserves.

I am not really sure I understand the above commemt.
The semantic is the same as skb_page_frag_refill() as explained above
as my understanding. Note that __page_frag_cache_refill() use 'gfp_mask'
for allocating order 3 pages and use the original 'gfp' for allocating
order 0 pages.

> 
> I'm unsure we want to propagate the __page_frag_cache_refill behavior
> here, the current behavior could be required by some systems.
> 
> It looks like this series still leave the skb_page_frag_refill()
> allocator alone, what about dropping this chunk, too? 

As explained above, I would prefer to keep it as it is as it seems
to be quite obvious that we can avoid possible pressure for mm by
not using memory reserve for order 3 pages as we have the fallback
for order 0 pages.

Please let me know if there is anything obvious I missed.

> 
> Thanks!
> 
> Paolo
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test
  2024-01-30 11:37 ` [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test Yunsheng Lin
@ 2024-02-02  4:05   ` Jason Wang
  2024-02-02 12:24     ` Yunsheng Lin
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Wang @ 2024-02-02  4:05 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
	Xuan Zhuo, virtualization

On Tue, Jan 30, 2024 at 7:38 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> introduce vhost_net_test basing on virtio_test to test
> vhost_net changing in the kernel.

Let's describe what kind of test is being done and how it is done here.

>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  tools/virtio/.gitignore       |   1 +
>  tools/virtio/Makefile         |   8 +-
>  tools/virtio/vhost_net_test.c | 576 ++++++++++++++++++++++++++++++++++
>  3 files changed, 582 insertions(+), 3 deletions(-)
>  create mode 100644 tools/virtio/vhost_net_test.c
>
> diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
> index 9934d48d9a55..7e47b281c442 100644
> --- a/tools/virtio/.gitignore
> +++ b/tools/virtio/.gitignore
> @@ -1,5 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  *.d
>  virtio_test
> +vhost_net_test
>  vringh_test
>  virtio-trace/trace-agent
> diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
> index d128925980e0..e25e99c1c3b7 100644
> --- a/tools/virtio/Makefile
> +++ b/tools/virtio/Makefile
> @@ -1,8 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0
>  all: test mod
> -test: virtio_test vringh_test
> +test: virtio_test vringh_test vhost_net_test
>  virtio_test: virtio_ring.o virtio_test.o
>  vringh_test: vringh_test.o vringh.o virtio_ring.o
> +vhost_net_test: virtio_ring.o vhost_net_test.o
>
>  try-run = $(shell set -e;              \
>         if ($(1)) >/dev/null 2>&1;      \
> @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
>
>  .PHONY: all test mod clean vhost oot oot-clean oot-build
>  clean:
> -       ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
> -              vhost_test/Module.symvers vhost_test/modules.order *.d
> +       ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
> +              vhost_test/.*.cmd vhost_test/Module.symvers \
> +              vhost_test/modules.order *.d
>  -include *.d
> diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
> new file mode 100644
> index 000000000000..e336792a0d77
> --- /dev/null
> +++ b/tools/virtio/vhost_net_test.c
> @@ -0,0 +1,576 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include <getopt.h>
> +#include <limits.h>
> +#include <string.h>
> +#include <poll.h>
> +#include <sys/eventfd.h>
> +#include <stdlib.h>
> +#include <assert.h>
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <fcntl.h>
> +#include <stdbool.h>
> +#include <linux/virtio_types.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_ring.h>
> +#include <linux/if.h>
> +#include <linux/if_tun.h>
> +#include <linux/in.h>
> +#include <linux/if_packet.h>
> +#include <netinet/ether.h>
> +
> +#define RANDOM_BATCH   -1
> +#define HDR_LEN                12
> +#define TEST_BUF_LEN   256
> +#define TEST_PTYPE     ETH_P_LOOPBACK
> +
> +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
> +
> +struct vq_info {
> +       int kick;
> +       int call;
> +       int idx;
> +       long started;
> +       long completed;
> +       struct pollfd fds;
> +       void *ring;
> +       /* copy used for control */
> +       struct vring vring;
> +       struct virtqueue *vq;
> +};
> +
> +struct vdev_info {
> +       struct virtio_device vdev;
> +       int control;
> +       struct vq_info vqs[2];
> +       int nvqs;
> +       void *buf;
> +       size_t buf_size;
> +       char *test_buf;
> +       char *res_buf;
> +       struct vhost_memory *mem;
> +       int sock;
> +       int ifindex;
> +       unsigned char mac[ETHER_ADDR_LEN];
> +};
> +
> +static int tun_alloc(struct vdev_info *dev)
> +{
> +       struct ifreq ifr;
> +       int len = HDR_LEN;

Any reason you can't just use the virtio_net uapi?

> +       int fd, e;
> +
> +       fd = open("/dev/net/tun", O_RDWR);
> +       if (fd < 0) {
> +               perror("Cannot open /dev/net/tun");
> +               return fd;
> +       }
> +
> +       memset(&ifr, 0, sizeof(ifr));
> +
> +       ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> +
> +       e = ioctl(fd, TUNSETIFF, &ifr);
> +       if (e < 0) {
> +               perror("ioctl[TUNSETIFF]");
> +               close(fd);
> +               return e;
> +       }
> +
> +       e = ioctl(fd, TUNSETVNETHDRSZ, &len);
> +       if (e < 0) {
> +               perror("ioctl[TUNSETVNETHDRSZ]");
> +               close(fd);
> +               return e;
> +       }
> +
> +       e = ioctl(fd, SIOCGIFHWADDR, &ifr);
> +       if (e < 0) {
> +               perror("ioctl[SIOCGIFHWADDR]");
> +               close(fd);
> +               return e;
> +       }
> +
> +       memcpy(dev->mac, &ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
> +       return fd;
> +}
> +
> +static void vdev_create_socket(struct vdev_info *dev)
> +{
> +       struct ifreq ifr;
> +
> +       dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
> +       assert(dev->sock != -1);
> +
> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());

Nit: it might be better to accept the device name instead of repeating
the snprintf trick here, this would facilitate the future changes.

> +       assert(ioctl(dev->sock, SIOCGIFINDEX, &ifr) >= 0);
> +
> +       dev->ifindex = ifr.ifr_ifindex;
> +
> +       /* Set the flags that bring the device up */
> +       assert(ioctl(dev->sock, SIOCGIFFLAGS, &ifr) >= 0);
> +       ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
> +       assert(ioctl(dev->sock, SIOCSIFFLAGS, &ifr) >= 0);
> +}
> +
> +static void vdev_send_packet(struct vdev_info *dev)
> +{
> +       char *sendbuf = dev->test_buf + HDR_LEN;
> +       struct sockaddr_ll saddrll = {0};
> +       int sockfd = dev->sock;
> +       int ret;
> +
> +       saddrll.sll_family = PF_PACKET;
> +       saddrll.sll_ifindex = dev->ifindex;
> +       saddrll.sll_halen = ETH_ALEN;
> +       saddrll.sll_protocol = htons(TEST_PTYPE);
> +
> +       ret = sendto(sockfd, sendbuf, TEST_BUF_LEN, 0,
> +                    (struct sockaddr *)&saddrll,
> +                    sizeof(struct sockaddr_ll));
> +       assert(ret >= 0);
> +}
> +
> +static bool vq_notify(struct virtqueue *vq)
> +{
> +       struct vq_info *info = vq->priv;
> +       unsigned long long v = 1;
> +       int r;
> +
> +       r = write(info->kick, &v, sizeof(v));
> +       assert(r == sizeof(v));
> +
> +       return true;
> +}
> +
> +static void vq_callback(struct virtqueue *vq)
> +{
> +}
> +
> +static void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
> +{
> +       struct vhost_vring_addr addr = {
> +               .index = info->idx,
> +               .desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
> +               .avail_user_addr = (uint64_t)(unsigned long)info->vring.avail,
> +               .used_user_addr = (uint64_t)(unsigned long)info->vring.used,
> +       };
> +       struct vhost_vring_state state = { .index = info->idx };
> +       struct vhost_vring_file file = { .index = info->idx };
> +       int r;
> +
> +       state.num = info->vring.num;
> +       r = ioctl(dev->control, VHOST_SET_VRING_NUM, &state);
> +       assert(r >= 0);
> +
> +       state.num = 0;
> +       r = ioctl(dev->control, VHOST_SET_VRING_BASE, &state);
> +       assert(r >= 0);
> +
> +       r = ioctl(dev->control, VHOST_SET_VRING_ADDR, &addr);
> +       assert(r >= 0);
> +
> +       file.fd = info->kick;
> +       r = ioctl(dev->control, VHOST_SET_VRING_KICK, &file);
> +       assert(r >= 0);
> +
> +       file.fd = info->call;
> +       r = ioctl(dev->control, VHOST_SET_VRING_CALL, &file);
> +       assert(r >= 0);
> +}
> +
> +static void vq_reset(struct vq_info *info, int num, struct virtio_device *vdev)
> +{
> +       if (info->vq)
> +               vring_del_virtqueue(info->vq);
> +
> +       memset(info->ring, 0, vring_size(num, 4096));
> +       vring_init(&info->vring, num, info->ring, 4096);
> +       info->vq = vring_new_virtqueue(info->idx, num, 4096, vdev, true, false,
> +                                      info->ring, vq_notify, vq_callback, "test");
> +       assert(info->vq);
> +       info->vq->priv = info;
> +}
> +
> +static void vq_info_add(struct vdev_info *dev, int idx, int num, int fd)
> +{
> +       struct vhost_vring_file backend = { .index = idx, .fd = fd };
> +       struct vq_info *info = &dev->vqs[idx];
> +       int r;
> +
> +       info->idx = idx;
> +       info->kick = eventfd(0, EFD_NONBLOCK);
> +       info->call = eventfd(0, EFD_NONBLOCK);

If we don't care about the callback, let's just avoid to set the call here?

(As I see vq_callback is a NULL)

> +       r = posix_memalign(&info->ring, 4096, vring_size(num, 4096));
> +       assert(r >= 0);
> +       vq_reset(info, num, &dev->vdev);
> +       vhost_vq_setup(dev, info);
> +       info->fds.fd = info->call;
> +       info->fds.events = POLLIN;
> +
> +       r = ioctl(dev->control, VHOST_NET_SET_BACKEND, &backend);
> +       assert(!r);
> +}
> +
> +static void vdev_info_init(struct vdev_info *dev, unsigned long long features)
> +{
> +       struct ether_header *eh;
> +       int i, r;
> +
> +       dev->vdev.features = features;
> +       INIT_LIST_HEAD(&dev->vdev.vqs);
> +       spin_lock_init(&dev->vdev.vqs_list_lock);
> +
> +       dev->buf_size = (HDR_LEN + TEST_BUF_LEN) * 2;
> +       dev->buf = malloc(dev->buf_size);
> +       assert(dev->buf);
> +       dev->test_buf = dev->buf;
> +       dev->res_buf = dev->test_buf + HDR_LEN + TEST_BUF_LEN;
> +
> +       memset(dev->test_buf, 0, HDR_LEN + TEST_BUF_LEN);
> +       eh = (struct ether_header *)(dev->test_buf + HDR_LEN);
> +       eh->ether_type = htons(TEST_PTYPE);
> +       memcpy(eh->ether_dhost, dev->mac, ETHER_ADDR_LEN);
> +       memcpy(eh->ether_shost, dev->mac, ETHER_ADDR_LEN);
> +
> +       for (i = sizeof(*eh); i < TEST_BUF_LEN; i++)
> +               dev->test_buf[i + HDR_LEN] = (char)i;
> +
> +       dev->control = open("/dev/vhost-net", O_RDWR);
> +       assert(dev->control >= 0);
> +
> +       r = ioctl(dev->control, VHOST_SET_OWNER, NULL);
> +       assert(r >= 0);
> +
> +       dev->mem = malloc(offsetof(struct vhost_memory, regions) +
> +                         sizeof(dev->mem->regions[0]));
> +       assert(dev->mem);
> +       memset(dev->mem, 0, offsetof(struct vhost_memory, regions) +
> +              sizeof(dev->mem->regions[0]));
> +       dev->mem->nregions = 1;
> +       dev->mem->regions[0].guest_phys_addr = (long)dev->buf;
> +       dev->mem->regions[0].userspace_addr = (long)dev->buf;
> +       dev->mem->regions[0].memory_size = dev->buf_size;
> +
> +       r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
> +       assert(r >= 0);
> +
> +       r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
> +       assert(r >= 0);
> +
> +       dev->nvqs = 2;
> +}
> +
> +static void wait_for_interrupt(struct vq_info *vq)
> +{
> +       unsigned long long val;
> +
> +       poll(&vq->fds, 1, -1);
> +
> +       if (vq->fds.revents & POLLIN)
> +               read(vq->fds.fd, &val, sizeof(val));
> +}
> +
> +static void verify_res_buf(char *res_buf)
> +{
> +       int i;
> +
> +       for (i = ETHER_HDR_LEN; i < TEST_BUF_LEN; i++)
> +               assert(res_buf[i] == (char)i);
> +}
> +
> +static void run_tx_test(struct vdev_info *dev, struct vq_info *vq,
> +                       bool delayed, int batch, int bufs)

It might be better to describe the test design briefly above as a
comment. Or we can start from simple test logic and add sophisticated
ones on top.

> +{
> +       const bool random_batch = batch == RANDOM_BATCH;
> +       long long spurious = 0;
> +       struct scatterlist sl;
> +       unsigned int len;
> +       int r;
> +
> +       for (;;) {
> +               long started_before = vq->started;
> +               long completed_before = vq->completed;
> +
> +               virtqueue_disable_cb(vq->vq);
> +               do {
> +                       if (random_batch)
> +                               batch = (random() % vq->vring.num) + 1;
> +
> +                       while (vq->started < bufs &&
> +                              (vq->started - vq->completed) < batch) {
> +                               sg_init_one(&sl, dev->test_buf, HDR_LEN + TEST_BUF_LEN);
> +                               r = virtqueue_add_outbuf(vq->vq, &sl, 1,
> +                                                        dev->test_buf + vq->started,
> +                                                        GFP_ATOMIC);
> +                               if (unlikely(r != 0)) {
> +                                       if (r == -ENOSPC &&
> +                                           vq->started > started_before)
> +                                               r = 0;
> +                                       else
> +                                               r = -1;
> +                                       break;
> +                               }
> +
> +                               ++vq->started;
> +
> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
> +                                       r = -1;
> +                                       break;
> +                               }
> +                       }
> +
> +                       if (vq->started >= bufs)
> +                               r = -1;
> +
> +                       /* Flush out completed bufs if any */
> +                       while (virtqueue_get_buf(vq->vq, &len)) {
> +                               int n;
> +
> +                               n = recvfrom(dev->sock, dev->res_buf, TEST_BUF_LEN, 0, NULL, NULL);
> +                               assert(n == TEST_BUF_LEN);
> +                               verify_res_buf(dev->res_buf);
> +
> +                               ++vq->completed;
> +                               r = 0;
> +                       }
> +               } while (r == 0);
> +
> +               if (vq->completed == completed_before && vq->started == started_before)
> +                       ++spurious;
> +
> +               assert(vq->completed <= bufs);
> +               assert(vq->started <= bufs);
> +               if (vq->completed == bufs)
> +                       break;
> +
> +               if (delayed) {
> +                       if (virtqueue_enable_cb_delayed(vq->vq))
> +                               wait_for_interrupt(vq);
> +               } else {
> +                       if (virtqueue_enable_cb(vq->vq))
> +                               wait_for_interrupt(vq);
> +               }
> +       }
> +       printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
> +              spurious, vq->started, vq->completed);
> +}
> +
> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
> +                       bool delayed, int batch, int bufs)
> +{
> +       const bool random_batch = batch == RANDOM_BATCH;
> +       long long spurious = 0;
> +       struct scatterlist sl;
> +       unsigned int len;
> +       int r;
> +
> +       for (;;) {
> +               long started_before = vq->started;
> +               long completed_before = vq->completed;
> +
> +               do {
> +                       if (random_batch)
> +                               batch = (random() % vq->vring.num) + 1;
> +
> +                       while (vq->started < bufs &&
> +                              (vq->started - vq->completed) < batch) {
> +                               sg_init_one(&sl, dev->res_buf, HDR_LEN + TEST_BUF_LEN);
> +
> +                               r = virtqueue_add_inbuf(vq->vq, &sl, 1,
> +                                                       dev->res_buf + vq->started,
> +                                                       GFP_ATOMIC);
> +                               if (unlikely(r != 0)) {
> +                                       if (r == -ENOSPC &&

Drivers usually maintain a #free_slots, this can help to avoid the
trick for checking ENOSPC?

> +                                           vq->started > started_before)
> +                                               r = 0;
> +                                       else
> +                                               r = -1;
> +                                       break;
> +                               }
> +
> +                               ++vq->started;
> +
> +                               vdev_send_packet(dev);
> +
> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
> +                                       r = -1;
> +                                       break;
> +                               }
> +                       }
> +
> +                       if (vq->started >= bufs)
> +                               r = -1;
> +
> +                       /* Flush out completed bufs if any */
> +                       while (virtqueue_get_buf(vq->vq, &len)) {
> +                               struct ether_header *eh;
> +
> +                               eh = (struct ether_header *)(dev->res_buf + HDR_LEN);
> +
> +                               /* tun netdev is up and running, ignore the
> +                                * non-TEST_PTYPE packet.
> +                                */

I wonder if it's better to set up some kind of qdisc to avoid the
unexpected packet here, or is it too complicated?

Thanks


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation
  2024-02-02  2:10     ` Yunsheng Lin
@ 2024-02-02  8:36       ` Paolo Abeni
  2024-02-02 12:26         ` Yunsheng Lin
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Abeni @ 2024-02-02  8:36 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba
  Cc: netdev, linux-kernel, Alexander Duyck, Alexander Duyck,
	Michael S. Tsirkin, Jason Wang, Andrew Morton, Eric Dumazet, kvm,
	virtualization, linux-mm

On Fri, 2024-02-02 at 10:10 +0800, Yunsheng Lin wrote:
> On 2024/2/1 21:16, Paolo Abeni wrote:
> 
> > from the __page_frag_cache_refill() allocator - which never accesses
> > the memory reserves.
> 
> I am not really sure I understand the above commemt.
> The semantic is the same as skb_page_frag_refill() as explained above
> as my understanding. Note that __page_frag_cache_refill() use 'gfp_mask'
> for allocating order 3 pages and use the original 'gfp' for allocating
> order 0 pages.

You are right! I got fooled misreading 'gfp' as 'gfp_mask' in there.

> > I'm unsure we want to propagate the __page_frag_cache_refill behavior
> > here, the current behavior could be required by some systems.
> > 
> > It looks like this series still leave the skb_page_frag_refill()
> > allocator alone, what about dropping this chunk, too? 
> 
> As explained above, I would prefer to keep it as it is as it seems
> to be quite obvious that we can avoid possible pressure for mm by
> not using memory reserve for order 3 pages as we have the fallback
> for order 0 pages.
> 
> Please let me know if there is anything obvious I missed.
> 

I still think/fear that behaviours changes here could have
subtle/negative side effects - even if I agree the change looks safe.

I think the series without this patch would still achieve its goals and
would be much more uncontroversial. What about move this patch as a
standalone follow-up?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test
  2024-02-02  4:05   ` Jason Wang
@ 2024-02-02 12:24     ` Yunsheng Lin
  2024-02-04  1:30       ` Jason Wang
  0 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2024-02-02 12:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
	Xuan Zhuo, virtualization

On 2024/2/2 12:05, Jason Wang wrote:
> On Tue, Jan 30, 2024 at 7:38 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> introduce vhost_net_test basing on virtio_test to test
>> vhost_net changing in the kernel.
> 
> Let's describe what kind of test is being done and how it is done here.

How about something like below:

This patch introduces testing for both vhost_net tx and rx.
Steps for vhost_net tx testing:
1. Prepare a out buf
2. Kick the vhost_net to do tx processing
3. Do the receiving in the tun side
4. verify the data received by tun is correct

Steps for vhost_net rx testing::
1. Prepare a in buf
2. Do the sending in the tun side
3. Kick the vhost_net to do rx processing
4. verify the data received by vhost_net is correct


>> +
>> +static int tun_alloc(struct vdev_info *dev)
>> +{
>> +       struct ifreq ifr;
>> +       int len = HDR_LEN;
> 
> Any reason you can't just use the virtio_net uapi?

I didn't find a macro for that in include/uapi/linux/virtio_net.h.

Did you mean using something like below?
sizeof(struct virtio_net_hdr_mrg_rxbuf)

> 
>> +       int fd, e;
>> +
>> +       fd = open("/dev/net/tun", O_RDWR);
>> +       if (fd < 0) {
>> +               perror("Cannot open /dev/net/tun");
>> +               return fd;
>> +       }
>> +
>> +       memset(&ifr, 0, sizeof(ifr));
>> +
>> +       ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
>> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
>> +
>> +       e = ioctl(fd, TUNSETIFF, &ifr);
>> +       if (e < 0) {
>> +               perror("ioctl[TUNSETIFF]");
>> +               close(fd);
>> +               return e;
>> +       }
>> +
>> +       e = ioctl(fd, TUNSETVNETHDRSZ, &len);
>> +       if (e < 0) {
>> +               perror("ioctl[TUNSETVNETHDRSZ]");
>> +               close(fd);
>> +               return e;
>> +       }
>> +
>> +       e = ioctl(fd, SIOCGIFHWADDR, &ifr);
>> +       if (e < 0) {
>> +               perror("ioctl[SIOCGIFHWADDR]");
>> +               close(fd);
>> +               return e;
>> +       }
>> +
>> +       memcpy(dev->mac, &ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
>> +       return fd;
>> +}
>> +
>> +static void vdev_create_socket(struct vdev_info *dev)
>> +{
>> +       struct ifreq ifr;
>> +
>> +       dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
>> +       assert(dev->sock != -1);
>> +
>> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> 
> Nit: it might be better to accept the device name instead of repeating
> the snprintf trick here, this would facilitate the future changes.

I am not sure I understand what did you mean by "accept the device name"
here.

The above is used to get ifindex of the tun netdevice created in
tun_alloc(), so that we can use it in vdev_send_packet() to send
a packet using the tun netdevice created in tun_alloc(). Is there
anything obvious I missed here?

> 
>> +       assert(ioctl(dev->sock, SIOCGIFINDEX, &ifr) >= 0);
>> +
>> +       dev->ifindex = ifr.ifr_ifindex;
>> +
>> +       /* Set the flags that bring the device up */
>> +       assert(ioctl(dev->sock, SIOCGIFFLAGS, &ifr) >= 0);
>> +       ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
>> +       assert(ioctl(dev->sock, SIOCSIFFLAGS, &ifr) >= 0);
>> +}
>> +
>> +static void vdev_send_packet(struct vdev_info *dev)
>> +{
>> +       char *sendbuf = dev->test_buf + HDR_LEN;
>> +       struct sockaddr_ll saddrll = {0};
>> +       int sockfd = dev->sock;
>> +       int ret;
>> +
>> +       saddrll.sll_family = PF_PACKET;
>> +       saddrll.sll_ifindex = dev->ifindex;
>> +       saddrll.sll_halen = ETH_ALEN;
>> +       saddrll.sll_protocol = htons(TEST_PTYPE);
>> +
>> +       ret = sendto(sockfd, sendbuf, TEST_BUF_LEN, 0,
>> +                    (struct sockaddr *)&saddrll,
>> +                    sizeof(struct sockaddr_ll));
>> +       assert(ret >= 0);
>> +}
>> +

...

>> +
>> +static void vq_info_add(struct vdev_info *dev, int idx, int num, int fd)
>> +{
>> +       struct vhost_vring_file backend = { .index = idx, .fd = fd };
>> +       struct vq_info *info = &dev->vqs[idx];
>> +       int r;
>> +
>> +       info->idx = idx;
>> +       info->kick = eventfd(0, EFD_NONBLOCK);
>> +       info->call = eventfd(0, EFD_NONBLOCK);
> 
> If we don't care about the callback, let's just avoid to set the call here?
> 
> (As I see vq_callback is a NULL)

Sure, will remove the vq_callback related code.

> 
>> +       r = posix_memalign(&info->ring, 4096, vring_size(num, 4096));
>> +       assert(r >= 0);
>> +       vq_reset(info, num, &dev->vdev);
>> +       vhost_vq_setup(dev, info);
>> +       info->fds.fd = info->call;
>> +       info->fds.events = POLLIN;
>> +
>> +       r = ioctl(dev->control, VHOST_NET_SET_BACKEND, &backend);
>> +       assert(!r);
>> +}
>> +
>> +static void vdev_info_init(struct vdev_info *dev, unsigned long long features)
>> +{
>> +       struct ether_header *eh;
>> +       int i, r;
>> +
>> +       dev->vdev.features = features;
>> +       INIT_LIST_HEAD(&dev->vdev.vqs);
>> +       spin_lock_init(&dev->vdev.vqs_list_lock);
>> +
>> +       dev->buf_size = (HDR_LEN + TEST_BUF_LEN) * 2;
>> +       dev->buf = malloc(dev->buf_size);
>> +       assert(dev->buf);
>> +       dev->test_buf = dev->buf;
>> +       dev->res_buf = dev->test_buf + HDR_LEN + TEST_BUF_LEN;
>> +
>> +       memset(dev->test_buf, 0, HDR_LEN + TEST_BUF_LEN);
>> +       eh = (struct ether_header *)(dev->test_buf + HDR_LEN);
>> +       eh->ether_type = htons(TEST_PTYPE);
>> +       memcpy(eh->ether_dhost, dev->mac, ETHER_ADDR_LEN);
>> +       memcpy(eh->ether_shost, dev->mac, ETHER_ADDR_LEN);
>> +
>> +       for (i = sizeof(*eh); i < TEST_BUF_LEN; i++)
>> +               dev->test_buf[i + HDR_LEN] = (char)i;
>> +
>> +       dev->control = open("/dev/vhost-net", O_RDWR);
>> +       assert(dev->control >= 0);
>> +
>> +       r = ioctl(dev->control, VHOST_SET_OWNER, NULL);
>> +       assert(r >= 0);
>> +
>> +       dev->mem = malloc(offsetof(struct vhost_memory, regions) +
>> +                         sizeof(dev->mem->regions[0]));
>> +       assert(dev->mem);
>> +       memset(dev->mem, 0, offsetof(struct vhost_memory, regions) +
>> +              sizeof(dev->mem->regions[0]));
>> +       dev->mem->nregions = 1;
>> +       dev->mem->regions[0].guest_phys_addr = (long)dev->buf;
>> +       dev->mem->regions[0].userspace_addr = (long)dev->buf;
>> +       dev->mem->regions[0].memory_size = dev->buf_size;
>> +
>> +       r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
>> +       assert(r >= 0);
>> +
>> +       r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
>> +       assert(r >= 0);
>> +
>> +       dev->nvqs = 2;
>> +}
>> +
>> +static void wait_for_interrupt(struct vq_info *vq)
>> +{
>> +       unsigned long long val;
>> +
>> +       poll(&vq->fds, 1, -1);
>> +
>> +       if (vq->fds.revents & POLLIN)
>> +               read(vq->fds.fd, &val, sizeof(val));
>> +}
>> +
>> +static void verify_res_buf(char *res_buf)
>> +{
>> +       int i;
>> +
>> +       for (i = ETHER_HDR_LEN; i < TEST_BUF_LEN; i++)
>> +               assert(res_buf[i] == (char)i);
>> +}
>> +
>> +static void run_tx_test(struct vdev_info *dev, struct vq_info *vq,
>> +                       bool delayed, int batch, int bufs)
> 
> It might be better to describe the test design briefly above as a
> comment. Or we can start from simple test logic and add sophisticated
> ones on top.

Does something described in the comment log as suggested by you make
sense to you?
Steps for vhost_net tx testing:
1. Prepare a out buf
2. Kick the vhost_net to do tx processing
3. Do the receiving in the tun side
4. verify the data received by tun is correct

> 
>> +{
>> +       const bool random_batch = batch == RANDOM_BATCH;
>> +       long long spurious = 0;
>> +       struct scatterlist sl;
>> +       unsigned int len;
>> +       int r;
>> +
>> +       for (;;) {
>> +               long started_before = vq->started;
>> +               long completed_before = vq->completed;
>> +
>> +               virtqueue_disable_cb(vq->vq);
>> +               do {
>> +                       if (random_batch)
>> +                               batch = (random() % vq->vring.num) + 1;
>> +
>> +                       while (vq->started < bufs &&
>> +                              (vq->started - vq->completed) < batch) {
>> +                               sg_init_one(&sl, dev->test_buf, HDR_LEN + TEST_BUF_LEN);
>> +                               r = virtqueue_add_outbuf(vq->vq, &sl, 1,
>> +                                                        dev->test_buf + vq->started,
>> +                                                        GFP_ATOMIC);
>> +                               if (unlikely(r != 0)) {
>> +                                       if (r == -ENOSPC &&
>> +                                           vq->started > started_before)
>> +                                               r = 0;
>> +                                       else
>> +                                               r = -1;
>> +                                       break;
>> +                               }
>> +
>> +                               ++vq->started;
>> +
>> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
>> +                                       r = -1;
>> +                                       break;
>> +                               }
>> +                       }
>> +
>> +                       if (vq->started >= bufs)
>> +                               r = -1;
>> +
>> +                       /* Flush out completed bufs if any */
>> +                       while (virtqueue_get_buf(vq->vq, &len)) {
>> +                               int n;
>> +
>> +                               n = recvfrom(dev->sock, dev->res_buf, TEST_BUF_LEN, 0, NULL, NULL);
>> +                               assert(n == TEST_BUF_LEN);
>> +                               verify_res_buf(dev->res_buf);
>> +
>> +                               ++vq->completed;
>> +                               r = 0;
>> +                       }
>> +               } while (r == 0);
>> +
>> +               if (vq->completed == completed_before && vq->started == started_before)
>> +                       ++spurious;
>> +
>> +               assert(vq->completed <= bufs);
>> +               assert(vq->started <= bufs);
>> +               if (vq->completed == bufs)
>> +                       break;
>> +
>> +               if (delayed) {
>> +                       if (virtqueue_enable_cb_delayed(vq->vq))
>> +                               wait_for_interrupt(vq);
>> +               } else {
>> +                       if (virtqueue_enable_cb(vq->vq))
>> +                               wait_for_interrupt(vq);
>> +               }
>> +       }
>> +       printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
>> +              spurious, vq->started, vq->completed);
>> +}
>> +
>> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
>> +                       bool delayed, int batch, int bufs)
>> +{
>> +       const bool random_batch = batch == RANDOM_BATCH;
>> +       long long spurious = 0;
>> +       struct scatterlist sl;
>> +       unsigned int len;
>> +       int r;
>> +
>> +       for (;;) {
>> +               long started_before = vq->started;
>> +               long completed_before = vq->completed;
>> +
>> +               do {
>> +                       if (random_batch)
>> +                               batch = (random() % vq->vring.num) + 1;
>> +
>> +                       while (vq->started < bufs &&
>> +                              (vq->started - vq->completed) < batch) {
>> +                               sg_init_one(&sl, dev->res_buf, HDR_LEN + TEST_BUF_LEN);
>> +
>> +                               r = virtqueue_add_inbuf(vq->vq, &sl, 1,
>> +                                                       dev->res_buf + vq->started,
>> +                                                       GFP_ATOMIC);
>> +                               if (unlikely(r != 0)) {
>> +                                       if (r == -ENOSPC &&
> 
> Drivers usually maintain a #free_slots, this can help to avoid the
> trick for checking ENOSPC?

The above "(vq->started - vq->completed) < batch" seems to ensure that
the 'r' can't be '-ENOSPC'? We just need to ensure the batch <= desc_num,
and the 'r == -ENOSPC' checking seems to be unnecessary.

> 
>> +                                           vq->started > started_before)
>> +                                               r = 0;
>> +                                       else
>> +                                               r = -1;
>> +                                       break;
>> +                               }
>> +
>> +                               ++vq->started;
>> +
>> +                               vdev_send_packet(dev);
>> +
>> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
>> +                                       r = -1;
>> +                                       break;
>> +                               }
>> +                       }
>> +
>> +                       if (vq->started >= bufs)
>> +                               r = -1;
>> +
>> +                       /* Flush out completed bufs if any */
>> +                       while (virtqueue_get_buf(vq->vq, &len)) {
>> +                               struct ether_header *eh;
>> +
>> +                               eh = (struct ether_header *)(dev->res_buf + HDR_LEN);
>> +
>> +                               /* tun netdev is up and running, ignore the
>> +                                * non-TEST_PTYPE packet.
>> +                                */
> 
> I wonder if it's better to set up some kind of qdisc to avoid the
> unexpected packet here, or is it too complicated?

Yes, at least I don't know to do that yet.

> 
> Thanks
> 
> .
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation
  2024-02-02  8:36       ` Paolo Abeni
@ 2024-02-02 12:26         ` Yunsheng Lin
  0 siblings, 0 replies; 15+ messages in thread
From: Yunsheng Lin @ 2024-02-02 12:26 UTC (permalink / raw)
  To: Paolo Abeni, davem, kuba
  Cc: netdev, linux-kernel, Alexander Duyck, Alexander Duyck,
	Michael S. Tsirkin, Jason Wang, Andrew Morton, Eric Dumazet, kvm,
	virtualization, linux-mm

On 2024/2/2 16:36, Paolo Abeni wrote:
> On Fri, 2024-02-02 at 10:10 +0800, Yunsheng Lin wrote:
>> On 2024/2/1 21:16, Paolo Abeni wrote:
>>
>>> from the __page_frag_cache_refill() allocator - which never accesses
>>> the memory reserves.
>>
>> I am not really sure I understand the above commemt.
>> The semantic is the same as skb_page_frag_refill() as explained above
>> as my understanding. Note that __page_frag_cache_refill() use 'gfp_mask'
>> for allocating order 3 pages and use the original 'gfp' for allocating
>> order 0 pages.
> 
> You are right! I got fooled misreading 'gfp' as 'gfp_mask' in there.
> 
>>> I'm unsure we want to propagate the __page_frag_cache_refill behavior
>>> here, the current behavior could be required by some systems.
>>>
>>> It looks like this series still leave the skb_page_frag_refill()
>>> allocator alone, what about dropping this chunk, too? 
>>
>> As explained above, I would prefer to keep it as it is as it seems
>> to be quite obvious that we can avoid possible pressure for mm by
>> not using memory reserve for order 3 pages as we have the fallback
>> for order 0 pages.
>>
>> Please let me know if there is anything obvious I missed.
>>
> 
> I still think/fear that behaviours changes here could have
> subtle/negative side effects - even if I agree the change looks safe.
> 
> I think the series without this patch would still achieve its goals and
> would be much more uncontroversial. What about move this patch as a
> standalone follow-up?

Fair enough, will remove that for now.

> 
> Thanks!
> 
> Paolo
> 
> .
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test
  2024-02-02 12:24     ` Yunsheng Lin
@ 2024-02-04  1:30       ` Jason Wang
  2024-02-04  3:50         ` Yunsheng Lin
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Wang @ 2024-02-04  1:30 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
	Xuan Zhuo, virtualization

On Fri, Feb 2, 2024 at 8:24 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/2/2 12:05, Jason Wang wrote:
> > On Tue, Jan 30, 2024 at 7:38 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> introduce vhost_net_test basing on virtio_test to test
> >> vhost_net changing in the kernel.
> >
> > Let's describe what kind of test is being done and how it is done here.
>
> How about something like below:
>
> This patch introduces testing for both vhost_net tx and rx.
> Steps for vhost_net tx testing:
> 1. Prepare a out buf
> 2. Kick the vhost_net to do tx processing
> 3. Do the receiving in the tun side
> 4. verify the data received by tun is correct
>
> Steps for vhost_net rx testing::
> 1. Prepare a in buf
> 2. Do the sending in the tun side
> 3. Kick the vhost_net to do rx processing
> 4. verify the data received by vhost_net is correct

It looks like some important details were lost, e.g the logic for batching etc.

>
>
> >> +
> >> +static int tun_alloc(struct vdev_info *dev)
> >> +{
> >> +       struct ifreq ifr;
> >> +       int len = HDR_LEN;
> >
> > Any reason you can't just use the virtio_net uapi?
>
> I didn't find a macro for that in include/uapi/linux/virtio_net.h.
>
> Did you mean using something like below?
> sizeof(struct virtio_net_hdr_mrg_rxbuf)

Yes.

>
> >
> >> +       int fd, e;
> >> +
> >> +       fd = open("/dev/net/tun", O_RDWR);
> >> +       if (fd < 0) {
> >> +               perror("Cannot open /dev/net/tun");
> >> +               return fd;
> >> +       }
> >> +
> >> +       memset(&ifr, 0, sizeof(ifr));
> >> +
> >> +       ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
> >> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> >> +
> >> +       e = ioctl(fd, TUNSETIFF, &ifr);
> >> +       if (e < 0) {
> >> +               perror("ioctl[TUNSETIFF]");
> >> +               close(fd);
> >> +               return e;
> >> +       }
> >> +
> >> +       e = ioctl(fd, TUNSETVNETHDRSZ, &len);
> >> +       if (e < 0) {
> >> +               perror("ioctl[TUNSETVNETHDRSZ]");
> >> +               close(fd);
> >> +               return e;
> >> +       }
> >> +
> >> +       e = ioctl(fd, SIOCGIFHWADDR, &ifr);
> >> +       if (e < 0) {
> >> +               perror("ioctl[SIOCGIFHWADDR]");
> >> +               close(fd);
> >> +               return e;
> >> +       }
> >> +
> >> +       memcpy(dev->mac, &ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
> >> +       return fd;
> >> +}
> >> +
> >> +static void vdev_create_socket(struct vdev_info *dev)
> >> +{
> >> +       struct ifreq ifr;
> >> +
> >> +       dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
> >> +       assert(dev->sock != -1);
> >> +
> >> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> >
> > Nit: it might be better to accept the device name instead of repeating
> > the snprintf trick here, this would facilitate the future changes.
>
> I am not sure I understand what did you mean by "accept the device name"
> here.
>
> The above is used to get ifindex of the tun netdevice created in
> tun_alloc(), so that we can use it in vdev_send_packet() to send
> a packet using the tun netdevice created in tun_alloc(). Is there
> anything obvious I missed here?

I meant a const char *ifname for this function and let the caller to
pass the name.

>
> >
> >> +       assert(ioctl(dev->sock, SIOCGIFINDEX, &ifr) >= 0);
> >> +
> >> +       dev->ifindex = ifr.ifr_ifindex;
> >> +
> >> +       /* Set the flags that bring the device up */
> >> +       assert(ioctl(dev->sock, SIOCGIFFLAGS, &ifr) >= 0);
> >> +       ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
> >> +       assert(ioctl(dev->sock, SIOCSIFFLAGS, &ifr) >= 0);
> >> +}
> >> +
> >> +static void vdev_send_packet(struct vdev_info *dev)
> >> +{
> >> +       char *sendbuf = dev->test_buf + HDR_LEN;
> >> +       struct sockaddr_ll saddrll = {0};
> >> +       int sockfd = dev->sock;
> >> +       int ret;
> >> +
> >> +       saddrll.sll_family = PF_PACKET;
> >> +       saddrll.sll_ifindex = dev->ifindex;
> >> +       saddrll.sll_halen = ETH_ALEN;
> >> +       saddrll.sll_protocol = htons(TEST_PTYPE);
> >> +
> >> +       ret = sendto(sockfd, sendbuf, TEST_BUF_LEN, 0,
> >> +                    (struct sockaddr *)&saddrll,
> >> +                    sizeof(struct sockaddr_ll));
> >> +       assert(ret >= 0);
> >> +}
> >> +
>
> ...
>
> >> +
> >> +static void vq_info_add(struct vdev_info *dev, int idx, int num, int fd)
> >> +{
> >> +       struct vhost_vring_file backend = { .index = idx, .fd = fd };
> >> +       struct vq_info *info = &dev->vqs[idx];
> >> +       int r;
> >> +
> >> +       info->idx = idx;
> >> +       info->kick = eventfd(0, EFD_NONBLOCK);
> >> +       info->call = eventfd(0, EFD_NONBLOCK);
> >
> > If we don't care about the callback, let's just avoid to set the call here?
> >
> > (As I see vq_callback is a NULL)
>
> Sure, will remove the vq_callback related code.
>
> >
> >> +       r = posix_memalign(&info->ring, 4096, vring_size(num, 4096));
> >> +       assert(r >= 0);
> >> +       vq_reset(info, num, &dev->vdev);
> >> +       vhost_vq_setup(dev, info);
> >> +       info->fds.fd = info->call;
> >> +       info->fds.events = POLLIN;
> >> +
> >> +       r = ioctl(dev->control, VHOST_NET_SET_BACKEND, &backend);
> >> +       assert(!r);
> >> +}
> >> +
> >> +static void vdev_info_init(struct vdev_info *dev, unsigned long long features)
> >> +{
> >> +       struct ether_header *eh;
> >> +       int i, r;
> >> +
> >> +       dev->vdev.features = features;
> >> +       INIT_LIST_HEAD(&dev->vdev.vqs);
> >> +       spin_lock_init(&dev->vdev.vqs_list_lock);
> >> +
> >> +       dev->buf_size = (HDR_LEN + TEST_BUF_LEN) * 2;
> >> +       dev->buf = malloc(dev->buf_size);
> >> +       assert(dev->buf);
> >> +       dev->test_buf = dev->buf;
> >> +       dev->res_buf = dev->test_buf + HDR_LEN + TEST_BUF_LEN;
> >> +
> >> +       memset(dev->test_buf, 0, HDR_LEN + TEST_BUF_LEN);
> >> +       eh = (struct ether_header *)(dev->test_buf + HDR_LEN);
> >> +       eh->ether_type = htons(TEST_PTYPE);
> >> +       memcpy(eh->ether_dhost, dev->mac, ETHER_ADDR_LEN);
> >> +       memcpy(eh->ether_shost, dev->mac, ETHER_ADDR_LEN);
> >> +
> >> +       for (i = sizeof(*eh); i < TEST_BUF_LEN; i++)
> >> +               dev->test_buf[i + HDR_LEN] = (char)i;
> >> +
> >> +       dev->control = open("/dev/vhost-net", O_RDWR);
> >> +       assert(dev->control >= 0);
> >> +
> >> +       r = ioctl(dev->control, VHOST_SET_OWNER, NULL);
> >> +       assert(r >= 0);
> >> +
> >> +       dev->mem = malloc(offsetof(struct vhost_memory, regions) +
> >> +                         sizeof(dev->mem->regions[0]));
> >> +       assert(dev->mem);
> >> +       memset(dev->mem, 0, offsetof(struct vhost_memory, regions) +
> >> +              sizeof(dev->mem->regions[0]));
> >> +       dev->mem->nregions = 1;
> >> +       dev->mem->regions[0].guest_phys_addr = (long)dev->buf;
> >> +       dev->mem->regions[0].userspace_addr = (long)dev->buf;
> >> +       dev->mem->regions[0].memory_size = dev->buf_size;
> >> +
> >> +       r = ioctl(dev->control, VHOST_SET_MEM_TABLE, dev->mem);
> >> +       assert(r >= 0);
> >> +
> >> +       r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
> >> +       assert(r >= 0);
> >> +
> >> +       dev->nvqs = 2;
> >> +}
> >> +
> >> +static void wait_for_interrupt(struct vq_info *vq)
> >> +{
> >> +       unsigned long long val;
> >> +
> >> +       poll(&vq->fds, 1, -1);
> >> +
> >> +       if (vq->fds.revents & POLLIN)
> >> +               read(vq->fds.fd, &val, sizeof(val));
> >> +}
> >> +
> >> +static void verify_res_buf(char *res_buf)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = ETHER_HDR_LEN; i < TEST_BUF_LEN; i++)
> >> +               assert(res_buf[i] == (char)i);
> >> +}
> >> +
> >> +static void run_tx_test(struct vdev_info *dev, struct vq_info *vq,
> >> +                       bool delayed, int batch, int bufs)
> >
> > It might be better to describe the test design briefly above as a
> > comment. Or we can start from simple test logic and add sophisticated
> > ones on top.
>
> Does something described in the comment log as suggested by you make
> sense to you?
> Steps for vhost_net tx testing:
> 1. Prepare a out buf
> 2. Kick the vhost_net to do tx processing
> 3. Do the receiving in the tun side
> 4. verify the data received by tun is correct

See my reply above.

>
> >
> >> +{
> >> +       const bool random_batch = batch == RANDOM_BATCH;
> >> +       long long spurious = 0;
> >> +       struct scatterlist sl;
> >> +       unsigned int len;
> >> +       int r;
> >> +
> >> +       for (;;) {
> >> +               long started_before = vq->started;
> >> +               long completed_before = vq->completed;
> >> +
> >> +               virtqueue_disable_cb(vq->vq);
> >> +               do {
> >> +                       if (random_batch)
> >> +                               batch = (random() % vq->vring.num) + 1;
> >> +
> >> +                       while (vq->started < bufs &&
> >> +                              (vq->started - vq->completed) < batch) {
> >> +                               sg_init_one(&sl, dev->test_buf, HDR_LEN + TEST_BUF_LEN);
> >> +                               r = virtqueue_add_outbuf(vq->vq, &sl, 1,
> >> +                                                        dev->test_buf + vq->started,
> >> +                                                        GFP_ATOMIC);
> >> +                               if (unlikely(r != 0)) {
> >> +                                       if (r == -ENOSPC &&
> >> +                                           vq->started > started_before)
> >> +                                               r = 0;
> >> +                                       else
> >> +                                               r = -1;
> >> +                                       break;
> >> +                               }
> >> +
> >> +                               ++vq->started;
> >> +
> >> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
> >> +                                       r = -1;
> >> +                                       break;
> >> +                               }
> >> +                       }
> >> +
> >> +                       if (vq->started >= bufs)
> >> +                               r = -1;
> >> +
> >> +                       /* Flush out completed bufs if any */
> >> +                       while (virtqueue_get_buf(vq->vq, &len)) {
> >> +                               int n;
> >> +
> >> +                               n = recvfrom(dev->sock, dev->res_buf, TEST_BUF_LEN, 0, NULL, NULL);
> >> +                               assert(n == TEST_BUF_LEN);
> >> +                               verify_res_buf(dev->res_buf);
> >> +
> >> +                               ++vq->completed;
> >> +                               r = 0;
> >> +                       }
> >> +               } while (r == 0);
> >> +
> >> +               if (vq->completed == completed_before && vq->started == started_before)
> >> +                       ++spurious;
> >> +
> >> +               assert(vq->completed <= bufs);
> >> +               assert(vq->started <= bufs);
> >> +               if (vq->completed == bufs)
> >> +                       break;
> >> +
> >> +               if (delayed) {
> >> +                       if (virtqueue_enable_cb_delayed(vq->vq))
> >> +                               wait_for_interrupt(vq);
> >> +               } else {
> >> +                       if (virtqueue_enable_cb(vq->vq))
> >> +                               wait_for_interrupt(vq);
> >> +               }
> >> +       }
> >> +       printf("TX spurious wakeups: 0x%llx started=0x%lx completed=0x%lx\n",
> >> +              spurious, vq->started, vq->completed);
> >> +}
> >> +
> >> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
> >> +                       bool delayed, int batch, int bufs)
> >> +{
> >> +       const bool random_batch = batch == RANDOM_BATCH;
> >> +       long long spurious = 0;
> >> +       struct scatterlist sl;
> >> +       unsigned int len;
> >> +       int r;
> >> +
> >> +       for (;;) {
> >> +               long started_before = vq->started;
> >> +               long completed_before = vq->completed;
> >> +
> >> +               do {
> >> +                       if (random_batch)
> >> +                               batch = (random() % vq->vring.num) + 1;
> >> +
> >> +                       while (vq->started < bufs &&
> >> +                              (vq->started - vq->completed) < batch) {
> >> +                               sg_init_one(&sl, dev->res_buf, HDR_LEN + TEST_BUF_LEN);
> >> +
> >> +                               r = virtqueue_add_inbuf(vq->vq, &sl, 1,
> >> +                                                       dev->res_buf + vq->started,
> >> +                                                       GFP_ATOMIC);
> >> +                               if (unlikely(r != 0)) {
> >> +                                       if (r == -ENOSPC &&
> >
> > Drivers usually maintain a #free_slots, this can help to avoid the
> > trick for checking ENOSPC?
>
> The above "(vq->started - vq->completed) < batch" seems to ensure that
> the 'r' can't be '-ENOSPC'?

Well, if this is true any reason we still check ENOSPEC here?

> We just need to ensure the batch <= desc_num,
> and the 'r == -ENOSPC' checking seems to be unnecessary.
>
> >
> >> +                                           vq->started > started_before)
> >> +                                               r = 0;
> >> +                                       else
> >> +                                               r = -1;
> >> +                                       break;
> >> +                               }
> >> +
> >> +                               ++vq->started;
> >> +
> >> +                               vdev_send_packet(dev);
> >> +
> >> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
> >> +                                       r = -1;
> >> +                                       break;
> >> +                               }
> >> +                       }
> >> +
> >> +                       if (vq->started >= bufs)
> >> +                               r = -1;
> >> +
> >> +                       /* Flush out completed bufs if any */
> >> +                       while (virtqueue_get_buf(vq->vq, &len)) {
> >> +                               struct ether_header *eh;
> >> +
> >> +                               eh = (struct ether_header *)(dev->res_buf + HDR_LEN);
> >> +
> >> +                               /* tun netdev is up and running, ignore the
> >> +                                * non-TEST_PTYPE packet.
> >> +                                */
> >
> > I wonder if it's better to set up some kind of qdisc to avoid the
> > unexpected packet here, or is it too complicated?
>
> Yes, at least I don't know to do that yet.

For example, the blackhole qdisc?

Thanks


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test
  2024-02-04  1:30       ` Jason Wang
@ 2024-02-04  3:50         ` Yunsheng Lin
  2024-02-05  1:43           ` Jason Wang
  0 siblings, 1 reply; 15+ messages in thread
From: Yunsheng Lin @ 2024-02-04  3:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
	Xuan Zhuo, virtualization

On 2024/2/4 9:30, Jason Wang wrote:
> On Fri, Feb 2, 2024 at 8:24 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/2/2 12:05, Jason Wang wrote:
>>> On Tue, Jan 30, 2024 at 7:38 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> introduce vhost_net_test basing on virtio_test to test
>>>> vhost_net changing in the kernel.
>>>
>>> Let's describe what kind of test is being done and how it is done here.
>>
>> How about something like below:
>>
>> This patch introduces testing for both vhost_net tx and rx.
>> Steps for vhost_net tx testing:
>> 1. Prepare a out buf
>> 2. Kick the vhost_net to do tx processing
>> 3. Do the receiving in the tun side
>> 4. verify the data received by tun is correct
>>
>> Steps for vhost_net rx testing::
>> 1. Prepare a in buf
>> 2. Do the sending in the tun side
>> 3. Kick the vhost_net to do rx processing
>> 4. verify the data received by vhost_net is correct
> 
> It looks like some important details were lost, e.g the logic for batching etc.

I am supposeing you are referring to the virtio desc batch handling,
right?

It was a copy & paste code of virtio_test.c, I was thinking about removing
the virtio desc batch handling for now, as this patchset does not require
that to do the testing, it mainly depend on the "sock->sk->sk_sndbuf" to
be INT_MAX to call vhost_net_build_xdp(), which seems to be the default
case for vhost_net.

> 
>>

...

>>>> +static void vdev_create_socket(struct vdev_info *dev)
>>>> +{
>>>> +       struct ifreq ifr;
>>>> +
>>>> +       dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
>>>> +       assert(dev->sock != -1);
>>>> +
>>>> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
>>>
>>> Nit: it might be better to accept the device name instead of repeating
>>> the snprintf trick here, this would facilitate the future changes.
>>
>> I am not sure I understand what did you mean by "accept the device name"
>> here.
>>
>> The above is used to get ifindex of the tun netdevice created in
>> tun_alloc(), so that we can use it in vdev_send_packet() to send
>> a packet using the tun netdevice created in tun_alloc(). Is there
>> anything obvious I missed here?
> 
> I meant a const char *ifname for this function and let the caller to
> pass the name.

Sure.

> 
>>

>>>> +
>>>> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
>>>> +                       bool delayed, int batch, int bufs)
>>>> +{
>>>> +       const bool random_batch = batch == RANDOM_BATCH;
>>>> +       long long spurious = 0;
>>>> +       struct scatterlist sl;
>>>> +       unsigned int len;
>>>> +       int r;
>>>> +
>>>> +       for (;;) {
>>>> +               long started_before = vq->started;
>>>> +               long completed_before = vq->completed;
>>>> +
>>>> +               do {
>>>> +                       if (random_batch)
>>>> +                               batch = (random() % vq->vring.num) + 1;
>>>> +
>>>> +                       while (vq->started < bufs &&
>>>> +                              (vq->started - vq->completed) < batch) {
>>>> +                               sg_init_one(&sl, dev->res_buf, HDR_LEN + TEST_BUF_LEN);
>>>> +
>>>> +                               r = virtqueue_add_inbuf(vq->vq, &sl, 1,
>>>> +                                                       dev->res_buf + vq->started,
>>>> +                                                       GFP_ATOMIC);
>>>> +                               if (unlikely(r != 0)) {
>>>> +                                       if (r == -ENOSPC &&
>>>
>>> Drivers usually maintain a #free_slots, this can help to avoid the
>>> trick for checking ENOSPC?
>>
>> The above "(vq->started - vq->completed) < batch" seems to ensure that
>> the 'r' can't be '-ENOSPC'?
> 
> Well, if this is true any reason we still check ENOSPEC here?

As mentioned above, It was a copy & paste code of virtio_test.c.
Will remove 'r == -ENOSPC' checking.

> 
>> We just need to ensure the batch <= desc_num,
>> and the 'r == -ENOSPC' checking seems to be unnecessary.
>>
>>>
>>>> +                                           vq->started > started_before)
>>>> +                                               r = 0;
>>>> +                                       else
>>>> +                                               r = -1;
>>>> +                                       break;
>>>> +                               }
>>>> +
>>>> +                               ++vq->started;
>>>> +
>>>> +                               vdev_send_packet(dev);
>>>> +
>>>> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
>>>> +                                       r = -1;
>>>> +                                       break;
>>>> +                               }
>>>> +                       }
>>>> +
>>>> +                       if (vq->started >= bufs)
>>>> +                               r = -1;
>>>> +
>>>> +                       /* Flush out completed bufs if any */
>>>> +                       while (virtqueue_get_buf(vq->vq, &len)) {
>>>> +                               struct ether_header *eh;
>>>> +
>>>> +                               eh = (struct ether_header *)(dev->res_buf + HDR_LEN);
>>>> +
>>>> +                               /* tun netdev is up and running, ignore the
>>>> +                                * non-TEST_PTYPE packet.
>>>> +                                */
>>>
>>> I wonder if it's better to set up some kind of qdisc to avoid the
>>> unexpected packet here, or is it too complicated?
>>
>> Yes, at least I don't know to do that yet.
> 
> For example, the blackhole qdisc?

It seems the blackhole_enqueue() just drop everything, which includes
the packet sent using the raw socket in vdev_send_packet()?

We may bypass qdisc for the raw socket in vdev_send_packet(),but it
means other caller may bypass qdisc, and even cook up a packet with
ethertype being ETH_P_LOOPBACK, there is part of the reason I added a
simple payload verification in verify_res_buf().

> 
> Thanks
> 
> .
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test
  2024-02-04  3:50         ` Yunsheng Lin
@ 2024-02-05  1:43           ` Jason Wang
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Wang @ 2024-02-05  1:43 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Michael S. Tsirkin,
	Xuan Zhuo, virtualization

On Sun, Feb 4, 2024 at 11:50 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/2/4 9:30, Jason Wang wrote:
> > On Fri, Feb 2, 2024 at 8:24 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2024/2/2 12:05, Jason Wang wrote:
> >>> On Tue, Jan 30, 2024 at 7:38 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>>
> >>>> introduce vhost_net_test basing on virtio_test to test
> >>>> vhost_net changing in the kernel.
> >>>
> >>> Let's describe what kind of test is being done and how it is done here.
> >>
> >> How about something like below:
> >>
> >> This patch introduces testing for both vhost_net tx and rx.
> >> Steps for vhost_net tx testing:
> >> 1. Prepare a out buf
> >> 2. Kick the vhost_net to do tx processing
> >> 3. Do the receiving in the tun side
> >> 4. verify the data received by tun is correct
> >>
> >> Steps for vhost_net rx testing::
> >> 1. Prepare a in buf
> >> 2. Do the sending in the tun side
> >> 3. Kick the vhost_net to do rx processing
> >> 4. verify the data received by vhost_net is correct
> >
> > It looks like some important details were lost, e.g the logic for batching etc.
>
> I am supposeing you are referring to the virtio desc batch handling,
> right?

Yes.

>
> It was a copy & paste code of virtio_test.c, I was thinking about removing
> the virtio desc batch handling for now, as this patchset does not require
> that to do the testing, it mainly depend on the "sock->sk->sk_sndbuf" to
> be INT_MAX to call vhost_net_build_xdp(), which seems to be the default
> case for vhost_net.

Ok.

>
> >
> >>
>
> ...
>
> >>>> +static void vdev_create_socket(struct vdev_info *dev)
> >>>> +{
> >>>> +       struct ifreq ifr;
> >>>> +
> >>>> +       dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
> >>>> +       assert(dev->sock != -1);
> >>>> +
> >>>> +       snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> >>>
> >>> Nit: it might be better to accept the device name instead of repeating
> >>> the snprintf trick here, this would facilitate the future changes.
> >>
> >> I am not sure I understand what did you mean by "accept the device name"
> >> here.
> >>
> >> The above is used to get ifindex of the tun netdevice created in
> >> tun_alloc(), so that we can use it in vdev_send_packet() to send
> >> a packet using the tun netdevice created in tun_alloc(). Is there
> >> anything obvious I missed here?
> >
> > I meant a const char *ifname for this function and let the caller to
> > pass the name.
>
> Sure.
>
> >
> >>
>
> >>>> +
> >>>> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
> >>>> +                       bool delayed, int batch, int bufs)
> >>>> +{
> >>>> +       const bool random_batch = batch == RANDOM_BATCH;
> >>>> +       long long spurious = 0;
> >>>> +       struct scatterlist sl;
> >>>> +       unsigned int len;
> >>>> +       int r;
> >>>> +
> >>>> +       for (;;) {
> >>>> +               long started_before = vq->started;
> >>>> +               long completed_before = vq->completed;
> >>>> +
> >>>> +               do {
> >>>> +                       if (random_batch)
> >>>> +                               batch = (random() % vq->vring.num) + 1;
> >>>> +
> >>>> +                       while (vq->started < bufs &&
> >>>> +                              (vq->started - vq->completed) < batch) {
> >>>> +                               sg_init_one(&sl, dev->res_buf, HDR_LEN + TEST_BUF_LEN);
> >>>> +
> >>>> +                               r = virtqueue_add_inbuf(vq->vq, &sl, 1,
> >>>> +                                                       dev->res_buf + vq->started,
> >>>> +                                                       GFP_ATOMIC);
> >>>> +                               if (unlikely(r != 0)) {
> >>>> +                                       if (r == -ENOSPC &&
> >>>
> >>> Drivers usually maintain a #free_slots, this can help to avoid the
> >>> trick for checking ENOSPC?
> >>
> >> The above "(vq->started - vq->completed) < batch" seems to ensure that
> >> the 'r' can't be '-ENOSPC'?
> >
> > Well, if this is true any reason we still check ENOSPEC here?
>
> As mentioned above, It was a copy & paste code of virtio_test.c.
> Will remove 'r == -ENOSPC' checking.

I see.

>
> >
> >> We just need to ensure the batch <= desc_num,
> >> and the 'r == -ENOSPC' checking seems to be unnecessary.
> >>
> >>>
> >>>> +                                           vq->started > started_before)
> >>>> +                                               r = 0;
> >>>> +                                       else
> >>>> +                                               r = -1;
> >>>> +                                       break;
> >>>> +                               }
> >>>> +
> >>>> +                               ++vq->started;
> >>>> +
> >>>> +                               vdev_send_packet(dev);
> >>>> +
> >>>> +                               if (unlikely(!virtqueue_kick(vq->vq))) {
> >>>> +                                       r = -1;
> >>>> +                                       break;
> >>>> +                               }
> >>>> +                       }
> >>>> +
> >>>> +                       if (vq->started >= bufs)
> >>>> +                               r = -1;
> >>>> +
> >>>> +                       /* Flush out completed bufs if any */
> >>>> +                       while (virtqueue_get_buf(vq->vq, &len)) {
> >>>> +                               struct ether_header *eh;
> >>>> +
> >>>> +                               eh = (struct ether_header *)(dev->res_buf + HDR_LEN);
> >>>> +
> >>>> +                               /* tun netdev is up and running, ignore the
> >>>> +                                * non-TEST_PTYPE packet.
> >>>> +                                */
> >>>
> >>> I wonder if it's better to set up some kind of qdisc to avoid the
> >>> unexpected packet here, or is it too complicated?
> >>
> >> Yes, at least I don't know to do that yet.
> >
> > For example, the blackhole qdisc?
>
> It seems the blackhole_enqueue() just drop everything, which includes
> the packet sent using the raw socket in vdev_send_packet()?

I vaguely remember there's a mode that af_packet will bypass the qdisc
but I might be wrong.

But rethink of this, though it facilitates the code but it introduces
unnecessary dependencies like black hole which seems to be suboptimal.

>
> We may bypass qdisc for the raw socket in vdev_send_packet(),but it
> means other caller may bypass qdisc, and even cook up a packet with
> ethertype being ETH_P_LOOPBACK, there is part of the reason I added a
> simple payload verification in verify_res_buf().

Fine.

Thanks

>
> >
> > Thanks
> >
> > .
> >
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-02-05  1:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-30 11:37 [PATCH net-next v4 0/5] remove page frag implementation in vhost_net Yunsheng Lin
2024-01-30 11:37 ` [PATCH net-next v4 1/5] mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument Yunsheng Lin
2024-01-30 11:37 ` [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation Yunsheng Lin
2024-02-01 13:16   ` Paolo Abeni
2024-02-02  2:10     ` Yunsheng Lin
2024-02-02  8:36       ` Paolo Abeni
2024-02-02 12:26         ` Yunsheng Lin
2024-01-30 11:37 ` [PATCH net-next v4 3/5] net: introduce page_frag_cache_drain() Yunsheng Lin
2024-01-30 11:37 ` [PATCH net-next v4 4/5] vhost/net: remove vhost_net_page_frag_refill() Yunsheng Lin
2024-01-30 11:37 ` [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test Yunsheng Lin
2024-02-02  4:05   ` Jason Wang
2024-02-02 12:24     ` Yunsheng Lin
2024-02-04  1:30       ` Jason Wang
2024-02-04  3:50         ` Yunsheng Lin
2024-02-05  1:43           ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).