All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] virtio support cache indirect desc
@ 2021-10-29  6:28 ` Xuan Zhuo
  0 siblings, 0 replies; 14+ messages in thread
From: Xuan Zhuo @ 2021-10-29  6:28 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski

If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
of sgs used for sending packets is greater than 1. We must constantly
call __kmalloc/kfree to allocate/release desc.

In the case of extremely fast package delivery, the overhead cannot be
ignored:

  27.46%  [kernel]  [k] virtqueue_add
  16.66%  [kernel]  [k] detach_buf_split
  16.51%  [kernel]  [k] virtnet_xsk_xmit
  14.04%  [kernel]  [k] virtqueue_add_outbuf
   5.18%  [kernel]  [k] __kmalloc
   4.08%  [kernel]  [k] kfree
   2.80%  [kernel]  [k] virtqueue_get_buf_ctx
   2.22%  [kernel]  [k] xsk_tx_peek_desc
   2.08%  [kernel]  [k] memset_erms
   0.83%  [kernel]  [k] virtqueue_kick_prepare
   0.76%  [kernel]  [k] virtnet_xsk_run
   0.62%  [kernel]  [k] __free_old_xmit_ptr
   0.60%  [kernel]  [k] vring_map_one_sg
   0.53%  [kernel]  [k] native_apic_mem_write
   0.46%  [kernel]  [k] sg_next
   0.43%  [kernel]  [k] sg_init_table
   0.41%  [kernel]  [k] kmalloc_slab

This patch adds a cache function to virtio to cache these allocated indirect
desc instead of constantly allocating and releasing desc.

v3:
  pre-allocate per buffer indirect descriptors array

v2:
  use struct list_head to cache the desc

*** BLURB HERE ***

Xuan Zhuo (3):
  virtio: cache indirect desc for split
  virtio: cache indirect desc for packed
  virtio-net: enable virtio desc cache

 drivers/net/virtio_net.c     |  11 +++
 drivers/virtio/virtio.c      |   6 ++
 drivers/virtio/virtio_ring.c | 131 ++++++++++++++++++++++++++++++-----
 include/linux/virtio.h       |  14 ++++
 4 files changed, 145 insertions(+), 17 deletions(-)

--
2.31.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 0/3] virtio support cache indirect desc
@ 2021-10-29  6:28 ` Xuan Zhuo
  0 siblings, 0 replies; 14+ messages in thread
From: Xuan Zhuo @ 2021-10-29  6:28 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Jakub Kicinski, David S. Miller, Michael S. Tsirkin

If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
of sgs used for sending packets is greater than 1. We must constantly
call __kmalloc/kfree to allocate/release desc.

In the case of extremely fast package delivery, the overhead cannot be
ignored:

  27.46%  [kernel]  [k] virtqueue_add
  16.66%  [kernel]  [k] detach_buf_split
  16.51%  [kernel]  [k] virtnet_xsk_xmit
  14.04%  [kernel]  [k] virtqueue_add_outbuf
   5.18%  [kernel]  [k] __kmalloc
   4.08%  [kernel]  [k] kfree
   2.80%  [kernel]  [k] virtqueue_get_buf_ctx
   2.22%  [kernel]  [k] xsk_tx_peek_desc
   2.08%  [kernel]  [k] memset_erms
   0.83%  [kernel]  [k] virtqueue_kick_prepare
   0.76%  [kernel]  [k] virtnet_xsk_run
   0.62%  [kernel]  [k] __free_old_xmit_ptr
   0.60%  [kernel]  [k] vring_map_one_sg
   0.53%  [kernel]  [k] native_apic_mem_write
   0.46%  [kernel]  [k] sg_next
   0.43%  [kernel]  [k] sg_init_table
   0.41%  [kernel]  [k] kmalloc_slab

This patch adds a cache function to virtio to cache these allocated indirect
desc instead of constantly allocating and releasing desc.

v3:
  pre-allocate per buffer indirect descriptors array

v2:
  use struct list_head to cache the desc

*** BLURB HERE ***

Xuan Zhuo (3):
  virtio: cache indirect desc for split
  virtio: cache indirect desc for packed
  virtio-net: enable virtio desc cache

 drivers/net/virtio_net.c     |  11 +++
 drivers/virtio/virtio.c      |   6 ++
 drivers/virtio/virtio_ring.c | 131 ++++++++++++++++++++++++++++++-----
 include/linux/virtio.h       |  14 ++++
 4 files changed, 145 insertions(+), 17 deletions(-)

--
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/3] virtio: cache indirect desc for split
  2021-10-29  6:28 ` Xuan Zhuo
@ 2021-10-29  6:28   ` Xuan Zhuo
  -1 siblings, 0 replies; 14+ messages in thread
From: Xuan Zhuo @ 2021-10-29  6:28 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski

In the case of using indirect, indirect desc must be allocated and
released each time, which increases a lot of cpu overhead.

Here, a cache is added for indirect. If the number of indirect desc to be
applied for is less than desc_cache_thr, the desc array with
the size of desc_cache_thr is fixed and cached for reuse.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio.c      |   6 ++
 drivers/virtio/virtio_ring.c | 105 ++++++++++++++++++++++++++++++++---
 include/linux/virtio.h       |  14 +++++
 3 files changed, 117 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 0a5b54034d4b..1047149ac2a4 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -431,6 +431,12 @@ bool is_virtio_device(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(is_virtio_device);
 
+void virtio_set_desc_cache(struct virtio_device *dev, u32 thr)
+{
+	dev->desc_cache_thr = thr;
+}
+EXPORT_SYMBOL_GPL(virtio_set_desc_cache);
+
 void unregister_virtio_device(struct virtio_device *dev)
 {
 	int index = dev->index; /* save for after device release */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index dd95dfd85e98..010c47baa37f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -85,6 +85,19 @@ struct vring_desc_extra {
 	u16 next;			/* The next desc state in a list. */
 };
 
+struct vring_desc_cache {
+	/* desc cache chain */
+	struct list_head list;
+
+	void *array;
+
+	/* desc cache threshold
+	 *    0   - disable desc cache
+	 *    > 0 - enable desc cache. As the threshold of the desc cache.
+	 */
+	u32 thr;
+};
+
 struct vring_virtqueue {
 	struct virtqueue vq;
 
@@ -117,6 +130,8 @@ struct vring_virtqueue {
 	/* Hint for event idx: already triggered no need to disable. */
 	bool event_triggered;
 
+	struct vring_desc_cache desc_cache;
+
 	union {
 		/* Available for split ring */
 		struct {
@@ -423,7 +438,76 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
 	return extra[i].next;
 }
 
-static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+static void desc_cache_free(struct vring_virtqueue *vq)
+{
+	kfree(vq->desc_cache.array);
+}
+
+static void desc_cache_create(struct vring_virtqueue *vq,
+			      struct virtio_device *vdev, int size, int num)
+{
+	struct list_head *node;
+	int i;
+
+	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_desc));
+	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_packed_desc));
+
+	vq->desc_cache.array = NULL;
+	vq->desc_cache.thr = vdev->desc_cache_thr;
+
+	INIT_LIST_HEAD(&vq->desc_cache.list);
+
+	if (!vq->desc_cache.thr)
+		return;
+
+	size = size * vq->desc_cache.thr;
+
+	vq->desc_cache.array = kmalloc_array(num, size, GFP_KERNEL);
+	if (!vq->desc_cache.array) {
+		vq->desc_cache.thr = 0;
+		dev_warn(&vdev->dev, "queue[%d] alloc desc cache fail. turn off it.\n",
+			 vq->vq.index);
+		return;
+	}
+
+	for (i = 0; i < num; ++i) {
+		node = vq->desc_cache.array + (i * size);
+		list_add(node, &vq->desc_cache.list);
+	}
+}
+
+static void __desc_cache_put(struct vring_virtqueue *vq,
+			     struct list_head *node, int n)
+{
+	if (n <= vq->desc_cache.thr)
+		list_add(node, &vq->desc_cache.list);
+	else
+		kfree(node);
+}
+
+#define desc_cache_put(vq, desc, n) \
+	__desc_cache_put(vq, (struct list_head *)desc, n)
+
+static void *desc_cache_get(struct vring_virtqueue *vq,
+			    int size, int n, gfp_t gfp)
+{
+	struct list_head *node;
+
+	if (n > vq->desc_cache.thr)
+		return kmalloc_array(n, size, gfp);
+
+	node = vq->desc_cache.list.next;
+	list_del(node);
+	return node;
+}
+
+#define _desc_cache_get(vq, n, gfp, tp) \
+	((tp *)desc_cache_get(vq, (sizeof(tp)), n, gfp))
+
+#define desc_cache_get_split(vq, n, gfp) \
+	_desc_cache_get(vq, n, gfp, struct vring_desc)
+
+static struct vring_desc *alloc_indirect_split(struct vring_virtqueue *vq,
 					       unsigned int total_sg,
 					       gfp_t gfp)
 {
@@ -437,12 +521,12 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
+	desc = desc_cache_get_split(vq, total_sg, gfp);
 	if (!desc)
 		return NULL;
 
 	for (i = 0; i < total_sg; i++)
-		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
+		desc[i].next = cpu_to_virtio16(vq->vq.vdev, i + 1);
 	return desc;
 }
 
@@ -508,7 +592,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	head = vq->free_head;
 
 	if (virtqueue_use_indirect(_vq, total_sg))
-		desc = alloc_indirect_split(_vq, total_sg, gfp);
+		desc = alloc_indirect_split(vq, total_sg, gfp);
 	else {
 		desc = NULL;
 		WARN_ON_ONCE(total_sg > vq->split.vring.num && !vq->indirect);
@@ -652,7 +736,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	}
 
 	if (indirect)
-		kfree(desc);
+		desc_cache_put(vq, desc, total_sg);
 
 	END_USE(vq);
 	return -ENOMEM;
@@ -717,7 +801,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 	if (vq->indirect) {
 		struct vring_desc *indir_desc =
 				vq->split.desc_state[head].indir_desc;
-		u32 len;
+		u32 len, n;
 
 		/* Free the indirect table, if any, now that it's unmapped. */
 		if (!indir_desc)
@@ -729,10 +813,12 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 				VRING_DESC_F_INDIRECT));
 		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
 
-		for (j = 0; j < len / sizeof(struct vring_desc); j++)
+		n = len / sizeof(struct vring_desc);
+
+		for (j = 0; j < n; j++)
 			vring_unmap_one_split_indirect(vq, &indir_desc[j]);
 
-		kfree(indir_desc);
+		desc_cache_put(vq, indir_desc, n);
 		vq->split.desc_state[head].indir_desc = NULL;
 	} else if (ctx) {
 		*ctx = vq->split.desc_state[head].indir_desc;
@@ -2200,6 +2286,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	desc_cache_create(vq, vdev, sizeof(struct vring_desc), vring.num);
+
 	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
 		vq->weak_barriers = false;
 
@@ -2329,6 +2417,7 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 	if (!vq->packed_ring) {
 		kfree(vq->split.desc_state);
 		kfree(vq->split.desc_extra);
+		desc_cache_free(vq);
 	}
 	kfree(vq);
 }
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 41edbc01ffa4..bda6f9853e97 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -118,6 +118,7 @@ struct virtio_device {
 	struct list_head vqs;
 	u64 features;
 	void *priv;
+	u32 desc_cache_thr;
 };
 
 static inline struct virtio_device *dev_to_virtio(struct device *_dev)
@@ -130,6 +131,19 @@ int register_virtio_device(struct virtio_device *dev);
 void unregister_virtio_device(struct virtio_device *dev);
 bool is_virtio_device(struct device *dev);
 
+/**
+ * virtio_set_desc_cache - set virtio ring desc cache threshold
+ *
+ * virtio will cache the allocated indirect desc.
+ *
+ * This function must be called before find_vqs.
+ *
+ * @thr:
+ *    0   - disable desc cache
+ *    > 0 - enable desc cache. As the threshold of the desc cache.
+ */
+void virtio_set_desc_cache(struct virtio_device *dev, u32 thr);
+
 void virtio_break_device(struct virtio_device *dev);
 
 void virtio_config_changed(struct virtio_device *dev);
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 1/3] virtio: cache indirect desc for split
@ 2021-10-29  6:28   ` Xuan Zhuo
  0 siblings, 0 replies; 14+ messages in thread
From: Xuan Zhuo @ 2021-10-29  6:28 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Jakub Kicinski, David S. Miller, Michael S. Tsirkin

In the case of using indirect, indirect desc must be allocated and
released each time, which increases a lot of cpu overhead.

Here, a cache is added for indirect. If the number of indirect desc to be
applied for is less than desc_cache_thr, the desc array with
the size of desc_cache_thr is fixed and cached for reuse.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio.c      |   6 ++
 drivers/virtio/virtio_ring.c | 105 ++++++++++++++++++++++++++++++++---
 include/linux/virtio.h       |  14 +++++
 3 files changed, 117 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 0a5b54034d4b..1047149ac2a4 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -431,6 +431,12 @@ bool is_virtio_device(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(is_virtio_device);
 
+void virtio_set_desc_cache(struct virtio_device *dev, u32 thr)
+{
+	dev->desc_cache_thr = thr;
+}
+EXPORT_SYMBOL_GPL(virtio_set_desc_cache);
+
 void unregister_virtio_device(struct virtio_device *dev)
 {
 	int index = dev->index; /* save for after device release */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index dd95dfd85e98..010c47baa37f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -85,6 +85,19 @@ struct vring_desc_extra {
 	u16 next;			/* The next desc state in a list. */
 };
 
+struct vring_desc_cache {
+	/* desc cache chain */
+	struct list_head list;
+
+	void *array;
+
+	/* desc cache threshold
+	 *    0   - disable desc cache
+	 *    > 0 - enable desc cache. As the threshold of the desc cache.
+	 */
+	u32 thr;
+};
+
 struct vring_virtqueue {
 	struct virtqueue vq;
 
@@ -117,6 +130,8 @@ struct vring_virtqueue {
 	/* Hint for event idx: already triggered no need to disable. */
 	bool event_triggered;
 
+	struct vring_desc_cache desc_cache;
+
 	union {
 		/* Available for split ring */
 		struct {
@@ -423,7 +438,76 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
 	return extra[i].next;
 }
 
-static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+static void desc_cache_free(struct vring_virtqueue *vq)
+{
+	kfree(vq->desc_cache.array);
+}
+
+static void desc_cache_create(struct vring_virtqueue *vq,
+			      struct virtio_device *vdev, int size, int num)
+{
+	struct list_head *node;
+	int i;
+
+	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_desc));
+	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_packed_desc));
+
+	vq->desc_cache.array = NULL;
+	vq->desc_cache.thr = vdev->desc_cache_thr;
+
+	INIT_LIST_HEAD(&vq->desc_cache.list);
+
+	if (!vq->desc_cache.thr)
+		return;
+
+	size = size * vq->desc_cache.thr;
+
+	vq->desc_cache.array = kmalloc_array(num, size, GFP_KERNEL);
+	if (!vq->desc_cache.array) {
+		vq->desc_cache.thr = 0;
+		dev_warn(&vdev->dev, "queue[%d] alloc desc cache fail. turn off it.\n",
+			 vq->vq.index);
+		return;
+	}
+
+	for (i = 0; i < num; ++i) {
+		node = vq->desc_cache.array + (i * size);
+		list_add(node, &vq->desc_cache.list);
+	}
+}
+
+static void __desc_cache_put(struct vring_virtqueue *vq,
+			     struct list_head *node, int n)
+{
+	if (n <= vq->desc_cache.thr)
+		list_add(node, &vq->desc_cache.list);
+	else
+		kfree(node);
+}
+
+#define desc_cache_put(vq, desc, n) \
+	__desc_cache_put(vq, (struct list_head *)desc, n)
+
+static void *desc_cache_get(struct vring_virtqueue *vq,
+			    int size, int n, gfp_t gfp)
+{
+	struct list_head *node;
+
+	if (n > vq->desc_cache.thr)
+		return kmalloc_array(n, size, gfp);
+
+	node = vq->desc_cache.list.next;
+	list_del(node);
+	return node;
+}
+
+#define _desc_cache_get(vq, n, gfp, tp) \
+	((tp *)desc_cache_get(vq, (sizeof(tp)), n, gfp))
+
+#define desc_cache_get_split(vq, n, gfp) \
+	_desc_cache_get(vq, n, gfp, struct vring_desc)
+
+static struct vring_desc *alloc_indirect_split(struct vring_virtqueue *vq,
 					       unsigned int total_sg,
 					       gfp_t gfp)
 {
@@ -437,12 +521,12 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
+	desc = desc_cache_get_split(vq, total_sg, gfp);
 	if (!desc)
 		return NULL;
 
 	for (i = 0; i < total_sg; i++)
-		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
+		desc[i].next = cpu_to_virtio16(vq->vq.vdev, i + 1);
 	return desc;
 }
 
@@ -508,7 +592,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	head = vq->free_head;
 
 	if (virtqueue_use_indirect(_vq, total_sg))
-		desc = alloc_indirect_split(_vq, total_sg, gfp);
+		desc = alloc_indirect_split(vq, total_sg, gfp);
 	else {
 		desc = NULL;
 		WARN_ON_ONCE(total_sg > vq->split.vring.num && !vq->indirect);
@@ -652,7 +736,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	}
 
 	if (indirect)
-		kfree(desc);
+		desc_cache_put(vq, desc, total_sg);
 
 	END_USE(vq);
 	return -ENOMEM;
@@ -717,7 +801,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 	if (vq->indirect) {
 		struct vring_desc *indir_desc =
 				vq->split.desc_state[head].indir_desc;
-		u32 len;
+		u32 len, n;
 
 		/* Free the indirect table, if any, now that it's unmapped. */
 		if (!indir_desc)
@@ -729,10 +813,12 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 				VRING_DESC_F_INDIRECT));
 		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
 
-		for (j = 0; j < len / sizeof(struct vring_desc); j++)
+		n = len / sizeof(struct vring_desc);
+
+		for (j = 0; j < n; j++)
 			vring_unmap_one_split_indirect(vq, &indir_desc[j]);
 
-		kfree(indir_desc);
+		desc_cache_put(vq, indir_desc, n);
 		vq->split.desc_state[head].indir_desc = NULL;
 	} else if (ctx) {
 		*ctx = vq->split.desc_state[head].indir_desc;
@@ -2200,6 +2286,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	desc_cache_create(vq, vdev, sizeof(struct vring_desc), vring.num);
+
 	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
 		vq->weak_barriers = false;
 
@@ -2329,6 +2417,7 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 	if (!vq->packed_ring) {
 		kfree(vq->split.desc_state);
 		kfree(vq->split.desc_extra);
+		desc_cache_free(vq);
 	}
 	kfree(vq);
 }
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 41edbc01ffa4..bda6f9853e97 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -118,6 +118,7 @@ struct virtio_device {
 	struct list_head vqs;
 	u64 features;
 	void *priv;
+	u32 desc_cache_thr;
 };
 
 static inline struct virtio_device *dev_to_virtio(struct device *_dev)
@@ -130,6 +131,19 @@ int register_virtio_device(struct virtio_device *dev);
 void unregister_virtio_device(struct virtio_device *dev);
 bool is_virtio_device(struct device *dev);
 
+/**
+ * virtio_set_desc_cache - set virtio ring desc cache threshold
+ *
+ * virtio will cache the allocated indirect desc.
+ *
+ * This function must be called before find_vqs.
+ *
+ * @thr:
+ *    0   - disable desc cache
+ *    > 0 - enable desc cache. As the threshold of the desc cache.
+ */
+void virtio_set_desc_cache(struct virtio_device *dev, u32 thr);
+
 void virtio_break_device(struct virtio_device *dev);
 
 void virtio_config_changed(struct virtio_device *dev);
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 2/3] virtio: cache indirect desc for packed
  2021-10-29  6:28 ` Xuan Zhuo
@ 2021-10-29  6:28   ` Xuan Zhuo
  -1 siblings, 0 replies; 14+ messages in thread
From: Xuan Zhuo @ 2021-10-29  6:28 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski

In the case of using indirect, indirect desc must be allocated and
released each time, which increases a lot of cpu overhead.

Here, a cache is added for indirect. If the number of indirect desc to be
applied for is less than desc_cache_thr, the desc array with
the size of desc_cache_thr is fixed and cached for reuse.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio_ring.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 010c47baa37f..d853a281fb43 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1118,7 +1118,11 @@ static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
 	}
 }
 
-static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
+#define desc_cache_get_packed(vq, n, gfp) \
+	_desc_cache_get(vq, n, gfp, struct vring_packed_desc)
+
+static struct vring_packed_desc *alloc_indirect_packed(struct vring_virtqueue *vq,
+						       unsigned int total_sg,
 						       gfp_t gfp)
 {
 	struct vring_packed_desc *desc;
@@ -1130,7 +1134,7 @@ static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc_array(total_sg, sizeof(struct vring_packed_desc), gfp);
+	desc = desc_cache_get_packed(vq, total_sg, gfp);
 
 	return desc;
 }
@@ -1150,7 +1154,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
 	dma_addr_t addr;
 
 	head = vq->packed.next_avail_idx;
-	desc = alloc_indirect_packed(total_sg, gfp);
+	desc = alloc_indirect_packed(vq, total_sg, gfp);
 
 	if (unlikely(vq->vq.num_free < 1)) {
 		pr_debug("Can't add buf len 1 - avail = 0\n");
@@ -1241,7 +1245,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
 	for (i = 0; i < err_idx; i++)
 		vring_unmap_desc_packed(vq, &desc[i]);
 
-	kfree(desc);
+	desc_cache_put(vq, desc, total_sg);
 
 	END_USE(vq);
 	return -ENOMEM;
@@ -1466,20 +1470,22 @@ static void detach_buf_packed(struct vring_virtqueue *vq,
 	}
 
 	if (vq->indirect) {
-		u32 len;
+		u32 len, n;
 
 		/* Free the indirect table, if any, now that it's unmapped. */
 		desc = state->indir_desc;
 		if (!desc)
 			return;
 
+		len = vq->packed.desc_extra[id].len;
+		n = len / sizeof(struct vring_packed_desc);
+
 		if (vq->use_dma_api) {
-			len = vq->packed.desc_extra[id].len;
-			for (i = 0; i < len / sizeof(struct vring_packed_desc);
-					i++)
+			for (i = 0; i < n; i++)
 				vring_unmap_desc_packed(vq, &desc[i]);
 		}
-		kfree(desc);
+
+		desc_cache_put(vq, desc, n);
 		state->indir_desc = NULL;
 	} else if (ctx) {
 		*ctx = state->indir_desc;
@@ -1798,6 +1804,8 @@ static struct virtqueue *vring_create_virtqueue_packed(
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	desc_cache_create(vq, vdev, sizeof(struct vring_packed_desc), num);
+
 	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
 		vq->weak_barriers = false;
 
@@ -2417,8 +2425,8 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 	if (!vq->packed_ring) {
 		kfree(vq->split.desc_state);
 		kfree(vq->split.desc_extra);
-		desc_cache_free(vq);
 	}
+	desc_cache_free(vq);
 	kfree(vq);
 }
 EXPORT_SYMBOL_GPL(vring_del_virtqueue);
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 2/3] virtio: cache indirect desc for packed
@ 2021-10-29  6:28   ` Xuan Zhuo
  0 siblings, 0 replies; 14+ messages in thread
From: Xuan Zhuo @ 2021-10-29  6:28 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Jakub Kicinski, David S. Miller, Michael S. Tsirkin

In the case of using indirect, indirect desc must be allocated and
released each time, which increases a lot of cpu overhead.

Here, a cache is added for indirect. If the number of indirect desc to be
applied for is less than desc_cache_thr, the desc array with
the size of desc_cache_thr is fixed and cached for reuse.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio_ring.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 010c47baa37f..d853a281fb43 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1118,7 +1118,11 @@ static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
 	}
 }
 
-static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
+#define desc_cache_get_packed(vq, n, gfp) \
+	_desc_cache_get(vq, n, gfp, struct vring_packed_desc)
+
+static struct vring_packed_desc *alloc_indirect_packed(struct vring_virtqueue *vq,
+						       unsigned int total_sg,
 						       gfp_t gfp)
 {
 	struct vring_packed_desc *desc;
@@ -1130,7 +1134,7 @@ static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc_array(total_sg, sizeof(struct vring_packed_desc), gfp);
+	desc = desc_cache_get_packed(vq, total_sg, gfp);
 
 	return desc;
 }
@@ -1150,7 +1154,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
 	dma_addr_t addr;
 
 	head = vq->packed.next_avail_idx;
-	desc = alloc_indirect_packed(total_sg, gfp);
+	desc = alloc_indirect_packed(vq, total_sg, gfp);
 
 	if (unlikely(vq->vq.num_free < 1)) {
 		pr_debug("Can't add buf len 1 - avail = 0\n");
@@ -1241,7 +1245,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
 	for (i = 0; i < err_idx; i++)
 		vring_unmap_desc_packed(vq, &desc[i]);
 
-	kfree(desc);
+	desc_cache_put(vq, desc, total_sg);
 
 	END_USE(vq);
 	return -ENOMEM;
@@ -1466,20 +1470,22 @@ static void detach_buf_packed(struct vring_virtqueue *vq,
 	}
 
 	if (vq->indirect) {
-		u32 len;
+		u32 len, n;
 
 		/* Free the indirect table, if any, now that it's unmapped. */
 		desc = state->indir_desc;
 		if (!desc)
 			return;
 
+		len = vq->packed.desc_extra[id].len;
+		n = len / sizeof(struct vring_packed_desc);
+
 		if (vq->use_dma_api) {
-			len = vq->packed.desc_extra[id].len;
-			for (i = 0; i < len / sizeof(struct vring_packed_desc);
-					i++)
+			for (i = 0; i < n; i++)
 				vring_unmap_desc_packed(vq, &desc[i]);
 		}
-		kfree(desc);
+
+		desc_cache_put(vq, desc, n);
 		state->indir_desc = NULL;
 	} else if (ctx) {
 		*ctx = state->indir_desc;
@@ -1798,6 +1804,8 @@ static struct virtqueue *vring_create_virtqueue_packed(
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	desc_cache_create(vq, vdev, sizeof(struct vring_packed_desc), num);
+
 	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
 		vq->weak_barriers = false;
 
@@ -2417,8 +2425,8 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 	if (!vq->packed_ring) {
 		kfree(vq->split.desc_state);
 		kfree(vq->split.desc_extra);
-		desc_cache_free(vq);
 	}
+	desc_cache_free(vq);
 	kfree(vq);
 }
 EXPORT_SYMBOL_GPL(vring_del_virtqueue);
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 3/3] virtio-net: enable virtio desc cache
  2021-10-29  6:28 ` Xuan Zhuo
@ 2021-10-29  6:28   ` Xuan Zhuo
  -1 siblings, 0 replies; 14+ messages in thread
From: Xuan Zhuo @ 2021-10-29  6:28 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski

If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
of sgs used for sending packets is greater than 1. We must constantly
call __kmalloc/kfree to allocate/release desc.

In the case of extremely fast package delivery, the overhead cannot be
ignored:

  27.46%  [kernel]  [k] virtqueue_add
  16.66%  [kernel]  [k] detach_buf_split
  16.51%  [kernel]  [k] virtnet_xsk_xmit
  14.04%  [kernel]  [k] virtqueue_add_outbuf
>  5.18%  [kernel]  [k] __kmalloc
>  4.08%  [kernel]  [k] kfree
   2.80%  [kernel]  [k] virtqueue_get_buf_ctx
   2.22%  [kernel]  [k] xsk_tx_peek_desc
   2.08%  [kernel]  [k] memset_erms
   0.83%  [kernel]  [k] virtqueue_kick_prepare
   0.76%  [kernel]  [k] virtnet_xsk_run
   0.62%  [kernel]  [k] __free_old_xmit_ptr
   0.60%  [kernel]  [k] vring_map_one_sg
   0.53%  [kernel]  [k] native_apic_mem_write
   0.46%  [kernel]  [k] sg_next
   0.43%  [kernel]  [k] sg_init_table
>  0.41%  [kernel]  [k] kmalloc_slab

Compared to not using virtio indirect cache, virtio-net can get a 16%
performance improvement when using virtio desc cache.

In the test case, the CPU where the package is sent has reached 100%.
The following are the PPS in two cases:

    indirect desc cache  | no cache
    3074658              | 2685132
    3111866              | 2666118
    3152527              | 2653632
    3125867              | 2669820
    3027147              | 2644464
    3069211              | 2669777
    3038522              | 2675645
    3034507              | 2671302
    3102257              | 2685504
    3083712              | 2692800
    3051771              | 2676928
    3080684              | 2695040
    3147816              | 2720876
    3123887              | 2705492
    3180963              | 2699520
    3191579              | 2676480
    3161670              | 2686272
    3189768              | 2692588
    3174272              | 2686692
    3143434              | 2682416

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4ad25a8b0870..df2514281847 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -31,6 +31,12 @@ module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
 
+/* Because virtio desc cache will increase memory overhead, users can turn it
+ * off or select an acceptable value. The maximum value is 2 + MAX_SKB_FRAGS.
+ */
+static u32 virtio_desc_cache_thr = 4;
+module_param(virtio_desc_cache_thr, uint, 0644);
+
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
 #define GOOD_COPY_LEN	128
@@ -3214,6 +3220,11 @@ static int virtnet_probe(struct virtio_device *vdev)
 		vi->curr_queue_pairs = num_online_cpus();
 	vi->max_queue_pairs = max_queue_pairs;
 
+	if (virtio_desc_cache_thr > 2 + MAX_SKB_FRAGS)
+		virtio_set_desc_cache(vdev, 2 + MAX_SKB_FRAGS);
+	else
+		virtio_set_desc_cache(vdev, virtio_desc_cache_thr);
+
 	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
 	err = init_vqs(vi);
 	if (err)
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 3/3] virtio-net: enable virtio desc cache
@ 2021-10-29  6:28   ` Xuan Zhuo
  0 siblings, 0 replies; 14+ messages in thread
From: Xuan Zhuo @ 2021-10-29  6:28 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Jakub Kicinski, David S. Miller, Michael S. Tsirkin

If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
of sgs used for sending packets is greater than 1. We must constantly
call __kmalloc/kfree to allocate/release desc.

In the case of extremely fast package delivery, the overhead cannot be
ignored:

  27.46%  [kernel]  [k] virtqueue_add
  16.66%  [kernel]  [k] detach_buf_split
  16.51%  [kernel]  [k] virtnet_xsk_xmit
  14.04%  [kernel]  [k] virtqueue_add_outbuf
>  5.18%  [kernel]  [k] __kmalloc
>  4.08%  [kernel]  [k] kfree
   2.80%  [kernel]  [k] virtqueue_get_buf_ctx
   2.22%  [kernel]  [k] xsk_tx_peek_desc
   2.08%  [kernel]  [k] memset_erms
   0.83%  [kernel]  [k] virtqueue_kick_prepare
   0.76%  [kernel]  [k] virtnet_xsk_run
   0.62%  [kernel]  [k] __free_old_xmit_ptr
   0.60%  [kernel]  [k] vring_map_one_sg
   0.53%  [kernel]  [k] native_apic_mem_write
   0.46%  [kernel]  [k] sg_next
   0.43%  [kernel]  [k] sg_init_table
>  0.41%  [kernel]  [k] kmalloc_slab

Compared to not using virtio indirect cache, virtio-net can get a 16%
performance improvement when using virtio desc cache.

In the test case, the CPU where the package is sent has reached 100%.
The following are the PPS in two cases:

    indirect desc cache  | no cache
    3074658              | 2685132
    3111866              | 2666118
    3152527              | 2653632
    3125867              | 2669820
    3027147              | 2644464
    3069211              | 2669777
    3038522              | 2675645
    3034507              | 2671302
    3102257              | 2685504
    3083712              | 2692800
    3051771              | 2676928
    3080684              | 2695040
    3147816              | 2720876
    3123887              | 2705492
    3180963              | 2699520
    3191579              | 2676480
    3161670              | 2686272
    3189768              | 2692588
    3174272              | 2686692
    3143434              | 2682416

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4ad25a8b0870..df2514281847 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -31,6 +31,12 @@ module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
 
+/* Because virtio desc cache will increase memory overhead, users can turn it
+ * off or select an acceptable value. The maximum value is 2 + MAX_SKB_FRAGS.
+ */
+static u32 virtio_desc_cache_thr = 4;
+module_param(virtio_desc_cache_thr, uint, 0644);
+
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
 #define GOOD_COPY_LEN	128
@@ -3214,6 +3220,11 @@ static int virtnet_probe(struct virtio_device *vdev)
 		vi->curr_queue_pairs = num_online_cpus();
 	vi->max_queue_pairs = max_queue_pairs;
 
+	if (virtio_desc_cache_thr > 2 + MAX_SKB_FRAGS)
+		virtio_set_desc_cache(vdev, 2 + MAX_SKB_FRAGS);
+	else
+		virtio_set_desc_cache(vdev, virtio_desc_cache_thr);
+
 	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
 	err = init_vqs(vi);
 	if (err)
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/3] virtio support cache indirect desc
  2021-10-29  6:28 ` Xuan Zhuo
@ 2022-01-06 12:28   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2022-01-06 12:28 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: virtualization, netdev, Jason Wang, David S. Miller, Jakub Kicinski

On Fri, Oct 29, 2021 at 02:28:11PM +0800, Xuan Zhuo wrote:
> If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> of sgs used for sending packets is greater than 1. We must constantly
> call __kmalloc/kfree to allocate/release desc.


So where is this going? I really like the performance boost. My concern
is that if guest spans NUMA nodes and when handler switches from
node to another this will keep reusing the cache from
the old node. A bunch of ways were suggested to address this, but
even just making the cache per numa node would help.


> In the case of extremely fast package delivery, the overhead cannot be
> ignored:
> 
>   27.46%  [kernel]  [k] virtqueue_add
>   16.66%  [kernel]  [k] detach_buf_split
>   16.51%  [kernel]  [k] virtnet_xsk_xmit
>   14.04%  [kernel]  [k] virtqueue_add_outbuf
>    5.18%  [kernel]  [k] __kmalloc
>    4.08%  [kernel]  [k] kfree
>    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
>    2.22%  [kernel]  [k] xsk_tx_peek_desc
>    2.08%  [kernel]  [k] memset_erms
>    0.83%  [kernel]  [k] virtqueue_kick_prepare
>    0.76%  [kernel]  [k] virtnet_xsk_run
>    0.62%  [kernel]  [k] __free_old_xmit_ptr
>    0.60%  [kernel]  [k] vring_map_one_sg
>    0.53%  [kernel]  [k] native_apic_mem_write
>    0.46%  [kernel]  [k] sg_next
>    0.43%  [kernel]  [k] sg_init_table
>    0.41%  [kernel]  [k] kmalloc_slab
> 
> This patch adds a cache function to virtio to cache these allocated indirect
> desc instead of constantly allocating and releasing desc.
> 
> v3:
>   pre-allocate per buffer indirect descriptors array
> 
> v2:
>   use struct list_head to cache the desc
> 
> *** BLURB HERE ***
> 
> Xuan Zhuo (3):
>   virtio: cache indirect desc for split
>   virtio: cache indirect desc for packed
>   virtio-net: enable virtio desc cache
> 
>  drivers/net/virtio_net.c     |  11 +++
>  drivers/virtio/virtio.c      |   6 ++
>  drivers/virtio/virtio_ring.c | 131 ++++++++++++++++++++++++++++++-----
>  include/linux/virtio.h       |  14 ++++
>  4 files changed, 145 insertions(+), 17 deletions(-)
> 
> --
> 2.31.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/3] virtio support cache indirect desc
@ 2022-01-06 12:28   ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2022-01-06 12:28 UTC (permalink / raw)
  To: Xuan Zhuo; +Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Fri, Oct 29, 2021 at 02:28:11PM +0800, Xuan Zhuo wrote:
> If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> of sgs used for sending packets is greater than 1. We must constantly
> call __kmalloc/kfree to allocate/release desc.


So where is this going? I really like the performance boost. My concern
is that if guest spans NUMA nodes and when handler switches from
node to another this will keep reusing the cache from
the old node. A bunch of ways were suggested to address this, but
even just making the cache per numa node would help.


> In the case of extremely fast package delivery, the overhead cannot be
> ignored:
> 
>   27.46%  [kernel]  [k] virtqueue_add
>   16.66%  [kernel]  [k] detach_buf_split
>   16.51%  [kernel]  [k] virtnet_xsk_xmit
>   14.04%  [kernel]  [k] virtqueue_add_outbuf
>    5.18%  [kernel]  [k] __kmalloc
>    4.08%  [kernel]  [k] kfree
>    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
>    2.22%  [kernel]  [k] xsk_tx_peek_desc
>    2.08%  [kernel]  [k] memset_erms
>    0.83%  [kernel]  [k] virtqueue_kick_prepare
>    0.76%  [kernel]  [k] virtnet_xsk_run
>    0.62%  [kernel]  [k] __free_old_xmit_ptr
>    0.60%  [kernel]  [k] vring_map_one_sg
>    0.53%  [kernel]  [k] native_apic_mem_write
>    0.46%  [kernel]  [k] sg_next
>    0.43%  [kernel]  [k] sg_init_table
>    0.41%  [kernel]  [k] kmalloc_slab
> 
> This patch adds a cache function to virtio to cache these allocated indirect
> desc instead of constantly allocating and releasing desc.
> 
> v3:
>   pre-allocate per buffer indirect descriptors array
> 
> v2:
>   use struct list_head to cache the desc
> 
> *** BLURB HERE ***
> 
> Xuan Zhuo (3):
>   virtio: cache indirect desc for split
>   virtio: cache indirect desc for packed
>   virtio-net: enable virtio desc cache
> 
>  drivers/net/virtio_net.c     |  11 +++
>  drivers/virtio/virtio.c      |   6 ++
>  drivers/virtio/virtio_ring.c | 131 ++++++++++++++++++++++++++++++-----
>  include/linux/virtio.h       |  14 ++++
>  4 files changed, 145 insertions(+), 17 deletions(-)
> 
> --
> 2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/3] virtio support cache indirect desc
  2022-01-06 12:28   ` Michael S. Tsirkin
  (?)
@ 2022-01-06 12:48   ` Xuan Zhuo
  2022-01-10 13:50       ` Michael S. Tsirkin
  -1 siblings, 1 reply; 14+ messages in thread
From: Xuan Zhuo @ 2022-01-06 12:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Thu, 6 Jan 2022 07:28:31 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Oct 29, 2021 at 02:28:11PM +0800, Xuan Zhuo wrote:
> > If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> > of sgs used for sending packets is greater than 1. We must constantly
> > call __kmalloc/kfree to allocate/release desc.
>
>
> So where is this going? I really like the performance boost. My concern
> is that if guest spans NUMA nodes and when handler switches from
> node to another this will keep reusing the cache from
> the old node. A bunch of ways were suggested to address this, but
> even just making the cache per numa node would help.
>

In fact, this is the problem I encountered in implementing virtio-net to support
xdp socket. With virtqueue reset[0] has been merged into virtio spec. I
am completing this series of work. My plan is:

1. virtio support advance dma
2. linux kernel/qemu support virtqueue reset
3. virtio-net support AF_XDP
4. virtio support cache indirect desc

[0]: https://github.com/oasis-tcs/virtio-spec/issues/124

Thanks.

>
> > In the case of extremely fast package delivery, the overhead cannot be
> > ignored:
> >
> >   27.46%  [kernel]  [k] virtqueue_add
> >   16.66%  [kernel]  [k] detach_buf_split
> >   16.51%  [kernel]  [k] virtnet_xsk_xmit
> >   14.04%  [kernel]  [k] virtqueue_add_outbuf
> >    5.18%  [kernel]  [k] __kmalloc
> >    4.08%  [kernel]  [k] kfree
> >    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
> >    2.22%  [kernel]  [k] xsk_tx_peek_desc
> >    2.08%  [kernel]  [k] memset_erms
> >    0.83%  [kernel]  [k] virtqueue_kick_prepare
> >    0.76%  [kernel]  [k] virtnet_xsk_run
> >    0.62%  [kernel]  [k] __free_old_xmit_ptr
> >    0.60%  [kernel]  [k] vring_map_one_sg
> >    0.53%  [kernel]  [k] native_apic_mem_write
> >    0.46%  [kernel]  [k] sg_next
> >    0.43%  [kernel]  [k] sg_init_table
> >    0.41%  [kernel]  [k] kmalloc_slab
> >
> > This patch adds a cache function to virtio to cache these allocated indirect
> > desc instead of constantly allocating and releasing desc.
> >
> > v3:
> >   pre-allocate per buffer indirect descriptors array
> >
> > v2:
> >   use struct list_head to cache the desc
> >
> > *** BLURB HERE ***
> >
> > Xuan Zhuo (3):
> >   virtio: cache indirect desc for split
> >   virtio: cache indirect desc for packed
> >   virtio-net: enable virtio desc cache
> >
> >  drivers/net/virtio_net.c     |  11 +++
> >  drivers/virtio/virtio.c      |   6 ++
> >  drivers/virtio/virtio_ring.c | 131 ++++++++++++++++++++++++++++++-----
> >  include/linux/virtio.h       |  14 ++++
> >  4 files changed, 145 insertions(+), 17 deletions(-)
> >
> > --
> > 2.31.0
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/3] virtio support cache indirect desc
  2022-01-06 12:48   ` Xuan Zhuo
@ 2022-01-10 13:50       ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2022-01-10 13:50 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: virtualization, netdev, Jason Wang, David S. Miller, Jakub Kicinski

On Thu, Jan 06, 2022 at 08:48:59PM +0800, Xuan Zhuo wrote:
> On Thu, 6 Jan 2022 07:28:31 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Oct 29, 2021 at 02:28:11PM +0800, Xuan Zhuo wrote:
> > > If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> > > of sgs used for sending packets is greater than 1. We must constantly
> > > call __kmalloc/kfree to allocate/release desc.
> >
> >
> > So where is this going? I really like the performance boost. My concern
> > is that if guest spans NUMA nodes and when handler switches from
> > node to another this will keep reusing the cache from
> > the old node. A bunch of ways were suggested to address this, but
> > even just making the cache per numa node would help.
> >
> 
> In fact, this is the problem I encountered in implementing virtio-net to support
> xdp socket. With virtqueue reset[0] has been merged into virtio spec. I
> am completing this series of work. My plan is:
> 
> 1. virtio support advance dma
> 2. linux kernel/qemu support virtqueue reset
> 3. virtio-net support AF_XDP
> 4. virtio support cache indirect desc
> 
> [0]: https://github.com/oasis-tcs/virtio-spec/issues/124
> 
> Thanks.

OK it's up to you how to prioritize your work.
An idea though: isn't there a way to reduce the use of indirect?
Even with all the caching, it is surely not free.
We made it work better in the past with:

commit e7428e95a06fb516fac1308bd0e176e27c0b9287
    ("virtio-net: put virtio-net header inline with data"). 
and
commit 6ebbc1a6383fe78be3c0961d1475043ac6cc2542
    virtio-net: Set needed_headroom for virtio-net when VIRTIO_F_ANY_LAYOUT is true

can't something similar be done for XDP?



Another idea is to skip indirect even with s/g as number of outstanding
entries is small. The difficulty with this approach is that it has
to be tested across a large number of configurations, including
storage to make sure we don't cause regressions, unless we
are very conservative and only make a small % of entries direct.
Will doing that still help? It looks attractive on paper:
if guest starts outpacing host and ring begins to fill
up to more than say 10% then we switch to allocating indirect
entries which slows guest down.



> >
> > > In the case of extremely fast package delivery, the overhead cannot be
> > > ignored:
> > >
> > >   27.46%  [kernel]  [k] virtqueue_add
> > >   16.66%  [kernel]  [k] detach_buf_split
> > >   16.51%  [kernel]  [k] virtnet_xsk_xmit
> > >   14.04%  [kernel]  [k] virtqueue_add_outbuf
> > >    5.18%  [kernel]  [k] __kmalloc
> > >    4.08%  [kernel]  [k] kfree
> > >    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
> > >    2.22%  [kernel]  [k] xsk_tx_peek_desc
> > >    2.08%  [kernel]  [k] memset_erms
> > >    0.83%  [kernel]  [k] virtqueue_kick_prepare
> > >    0.76%  [kernel]  [k] virtnet_xsk_run
> > >    0.62%  [kernel]  [k] __free_old_xmit_ptr
> > >    0.60%  [kernel]  [k] vring_map_one_sg
> > >    0.53%  [kernel]  [k] native_apic_mem_write
> > >    0.46%  [kernel]  [k] sg_next
> > >    0.43%  [kernel]  [k] sg_init_table
> > >    0.41%  [kernel]  [k] kmalloc_slab
> > >
> > > This patch adds a cache function to virtio to cache these allocated indirect
> > > desc instead of constantly allocating and releasing desc.
> > >
> > > v3:
> > >   pre-allocate per buffer indirect descriptors array
> > >
> > > v2:
> > >   use struct list_head to cache the desc
> > >
> > > *** BLURB HERE ***
> > >
> > > Xuan Zhuo (3):
> > >   virtio: cache indirect desc for split
> > >   virtio: cache indirect desc for packed
> > >   virtio-net: enable virtio desc cache
> > >
> > >  drivers/net/virtio_net.c     |  11 +++
> > >  drivers/virtio/virtio.c      |   6 ++
> > >  drivers/virtio/virtio_ring.c | 131 ++++++++++++++++++++++++++++++-----
> > >  include/linux/virtio.h       |  14 ++++
> > >  4 files changed, 145 insertions(+), 17 deletions(-)
> > >
> > > --
> > > 2.31.0
> >


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/3] virtio support cache indirect desc
@ 2022-01-10 13:50       ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2022-01-10 13:50 UTC (permalink / raw)
  To: Xuan Zhuo; +Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Thu, Jan 06, 2022 at 08:48:59PM +0800, Xuan Zhuo wrote:
> On Thu, 6 Jan 2022 07:28:31 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Oct 29, 2021 at 02:28:11PM +0800, Xuan Zhuo wrote:
> > > If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> > > of sgs used for sending packets is greater than 1. We must constantly
> > > call __kmalloc/kfree to allocate/release desc.
> >
> >
> > So where is this going? I really like the performance boost. My concern
> > is that if guest spans NUMA nodes and when handler switches from
> > node to another this will keep reusing the cache from
> > the old node. A bunch of ways were suggested to address this, but
> > even just making the cache per numa node would help.
> >
> 
> In fact, this is the problem I encountered in implementing virtio-net to support
> xdp socket. With virtqueue reset[0] has been merged into virtio spec. I
> am completing this series of work. My plan is:
> 
> 1. virtio support advance dma
> 2. linux kernel/qemu support virtqueue reset
> 3. virtio-net support AF_XDP
> 4. virtio support cache indirect desc
> 
> [0]: https://github.com/oasis-tcs/virtio-spec/issues/124
> 
> Thanks.

OK it's up to you how to prioritize your work.
An idea though: isn't there a way to reduce the use of indirect?
Even with all the caching, it is surely not free.
We made it work better in the past with:

commit e7428e95a06fb516fac1308bd0e176e27c0b9287
    ("virtio-net: put virtio-net header inline with data"). 
and
commit 6ebbc1a6383fe78be3c0961d1475043ac6cc2542
    virtio-net: Set needed_headroom for virtio-net when VIRTIO_F_ANY_LAYOUT is true

can't something similar be done for XDP?



Another idea is to skip indirect even with s/g as number of outstanding
entries is small. The difficulty with this approach is that it has
to be tested across a large number of configurations, including
storage to make sure we don't cause regressions, unless we
are very conservative and only make a small % of entries direct.
Will doing that still help? It looks attractive on paper:
if guest starts outpacing host and ring begins to fill
up to more than say 10% then we switch to allocating indirect
entries which slows guest down.



> >
> > > In the case of extremely fast package delivery, the overhead cannot be
> > > ignored:
> > >
> > >   27.46%  [kernel]  [k] virtqueue_add
> > >   16.66%  [kernel]  [k] detach_buf_split
> > >   16.51%  [kernel]  [k] virtnet_xsk_xmit
> > >   14.04%  [kernel]  [k] virtqueue_add_outbuf
> > >    5.18%  [kernel]  [k] __kmalloc
> > >    4.08%  [kernel]  [k] kfree
> > >    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
> > >    2.22%  [kernel]  [k] xsk_tx_peek_desc
> > >    2.08%  [kernel]  [k] memset_erms
> > >    0.83%  [kernel]  [k] virtqueue_kick_prepare
> > >    0.76%  [kernel]  [k] virtnet_xsk_run
> > >    0.62%  [kernel]  [k] __free_old_xmit_ptr
> > >    0.60%  [kernel]  [k] vring_map_one_sg
> > >    0.53%  [kernel]  [k] native_apic_mem_write
> > >    0.46%  [kernel]  [k] sg_next
> > >    0.43%  [kernel]  [k] sg_init_table
> > >    0.41%  [kernel]  [k] kmalloc_slab
> > >
> > > This patch adds a cache function to virtio to cache these allocated indirect
> > > desc instead of constantly allocating and releasing desc.
> > >
> > > v3:
> > >   pre-allocate per buffer indirect descriptors array
> > >
> > > v2:
> > >   use struct list_head to cache the desc
> > >
> > > *** BLURB HERE ***
> > >
> > > Xuan Zhuo (3):
> > >   virtio: cache indirect desc for split
> > >   virtio: cache indirect desc for packed
> > >   virtio-net: enable virtio desc cache
> > >
> > >  drivers/net/virtio_net.c     |  11 +++
> > >  drivers/virtio/virtio.c      |   6 ++
> > >  drivers/virtio/virtio_ring.c | 131 ++++++++++++++++++++++++++++++-----
> > >  include/linux/virtio.h       |  14 ++++
> > >  4 files changed, 145 insertions(+), 17 deletions(-)
> > >
> > > --
> > > 2.31.0
> >

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/3] virtio support cache indirect desc
  2022-01-10 13:50       ` Michael S. Tsirkin
  (?)
@ 2022-01-11  2:00       ` Xuan Zhuo
  -1 siblings, 0 replies; 14+ messages in thread
From: Xuan Zhuo @ 2022-01-11  2:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Mon, 10 Jan 2022 08:50:08 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Jan 06, 2022 at 08:48:59PM +0800, Xuan Zhuo wrote:
> > On Thu, 6 Jan 2022 07:28:31 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Fri, Oct 29, 2021 at 02:28:11PM +0800, Xuan Zhuo wrote:
> > > > If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> > > > of sgs used for sending packets is greater than 1. We must constantly
> > > > call __kmalloc/kfree to allocate/release desc.
> > >
> > >
> > > So where is this going? I really like the performance boost. My concern
> > > is that if guest spans NUMA nodes and when handler switches from
> > > node to another this will keep reusing the cache from
> > > the old node. A bunch of ways were suggested to address this, but
> > > even just making the cache per numa node would help.
> > >
> >
> > In fact, this is the problem I encountered in implementing virtio-net to support
> > xdp socket. With virtqueue reset[0] has been merged into virtio spec. I
> > am completing this series of work. My plan is:
> >
> > 1. virtio support advance dma
> > 2. linux kernel/qemu support virtqueue reset
> > 3. virtio-net support AF_XDP
> > 4. virtio support cache indirect desc
> >
> > [0]: https://github.com/oasis-tcs/virtio-spec/issues/124
> >
> > Thanks.
>
> OK it's up to you how to prioritize your work.
> An idea though: isn't there a way to reduce the use of indirect?
> Even with all the caching, it is surely not free.
> We made it work better in the past with:
>
> commit e7428e95a06fb516fac1308bd0e176e27c0b9287
>     ("virtio-net: put virtio-net header inline with data").
> and
> commit 6ebbc1a6383fe78be3c0961d1475043ac6cc2542
>     virtio-net: Set needed_headroom for virtio-net when VIRTIO_F_ANY_LAYOUT is true
>
> can't something similar be done for XDP?
>
>
>
> Another idea is to skip indirect even with s/g as number of outstanding
> entries is small. The difficulty with this approach is that it has
> to be tested across a large number of configurations, including
> storage to make sure we don't cause regressions, unless we
> are very conservative and only make a small % of entries direct.
> Will doing that still help? It looks attractive on paper:
> if guest starts outpacing host and ring begins to fill
> up to more than say 10% then we switch to allocating indirect
> entries which slows guest down.
>

I see what you mean, I will try to reduce the use of indirect.
And bring performance test data.

Thanks for your opinion.

>
>
> > >
> > > > In the case of extremely fast package delivery, the overhead cannot be
> > > > ignored:
> > > >
> > > >   27.46%  [kernel]  [k] virtqueue_add
> > > >   16.66%  [kernel]  [k] detach_buf_split
> > > >   16.51%  [kernel]  [k] virtnet_xsk_xmit
> > > >   14.04%  [kernel]  [k] virtqueue_add_outbuf
> > > >    5.18%  [kernel]  [k] __kmalloc
> > > >    4.08%  [kernel]  [k] kfree
> > > >    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
> > > >    2.22%  [kernel]  [k] xsk_tx_peek_desc
> > > >    2.08%  [kernel]  [k] memset_erms
> > > >    0.83%  [kernel]  [k] virtqueue_kick_prepare
> > > >    0.76%  [kernel]  [k] virtnet_xsk_run
> > > >    0.62%  [kernel]  [k] __free_old_xmit_ptr
> > > >    0.60%  [kernel]  [k] vring_map_one_sg
> > > >    0.53%  [kernel]  [k] native_apic_mem_write
> > > >    0.46%  [kernel]  [k] sg_next
> > > >    0.43%  [kernel]  [k] sg_init_table
> > > >    0.41%  [kernel]  [k] kmalloc_slab
> > > >
> > > > This patch adds a cache function to virtio to cache these allocated indirect
> > > > desc instead of constantly allocating and releasing desc.
> > > >
> > > > v3:
> > > >   pre-allocate per buffer indirect descriptors array
> > > >
> > > > v2:
> > > >   use struct list_head to cache the desc
> > > >
> > > > *** BLURB HERE ***
> > > >
> > > > Xuan Zhuo (3):
> > > >   virtio: cache indirect desc for split
> > > >   virtio: cache indirect desc for packed
> > > >   virtio-net: enable virtio desc cache
> > > >
> > > >  drivers/net/virtio_net.c     |  11 +++
> > > >  drivers/virtio/virtio.c      |   6 ++
> > > >  drivers/virtio/virtio_ring.c | 131 ++++++++++++++++++++++++++++++-----
> > > >  include/linux/virtio.h       |  14 ++++
> > > >  4 files changed, 145 insertions(+), 17 deletions(-)
> > > >
> > > > --
> > > > 2.31.0
> > >
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-01-11  2:03 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-29  6:28 [PATCH v3 0/3] virtio support cache indirect desc Xuan Zhuo
2021-10-29  6:28 ` Xuan Zhuo
2021-10-29  6:28 ` [PATCH v3 1/3] virtio: cache indirect desc for split Xuan Zhuo
2021-10-29  6:28   ` Xuan Zhuo
2021-10-29  6:28 ` [PATCH v3 2/3] virtio: cache indirect desc for packed Xuan Zhuo
2021-10-29  6:28   ` Xuan Zhuo
2021-10-29  6:28 ` [PATCH v3 3/3] virtio-net: enable virtio desc cache Xuan Zhuo
2021-10-29  6:28   ` Xuan Zhuo
2022-01-06 12:28 ` [PATCH v3 0/3] virtio support cache indirect desc Michael S. Tsirkin
2022-01-06 12:28   ` Michael S. Tsirkin
2022-01-06 12:48   ` Xuan Zhuo
2022-01-10 13:50     ` Michael S. Tsirkin
2022-01-10 13:50       ` Michael S. Tsirkin
2022-01-11  2:00       ` Xuan Zhuo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.