All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] virtio support cache indirect desc
@ 2021-11-08 11:49 ` Xuan Zhuo
  0 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-08 11:49 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski

If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
of sgs used for sending packets is greater than 1. We must constantly
call __kmalloc/kfree to allocate/release desc.

In the case of extremely fast package delivery, the overhead cannot be
ignored:

  27.46%  [kernel]  [k] virtqueue_add
  16.66%  [kernel]  [k] detach_buf_split
  16.51%  [kernel]  [k] virtnet_xsk_xmit
  14.04%  [kernel]  [k] virtqueue_add_outbuf
   5.18%  [kernel]  [k] __kmalloc
   4.08%  [kernel]  [k] kfree
   2.80%  [kernel]  [k] virtqueue_get_buf_ctx
   2.22%  [kernel]  [k] xsk_tx_peek_desc
   2.08%  [kernel]  [k] memset_erms
   0.83%  [kernel]  [k] virtqueue_kick_prepare
   0.76%  [kernel]  [k] virtnet_xsk_run
   0.62%  [kernel]  [k] __free_old_xmit_ptr
   0.60%  [kernel]  [k] vring_map_one_sg
   0.53%  [kernel]  [k] native_apic_mem_write
   0.46%  [kernel]  [k] sg_next
   0.43%  [kernel]  [k] sg_init_table
   0.41%  [kernel]  [k] kmalloc_slab

This patch adds a cache function to virtio to cache these allocated indirect
desc instead of constantly allocating and releasing desc.

v4:
    1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation is successful
    2. The desc cache threshold can be set for each virtqueue

v3:
  pre-allocate per buffer indirect descriptors array

v2:
  use struct list_head to cache the desc

Xuan Zhuo (3):
  virtio: cache indirect desc for split
  virtio: cache indirect desc for packed
  virtio-net: enable virtio desc cache

 drivers/net/virtio_net.c     |  12 ++-
 drivers/virtio/virtio_ring.c | 152 +++++++++++++++++++++++++++++++----
 include/linux/virtio.h       |  17 ++++
 3 files changed, 163 insertions(+), 18 deletions(-)

--
2.31.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 0/3] virtio support cache indirect desc
@ 2021-11-08 11:49 ` Xuan Zhuo
  0 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-08 11:49 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Jakub Kicinski, David S. Miller, Michael S. Tsirkin

If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
of sgs used for sending packets is greater than 1. We must constantly
call __kmalloc/kfree to allocate/release desc.

In the case of extremely fast package delivery, the overhead cannot be
ignored:

  27.46%  [kernel]  [k] virtqueue_add
  16.66%  [kernel]  [k] detach_buf_split
  16.51%  [kernel]  [k] virtnet_xsk_xmit
  14.04%  [kernel]  [k] virtqueue_add_outbuf
   5.18%  [kernel]  [k] __kmalloc
   4.08%  [kernel]  [k] kfree
   2.80%  [kernel]  [k] virtqueue_get_buf_ctx
   2.22%  [kernel]  [k] xsk_tx_peek_desc
   2.08%  [kernel]  [k] memset_erms
   0.83%  [kernel]  [k] virtqueue_kick_prepare
   0.76%  [kernel]  [k] virtnet_xsk_run
   0.62%  [kernel]  [k] __free_old_xmit_ptr
   0.60%  [kernel]  [k] vring_map_one_sg
   0.53%  [kernel]  [k] native_apic_mem_write
   0.46%  [kernel]  [k] sg_next
   0.43%  [kernel]  [k] sg_init_table
   0.41%  [kernel]  [k] kmalloc_slab

This patch adds a cache function to virtio to cache these allocated indirect
desc instead of constantly allocating and releasing desc.

v4:
    1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation is successful
    2. The desc cache threshold can be set for each virtqueue

v3:
  pre-allocate per buffer indirect descriptors array

v2:
  use struct list_head to cache the desc

Xuan Zhuo (3):
  virtio: cache indirect desc for split
  virtio: cache indirect desc for packed
  virtio-net: enable virtio desc cache

 drivers/net/virtio_net.c     |  12 ++-
 drivers/virtio/virtio_ring.c | 152 +++++++++++++++++++++++++++++++----
 include/linux/virtio.h       |  17 ++++
 3 files changed, 163 insertions(+), 18 deletions(-)

--
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/3] virtio: cache indirect desc for split
  2021-11-08 11:49 ` Xuan Zhuo
@ 2021-11-08 11:49   ` Xuan Zhuo
  -1 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-08 11:49 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski

In the case of using indirect, indirect desc must be allocated and
released each time, which increases a lot of cpu overhead.

Here, a cache is added for indirect. If the number of indirect desc to be
applied for is less than desc_cache_thr, the desc array with
the size of desc_cache_thr is fixed and cached for reuse.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio_ring.c | 126 ++++++++++++++++++++++++++++++++---
 include/linux/virtio.h       |  17 +++++
 2 files changed, 135 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index dd95dfd85e98..a4a91c497a83 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -85,6 +85,19 @@ struct vring_desc_extra {
 	u16 next;			/* The next desc state in a list. */
 };
 
+struct vring_desc_cache {
+	/* desc cache chain */
+	struct list_head list;
+
+	void *array;
+
+	/* desc cache threshold
+	 *    0   - disable desc cache
+	 *    > 0 - enable desc cache. As the threshold of the desc cache.
+	 */
+	u32 threshold;
+};
+
 struct vring_virtqueue {
 	struct virtqueue vq;
 
@@ -117,6 +130,8 @@ struct vring_virtqueue {
 	/* Hint for event idx: already triggered no need to disable. */
 	bool event_triggered;
 
+	struct vring_desc_cache desc_cache;
+
 	union {
 		/* Available for split ring */
 		struct {
@@ -423,7 +438,50 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
 	return extra[i].next;
 }
 
-static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+static void desc_cache_init(struct vring_virtqueue *vq)
+{
+	vq->desc_cache.array = NULL;
+	vq->desc_cache.threshold = 0;
+	INIT_LIST_HEAD(&vq->desc_cache.list);
+}
+
+static void desc_cache_free(struct vring_virtqueue *vq)
+{
+	kfree(vq->desc_cache.array);
+}
+
+static void __desc_cache_put(struct vring_virtqueue *vq,
+			     struct list_head *node, int n)
+{
+	if (n <= vq->desc_cache.threshold)
+		list_add(node, &vq->desc_cache.list);
+	else
+		kfree(node);
+}
+
+#define desc_cache_put(vq, desc, n) \
+	__desc_cache_put(vq, (struct list_head *)desc, n)
+
+static void *desc_cache_get(struct vring_virtqueue *vq,
+			    int size, int n, gfp_t gfp)
+{
+	struct list_head *node;
+
+	if (n > vq->desc_cache.threshold)
+		return kmalloc_array(n, size, gfp);
+
+	node = vq->desc_cache.list.next;
+	list_del(node);
+	return node;
+}
+
+#define _desc_cache_get(vq, n, gfp, tp) \
+	((tp *)desc_cache_get(vq, (sizeof(tp)), n, gfp))
+
+#define desc_cache_get_split(vq, n, gfp) \
+	_desc_cache_get(vq, n, gfp, struct vring_desc)
+
+static struct vring_desc *alloc_indirect_split(struct vring_virtqueue *vq,
 					       unsigned int total_sg,
 					       gfp_t gfp)
 {
@@ -437,12 +495,12 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
+	desc = desc_cache_get_split(vq, total_sg, gfp);
 	if (!desc)
 		return NULL;
 
 	for (i = 0; i < total_sg; i++)
-		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
+		desc[i].next = cpu_to_virtio16(vq->vq.vdev, i + 1);
 	return desc;
 }
 
@@ -508,7 +566,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	head = vq->free_head;
 
 	if (virtqueue_use_indirect(_vq, total_sg))
-		desc = alloc_indirect_split(_vq, total_sg, gfp);
+		desc = alloc_indirect_split(vq, total_sg, gfp);
 	else {
 		desc = NULL;
 		WARN_ON_ONCE(total_sg > vq->split.vring.num && !vq->indirect);
@@ -652,7 +710,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	}
 
 	if (indirect)
-		kfree(desc);
+		desc_cache_put(vq, desc, total_sg);
 
 	END_USE(vq);
 	return -ENOMEM;
@@ -717,7 +775,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 	if (vq->indirect) {
 		struct vring_desc *indir_desc =
 				vq->split.desc_state[head].indir_desc;
-		u32 len;
+		u32 len, n;
 
 		/* Free the indirect table, if any, now that it's unmapped. */
 		if (!indir_desc)
@@ -729,10 +787,12 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 				VRING_DESC_F_INDIRECT));
 		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
 
-		for (j = 0; j < len / sizeof(struct vring_desc); j++)
+		n = len / sizeof(struct vring_desc);
+
+		for (j = 0; j < n; j++)
 			vring_unmap_one_split_indirect(vq, &indir_desc[j]);
 
-		kfree(indir_desc);
+		desc_cache_put(vq, indir_desc, n);
 		vq->split.desc_state[head].indir_desc = NULL;
 	} else if (ctx) {
 		*ctx = vq->split.desc_state[head].indir_desc;
@@ -2200,6 +2260,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	desc_cache_init(vq);
+
 	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
 		vq->weak_barriers = false;
 
@@ -2329,6 +2391,7 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 	if (!vq->packed_ring) {
 		kfree(vq->split.desc_state);
 		kfree(vq->split.desc_extra);
+		desc_cache_free(vq);
 	}
 	kfree(vq);
 }
@@ -2445,6 +2508,53 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
 
+int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct list_head *node;
+	int size, num, i;
+
+	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_desc));
+	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_packed_desc));
+
+	BUG_ON(!vq->we_own_ring);
+
+	if (!vq->indirect)
+		return 0;
+
+	vq->desc_cache.threshold = threshold;
+
+	if (!threshold)
+		return 0;
+
+	if (vq->packed_ring) {
+		size = sizeof(struct vring_packed_desc);
+		num = vq->packed.vring.num;
+	} else {
+		size = sizeof(struct vring_desc);
+		num = vq->split.vring.num;
+	}
+
+	size = size * vq->desc_cache.threshold;
+
+	vq->desc_cache.array = kmalloc_array(num, size, GFP_KERNEL);
+	if (!vq->desc_cache.array) {
+		vq->desc_cache.threshold = 0;
+		dev_warn(&vq->vq.vdev->dev,
+			 "queue[%d] alloc desc cache fail. turn off it.\n",
+			 vq->vq.index);
+		return -1;
+	}
+
+	for (i = 0; i < num; ++i) {
+		node = vq->desc_cache.array + (i * size);
+		list_add(node, &vq->desc_cache.list);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_set_desc_cache);
+
 /* Only available for split ring */
 const struct vring *virtqueue_get_vring(struct virtqueue *vq)
 {
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 41edbc01ffa4..e24b2e90dd42 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -89,6 +89,23 @@ dma_addr_t virtqueue_get_desc_addr(struct virtqueue *vq);
 dma_addr_t virtqueue_get_avail_addr(struct virtqueue *vq);
 dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
 
+/**
+ * virtqueue_set_desc_cache - set virtio ring desc cache threshold
+ *
+ * virtio will allocate ring.num desc arrays of size threshold in advance. If
+ * total_sg exceeds the threshold, use kmalloc/kfree allocation indirect desc,
+ * if total_sg is less than or equal to the threshold, use these pre-allocated
+ * desc arrays.
+ *
+ * This function must be called immediately after find_vqs and before device
+ * ready.
+ *
+ * @threshold:
+ *    0   - disable desc cache
+ *    > 0 - enable desc cache. As the threshold of the desc cache.
+ */
+int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold);
+
 /**
  * virtio_device - representation of a device using virtio
  * @index: unique position on the virtio bus
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 1/3] virtio: cache indirect desc for split
@ 2021-11-08 11:49   ` Xuan Zhuo
  0 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-08 11:49 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Jakub Kicinski, David S. Miller, Michael S. Tsirkin

In the case of using indirect, indirect desc must be allocated and
released each time, which increases a lot of cpu overhead.

Here, a cache is added for indirect. If the number of indirect desc to be
applied for is less than desc_cache_thr, the desc array with
the size of desc_cache_thr is fixed and cached for reuse.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio_ring.c | 126 ++++++++++++++++++++++++++++++++---
 include/linux/virtio.h       |  17 +++++
 2 files changed, 135 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index dd95dfd85e98..a4a91c497a83 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -85,6 +85,19 @@ struct vring_desc_extra {
 	u16 next;			/* The next desc state in a list. */
 };
 
+struct vring_desc_cache {
+	/* desc cache chain */
+	struct list_head list;
+
+	void *array;
+
+	/* desc cache threshold
+	 *    0   - disable desc cache
+	 *    > 0 - enable desc cache. As the threshold of the desc cache.
+	 */
+	u32 threshold;
+};
+
 struct vring_virtqueue {
 	struct virtqueue vq;
 
@@ -117,6 +130,8 @@ struct vring_virtqueue {
 	/* Hint for event idx: already triggered no need to disable. */
 	bool event_triggered;
 
+	struct vring_desc_cache desc_cache;
+
 	union {
 		/* Available for split ring */
 		struct {
@@ -423,7 +438,50 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
 	return extra[i].next;
 }
 
-static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+static void desc_cache_init(struct vring_virtqueue *vq)
+{
+	vq->desc_cache.array = NULL;
+	vq->desc_cache.threshold = 0;
+	INIT_LIST_HEAD(&vq->desc_cache.list);
+}
+
+static void desc_cache_free(struct vring_virtqueue *vq)
+{
+	kfree(vq->desc_cache.array);
+}
+
+static void __desc_cache_put(struct vring_virtqueue *vq,
+			     struct list_head *node, int n)
+{
+	if (n <= vq->desc_cache.threshold)
+		list_add(node, &vq->desc_cache.list);
+	else
+		kfree(node);
+}
+
+#define desc_cache_put(vq, desc, n) \
+	__desc_cache_put(vq, (struct list_head *)desc, n)
+
+static void *desc_cache_get(struct vring_virtqueue *vq,
+			    int size, int n, gfp_t gfp)
+{
+	struct list_head *node;
+
+	if (n > vq->desc_cache.threshold)
+		return kmalloc_array(n, size, gfp);
+
+	node = vq->desc_cache.list.next;
+	list_del(node);
+	return node;
+}
+
+#define _desc_cache_get(vq, n, gfp, tp) \
+	((tp *)desc_cache_get(vq, (sizeof(tp)), n, gfp))
+
+#define desc_cache_get_split(vq, n, gfp) \
+	_desc_cache_get(vq, n, gfp, struct vring_desc)
+
+static struct vring_desc *alloc_indirect_split(struct vring_virtqueue *vq,
 					       unsigned int total_sg,
 					       gfp_t gfp)
 {
@@ -437,12 +495,12 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
+	desc = desc_cache_get_split(vq, total_sg, gfp);
 	if (!desc)
 		return NULL;
 
 	for (i = 0; i < total_sg; i++)
-		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
+		desc[i].next = cpu_to_virtio16(vq->vq.vdev, i + 1);
 	return desc;
 }
 
@@ -508,7 +566,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	head = vq->free_head;
 
 	if (virtqueue_use_indirect(_vq, total_sg))
-		desc = alloc_indirect_split(_vq, total_sg, gfp);
+		desc = alloc_indirect_split(vq, total_sg, gfp);
 	else {
 		desc = NULL;
 		WARN_ON_ONCE(total_sg > vq->split.vring.num && !vq->indirect);
@@ -652,7 +710,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 	}
 
 	if (indirect)
-		kfree(desc);
+		desc_cache_put(vq, desc, total_sg);
 
 	END_USE(vq);
 	return -ENOMEM;
@@ -717,7 +775,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 	if (vq->indirect) {
 		struct vring_desc *indir_desc =
 				vq->split.desc_state[head].indir_desc;
-		u32 len;
+		u32 len, n;
 
 		/* Free the indirect table, if any, now that it's unmapped. */
 		if (!indir_desc)
@@ -729,10 +787,12 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
 				VRING_DESC_F_INDIRECT));
 		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
 
-		for (j = 0; j < len / sizeof(struct vring_desc); j++)
+		n = len / sizeof(struct vring_desc);
+
+		for (j = 0; j < n; j++)
 			vring_unmap_one_split_indirect(vq, &indir_desc[j]);
 
-		kfree(indir_desc);
+		desc_cache_put(vq, indir_desc, n);
 		vq->split.desc_state[head].indir_desc = NULL;
 	} else if (ctx) {
 		*ctx = vq->split.desc_state[head].indir_desc;
@@ -2200,6 +2260,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	desc_cache_init(vq);
+
 	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
 		vq->weak_barriers = false;
 
@@ -2329,6 +2391,7 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 	if (!vq->packed_ring) {
 		kfree(vq->split.desc_state);
 		kfree(vq->split.desc_extra);
+		desc_cache_free(vq);
 	}
 	kfree(vq);
 }
@@ -2445,6 +2508,53 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
 }
 EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
 
+int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	struct list_head *node;
+	int size, num, i;
+
+	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_desc));
+	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_packed_desc));
+
+	BUG_ON(!vq->we_own_ring);
+
+	if (!vq->indirect)
+		return 0;
+
+	vq->desc_cache.threshold = threshold;
+
+	if (!threshold)
+		return 0;
+
+	if (vq->packed_ring) {
+		size = sizeof(struct vring_packed_desc);
+		num = vq->packed.vring.num;
+	} else {
+		size = sizeof(struct vring_desc);
+		num = vq->split.vring.num;
+	}
+
+	size = size * vq->desc_cache.threshold;
+
+	vq->desc_cache.array = kmalloc_array(num, size, GFP_KERNEL);
+	if (!vq->desc_cache.array) {
+		vq->desc_cache.threshold = 0;
+		dev_warn(&vq->vq.vdev->dev,
+			 "queue[%d] alloc desc cache fail. turn off it.\n",
+			 vq->vq.index);
+		return -1;
+	}
+
+	for (i = 0; i < num; ++i) {
+		node = vq->desc_cache.array + (i * size);
+		list_add(node, &vq->desc_cache.list);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtqueue_set_desc_cache);
+
 /* Only available for split ring */
 const struct vring *virtqueue_get_vring(struct virtqueue *vq)
 {
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 41edbc01ffa4..e24b2e90dd42 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -89,6 +89,23 @@ dma_addr_t virtqueue_get_desc_addr(struct virtqueue *vq);
 dma_addr_t virtqueue_get_avail_addr(struct virtqueue *vq);
 dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
 
+/**
+ * virtqueue_set_desc_cache - set virtio ring desc cache threshold
+ *
+ * virtio will allocate ring.num desc arrays of size threshold in advance. If
+ * total_sg exceeds the threshold, use kmalloc/kfree allocation indirect desc,
+ * if total_sg is less than or equal to the threshold, use these pre-allocated
+ * desc arrays.
+ *
+ * This function must be called immediately after find_vqs and before device
+ * ready.
+ *
+ * @threshold:
+ *    0   - disable desc cache
+ *    > 0 - enable desc cache. As the threshold of the desc cache.
+ */
+int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold);
+
 /**
  * virtio_device - representation of a device using virtio
  * @index: unique position on the virtio bus
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 2/3] virtio: cache indirect desc for packed
  2021-11-08 11:49 ` Xuan Zhuo
@ 2021-11-08 11:49   ` Xuan Zhuo
  -1 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-08 11:49 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski

In the case of using indirect, indirect desc must be allocated and
released each time, which increases a lot of cpu overhead.

Here, a cache is added for indirect. If the number of indirect desc to be
applied for is less than desc_cache_thr, the desc array with
the size of desc_cache_thr is fixed and cached for reuse.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio_ring.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index a4a91c497a83..76a974219ffd 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1092,7 +1092,11 @@ static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
 	}
 }
 
-static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
+#define desc_cache_get_packed(vq, n, gfp) \
+	_desc_cache_get(vq, n, gfp, struct vring_packed_desc)
+
+static struct vring_packed_desc *alloc_indirect_packed(struct vring_virtqueue *vq,
+						       unsigned int total_sg,
 						       gfp_t gfp)
 {
 	struct vring_packed_desc *desc;
@@ -1104,7 +1108,7 @@ static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc_array(total_sg, sizeof(struct vring_packed_desc), gfp);
+	desc = desc_cache_get_packed(vq, total_sg, gfp);
 
 	return desc;
 }
@@ -1124,7 +1128,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
 	dma_addr_t addr;
 
 	head = vq->packed.next_avail_idx;
-	desc = alloc_indirect_packed(total_sg, gfp);
+	desc = alloc_indirect_packed(vq, total_sg, gfp);
 
 	if (unlikely(vq->vq.num_free < 1)) {
 		pr_debug("Can't add buf len 1 - avail = 0\n");
@@ -1215,7 +1219,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
 	for (i = 0; i < err_idx; i++)
 		vring_unmap_desc_packed(vq, &desc[i]);
 
-	kfree(desc);
+	desc_cache_put(vq, desc, total_sg);
 
 	END_USE(vq);
 	return -ENOMEM;
@@ -1440,20 +1444,22 @@ static void detach_buf_packed(struct vring_virtqueue *vq,
 	}
 
 	if (vq->indirect) {
-		u32 len;
+		u32 len, n;
 
 		/* Free the indirect table, if any, now that it's unmapped. */
 		desc = state->indir_desc;
 		if (!desc)
 			return;
 
+		len = vq->packed.desc_extra[id].len;
+		n = len / sizeof(struct vring_packed_desc);
+
 		if (vq->use_dma_api) {
-			len = vq->packed.desc_extra[id].len;
-			for (i = 0; i < len / sizeof(struct vring_packed_desc);
-					i++)
+			for (i = 0; i < n; i++)
 				vring_unmap_desc_packed(vq, &desc[i]);
 		}
-		kfree(desc);
+
+		desc_cache_put(vq, desc, n);
 		state->indir_desc = NULL;
 	} else if (ctx) {
 		*ctx = state->indir_desc;
@@ -1772,6 +1778,8 @@ static struct virtqueue *vring_create_virtqueue_packed(
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	desc_cache_init(vq);
+
 	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
 		vq->weak_barriers = false;
 
@@ -2391,8 +2399,8 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 	if (!vq->packed_ring) {
 		kfree(vq->split.desc_state);
 		kfree(vq->split.desc_extra);
-		desc_cache_free(vq);
 	}
+	desc_cache_free(vq);
 	kfree(vq);
 }
 EXPORT_SYMBOL_GPL(vring_del_virtqueue);
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 2/3] virtio: cache indirect desc for packed
@ 2021-11-08 11:49   ` Xuan Zhuo
  0 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-08 11:49 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Jakub Kicinski, David S. Miller, Michael S. Tsirkin

In the case of using indirect, indirect desc must be allocated and
released each time, which increases a lot of cpu overhead.

Here, a cache is added for indirect. If the number of indirect desc to be
applied for is less than desc_cache_thr, the desc array with
the size of desc_cache_thr is fixed and cached for reuse.

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/virtio/virtio_ring.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index a4a91c497a83..76a974219ffd 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1092,7 +1092,11 @@ static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
 	}
 }
 
-static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
+#define desc_cache_get_packed(vq, n, gfp) \
+	_desc_cache_get(vq, n, gfp, struct vring_packed_desc)
+
+static struct vring_packed_desc *alloc_indirect_packed(struct vring_virtqueue *vq,
+						       unsigned int total_sg,
 						       gfp_t gfp)
 {
 	struct vring_packed_desc *desc;
@@ -1104,7 +1108,7 @@ static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
 	 */
 	gfp &= ~__GFP_HIGHMEM;
 
-	desc = kmalloc_array(total_sg, sizeof(struct vring_packed_desc), gfp);
+	desc = desc_cache_get_packed(vq, total_sg, gfp);
 
 	return desc;
 }
@@ -1124,7 +1128,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
 	dma_addr_t addr;
 
 	head = vq->packed.next_avail_idx;
-	desc = alloc_indirect_packed(total_sg, gfp);
+	desc = alloc_indirect_packed(vq, total_sg, gfp);
 
 	if (unlikely(vq->vq.num_free < 1)) {
 		pr_debug("Can't add buf len 1 - avail = 0\n");
@@ -1215,7 +1219,7 @@ static int virtqueue_add_indirect_packed(struct vring_virtqueue *vq,
 	for (i = 0; i < err_idx; i++)
 		vring_unmap_desc_packed(vq, &desc[i]);
 
-	kfree(desc);
+	desc_cache_put(vq, desc, total_sg);
 
 	END_USE(vq);
 	return -ENOMEM;
@@ -1440,20 +1444,22 @@ static void detach_buf_packed(struct vring_virtqueue *vq,
 	}
 
 	if (vq->indirect) {
-		u32 len;
+		u32 len, n;
 
 		/* Free the indirect table, if any, now that it's unmapped. */
 		desc = state->indir_desc;
 		if (!desc)
 			return;
 
+		len = vq->packed.desc_extra[id].len;
+		n = len / sizeof(struct vring_packed_desc);
+
 		if (vq->use_dma_api) {
-			len = vq->packed.desc_extra[id].len;
-			for (i = 0; i < len / sizeof(struct vring_packed_desc);
-					i++)
+			for (i = 0; i < n; i++)
 				vring_unmap_desc_packed(vq, &desc[i]);
 		}
-		kfree(desc);
+
+		desc_cache_put(vq, desc, n);
 		state->indir_desc = NULL;
 	} else if (ctx) {
 		*ctx = state->indir_desc;
@@ -1772,6 +1778,8 @@ static struct virtqueue *vring_create_virtqueue_packed(
 		!context;
 	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
 
+	desc_cache_init(vq);
+
 	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
 		vq->weak_barriers = false;
 
@@ -2391,8 +2399,8 @@ void vring_del_virtqueue(struct virtqueue *_vq)
 	if (!vq->packed_ring) {
 		kfree(vq->split.desc_state);
 		kfree(vq->split.desc_extra);
-		desc_cache_free(vq);
 	}
+	desc_cache_free(vq);
 	kfree(vq);
 }
 EXPORT_SYMBOL_GPL(vring_del_virtqueue);
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 3/3] virtio-net: enable virtio desc cache
  2021-11-08 11:49 ` Xuan Zhuo
@ 2021-11-08 11:49   ` Xuan Zhuo
  -1 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-08 11:49 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Michael S. Tsirkin, Jason Wang, David S. Miller, Jakub Kicinski

If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
of sgs used for sending packets is greater than 1. We must constantly
call __kmalloc/kfree to allocate/release desc.

In the case of extremely fast package delivery, the overhead cannot be
ignored:

  27.46%  [kernel]  [k] virtqueue_add
  16.66%  [kernel]  [k] detach_buf_split
  16.51%  [kernel]  [k] virtnet_xsk_xmit
  14.04%  [kernel]  [k] virtqueue_add_outbuf
>  5.18%  [kernel]  [k] __kmalloc
>  4.08%  [kernel]  [k] kfree
   2.80%  [kernel]  [k] virtqueue_get_buf_ctx
   2.22%  [kernel]  [k] xsk_tx_peek_desc
   2.08%  [kernel]  [k] memset_erms
   0.83%  [kernel]  [k] virtqueue_kick_prepare
   0.76%  [kernel]  [k] virtnet_xsk_run
   0.62%  [kernel]  [k] __free_old_xmit_ptr
   0.60%  [kernel]  [k] vring_map_one_sg
   0.53%  [kernel]  [k] native_apic_mem_write
   0.46%  [kernel]  [k] sg_next
   0.43%  [kernel]  [k] sg_init_table
>  0.41%  [kernel]  [k] kmalloc_slab

Compared to not using virtio indirect cache, virtio-net can get a 16%
performance improvement when using virtio desc cache.

In the test case, the CPU where the package is sent has reached 100%.
The following are the PPS in two cases:

    indirect desc cache  | no cache
    3074658              | 2685132
    3111866              | 2666118
    3152527              | 2653632
    3125867              | 2669820
    3027147              | 2644464
    3069211              | 2669777
    3038522              | 2675645
    3034507              | 2671302
    3102257              | 2685504
    3083712              | 2692800
    3051771              | 2676928
    3080684              | 2695040
    3147816              | 2720876
    3123887              | 2705492
    3180963              | 2699520
    3191579              | 2676480
    3161670              | 2686272
    3189768              | 2692588
    3174272              | 2686692
    3143434              | 2682416

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9ff2ef9dceca..193c8b38433e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -42,6 +42,9 @@ module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
 
+static u32 virtio_desc_cache_threshold = MAX_SKB_FRAGS + 2;
+module_param(virtio_desc_cache_threshold, uint, 0644);
+
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
 #define GOOD_COPY_LEN	128
@@ -3350,10 +3353,10 @@ static unsigned int mergeable_min_buf_len(struct virtnet_info *vi, struct virtqu
 
 static int virtnet_find_vqs(struct virtnet_info *vi)
 {
+	int i, total_vqs, threshold;
 	vq_callback_t **callbacks;
 	struct virtqueue **vqs;
 	int ret = -ENOMEM;
-	int i, total_vqs;
 	const char **names;
 	bool *ctx;
 
@@ -3411,10 +3414,17 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 			vi->dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 	}
 
+	threshold = min_t(u32, virtio_desc_cache_threshold, 2 + MAX_SKB_FRAGS);
+
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		vi->rq[i].vq = vqs[rxq2vq(i)];
 		vi->rq[i].min_buf_len = mergeable_min_buf_len(vi, vi->rq[i].vq);
 		vi->sq[i].vq = vqs[txq2vq(i)];
+
+		if (!vi->mergeable_rx_bufs && vi->big_packets)
+			virtqueue_set_desc_cache(vi->rq[i].vq, MAX_SKB_FRAGS + 2);
+
+		virtqueue_set_desc_cache(vi->sq[i].vq, threshold);
 	}
 
 	/* run here: ret == 0. */
-- 
2.31.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 3/3] virtio-net: enable virtio desc cache
@ 2021-11-08 11:49   ` Xuan Zhuo
  0 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-08 11:49 UTC (permalink / raw)
  To: virtualization, netdev
  Cc: Jakub Kicinski, David S. Miller, Michael S. Tsirkin

If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
of sgs used for sending packets is greater than 1. We must constantly
call __kmalloc/kfree to allocate/release desc.

In the case of extremely fast package delivery, the overhead cannot be
ignored:

  27.46%  [kernel]  [k] virtqueue_add
  16.66%  [kernel]  [k] detach_buf_split
  16.51%  [kernel]  [k] virtnet_xsk_xmit
  14.04%  [kernel]  [k] virtqueue_add_outbuf
>  5.18%  [kernel]  [k] __kmalloc
>  4.08%  [kernel]  [k] kfree
   2.80%  [kernel]  [k] virtqueue_get_buf_ctx
   2.22%  [kernel]  [k] xsk_tx_peek_desc
   2.08%  [kernel]  [k] memset_erms
   0.83%  [kernel]  [k] virtqueue_kick_prepare
   0.76%  [kernel]  [k] virtnet_xsk_run
   0.62%  [kernel]  [k] __free_old_xmit_ptr
   0.60%  [kernel]  [k] vring_map_one_sg
   0.53%  [kernel]  [k] native_apic_mem_write
   0.46%  [kernel]  [k] sg_next
   0.43%  [kernel]  [k] sg_init_table
>  0.41%  [kernel]  [k] kmalloc_slab

Compared to not using virtio indirect cache, virtio-net can get a 16%
performance improvement when using virtio desc cache.

In the test case, the CPU where the package is sent has reached 100%.
The following are the PPS in two cases:

    indirect desc cache  | no cache
    3074658              | 2685132
    3111866              | 2666118
    3152527              | 2653632
    3125867              | 2669820
    3027147              | 2644464
    3069211              | 2669777
    3038522              | 2675645
    3034507              | 2671302
    3102257              | 2685504
    3083712              | 2692800
    3051771              | 2676928
    3080684              | 2695040
    3147816              | 2720876
    3123887              | 2705492
    3180963              | 2699520
    3191579              | 2676480
    3161670              | 2686272
    3189768              | 2692588
    3174272              | 2686692
    3143434              | 2682416

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/virtio_net.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9ff2ef9dceca..193c8b38433e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -42,6 +42,9 @@ module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
 
+static u32 virtio_desc_cache_threshold = MAX_SKB_FRAGS + 2;
+module_param(virtio_desc_cache_threshold, uint, 0644);
+
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
 #define GOOD_COPY_LEN	128
@@ -3350,10 +3353,10 @@ static unsigned int mergeable_min_buf_len(struct virtnet_info *vi, struct virtqu
 
 static int virtnet_find_vqs(struct virtnet_info *vi)
 {
+	int i, total_vqs, threshold;
 	vq_callback_t **callbacks;
 	struct virtqueue **vqs;
 	int ret = -ENOMEM;
-	int i, total_vqs;
 	const char **names;
 	bool *ctx;
 
@@ -3411,10 +3414,17 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 			vi->dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 	}
 
+	threshold = min_t(u32, virtio_desc_cache_threshold, 2 + MAX_SKB_FRAGS);
+
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		vi->rq[i].vq = vqs[rxq2vq(i)];
 		vi->rq[i].min_buf_len = mergeable_min_buf_len(vi, vi->rq[i].vq);
 		vi->sq[i].vq = vqs[txq2vq(i)];
+
+		if (!vi->mergeable_rx_bufs && vi->big_packets)
+			virtqueue_set_desc_cache(vi->rq[i].vq, MAX_SKB_FRAGS + 2);
+
+		virtqueue_set_desc_cache(vi->sq[i].vq, threshold);
 	}
 
 	/* run here: ret == 0. */
-- 
2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
  2021-11-08 11:49 ` Xuan Zhuo
@ 2021-11-08 13:49   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-08 13:49 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: virtualization, netdev, Jason Wang, David S. Miller, Jakub Kicinski

On Mon, Nov 08, 2021 at 07:49:48PM +0800, Xuan Zhuo wrote:
> If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> of sgs used for sending packets is greater than 1. We must constantly
> call __kmalloc/kfree to allocate/release desc.
> 
> In the case of extremely fast package delivery, the overhead cannot be
> ignored:
> 
>   27.46%  [kernel]  [k] virtqueue_add
>   16.66%  [kernel]  [k] detach_buf_split
>   16.51%  [kernel]  [k] virtnet_xsk_xmit
>   14.04%  [kernel]  [k] virtqueue_add_outbuf
>    5.18%  [kernel]  [k] __kmalloc
>    4.08%  [kernel]  [k] kfree
>    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
>    2.22%  [kernel]  [k] xsk_tx_peek_desc
>    2.08%  [kernel]  [k] memset_erms
>    0.83%  [kernel]  [k] virtqueue_kick_prepare
>    0.76%  [kernel]  [k] virtnet_xsk_run
>    0.62%  [kernel]  [k] __free_old_xmit_ptr
>    0.60%  [kernel]  [k] vring_map_one_sg
>    0.53%  [kernel]  [k] native_apic_mem_write
>    0.46%  [kernel]  [k] sg_next
>    0.43%  [kernel]  [k] sg_init_table
>    0.41%  [kernel]  [k] kmalloc_slab
> 
> This patch adds a cache function to virtio to cache these allocated indirect
> desc instead of constantly allocating and releasing desc.

Hmm a bunch of comments got ignored. See e.g.
https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
if they aren't relevant add code comments or commit log text explaining the
design choice please.


> v4:
>     1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation is successful
>     2. The desc cache threshold can be set for each virtqueue
> 
> v3:
>   pre-allocate per buffer indirect descriptors array
> 
> v2:
>   use struct list_head to cache the desc
> 
> Xuan Zhuo (3):
>   virtio: cache indirect desc for split
>   virtio: cache indirect desc for packed
>   virtio-net: enable virtio desc cache
> 
>  drivers/net/virtio_net.c     |  12 ++-
>  drivers/virtio/virtio_ring.c | 152 +++++++++++++++++++++++++++++++----
>  include/linux/virtio.h       |  17 ++++
>  3 files changed, 163 insertions(+), 18 deletions(-)
> 
> --
> 2.31.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
@ 2021-11-08 13:49   ` Michael S. Tsirkin
  0 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-08 13:49 UTC (permalink / raw)
  To: Xuan Zhuo; +Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Mon, Nov 08, 2021 at 07:49:48PM +0800, Xuan Zhuo wrote:
> If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> of sgs used for sending packets is greater than 1. We must constantly
> call __kmalloc/kfree to allocate/release desc.
> 
> In the case of extremely fast package delivery, the overhead cannot be
> ignored:
> 
>   27.46%  [kernel]  [k] virtqueue_add
>   16.66%  [kernel]  [k] detach_buf_split
>   16.51%  [kernel]  [k] virtnet_xsk_xmit
>   14.04%  [kernel]  [k] virtqueue_add_outbuf
>    5.18%  [kernel]  [k] __kmalloc
>    4.08%  [kernel]  [k] kfree
>    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
>    2.22%  [kernel]  [k] xsk_tx_peek_desc
>    2.08%  [kernel]  [k] memset_erms
>    0.83%  [kernel]  [k] virtqueue_kick_prepare
>    0.76%  [kernel]  [k] virtnet_xsk_run
>    0.62%  [kernel]  [k] __free_old_xmit_ptr
>    0.60%  [kernel]  [k] vring_map_one_sg
>    0.53%  [kernel]  [k] native_apic_mem_write
>    0.46%  [kernel]  [k] sg_next
>    0.43%  [kernel]  [k] sg_init_table
>    0.41%  [kernel]  [k] kmalloc_slab
> 
> This patch adds a cache function to virtio to cache these allocated indirect
> desc instead of constantly allocating and releasing desc.

Hmm a bunch of comments got ignored. See e.g.
https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
if they aren't relevant add code comments or commit log text explaining the
design choice please.


> v4:
>     1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation is successful
>     2. The desc cache threshold can be set for each virtqueue
> 
> v3:
>   pre-allocate per buffer indirect descriptors array
> 
> v2:
>   use struct list_head to cache the desc
> 
> Xuan Zhuo (3):
>   virtio: cache indirect desc for split
>   virtio: cache indirect desc for packed
>   virtio-net: enable virtio desc cache
> 
>  drivers/net/virtio_net.c     |  12 ++-
>  drivers/virtio/virtio_ring.c | 152 +++++++++++++++++++++++++++++++----
>  include/linux/virtio.h       |  17 ++++
>  3 files changed, 163 insertions(+), 18 deletions(-)
> 
> --
> 2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
  2021-11-08 13:49   ` Michael S. Tsirkin
  (?)
@ 2021-11-08 14:47   ` Xuan Zhuo
  2021-11-10 12:53       ` Michael S. Tsirkin
  -1 siblings, 1 reply; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-08 14:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Hmm a bunch of comments got ignored. See e.g.
> https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> if they aren't relevant add code comments or commit log text explaining the
> design choice please.

I should have responded to related questions, I am guessing whether some emails
have been lost.

I have sorted out the following 6 questions, if there are any missing questions,
please let me know.

1. use list_head
  In the earliest version, I used pointers directly. You suggest that I use
  llist_head, but considering that llist_head has atomic operations. There is no
  competition problem here, so I used list_head.

  In fact, I did not increase the allocated space for list_head.

  use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
  use as queue item: | list_head ........................................|

2.
> > +	if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > +		if (vq->desc_cache_chain) {
> > +			desc = vq->desc_cache_chain;
> > +			vq->desc_cache_chain = (void *)desc->addr;
> > +			goto got;
> > +		}
> > +		n = VIRT_QUEUE_CACHE_DESC_NUM;
>
> Hmm. This will allocate more entries than actually used. Why do it?


This is because the size of each cache item is fixed, and the logic has been
modified in the latest code. I think this problem no longer exists.


3.
> What bothers me here is what happens if cache gets
> filled on one numa node, then used on another?

I'm thinking about another question, how did the cross-numa appear here, and
virtio desc queue also has the problem of cross-numa. So is it necessary for us
to deal with the cross-numa scene?

Indirect desc is used as virtio desc, so as long as it is in the same numa as
virito desc. So we can allocate indirect desc cache at the same time when
allocating virtio desc queue.

4.
> So e.g. for rx, we are wasting memory since indirect isn't used.

In the current version, desc cache is set up based on pre-queue.

So if the desc cache is not used, we don't need to set the desc cache.

For example, virtio-net, as long as the tx queue and the rx queue in big packet
mode enable desc cache.

5.
> Would a better API be a cache size in bytes? This controls how much
> memory is spent after all.

My design is to set a threshold. When total_sg is greater than this threshold,
it will fall back to kmalloc/kfree. When total_sg is less than or equal to
this threshold, use the allocated cache.


6. kmem_cache_*

I have tested these, the performance is not as good as the method used in this
patch.


Thanks.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
  2021-11-08 11:49 ` Xuan Zhuo
@ 2021-11-09 13:03   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-09 13:03 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: virtualization, netdev, Jason Wang, David S. Miller, Jakub Kicinski

On Mon, Nov 08, 2021 at 07:49:48PM +0800, Xuan Zhuo wrote:
> If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> of sgs used for sending packets is greater than 1. We must constantly
> call __kmalloc/kfree to allocate/release desc.
> 
> In the case of extremely fast package delivery, the overhead cannot be
> ignored:
> 
>   27.46%  [kernel]  [k] virtqueue_add
>   16.66%  [kernel]  [k] detach_buf_split
>   16.51%  [kernel]  [k] virtnet_xsk_xmit
>   14.04%  [kernel]  [k] virtqueue_add_outbuf
>    5.18%  [kernel]  [k] __kmalloc
>    4.08%  [kernel]  [k] kfree
>    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
>    2.22%  [kernel]  [k] xsk_tx_peek_desc
>    2.08%  [kernel]  [k] memset_erms
>    0.83%  [kernel]  [k] virtqueue_kick_prepare
>    0.76%  [kernel]  [k] virtnet_xsk_run
>    0.62%  [kernel]  [k] __free_old_xmit_ptr
>    0.60%  [kernel]  [k] vring_map_one_sg
>    0.53%  [kernel]  [k] native_apic_mem_write
>    0.46%  [kernel]  [k] sg_next
>    0.43%  [kernel]  [k] sg_init_table
>    0.41%  [kernel]  [k] kmalloc_slab
> 
> This patch adds a cache function to virtio to cache these allocated indirect
> desc instead of constantly allocating and releasing desc.
> 
> v4:
>     1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation is successful
>     2. The desc cache threshold can be set for each virtqueue
> 
> v3:
>   pre-allocate per buffer indirect descriptors array

So I'm not sure why we are doing that. Did it improve anything?


> v2:
>   use struct list_head to cache the desc
> 
> Xuan Zhuo (3):
>   virtio: cache indirect desc for split
>   virtio: cache indirect desc for packed
>   virtio-net: enable virtio desc cache
> 
>  drivers/net/virtio_net.c     |  12 ++-
>  drivers/virtio/virtio_ring.c | 152 +++++++++++++++++++++++++++++++----
>  include/linux/virtio.h       |  17 ++++
>  3 files changed, 163 insertions(+), 18 deletions(-)
> 
> --
> 2.31.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
@ 2021-11-09 13:03   ` Michael S. Tsirkin
  0 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-09 13:03 UTC (permalink / raw)
  To: Xuan Zhuo; +Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Mon, Nov 08, 2021 at 07:49:48PM +0800, Xuan Zhuo wrote:
> If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> of sgs used for sending packets is greater than 1. We must constantly
> call __kmalloc/kfree to allocate/release desc.
> 
> In the case of extremely fast package delivery, the overhead cannot be
> ignored:
> 
>   27.46%  [kernel]  [k] virtqueue_add
>   16.66%  [kernel]  [k] detach_buf_split
>   16.51%  [kernel]  [k] virtnet_xsk_xmit
>   14.04%  [kernel]  [k] virtqueue_add_outbuf
>    5.18%  [kernel]  [k] __kmalloc
>    4.08%  [kernel]  [k] kfree
>    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
>    2.22%  [kernel]  [k] xsk_tx_peek_desc
>    2.08%  [kernel]  [k] memset_erms
>    0.83%  [kernel]  [k] virtqueue_kick_prepare
>    0.76%  [kernel]  [k] virtnet_xsk_run
>    0.62%  [kernel]  [k] __free_old_xmit_ptr
>    0.60%  [kernel]  [k] vring_map_one_sg
>    0.53%  [kernel]  [k] native_apic_mem_write
>    0.46%  [kernel]  [k] sg_next
>    0.43%  [kernel]  [k] sg_init_table
>    0.41%  [kernel]  [k] kmalloc_slab
> 
> This patch adds a cache function to virtio to cache these allocated indirect
> desc instead of constantly allocating and releasing desc.
> 
> v4:
>     1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation is successful
>     2. The desc cache threshold can be set for each virtqueue
> 
> v3:
>   pre-allocate per buffer indirect descriptors array

So I'm not sure why we are doing that. Did it improve anything?


> v2:
>   use struct list_head to cache the desc
> 
> Xuan Zhuo (3):
>   virtio: cache indirect desc for split
>   virtio: cache indirect desc for packed
>   virtio-net: enable virtio desc cache
> 
>  drivers/net/virtio_net.c     |  12 ++-
>  drivers/virtio/virtio_ring.c | 152 +++++++++++++++++++++++++++++++----
>  include/linux/virtio.h       |  17 ++++
>  3 files changed, 163 insertions(+), 18 deletions(-)
> 
> --
> 2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/3] virtio: cache indirect desc for split
  2021-11-08 11:49   ` Xuan Zhuo
@ 2021-11-09 13:09     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-09 13:09 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: virtualization, netdev, Jason Wang, David S. Miller, Jakub Kicinski

On Mon, Nov 08, 2021 at 07:49:49PM +0800, Xuan Zhuo wrote:
> In the case of using indirect, indirect desc must be allocated and
> released each time, which increases a lot of cpu overhead.
> 
> Here, a cache is added for indirect. If the number of indirect desc to be
> applied for is less than desc_cache_thr, the desc array with
> the size of desc_cache_thr is fixed and cached for reuse.
> 
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>  drivers/virtio/virtio_ring.c | 126 ++++++++++++++++++++++++++++++++---
>  include/linux/virtio.h       |  17 +++++
>  2 files changed, 135 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index dd95dfd85e98..a4a91c497a83 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -85,6 +85,19 @@ struct vring_desc_extra {
>  	u16 next;			/* The next desc state in a list. */
>  };
>  
> +struct vring_desc_cache {
> +	/* desc cache chain */
> +	struct list_head list;
> +
> +	void *array;
> +
> +	/* desc cache threshold
> +	 *    0   - disable desc cache
> +	 *    > 0 - enable desc cache. As the threshold of the desc cache.
> +	 */
> +	u32 threshold;
> +};
> +
>  struct vring_virtqueue {
>  	struct virtqueue vq;
>  
> @@ -117,6 +130,8 @@ struct vring_virtqueue {
>  	/* Hint for event idx: already triggered no need to disable. */
>  	bool event_triggered;
>  
> +	struct vring_desc_cache desc_cache;
> +
>  	union {
>  		/* Available for split ring */
>  		struct {
> @@ -423,7 +438,50 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
>  	return extra[i].next;
>  }
>  
> -static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> +static void desc_cache_init(struct vring_virtqueue *vq)
> +{
> +	vq->desc_cache.array = NULL;
> +	vq->desc_cache.threshold = 0;
> +	INIT_LIST_HEAD(&vq->desc_cache.list);
> +}
> +
> +static void desc_cache_free(struct vring_virtqueue *vq)
> +{
> +	kfree(vq->desc_cache.array);
> +}
> +
> +static void __desc_cache_put(struct vring_virtqueue *vq,
> +			     struct list_head *node, int n)
> +{
> +	if (n <= vq->desc_cache.threshold)
> +		list_add(node, &vq->desc_cache.list);
> +	else
> +		kfree(node);
> +}
> +
> +#define desc_cache_put(vq, desc, n) \
> +	__desc_cache_put(vq, (struct list_head *)desc, n)
> +
> +static void *desc_cache_get(struct vring_virtqueue *vq,
> +			    int size, int n, gfp_t gfp)
> +{
> +	struct list_head *node;
> +
> +	if (n > vq->desc_cache.threshold)
> +		return kmalloc_array(n, size, gfp);
> +
> +	node = vq->desc_cache.list.next;
> +	list_del(node);
> +	return node;
> +}
> +
> +#define _desc_cache_get(vq, n, gfp, tp) \
> +	((tp *)desc_cache_get(vq, (sizeof(tp)), n, gfp))
> +
> +#define desc_cache_get_split(vq, n, gfp) \
> +	_desc_cache_get(vq, n, gfp, struct vring_desc)
> +
> +static struct vring_desc *alloc_indirect_split(struct vring_virtqueue *vq,
>  					       unsigned int total_sg,
>  					       gfp_t gfp)
>  {
> @@ -437,12 +495,12 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
>  	 */
>  	gfp &= ~__GFP_HIGHMEM;
>  
> -	desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
> +	desc = desc_cache_get_split(vq, total_sg, gfp);
>  	if (!desc)
>  		return NULL;
>  
>  	for (i = 0; i < total_sg; i++)
> -		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> +		desc[i].next = cpu_to_virtio16(vq->vq.vdev, i + 1);
>  	return desc;
>  }
>  
> @@ -508,7 +566,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>  	head = vq->free_head;
>  
>  	if (virtqueue_use_indirect(_vq, total_sg))
> -		desc = alloc_indirect_split(_vq, total_sg, gfp);
> +		desc = alloc_indirect_split(vq, total_sg, gfp);
>  	else {
>  		desc = NULL;
>  		WARN_ON_ONCE(total_sg > vq->split.vring.num && !vq->indirect);
> @@ -652,7 +710,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>  	}
>  
>  	if (indirect)
> -		kfree(desc);
> +		desc_cache_put(vq, desc, total_sg);
>  
>  	END_USE(vq);
>  	return -ENOMEM;
> @@ -717,7 +775,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
>  	if (vq->indirect) {
>  		struct vring_desc *indir_desc =
>  				vq->split.desc_state[head].indir_desc;
> -		u32 len;
> +		u32 len, n;
>  
>  		/* Free the indirect table, if any, now that it's unmapped. */
>  		if (!indir_desc)
> @@ -729,10 +787,12 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
>  				VRING_DESC_F_INDIRECT));
>  		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
>  
> -		for (j = 0; j < len / sizeof(struct vring_desc); j++)
> +		n = len / sizeof(struct vring_desc);
> +
> +		for (j = 0; j < n; j++)
>  			vring_unmap_one_split_indirect(vq, &indir_desc[j]);
>  
> -		kfree(indir_desc);
> +		desc_cache_put(vq, indir_desc, n);
>  		vq->split.desc_state[head].indir_desc = NULL;
>  	} else if (ctx) {
>  		*ctx = vq->split.desc_state[head].indir_desc;
> @@ -2200,6 +2260,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
>  		!context;
>  	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
>  
> +	desc_cache_init(vq);
> +
>  	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
>  		vq->weak_barriers = false;
>  
> @@ -2329,6 +2391,7 @@ void vring_del_virtqueue(struct virtqueue *_vq)
>  	if (!vq->packed_ring) {
>  		kfree(vq->split.desc_state);
>  		kfree(vq->split.desc_extra);
> +		desc_cache_free(vq);
>  	}
>  	kfree(vq);
>  }
> @@ -2445,6 +2508,53 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
>  
> +int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	struct list_head *node;
> +	int size, num, i;
> +
> +	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_desc));
> +	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_packed_desc));
> +
> +	BUG_ON(!vq->we_own_ring);
> +
> +	if (!vq->indirect)
> +		return 0;
> +
> +	vq->desc_cache.threshold = threshold;
> +
> +	if (!threshold)
> +		return 0;
> +
> +	if (vq->packed_ring) {
> +		size = sizeof(struct vring_packed_desc);
> +		num = vq->packed.vring.num;
> +	} else {
> +		size = sizeof(struct vring_desc);
> +		num = vq->split.vring.num;
> +	}
> +
> +	size = size * vq->desc_cache.threshold;

just use two variables pls so it's clear which size is where.

> +
> +	vq->desc_cache.array = kmalloc_array(num, size, GFP_KERNEL);

might be quite big. the point of indirect is so it can be
allocated in chunks...

also this allocates from numa node on which driver is loaded,
likely not the correct one to use for the VQ.
how about addressing this e.g. by dropping the cache
if we cross numa nodes?


> +	if (!vq->desc_cache.array) {
> +		vq->desc_cache.threshold = 0;
> +		dev_warn(&vq->vq.vdev->dev,
> +			 "queue[%d] alloc desc cache fail. turn off it.\n",
> +			 vq->vq.index);
> +		return -1;

should be some errno, don't come up with your own return codes.

> +	}
> +
> +	for (i = 0; i < num; ++i) {
> +		node = vq->desc_cache.array + (i * size);
> +		list_add(node, &vq->desc_cache.list);
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_set_desc_cache);
> +
>  /* Only available for split ring */
>  const struct vring *virtqueue_get_vring(struct virtqueue *vq)
>  {
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 41edbc01ffa4..e24b2e90dd42 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -89,6 +89,23 @@ dma_addr_t virtqueue_get_desc_addr(struct virtqueue *vq);
>  dma_addr_t virtqueue_get_avail_addr(struct virtqueue *vq);
>  dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
>  
> +/**
> + * virtqueue_set_desc_cache - set virtio ring desc cache threshold
> + *
> + * virtio will allocate ring.num desc arrays of size threshold in advance. If
> + * total_sg exceeds the threshold, use kmalloc/kfree allocation indirect desc,
> + * if total_sg is less than or equal to the threshold, use these pre-allocated
> + * desc arrays.
> + *
> + * This function must be called immediately after find_vqs and before device
> + * ready.
> + *
> + * @threshold:
> + *    0   - disable desc cache
> + *    > 0 - enable desc cache. As the threshold of the desc cache.

still not descriptive at all.

why do devices even care? is not the issue with a large
threshold its memory consumption?


> + */
> +int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold);

document return code too.

> +
>  /**
>   * virtio_device - representation of a device using virtio
>   * @index: unique position on the virtio bus
> -- 
> 2.31.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/3] virtio: cache indirect desc for split
@ 2021-11-09 13:09     ` Michael S. Tsirkin
  0 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-09 13:09 UTC (permalink / raw)
  To: Xuan Zhuo; +Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Mon, Nov 08, 2021 at 07:49:49PM +0800, Xuan Zhuo wrote:
> In the case of using indirect, indirect desc must be allocated and
> released each time, which increases a lot of cpu overhead.
> 
> Here, a cache is added for indirect. If the number of indirect desc to be
> applied for is less than desc_cache_thr, the desc array with
> the size of desc_cache_thr is fixed and cached for reuse.
> 
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>  drivers/virtio/virtio_ring.c | 126 ++++++++++++++++++++++++++++++++---
>  include/linux/virtio.h       |  17 +++++
>  2 files changed, 135 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index dd95dfd85e98..a4a91c497a83 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -85,6 +85,19 @@ struct vring_desc_extra {
>  	u16 next;			/* The next desc state in a list. */
>  };
>  
> +struct vring_desc_cache {
> +	/* desc cache chain */
> +	struct list_head list;
> +
> +	void *array;
> +
> +	/* desc cache threshold
> +	 *    0   - disable desc cache
> +	 *    > 0 - enable desc cache. As the threshold of the desc cache.
> +	 */
> +	u32 threshold;
> +};
> +
>  struct vring_virtqueue {
>  	struct virtqueue vq;
>  
> @@ -117,6 +130,8 @@ struct vring_virtqueue {
>  	/* Hint for event idx: already triggered no need to disable. */
>  	bool event_triggered;
>  
> +	struct vring_desc_cache desc_cache;
> +
>  	union {
>  		/* Available for split ring */
>  		struct {
> @@ -423,7 +438,50 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
>  	return extra[i].next;
>  }
>  
> -static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> +static void desc_cache_init(struct vring_virtqueue *vq)
> +{
> +	vq->desc_cache.array = NULL;
> +	vq->desc_cache.threshold = 0;
> +	INIT_LIST_HEAD(&vq->desc_cache.list);
> +}
> +
> +static void desc_cache_free(struct vring_virtqueue *vq)
> +{
> +	kfree(vq->desc_cache.array);
> +}
> +
> +static void __desc_cache_put(struct vring_virtqueue *vq,
> +			     struct list_head *node, int n)
> +{
> +	if (n <= vq->desc_cache.threshold)
> +		list_add(node, &vq->desc_cache.list);
> +	else
> +		kfree(node);
> +}
> +
> +#define desc_cache_put(vq, desc, n) \
> +	__desc_cache_put(vq, (struct list_head *)desc, n)
> +
> +static void *desc_cache_get(struct vring_virtqueue *vq,
> +			    int size, int n, gfp_t gfp)
> +{
> +	struct list_head *node;
> +
> +	if (n > vq->desc_cache.threshold)
> +		return kmalloc_array(n, size, gfp);
> +
> +	node = vq->desc_cache.list.next;
> +	list_del(node);
> +	return node;
> +}
> +
> +#define _desc_cache_get(vq, n, gfp, tp) \
> +	((tp *)desc_cache_get(vq, (sizeof(tp)), n, gfp))
> +
> +#define desc_cache_get_split(vq, n, gfp) \
> +	_desc_cache_get(vq, n, gfp, struct vring_desc)
> +
> +static struct vring_desc *alloc_indirect_split(struct vring_virtqueue *vq,
>  					       unsigned int total_sg,
>  					       gfp_t gfp)
>  {
> @@ -437,12 +495,12 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
>  	 */
>  	gfp &= ~__GFP_HIGHMEM;
>  
> -	desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
> +	desc = desc_cache_get_split(vq, total_sg, gfp);
>  	if (!desc)
>  		return NULL;
>  
>  	for (i = 0; i < total_sg; i++)
> -		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> +		desc[i].next = cpu_to_virtio16(vq->vq.vdev, i + 1);
>  	return desc;
>  }
>  
> @@ -508,7 +566,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>  	head = vq->free_head;
>  
>  	if (virtqueue_use_indirect(_vq, total_sg))
> -		desc = alloc_indirect_split(_vq, total_sg, gfp);
> +		desc = alloc_indirect_split(vq, total_sg, gfp);
>  	else {
>  		desc = NULL;
>  		WARN_ON_ONCE(total_sg > vq->split.vring.num && !vq->indirect);
> @@ -652,7 +710,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>  	}
>  
>  	if (indirect)
> -		kfree(desc);
> +		desc_cache_put(vq, desc, total_sg);
>  
>  	END_USE(vq);
>  	return -ENOMEM;
> @@ -717,7 +775,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
>  	if (vq->indirect) {
>  		struct vring_desc *indir_desc =
>  				vq->split.desc_state[head].indir_desc;
> -		u32 len;
> +		u32 len, n;
>  
>  		/* Free the indirect table, if any, now that it's unmapped. */
>  		if (!indir_desc)
> @@ -729,10 +787,12 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
>  				VRING_DESC_F_INDIRECT));
>  		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
>  
> -		for (j = 0; j < len / sizeof(struct vring_desc); j++)
> +		n = len / sizeof(struct vring_desc);
> +
> +		for (j = 0; j < n; j++)
>  			vring_unmap_one_split_indirect(vq, &indir_desc[j]);
>  
> -		kfree(indir_desc);
> +		desc_cache_put(vq, indir_desc, n);
>  		vq->split.desc_state[head].indir_desc = NULL;
>  	} else if (ctx) {
>  		*ctx = vq->split.desc_state[head].indir_desc;
> @@ -2200,6 +2260,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
>  		!context;
>  	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
>  
> +	desc_cache_init(vq);
> +
>  	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
>  		vq->weak_barriers = false;
>  
> @@ -2329,6 +2391,7 @@ void vring_del_virtqueue(struct virtqueue *_vq)
>  	if (!vq->packed_ring) {
>  		kfree(vq->split.desc_state);
>  		kfree(vq->split.desc_extra);
> +		desc_cache_free(vq);
>  	}
>  	kfree(vq);
>  }
> @@ -2445,6 +2508,53 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
>  
> +int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold)
> +{
> +	struct vring_virtqueue *vq = to_vvq(_vq);
> +	struct list_head *node;
> +	int size, num, i;
> +
> +	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_desc));
> +	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_packed_desc));
> +
> +	BUG_ON(!vq->we_own_ring);
> +
> +	if (!vq->indirect)
> +		return 0;
> +
> +	vq->desc_cache.threshold = threshold;
> +
> +	if (!threshold)
> +		return 0;
> +
> +	if (vq->packed_ring) {
> +		size = sizeof(struct vring_packed_desc);
> +		num = vq->packed.vring.num;
> +	} else {
> +		size = sizeof(struct vring_desc);
> +		num = vq->split.vring.num;
> +	}
> +
> +	size = size * vq->desc_cache.threshold;

just use two variables pls so it's clear which size is where.

> +
> +	vq->desc_cache.array = kmalloc_array(num, size, GFP_KERNEL);

might be quite big. the point of indirect is so it can be
allocated in chunks...

also this allocates from numa node on which driver is loaded,
likely not the correct one to use for the VQ.
how about addressing this e.g. by dropping the cache
if we cross numa nodes?


> +	if (!vq->desc_cache.array) {
> +		vq->desc_cache.threshold = 0;
> +		dev_warn(&vq->vq.vdev->dev,
> +			 "queue[%d] alloc desc cache fail. turn off it.\n",
> +			 vq->vq.index);
> +		return -1;

should be some errno, don't come up with your own return codes.

> +	}
> +
> +	for (i = 0; i < num; ++i) {
> +		node = vq->desc_cache.array + (i * size);
> +		list_add(node, &vq->desc_cache.list);
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_set_desc_cache);
> +
>  /* Only available for split ring */
>  const struct vring *virtqueue_get_vring(struct virtqueue *vq)
>  {
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 41edbc01ffa4..e24b2e90dd42 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -89,6 +89,23 @@ dma_addr_t virtqueue_get_desc_addr(struct virtqueue *vq);
>  dma_addr_t virtqueue_get_avail_addr(struct virtqueue *vq);
>  dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
>  
> +/**
> + * virtqueue_set_desc_cache - set virtio ring desc cache threshold
> + *
> + * virtio will allocate ring.num desc arrays of size threshold in advance. If
> + * total_sg exceeds the threshold, use kmalloc/kfree allocation indirect desc,
> + * if total_sg is less than or equal to the threshold, use these pre-allocated
> + * desc arrays.
> + *
> + * This function must be called immediately after find_vqs and before device
> + * ready.
> + *
> + * @threshold:
> + *    0   - disable desc cache
> + *    > 0 - enable desc cache. As the threshold of the desc cache.

still not descriptive at all.

why do devices even care? is not the issue with a large
threshold its memory consumption?


> + */
> +int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold);

document return code too.

> +
>  /**
>   * virtio_device - representation of a device using virtio
>   * @index: unique position on the virtio bus
> -- 
> 2.31.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/3] virtio: cache indirect desc for split
  2021-11-09 13:09     ` Michael S. Tsirkin
  (?)
@ 2021-11-09 15:00     ` Xuan Zhuo
  -1 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-09 15:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Tue, 9 Nov 2021 08:09:42 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Nov 08, 2021 at 07:49:49PM +0800, Xuan Zhuo wrote:
> > In the case of using indirect, indirect desc must be allocated and
> > released each time, which increases a lot of cpu overhead.
> >
> > Here, a cache is added for indirect. If the number of indirect desc to be
> > applied for is less than desc_cache_thr, the desc array with
> > the size of desc_cache_thr is fixed and cached for reuse.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >  drivers/virtio/virtio_ring.c | 126 ++++++++++++++++++++++++++++++++---
> >  include/linux/virtio.h       |  17 +++++
> >  2 files changed, 135 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index dd95dfd85e98..a4a91c497a83 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -85,6 +85,19 @@ struct vring_desc_extra {
> >  	u16 next;			/* The next desc state in a list. */
> >  };
> >
> > +struct vring_desc_cache {
> > +	/* desc cache chain */
> > +	struct list_head list;
> > +
> > +	void *array;
> > +
> > +	/* desc cache threshold
> > +	 *    0   - disable desc cache
> > +	 *    > 0 - enable desc cache. As the threshold of the desc cache.
> > +	 */
> > +	u32 threshold;
> > +};
> > +
> >  struct vring_virtqueue {
> >  	struct virtqueue vq;
> >
> > @@ -117,6 +130,8 @@ struct vring_virtqueue {
> >  	/* Hint for event idx: already triggered no need to disable. */
> >  	bool event_triggered;
> >
> > +	struct vring_desc_cache desc_cache;
> > +
> >  	union {
> >  		/* Available for split ring */
> >  		struct {
> > @@ -423,7 +438,50 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> >  	return extra[i].next;
> >  }
> >
> > -static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> > +static void desc_cache_init(struct vring_virtqueue *vq)
> > +{
> > +	vq->desc_cache.array = NULL;
> > +	vq->desc_cache.threshold = 0;
> > +	INIT_LIST_HEAD(&vq->desc_cache.list);
> > +}
> > +
> > +static void desc_cache_free(struct vring_virtqueue *vq)
> > +{
> > +	kfree(vq->desc_cache.array);
> > +}
> > +
> > +static void __desc_cache_put(struct vring_virtqueue *vq,
> > +			     struct list_head *node, int n)
> > +{
> > +	if (n <= vq->desc_cache.threshold)
> > +		list_add(node, &vq->desc_cache.list);
> > +	else
> > +		kfree(node);
> > +}
> > +
> > +#define desc_cache_put(vq, desc, n) \
> > +	__desc_cache_put(vq, (struct list_head *)desc, n)
> > +
> > +static void *desc_cache_get(struct vring_virtqueue *vq,
> > +			    int size, int n, gfp_t gfp)
> > +{
> > +	struct list_head *node;
> > +
> > +	if (n > vq->desc_cache.threshold)
> > +		return kmalloc_array(n, size, gfp);
> > +
> > +	node = vq->desc_cache.list.next;
> > +	list_del(node);
> > +	return node;
> > +}
> > +
> > +#define _desc_cache_get(vq, n, gfp, tp) \
> > +	((tp *)desc_cache_get(vq, (sizeof(tp)), n, gfp))
> > +
> > +#define desc_cache_get_split(vq, n, gfp) \
> > +	_desc_cache_get(vq, n, gfp, struct vring_desc)
> > +
> > +static struct vring_desc *alloc_indirect_split(struct vring_virtqueue *vq,
> >  					       unsigned int total_sg,
> >  					       gfp_t gfp)
> >  {
> > @@ -437,12 +495,12 @@ static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> >  	 */
> >  	gfp &= ~__GFP_HIGHMEM;
> >
> > -	desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
> > +	desc = desc_cache_get_split(vq, total_sg, gfp);
> >  	if (!desc)
> >  		return NULL;
> >
> >  	for (i = 0; i < total_sg; i++)
> > -		desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> > +		desc[i].next = cpu_to_virtio16(vq->vq.vdev, i + 1);
> >  	return desc;
> >  }
> >
> > @@ -508,7 +566,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> >  	head = vq->free_head;
> >
> >  	if (virtqueue_use_indirect(_vq, total_sg))
> > -		desc = alloc_indirect_split(_vq, total_sg, gfp);
> > +		desc = alloc_indirect_split(vq, total_sg, gfp);
> >  	else {
> >  		desc = NULL;
> >  		WARN_ON_ONCE(total_sg > vq->split.vring.num && !vq->indirect);
> > @@ -652,7 +710,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> >  	}
> >
> >  	if (indirect)
> > -		kfree(desc);
> > +		desc_cache_put(vq, desc, total_sg);
> >
> >  	END_USE(vq);
> >  	return -ENOMEM;
> > @@ -717,7 +775,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
> >  	if (vq->indirect) {
> >  		struct vring_desc *indir_desc =
> >  				vq->split.desc_state[head].indir_desc;
> > -		u32 len;
> > +		u32 len, n;
> >
> >  		/* Free the indirect table, if any, now that it's unmapped. */
> >  		if (!indir_desc)
> > @@ -729,10 +787,12 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
> >  				VRING_DESC_F_INDIRECT));
> >  		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
> >
> > -		for (j = 0; j < len / sizeof(struct vring_desc); j++)
> > +		n = len / sizeof(struct vring_desc);
> > +
> > +		for (j = 0; j < n; j++)
> >  			vring_unmap_one_split_indirect(vq, &indir_desc[j]);
> >
> > -		kfree(indir_desc);
> > +		desc_cache_put(vq, indir_desc, n);
> >  		vq->split.desc_state[head].indir_desc = NULL;
> >  	} else if (ctx) {
> >  		*ctx = vq->split.desc_state[head].indir_desc;
> > @@ -2200,6 +2260,8 @@ struct virtqueue *__vring_new_virtqueue(unsigned int index,
> >  		!context;
> >  	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX);
> >
> > +	desc_cache_init(vq);
> > +
> >  	if (virtio_has_feature(vdev, VIRTIO_F_ORDER_PLATFORM))
> >  		vq->weak_barriers = false;
> >
> > @@ -2329,6 +2391,7 @@ void vring_del_virtqueue(struct virtqueue *_vq)
> >  	if (!vq->packed_ring) {
> >  		kfree(vq->split.desc_state);
> >  		kfree(vq->split.desc_extra);
> > +		desc_cache_free(vq);
> >  	}
> >  	kfree(vq);
> >  }
> > @@ -2445,6 +2508,53 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
> >  }
> >  EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
> >
> > +int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold)
> > +{
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	struct list_head *node;
> > +	int size, num, i;
> > +
> > +	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_desc));
> > +	BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct vring_packed_desc));
> > +
> > +	BUG_ON(!vq->we_own_ring);
> > +
> > +	if (!vq->indirect)
> > +		return 0;
> > +
> > +	vq->desc_cache.threshold = threshold;
> > +
> > +	if (!threshold)
> > +		return 0;
> > +
> > +	if (vq->packed_ring) {
> > +		size = sizeof(struct vring_packed_desc);
> > +		num = vq->packed.vring.num;
> > +	} else {
> > +		size = sizeof(struct vring_desc);
> > +		num = vq->split.vring.num;
> > +	}
> > +
> > +	size = size * vq->desc_cache.threshold;
>
> just use two variables pls so it's clear which size is where.

OK.

>
> > +
> > +	vq->desc_cache.array = kmalloc_array(num, size, GFP_KERNEL);
>
> might be quite big. the point of indirect is so it can be
> allocated in chunks...
>
> also this allocates from numa node on which driver is loaded,
> likely not the correct one to use for the VQ.
> how about addressing this e.g. by dropping the cache
> if we cross numa nodes?
>

I think as long as it is guaranteed to allocate indirect desc cache and vq desc,
vq used queue, vq avail queue on the same numa.

If there is a problem of using indirect desc cache across numa, there is also a
problem of using vq desc across numa.

So the cross-numa problem is another problem to be solved.

Of course, it is also a feasible way to check the numa id every time when using the
indirect desc cache item.


>
> > +	if (!vq->desc_cache.array) {
> > +		vq->desc_cache.threshold = 0;
> > +		dev_warn(&vq->vq.vdev->dev,
> > +			 "queue[%d] alloc desc cache fail. turn off it.\n",
> > +			 vq->vq.index);
> > +		return -1;
>
> should be some errno, don't come up with your own return codes.
>


Yes

> > +	}
> > +
> > +	for (i = 0; i < num; ++i) {
> > +		node = vq->desc_cache.array + (i * size);
> > +		list_add(node, &vq->desc_cache.list);
> > +	}
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtqueue_set_desc_cache);
> > +
> >  /* Only available for split ring */
> >  const struct vring *virtqueue_get_vring(struct virtqueue *vq)
> >  {
> > diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> > index 41edbc01ffa4..e24b2e90dd42 100644
> > --- a/include/linux/virtio.h
> > +++ b/include/linux/virtio.h
> > @@ -89,6 +89,23 @@ dma_addr_t virtqueue_get_desc_addr(struct virtqueue *vq);
> >  dma_addr_t virtqueue_get_avail_addr(struct virtqueue *vq);
> >  dma_addr_t virtqueue_get_used_addr(struct virtqueue *vq);
> >
> > +/**
> > + * virtqueue_set_desc_cache - set virtio ring desc cache threshold
> > + *
> > + * virtio will allocate ring.num desc arrays of size threshold in advance. If
> > + * total_sg exceeds the threshold, use kmalloc/kfree allocation indirect desc,
> > + * if total_sg is less than or equal to the threshold, use these pre-allocated
> > + * desc arrays.
> > + *
> > + * This function must be called immediately after find_vqs and before device
> > + * ready.
> > + *
> > + * @threshold:
> > + *    0   - disable desc cache
> > + *    > 0 - enable desc cache. As the threshold of the desc cache.
>
> still not descriptive at all.
>
> why do devices even care? is not the issue with a large
> threshold its memory consumption?


On the one hand, the threshold limits which requests will use the cache. On the
other hand, it also specifies the size of the cache item.

The common total_sg of different devices is different. For example, the total_sg
of the network card device will not exceed MAX_SKB_FRAGS + 2.

Different devices can set different cache item sizes based on their own
characteristics. This helps to save memory.


>
>
> > + */
> > +int virtqueue_set_desc_cache(struct virtqueue *_vq, u32 threshold);
>
> document return code too.

OK.

Thanks

>
> > +
> >  /**
> >   * virtio_device - representation of a device using virtio
> >   * @index: unique position on the virtio bus
> > --
> > 2.31.0
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
  2021-11-09 13:03   ` Michael S. Tsirkin
  (?)
@ 2021-11-09 15:14   ` Xuan Zhuo
  -1 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-09 15:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Tue, 9 Nov 2021 08:03:18 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Nov 08, 2021 at 07:49:48PM +0800, Xuan Zhuo wrote:
> > If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> > of sgs used for sending packets is greater than 1. We must constantly
> > call __kmalloc/kfree to allocate/release desc.
> >
> > In the case of extremely fast package delivery, the overhead cannot be
> > ignored:
> >
> >   27.46%  [kernel]  [k] virtqueue_add
> >   16.66%  [kernel]  [k] detach_buf_split
> >   16.51%  [kernel]  [k] virtnet_xsk_xmit
> >   14.04%  [kernel]  [k] virtqueue_add_outbuf
> >    5.18%  [kernel]  [k] __kmalloc
> >    4.08%  [kernel]  [k] kfree
> >    2.80%  [kernel]  [k] virtqueue_get_buf_ctx
> >    2.22%  [kernel]  [k] xsk_tx_peek_desc
> >    2.08%  [kernel]  [k] memset_erms
> >    0.83%  [kernel]  [k] virtqueue_kick_prepare
> >    0.76%  [kernel]  [k] virtnet_xsk_run
> >    0.62%  [kernel]  [k] __free_old_xmit_ptr
> >    0.60%  [kernel]  [k] vring_map_one_sg
> >    0.53%  [kernel]  [k] native_apic_mem_write
> >    0.46%  [kernel]  [k] sg_next
> >    0.43%  [kernel]  [k] sg_init_table
> >    0.41%  [kernel]  [k] kmalloc_slab
> >
> > This patch adds a cache function to virtio to cache these allocated indirect
> > desc instead of constantly allocating and releasing desc.
> >
> > v4:
> >     1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation is successful
> >     2. The desc cache threshold can be set for each virtqueue
> >
> > v3:
> >   pre-allocate per buffer indirect descriptors array
>
> So I'm not sure why we are doing that. Did it improve anything?

1. Don't call kmalloc in the data path, this is not the point
2. Allocate memory on the cpu of the initialization device to ensure that the
indirect desc cache and vq desc are on the same numa.

Thanks.

>
>
> > v2:
> >   use struct list_head to cache the desc
> >
> > Xuan Zhuo (3):
> >   virtio: cache indirect desc for split
> >   virtio: cache indirect desc for packed
> >   virtio-net: enable virtio desc cache
> >
> >  drivers/net/virtio_net.c     |  12 ++-
> >  drivers/virtio/virtio_ring.c | 152 +++++++++++++++++++++++++++++++----
> >  include/linux/virtio.h       |  17 ++++
> >  3 files changed, 163 insertions(+), 18 deletions(-)
> >
> > --
> > 2.31.0
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
  2021-11-08 14:47   ` Xuan Zhuo
@ 2021-11-10 12:53       ` Michael S. Tsirkin
  0 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-10 12:53 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: virtualization, netdev, Jason Wang, David S. Miller, Jakub Kicinski

On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > Hmm a bunch of comments got ignored. See e.g.
> > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > if they aren't relevant add code comments or commit log text explaining the
> > design choice please.
> 
> I should have responded to related questions, I am guessing whether some emails
> have been lost.
> 
> I have sorted out the following 6 questions, if there are any missing questions,
> please let me know.
> 
> 1. use list_head
>   In the earliest version, I used pointers directly. You suggest that I use
>   llist_head, but considering that llist_head has atomic operations. There is no
>   competition problem here, so I used list_head.
> 
>   In fact, I did not increase the allocated space for list_head.
> 
>   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
>   use as queue item: | list_head ........................................|

the concern is that you touch many cache lines when removing an entry.

I suggest something like:

llist: add a non-atomic list_del_first

One has to know what one's doing, but if one has locked the list
preventing all accesses, then it's ok to just pop off an entry without
atomics.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 24f207b0190b..13a47dddb12b 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
 
 extern struct llist_node *llist_del_first(struct llist_head *head);
 
+static inline struct llist_node *__llist_del_first(struct llist_head *head)
+{
+	struct llist_node *first = head->first;
+
+	if (!first)
+		return NULL;
+
+	head->first = first->next;
+	return first;
+}
+
 struct llist_node *llist_reverse_order(struct llist_node *head);
 
 #endif /* LLIST_H */


-----


> 2.
> > > +	if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > > +		if (vq->desc_cache_chain) {
> > > +			desc = vq->desc_cache_chain;
> > > +			vq->desc_cache_chain = (void *)desc->addr;
> > > +			goto got;
> > > +		}
> > > +		n = VIRT_QUEUE_CACHE_DESC_NUM;
> >
> > Hmm. This will allocate more entries than actually used. Why do it?
> 
> 
> This is because the size of each cache item is fixed, and the logic has been
> modified in the latest code. I think this problem no longer exists.
> 
> 
> 3.
> > What bothers me here is what happens if cache gets
> > filled on one numa node, then used on another?
> 
> I'm thinking about another question, how did the cross-numa appear here, and
> virtio desc queue also has the problem of cross-numa. So is it necessary for us
> to deal with the cross-numa scene?

It's true that desc queue might be cross numa, and people are looking
for ways to improve that. Not a reason to make things worse ...


> Indirect desc is used as virtio desc, so as long as it is in the same numa as
> virito desc. So we can allocate indirect desc cache at the same time when
> allocating virtio desc queue.

Using it from current node like we do now seems better.

> 4.
> > So e.g. for rx, we are wasting memory since indirect isn't used.
> 
> In the current version, desc cache is set up based on pre-queue.
> 
> So if the desc cache is not used, we don't need to set the desc cache.
> 
> For example, virtio-net, as long as the tx queue and the rx queue in big packet
> mode enable desc cache.


I liked how in older versions adding indrect enabled it implicitly
though without need to hack drivers.

> 5.
> > Would a better API be a cache size in bytes? This controls how much
> > memory is spent after all.
> 
> My design is to set a threshold. When total_sg is greater than this threshold,
> it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> this threshold, use the allocated cache.
> 

I know. My question is this, do devices know what a good threshold is?
If yes how do they know?

> 6. kmem_cache_*
> 
> I have tested these, the performance is not as good as the method used in this
> patch.

Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
You mentioned just kmem_cache_alloc previously.

> 
> Thanks.


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
@ 2021-11-10 12:53       ` Michael S. Tsirkin
  0 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-10 12:53 UTC (permalink / raw)
  To: Xuan Zhuo; +Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > Hmm a bunch of comments got ignored. See e.g.
> > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > if they aren't relevant add code comments or commit log text explaining the
> > design choice please.
> 
> I should have responded to related questions, I am guessing whether some emails
> have been lost.
> 
> I have sorted out the following 6 questions, if there are any missing questions,
> please let me know.
> 
> 1. use list_head
>   In the earliest version, I used pointers directly. You suggest that I use
>   llist_head, but considering that llist_head has atomic operations. There is no
>   competition problem here, so I used list_head.
> 
>   In fact, I did not increase the allocated space for list_head.
> 
>   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
>   use as queue item: | list_head ........................................|

the concern is that you touch many cache lines when removing an entry.

I suggest something like:

llist: add a non-atomic list_del_first

One has to know what one's doing, but if one has locked the list
preventing all accesses, then it's ok to just pop off an entry without
atomics.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 24f207b0190b..13a47dddb12b 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
 
 extern struct llist_node *llist_del_first(struct llist_head *head);
 
+static inline struct llist_node *__llist_del_first(struct llist_head *head)
+{
+	struct llist_node *first = head->first;
+
+	if (!first)
+		return NULL;
+
+	head->first = first->next;
+	return first;
+}
+
 struct llist_node *llist_reverse_order(struct llist_node *head);
 
 #endif /* LLIST_H */


-----


> 2.
> > > +	if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > > +		if (vq->desc_cache_chain) {
> > > +			desc = vq->desc_cache_chain;
> > > +			vq->desc_cache_chain = (void *)desc->addr;
> > > +			goto got;
> > > +		}
> > > +		n = VIRT_QUEUE_CACHE_DESC_NUM;
> >
> > Hmm. This will allocate more entries than actually used. Why do it?
> 
> 
> This is because the size of each cache item is fixed, and the logic has been
> modified in the latest code. I think this problem no longer exists.
> 
> 
> 3.
> > What bothers me here is what happens if cache gets
> > filled on one numa node, then used on another?
> 
> I'm thinking about another question, how did the cross-numa appear here, and
> virtio desc queue also has the problem of cross-numa. So is it necessary for us
> to deal with the cross-numa scene?

It's true that desc queue might be cross numa, and people are looking
for ways to improve that. Not a reason to make things worse ...


> Indirect desc is used as virtio desc, so as long as it is in the same numa as
> virito desc. So we can allocate indirect desc cache at the same time when
> allocating virtio desc queue.

Using it from current node like we do now seems better.

> 4.
> > So e.g. for rx, we are wasting memory since indirect isn't used.
> 
> In the current version, desc cache is set up based on pre-queue.
> 
> So if the desc cache is not used, we don't need to set the desc cache.
> 
> For example, virtio-net, as long as the tx queue and the rx queue in big packet
> mode enable desc cache.


I liked how in older versions adding indrect enabled it implicitly
though without need to hack drivers.

> 5.
> > Would a better API be a cache size in bytes? This controls how much
> > memory is spent after all.
> 
> My design is to set a threshold. When total_sg is greater than this threshold,
> it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> this threshold, use the allocated cache.
> 

I know. My question is this, do devices know what a good threshold is?
If yes how do they know?

> 6. kmem_cache_*
> 
> I have tested these, the performance is not as good as the method used in this
> patch.

Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
You mentioned just kmem_cache_alloc previously.

> 
> Thanks.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
  2021-11-10 12:53       ` Michael S. Tsirkin
@ 2021-11-10 12:54         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-10 12:54 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: virtualization, netdev, Jason Wang, David S. Miller, Jakub Kicinski

On Wed, Nov 10, 2021 at 07:53:49AM -0500, Michael S. Tsirkin wrote:
> On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > Hmm a bunch of comments got ignored. See e.g.
> > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > if they aren't relevant add code comments or commit log text explaining the
> > > design choice please.
> > 
> > I should have responded to related questions, I am guessing whether some emails
> > have been lost.
> > 
> > I have sorted out the following 6 questions, if there are any missing questions,
> > please let me know.
> > 
> > 1. use list_head
> >   In the earliest version, I used pointers directly. You suggest that I use
> >   llist_head, but considering that llist_head has atomic operations. There is no
> >   competition problem here, so I used list_head.
> > 
> >   In fact, I did not increase the allocated space for list_head.
> > 
> >   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
> >   use as queue item: | list_head ........................................|
> 
> the concern is that you touch many cache lines when removing an entry.
> 
> I suggest something like:
> 
> llist: add a non-atomic list_del_first
> 
> One has to know what one's doing, but if one has locked the list
> preventing all accesses, then it's ok to just pop off an entry without
> atomics.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 24f207b0190b..13a47dddb12b 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
>  
>  extern struct llist_node *llist_del_first(struct llist_head *head);
>  
> +static inline struct llist_node *__llist_del_first(struct llist_head *head)
> +{
> +	struct llist_node *first = head->first;
> +
> +	if (!first)
> +		return NULL;
> +
> +	head->first = first->next;
> +	return first;
> +}
> +
>  struct llist_node *llist_reverse_order(struct llist_node *head);
>  
>  #endif /* LLIST_H */
> 
> 
> -----
> 
> 
> > 2.
> > > > +	if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > +		if (vq->desc_cache_chain) {
> > > > +			desc = vq->desc_cache_chain;
> > > > +			vq->desc_cache_chain = (void *)desc->addr;
> > > > +			goto got;
> > > > +		}
> > > > +		n = VIRT_QUEUE_CACHE_DESC_NUM;
> > >
> > > Hmm. This will allocate more entries than actually used. Why do it?
> > 
> > 
> > This is because the size of each cache item is fixed, and the logic has been
> > modified in the latest code. I think this problem no longer exists.
> > 
> > 
> > 3.
> > > What bothers me here is what happens if cache gets
> > > filled on one numa node, then used on another?
> > 
> > I'm thinking about another question, how did the cross-numa appear here, and
> > virtio desc queue also has the problem of cross-numa. So is it necessary for us
> > to deal with the cross-numa scene?
> 
> It's true that desc queue might be cross numa, and people are looking
> for ways to improve that. Not a reason to make things worse ...
> 

To add to that, given it's a concern, how about actually
testing performance for this config?

> > Indirect desc is used as virtio desc, so as long as it is in the same numa as
> > virito desc. So we can allocate indirect desc cache at the same time when
> > allocating virtio desc queue.
> 
> Using it from current node like we do now seems better.
> 
> > 4.
> > > So e.g. for rx, we are wasting memory since indirect isn't used.
> > 
> > In the current version, desc cache is set up based on pre-queue.
> > 
> > So if the desc cache is not used, we don't need to set the desc cache.
> > 
> > For example, virtio-net, as long as the tx queue and the rx queue in big packet
> > mode enable desc cache.
> 
> 
> I liked how in older versions adding indrect enabled it implicitly
> though without need to hack drivers.
> 
> > 5.
> > > Would a better API be a cache size in bytes? This controls how much
> > > memory is spent after all.
> > 
> > My design is to set a threshold. When total_sg is greater than this threshold,
> > it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> > this threshold, use the allocated cache.
> > 
> 
> I know. My question is this, do devices know what a good threshold is?
> If yes how do they know?
> 
> > 6. kmem_cache_*
> > 
> > I have tested these, the performance is not as good as the method used in this
> > patch.
> 
> Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
> You mentioned just kmem_cache_alloc previously.
> 
> > 
> > Thanks.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
@ 2021-11-10 12:54         ` Michael S. Tsirkin
  0 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-10 12:54 UTC (permalink / raw)
  To: Xuan Zhuo; +Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Wed, Nov 10, 2021 at 07:53:49AM -0500, Michael S. Tsirkin wrote:
> On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > Hmm a bunch of comments got ignored. See e.g.
> > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > if they aren't relevant add code comments or commit log text explaining the
> > > design choice please.
> > 
> > I should have responded to related questions, I am guessing whether some emails
> > have been lost.
> > 
> > I have sorted out the following 6 questions, if there are any missing questions,
> > please let me know.
> > 
> > 1. use list_head
> >   In the earliest version, I used pointers directly. You suggest that I use
> >   llist_head, but considering that llist_head has atomic operations. There is no
> >   competition problem here, so I used list_head.
> > 
> >   In fact, I did not increase the allocated space for list_head.
> > 
> >   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
> >   use as queue item: | list_head ........................................|
> 
> the concern is that you touch many cache lines when removing an entry.
> 
> I suggest something like:
> 
> llist: add a non-atomic list_del_first
> 
> One has to know what one's doing, but if one has locked the list
> preventing all accesses, then it's ok to just pop off an entry without
> atomics.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 24f207b0190b..13a47dddb12b 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
>  
>  extern struct llist_node *llist_del_first(struct llist_head *head);
>  
> +static inline struct llist_node *__llist_del_first(struct llist_head *head)
> +{
> +	struct llist_node *first = head->first;
> +
> +	if (!first)
> +		return NULL;
> +
> +	head->first = first->next;
> +	return first;
> +}
> +
>  struct llist_node *llist_reverse_order(struct llist_node *head);
>  
>  #endif /* LLIST_H */
> 
> 
> -----
> 
> 
> > 2.
> > > > +	if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > +		if (vq->desc_cache_chain) {
> > > > +			desc = vq->desc_cache_chain;
> > > > +			vq->desc_cache_chain = (void *)desc->addr;
> > > > +			goto got;
> > > > +		}
> > > > +		n = VIRT_QUEUE_CACHE_DESC_NUM;
> > >
> > > Hmm. This will allocate more entries than actually used. Why do it?
> > 
> > 
> > This is because the size of each cache item is fixed, and the logic has been
> > modified in the latest code. I think this problem no longer exists.
> > 
> > 
> > 3.
> > > What bothers me here is what happens if cache gets
> > > filled on one numa node, then used on another?
> > 
> > I'm thinking about another question, how did the cross-numa appear here, and
> > virtio desc queue also has the problem of cross-numa. So is it necessary for us
> > to deal with the cross-numa scene?
> 
> It's true that desc queue might be cross numa, and people are looking
> for ways to improve that. Not a reason to make things worse ...
> 

To add to that, given it's a concern, how about actually
testing performance for this config?

> > Indirect desc is used as virtio desc, so as long as it is in the same numa as
> > virito desc. So we can allocate indirect desc cache at the same time when
> > allocating virtio desc queue.
> 
> Using it from current node like we do now seems better.
> 
> > 4.
> > > So e.g. for rx, we are wasting memory since indirect isn't used.
> > 
> > In the current version, desc cache is set up based on pre-queue.
> > 
> > So if the desc cache is not used, we don't need to set the desc cache.
> > 
> > For example, virtio-net, as long as the tx queue and the rx queue in big packet
> > mode enable desc cache.
> 
> 
> I liked how in older versions adding indrect enabled it implicitly
> though without need to hack drivers.
> 
> > 5.
> > > Would a better API be a cache size in bytes? This controls how much
> > > memory is spent after all.
> > 
> > My design is to set a threshold. When total_sg is greater than this threshold,
> > it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> > this threshold, use the allocated cache.
> > 
> 
> I know. My question is this, do devices know what a good threshold is?
> If yes how do they know?
> 
> > 6. kmem_cache_*
> > 
> > I have tested these, the performance is not as good as the method used in this
> > patch.
> 
> Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
> You mentioned just kmem_cache_alloc previously.
> 
> > 
> > Thanks.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
  2021-11-10 12:53       ` Michael S. Tsirkin
  (?)
  (?)
@ 2021-11-11  6:52       ` Xuan Zhuo
  2021-11-11 15:02           ` Michael S. Tsirkin
  -1 siblings, 1 reply; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-11  6:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Wed, 10 Nov 2021 07:53:44 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > Hmm a bunch of comments got ignored. See e.g.
> > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > if they aren't relevant add code comments or commit log text explaining the
> > > design choice please.
> >
> > I should have responded to related questions, I am guessing whether some emails
> > have been lost.
> >
> > I have sorted out the following 6 questions, if there are any missing questions,
> > please let me know.
> >
> > 1. use list_head
> >   In the earliest version, I used pointers directly. You suggest that I use
> >   llist_head, but considering that llist_head has atomic operations. There is no
> >   competition problem here, so I used list_head.
> >
> >   In fact, I did not increase the allocated space for list_head.
> >
> >   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
> >   use as queue item: | list_head ........................................|
>
> the concern is that you touch many cache lines when removing an entry.
>
> I suggest something like:
>
> llist: add a non-atomic list_del_first
>
> One has to know what one's doing, but if one has locked the list
> preventing all accesses, then it's ok to just pop off an entry without
> atomics.
>

Oh, great, but my way of solving the problem is too conservative.

> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> ---
>
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 24f207b0190b..13a47dddb12b 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
>
>  extern struct llist_node *llist_del_first(struct llist_head *head);
>
> +static inline struct llist_node *__llist_del_first(struct llist_head *head)
> +{
> +	struct llist_node *first = head->first;
> +
> +	if (!first)
> +		return NULL;
> +
> +	head->first = first->next;
> +	return first;
> +}
> +
>  struct llist_node *llist_reverse_order(struct llist_node *head);
>
>  #endif /* LLIST_H */
>
>
> -----
>
>
> > 2.
> > > > +	if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > +		if (vq->desc_cache_chain) {
> > > > +			desc = vq->desc_cache_chain;
> > > > +			vq->desc_cache_chain = (void *)desc->addr;
> > > > +			goto got;
> > > > +		}
> > > > +		n = VIRT_QUEUE_CACHE_DESC_NUM;
> > >
> > > Hmm. This will allocate more entries than actually used. Why do it?
> >
> >
> > This is because the size of each cache item is fixed, and the logic has been
> > modified in the latest code. I think this problem no longer exists.
> >
> >
> > 3.
> > > What bothers me here is what happens if cache gets
> > > filled on one numa node, then used on another?
> >
> > I'm thinking about another question, how did the cross-numa appear here, and
> > virtio desc queue also has the problem of cross-numa. So is it necessary for us
> > to deal with the cross-numa scene?
>
> It's true that desc queue might be cross numa, and people are looking
> for ways to improve that. Not a reason to make things worse ...
>

I will test for it.

>
> > Indirect desc is used as virtio desc, so as long as it is in the same numa as
> > virito desc. So we can allocate indirect desc cache at the same time when
> > allocating virtio desc queue.
>
> Using it from current node like we do now seems better.
>
> > 4.
> > > So e.g. for rx, we are wasting memory since indirect isn't used.
> >
> > In the current version, desc cache is set up based on pre-queue.
> >
> > So if the desc cache is not used, we don't need to set the desc cache.
> >
> > For example, virtio-net, as long as the tx queue and the rx queue in big packet
> > mode enable desc cache.
>
>
> I liked how in older versions adding indrect enabled it implicitly
> though without need to hack drivers.

I see.

>
> > 5.
> > > Would a better API be a cache size in bytes? This controls how much
> > > memory is spent after all.
> >
> > My design is to set a threshold. When total_sg is greater than this threshold,
> > it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> > this threshold, use the allocated cache.
> >
>
> I know. My question is this, do devices know what a good threshold is?
> If yes how do they know?

I think the driver knows the threshold, for example, MAX_SKB_FRAG + 2 is a
suitable threshold for virtio-net.


>
> > 6. kmem_cache_*
> >
> > I have tested these, the performance is not as good as the method used in this
> > patch.
>
> Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
> You mentioned just kmem_cache_alloc previously.


I will test for kmem_cache_alloc_bulk.

Thanks.

>
> >
> > Thanks.
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
  2021-11-11  6:52       ` Xuan Zhuo
@ 2021-11-11 15:02           ` Michael S. Tsirkin
  0 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-11 15:02 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: virtualization, netdev, Jason Wang, David S. Miller, Jakub Kicinski

On Thu, Nov 11, 2021 at 02:52:07PM +0800, Xuan Zhuo wrote:
> On Wed, 10 Nov 2021 07:53:44 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > Hmm a bunch of comments got ignored. See e.g.
> > > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > > if they aren't relevant add code comments or commit log text explaining the
> > > > design choice please.
> > >
> > > I should have responded to related questions, I am guessing whether some emails
> > > have been lost.
> > >
> > > I have sorted out the following 6 questions, if there are any missing questions,
> > > please let me know.
> > >
> > > 1. use list_head
> > >   In the earliest version, I used pointers directly. You suggest that I use
> > >   llist_head, but considering that llist_head has atomic operations. There is no
> > >   competition problem here, so I used list_head.
> > >
> > >   In fact, I did not increase the allocated space for list_head.
> > >
> > >   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
> > >   use as queue item: | list_head ........................................|
> >
> > the concern is that you touch many cache lines when removing an entry.
> >
> > I suggest something like:
> >
> > llist: add a non-atomic list_del_first
> >
> > One has to know what one's doing, but if one has locked the list
> > preventing all accesses, then it's ok to just pop off an entry without
> > atomics.
> >
> 
> Oh, great, but my way of solving the problem is too conservative.
> 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > ---
> >
> > diff --git a/include/linux/llist.h b/include/linux/llist.h
> > index 24f207b0190b..13a47dddb12b 100644
> > --- a/include/linux/llist.h
> > +++ b/include/linux/llist.h
> > @@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
> >
> >  extern struct llist_node *llist_del_first(struct llist_head *head);
> >
> > +static inline struct llist_node *__llist_del_first(struct llist_head *head)
> > +{
> > +	struct llist_node *first = head->first;
> > +
> > +	if (!first)
> > +		return NULL;
> > +
> > +	head->first = first->next;
> > +	return first;
> > +}
> > +
> >  struct llist_node *llist_reverse_order(struct llist_node *head);
> >
> >  #endif /* LLIST_H */
> >
> >
> > -----
> >
> >
> > > 2.
> > > > > +	if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > > +		if (vq->desc_cache_chain) {
> > > > > +			desc = vq->desc_cache_chain;
> > > > > +			vq->desc_cache_chain = (void *)desc->addr;
> > > > > +			goto got;
> > > > > +		}
> > > > > +		n = VIRT_QUEUE_CACHE_DESC_NUM;
> > > >
> > > > Hmm. This will allocate more entries than actually used. Why do it?
> > >
> > >
> > > This is because the size of each cache item is fixed, and the logic has been
> > > modified in the latest code. I think this problem no longer exists.
> > >
> > >
> > > 3.
> > > > What bothers me here is what happens if cache gets
> > > > filled on one numa node, then used on another?
> > >
> > > I'm thinking about another question, how did the cross-numa appear here, and
> > > virtio desc queue also has the problem of cross-numa. So is it necessary for us
> > > to deal with the cross-numa scene?
> >
> > It's true that desc queue might be cross numa, and people are looking
> > for ways to improve that. Not a reason to make things worse ...
> >
> 
> I will test for it.
> 
> >
> > > Indirect desc is used as virtio desc, so as long as it is in the same numa as
> > > virito desc. So we can allocate indirect desc cache at the same time when
> > > allocating virtio desc queue.
> >
> > Using it from current node like we do now seems better.
> >
> > > 4.
> > > > So e.g. for rx, we are wasting memory since indirect isn't used.
> > >
> > > In the current version, desc cache is set up based on pre-queue.
> > >
> > > So if the desc cache is not used, we don't need to set the desc cache.
> > >
> > > For example, virtio-net, as long as the tx queue and the rx queue in big packet
> > > mode enable desc cache.
> >
> >
> > I liked how in older versions adding indrect enabled it implicitly
> > though without need to hack drivers.
> 
> I see.
> 
> >
> > > 5.
> > > > Would a better API be a cache size in bytes? This controls how much
> > > > memory is spent after all.
> > >
> > > My design is to set a threshold. When total_sg is greater than this threshold,
> > > it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> > > this threshold, use the allocated cache.
> > >
> >
> > I know. My question is this, do devices know what a good threshold is?
> > If yes how do they know?
> 
> I think the driver knows the threshold, for example, MAX_SKB_FRAG + 2 is a
> suitable threshold for virtio-net.
> 

I guess... in that case I assume it's a good idea to have
virtio core round the size up to whole cache lines, right?

> >
> > > 6. kmem_cache_*
> > >
> > > I have tested these, the performance is not as good as the method used in this
> > > patch.
> >
> > Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
> > You mentioned just kmem_cache_alloc previously.
> 
> 
> I will test for kmem_cache_alloc_bulk.
> 
> Thanks.
> 
> >
> > >
> > > Thanks.
> >


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
@ 2021-11-11 15:02           ` Michael S. Tsirkin
  0 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2021-11-11 15:02 UTC (permalink / raw)
  To: Xuan Zhuo; +Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Thu, Nov 11, 2021 at 02:52:07PM +0800, Xuan Zhuo wrote:
> On Wed, 10 Nov 2021 07:53:44 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > Hmm a bunch of comments got ignored. See e.g.
> > > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > > if they aren't relevant add code comments or commit log text explaining the
> > > > design choice please.
> > >
> > > I should have responded to related questions, I am guessing whether some emails
> > > have been lost.
> > >
> > > I have sorted out the following 6 questions, if there are any missing questions,
> > > please let me know.
> > >
> > > 1. use list_head
> > >   In the earliest version, I used pointers directly. You suggest that I use
> > >   llist_head, but considering that llist_head has atomic operations. There is no
> > >   competition problem here, so I used list_head.
> > >
> > >   In fact, I did not increase the allocated space for list_head.
> > >
> > >   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
> > >   use as queue item: | list_head ........................................|
> >
> > the concern is that you touch many cache lines when removing an entry.
> >
> > I suggest something like:
> >
> > llist: add a non-atomic list_del_first
> >
> > One has to know what one's doing, but if one has locked the list
> > preventing all accesses, then it's ok to just pop off an entry without
> > atomics.
> >
> 
> Oh, great, but my way of solving the problem is too conservative.
> 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > ---
> >
> > diff --git a/include/linux/llist.h b/include/linux/llist.h
> > index 24f207b0190b..13a47dddb12b 100644
> > --- a/include/linux/llist.h
> > +++ b/include/linux/llist.h
> > @@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
> >
> >  extern struct llist_node *llist_del_first(struct llist_head *head);
> >
> > +static inline struct llist_node *__llist_del_first(struct llist_head *head)
> > +{
> > +	struct llist_node *first = head->first;
> > +
> > +	if (!first)
> > +		return NULL;
> > +
> > +	head->first = first->next;
> > +	return first;
> > +}
> > +
> >  struct llist_node *llist_reverse_order(struct llist_node *head);
> >
> >  #endif /* LLIST_H */
> >
> >
> > -----
> >
> >
> > > 2.
> > > > > +	if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > > +		if (vq->desc_cache_chain) {
> > > > > +			desc = vq->desc_cache_chain;
> > > > > +			vq->desc_cache_chain = (void *)desc->addr;
> > > > > +			goto got;
> > > > > +		}
> > > > > +		n = VIRT_QUEUE_CACHE_DESC_NUM;
> > > >
> > > > Hmm. This will allocate more entries than actually used. Why do it?
> > >
> > >
> > > This is because the size of each cache item is fixed, and the logic has been
> > > modified in the latest code. I think this problem no longer exists.
> > >
> > >
> > > 3.
> > > > What bothers me here is what happens if cache gets
> > > > filled on one numa node, then used on another?
> > >
> > > I'm thinking about another question, how did the cross-numa appear here, and
> > > virtio desc queue also has the problem of cross-numa. So is it necessary for us
> > > to deal with the cross-numa scene?
> >
> > It's true that desc queue might be cross numa, and people are looking
> > for ways to improve that. Not a reason to make things worse ...
> >
> 
> I will test for it.
> 
> >
> > > Indirect desc is used as virtio desc, so as long as it is in the same numa as
> > > virito desc. So we can allocate indirect desc cache at the same time when
> > > allocating virtio desc queue.
> >
> > Using it from current node like we do now seems better.
> >
> > > 4.
> > > > So e.g. for rx, we are wasting memory since indirect isn't used.
> > >
> > > In the current version, desc cache is set up based on pre-queue.
> > >
> > > So if the desc cache is not used, we don't need to set the desc cache.
> > >
> > > For example, virtio-net, as long as the tx queue and the rx queue in big packet
> > > mode enable desc cache.
> >
> >
> > I liked how in older versions adding indrect enabled it implicitly
> > though without need to hack drivers.
> 
> I see.
> 
> >
> > > 5.
> > > > Would a better API be a cache size in bytes? This controls how much
> > > > memory is spent after all.
> > >
> > > My design is to set a threshold. When total_sg is greater than this threshold,
> > > it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> > > this threshold, use the allocated cache.
> > >
> >
> > I know. My question is this, do devices know what a good threshold is?
> > If yes how do they know?
> 
> I think the driver knows the threshold, for example, MAX_SKB_FRAG + 2 is a
> suitable threshold for virtio-net.
> 

I guess... in that case I assume it's a good idea to have
virtio core round the size up to whole cache lines, right?

> >
> > > 6. kmem_cache_*
> > >
> > > I have tested these, the performance is not as good as the method used in this
> > > patch.
> >
> > Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
> > You mentioned just kmem_cache_alloc previously.
> 
> 
> I will test for kmem_cache_alloc_bulk.
> 
> Thanks.
> 
> >
> > >
> > > Thanks.
> >

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/3] virtio support cache indirect desc
  2021-11-11 15:02           ` Michael S. Tsirkin
  (?)
@ 2021-11-12  1:53           ` Xuan Zhuo
  -1 siblings, 0 replies; 25+ messages in thread
From: Xuan Zhuo @ 2021-11-12  1:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Jakub Kicinski, virtualization

On Thu, 11 Nov 2021 10:02:01 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Nov 11, 2021 at 02:52:07PM +0800, Xuan Zhuo wrote:
> > On Wed, 10 Nov 2021 07:53:44 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > > > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > Hmm a bunch of comments got ignored. See e.g.
> > > > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > > > if they aren't relevant add code comments or commit log text explaining the
> > > > > design choice please.
> > > >
> > > > I should have responded to related questions, I am guessing whether some emails
> > > > have been lost.
> > > >
> > > > I have sorted out the following 6 questions, if there are any missing questions,
> > > > please let me know.
> > > >
> > > > 1. use list_head
> > > >   In the earliest version, I used pointers directly. You suggest that I use
> > > >   llist_head, but considering that llist_head has atomic operations. There is no
> > > >   competition problem here, so I used list_head.
> > > >
> > > >   In fact, I did not increase the allocated space for list_head.
> > > >
> > > >   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
> > > >   use as queue item: | list_head ........................................|
> > >
> > > the concern is that you touch many cache lines when removing an entry.
> > >
> > > I suggest something like:
> > >
> > > llist: add a non-atomic list_del_first
> > >
> > > One has to know what one's doing, but if one has locked the list
> > > preventing all accesses, then it's ok to just pop off an entry without
> > > atomics.
> > >
> >
> > Oh, great, but my way of solving the problem is too conservative.
> >
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >
> > > ---
> > >
> > > diff --git a/include/linux/llist.h b/include/linux/llist.h
> > > index 24f207b0190b..13a47dddb12b 100644
> > > --- a/include/linux/llist.h
> > > +++ b/include/linux/llist.h
> > > @@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
> > >
> > >  extern struct llist_node *llist_del_first(struct llist_head *head);
> > >
> > > +static inline struct llist_node *__llist_del_first(struct llist_head *head)
> > > +{
> > > +	struct llist_node *first = head->first;
> > > +
> > > +	if (!first)
> > > +		return NULL;
> > > +
> > > +	head->first = first->next;
> > > +	return first;
> > > +}
> > > +
> > >  struct llist_node *llist_reverse_order(struct llist_node *head);
> > >
> > >  #endif /* LLIST_H */
> > >
> > >
> > > -----
> > >
> > >
> > > > 2.
> > > > > > +	if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > > > +		if (vq->desc_cache_chain) {
> > > > > > +			desc = vq->desc_cache_chain;
> > > > > > +			vq->desc_cache_chain = (void *)desc->addr;
> > > > > > +			goto got;
> > > > > > +		}
> > > > > > +		n = VIRT_QUEUE_CACHE_DESC_NUM;
> > > > >
> > > > > Hmm. This will allocate more entries than actually used. Why do it?
> > > >
> > > >
> > > > This is because the size of each cache item is fixed, and the logic has been
> > > > modified in the latest code. I think this problem no longer exists.
> > > >
> > > >
> > > > 3.
> > > > > What bothers me here is what happens if cache gets
> > > > > filled on one numa node, then used on another?
> > > >
> > > > I'm thinking about another question, how did the cross-numa appear here, and
> > > > virtio desc queue also has the problem of cross-numa. So is it necessary for us
> > > > to deal with the cross-numa scene?
> > >
> > > It's true that desc queue might be cross numa, and people are looking
> > > for ways to improve that. Not a reason to make things worse ...
> > >
> >
> > I will test for it.
> >
> > >
> > > > Indirect desc is used as virtio desc, so as long as it is in the same numa as
> > > > virito desc. So we can allocate indirect desc cache at the same time when
> > > > allocating virtio desc queue.
> > >
> > > Using it from current node like we do now seems better.
> > >
> > > > 4.
> > > > > So e.g. for rx, we are wasting memory since indirect isn't used.
> > > >
> > > > In the current version, desc cache is set up based on pre-queue.
> > > >
> > > > So if the desc cache is not used, we don't need to set the desc cache.
> > > >
> > > > For example, virtio-net, as long as the tx queue and the rx queue in big packet
> > > > mode enable desc cache.
> > >
> > >
> > > I liked how in older versions adding indrect enabled it implicitly
> > > though without need to hack drivers.
> >
> > I see.
> >
> > >
> > > > 5.
> > > > > Would a better API be a cache size in bytes? This controls how much
> > > > > memory is spent after all.
> > > >
> > > > My design is to set a threshold. When total_sg is greater than this threshold,
> > > > it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> > > > this threshold, use the allocated cache.
> > > >
> > >
> > > I know. My question is this, do devices know what a good threshold is?
> > > If yes how do they know?
> >
> > I think the driver knows the threshold, for example, MAX_SKB_FRAG + 2 is a
> > suitable threshold for virtio-net.
> >
>
> I guess... in that case I assume it's a good idea to have
> virtio core round the size up to whole cache lines, right?


Yes.

Thanks.

>
> > >
> > > > 6. kmem_cache_*
> > > >
> > > > I have tested these, the performance is not as good as the method used in this
> > > > patch.
> > >
> > > Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
> > > You mentioned just kmem_cache_alloc previously.
> >
> >
> > I will test for kmem_cache_alloc_bulk.
> >
> > Thanks.
> >
> > >
> > > >
> > > > Thanks.
> > >
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-11-12  1:53 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-08 11:49 [PATCH v4 0/3] virtio support cache indirect desc Xuan Zhuo
2021-11-08 11:49 ` Xuan Zhuo
2021-11-08 11:49 ` [PATCH v4 1/3] virtio: cache indirect desc for split Xuan Zhuo
2021-11-08 11:49   ` Xuan Zhuo
2021-11-09 13:09   ` Michael S. Tsirkin
2021-11-09 13:09     ` Michael S. Tsirkin
2021-11-09 15:00     ` Xuan Zhuo
2021-11-08 11:49 ` [PATCH v4 2/3] virtio: cache indirect desc for packed Xuan Zhuo
2021-11-08 11:49   ` Xuan Zhuo
2021-11-08 11:49 ` [PATCH v4 3/3] virtio-net: enable virtio desc cache Xuan Zhuo
2021-11-08 11:49   ` Xuan Zhuo
2021-11-08 13:49 ` [PATCH v4 0/3] virtio support cache indirect desc Michael S. Tsirkin
2021-11-08 13:49   ` Michael S. Tsirkin
2021-11-08 14:47   ` Xuan Zhuo
2021-11-10 12:53     ` Michael S. Tsirkin
2021-11-10 12:53       ` Michael S. Tsirkin
2021-11-10 12:54       ` Michael S. Tsirkin
2021-11-10 12:54         ` Michael S. Tsirkin
2021-11-11  6:52       ` Xuan Zhuo
2021-11-11 15:02         ` Michael S. Tsirkin
2021-11-11 15:02           ` Michael S. Tsirkin
2021-11-12  1:53           ` Xuan Zhuo
2021-11-09 13:03 ` Michael S. Tsirkin
2021-11-09 13:03   ` Michael S. Tsirkin
2021-11-09 15:14   ` Xuan Zhuo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.