[PATCH v3 0/7] net/9p: remove msize limit in virtio transport

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/7] net/9p: remove msize limit in virtio transport
@ 2021-09-22 16:00 Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 1/7] net/9p: show error message if user 'msize' cannot be satisfied Christian Schoenebeck
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Christian Schoenebeck @ 2021-09-22 16:00 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

This series aims to get get rid of the current 500k 'msize' limitation in
the 9p virtio transport, which is currently a bottleneck for performance
of 9p mounts.

This is a follow-up of the following series and discussion:
https://lore.kernel.org/all/28bb651ae0349a7d57e8ddc92c1bd5e62924a912.1630770829.git.linux_oss@crudebyte.com/T/#eb647d0c013616cee3eb8ba9d87da7d8b1f476f37

Known limitation: With this series applied I can run

  QEMU host <-> 9P virtio <-> Linux guest

with up to 3 MB msize. If I try to run it with 4 MB it seems to hit some
limitation on QEMU side:

  qemu-system-x86_64: virtio: too many write descriptors in indirect table

I haven't looked into this issue yet.

Prerequisites: If you are testing with QEMU then please apply the following
patch on QEMU side:
https://lore.kernel.org/qemu-devel/E1mT2Js-0000DW-OH@lizzy.crudebyte.com/
That QEMU patch is required if you are using a user space app that
automatically retrieves an optimum I/O block size by obeying stat's
st_blksize, which 'cat' for instance is doing, e.g.:

	time cat test_rnd.dat > /dev/null

Otherwise please use a user space app for performance testing that allows
you to force a large block size and to avoid that QEMU issue, like 'dd'
for instance, in that case you don't need to patch QEMU.

Testing and feedback appreciated!

v2 -> v3:

  * Make vq_sg_free() safe for NULL argument [patch 4].

  * Show info message to user if user's msize option had to be limited in
    case it would exceed p9_max_pages [patch 6].

  * Fix memory leak in vq_sg_resize() [patch 7].

  * Show info message to user if user's msize option had to be limited in
    case not all required SG lists could be allocated [patch 7].

Christian Schoenebeck (7):
  net/9p: show error message if user 'msize' cannot be satisfied
  9p/trans_virtio: separate allocation of scatter gather list
  9p/trans_virtio: turn amount of sg lists into runtime info
  9p/trans_virtio: introduce struct virtqueue_sg
  net/9p: add trans_maxsize to struct p9_client
  9p/trans_virtio: support larger msize values
  9p/trans_virtio: resize sg lists to whatever is possible

 include/net/9p/client.h |   2 +
 net/9p/client.c         |  17 ++-
 net/9p/trans_virtio.c   | 304 +++++++++++++++++++++++++++++++++++-----
 3 files changed, 283 insertions(+), 40 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/7] net/9p: show error message if user 'msize' cannot be satisfied
  2021-09-22 16:00 [PATCH v3 0/7] net/9p: remove msize limit in virtio transport Christian Schoenebeck
@ 2021-09-22 16:00 ` Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 2/7] 9p/trans_virtio: separate allocation of scatter gather list Christian Schoenebeck
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Christian Schoenebeck @ 2021-09-22 16:00 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

If user supplied a large value with the 'msize' option, then
client would silently limit that 'msize' value to the maximum
value supported by transport. That's a bit confusing for users
of not having any indication why the preferred 'msize' value
could not be satisfied.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/client.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index 213f12ed76cd..4f4fd2098a30 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1044,8 +1044,13 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
 	if (err)
 		goto put_trans;
 
-	if (clnt->msize > clnt->trans_mod->maxsize)
+	if (clnt->msize > clnt->trans_mod->maxsize) {
 		clnt->msize = clnt->trans_mod->maxsize;
+		pr_info("Limiting 'msize' to %d as this is the maximum "
+			"supported by transport %s\n",
+			clnt->msize, clnt->trans_mod->name
+		);
+	}
 
 	if (clnt->msize < 4096) {
 		p9_debug(P9_DEBUG_ERROR,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/7] 9p/trans_virtio: separate allocation of scatter gather list
  2021-09-22 16:00 [PATCH v3 0/7] net/9p: remove msize limit in virtio transport Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 1/7] net/9p: show error message if user 'msize' cannot be satisfied Christian Schoenebeck
@ 2021-09-22 16:00 ` Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 3/7] 9p/trans_virtio: turn amount of sg lists into runtime info Christian Schoenebeck
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Christian Schoenebeck @ 2021-09-22 16:00 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

The scatter gather list in struct virtio_chan currently
resides as compile-time constant size array directly within the
contiguous struct virtio_chan's memory space.

Separate memory space and allocation of the scatter gather list
from memory space and allocation of struct virtio_chan.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 490a4c900339..1dbe2e921bb8 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -77,7 +77,7 @@ struct virtio_chan {
 	 */
 	unsigned long p9_max_pages;
 	/* Scatterlist: can be too big for stack. */
-	struct scatterlist sg[VIRTQUEUE_NUM];
+	struct scatterlist *sg;
 	/**
 	 * @tag: name to identify a mount null terminated
 	 */
@@ -574,6 +574,14 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 		goto fail;
 	}
 
+	chan->sg = kmalloc_array(VIRTQUEUE_NUM,
+				 sizeof(struct scatterlist), GFP_KERNEL);
+	if (!chan->sg) {
+		pr_err("Failed to allocate virtio 9P channel\n");
+		err = -ENOMEM;
+		goto out_free_chan_shallow;
+	}
+
 	chan->vdev = vdev;
 
 	/* We expect one virtqueue, for requests. */
@@ -635,6 +643,8 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 out_free_vq:
 	vdev->config->del_vqs(vdev);
 out_free_chan:
+	kfree(chan->sg);
+out_free_chan_shallow:
 	kfree(chan);
 fail:
 	return err;
@@ -728,6 +738,7 @@ static void p9_virtio_remove(struct virtio_device *vdev)
 	kobject_uevent(&(vdev->dev.kobj), KOBJ_CHANGE);
 	kfree(chan->tag);
 	kfree(chan->vc_wq);
+	kfree(chan->sg);
 	kfree(chan);
 
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/7] 9p/trans_virtio: turn amount of sg lists into runtime info
  2021-09-22 16:00 [PATCH v3 0/7] net/9p: remove msize limit in virtio transport Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 1/7] net/9p: show error message if user 'msize' cannot be satisfied Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 2/7] 9p/trans_virtio: separate allocation of scatter gather list Christian Schoenebeck
@ 2021-09-22 16:00 ` Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 4/7] 9p/trans_virtio: introduce struct virtqueue_sg Christian Schoenebeck
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Christian Schoenebeck @ 2021-09-22 16:00 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

The size of scatter/gather lists used by the virtio transport is
currently hard coded. Refactor this to using a runtime variable.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 1dbe2e921bb8..3347d35a5e6e 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -36,7 +36,7 @@
 #include <linux/virtio_9p.h>
 #include "trans_common.h"
 
-#define VIRTQUEUE_NUM	128
+#define VIRTQUEUE_DEFAULT_NUM	128
 
 /* a single mutex to manage channel initialization and attachment */
 static DEFINE_MUTEX(virtio_9p_lock);
@@ -54,6 +54,7 @@ static atomic_t vp_pinned = ATOMIC_INIT(0);
  * @vc_wq: wait queue for waiting for thing to be added to ring buf
  * @p9_max_pages: maximum number of pinned pages
  * @sg: scatter gather list which is used to pack a request (protected?)
+ * @sg_n: amount of elements in sg array
  * @chan_list: linked list of channels
  *
  * We keep all per-channel information in a structure.
@@ -78,6 +79,7 @@ struct virtio_chan {
 	unsigned long p9_max_pages;
 	/* Scatterlist: can be too big for stack. */
 	struct scatterlist *sg;
+	size_t sg_n;
 	/**
 	 * @tag: name to identify a mount null terminated
 	 */
@@ -270,12 +272,12 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
 	out_sgs = in_sgs = 0;
 	/* Handle out VirtIO ring buffers */
 	out = pack_sg_list(chan->sg, 0,
-			   VIRTQUEUE_NUM, req->tc.sdata, req->tc.size);
+			   chan->sg_n, req->tc.sdata, req->tc.size);
 	if (out)
 		sgs[out_sgs++] = chan->sg;
 
 	in = pack_sg_list(chan->sg, out,
-			  VIRTQUEUE_NUM, req->rc.sdata, req->rc.capacity);
+			  chan->sg_n, req->rc.sdata, req->rc.capacity);
 	if (in)
 		sgs[out_sgs + in_sgs++] = chan->sg + out;
 
@@ -447,14 +449,14 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 
 	/* out data */
 	out = pack_sg_list(chan->sg, 0,
-			   VIRTQUEUE_NUM, req->tc.sdata, req->tc.size);
+			   chan->sg_n, req->tc.sdata, req->tc.size);
 
 	if (out)
 		sgs[out_sgs++] = chan->sg;
 
 	if (out_pages) {
 		sgs[out_sgs++] = chan->sg + out;
-		out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
+		out += pack_sg_list_p(chan->sg, out, chan->sg_n,
 				      out_pages, out_nr_pages, offs, outlen);
 	}
 
@@ -466,13 +468,13 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	 * allocated memory and payload onto the user buffer.
 	 */
 	in = pack_sg_list(chan->sg, out,
-			  VIRTQUEUE_NUM, req->rc.sdata, in_hdr_len);
+			  chan->sg_n, req->rc.sdata, in_hdr_len);
 	if (in)
 		sgs[out_sgs + in_sgs++] = chan->sg + out;
 
 	if (in_pages) {
 		sgs[out_sgs + in_sgs++] = chan->sg + out + in;
-		in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
+		in += pack_sg_list_p(chan->sg, out + in, chan->sg_n,
 				     in_pages, in_nr_pages, offs, inlen);
 	}
 
@@ -574,13 +576,14 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 		goto fail;
 	}
 
-	chan->sg = kmalloc_array(VIRTQUEUE_NUM,
+	chan->sg = kmalloc_array(VIRTQUEUE_DEFAULT_NUM,
 				 sizeof(struct scatterlist), GFP_KERNEL);
 	if (!chan->sg) {
 		pr_err("Failed to allocate virtio 9P channel\n");
 		err = -ENOMEM;
 		goto out_free_chan_shallow;
 	}
+	chan->sg_n = VIRTQUEUE_DEFAULT_NUM;
 
 	chan->vdev = vdev;
 
@@ -593,7 +596,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 	chan->vq->vdev->priv = chan;
 	spin_lock_init(&chan->lock);
 
-	sg_init_table(chan->sg, VIRTQUEUE_NUM);
+	sg_init_table(chan->sg, chan->sg_n);
 
 	chan->inuse = false;
 	if (virtio_has_feature(vdev, VIRTIO_9P_MOUNT_TAG)) {
@@ -777,7 +780,7 @@ static struct p9_trans_module p9_virtio_trans = {
 	 * that are not at page boundary, that can result in an extra
 	 * page in zero copy.
 	 */
-	.maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
+	.maxsize = PAGE_SIZE * (VIRTQUEUE_DEFAULT_NUM - 3),
 	.def = 1,
 	.owner = THIS_MODULE,
 };
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 4/7] 9p/trans_virtio: introduce struct virtqueue_sg
  2021-09-22 16:00 [PATCH v3 0/7] net/9p: remove msize limit in virtio transport Christian Schoenebeck
                   ` (2 preceding siblings ...)
  2021-09-22 16:00 ` [PATCH v3 3/7] 9p/trans_virtio: turn amount of sg lists into runtime info Christian Schoenebeck
@ 2021-09-22 16:00 ` Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 5/7] net/9p: add trans_maxsize to struct p9_client Christian Schoenebeck
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Christian Schoenebeck @ 2021-09-22 16:00 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

The amount of elements in a scatter/gather list is limited to
approximately 128 elements. To allow going beyond that limit
with subsequent patches, pave the way by turning the one-
dimensional sg list array into a two-dimensional array, i.e:

  sg[128]

becomes

  sgl[nsgl][SG_MAX_SINGLE_ALLOC]

As the value of 'nsgl' is exactly (still) 1 in this commit
and the compile-time (compiler and architecture dependent)
value of 'SG_MAX_SINGLE_ALLOC' equals approximately the
previous hard coded 128 elements, this commit is therefore
more of a preparatory refactoring then actual behaviour
change.

A custom struct virtqueue_sg is defined instead of using
shared API struct sg_table, because the latter would not
allow to resize the table after allocation. sg_append_table
API OTOH would not fit either, because it requires a list
of pages beforehand upon allocation. And both APIs only
support all-or-nothing allocation.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 193 ++++++++++++++++++++++++++++++++----------
 1 file changed, 147 insertions(+), 46 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 3347d35a5e6e..e478a34326f1 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -36,7 +36,31 @@
 #include <linux/virtio_9p.h>
 #include "trans_common.h"
 
-#define VIRTQUEUE_DEFAULT_NUM	128
+/**
+ * struct virtqueue_sg - (chained) scatter gather lists for virtqueue data
+ * transmission
+ * @nsgl: amount of elements (in first dimension) of array field @sgl
+ * @sgl: two-dimensional array, i.e. sgl[nsgl][SG_MAX_SINGLE_ALLOC]
+ */
+struct virtqueue_sg {
+	unsigned int nsgl;
+	struct scatterlist *sgl[];
+};
+
+/*
+ * Default value for field nsgl in struct virtqueue_sg, which defines the
+ * initial virtio data transmission capacity when this virtio transport is
+ * probed.
+ */
+#define VIRTQUEUE_SG_NSGL_DEFAULT 1
+
+/* maximum value for field nsgl in struct virtqueue_sg */
+#define VIRTQUEUE_SG_NSGL_MAX						\
+	((PAGE_SIZE - sizeof(struct virtqueue_sg)) /			\
+	sizeof(struct scatterlist *))					\
+
+/* last entry per sg list is used for chaining (pointer to next list) */
+#define SG_USER_PAGES_PER_LIST	(SG_MAX_SINGLE_ALLOC - 1)
 
 /* a single mutex to manage channel initialization and attachment */
 static DEFINE_MUTEX(virtio_9p_lock);
@@ -53,8 +77,7 @@ static atomic_t vp_pinned = ATOMIC_INIT(0);
  * @ring_bufs_avail: flag to indicate there is some available in the ring buf
  * @vc_wq: wait queue for waiting for thing to be added to ring buf
  * @p9_max_pages: maximum number of pinned pages
- * @sg: scatter gather list which is used to pack a request (protected?)
- * @sg_n: amount of elements in sg array
+ * @vq_sg: table of scatter gather lists, which are used to pack a request
  * @chan_list: linked list of channels
  *
  * We keep all per-channel information in a structure.
@@ -77,9 +100,7 @@ struct virtio_chan {
 	 * will be placing it in each channel.
 	 */
 	unsigned long p9_max_pages;
-	/* Scatterlist: can be too big for stack. */
-	struct scatterlist *sg;
-	size_t sg_n;
+	struct virtqueue_sg *vq_sg;
 	/**
 	 * @tag: name to identify a mount null terminated
 	 */
@@ -96,6 +117,92 @@ static unsigned int rest_of_page(void *data)
 	return PAGE_SIZE - offset_in_page(data);
 }
 
+/**
+ * vq_sg_page - returns user page for given page index
+ * @vq_sg: scatter gather lists used by this transport
+ * @page: user page index across all scatter gather lists
+ */
+static struct scatterlist *vq_sg_page(struct virtqueue_sg *vq_sg, size_t page)
+{
+	unsigned int node = page / SG_USER_PAGES_PER_LIST;
+	unsigned int leaf = page % SG_USER_PAGES_PER_LIST;
+	BUG_ON(node >= VIRTQUEUE_SG_NSGL_MAX);
+	return &vq_sg->sgl[node][leaf];
+}
+
+/**
+ * vq_sg_npages - returns total number of individual user pages in passed
+ * scatter gather lists
+ * @vq_sg: scatter gather lists to be counted
+ */
+static size_t vq_sg_npages(struct virtqueue_sg *vq_sg)
+{
+	return vq_sg->nsgl * SG_USER_PAGES_PER_LIST;
+}
+
+/**
+ * vq_sg_free - free all memory previously allocated for @vq_sg
+ * @vq_sg: scatter gather lists to be freed
+ */
+static void vq_sg_free(struct virtqueue_sg *vq_sg)
+{
+	unsigned int i;
+
+	if (!vq_sg)
+		return;
+
+	for (i = 0; i < vq_sg->nsgl; ++i) {
+		kfree(vq_sg->sgl[i]);
+	}
+	kfree(vq_sg);
+}
+
+/**
+ * vq_sg_alloc - allocates and returns @nsgl scatter gather lists
+ * @nsgl: amount of scatter gather lists to be allocated
+ * If @nsgl is larger than one then chained lists are used if supported by
+ * architecture.
+ */
+static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
+{
+	struct virtqueue_sg *vq_sg;
+	unsigned int i;
+
+	BUG_ON(!nsgl || nsgl > VIRTQUEUE_SG_NSGL_MAX);
+#ifdef CONFIG_ARCH_NO_SG_CHAIN
+	if (WARN_ON_ONCE(nsgl > 1))
+		return NULL;
+#endif
+
+	vq_sg = kzalloc(sizeof(struct virtqueue_sg) +
+			nsgl * sizeof(struct scatterlist *),
+			GFP_KERNEL);
+
+	if (!vq_sg)
+		return NULL;
+
+	vq_sg->nsgl = nsgl;
+
+	for (i = 0; i < nsgl; ++i) {
+		vq_sg->sgl[i] = kmalloc_array(
+			SG_MAX_SINGLE_ALLOC, sizeof(struct scatterlist),
+			GFP_KERNEL
+		);
+		if (!vq_sg->sgl[i]) {
+			vq_sg_free(vq_sg);
+			return NULL;
+		}
+		sg_init_table(vq_sg->sgl[i], SG_MAX_SINGLE_ALLOC);
+		if (i) {
+			/* chain the lists */
+			sg_chain(vq_sg->sgl[i - 1], SG_MAX_SINGLE_ALLOC,
+				 vq_sg->sgl[i]);
+		}
+	}
+	sg_mark_end(&vq_sg->sgl[nsgl - 1][SG_MAX_SINGLE_ALLOC - 1]);
+	return vq_sg;
+}
+
 /**
  * p9_virtio_close - reclaim resources of a channel
  * @client: client instance
@@ -158,9 +265,8 @@ static void req_done(struct virtqueue *vq)
 
 /**
  * pack_sg_list - pack a scatter gather list from a linear buffer
- * @sg: scatter/gather list to pack into
+ * @vq_sg: scatter/gather lists to pack into
  * @start: which segment of the sg_list to start at
- * @limit: maximum segment to pack data to
  * @data: data to pack into scatter/gather list
  * @count: amount of data to pack into the scatter/gather list
  *
@@ -170,11 +276,12 @@ static void req_done(struct virtqueue *vq)
  *
  */
 
-static int pack_sg_list(struct scatterlist *sg, int start,
-			int limit, char *data, int count)
+static int pack_sg_list(struct virtqueue_sg *vq_sg, int start,
+			char *data, int count)
 {
 	int s;
 	int index = start;
+	size_t limit = vq_sg_npages(vq_sg);
 
 	while (count) {
 		s = rest_of_page(data);
@@ -182,13 +289,13 @@ static int pack_sg_list(struct scatterlist *sg, int start,
 			s = count;
 		BUG_ON(index >= limit);
 		/* Make sure we don't terminate early. */
-		sg_unmark_end(&sg[index]);
-		sg_set_buf(&sg[index++], data, s);
+		sg_unmark_end(vq_sg_page(vq_sg, index));
+		sg_set_buf(vq_sg_page(vq_sg, index++), data, s);
 		count -= s;
 		data += s;
 	}
 	if (index-start)
-		sg_mark_end(&sg[index - 1]);
+		sg_mark_end(vq_sg_page(vq_sg, index - 1));
 	return index-start;
 }
 
@@ -208,21 +315,21 @@ static int p9_virtio_cancelled(struct p9_client *client, struct p9_req_t *req)
 /**
  * pack_sg_list_p - Just like pack_sg_list. Instead of taking a buffer,
  * this takes a list of pages.
- * @sg: scatter/gather list to pack into
+ * @vq_sg: scatter/gather lists to pack into
  * @start: which segment of the sg_list to start at
- * @limit: maximum number of pages in sg list.
  * @pdata: a list of pages to add into sg.
  * @nr_pages: number of pages to pack into the scatter/gather list
  * @offs: amount of data in the beginning of first page _not_ to pack
  * @count: amount of data to pack into the scatter/gather list
  */
 static int
-pack_sg_list_p(struct scatterlist *sg, int start, int limit,
+pack_sg_list_p(struct virtqueue_sg *vq_sg, int start,
 	       struct page **pdata, int nr_pages, size_t offs, int count)
 {
 	int i = 0, s;
 	int data_off = offs;
 	int index = start;
+	size_t limit = vq_sg_npages(vq_sg);
 
 	BUG_ON(nr_pages > (limit - start));
 	/*
@@ -235,15 +342,16 @@ pack_sg_list_p(struct scatterlist *sg, int start, int limit,
 			s = count;
 		BUG_ON(index >= limit);
 		/* Make sure we don't terminate early. */
-		sg_unmark_end(&sg[index]);
-		sg_set_page(&sg[index++], pdata[i++], s, data_off);
+		sg_unmark_end(vq_sg_page(vq_sg, index));
+		sg_set_page(vq_sg_page(vq_sg, index++), pdata[i++], s,
+			    data_off);
 		data_off = 0;
 		count -= s;
 		nr_pages--;
 	}
 
 	if (index-start)
-		sg_mark_end(&sg[index - 1]);
+		sg_mark_end(vq_sg_page(vq_sg, index - 1));
 	return index - start;
 }
 
@@ -271,15 +379,13 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
 
 	out_sgs = in_sgs = 0;
 	/* Handle out VirtIO ring buffers */
-	out = pack_sg_list(chan->sg, 0,
-			   chan->sg_n, req->tc.sdata, req->tc.size);
+	out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);
 	if (out)
-		sgs[out_sgs++] = chan->sg;
+		sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);
 
-	in = pack_sg_list(chan->sg, out,
-			  chan->sg_n, req->rc.sdata, req->rc.capacity);
+	in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, req->rc.capacity);
 	if (in)
-		sgs[out_sgs + in_sgs++] = chan->sg + out;
+		sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);
 
 	err = virtqueue_add_sgs(chan->vq, sgs, out_sgs, in_sgs, req,
 				GFP_ATOMIC);
@@ -448,16 +554,15 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	out_sgs = in_sgs = 0;
 
 	/* out data */
-	out = pack_sg_list(chan->sg, 0,
-			   chan->sg_n, req->tc.sdata, req->tc.size);
+	out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);
 
 	if (out)
-		sgs[out_sgs++] = chan->sg;
+		sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);
 
 	if (out_pages) {
-		sgs[out_sgs++] = chan->sg + out;
-		out += pack_sg_list_p(chan->sg, out, chan->sg_n,
-				      out_pages, out_nr_pages, offs, outlen);
+		sgs[out_sgs++] = vq_sg_page(chan->vq_sg, out);
+		out += pack_sg_list_p(chan->vq_sg, out, out_pages,
+				      out_nr_pages, offs, outlen);
 	}
 
 	/*
@@ -467,15 +572,14 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	 * Arrange in such a way that server places header in the
 	 * allocated memory and payload onto the user buffer.
 	 */
-	in = pack_sg_list(chan->sg, out,
-			  chan->sg_n, req->rc.sdata, in_hdr_len);
+	in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, in_hdr_len);
 	if (in)
-		sgs[out_sgs + in_sgs++] = chan->sg + out;
+		sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);
 
 	if (in_pages) {
-		sgs[out_sgs + in_sgs++] = chan->sg + out + in;
-		in += pack_sg_list_p(chan->sg, out + in, chan->sg_n,
-				     in_pages, in_nr_pages, offs, inlen);
+		sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out + in);
+		in += pack_sg_list_p(chan->vq_sg, out + in, in_pages,
+				     in_nr_pages, offs, inlen);
 	}
 
 	BUG_ON(out_sgs + in_sgs > ARRAY_SIZE(sgs));
@@ -576,14 +680,12 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 		goto fail;
 	}
 
-	chan->sg = kmalloc_array(VIRTQUEUE_DEFAULT_NUM,
-				 sizeof(struct scatterlist), GFP_KERNEL);
-	if (!chan->sg) {
+	chan->vq_sg = vq_sg_alloc(VIRTQUEUE_SG_NSGL_DEFAULT);
+	if (!chan->vq_sg) {
 		pr_err("Failed to allocate virtio 9P channel\n");
 		err = -ENOMEM;
 		goto out_free_chan_shallow;
 	}
-	chan->sg_n = VIRTQUEUE_DEFAULT_NUM;
 
 	chan->vdev = vdev;
 
@@ -596,8 +698,6 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 	chan->vq->vdev->priv = chan;
 	spin_lock_init(&chan->lock);
 
-	sg_init_table(chan->sg, chan->sg_n);
-
 	chan->inuse = false;
 	if (virtio_has_feature(vdev, VIRTIO_9P_MOUNT_TAG)) {
 		virtio_cread(vdev, struct virtio_9p_config, tag_len, &tag_len);
@@ -646,7 +746,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 out_free_vq:
 	vdev->config->del_vqs(vdev);
 out_free_chan:
-	kfree(chan->sg);
+	vq_sg_free(chan->vq_sg);
 out_free_chan_shallow:
 	kfree(chan);
 fail:
@@ -741,7 +841,7 @@ static void p9_virtio_remove(struct virtio_device *vdev)
 	kobject_uevent(&(vdev->dev.kobj), KOBJ_CHANGE);
 	kfree(chan->tag);
 	kfree(chan->vc_wq);
-	kfree(chan->sg);
+	vq_sg_free(chan->vq_sg);
 	kfree(chan);
 
 }
@@ -780,7 +880,8 @@ static struct p9_trans_module p9_virtio_trans = {
 	 * that are not at page boundary, that can result in an extra
 	 * page in zero copy.
 	 */
-	.maxsize = PAGE_SIZE * (VIRTQUEUE_DEFAULT_NUM - 3),
+	.maxsize = PAGE_SIZE *
+		((VIRTQUEUE_SG_NSGL_DEFAULT * SG_USER_PAGES_PER_LIST) - 3),
 	.def = 1,
 	.owner = THIS_MODULE,
 };
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 5/7] net/9p: add trans_maxsize to struct p9_client
  2021-09-22 16:00 [PATCH v3 0/7] net/9p: remove msize limit in virtio transport Christian Schoenebeck
                   ` (3 preceding siblings ...)
  2021-09-22 16:00 ` [PATCH v3 4/7] 9p/trans_virtio: introduce struct virtqueue_sg Christian Schoenebeck
@ 2021-09-22 16:00 ` Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 6/7] 9p/trans_virtio: support larger msize values Christian Schoenebeck
  2021-09-22 16:00 ` [PATCH v3 7/7] 9p/trans_virtio: resize sg lists to whatever is possible Christian Schoenebeck
  6 siblings, 0 replies; 18+ messages in thread
From: Christian Schoenebeck @ 2021-09-22 16:00 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

This new field 'trans_maxsize' optionally allows transport to
update it to reflect the actual maximum msize supported by
allocated transport channel.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 include/net/9p/client.h |  2 ++
 net/9p/client.c         | 12 ++++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index e1c308d8d288..e48c4cdf9be0 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -89,6 +89,7 @@ struct p9_req_t {
  * struct p9_client - per client instance state
  * @lock: protect @fids and @reqs
  * @msize: maximum data size negotiated by protocol
+ * @trans_maxsize: actual maximum msize supported by transport channel
  * @proto_version: 9P protocol version to use
  * @trans_mod: module API instantiated with this client
  * @status: connection state
@@ -103,6 +104,7 @@ struct p9_req_t {
 struct p9_client {
 	spinlock_t lock;
 	unsigned int msize;
+	unsigned int trans_maxsize;
 	unsigned char proto_version;
 	struct p9_trans_module *trans_mod;
 	enum p9_trans_status status;
diff --git a/net/9p/client.c b/net/9p/client.c
index 4f4fd2098a30..a75034fa249b 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1037,6 +1037,14 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
 		goto free_client;
 	}
 
+	/*
+	 * transport will get a chance to increase trans_maxsize (if
+	 * necessary) and it may update trans_maxsize in create() function
+	 * below accordingly to reflect the actual maximum size supported by
+	 * the allocated transport channel
+	 */
+	clnt->trans_maxsize = clnt->trans_mod->maxsize;
+
 	p9_debug(P9_DEBUG_MUX, "clnt %p trans %p msize %d protocol %d\n",
 		 clnt, clnt->trans_mod, clnt->msize, clnt->proto_version);
 
@@ -1044,8 +1052,8 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
 	if (err)
 		goto put_trans;
 
-	if (clnt->msize > clnt->trans_mod->maxsize) {
-		clnt->msize = clnt->trans_mod->maxsize;
+	if (clnt->msize > clnt->trans_maxsize) {
+		clnt->msize = clnt->trans_maxsize;
 		pr_info("Limiting 'msize' to %d as this is the maximum "
 			"supported by transport %s\n",
 			clnt->msize, clnt->trans_mod->name
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-09-22 16:00 [PATCH v3 0/7] net/9p: remove msize limit in virtio transport Christian Schoenebeck
                   ` (4 preceding siblings ...)
  2021-09-22 16:00 ` [PATCH v3 5/7] net/9p: add trans_maxsize to struct p9_client Christian Schoenebeck
@ 2021-09-22 16:00 ` Christian Schoenebeck
  2021-11-20 11:20   ` Nikolay Kichukov
  2021-09-22 16:00 ` [PATCH v3 7/7] 9p/trans_virtio: resize sg lists to whatever is possible Christian Schoenebeck
  6 siblings, 1 reply; 18+ messages in thread
From: Christian Schoenebeck @ 2021-09-22 16:00 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

The virtio transport supports by default a 9p 'msize' of up to
approximately 500 kB. This patch adds support for larger 'msize'
values by resizing the amount of scatter/gather lists if required.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 61 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index e478a34326f1..147ebf647a95 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -203,6 +203,31 @@ static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
 	return vq_sg;
 }
 
+/**
+ * vq_sg_resize - resize passed virtqueue scatter/gather lists to the passed
+ * amount of lists
+ * @_vq_sg: scatter/gather lists to be resized
+ * @nsgl: new amount of scatter/gather lists
+ */
+static int vq_sg_resize(struct virtqueue_sg **_vq_sg, unsigned int nsgl)
+{
+	struct virtqueue_sg *vq_sg;
+
+	BUG_ON(!_vq_sg || !nsgl);
+	vq_sg = *_vq_sg;
+	if (vq_sg->nsgl == nsgl)
+		return 0;
+
+	/* lazy resize implementation for now */
+	vq_sg = vq_sg_alloc(nsgl);
+	if (!vq_sg)
+		return -ENOMEM;
+
+	kfree(*_vq_sg);
+	*_vq_sg = vq_sg;
+	return 0;
+}
+
 /**
  * p9_virtio_close - reclaim resources of a channel
  * @client: client instance
@@ -774,6 +799,10 @@ p9_virtio_create(struct p9_client *client, const char *devname, char *args)
 	struct virtio_chan *chan;
 	int ret = -ENOENT;
 	int found = 0;
+#if !defined(CONFIG_ARCH_NO_SG_CHAIN)
+	size_t npages;
+	size_t nsgl;
+#endif
 
 	if (devname == NULL)
 		return -EINVAL;
@@ -796,6 +825,38 @@ p9_virtio_create(struct p9_client *client, const char *devname, char *args)
 		return ret;
 	}
 
+	/*
+	 * if user supplied an 'msize' option that's larger than what this
+	 * transport supports by default, then try to allocate more sg lists
+	 */
+	if (client->msize > client->trans_maxsize) {
+#ifdef CONFIG_ARCH_NO_SG_CHAIN
+		pr_info("limiting 'msize' to %d because architecture does not "
+			"support chained scatter gather lists\n",
+			client->trans_maxsize);
+#else
+		npages = DIV_ROUND_UP(client->msize, PAGE_SIZE);
+		if (npages > chan->p9_max_pages) {
+			npages = chan->p9_max_pages;
+			pr_info("limiting 'msize' as it would exceed the max. "
+				"of %lu pages allowed on this system\n",
+				chan->p9_max_pages);
+		}
+		nsgl = DIV_ROUND_UP(npages, SG_USER_PAGES_PER_LIST);
+		if (nsgl > chan->vq_sg->nsgl) {
+			/*
+			 * if resize fails, no big deal, then just
+			 * continue with default msize instead
+			 */
+			if (!vq_sg_resize(&chan->vq_sg, nsgl)) {
+				client->trans_maxsize =
+					PAGE_SIZE *
+					((nsgl * SG_USER_PAGES_PER_LIST) - 3);
+			}
+		}
+#endif /* CONFIG_ARCH_NO_SG_CHAIN */
+	}
+
 	client->trans = (void *)chan;
 	client->status = Connected;
 	chan->client = client;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 7/7] 9p/trans_virtio: resize sg lists to whatever is possible
  2021-09-22 16:00 [PATCH v3 0/7] net/9p: remove msize limit in virtio transport Christian Schoenebeck
                   ` (5 preceding siblings ...)
  2021-09-22 16:00 ` [PATCH v3 6/7] 9p/trans_virtio: support larger msize values Christian Schoenebeck
@ 2021-09-22 16:00 ` Christian Schoenebeck
  6 siblings, 0 replies; 18+ messages in thread
From: Christian Schoenebeck @ 2021-09-22 16:00 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

Right now vq_sg_resize() used a lazy implementation following
the all-or-nothing princible. So it either resized exactly to
the requested new amount of sg lists, or it did not resize at
all.

The problem with this is if a user supplies a very large msize
value, resize would simply fail and the user would stick to
the default maximum msize supported by the virtio transport.

To resolve this potential issue, change vq_sg_resize() to resize
the passed sg list to whatever is possible on the machine.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 68 ++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 60 insertions(+), 8 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 147ebf647a95..4c61b4857bea 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -208,24 +208,67 @@ static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
  * amount of lists
  * @_vq_sg: scatter/gather lists to be resized
  * @nsgl: new amount of scatter/gather lists
+ *
+ * Old scatter/gather lists are retained. Only growing the size is supported.
+ * If the requested amount cannot be satisfied, then lists are increased to
+ * whatever is possible.
  */
 static int vq_sg_resize(struct virtqueue_sg **_vq_sg, unsigned int nsgl)
 {
 	struct virtqueue_sg *vq_sg;
+	unsigned int i;
+	size_t sz;
+	int ret = 0;
 
 	BUG_ON(!_vq_sg || !nsgl);
 	vq_sg = *_vq_sg;
+	if (nsgl > VIRTQUEUE_SG_NSGL_MAX)
+		nsgl = VIRTQUEUE_SG_NSGL_MAX;
 	if (vq_sg->nsgl == nsgl)
 		return 0;
+	if (vq_sg->nsgl > nsgl)
+		return -ENOTSUPP;
+
+	vq_sg = kzalloc(sizeof(struct virtqueue_sg) +
+			nsgl * sizeof(struct scatterlist *),
+			GFP_KERNEL);
 
-	/* lazy resize implementation for now */
-	vq_sg = vq_sg_alloc(nsgl);
 	if (!vq_sg)
 		return -ENOMEM;
 
+	/* copy over old scatter gather lists */
+	sz = sizeof(struct virtqueue_sg) +
+		(*_vq_sg)->nsgl * sizeof(struct scatterlist *);
+	memcpy(vq_sg, *_vq_sg, sz);
+
+	vq_sg->nsgl = nsgl;
+
+	for (i = (*_vq_sg)->nsgl; i < nsgl; ++i) {
+		vq_sg->sgl[i] = kmalloc_array(
+			SG_MAX_SINGLE_ALLOC, sizeof(struct scatterlist),
+			GFP_KERNEL
+		);
+		/*
+		 * handle failed allocation as soft error, we take whatever
+		 * we get
+		 */
+		if (!vq_sg->sgl[i]) {
+			ret = -ENOMEM;
+			vq_sg->nsgl = nsgl = i;
+			break;
+		}
+		sg_init_table(vq_sg->sgl[i], SG_MAX_SINGLE_ALLOC);
+		if (i) {
+			/* chain the lists */
+			sg_chain(vq_sg->sgl[i - 1], SG_MAX_SINGLE_ALLOC,
+				 vq_sg->sgl[i]);
+		}
+	}
+	sg_mark_end(&vq_sg->sgl[nsgl - 1][SG_MAX_SINGLE_ALLOC - 1]);
+
 	kfree(*_vq_sg);
 	*_vq_sg = vq_sg;
-	return 0;
+	return ret;
 }
 
 /**
@@ -846,12 +889,21 @@ p9_virtio_create(struct p9_client *client, const char *devname, char *args)
 		if (nsgl > chan->vq_sg->nsgl) {
 			/*
 			 * if resize fails, no big deal, then just
-			 * continue with default msize instead
+			 * continue with whatever we got
+			 */
+			vq_sg_resize(&chan->vq_sg, nsgl);
+			/*
+			 * actual allocation size might be less than
+			 * requested, so use vq_sg->nsgl instead of nsgl
 			 */
-			if (!vq_sg_resize(&chan->vq_sg, nsgl)) {
-				client->trans_maxsize =
-					PAGE_SIZE *
-					((nsgl * SG_USER_PAGES_PER_LIST) - 3);
+			client->trans_maxsize =
+				PAGE_SIZE * ((chan->vq_sg->nsgl *
+				SG_USER_PAGES_PER_LIST) - 3);
+			if (nsgl > chan->vq_sg->nsgl) {
+				pr_info("limiting 'msize' to %d as only %d "
+					"of %zu SG lists could be allocated",
+					client->trans_maxsize,
+					chan->vq_sg->nsgl, nsgl);
 			}
 		}
 #endif /* CONFIG_ARCH_NO_SG_CHAIN */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-09-22 16:00 ` [PATCH v3 6/7] 9p/trans_virtio: support larger msize values Christian Schoenebeck
@ 2021-11-20 11:20   ` Nikolay Kichukov
  2021-11-20 11:45     ` Dominique Martinet
  0 siblings, 1 reply; 18+ messages in thread
From: Nikolay Kichukov @ 2021-11-20 11:20 UTC (permalink / raw)
  To: Christian Schoenebeck, v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 5787 bytes --]

Thanks for the patches and sorry for top-posting.

I've tested them on GNU/Gentoo Linux, kernel 5.15.3 on amd64
architecture on both guest and KVM host.

The patches from this series, v3 have been applied to the host kernel
and also to the guest kernel. Guest kernel is clang compiled and host
kernel is compiled with gcc-11.

The host also received the qemu patches:
https://github.com/cschoenebeck/qemu/commit/04a7f9e55e0930b87805f7c97851eea4610e78fc
https://github.com/cschoenebeck/qemu/commit/b565bccb00afe8b73d529bbc3a38682996dac5c7
https://github.com/cschoenebeck/qemu/commit/669ced09b3b6070d478acce51810591b78ab0ccd

Qemu version on the host is 6.0.0-r54.

When the client mounts the share via virtio, requested msize is:
10485760 or 104857600

however the mount succeeds with:
msize=507904 in the end as per the /proc filesystem. This is less than
the previous maximum value.

However, performance-wise, I do see an improvement in throughput,
perhaps related to the qemu patches or some combination.

In addition to the above, when the kernel on the guest boots and loads
9pfs support, the attached memory allocation failure trace is generated.

Is anyone else seeing similar and was anybody able to get msize set to
10MB via virtio protocol with these patches?

Thank you,
-Nikolay

On Wed, 2021-09-22 at 18:00 +0200, Christian Schoenebeck wrote:
> The virtio transport supports by default a 9p 'msize' of up to
> approximately 500 kB. This patch adds support for larger 'msize'
> values by resizing the amount of scatter/gather lists if required.
> 
> Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
> ---
>  net/9p/trans_virtio.c | 61 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)
> 
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index e478a34326f1..147ebf647a95 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -203,6 +203,31 @@ static struct virtqueue_sg *vq_sg_alloc(unsigned
> int nsgl)
>         return vq_sg;
>  }
>  
> +/**
> + * vq_sg_resize - resize passed virtqueue scatter/gather lists to the
> passed
> + * amount of lists
> + * @_vq_sg: scatter/gather lists to be resized
> + * @nsgl: new amount of scatter/gather lists
> + */
> +static int vq_sg_resize(struct virtqueue_sg **_vq_sg, unsigned int
> nsgl)
> +{
> +       struct virtqueue_sg *vq_sg;
> +
> +       BUG_ON(!_vq_sg || !nsgl);
> +       vq_sg = *_vq_sg;
> +       if (vq_sg->nsgl == nsgl)
> +               return 0;
> +
> +       /* lazy resize implementation for now */
> +       vq_sg = vq_sg_alloc(nsgl);
> +       if (!vq_sg)
> +               return -ENOMEM;
> +
> +       kfree(*_vq_sg);
> +       *_vq_sg = vq_sg;
> +       return 0;
> +}
> +
>  /**
>   * p9_virtio_close - reclaim resources of a channel
>   * @client: client instance
> @@ -774,6 +799,10 @@ p9_virtio_create(struct p9_client *client, const
> char *devname, char *args)
>         struct virtio_chan *chan;
>         int ret = -ENOENT;
>         int found = 0;
> +#if !defined(CONFIG_ARCH_NO_SG_CHAIN)
> +       size_t npages;
> +       size_t nsgl;
> +#endif
>  
>         if (devname == NULL)
>                 return -EINVAL;
> @@ -796,6 +825,38 @@ p9_virtio_create(struct p9_client *client, const
> char *devname, char *args)
>                 return ret;
>         }
>  
> +       /*
> +        * if user supplied an 'msize' option that's larger than what
> this
> +        * transport supports by default, then try to allocate more sg
> lists
> +        */
> +       if (client->msize > client->trans_maxsize) {
> +#ifdef CONFIG_ARCH_NO_SG_CHAIN
> +               pr_info("limiting 'msize' to %d because architecture
> does not "
> +                       "support chained scatter gather lists\n",
> +                       client->trans_maxsize);
> +#else
> +               npages = DIV_ROUND_UP(client->msize, PAGE_SIZE);
> +               if (npages > chan->p9_max_pages) {
> +                       npages = chan->p9_max_pages;
> +                       pr_info("limiting 'msize' as it would exceed the
> max. "
> +                               "of %lu pages allowed on this system\n",
> +                               chan->p9_max_pages);
> +               }
> +               nsgl = DIV_ROUND_UP(npages, SG_USER_PAGES_PER_LIST);
> +               if (nsgl > chan->vq_sg->nsgl) {
> +                       /*
> +                        * if resize fails, no big deal, then just
> +                        * continue with default msize instead
> +                        */
> +                       if (!vq_sg_resize(&chan->vq_sg, nsgl)) {
> +                               client->trans_maxsize =
> +                                       PAGE_SIZE *
> +                                       ((nsgl * SG_USER_PAGES_PER_LIST)
> - 3);
> +                       }
> +               }
> +#endif /* CONFIG_ARCH_NO_SG_CHAIN */
> +       }
> +
>         client->trans = (void *)chan;
>         client->status = Connected;
>         chan->client = client;


[-- Attachment #2: 9p-msize.txt --]
[-- Type: text/plain, Size: 3403 bytes --]

[    1.527981] 9p: Installing v9fs 9p2000 file system support
[    1.528173] ------------[ cut here ]------------
[    1.528174] WARNING: CPU: 1 PID: 791 at mm/page_alloc.c:5356 __alloc_pages+0x1ed/0x290
[    1.528179] Modules linked in: 9p 9pnet_virtio virtio_net net_failover failover virtio_console 9pnet virtio_balloon efivarfs
[    1.528182] CPU: 1 PID: 791 Comm: mount Not tainted 5.15.3-gentoo-x86_64 #1
[    1.528184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[    1.528185] RIP: 0010:__alloc_pages+0x1ed/0x290
[    1.528187] Code: ef 44 89 f6 e8 34 13 00 00 31 ed e9 5b ff ff ff 81 e3 3f ff ff ff 89 d9 89 cb 83 e3 f7 a9 00 00 00 10 0f 44 d9 e9 64 fe ff ff <0f> 0b 31 ed e9 42 ff ff ff 41 89 df 65 8b 05 08 65 9f 69 41 81 cf
[    1.528188] RSP: 0018:ffffb666405039e8 EFLAGS: 00010246
[    1.528189] RAX: c491259349f3ef00 RBX: 0000000000040c40 RCX: 0000000000000000
[    1.528190] RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000040c40
[    1.528191] RBP: 000000000000000c R08: 0000000000000090 R09: fffffa3cc0111640
[    1.528192] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000c40
[    1.528192] R13: ffff9fc944459c88 R14: 000000000000000c R15: 0000000000000c40
[    1.528194] FS:  00007ff620e57740(0000) GS:ffff9fc9bbc40000(0000) knlGS:0000000000000000
[    1.528195] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.528196] CR2: 00007fcfc14e8000 CR3: 0000000003358000 CR4: 0000000000350ea0
[    1.528198] Call Trace:
[    1.528201]  <TASK>
[    1.528202]  kmalloc_order+0x39/0xf0
[    1.528204]  kmalloc_order_trace+0x13/0x70
[    1.528205]  __kmalloc+0x1fc/0x2b0
[    1.528208]  p9_fcall_init+0x3d/0x60 [9pnet]
[    1.528210]  p9_client_prepare_req+0x82/0x2b0 [9pnet]
[    1.528212]  p9_client_rpc+0x80/0x350 [9pnet]
[    1.528214]  ? p9_virtio_create+0x253/0x2b0 [9pnet_virtio]
[    1.528216]  ? kfree+0x260/0x350
[    1.528217]  p9_client_version+0x60/0x1d0 [9pnet]
[    1.528219]  p9_client_create+0x3b4/0x460 [9pnet]
[    1.528221]  v9fs_session_init+0xab/0x760 [9p]
[    1.528222]  ? user_path_at_empty+0x7b/0x90
[    1.528224]  ? kmem_cache_alloc_trace+0x188/0x260
[    1.528226]  v9fs_mount+0x43/0x2d0 [9p]
[    1.528227]  legacy_get_tree.llvm.612377983374665620+0x22/0x40
[    1.528230]  vfs_get_tree+0x21/0xb0
[    1.528232]  path_mount+0x70d/0xcd0
[    1.528234]  __x64_sys_mount+0x148/0x1c0
[    1.528236]  do_syscall_64+0x4a/0xb0
[    1.528238]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    1.528240] RIP: 0033:0x7ff620f92d5a
[    1.528241] Code: 48 8b 0d 11 c1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d de c0 0b 00 f7 d8 64 89 01 48
[    1.528242] RSP: 002b:00007ffe4675f9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[    1.528244] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff620f92d5a
[    1.528244] RDX: 0000000000541f90 RSI: 0000000000541f40 RDI: 0000000000541f20
[    1.528245] RBP: 0000000000000c00 R08: 00000000005421f0 R09: 00007ffe4675e790
[    1.528246] R10: 0000000000000c00 R11: 0000000000000246 R12: 000000000053d4f0
[    1.528246] R13: 0000000000541f20 R14: 0000000000541f90 R15: 00007ff62109dcf4
[    1.528247]  </TASK>
[    1.528248] ---[ end trace 84f05b2aa35f19b3 ]---

[   90.894853] 9pnet: Limiting 'msize' to 507904 as this is the maximum supported by transport virtio

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-11-20 11:20   ` Nikolay Kichukov
@ 2021-11-20 11:45     ` Dominique Martinet
  2021-11-20 14:46       ` Christian Schoenebeck
  0 siblings, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2021-11-20 11:45 UTC (permalink / raw)
  To: Nikolay Kichukov
  Cc: Christian Schoenebeck, v9fs-developer, netdev,
	Eric Van Hensbergen, Latchesar Ionkov, Greg Kurz, Vivek Goyal

(Thanks for restarting this thread, this had totally slipped out of my
mind...)

Nikolay Kichukov wrote on Sat, Nov 20, 2021 at 12:20:35PM +0100:
> When the client mounts the share via virtio, requested msize is:
> 10485760 or 104857600
> 
> however the mount succeeds with:
> msize=507904 in the end as per the /proc filesystem. This is less than
> the previous maximum value.

(Not sure about this, I'll test these patches tomorrow, but since
something failed I'm not surprised you have less than what you could
have here: what do you get with a more reasonable value like 1M first?)

> In addition to the above, when the kernel on the guest boots and loads
> 9pfs support, the attached memory allocation failure trace is generated.
> 
> Is anyone else seeing similar and was anybody able to get msize set to
> 10MB via virtio protocol with these patches?

I don't think the kernel would ever allow this with the given code, as
the "common" part of 9p is not smart enough to use scatter-gather and
tries to do a big allocation (almost) the size of the msize:

---
        clnt->fcall_cache =
                kmem_cache_create_usercopy("9p-fcall-cache", clnt->msize,
                                           0, 0, P9_HDRSZ + 4,
                                           clnt->msize - (P9_HDRSZ + 4),
                                           NULL);

...
		fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
---
So in practice, you will be capped at 2MB as that is the biggest the
slab will be able to hand over in a single chunk.

It'll also make allocation failures quite likely as soon as the system
has had some uptime (depending on your workload, look at /proc/buddyinfo
if your machines normally have 2MB chunks available), so I would really
not recommend running with buffers bigger than e.g. 512k on real
workloads -- it looks great on benchmarks, especially as it's on its own
slab so as long as you're doing a lot of requests it will dish out
buffers fast, but it'll likely be unreliable over time.
(oh, and we allocate two of these per request...)

The next step to support large buffers really would be splitting htat
buffer to:
 1/ not allocate huge buffers for small metadata ops, e.g. an open call
certainly doesn't need to allocate 10MB of memory
 2/ support splitting the buffer in some scattergather array

Ideally we'd only allocate on an as-need basis, most of the protocol
calls bound how much data is supposed to come back and we know how much
we want to send (it's a format string actually, but we can majorate it
quite easily), so one would need to adjust all protocol calls to pass
this info to p9_client_rpc/p9_client_zc_rpc so it only allocates buffers
as required, if necessary in multiple reasonably-sized segments (I'd
love 2MB hugepages-backed folios...), and have all transports use these
buffers.

I have a rough idea on how to do all this but honestly less than 0 time
for that, so happy to give advices or review any patch, but it's going
to be a lot of work that stand in the way of really big IOs.

> [    1.527981] 9p: Installing v9fs 9p2000 file system support
> [    1.528173] ------------[ cut here ]------------
> [    1.528174] WARNING: CPU: 1 PID: 791 at mm/page_alloc.c:5356 __alloc_pages+0x1ed/0x290

This warning is exactly what I was saying about the allocation cap:
you've requested an allocation that was bigger than the max __alloc_page
is willing to provide (MAX_ORDER, 11, so 2MB as I was saying)

-- 
Dominique

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-11-20 11:45     ` Dominique Martinet
@ 2021-11-20 14:46       ` Christian Schoenebeck
  2021-11-20 21:28         ` Nikolay Kichukov
  2021-11-20 23:02         ` Dominique Martinet
  0 siblings, 2 replies; 18+ messages in thread
From: Christian Schoenebeck @ 2021-11-20 14:46 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Nikolay Kichukov, v9fs-developer, netdev, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

On Samstag, 20. November 2021 12:45:09 CET Dominique Martinet wrote:
> (Thanks for restarting this thread, this had totally slipped out of my
> mind...)

Hi guys,

I have (more or less) silently been working on these patches on all ends in 
the meantime. If desired I try to find some time next week to summarize 
current status of this overall effort and what still needs to be done.

> Nikolay Kichukov wrote on Sat, Nov 20, 2021 at 12:20:35PM +0100:
> > When the client mounts the share via virtio, requested msize is:
> > 10485760 or 104857600
> > 
> > however the mount succeeds with:
> > msize=507904 in the end as per the /proc filesystem. This is less than
> > the previous maximum value.
> 
> (Not sure about this, I'll test these patches tomorrow, but since
> something failed I'm not surprised you have less than what you could
> have here: what do you get with a more reasonable value like 1M first?)

The highest 'msize' value possible for me with this particular version of the 
kernel patches is 4186212 (so slightly below 4MB).

Some benchmarks, linear reading a 12 GB file:

msize    average      notes

8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
1 MB     2551 MB/s    this msize would violate current virtio specs
2 MB     2521 MB/s    this msize would violate current virtio specs
4 MB     2628 MB/s    planned milestone

That does not mean it already makes sense to use these patches in this version 
as is to increase overall performance yet, but the numbers already suggest 
that increasing msize can improve performance on large sequential I/O. There 
are still some things to do though to fix other use patterns from slowing down 
with rising msize, which I will describe in a separate email.

I also have an experimental version of kernel patches and QEMU that goes as 
high as msize=128MB. But that's a highly hacked version that copies buffers 
between 9p client level and virtio level back and forth and with intentional 
large buffer sizes on every 9p message type. This was solely meant as a stress 
test, i.e. whether it was possible to go as high as virtio's theoretical 
protocol limit in the first place, and whether it was stable. This stress test 
was not about performance at all. And yes, I had it running with 128MB for 
weeks without issues (except of being very slow of course due to hacks used).

> > In addition to the above, when the kernel on the guest boots and loads
> > 9pfs support, the attached memory allocation failure trace is generated.
> > 
> > Is anyone else seeing similar and was anybody able to get msize set to
> > 10MB via virtio protocol with these patches?
> 
> I don't think the kernel would ever allow this with the given code, as
> the "common" part of 9p is not smart enough to use scatter-gather and
> tries to do a big allocation (almost) the size of the msize:
> 
> ---
>         clnt->fcall_cache =
>                 kmem_cache_create_usercopy("9p-fcall-cache", clnt->msize,
>                                            0, 0, P9_HDRSZ + 4,
>                                            clnt->msize - (P9_HDRSZ + 4),
>                                            NULL);
> 
> ...
> 		fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> ---
> So in practice, you will be capped at 2MB as that is the biggest the
> slab will be able to hand over in a single chunk.

I did not encounter a 2MB limit here. But kmalloc() clearly has a 4MB limit, 
so when trying an msize larger than 4MB it inevitably causes a memory 
allocation error. In my tests this allocation error would always happen 
immediately at mount time causing an instant kernel oops.

> It'll also make allocation failures quite likely as soon as the system
> has had some uptime (depending on your workload, look at /proc/buddyinfo
> if your machines normally have 2MB chunks available), so I would really
> not recommend running with buffers bigger than e.g. 512k on real
> workloads -- it looks great on benchmarks, especially as it's on its own
> slab so as long as you're doing a lot of requests it will dish out
> buffers fast, but it'll likely be unreliable over time.
> (oh, and we allocate two of these per request...)
> 
> 
> The next step to support large buffers really would be splitting htat
> buffer to:
>  1/ not allocate huge buffers for small metadata ops, e.g. an open call
> certainly doesn't need to allocate 10MB of memory
>  2/ support splitting the buffer in some scattergather array
> 
> Ideally we'd only allocate on an as-need basis, most of the protocol
> calls bound how much data is supposed to come back and we know how much
> we want to send (it's a format string actually, but we can majorate it
> quite easily), so one would need to adjust all protocol calls to pass
> this info to p9_client_rpc/p9_client_zc_rpc so it only allocates buffers
> as required, if necessary in multiple reasonably-sized segments (I'd
> love 2MB hugepages-backed folios...), and have all transports use these
> buffers.

It is not that bad in sense of pending work. One major thing that needs to be 
done is to cap the majority of 9p message types to allocate only as much as 
they need, which is for most message types <8k. Right now they always simply 
kmalloc(msize), which hurts with increasing msize values. That task does not 
need many changes though.

> I have a rough idea on how to do all this but honestly less than 0 time
> for that, so happy to give advices or review any patch, but it's going
> to be a lot of work that stand in the way of really big IOs.

Reviews of the patches would actually help a lot to bring this overall effort 
forward, but probably rather starting with the upcoming next version of the 
kernel patches, not this old v3.

> > [    1.527981] 9p: Installing v9fs 9p2000 file system support
> > [    1.528173] ------------[ cut here ]------------
> > [    1.528174] WARNING: CPU: 1 PID: 791 at mm/page_alloc.c:5356
> > __alloc_pages+0x1ed/0x290
> This warning is exactly what I was saying about the allocation cap:
> you've requested an allocation that was bigger than the max __alloc_page
> is willing to provide (MAX_ORDER, 11, so 2MB as I was saying)

Best regards,
Christian Schoenebeck

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-11-20 14:46       ` Christian Schoenebeck
@ 2021-11-20 21:28         ` Nikolay Kichukov
  2021-11-20 23:02         ` Dominique Martinet
  1 sibling, 0 replies; 18+ messages in thread
From: Nikolay Kichukov @ 2021-11-20 21:28 UTC (permalink / raw)
  To: Christian Schoenebeck, Dominique Martinet
  Cc: v9fs-developer, netdev, Eric Van Hensbergen, Latchesar Ionkov,
	Greg Kurz, Vivek Goyal

On Sat, 2021-11-20 at 15:46 +0100, Christian Schoenebeck wrote:
> On Samstag, 20. November 2021 12:45:09 CET Dominique Martinet wrote:
> > (Thanks for restarting this thread, this had totally slipped out of
> > my
> > mind...)
> 
> Hi guys,
> 
> I have (more or less) silently been working on these patches on all
> ends in 
> the meantime. If desired I try to find some time next week to
> summarize 
> current status of this overall effort and what still needs to be done.

Great, I would be more than happy to test next version of these patches.

> 
> > Nikolay Kichukov wrote on Sat, Nov 20, 2021 at 12:20:35PM +0100:
> > > When the client mounts the share via virtio, requested msize is:
> > > 10485760 or 104857600
> > > 
> > > however the mount succeeds with:
> > > msize=507904 in the end as per the /proc filesystem. This is less
> > > than
> > > the previous maximum value.
> > 
> > (Not sure about this, I'll test these patches tomorrow, but since
> > something failed I'm not surprised you have less than what you could
> > have here: what do you get with a more reasonable value like 1M
> > first?)

It worked with 1MB, I can stick to this for the time being.

Are the kernel patches supposed to be included in the KVM host kernel or
would the guest kernel suffice?

> 
> The highest 'msize' value possible for me with this particular version
> of the 
> kernel patches is 4186212 (so slightly below 4MB).
> 
> Some benchmarks, linear reading a 12 GB file:
> 
> msize    average      notes
> 
> 8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
> 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
> 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
> 1 MB     2551 MB/s    this msize would violate current virtio specs
> 2 MB     2521 MB/s    this msize would violate current virtio specs
> 4 MB     2628 MB/s    planned milestone
> 
> That does not mean it already makes sense to use these patches in this
> version 
> as is to increase overall performance yet, but the numbers already
> suggest 
> that increasing msize can improve performance on large sequential I/O.
> There 
> are still some things to do though to fix other use patterns from
> slowing down 
> with rising msize, which I will describe in a separate email.
> 
> I also have an experimental version of kernel patches and QEMU that
> goes as 
> high as msize=128MB. But that's a highly hacked version that copies
> buffers 
> between 9p client level and virtio level back and forth and with
> intentional 
> large buffer sizes on every 9p message type. This was solely meant as
> a stress 
> test, i.e. whether it was possible to go as high as virtio's
> theoretical 
> protocol limit in the first place, and whether it was stable. This
> stress test 
> was not about performance at all. And yes, I had it running with 128MB
> for 
> weeks without issues (except of being very slow of course due to hacks
> used).
> 
> > > In addition to the above, when the kernel on the guest boots and
> > > loads
> > > 9pfs support, the attached memory allocation failure trace is
> > > generated.
> > > 
> > > Is anyone else seeing similar and was anybody able to get msize
> > > set to
> > > 10MB via virtio protocol with these patches?
> > 
> > I don't think the kernel would ever allow this with the given code,
> > as
> > the "common" part of 9p is not smart enough to use scatter-gather
> > and
> > tries to do a big allocation (almost) the size of the msize:
> > 
> > ---
> >         clnt->fcall_cache =
> >                 kmem_cache_create_usercopy("9p-fcall-cache", clnt-
> > >msize,
> >                                            0, 0, P9_HDRSZ + 4,
> >                                            clnt->msize - (P9_HDRSZ +
> > 4),
> >                                            NULL);
> > 
> > ...
> >                 fc->sdata = kmem_cache_alloc(c->fcall_cache,
> > GFP_NOFS);
> > ---
> > So in practice, you will be capped at 2MB as that is the biggest the
> > slab will be able to hand over in a single chunk.
> 
> I did not encounter a 2MB limit here. But kmalloc() clearly has a 4MB
> limit, 
> so when trying an msize larger than 4MB it inevitably causes a memory 
> allocation error. In my tests this allocation error would always
> happen 
> immediately at mount time causing an instant kernel oops.
> 
> > It'll also make allocation failures quite likely as soon as the
> > system
> > has had some uptime (depending on your workload, look at
> > /proc/buddyinfo
> > if your machines normally have 2MB chunks available), so I would
> > really
> > not recommend running with buffers bigger than e.g. 512k on real
> > workloads -- it looks great on benchmarks, especially as it's on its
> > own
> > slab so as long as you're doing a lot of requests it will dish out
> > buffers fast, but it'll likely be unreliable over time.
> > (oh, and we allocate two of these per request...)
> > 
> > 
> > The next step to support large buffers really would be splitting
> > htat
> > buffer to:
> >  1/ not allocate huge buffers for small metadata ops, e.g. an open
> > call
> > certainly doesn't need to allocate 10MB of memory
> >  2/ support splitting the buffer in some scattergather array
> > 
> > Ideally we'd only allocate on an as-need basis, most of the protocol
> > calls bound how much data is supposed to come back and we know how
> > much
> > we want to send (it's a format string actually, but we can majorate
> > it
> > quite easily), so one would need to adjust all protocol calls to
> > pass
> > this info to p9_client_rpc/p9_client_zc_rpc so it only allocates
> > buffers
> > as required, if necessary in multiple reasonably-sized segments (I'd
> > love 2MB hugepages-backed folios...), and have all transports use
> > these
> > buffers.
> 
> It is not that bad in sense of pending work. One major thing that
> needs to be 
> done is to cap the majority of 9p message types to allocate only as
> much as 
> they need, which is for most message types <8k. Right now they always
> simply 
> kmalloc(msize), which hurts with increasing msize values. That task
> does not 
> need many changes though.
> 
> > I have a rough idea on how to do all this but honestly less than 0
> > time
> > for that, so happy to give advices or review any patch, but it's
> > going
> > to be a lot of work that stand in the way of really big IOs.
> 
> Reviews of the patches would actually help a lot to bring this overall
> effort 
> forward, but probably rather starting with the upcoming next version
> of the 
> kernel patches, not this old v3.
> 
> > > [    1.527981] 9p: Installing v9fs 9p2000 file system support
> > > [    1.528173] ------------[ cut here ]------------
> > > [    1.528174] WARNING: CPU: 1 PID: 791 at mm/page_alloc.c:5356
> > > __alloc_pages+0x1ed/0x290
> > This warning is exactly what I was saying about the allocation cap:
> > you've requested an allocation that was bigger than the max
> > __alloc_page
> > is willing to provide (MAX_ORDER, 11, so 2MB as I was saying)

Yes, this is no longer happening when I lowered the value of the msize
parameter.

> 
> Best regards,
> Christian Schoenebeck
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-11-20 14:46       ` Christian Schoenebeck
  2021-11-20 21:28         ` Nikolay Kichukov
@ 2021-11-20 23:02         ` Dominique Martinet
  2021-11-21 16:57           ` Christian Schoenebeck
  1 sibling, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2021-11-20 23:02 UTC (permalink / raw)
  To: Christian Schoenebeck, Nikolay Kichukov
  Cc: v9fs-developer, netdev, Eric Van Hensbergen, Latchesar Ionkov,
	Greg Kurz, Vivek Goyal


Christian,

Christian Schoenebeck wrote on Sat, Nov 20, 2021 at 03:46:19PM +0100:
> > So in practice, you will be capped at 2MB as that is the biggest the
> > slab will be able to hand over in a single chunk.
> 
> I did not encounter a 2MB limit here. But kmalloc() clearly has a 4MB limit, 
> so when trying an msize larger than 4MB it inevitably causes a memory 
> allocation error. In my tests this allocation error would always happen 
> immediately at mount time causing an instant kernel oops.

Interesting, I was assuming it'd have the same limit.
There must be some fallback path I didn't know about... I wonder if it
handles non-contiguous memory ranges too then, in which case it's not as
bad as I'd have expected depending on how finely it's willing to sew
things back together: I'll check

> > Ideally we'd only allocate on an as-need basis, most of the protocol
> > calls bound how much data is supposed to come back and we know how much
> > we want to send (it's a format string actually, but we can majorate it
> > quite easily), so one would need to adjust all protocol calls to pass
> > this info to p9_client_rpc/p9_client_zc_rpc so it only allocates buffers
> > as required, if necessary in multiple reasonably-sized segments (I'd
> > love 2MB hugepages-backed folios...), and have all transports use these
> > buffers.
> 
> It is not that bad in sense of pending work. One major thing that needs to be 
> done is to cap the majority of 9p message types to allocate only as much as 
> they need, which is for most message types <8k. Right now they always simply 
> kmalloc(msize), which hurts with increasing msize values. That task does not 
> need many changes though.

Yes, that could be a first step.
Although frankly as I said if we're going to do this, we actual can
majorate the actual max for all operations pretty easily thanks to the
count parameter -- I guess it's a bit more work but we can put arbitrary
values (e.g. 8k for all the small stuff) instead of trying to figure it
out more precisely; I'd just like the code path to be able to do it so
we only do that rechurn once.

Note I've been rather aggressive with checkpatch warning fixes in my
last update for 5.16, hopefully it won't conflict too much with your
work... Let me deal with conflicts if it's a problem.

> > I have a rough idea on how to do all this but honestly less than 0 time
> > for that, so happy to give advices or review any patch, but it's going
> > to be a lot of work that stand in the way of really big IOs.
> 
> Reviews of the patches would actually help a lot to bring this overall effort 
> forward, but probably rather starting with the upcoming next version of the 
> kernel patches, not this old v3.

Happy to review anything you send over, yes :)



Nikolay,

> > > (Not sure about this, I'll test these patches tomorrow, but since
> > > something failed I'm not surprised you have less than what you could
> > > have here: what do you get with a more reasonable value like 1M
> > > first?)
> 
> It worked with 1MB, I can stick to this for the time being.
> 
> Are the kernel patches supposed to be included in the KVM host kernel or
> would the guest kernel suffice?

The patches are only required in the guest.

-- 
Dominique

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-11-20 23:02         ` Dominique Martinet
@ 2021-11-21 16:57           ` Christian Schoenebeck
  2021-11-21 22:12             ` Dominique Martinet
  0 siblings, 1 reply; 18+ messages in thread
From: Christian Schoenebeck @ 2021-11-21 16:57 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Nikolay Kichukov, v9fs-developer, netdev, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

On Sonntag, 21. November 2021 00:02:16 CET Dominique Martinet wrote:
> Christian,
> 
> Christian Schoenebeck wrote on Sat, Nov 20, 2021 at 03:46:19PM +0100:
> > > So in practice, you will be capped at 2MB as that is the biggest the
> > > slab will be able to hand over in a single chunk.
> > 
> > I did not encounter a 2MB limit here. But kmalloc() clearly has a 4MB
> > limit, so when trying an msize larger than 4MB it inevitably causes a
> > memory allocation error. In my tests this allocation error would always
> > happen immediately at mount time causing an instant kernel oops.
> 
> Interesting, I was assuming it'd have the same limit.

That's one of the things in the Linux kernel's memory allocation APIs that 
could be improved. IMO it should be clear to developers what the max. size for 
kmalloc() is, by either a dedicated function or a macro *and* that being 
officially and clearly documented. Right now it is more a "I think it is ..." 
and eventually trying out. You might use the value of MAX_ORDER, but does not 
feel like an official way to me.

> There must be some fallback path I didn't know about... I wonder if it
> handles non-contiguous memory ranges too then, in which case it's not as
> bad as I'd have expected depending on how finely it's willing to sew
> things back together: I'll check

AFAICS that was a misconception in the past in 9p code (I see 
is_vmalloc_addr() being used); there are basically two types of kernel buffers 
that you can allocate today:

1. Buffer with guaranteed contiguous physical pages, e.g. kmalloc():

   These are limited to exactly 4 MB.

or

2. Buffer with potential non-contiguous physical pages, e.g. vmalloc().

   For 32-bit systems the limit here is either 120 MB or 128 MB. On higher 
   systems it is usually only limited by the physical memory being available.

In both cases you get exactly one memory address that can very simply be 
treated by driver code as if that address was pointing to a single linear 
buffer, just like you would in regular user space apps.

However (2.) is not an option for 9p driver, because that memory type yields 
in non-logical addresses which cannot be accessed by a device through DMA. I 
actually tried replacing kmalloc() call by kvmalloc() as a quick hack for 
attempting to bypass the 4 MB limit without significant code changes, however 
it fails on QEMU side with the following QEMU error, as host cannot access 
such kind of memory address in general:

  virtio: bogus descriptor or out of resources

So in my very slow msize=128MB virtio stress test I ended up using kvmalloc() 
for 9p client side, plus alloc_pages() for virtio/DMA side and copying between 
the two sides back and forth as a hack. But copying should not be an option 
for production code, as it is very slow.

> > > Ideally we'd only allocate on an as-need basis, most of the protocol
> > > calls bound how much data is supposed to come back and we know how much
> > > we want to send (it's a format string actually, but we can majorate it
> > > quite easily), so one would need to adjust all protocol calls to pass
> > > this info to p9_client_rpc/p9_client_zc_rpc so it only allocates buffers
> > > as required, if necessary in multiple reasonably-sized segments (I'd
> > > love 2MB hugepages-backed folios...), and have all transports use these
> > > buffers.
> > 
> > It is not that bad in sense of pending work. One major thing that needs to
> > be done is to cap the majority of 9p message types to allocate only as
> > much as they need, which is for most message types <8k. Right now they
> > always simply kmalloc(msize), which hurts with increasing msize values.
> > That task does not need many changes though.
> 
> Yes, that could be a first step.
> Although frankly as I said if we're going to do this, we actual can
> majorate the actual max for all operations pretty easily thanks to the
> count parameter -- I guess it's a bit more work but we can put arbitrary
> values (e.g. 8k for all the small stuff) instead of trying to figure it
> out more precisely; I'd just like the code path to be able to do it so
> we only do that rechurn once.

Looks like we had a similar idea on this. My plan was something like this:

static int max_msg_size(enum msg_type) {
    switch (msg_type) {
        /* large zero copy messages */
        case Twrite:
        case Tread:
        case Treaddir:
            BUG_ON(true);

        /* small messages */
        case Tversion:
        ....
            return 8k; /* to be replaced with appropriate max value */
    }
}

That would be a quick start and allow to fine grade in future. It would also 
provide a safety net, e.g. the compiler would bark if a new message type is 
added in future.

> Note I've been rather aggressive with checkpatch warning fixes in my
> last update for 5.16, hopefully it won't conflict too much with your
> work... Let me deal with conflicts if it's a problem.

Good to know! I just tried a quick rebase, no conflicts fortunately.

> > > I have a rough idea on how to do all this but honestly less than 0 time
> > > for that, so happy to give advices or review any patch, but it's going
> > > to be a lot of work that stand in the way of really big IOs.
> > 
> > Reviews of the patches would actually help a lot to bring this overall
> > effort forward, but probably rather starting with the upcoming next
> > version of the kernel patches, not this old v3.
> 
> Happy to review anything you send over, yes :)

Perfect!

> Nikolay,
> 
> > > > (Not sure about this, I'll test these patches tomorrow, but since
> > > > something failed I'm not surprised you have less than what you could
> > > > have here: what do you get with a more reasonable value like 1M
> > > > first?)
> > 
> > It worked with 1MB, I can stick to this for the time being.
> > 
> > Are the kernel patches supposed to be included in the KVM host kernel or
> > would the guest kernel suffice?
> 
> The patches are only required in the guest.

Correct, plus preferably latest git version of QEMU, a.k.a upcoming QEMU 6.2 
release (already in hard freeze).

https://wiki.qemu.org/Planning/6.2
https://wiki.qemu.org/ChangeLog/6.2#9pfs

That QEMU 6.2 version is still limited to 4 MB virtio transfer size though.

I proposed patches for QEMU that allow up to 128 MB virtio transfer size. That 
ended in a very long discussion on qemu-devel and the conclusion that a 
refinement of the virtio specs should be done:
https://lore.kernel.org/all/cover.1633376313.git.qemu_oss@crudebyte.com/

Virtio spec changes proposed (v1):
https://github.com/oasis-tcs/virtio-spec/issues/122

You can probably subscribe on that virtio-spec issue if you want to receive 
updates on its progress.

Best regards,
Christian Schoenebeck

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-11-21 16:57           ` Christian Schoenebeck
@ 2021-11-21 22:12             ` Dominique Martinet
  2021-11-22 13:32               ` Christian Schoenebeck
  0 siblings, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2021-11-21 22:12 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Nikolay Kichukov, v9fs-developer, netdev, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

Christian Schoenebeck wrote on Sun, Nov 21, 2021 at 05:57:30PM +0100:
> > Although frankly as I said if we're going to do this, we actual can
> > majorate the actual max for all operations pretty easily thanks to the
> > count parameter -- I guess it's a bit more work but we can put arbitrary
> > values (e.g. 8k for all the small stuff) instead of trying to figure it
> > out more precisely; I'd just like the code path to be able to do it so
> > we only do that rechurn once.
> 
> Looks like we had a similar idea on this. My plan was something like this:
> 
> static int max_msg_size(enum msg_type) {
>     switch (msg_type) {
>         /* large zero copy messages */
>         case Twrite:
>         case Tread:
>         case Treaddir:
>             BUG_ON(true);
> 
>         /* small messages */
>         case Tversion:
>         ....
>             return 8k; /* to be replaced with appropriate max value */
>     }
> }
> 
> That would be a quick start and allow to fine grade in future. It would also 
> provide a safety net, e.g. the compiler would bark if a new message type is 
> added in future.

I assume that'd only be used if the caller does not set an explicit
limit, at which point we're basically using a constant and the function
coud be replaced by a P9_SMALL_MSG_SIZE constant... But yes, I agree
with the idea, it's these three calls that deal with big buffers in
either emission or reception (might as well not allocate a 128MB send
buffer for Tread ;))

If you have a Plan for it I'll let you continue and review as things
come. Thanks a lot for the work!

-- 
Dominique

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-11-21 22:12             ` Dominique Martinet
@ 2021-11-22 13:32               ` Christian Schoenebeck
  2021-11-22 22:35                 ` Dominique Martinet
  0 siblings, 1 reply; 18+ messages in thread
From: Christian Schoenebeck @ 2021-11-22 13:32 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Nikolay Kichukov, v9fs-developer, netdev, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

On Sonntag, 21. November 2021 23:12:14 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Sun, Nov 21, 2021 at 05:57:30PM +0100:
> > > Although frankly as I said if we're going to do this, we actual can
> > > majorate the actual max for all operations pretty easily thanks to the
> > > count parameter -- I guess it's a bit more work but we can put arbitrary
> > > values (e.g. 8k for all the small stuff) instead of trying to figure it
> > > out more precisely; I'd just like the code path to be able to do it so
> > > we only do that rechurn once.
> > 
> > Looks like we had a similar idea on this. My plan was something like this:
> > 
> > static int max_msg_size(enum msg_type) {
> > 
> >     switch (msg_type) {
> >     
> >         /* large zero copy messages */
> >         case Twrite:
> >         case Tread:
> >         
> >         case Treaddir:
> >             BUG_ON(true);
> >         
> >         /* small messages */
> >         case Tversion:
> >         ....
> >         
> >             return 8k; /* to be replaced with appropriate max value */
> >     
> >     }
> > 
> > }
> > 
> > That would be a quick start and allow to fine grade in future. It would
> > also provide a safety net, e.g. the compiler would bark if a new message
> > type is added in future.
> 
> I assume that'd only be used if the caller does not set an explicit
> limit, at which point we're basically using a constant and the function
> coud be replaced by a P9_SMALL_MSG_SIZE constant... But yes, I agree
> with the idea, it's these three calls that deal with big buffers in
> either emission or reception (might as well not allocate a 128MB send
> buffer for Tread ;))

I "think" this could be used for all 9p message types except for the listed 
former three, but I'll review the 9p specs more carefully before v4. For Tread 
and Twrite we already received the requested size, which just leaves Treaddir, 
which is however indeed tricky, because I don't think we have any info how 
many directory entries we could expect.

A simple compile time constant (e.g. one macro) could be used instead of this 
function. If you prefer a constant instead, I could go for it in v4 of course. 
For this 9p client I would recommend a function though, simply because this 
code has already seen some authors come and go over the years, so it might be 
worth the redundant code for future safety. But I'll adapt to what others 
think.

Greg, opinion?

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-11-22 13:32               ` Christian Schoenebeck
@ 2021-11-22 22:35                 ` Dominique Martinet
  2021-11-24 12:22                   ` Christian Schoenebeck
  0 siblings, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2021-11-22 22:35 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Nikolay Kichukov, v9fs-developer, netdev, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

Christian Schoenebeck wrote on Mon, Nov 22, 2021 at 02:32:23PM +0100:
> I "think" this could be used for all 9p message types except for the listed 
> former three, but I'll review the 9p specs more carefully before v4. For Tread 
> and Twrite we already received the requested size, which just leaves Treaddir, 
> which is however indeed tricky, because I don't think we have any info how 
> many directory entries we could expect.

count in Treaddir is a number of bytes, not a number of entries -- so
it's perfect for this as well :)

> A simple compile time constant (e.g. one macro) could be used instead of this 
> function. If you prefer a constant instead, I could go for it in v4 of course. 
> For this 9p client I would recommend a function though, simply because this 
> code has already seen some authors come and go over the years, so it might be 
> worth the redundant code for future safety. But I'll adapt to what others 
> think.

In this case a fallback constant seems simpler than a big switch like
you've done, but honestly I'm not fussy at this point -- if you work on
this you have the right to decide this kind of things in my opinion.

My worry with the snippet you listed is that you need to enumerate all
calls again, so if someday the protocol is extended it'll be a place to
forget adding new calls (although compiler warnings help with that),
whereas a fallback constant will always work as long as it's small
messages.

But really, as long as it's not horrible I'll take it :)

-- 
Dominique

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
  2021-11-22 22:35                 ` Dominique Martinet
@ 2021-11-24 12:22                   ` Christian Schoenebeck
  0 siblings, 0 replies; 18+ messages in thread
From: Christian Schoenebeck @ 2021-11-24 12:22 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Nikolay Kichukov, v9fs-developer, netdev, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

On Montag, 22. November 2021 23:35:18 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Mon, Nov 22, 2021 at 02:32:23PM +0100:
> > I "think" this could be used for all 9p message types except for the
> > listed
> > former three, but I'll review the 9p specs more carefully before v4. For
> > Tread and Twrite we already received the requested size, which just
> > leaves Treaddir, which is however indeed tricky, because I don't think we
> > have any info how many directory entries we could expect.
> 
> count in Treaddir is a number of bytes, not a number of entries -- so
> it's perfect for this as well :)

Yes it is in bytes, but it's currently always simply msize - P9_READDIRHDRSZ:
https://github.com/torvalds/linux/blob/5d9f4cf36721aba199975a9be7863a3ff5cd4b59/fs/9p/vfs_dir.c#L159

As my planned milestone for this series is max. 4 MB msize, it might be okay
for now, but it is something to keep in mind and should be checked whether it
will slow down things.

On the long term, when msize >4MB is supported, this definitely must be
addressed.

> > A simple compile time constant (e.g. one macro) could be used instead of
> > this function. If you prefer a constant instead, I could go for it in v4
> > of course. For this 9p client I would recommend a function though, simply
> > because this code has already seen some authors come and go over the
> > years, so it might be worth the redundant code for future safety. But
> > I'll adapt to what others think.
> 
> In this case a fallback constant seems simpler than a big switch like
> you've done, but honestly I'm not fussy at this point -- if you work on
> this you have the right to decide this kind of things in my opinion.
> 
> My worry with the snippet you listed is that you need to enumerate all
> calls again, so if someday the protocol is extended it'll be a place to
> forget adding new calls (although compiler warnings help with that),
> whereas a fallback constant will always work as long as it's small
> messages.
> 
> But really, as long as it's not horrible I'll take it :)

Maybe I can find a compromise. :)

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-11-24 12:28 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-22 16:00 [PATCH v3 0/7] net/9p: remove msize limit in virtio transport Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 1/7] net/9p: show error message if user 'msize' cannot be satisfied Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 2/7] 9p/trans_virtio: separate allocation of scatter gather list Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 3/7] 9p/trans_virtio: turn amount of sg lists into runtime info Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 4/7] 9p/trans_virtio: introduce struct virtqueue_sg Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 5/7] net/9p: add trans_maxsize to struct p9_client Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 6/7] 9p/trans_virtio: support larger msize values Christian Schoenebeck
2021-11-20 11:20   ` Nikolay Kichukov
2021-11-20 11:45     ` Dominique Martinet
2021-11-20 14:46       ` Christian Schoenebeck
2021-11-20 21:28         ` Nikolay Kichukov
2021-11-20 23:02         ` Dominique Martinet
2021-11-21 16:57           ` Christian Schoenebeck
2021-11-21 22:12             ` Dominique Martinet
2021-11-22 13:32               ` Christian Schoenebeck
2021-11-22 22:35                 ` Dominique Martinet
2021-11-24 12:22                   ` Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 7/7] 9p/trans_virtio: resize sg lists to whatever is possible Christian Schoenebeck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.