All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2022-04-02 14:05   ` Dominique Martinet
  2021-12-30 13:23 ` [PATCH v4 08/12] net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports Christian Schoenebeck
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

So far 'msize' was simply used for all 9p message types, which is far
too much and slowed down performance tremendously with large values
for user configurable 'msize' option.

Let's stop this waste by using the new p9_msg_buf_size() function for
allocating more appropriate, smaller buffers according to what is
actually sent over the wire.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/client.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index 56be1658870d..773915c95219 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -255,19 +255,35 @@ static struct kmem_cache *p9_req_cache;
  * p9_tag_alloc - Allocate a new request.
  * @c: Client session.
  * @type: Transaction type.
- * @t_size: Buffer size for holding this request.
- * @r_size: Buffer size for holding server's reply on this request.
+ * @t_size: Buffer size for holding this request
+ * (automatic calculation by format template if 0).
+ * @r_size: Buffer size for holding server's reply on this request
+ * (automatic calculation by format template if 0).
+ * @fmt: Format template for assembling 9p request message
+ * (see p9pdu_vwritef).
+ * @ap: Variable arguments to be fed to passed format template
+ * (see p9pdu_vwritef).
  *
  * Context: Process context.
  * Return: Pointer to new request.
  */
 static struct p9_req_t *
-p9_tag_alloc(struct p9_client *c, int8_t type, uint t_size, uint r_size)
+p9_tag_alloc(struct p9_client *c, int8_t type, uint t_size, uint r_size,
+	      const char *fmt, va_list ap)
 {
 	struct p9_req_t *req = kmem_cache_alloc(p9_req_cache, GFP_NOFS);
-	int alloc_tsize = min(c->msize, t_size);
-	int alloc_rsize = min(c->msize, r_size);
+	int alloc_tsize;
+	int alloc_rsize;
 	int tag;
+	va_list apc;
+
+	va_copy(apc, ap);
+	alloc_tsize = min_t(size_t, c->msize,
+			    t_size ?: p9_msg_buf_size(c, type, fmt, apc));
+	va_end(apc);
+
+	alloc_rsize = min_t(size_t, c->msize,
+			    r_size ?: p9_msg_buf_size(c, type + 1, fmt, ap));
 
 	if (!req)
 		return ERR_PTR(-ENOMEM);
@@ -685,6 +701,7 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
 {
 	int err;
 	struct p9_req_t *req;
+	va_list apc;
 
 	p9_debug(P9_DEBUG_MUX, "client %p op %d\n", c, type);
 
@@ -696,7 +713,9 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
 	if (c->status == BeginDisconnect && type != P9_TCLUNK)
 		return ERR_PTR(-EIO);
 
-	req = p9_tag_alloc(c, type, t_size, r_size);
+	va_copy(apc, ap);
+	req = p9_tag_alloc(c, type, t_size, r_size, fmt, apc);
+	va_end(apc);
 	if (IS_ERR(req))
 		return req;
 
@@ -733,7 +752,7 @@ p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
 	struct p9_req_t *req;
 
 	va_start(ap, fmt);
-	req = p9_client_prepare_req(c, type, c->msize, c->msize, fmt, ap);
+	req = p9_client_prepare_req(c, type, 0, 0, fmt, ap);
 	va_end(ap);
 	if (IS_ERR(req))
 		return req;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 00/12] remove msize limit in virtio transport
@ 2021-12-30 13:23 Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers Christian Schoenebeck
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

This series aims to get get rid of the current 500k 'msize' limitation in
the 9p virtio transport, which is currently a bottleneck for performance
of 9p mounts.

To avoid confusion: it does remove the msize limit for the virtio transport,
on 9p client level though the anticipated milestone for this series is now
a max. 'msize' of 4 MB. See patch 8 for reason why.

This is a follow-up of the following series and discussion:
https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/

Latest version of this series:
https://github.com/cschoenebeck/linux/commits/9p-virtio-drop-msize-cap


OVERVIEW OF PATCHES:

* Patch 1 is just a trivial info message for the user to know why his msize
  option got ignored by 9p client in case the value was larger than what is
  supported by this 9p driver.

* Patches 2..7 remove the msize limitation from the 'virtio' transport
  (i.e. the 9p 'virtio' transport itself actually supports >4MB now, tested
  successfully with an experimental QEMU version and some dirty 9p Linux
  client hacks up to msize=128MB).

* Patch 8 limits msize for all transports to 4 MB for now as >4MB would need
  more work on 9p client level (see commit log of patch 8 for details).

* Patches 9..12 tremendously reduce unnecessarily huge 9p message sizes and
  therefore provide performance gain as well. So far, almost all 9p messages
  simply allocated message buffers exactly msize large, even for messages
  that actually just needed few bytes. So these patches make sense by
  themselves, independent of this overall series, however for this series
  even more, because the larger msize, the more this issue would have hurt
  otherwise.


PREREQUISITES:

If you are testing with QEMU then please either use latest QEMU 6.2 release
or higher, or at least apply the following patch on QEMU side:

  https://lore.kernel.org/qemu-devel/E1mT2Js-0000DW-OH@lizzy.crudebyte.com/

That QEMU patch is required if you are using a user space app that
automatically retrieves an optimum I/O block size by obeying stat's
st_blksize, which 'cat' for instance is doing, e.g.:

	time cat test_rnd.dat > /dev/null

Otherwise please use a user space app for performance testing that allows
you to force a large block size and to avoid that QEMU issue, like 'dd'
for instance, in that case you don't need to patch QEMU.


KNOWN LIMITATION:

With this series applied I can run

  QEMU host <-> 9P virtio <-> Linux guest

with up to slightly below 4 MB msize [4186112 = (1024-2) * 4096]. If I try
to run it with exactly 4 MB (4194304) it currently hits a limitation on
QEMU side:

  qemu-system-x86_64: virtio: too many write descriptors in indirect table

That's because QEMU currently has a hard coded limit of max. 1024 virtio
descriptors per vring slot (i.e. per virtio message), see to do (1.) below.


STILL TO DO:

  1. Negotiating virtio "Queue Indirect Size" (MANDATORY):

    The QEMU issue described above must be addressed by negotiating the
    maximum length of virtio indirect descriptor tables on virtio device
    initialization. This would not only avoid the QEMU error above, but would
    also allow msize of >4MB in future. Before that change can be done on
    Linux and QEMU sides though, it first requires a change to the virtio
    specs. Work on that on the virtio specs is in progress:

    https://github.com/oasis-tcs/virtio-spec/issues/122

    This is not really an issue for testing this series. Just stick to max.
    msize=4186112 as described above and you will be fine. However for the
    final PR this should obviously be addressed in a clean way.

  2. Reduce readdir buffer sizes (optional - maybe later):

    This series already reduced the message buffers for most 9p message
    types. This does not include Treaddir though yet, which is still simply
    using msize. It would make sense to benchmark first whether this is
    actually an issue that hurts. If it does, then one might use already
    existing vfs knowledge to estimate the Treaddir size, or starting with
    some reasonable hard coded small Treaddir size first and then increasing
    it just on the 2nd Treaddir request if there are more directory entries
    to fetch.

  3. Add more buffer caches (optional - maybe later):

    p9_fcall_init() uses kmem_cache_alloc() instead of kmalloc() for very
    large buffers to reduce latency waiting for memory allocation to
    complete. Currently it does that only if the requested buffer size is
    exactly msize large. As patch 11 already divided the 9p message types
    into few message size categories, maybe it would make sense to use e.g.
    4 separate caches for those memory category (e.g. 4k, 8k, msize/2,
    msize). Might be worth a benchmark test.

Testing and feedback appreciated!

v3 -> v4:

  * Limit msize to 4 MB for all transports [NEW patch 8].

  * Avoid unnecessarily huge 9p message buffers
    [NEW patch 9] .. [NEW patch 12].

Christian Schoenebeck (12):
  net/9p: show error message if user 'msize' cannot be satisfied
  9p/trans_virtio: separate allocation of scatter gather list
  9p/trans_virtio: turn amount of sg lists into runtime info
  9p/trans_virtio: introduce struct virtqueue_sg
  net/9p: add trans_maxsize to struct p9_client
  9p/trans_virtio: support larger msize values
  9p/trans_virtio: resize sg lists to whatever is possible
  net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports
  net/9p: split message size argument into 't_size' and 'r_size' pair
  9p: add P9_ERRMAX for 9p2000 and 9p2000.u
  net/9p: add p9_msg_buf_size()
  net/9p: allocate appropriate reduced message buffers

 include/net/9p/9p.h     |   3 +
 include/net/9p/client.h |   2 +
 net/9p/client.c         |  67 +++++++--
 net/9p/protocol.c       | 154 ++++++++++++++++++++
 net/9p/protocol.h       |   2 +
 net/9p/trans_virtio.c   | 304 +++++++++++++++++++++++++++++++++++-----
 6 files changed, 483 insertions(+), 49 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 04/12] 9p/trans_virtio: introduce struct virtqueue_sg
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 08/12] net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 01/12] net/9p: show error message if user 'msize' cannot be satisfied Christian Schoenebeck
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

The amount of elements in a scatter/gather list is limited to
approximately 128 elements. To allow going beyond that limit
with subsequent patches, pave the way by turning the one-
dimensional sg list array into a two-dimensional array, i.e:

  sg[128]

becomes

  sgl[nsgl][SG_MAX_SINGLE_ALLOC]

As the value of 'nsgl' is exactly (still) 1 in this commit
and the compile-time (compiler and architecture dependent)
value of 'SG_MAX_SINGLE_ALLOC' equals approximately the
previous hard coded 128 elements, this commit is therefore
more of a preparatory refactoring then actual behaviour
change.

A custom struct virtqueue_sg is defined instead of using
shared API struct sg_table, because the latter would not
allow to resize the table after allocation. sg_append_table
API OTOH would not fit either, because it requires a list
of pages beforehand upon allocation. And both APIs only
support all-or-nothing allocation.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 193 ++++++++++++++++++++++++++++++++----------
 1 file changed, 147 insertions(+), 46 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index d063c69b85b7..656562a66f06 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -36,7 +36,31 @@
 #include <linux/virtio_9p.h>
 #include "trans_common.h"
 
-#define VIRTQUEUE_DEFAULT_NUM	128
+/**
+ * struct virtqueue_sg - (chained) scatter gather lists for virtqueue data
+ * transmission
+ * @nsgl: amount of elements (in first dimension) of array field @sgl
+ * @sgl: two-dimensional array, i.e. sgl[nsgl][SG_MAX_SINGLE_ALLOC]
+ */
+struct virtqueue_sg {
+	unsigned int nsgl;
+	struct scatterlist *sgl[];
+};
+
+/*
+ * Default value for field nsgl in struct virtqueue_sg, which defines the
+ * initial virtio data transmission capacity when this virtio transport is
+ * probed.
+ */
+#define VIRTQUEUE_SG_NSGL_DEFAULT 1
+
+/* maximum value for field nsgl in struct virtqueue_sg */
+#define VIRTQUEUE_SG_NSGL_MAX						\
+	((PAGE_SIZE - sizeof(struct virtqueue_sg)) /			\
+	sizeof(struct scatterlist *))					\
+
+/* last entry per sg list is used for chaining (pointer to next list) */
+#define SG_USER_PAGES_PER_LIST	(SG_MAX_SINGLE_ALLOC - 1)
 
 /* a single mutex to manage channel initialization and attachment */
 static DEFINE_MUTEX(virtio_9p_lock);
@@ -53,8 +77,7 @@ static atomic_t vp_pinned = ATOMIC_INIT(0);
  * @ring_bufs_avail: flag to indicate there is some available in the ring buf
  * @vc_wq: wait queue for waiting for thing to be added to ring buf
  * @p9_max_pages: maximum number of pinned pages
- * @sg: scatter gather list which is used to pack a request (protected?)
- * @sg_n: amount of elements in sg array
+ * @vq_sg: table of scatter gather lists, which are used to pack a request
  * @chan_list: linked list of channels
  *
  * We keep all per-channel information in a structure.
@@ -77,9 +100,7 @@ struct virtio_chan {
 	 * will be placing it in each channel.
 	 */
 	unsigned long p9_max_pages;
-	/* Scatterlist: can be too big for stack. */
-	struct scatterlist *sg;
-	size_t sg_n;
+	struct virtqueue_sg *vq_sg;
 	/**
 	 * @tag: name to identify a mount null terminated
 	 */
@@ -96,6 +117,92 @@ static unsigned int rest_of_page(void *data)
 	return PAGE_SIZE - offset_in_page(data);
 }
 
+/**
+ * vq_sg_page - returns user page for given page index
+ * @vq_sg: scatter gather lists used by this transport
+ * @page: user page index across all scatter gather lists
+ */
+static struct scatterlist *vq_sg_page(struct virtqueue_sg *vq_sg, size_t page)
+{
+	unsigned int node = page / SG_USER_PAGES_PER_LIST;
+	unsigned int leaf = page % SG_USER_PAGES_PER_LIST;
+	BUG_ON(node >= VIRTQUEUE_SG_NSGL_MAX);
+	return &vq_sg->sgl[node][leaf];
+}
+
+/**
+ * vq_sg_npages - returns total number of individual user pages in passed
+ * scatter gather lists
+ * @vq_sg: scatter gather lists to be counted
+ */
+static size_t vq_sg_npages(struct virtqueue_sg *vq_sg)
+{
+	return vq_sg->nsgl * SG_USER_PAGES_PER_LIST;
+}
+
+/**
+ * vq_sg_free - free all memory previously allocated for @vq_sg
+ * @vq_sg: scatter gather lists to be freed
+ */
+static void vq_sg_free(struct virtqueue_sg *vq_sg)
+{
+	unsigned int i;
+
+	if (!vq_sg)
+		return;
+
+	for (i = 0; i < vq_sg->nsgl; ++i) {
+		kfree(vq_sg->sgl[i]);
+	}
+	kfree(vq_sg);
+}
+
+/**
+ * vq_sg_alloc - allocates and returns @nsgl scatter gather lists
+ * @nsgl: amount of scatter gather lists to be allocated
+ * If @nsgl is larger than one then chained lists are used if supported by
+ * architecture.
+ */
+static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
+{
+	struct virtqueue_sg *vq_sg;
+	unsigned int i;
+
+	BUG_ON(!nsgl || nsgl > VIRTQUEUE_SG_NSGL_MAX);
+#ifdef CONFIG_ARCH_NO_SG_CHAIN
+	if (WARN_ON_ONCE(nsgl > 1))
+		return NULL;
+#endif
+
+	vq_sg = kzalloc(sizeof(struct virtqueue_sg) +
+			nsgl * sizeof(struct scatterlist *),
+			GFP_KERNEL);
+
+	if (!vq_sg)
+		return NULL;
+
+	vq_sg->nsgl = nsgl;
+
+	for (i = 0; i < nsgl; ++i) {
+		vq_sg->sgl[i] = kmalloc_array(
+			SG_MAX_SINGLE_ALLOC, sizeof(struct scatterlist),
+			GFP_KERNEL
+		);
+		if (!vq_sg->sgl[i]) {
+			vq_sg_free(vq_sg);
+			return NULL;
+		}
+		sg_init_table(vq_sg->sgl[i], SG_MAX_SINGLE_ALLOC);
+		if (i) {
+			/* chain the lists */
+			sg_chain(vq_sg->sgl[i - 1], SG_MAX_SINGLE_ALLOC,
+				 vq_sg->sgl[i]);
+		}
+	}
+	sg_mark_end(&vq_sg->sgl[nsgl - 1][SG_MAX_SINGLE_ALLOC - 1]);
+	return vq_sg;
+}
+
 /**
  * p9_virtio_close - reclaim resources of a channel
  * @client: client instance
@@ -158,9 +265,8 @@ static void req_done(struct virtqueue *vq)
 
 /**
  * pack_sg_list - pack a scatter gather list from a linear buffer
- * @sg: scatter/gather list to pack into
+ * @vq_sg: scatter/gather lists to pack into
  * @start: which segment of the sg_list to start at
- * @limit: maximum segment to pack data to
  * @data: data to pack into scatter/gather list
  * @count: amount of data to pack into the scatter/gather list
  *
@@ -170,11 +276,12 @@ static void req_done(struct virtqueue *vq)
  *
  */
 
-static int pack_sg_list(struct scatterlist *sg, int start,
-			int limit, char *data, int count)
+static int pack_sg_list(struct virtqueue_sg *vq_sg, int start,
+			char *data, int count)
 {
 	int s;
 	int index = start;
+	size_t limit = vq_sg_npages(vq_sg);
 
 	while (count) {
 		s = rest_of_page(data);
@@ -182,13 +289,13 @@ static int pack_sg_list(struct scatterlist *sg, int start,
 			s = count;
 		BUG_ON(index >= limit);
 		/* Make sure we don't terminate early. */
-		sg_unmark_end(&sg[index]);
-		sg_set_buf(&sg[index++], data, s);
+		sg_unmark_end(vq_sg_page(vq_sg, index));
+		sg_set_buf(vq_sg_page(vq_sg, index++), data, s);
 		count -= s;
 		data += s;
 	}
 	if (index-start)
-		sg_mark_end(&sg[index - 1]);
+		sg_mark_end(vq_sg_page(vq_sg, index - 1));
 	return index-start;
 }
 
@@ -208,21 +315,21 @@ static int p9_virtio_cancelled(struct p9_client *client, struct p9_req_t *req)
 /**
  * pack_sg_list_p - Just like pack_sg_list. Instead of taking a buffer,
  * this takes a list of pages.
- * @sg: scatter/gather list to pack into
+ * @vq_sg: scatter/gather lists to pack into
  * @start: which segment of the sg_list to start at
- * @limit: maximum number of pages in sg list.
  * @pdata: a list of pages to add into sg.
  * @nr_pages: number of pages to pack into the scatter/gather list
  * @offs: amount of data in the beginning of first page _not_ to pack
  * @count: amount of data to pack into the scatter/gather list
  */
 static int
-pack_sg_list_p(struct scatterlist *sg, int start, int limit,
+pack_sg_list_p(struct virtqueue_sg *vq_sg, int start,
 	       struct page **pdata, int nr_pages, size_t offs, int count)
 {
 	int i = 0, s;
 	int data_off = offs;
 	int index = start;
+	size_t limit = vq_sg_npages(vq_sg);
 
 	BUG_ON(nr_pages > (limit - start));
 	/*
@@ -235,15 +342,16 @@ pack_sg_list_p(struct scatterlist *sg, int start, int limit,
 			s = count;
 		BUG_ON(index >= limit);
 		/* Make sure we don't terminate early. */
-		sg_unmark_end(&sg[index]);
-		sg_set_page(&sg[index++], pdata[i++], s, data_off);
+		sg_unmark_end(vq_sg_page(vq_sg, index));
+		sg_set_page(vq_sg_page(vq_sg, index++), pdata[i++], s,
+			    data_off);
 		data_off = 0;
 		count -= s;
 		nr_pages--;
 	}
 
 	if (index-start)
-		sg_mark_end(&sg[index - 1]);
+		sg_mark_end(vq_sg_page(vq_sg, index - 1));
 	return index - start;
 }
 
@@ -271,15 +379,13 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
 
 	out_sgs = in_sgs = 0;
 	/* Handle out VirtIO ring buffers */
-	out = pack_sg_list(chan->sg, 0,
-			   chan->sg_n, req->tc.sdata, req->tc.size);
+	out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);
 	if (out)
-		sgs[out_sgs++] = chan->sg;
+		sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);
 
-	in = pack_sg_list(chan->sg, out,
-			  chan->sg_n, req->rc.sdata, req->rc.capacity);
+	in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, req->rc.capacity);
 	if (in)
-		sgs[out_sgs + in_sgs++] = chan->sg + out;
+		sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);
 
 	err = virtqueue_add_sgs(chan->vq, sgs, out_sgs, in_sgs, req,
 				GFP_ATOMIC);
@@ -448,16 +554,15 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	out_sgs = in_sgs = 0;
 
 	/* out data */
-	out = pack_sg_list(chan->sg, 0,
-			   chan->sg_n, req->tc.sdata, req->tc.size);
+	out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);
 
 	if (out)
-		sgs[out_sgs++] = chan->sg;
+		sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);
 
 	if (out_pages) {
-		sgs[out_sgs++] = chan->sg + out;
-		out += pack_sg_list_p(chan->sg, out, chan->sg_n,
-				      out_pages, out_nr_pages, offs, outlen);
+		sgs[out_sgs++] = vq_sg_page(chan->vq_sg, out);
+		out += pack_sg_list_p(chan->vq_sg, out, out_pages,
+				      out_nr_pages, offs, outlen);
 	}
 
 	/*
@@ -467,15 +572,14 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	 * Arrange in such a way that server places header in the
 	 * allocated memory and payload onto the user buffer.
 	 */
-	in = pack_sg_list(chan->sg, out,
-			  chan->sg_n, req->rc.sdata, in_hdr_len);
+	in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, in_hdr_len);
 	if (in)
-		sgs[out_sgs + in_sgs++] = chan->sg + out;
+		sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);
 
 	if (in_pages) {
-		sgs[out_sgs + in_sgs++] = chan->sg + out + in;
-		in += pack_sg_list_p(chan->sg, out + in, chan->sg_n,
-				     in_pages, in_nr_pages, offs, inlen);
+		sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out + in);
+		in += pack_sg_list_p(chan->vq_sg, out + in, in_pages,
+				     in_nr_pages, offs, inlen);
 	}
 
 	BUG_ON(out_sgs + in_sgs > ARRAY_SIZE(sgs));
@@ -576,14 +680,12 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 		goto fail;
 	}
 
-	chan->sg = kmalloc_array(VIRTQUEUE_DEFAULT_NUM,
-				 sizeof(struct scatterlist), GFP_KERNEL);
-	if (!chan->sg) {
+	chan->vq_sg = vq_sg_alloc(VIRTQUEUE_SG_NSGL_DEFAULT);
+	if (!chan->vq_sg) {
 		pr_err("Failed to allocate virtio 9P channel\n");
 		err = -ENOMEM;
 		goto out_free_chan_shallow;
 	}
-	chan->sg_n = VIRTQUEUE_DEFAULT_NUM;
 
 	chan->vdev = vdev;
 
@@ -596,8 +698,6 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 	chan->vq->vdev->priv = chan;
 	spin_lock_init(&chan->lock);
 
-	sg_init_table(chan->sg, chan->sg_n);
-
 	chan->inuse = false;
 	if (virtio_has_feature(vdev, VIRTIO_9P_MOUNT_TAG)) {
 		virtio_cread(vdev, struct virtio_9p_config, tag_len, &tag_len);
@@ -646,7 +746,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 out_free_vq:
 	vdev->config->del_vqs(vdev);
 out_free_chan:
-	kfree(chan->sg);
+	vq_sg_free(chan->vq_sg);
 out_free_chan_shallow:
 	kfree(chan);
 fail:
@@ -741,7 +841,7 @@ static void p9_virtio_remove(struct virtio_device *vdev)
 	kobject_uevent(&(vdev->dev.kobj), KOBJ_CHANGE);
 	kfree(chan->tag);
 	kfree(chan->vc_wq);
-	kfree(chan->sg);
+	vq_sg_free(chan->vq_sg);
 	kfree(chan);
 
 }
@@ -780,7 +880,8 @@ static struct p9_trans_module p9_virtio_trans = {
 	 * that are not at page boundary, that can result in an extra
 	 * page in zero copy.
 	 */
-	.maxsize = PAGE_SIZE * (VIRTQUEUE_DEFAULT_NUM - 3),
+	.maxsize = PAGE_SIZE *
+		((VIRTQUEUE_SG_NSGL_DEFAULT * SG_USER_PAGES_PER_LIST) - 3),
 	.def = 1,
 	.owner = THIS_MODULE,
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 05/12] net/9p: add trans_maxsize to struct p9_client
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (9 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 07/12] 9p/trans_virtio: resize sg lists to whatever is possible Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 03/12] 9p/trans_virtio: turn amount of sg lists into runtime info Christian Schoenebeck
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

This new field 'trans_maxsize' optionally allows transport to
update it to reflect the actual maximum msize supported by
allocated transport channel.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 include/net/9p/client.h |  2 ++
 net/9p/client.c         | 12 ++++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index ec1d1706f43c..f5718057fca4 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -87,6 +87,7 @@ struct p9_req_t {
  * struct p9_client - per client instance state
  * @lock: protect @fids and @reqs
  * @msize: maximum data size negotiated by protocol
+ * @trans_maxsize: actual maximum msize supported by transport channel
  * @proto_version: 9P protocol version to use
  * @trans_mod: module API instantiated with this client
  * @status: connection state
@@ -101,6 +102,7 @@ struct p9_req_t {
 struct p9_client {
 	spinlock_t lock;
 	unsigned int msize;
+	unsigned int trans_maxsize;
 	unsigned char proto_version;
 	struct p9_trans_module *trans_mod;
 	enum p9_trans_status status;
diff --git a/net/9p/client.c b/net/9p/client.c
index 8bba0d9cf975..20054addd81b 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1031,6 +1031,14 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
 		goto free_client;
 	}
 
+	/*
+	 * transport will get a chance to increase trans_maxsize (if
+	 * necessary) and it may update trans_maxsize in create() function
+	 * below accordingly to reflect the actual maximum size supported by
+	 * the allocated transport channel
+	 */
+	clnt->trans_maxsize = clnt->trans_mod->maxsize;
+
 	p9_debug(P9_DEBUG_MUX, "clnt %p trans %p msize %d protocol %d\n",
 		 clnt, clnt->trans_mod, clnt->msize, clnt->proto_version);
 
@@ -1038,8 +1046,8 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
 	if (err)
 		goto put_trans;
 
-	if (clnt->msize > clnt->trans_mod->maxsize) {
-		clnt->msize = clnt->trans_mod->maxsize;
+	if (clnt->msize > clnt->trans_maxsize) {
+		clnt->msize = clnt->trans_maxsize;
 		pr_info("Limiting 'msize' to %d as this is the maximum "
 			"supported by transport %s\n",
 			clnt->msize, clnt->trans_mod->name
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 02/12] 9p/trans_virtio: separate allocation of scatter gather list
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (6 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 06/12] 9p/trans_virtio: support larger msize values Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 09/12] net/9p: split message size argument into 't_size' and 'r_size' pair Christian Schoenebeck
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

The scatter gather list in struct virtio_chan currently
resides as compile-time constant size array directly within the
contiguous struct virtio_chan's memory space.

Separate memory space and allocation of the scatter gather list
from memory space and allocation of struct virtio_chan.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index bd5a89c4960d..7f0c992c0f68 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -77,7 +77,7 @@ struct virtio_chan {
 	 */
 	unsigned long p9_max_pages;
 	/* Scatterlist: can be too big for stack. */
-	struct scatterlist sg[VIRTQUEUE_NUM];
+	struct scatterlist *sg;
 	/**
 	 * @tag: name to identify a mount null terminated
 	 */
@@ -574,6 +574,14 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 		goto fail;
 	}
 
+	chan->sg = kmalloc_array(VIRTQUEUE_NUM,
+				 sizeof(struct scatterlist), GFP_KERNEL);
+	if (!chan->sg) {
+		pr_err("Failed to allocate virtio 9P channel\n");
+		err = -ENOMEM;
+		goto out_free_chan_shallow;
+	}
+
 	chan->vdev = vdev;
 
 	/* We expect one virtqueue, for requests. */
@@ -635,6 +643,8 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 out_free_vq:
 	vdev->config->del_vqs(vdev);
 out_free_chan:
+	kfree(chan->sg);
+out_free_chan_shallow:
 	kfree(chan);
 fail:
 	return err;
@@ -728,6 +738,7 @@ static void p9_virtio_remove(struct virtio_device *vdev)
 	kobject_uevent(&(vdev->dev.kobj), KOBJ_CHANGE);
 	kfree(chan->tag);
 	kfree(chan->vc_wq);
+	kfree(chan->sg);
 	kfree(chan);
 
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 01/12] net/9p: show error message if user 'msize' cannot be satisfied
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (2 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 04/12] 9p/trans_virtio: introduce struct virtqueue_sg Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 10/12] 9p: add P9_ERRMAX for 9p2000 and 9p2000.u Christian Schoenebeck
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

If user supplied a large value with the 'msize' option, then
client would silently limit that 'msize' value to the maximum
value supported by transport. That's a bit confusing for users
of not having any indication why the preferred 'msize' value
could not be satisfied.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/client.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index d062f1e5bfb0..8bba0d9cf975 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1038,8 +1038,13 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
 	if (err)
 		goto put_trans;
 
-	if (clnt->msize > clnt->trans_mod->maxsize)
+	if (clnt->msize > clnt->trans_mod->maxsize) {
 		clnt->msize = clnt->trans_mod->maxsize;
+		pr_info("Limiting 'msize' to %d as this is the maximum "
+			"supported by transport %s\n",
+			clnt->msize, clnt->trans_mod->name
+		);
+	}
 
 	if (clnt->msize < 4096) {
 		p9_debug(P9_DEBUG_ERROR,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 03/12] 9p/trans_virtio: turn amount of sg lists into runtime info
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (10 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 05/12] net/9p: add trans_maxsize to struct p9_client Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2022-01-20 22:43 ` [PATCH v4 00/12] remove msize limit in virtio transport Nikolay Kichukov
  2022-07-07 14:30 ` Christian Schoenebeck
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

The size of scatter/gather lists used by the virtio transport is
currently hard coded. Refactor this to using a runtime variable.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 7f0c992c0f68..d063c69b85b7 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -36,7 +36,7 @@
 #include <linux/virtio_9p.h>
 #include "trans_common.h"
 
-#define VIRTQUEUE_NUM	128
+#define VIRTQUEUE_DEFAULT_NUM	128
 
 /* a single mutex to manage channel initialization and attachment */
 static DEFINE_MUTEX(virtio_9p_lock);
@@ -54,6 +54,7 @@ static atomic_t vp_pinned = ATOMIC_INIT(0);
  * @vc_wq: wait queue for waiting for thing to be added to ring buf
  * @p9_max_pages: maximum number of pinned pages
  * @sg: scatter gather list which is used to pack a request (protected?)
+ * @sg_n: amount of elements in sg array
  * @chan_list: linked list of channels
  *
  * We keep all per-channel information in a structure.
@@ -78,6 +79,7 @@ struct virtio_chan {
 	unsigned long p9_max_pages;
 	/* Scatterlist: can be too big for stack. */
 	struct scatterlist *sg;
+	size_t sg_n;
 	/**
 	 * @tag: name to identify a mount null terminated
 	 */
@@ -270,12 +272,12 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
 	out_sgs = in_sgs = 0;
 	/* Handle out VirtIO ring buffers */
 	out = pack_sg_list(chan->sg, 0,
-			   VIRTQUEUE_NUM, req->tc.sdata, req->tc.size);
+			   chan->sg_n, req->tc.sdata, req->tc.size);
 	if (out)
 		sgs[out_sgs++] = chan->sg;
 
 	in = pack_sg_list(chan->sg, out,
-			  VIRTQUEUE_NUM, req->rc.sdata, req->rc.capacity);
+			  chan->sg_n, req->rc.sdata, req->rc.capacity);
 	if (in)
 		sgs[out_sgs + in_sgs++] = chan->sg + out;
 
@@ -447,14 +449,14 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 
 	/* out data */
 	out = pack_sg_list(chan->sg, 0,
-			   VIRTQUEUE_NUM, req->tc.sdata, req->tc.size);
+			   chan->sg_n, req->tc.sdata, req->tc.size);
 
 	if (out)
 		sgs[out_sgs++] = chan->sg;
 
 	if (out_pages) {
 		sgs[out_sgs++] = chan->sg + out;
-		out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
+		out += pack_sg_list_p(chan->sg, out, chan->sg_n,
 				      out_pages, out_nr_pages, offs, outlen);
 	}
 
@@ -466,13 +468,13 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	 * allocated memory and payload onto the user buffer.
 	 */
 	in = pack_sg_list(chan->sg, out,
-			  VIRTQUEUE_NUM, req->rc.sdata, in_hdr_len);
+			  chan->sg_n, req->rc.sdata, in_hdr_len);
 	if (in)
 		sgs[out_sgs + in_sgs++] = chan->sg + out;
 
 	if (in_pages) {
 		sgs[out_sgs + in_sgs++] = chan->sg + out + in;
-		in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
+		in += pack_sg_list_p(chan->sg, out + in, chan->sg_n,
 				     in_pages, in_nr_pages, offs, inlen);
 	}
 
@@ -574,13 +576,14 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 		goto fail;
 	}
 
-	chan->sg = kmalloc_array(VIRTQUEUE_NUM,
+	chan->sg = kmalloc_array(VIRTQUEUE_DEFAULT_NUM,
 				 sizeof(struct scatterlist), GFP_KERNEL);
 	if (!chan->sg) {
 		pr_err("Failed to allocate virtio 9P channel\n");
 		err = -ENOMEM;
 		goto out_free_chan_shallow;
 	}
+	chan->sg_n = VIRTQUEUE_DEFAULT_NUM;
 
 	chan->vdev = vdev;
 
@@ -593,7 +596,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 	chan->vq->vdev->priv = chan;
 	spin_lock_init(&chan->lock);
 
-	sg_init_table(chan->sg, VIRTQUEUE_NUM);
+	sg_init_table(chan->sg, chan->sg_n);
 
 	chan->inuse = false;
 	if (virtio_has_feature(vdev, VIRTIO_9P_MOUNT_TAG)) {
@@ -777,7 +780,7 @@ static struct p9_trans_module p9_virtio_trans = {
 	 * that are not at page boundary, that can result in an extra
 	 * page in zero copy.
 	 */
-	.maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
+	.maxsize = PAGE_SIZE * (VIRTQUEUE_DEFAULT_NUM - 3),
 	.def = 1,
 	.owner = THIS_MODULE,
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 07/12] 9p/trans_virtio: resize sg lists to whatever is possible
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (8 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 09/12] net/9p: split message size argument into 't_size' and 'r_size' pair Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 05/12] net/9p: add trans_maxsize to struct p9_client Christian Schoenebeck
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

Right now vq_sg_resize() used a lazy implementation following
the all-or-nothing princible. So it either resized exactly to
the requested new amount of sg lists, or it did not resize at
all.

The problem with this is if a user supplies a very large msize
value, resize would simply fail and the user would stick to
the default maximum msize supported by the virtio transport.

To resolve this potential issue, change vq_sg_resize() to resize
the passed sg list to whatever is possible on the machine.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 68 ++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 60 insertions(+), 8 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index a02050c9742a..580efa95eabd 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -208,24 +208,67 @@ static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
  * amount of lists
  * @_vq_sg: scatter/gather lists to be resized
  * @nsgl: new amount of scatter/gather lists
+ *
+ * Old scatter/gather lists are retained. Only growing the size is supported.
+ * If the requested amount cannot be satisfied, then lists are increased to
+ * whatever is possible.
  */
 static int vq_sg_resize(struct virtqueue_sg **_vq_sg, unsigned int nsgl)
 {
 	struct virtqueue_sg *vq_sg;
+	unsigned int i;
+	size_t sz;
+	int ret = 0;
 
 	BUG_ON(!_vq_sg || !nsgl);
 	vq_sg = *_vq_sg;
+	if (nsgl > VIRTQUEUE_SG_NSGL_MAX)
+		nsgl = VIRTQUEUE_SG_NSGL_MAX;
 	if (vq_sg->nsgl == nsgl)
 		return 0;
+	if (vq_sg->nsgl > nsgl)
+		return -ENOTSUPP;
+
+	vq_sg = kzalloc(sizeof(struct virtqueue_sg) +
+			nsgl * sizeof(struct scatterlist *),
+			GFP_KERNEL);
 
-	/* lazy resize implementation for now */
-	vq_sg = vq_sg_alloc(nsgl);
 	if (!vq_sg)
 		return -ENOMEM;
 
+	/* copy over old scatter gather lists */
+	sz = sizeof(struct virtqueue_sg) +
+		(*_vq_sg)->nsgl * sizeof(struct scatterlist *);
+	memcpy(vq_sg, *_vq_sg, sz);
+
+	vq_sg->nsgl = nsgl;
+
+	for (i = (*_vq_sg)->nsgl; i < nsgl; ++i) {
+		vq_sg->sgl[i] = kmalloc_array(
+			SG_MAX_SINGLE_ALLOC, sizeof(struct scatterlist),
+			GFP_KERNEL
+		);
+		/*
+		 * handle failed allocation as soft error, we take whatever
+		 * we get
+		 */
+		if (!vq_sg->sgl[i]) {
+			ret = -ENOMEM;
+			vq_sg->nsgl = nsgl = i;
+			break;
+		}
+		sg_init_table(vq_sg->sgl[i], SG_MAX_SINGLE_ALLOC);
+		if (i) {
+			/* chain the lists */
+			sg_chain(vq_sg->sgl[i - 1], SG_MAX_SINGLE_ALLOC,
+				 vq_sg->sgl[i]);
+		}
+	}
+	sg_mark_end(&vq_sg->sgl[nsgl - 1][SG_MAX_SINGLE_ALLOC - 1]);
+
 	kfree(*_vq_sg);
 	*_vq_sg = vq_sg;
-	return 0;
+	return ret;
 }
 
 /**
@@ -846,12 +889,21 @@ p9_virtio_create(struct p9_client *client, const char *devname, char *args)
 		if (nsgl > chan->vq_sg->nsgl) {
 			/*
 			 * if resize fails, no big deal, then just
-			 * continue with default msize instead
+			 * continue with whatever we got
+			 */
+			vq_sg_resize(&chan->vq_sg, nsgl);
+			/*
+			 * actual allocation size might be less than
+			 * requested, so use vq_sg->nsgl instead of nsgl
 			 */
-			if (!vq_sg_resize(&chan->vq_sg, nsgl)) {
-				client->trans_maxsize =
-					PAGE_SIZE *
-					((nsgl * SG_USER_PAGES_PER_LIST) - 3);
+			client->trans_maxsize =
+				PAGE_SIZE * ((chan->vq_sg->nsgl *
+				SG_USER_PAGES_PER_LIST) - 3);
+			if (nsgl > chan->vq_sg->nsgl) {
+				pr_info("limiting 'msize' to %d as only %d "
+					"of %zu SG lists could be allocated",
+					client->trans_maxsize,
+					chan->vq_sg->nsgl, nsgl);
 			}
 		}
 #endif /* CONFIG_ARCH_NO_SG_CHAIN */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 06/12] 9p/trans_virtio: support larger msize values
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (5 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 11/12] net/9p: add p9_msg_buf_size() Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 02/12] 9p/trans_virtio: separate allocation of scatter gather list Christian Schoenebeck
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

The virtio transport supports by default a 9p 'msize' of up to
approximately 500 kB. This patch adds support for larger 'msize'
values by resizing the amount of scatter/gather lists if required.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/trans_virtio.c | 61 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 656562a66f06..a02050c9742a 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -203,6 +203,31 @@ static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
 	return vq_sg;
 }
 
+/**
+ * vq_sg_resize - resize passed virtqueue scatter/gather lists to the passed
+ * amount of lists
+ * @_vq_sg: scatter/gather lists to be resized
+ * @nsgl: new amount of scatter/gather lists
+ */
+static int vq_sg_resize(struct virtqueue_sg **_vq_sg, unsigned int nsgl)
+{
+	struct virtqueue_sg *vq_sg;
+
+	BUG_ON(!_vq_sg || !nsgl);
+	vq_sg = *_vq_sg;
+	if (vq_sg->nsgl == nsgl)
+		return 0;
+
+	/* lazy resize implementation for now */
+	vq_sg = vq_sg_alloc(nsgl);
+	if (!vq_sg)
+		return -ENOMEM;
+
+	kfree(*_vq_sg);
+	*_vq_sg = vq_sg;
+	return 0;
+}
+
 /**
  * p9_virtio_close - reclaim resources of a channel
  * @client: client instance
@@ -774,6 +799,10 @@ p9_virtio_create(struct p9_client *client, const char *devname, char *args)
 	struct virtio_chan *chan;
 	int ret = -ENOENT;
 	int found = 0;
+#if !defined(CONFIG_ARCH_NO_SG_CHAIN)
+	size_t npages;
+	size_t nsgl;
+#endif
 
 	if (devname == NULL)
 		return -EINVAL;
@@ -796,6 +825,38 @@ p9_virtio_create(struct p9_client *client, const char *devname, char *args)
 		return ret;
 	}
 
+	/*
+	 * if user supplied an 'msize' option that's larger than what this
+	 * transport supports by default, then try to allocate more sg lists
+	 */
+	if (client->msize > client->trans_maxsize) {
+#ifdef CONFIG_ARCH_NO_SG_CHAIN
+		pr_info("limiting 'msize' to %d because architecture does not "
+			"support chained scatter gather lists\n",
+			client->trans_maxsize);
+#else
+		npages = DIV_ROUND_UP(client->msize, PAGE_SIZE);
+		if (npages > chan->p9_max_pages) {
+			npages = chan->p9_max_pages;
+			pr_info("limiting 'msize' as it would exceed the max. "
+				"of %lu pages allowed on this system\n",
+				chan->p9_max_pages);
+		}
+		nsgl = DIV_ROUND_UP(npages, SG_USER_PAGES_PER_LIST);
+		if (nsgl > chan->vq_sg->nsgl) {
+			/*
+			 * if resize fails, no big deal, then just
+			 * continue with default msize instead
+			 */
+			if (!vq_sg_resize(&chan->vq_sg, nsgl)) {
+				client->trans_maxsize =
+					PAGE_SIZE *
+					((nsgl * SG_USER_PAGES_PER_LIST) - 3);
+			}
+		}
+#endif /* CONFIG_ARCH_NO_SG_CHAIN */
+	}
+
 	client->trans = (void *)chan;
 	client->status = Connected;
 	chan->client = client;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 11/12] net/9p: add p9_msg_buf_size()
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (4 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 10/12] 9p: add P9_ERRMAX for 9p2000 and 9p2000.u Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 06/12] 9p/trans_virtio: support larger msize values Christian Schoenebeck
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

This new function calculates a buffer size suitable for holding the
intended 9p request or response. For rather small message types (which
applies to almost all 9p message types actually) simply use hard coded
values. For some variable-length and potentially large message types
calculate a more precise value according to what data is actually
transmitted to avoid unnecessarily huge buffers.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/protocol.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++
 net/9p/protocol.h |   2 +
 2 files changed, 156 insertions(+)

diff --git a/net/9p/protocol.c b/net/9p/protocol.c
index 3754c33e2974..49939e8cde2a 100644
--- a/net/9p/protocol.c
+++ b/net/9p/protocol.c
@@ -23,6 +23,160 @@
 
 #include <trace/events/9p.h>
 
+/* len[2] text[len] */
+#define P9_STRLEN(s) \
+	(2 + min_t(size_t, s ? strlen(s) : 0, USHRT_MAX))
+
+/**
+ * p9_msg_buf_size - Returns a buffer size sufficiently large to hold the
+ * intended 9p message.
+ * @c: client
+ * @type: message type
+ * @fmt: format template for assembling request message
+ * (see p9pdu_vwritef)
+ * @ap: variable arguments to be fed to passed format template
+ * (see p9pdu_vwritef)
+ *
+ * Note: Even for response types (P9_R*) the format template and variable
+ * arguments must always be for the originating request type (P9_T*).
+ */
+size_t p9_msg_buf_size(struct p9_client *c, enum p9_msg_t type,
+			const char *fmt, va_list ap)
+{
+	/* size[4] type[1] tag[2] */
+	const int hdr = 4 + 1 + 2;
+	/* ename[s] errno[4] */
+	const int rerror_size = hdr + P9_ERRMAX + 4;
+	/* ecode[4] */
+	const int rlerror_size = hdr + 4;
+	const int err_size =
+		c->proto_version == p9_proto_2000L ? rlerror_size : rerror_size;
+
+	switch (type) {
+
+	/* message types not used at all */
+	case P9_TERROR:
+	case P9_TLERROR:
+	case P9_TAUTH:
+	case P9_RAUTH:
+		BUG();
+
+	/* variable length & potentially large message types */
+	case P9_TATTACH:
+		BUG_ON(strcmp("ddss?u", fmt));
+		va_arg(ap, int32_t);
+		va_arg(ap, int32_t);
+		{
+			const char *uname = va_arg(ap, const char *);
+			const char *aname = va_arg(ap, const char *);
+			/* fid[4] afid[4] uname[s] aname[s] n_uname[4] */
+			return hdr + 4 + 4 + P9_STRLEN(uname) + P9_STRLEN(aname) + 4;
+		}
+	case P9_TWALK:
+		BUG_ON(strcmp("ddT", fmt));
+		va_arg(ap, int32_t);
+		va_arg(ap, int32_t);
+		{
+			uint i, nwname = max(va_arg(ap, int), 0);
+			size_t wname_all;
+			const char **wnames = va_arg(ap, const char **);
+			for (i = 0, wname_all = 0; i < nwname; ++i) {
+				wname_all += P9_STRLEN(wnames[i]);
+			}
+			/* fid[4] newfid[4] nwname[2] nwname*(wname[s]) */
+			return hdr + 4 + 4 + 2 + wname_all;
+		}
+	case P9_RWALK:
+		BUG_ON(strcmp("ddT", fmt));
+		va_arg(ap, int32_t);
+		va_arg(ap, int32_t);
+		{
+			uint nwname = va_arg(ap, int);
+			/* nwqid[2] nwqid*(wqid[13]) */
+			return max_t(size_t, hdr + 2 + nwname * 13, err_size);
+		}
+	case P9_TCREATE:
+		BUG_ON(strcmp("dsdb?s", fmt));
+		va_arg(ap, int32_t);
+		{
+			const char *name = va_arg(ap, const char *);
+			if ((c->proto_version != p9_proto_2000u) &&
+			    (c->proto_version != p9_proto_2000L))
+				/* fid[4] name[s] perm[4] mode[1] */
+				return hdr + 4 + P9_STRLEN(name) + 4 + 1;
+			{
+				va_arg(ap, int32_t);
+				va_arg(ap, int);
+				{
+					const char *ext = va_arg(ap, const char *);
+					/* fid[4] name[s] perm[4] mode[1] extension[s] */
+					return hdr + 4 + P9_STRLEN(name) + 4 + 1 + P9_STRLEN(ext);
+				}
+			}
+		}
+	case P9_TLCREATE:
+		BUG_ON(strcmp("dsddg", fmt));
+		va_arg(ap, int32_t);
+		{
+			const char *name = va_arg(ap, const char *);
+			/* fid[4] name[s] flags[4] mode[4] gid[4] */
+			return hdr + 4 + P9_STRLEN(name) + 4 + 4 + 4;
+		}
+	case P9_RREAD:
+	case P9_RREADDIR:
+		BUG_ON(strcmp("dqd", fmt));
+		va_arg(ap, int32_t);
+		va_arg(ap, int64_t);
+		{
+			const int32_t count = va_arg(ap, int32_t);
+			/* count[4] data[count] */
+			return max_t(size_t, hdr + 4 + count, err_size);
+		}
+	case P9_TWRITE:
+		BUG_ON(strcmp("dqV", fmt));
+		va_arg(ap, int32_t);
+		va_arg(ap, int64_t);
+		{
+			const int32_t count = va_arg(ap, int32_t);
+			/* fid[4] offset[8] count[4] data[count] */
+			return hdr + 4 + 8 + 4 + count;
+		}
+	case P9_TRENAMEAT:
+		BUG_ON(strcmp("dsds", fmt));
+		va_arg(ap, int32_t);
+		{
+			const char *oldname = va_arg(ap, const char *);
+			va_arg(ap, int32_t);
+			{
+				const char *newname = va_arg(ap, const char *);
+				/* olddirfid[4] oldname[s] newdirfid[4] newname[s] */
+				return hdr + 4 + P9_STRLEN(oldname) + 4 + P9_STRLEN(newname);
+			}
+		}
+	case P9_RERROR:
+		return rerror_size;
+	case P9_RLERROR:
+		return rlerror_size;
+
+	/* small message types */
+	case P9_TSTAT:
+	case P9_RSTAT:
+	case P9_TSYMLINK:
+	case P9_RREADLINK:
+	case P9_TXATTRWALK:
+	case P9_TXATTRCREATE:
+	case P9_TLINK:
+	case P9_TMKDIR:
+	case P9_TUNLINKAT:
+		return 8 * 1024;
+
+	/* tiny message types */
+	default:
+		return 4 * 1024;
+
+	}
+}
+
 static int
 p9pdu_writef(struct p9_fcall *pdu, int proto_version, const char *fmt, ...);
 
diff --git a/net/9p/protocol.h b/net/9p/protocol.h
index 6d719c30331a..ad2283d1f96b 100644
--- a/net/9p/protocol.h
+++ b/net/9p/protocol.h
@@ -8,6 +8,8 @@
  *  Copyright (C) 2008 by IBM, Corp.
  */
 
+size_t p9_msg_buf_size(struct p9_client *c, enum p9_msg_t type,
+			const char *fmt, va_list ap);
 int p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
 		  va_list ap);
 int p9pdu_readf(struct p9_fcall *pdu, int proto_version, const char *fmt, ...);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 08/12] net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 04/12] 9p/trans_virtio: introduce struct virtqueue_sg Christian Schoenebeck
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

This 9p client implementation is yet using linear message buffers for
most message types, i.e. they use kmalloc() et al. for allocating
continuous physical memory pages, which is usually limited to 4MB
buffers. Use KMALLOC_MAX_SIZE though instead of a hard coded 4MB for
constraining this more safely.

Unfortunately we cannot simply replace the existing kmalloc() calls by
vmalloc() ones, because that would yield in non-logical kernel addresses
(for any vmalloc(>4MB) that is) which are in general not accessible by
hosts like QEMU.

In future we would replace those linear buffers by scatter/gather lists
to eventually get rid of this limit (struct p9_fcall's sdata member by
p9_fcall_init() and struct p9_fid's rdir member by
v9fs_alloc_rdir_buf()).

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/client.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/net/9p/client.c b/net/9p/client.c
index 20054addd81b..fab939541c81 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1042,6 +1042,17 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
 	p9_debug(P9_DEBUG_MUX, "clnt %p trans %p msize %d protocol %d\n",
 		 clnt, clnt->trans_mod, clnt->msize, clnt->proto_version);
 
+	/*
+	 * due to linear message buffers being used by client ATM
+	 */
+	if (clnt->msize > KMALLOC_MAX_SIZE) {
+		clnt->msize = KMALLOC_MAX_SIZE;
+		pr_info("Limiting 'msize' to %zu as this is the maximum "
+			"supported by this client version.\n",
+			(size_t) KMALLOC_MAX_SIZE
+		);
+	}
+
 	err = clnt->trans_mod->create(clnt, dev_name, options);
 	if (err)
 		goto put_trans;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 09/12] net/9p: split message size argument into 't_size' and 'r_size' pair
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (7 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 02/12] 9p/trans_virtio: separate allocation of scatter gather list Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 07/12] 9p/trans_virtio: resize sg lists to whatever is possible Christian Schoenebeck
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

Refactor 'max_size' argument of p9_tag_alloc() and 'req_size' argument
of p9_client_prepare_req() both into a pair of arguments 't_size' and
'r_size' respectively to allow handling the buffer size for request and
reply separately from each other.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/client.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index fab939541c81..56be1658870d 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -255,24 +255,26 @@ static struct kmem_cache *p9_req_cache;
  * p9_tag_alloc - Allocate a new request.
  * @c: Client session.
  * @type: Transaction type.
- * @max_size: Maximum packet size for this request.
+ * @t_size: Buffer size for holding this request.
+ * @r_size: Buffer size for holding server's reply on this request.
  *
  * Context: Process context.
  * Return: Pointer to new request.
  */
 static struct p9_req_t *
-p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size)
+p9_tag_alloc(struct p9_client *c, int8_t type, uint t_size, uint r_size)
 {
 	struct p9_req_t *req = kmem_cache_alloc(p9_req_cache, GFP_NOFS);
-	int alloc_msize = min(c->msize, max_size);
+	int alloc_tsize = min(c->msize, t_size);
+	int alloc_rsize = min(c->msize, r_size);
 	int tag;
 
 	if (!req)
 		return ERR_PTR(-ENOMEM);
 
-	if (p9_fcall_init(c, &req->tc, alloc_msize))
+	if (p9_fcall_init(c, &req->tc, alloc_tsize))
 		goto free_req;
-	if (p9_fcall_init(c, &req->rc, alloc_msize))
+	if (p9_fcall_init(c, &req->rc, alloc_rsize))
 		goto free;
 
 	p9pdu_reset(&req->tc);
@@ -678,7 +680,7 @@ static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq)
 }
 
 static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
-					      int8_t type, int req_size,
+					      int8_t type, uint t_size, uint r_size,
 					      const char *fmt, va_list ap)
 {
 	int err;
@@ -694,7 +696,7 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
 	if (c->status == BeginDisconnect && type != P9_TCLUNK)
 		return ERR_PTR(-EIO);
 
-	req = p9_tag_alloc(c, type, req_size);
+	req = p9_tag_alloc(c, type, t_size, r_size);
 	if (IS_ERR(req))
 		return req;
 
@@ -731,7 +733,7 @@ p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
 	struct p9_req_t *req;
 
 	va_start(ap, fmt);
-	req = p9_client_prepare_req(c, type, c->msize, fmt, ap);
+	req = p9_client_prepare_req(c, type, c->msize, c->msize, fmt, ap);
 	va_end(ap);
 	if (IS_ERR(req))
 		return req;
@@ -829,7 +831,7 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type,
 	/* We allocate a inline protocol data of only 4k bytes.
 	 * The actual content is passed in zero-copy fashion.
 	 */
-	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
+	req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, P9_ZC_HDR_SZ, fmt, ap);
 	va_end(ap);
 	if (IS_ERR(req))
 		return req;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 10/12] 9p: add P9_ERRMAX for 9p2000 and 9p2000.u
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (3 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 01/12] net/9p: show error message if user 'msize' cannot be satisfied Christian Schoenebeck
@ 2021-12-30 13:23 ` Christian Schoenebeck
  2021-12-30 13:23 ` [PATCH v4 11/12] net/9p: add p9_msg_buf_size() Christian Schoenebeck
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2021-12-30 13:23 UTC (permalink / raw)
  To: v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal, Nikolay Kichukov

Add P9_ERRMAX macro to 9P protocol header which reflects the maximum
error string length of Rerror replies for 9p2000 and 9p2000.u protocol
versions. Unfortunately a maximum error string length is not defined by
the 9p2000 spec, picking 128 as value for now, as this seems to be a
common max. size for POSIX error strings in practice.

9p2000.L protocol version uses Rlerror replies instead which does not
contain an error string.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 include/net/9p/9p.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h
index 9c6ec78e47a5..a447acc55b02 100644
--- a/include/net/9p/9p.h
+++ b/include/net/9p/9p.h
@@ -331,6 +331,9 @@ enum p9_qid_t {
 /* size of header for zero copy read/write */
 #define P9_ZC_HDR_SZ 4096
 
+/* maximum length of an error string */
+#define P9_ERRMAX 128
+
 /**
  * struct p9_qid - file system entity information
  * @type: 8-bit type &p9_qid_t
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (11 preceding siblings ...)
  2021-12-30 13:23 ` [PATCH v4 03/12] 9p/trans_virtio: turn amount of sg lists into runtime info Christian Schoenebeck
@ 2022-01-20 22:43 ` Nikolay Kichukov
  2022-01-22 13:34   ` Christian Schoenebeck
  2022-07-07 14:30 ` Christian Schoenebeck
  13 siblings, 1 reply; 32+ messages in thread
From: Nikolay Kichukov @ 2022-01-20 22:43 UTC (permalink / raw)
  To: Christian Schoenebeck, v9fs-developer
  Cc: netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

Thanks for the patches. I've applied them on top of 5.16.2 kernel and it
works for msize=1048576. Performance-wise, same throughput as the
previous patches, basically limiting factor is the backend block
storage.

However, when I mount with msize=4194304, the system locks up upon first
try to traverse the directory structure, ie 'ls'. Only solution is to
'poweroff' the guest. Nothing in the logs.

Qemu 6.0.0 on the host has the following patches:

01-fix-wrong-io-block-size-Rgetattr.patch
02-dedupe-iounit-code.patch
03-9pfs-simplify-blksize_to_iounit.patch

The kernel patches were applied on the guest kernel only.

I've generated them with the following command:
git diff 783ba37c1566dd715b9a67d437efa3b77e3cd1a7^..8c305df4646b65218978fc6474aa0f5f29b216a0 > /tmp/kernel-5.16-9p-virtio-drop-msize-cap.patch

The host runs 5.15.4 kernel.

Cheers,
-N

On Thu, 2021-12-30 at 14:23 +0100, Christian Schoenebeck wrote:
> This series aims to get get rid of the current 500k 'msize' limitation
> in
> the 9p virtio transport, which is currently a bottleneck for
> performance
> of 9p mounts.
> 
> To avoid confusion: it does remove the msize limit for the virtio
> transport,
> on 9p client level though the anticipated milestone for this series is
> now
> a max. 'msize' of 4 MB. See patch 8 for reason why.
> 
> This is a follow-up of the following series and discussion:
> https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/
> 
> Latest version of this series:
> https://github.com/cschoenebeck/linux/commits/9p-virtio-drop-msize-cap
> 
> 
> OVERVIEW OF PATCHES:
> 
> * Patch 1 is just a trivial info message for the user to know why his
> msize
>   option got ignored by 9p client in case the value was larger than
> what is
>   supported by this 9p driver.
> 
> * Patches 2..7 remove the msize limitation from the 'virtio' transport
>   (i.e. the 9p 'virtio' transport itself actually supports >4MB now,
> tested
>   successfully with an experimental QEMU version and some dirty 9p
> Linux
>   client hacks up to msize=128MB).
> 
> * Patch 8 limits msize for all transports to 4 MB for now as >4MB
> would need
>   more work on 9p client level (see commit log of patch 8 for
> details).
> 
> * Patches 9..12 tremendously reduce unnecessarily huge 9p message
> sizes and
>   therefore provide performance gain as well. So far, almost all 9p
> messages
>   simply allocated message buffers exactly msize large, even for
> messages
>   that actually just needed few bytes. So these patches make sense by
>   themselves, independent of this overall series, however for this
> series
>   even more, because the larger msize, the more this issue would have
> hurt
>   otherwise.
> 
> 
> PREREQUISITES:
> 
> If you are testing with QEMU then please either use latest QEMU 6.2
> release
> or higher, or at least apply the following patch on QEMU side:
> 
>  
> https://lore.kernel.org/qemu-devel/E1mT2Js-0000DW-OH@lizzy.crudebyte.com/
> 
> That QEMU patch is required if you are using a user space app that
> automatically retrieves an optimum I/O block size by obeying stat's
> st_blksize, which 'cat' for instance is doing, e.g.:
> 
>         time cat test_rnd.dat > /dev/null
> 
> Otherwise please use a user space app for performance testing that
> allows
> you to force a large block size and to avoid that QEMU issue, like
> 'dd'
> for instance, in that case you don't need to patch QEMU.
> 
> 
> KNOWN LIMITATION:
> 
> With this series applied I can run
> 
>   QEMU host <-> 9P virtio <-> Linux guest
> 
> with up to slightly below 4 MB msize [4186112 = (1024-2) * 4096]. If I
> try
> to run it with exactly 4 MB (4194304) it currently hits a limitation
> on
> QEMU side:
> 
>   qemu-system-x86_64: virtio: too many write descriptors in indirect
> table
> 
> That's because QEMU currently has a hard coded limit of max. 1024
> virtio
> descriptors per vring slot (i.e. per virtio message), see to do (1.)
> below.
> 
> 
> STILL TO DO:
> 
>   1. Negotiating virtio "Queue Indirect Size" (MANDATORY):
> 
>     The QEMU issue described above must be addressed by negotiating
> the
>     maximum length of virtio indirect descriptor tables on virtio
> device
>     initialization. This would not only avoid the QEMU error above,
> but would
>     also allow msize of >4MB in future. Before that change can be done
> on
>     Linux and QEMU sides though, it first requires a change to the
> virtio
>     specs. Work on that on the virtio specs is in progress:
> 
>     https://github.com/oasis-tcs/virtio-spec/issues/122
> 
>     This is not really an issue for testing this series. Just stick to
> max.
>     msize=4186112 as described above and you will be fine. However for
> the
>     final PR this should obviously be addressed in a clean way.
> 
>   2. Reduce readdir buffer sizes (optional - maybe later):
> 
>     This series already reduced the message buffers for most 9p
> message
>     types. This does not include Treaddir though yet, which is still
> simply
>     using msize. It would make sense to benchmark first whether this
> is
>     actually an issue that hurts. If it does, then one might use
> already
>     existing vfs knowledge to estimate the Treaddir size, or starting
> with
>     some reasonable hard coded small Treaddir size first and then
> increasing
>     it just on the 2nd Treaddir request if there are more directory
> entries
>     to fetch.
> 
>   3. Add more buffer caches (optional - maybe later):
> 
>     p9_fcall_init() uses kmem_cache_alloc() instead of kmalloc() for
> very
>     large buffers to reduce latency waiting for memory allocation to
>     complete. Currently it does that only if the requested buffer size
> is
>     exactly msize large. As patch 11 already divided the 9p message
> types
>     into few message size categories, maybe it would make sense to use
> e.g.
>     4 separate caches for those memory category (e.g. 4k, 8k, msize/2,
>     msize). Might be worth a benchmark test.
> 
> Testing and feedback appreciated!
> 
> v3 -> v4:
> 
>   * Limit msize to 4 MB for all transports [NEW patch 8].
> 
>   * Avoid unnecessarily huge 9p message buffers
>     [NEW patch 9] .. [NEW patch 12].
> 
> Christian Schoenebeck (12):
>   net/9p: show error message if user 'msize' cannot be satisfied
>   9p/trans_virtio: separate allocation of scatter gather list
>   9p/trans_virtio: turn amount of sg lists into runtime info
>   9p/trans_virtio: introduce struct virtqueue_sg
>   net/9p: add trans_maxsize to struct p9_client
>   9p/trans_virtio: support larger msize values
>   9p/trans_virtio: resize sg lists to whatever is possible
>   net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports
>   net/9p: split message size argument into 't_size' and 'r_size' pair
>   9p: add P9_ERRMAX for 9p2000 and 9p2000.u
>   net/9p: add p9_msg_buf_size()
>   net/9p: allocate appropriate reduced message buffers
> 
>  include/net/9p/9p.h     |   3 +
>  include/net/9p/client.h |   2 +
>  net/9p/client.c         |  67 +++++++--
>  net/9p/protocol.c       | 154 ++++++++++++++++++++
>  net/9p/protocol.h       |   2 +
>  net/9p/trans_virtio.c   | 304 +++++++++++++++++++++++++++++++++++----
> -
>  6 files changed, 483 insertions(+), 49 deletions(-)
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-01-20 22:43 ` [PATCH v4 00/12] remove msize limit in virtio transport Nikolay Kichukov
@ 2022-01-22 13:34   ` Christian Schoenebeck
  2022-01-24 10:21     ` Nikolay Kichukov
  0 siblings, 1 reply; 32+ messages in thread
From: Christian Schoenebeck @ 2022-01-22 13:34 UTC (permalink / raw)
  To: Nikolay Kichukov
  Cc: v9fs-developer, netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

On Donnerstag, 20. Januar 2022 23:43:46 CET Nikolay Kichukov wrote:
> Thanks for the patches. I've applied them on top of 5.16.2 kernel and it
> works for msize=1048576. Performance-wise, same throughput as the
> previous patches, basically limiting factor is the backend block
> storage.

Depends on how you were testing exactly. I assume you just booted a guest and 
then mounted a humble 9p directory in guest to perform some isolated I/O 
throughput tests on a single file. In this test scenario yes, you would not 
see much of a difference between v3 vs. v4 of this series.

However in my tests I went much further than that by running an entire guest 
on top of 9p as its root filesystem:
https://wiki.qemu.org/Documentation/9p_root_fs
With this 9p rootfs setup you get a completely different picture. For instance 
you'll notice with v3 that guest boot time *increases* with rising msize, 
whereas with v4 it shrinks. And also when you benchmark throughput on a file 
in this 9p rootfs setup with v3 you get worse results than with v4, sometimes 
with v3 even worse than without patches at all. With v4 applied though it 
clearly outperforms any other kernel version in all aspects.

I highly recommend this 9p rootfs setup as a heterogenous 9p test environment, 
as it is a very good real world test scenario for all kinds of aspects.

> However, when I mount with msize=4194304, the system locks up upon first
> try to traverse the directory structure, ie 'ls'. Only solution is to
> 'poweroff' the guest. Nothing in the logs.

I've described this in detail in the cover letter under "KNOWN LIMITATIONS" 
already. Use max. msize 4186112.

> Qemu 6.0.0 on the host has the following patches:
> 
> 01-fix-wrong-io-block-size-Rgetattr.patch
> 02-dedupe-iounit-code.patch
> 03-9pfs-simplify-blksize_to_iounit.patch

I recommend just using QEMU 6.2. It is not worth to patch that old QEMU 
version. E.g. you would have a lousy readdir performance with that QEMU 
version and what not.

You don't need to install QEMU. You can directly run it from the build 
directory.

> The kernel patches were applied on the guest kernel only.
> 
> I've generated them with the following command:
> git diff
> 783ba37c1566dd715b9a67d437efa3b77e3cd1a7^..8c305df4646b65218978fc6474aa0f5f
> 29b216a0 > /tmp/kernel-5.16-9p-virtio-drop-msize-cap.patch
> 
> The host runs 5.15.4 kernel.

Host kernel version currently does not matter for this series.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-01-22 13:34   ` Christian Schoenebeck
@ 2022-01-24 10:21     ` Nikolay Kichukov
  2022-01-24 11:07       ` Dominique Martinet
  0 siblings, 1 reply; 32+ messages in thread
From: Nikolay Kichukov @ 2022-01-24 10:21 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: v9fs-developer, netdev, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

Thanks Christian,
It works, sorry for overlooking the 'known limitations' in the first
place. When do we expect these patches to be merged upstream?

Cheers,
-N


On Sat, 2022-01-22 at 14:34 +0100, Christian Schoenebeck wrote:
> On Donnerstag, 20. Januar 2022 23:43:46 CET Nikolay Kichukov wrote:
> > Thanks for the patches. I've applied them on top of 5.16.2 kernel
> > and it
> > works for msize=1048576. Performance-wise, same throughput as the
> > previous patches, basically limiting factor is the backend block
> > storage.
> 
> Depends on how you were testing exactly. I assume you just booted a
> guest and 
> then mounted a humble 9p directory in guest to perform some isolated
> I/O 
> throughput tests on a single file. In this test scenario yes, you
> would not 
> see much of a difference between v3 vs. v4 of this series.
> 
> However in my tests I went much further than that by running an entire
> guest 
> on top of 9p as its root filesystem:
> https://wiki.qemu.org/Documentation/9p_root_fs
> With this 9p rootfs setup you get a completely different picture. For
> instance 
> you'll notice with v3 that guest boot time *increases* with rising
> msize, 
> whereas with v4 it shrinks. And also when you benchmark throughput on
> a file 
> in this 9p rootfs setup with v3 you get worse results than with v4,
> sometimes 
> with v3 even worse than without patches at all. With v4 applied though
> it 
> clearly outperforms any other kernel version in all aspects.
> 
> I highly recommend this 9p rootfs setup as a heterogenous 9p test
> environment, 
> as it is a very good real world test scenario for all kinds of
> aspects.
> 
> > However, when I mount with msize=4194304, the system locks up upon
> > first
> > try to traverse the directory structure, ie 'ls'. Only solution is
> > to
> > 'poweroff' the guest. Nothing in the logs.
> 
> I've described this in detail in the cover letter under "KNOWN
> LIMITATIONS" 
> already. Use max. msize 4186112.
> 
> > Qemu 6.0.0 on the host has the following patches:
> > 
> > 01-fix-wrong-io-block-size-Rgetattr.patch
> > 02-dedupe-iounit-code.patch
> > 03-9pfs-simplify-blksize_to_iounit.patch
> 
> I recommend just using QEMU 6.2. It is not worth to patch that old
> QEMU 
> version. E.g. you would have a lousy readdir performance with that
> QEMU 
> version and what not.
> 
> You don't need to install QEMU. You can directly run it from the build
> directory.
> 
> > The kernel patches were applied on the guest kernel only.
> > 
> > I've generated them with the following command:
> > git diff
> > 783ba37c1566dd715b9a67d437efa3b77e3cd1a7^..8c305df4646b65218978fc647
> > 4aa0f5f
> > 29b216a0 > /tmp/kernel-5.16-9p-virtio-drop-msize-cap.patch
> > 
> > The host runs 5.15.4 kernel.
> 
> Host kernel version currently does not matter for this series.
> 
> Best regards,
> Christian Schoenebeck
> 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-01-24 10:21     ` Nikolay Kichukov
@ 2022-01-24 11:07       ` Dominique Martinet
  2022-01-24 11:57         ` Christian Schoenebeck
  2022-05-24  8:10         ` Nikolay Kichukov
  0 siblings, 2 replies; 32+ messages in thread
From: Dominique Martinet @ 2022-01-24 11:07 UTC (permalink / raw)
  To: Nikolay Kichukov
  Cc: Christian Schoenebeck, v9fs-developer, netdev,
	Eric Van Hensbergen, Latchesar Ionkov, Greg Kurz, Vivek Goyal

Nikolay Kichukov wrote on Mon, Jan 24, 2022 at 11:21:08AM +0100:
> It works, sorry for overlooking the 'known limitations' in the first
> place. When do we expect these patches to be merged upstream?

We're just starting a new development cycle for 5.18 while 5.17 is
stabilizing, so this mostly depends on the ability to check if a msize
given in parameter is valid as described in the first "STILL TO DO"
point listed in the cover letter.

I personally would be happy considering this series for this cycle with
just a max msize of 4MB-8k and leave that further bump for later if
we're sure qemu will handle it.
We're still seeing a boost for that and the smaller buffers for small
messages will benefit all transport types, so that would get in in
roughly two months for 5.18-rc1, then another two months for 5.18 to
actually be released and start hitting production code.


I'm not sure when exactly but I'll run some tests with it as well and
redo a proper code review within the next few weeks, so we can get this
in -next for a little while before the merge window.

-- 
Dominique

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-01-24 11:07       ` Dominique Martinet
@ 2022-01-24 11:57         ` Christian Schoenebeck
  2022-01-24 12:56           ` Dominique Martinet
  2022-01-25  8:45           ` Nikolay Kichukov
  2022-05-24  8:10         ` Nikolay Kichukov
  1 sibling, 2 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2022-01-24 11:57 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Nikolay Kichukov, v9fs-developer, netdev, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

On Montag, 24. Januar 2022 12:07:20 CET Dominique Martinet wrote:
> Nikolay Kichukov wrote on Mon, Jan 24, 2022 at 11:21:08AM +0100:
> > It works, sorry for overlooking the 'known limitations' in the first
> > place. When do we expect these patches to be merged upstream?
> 
> We're just starting a new development cycle for 5.18 while 5.17 is
> stabilizing, so this mostly depends on the ability to check if a msize
> given in parameter is valid as described in the first "STILL TO DO"
> point listed in the cover letter.

I will ping the Redhat guys on the open virtio spec issue this week. If you 
want I can CC you Dominique on the discussion regarding the virtio spec 
changes. It's a somewhat dry topic though.

> I personally would be happy considering this series for this cycle with
> just a max msize of 4MB-8k and leave that further bump for later if
> we're sure qemu will handle it.

I haven't actually checked whether there was any old QEMU version that did not 
support exceeding the virtio queue size. So it might be possible that a very 
ancient QEMU version might error out if msize > (128 * 4096 = 512k).

Besides QEMU, what other 9p server implementations are actually out there, and 
how would they behave on this? A test on their side would definitely be a good 
idea.

> We're still seeing a boost for that and the smaller buffers for small
> messages will benefit all transport types, so that would get in in
> roughly two months for 5.18-rc1, then another two months for 5.18 to
> actually be released and start hitting production code.
> 
> 
> I'm not sure when exactly but I'll run some tests with it as well and
> redo a proper code review within the next few weeks, so we can get this
> in -next for a little while before the merge window.

Especially the buffer size reduction patches needs a proper review. Those 
changes can be tricky. So far I have not encountered any issues with tests at 
least. OTOH these patches could be pushed through separately already, no 
matter what the decision regarding the virtio issue will be.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-01-24 11:57         ` Christian Schoenebeck
@ 2022-01-24 12:56           ` Dominique Martinet
  2022-01-24 13:55             ` Christian Schoenebeck
  2022-01-25  8:45           ` Nikolay Kichukov
  1 sibling, 1 reply; 32+ messages in thread
From: Dominique Martinet @ 2022-01-24 12:56 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Nikolay Kichukov, v9fs-developer, netdev, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

Christian Schoenebeck wrote on Mon, Jan 24, 2022 at 12:57:35PM +0100:
> > We're just starting a new development cycle for 5.18 while 5.17 is
> > stabilizing, so this mostly depends on the ability to check if a msize
> > given in parameter is valid as described in the first "STILL TO DO"
> > point listed in the cover letter.
> 
> I will ping the Redhat guys on the open virtio spec issue this week. If you 
> want I can CC you Dominique on the discussion regarding the virtio spec 
> changes. It's a somewhat dry topic though.

I don't have much expertise on virtio stuff so don't think I'll bring
much to the discussion, but always happy to fill my inbox :)
It's always good to keep an eye on things at least.

> > I personally would be happy considering this series for this cycle with
> > just a max msize of 4MB-8k and leave that further bump for later if
> > we're sure qemu will handle it.
> 
> I haven't actually checked whether there was any old QEMU version that did not 
> support exceeding the virtio queue size. So it might be possible that a very 
> ancient QEMU version might error out if msize > (128 * 4096 = 512k).

Even if the spec gets implemented we need the default msize to work for
reasonably older versions of qemu (at least a few years e.g. supported
versions of debian/rhel can go quite a while back), and ideally have a
somewhat sensible error if we go above some max...

> Besides QEMU, what other 9p server implementations are actually out there, and 
> how would they behave on this? A test on their side would definitely be a good 
> idea.

9p virtio would only be qemu as far as I know.

For tcp/fd there are a few:
 - https://github.com/chaos/diod (also supports rdma iirc, I don't have
any hardware for rdma tests anymore though)
 - https://github.com/nfs-ganesha/nfs-ganesha (also rdma)
 - I was pointed at https://github.com/lionkov/go9p in a recent bug
report
 - http://repo.cat-v.org/libixp/ is also a server implementation I
haven't tested with the linux client in a while but iirc it used to work


I normally run some tests with qemu (virtio) and ganesha (tcp) before
pushing to my linux-next branch, so we hopefully don't make too many
assumptions that are specific to a server


> > We're still seeing a boost for that and the smaller buffers for small
> > messages will benefit all transport types, so that would get in in
> > roughly two months for 5.18-rc1, then another two months for 5.18 to
> > actually be released and start hitting production code.
> > 
> > 
> > I'm not sure when exactly but I'll run some tests with it as well and
> > redo a proper code review within the next few weeks, so we can get this
> > in -next for a little while before the merge window.
> 
> Especially the buffer size reduction patches needs a proper review. Those 
> changes can be tricky. So far I have not encountered any issues with tests at 
> least. OTOH these patches could be pushed through separately already, no 
> matter what the decision regarding the virtio issue will be.

Yes, I've had a first look and it's quite different from what I'd have
done, but it didn't look bad and I just wanted to spend a bit more time
on it.
On a very high level I'm not fond of the logical duplication brought by
deciding the size in a different function (duplicates format strings for
checks and brings in a huge case with all formats) when we already have
one function per call which could take the size decision directly
without going through the format varargs, but it's not like the protocol
has evolved over the past ten years so it's not really a problem -- I
just need to get down to it and check it all matches up.

I also agree it's totally orthogonal to the virtio size extension so if
you want to wait for the new virtio standard I'll focus on this part
first.

-- 
Dominique

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-01-24 12:56           ` Dominique Martinet
@ 2022-01-24 13:55             ` Christian Schoenebeck
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2022-01-24 13:55 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Nikolay Kichukov, v9fs-developer, netdev, Eric Van Hensbergen,
	Latchesar Ionkov, Greg Kurz, Vivek Goyal

On Montag, 24. Januar 2022 13:56:17 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Mon, Jan 24, 2022 at 12:57:35PM +0100:
> > > We're just starting a new development cycle for 5.18 while 5.17 is
> > > stabilizing, so this mostly depends on the ability to check if a msize
> > > given in parameter is valid as described in the first "STILL TO DO"
> > > point listed in the cover letter.
> > 
> > I will ping the Redhat guys on the open virtio spec issue this week. If
> > you
> > want I can CC you Dominique on the discussion regarding the virtio spec
> > changes. It's a somewhat dry topic though.
> 
> I don't have much expertise on virtio stuff so don't think I'll bring
> much to the discussion, but always happy to fill my inbox :)
> It's always good to keep an eye on things at least.

Ok, I'll then CC you from now on at the virtio spec front, if it gets too 
noisy just raise your hand.

> > > I personally would be happy considering this series for this cycle with
> > > just a max msize of 4MB-8k and leave that further bump for later if
> > > we're sure qemu will handle it.
> > 
> > I haven't actually checked whether there was any old QEMU version that did
> > not support exceeding the virtio queue size. So it might be possible that
> > a very ancient QEMU version might error out if msize > (128 * 4096 =
> > 512k).
> Even if the spec gets implemented we need the default msize to work for
> reasonably older versions of qemu (at least a few years e.g. supported
> versions of debian/rhel can go quite a while back), and ideally have a
> somewhat sensible error if we go above some max...

Once the virtio spec changes are accepted and implemented, that would not be 
an issue at all, virtio changes are always made with backward compatibility in 
mind. The plan is to negotiate that new virtio feature on virtio subsystem 
level, if either side does not support the new virtio feature (either too old 
QEMU or too old kernel), then msize would automatically be limited to the old 
virtio size/behaviour (a.k.a. virtio "queue size") and with QEMU as 9p server 
that would be max. msize 500k.

Therefore I suggest just waiting for the virtio spec changes to be complete 
and implemented. People who care about performance should then just use an 
updated kernel *and* updated QEMU version to achieve msize > 500k. IMO, no 
need to risk breaking some old kernel/QEMU combination if nobody asked for it 
anyway, and if somebody does, then we could still add some kind of
--force-at-your-own-risk switch later on.

> > Besides QEMU, what other 9p server implementations are actually out there,
> > and how would they behave on this? A test on their side would definitely
> > be a good idea.
> 
> 9p virtio would only be qemu as far as I know.
> 
> For tcp/fd there are a few:
>  - https://github.com/chaos/diod (also supports rdma iirc, I don't have
> any hardware for rdma tests anymore though)
>  - https://github.com/nfs-ganesha/nfs-ganesha (also rdma)
>  - I was pointed at https://github.com/lionkov/go9p in a recent bug
> report
>  - http://repo.cat-v.org/libixp/ is also a server implementation I
> haven't tested with the linux client in a while but iirc it used to work
> 
> 
> I normally run some tests with qemu (virtio) and ganesha (tcp) before
> pushing to my linux-next branch, so we hopefully don't make too many
> assumptions that are specific to a server

Good to know, thanks!

> > > We're still seeing a boost for that and the smaller buffers for small
> > > messages will benefit all transport types, so that would get in in
> > > roughly two months for 5.18-rc1, then another two months for 5.18 to
> > > actually be released and start hitting production code.
> > > 
> > > 
> > > I'm not sure when exactly but I'll run some tests with it as well and
> > > redo a proper code review within the next few weeks, so we can get this
> > > in -next for a little while before the merge window.
> > 
> > Especially the buffer size reduction patches needs a proper review. Those
> > changes can be tricky. So far I have not encountered any issues with tests
> > at least. OTOH these patches could be pushed through separately already,
> > no matter what the decision regarding the virtio issue will be.
> 
> Yes, I've had a first look and it's quite different from what I'd have
> done, but it didn't look bad and I just wanted to spend a bit more time
> on it.
> On a very high level I'm not fond of the logical duplication brought by
> deciding the size in a different function (duplicates format strings for
> checks and brings in a huge case with all formats) when we already have
> one function per call which could take the size decision directly
> without going through the format varargs, but it's not like the protocol
> has evolved over the past ten years so it's not really a problem -- I
> just need to get down to it and check it all matches up.

Yeah I know, the advantage though is that this separate function/switch-case 
approach merges many message types. So it is actually less code. And I tried 
to automate code sanity with various BUG_ON() calls to prevent them from 
accidentally drifting with future changes.

> I also agree it's totally orthogonal to the virtio size extension so if
> you want to wait for the new virtio standard I'll focus on this part
> first.

IMO it would make sense to give these message size reduction patches priority 
for now, as long as the virtio spec changes are incomplete.

One more thing: so far I have just concentrated on behavioural aspects and 
testing. What I completed neglected so far was code style. If you want I can 
send a v5 this week with code style (and only code style) being fixed if that 
helps you to keep diff-noise low for your review.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-01-24 11:57         ` Christian Schoenebeck
  2022-01-24 12:56           ` Dominique Martinet
@ 2022-01-25  8:45           ` Nikolay Kichukov
  1 sibling, 0 replies; 32+ messages in thread
From: Nikolay Kichukov @ 2022-01-25  8:45 UTC (permalink / raw)
  To: Christian Schoenebeck, Dominique Martinet
  Cc: v9fs-developer, netdev, Eric Van Hensbergen, Latchesar Ionkov,
	Greg Kurz, Vivek Goyal, garlick.jim

On Mon, 2022-01-24 at 12:57 +0100, Christian Schoenebeck wrote:
> Besides QEMU, what other 9p server implementations are actually out
> there, and 
> how would they behave on this? A test on their side would definitely
> be a good 
> idea.
diod is a 9p network server, if these patches are purely virtio
transport specific, I believe they should not affect it. Here is the
source code for diod:
https://github.com/chaos/diod

Jim Garlick maintains it, added to CC here.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers
  2021-12-30 13:23 ` [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers Christian Schoenebeck
@ 2022-04-02 14:05   ` Dominique Martinet
  2022-04-03 11:29     ` Christian Schoenebeck
  0 siblings, 1 reply; 32+ messages in thread
From: Dominique Martinet @ 2022-04-02 14:05 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: v9fs-developer, netdev, Eric Van Hensbergen, Latchesar Ionkov,
	Greg Kurz, Vivek Goyal, Nikolay Kichukov

Christian Schoenebeck wrote on Thu, Dec 30, 2021 at 02:23:18PM +0100:
> So far 'msize' was simply used for all 9p message types, which is far
> too much and slowed down performance tremendously with large values
> for user configurable 'msize' option.
> 
> Let's stop this waste by using the new p9_msg_buf_size() function for
> allocating more appropriate, smaller buffers according to what is
> actually sent over the wire.

By the way, thinking of protocols earlier made me realize this won't
work on RDMA transport...

unlike virtio/tcp/xen, RDMA doesn't "mailbox" messages: there's a pool
of posted buffers, and once a message has been received it looks for the
header in the received message and associates it with the matching
request, but there's no guarantee a small message will use a small
buffer...

This is also going to need some thought, perhaps just copying small
buffers and recycling the buffer if a large one was used? but there
might be a window with no buffer available and I'm not sure what'd
happen, and don't have any RDMA hardware available to test this right
now so this will be fun.


I'm not shooting this down (it's definitely interesting), but we might
need to make it optional until someone with RDMA hardware can validate a
solution.
-- 
Dominique

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers
  2022-04-02 14:05   ` Dominique Martinet
@ 2022-04-03 11:29     ` Christian Schoenebeck
  2022-04-03 12:37       ` Dominique Martinet
  0 siblings, 1 reply; 32+ messages in thread
From: Christian Schoenebeck @ 2022-04-03 11:29 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: v9fs-developer, netdev, Eric Van Hensbergen, Latchesar Ionkov,
	Greg Kurz, Vivek Goyal, Nikolay Kichukov

On Samstag, 2. April 2022 16:05:36 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Thu, Dec 30, 2021 at 02:23:18PM +0100:
> > So far 'msize' was simply used for all 9p message types, which is far
> > too much and slowed down performance tremendously with large values
> > for user configurable 'msize' option.
> > 
> > Let's stop this waste by using the new p9_msg_buf_size() function for
> > allocating more appropriate, smaller buffers according to what is
> > actually sent over the wire.
> 
> By the way, thinking of protocols earlier made me realize this won't
> work on RDMA transport...
> 
> unlike virtio/tcp/xen, RDMA doesn't "mailbox" messages: there's a pool
> of posted buffers, and once a message has been received it looks for the
> header in the received message and associates it with the matching
> request, but there's no guarantee a small message will use a small
> buffer...
> 
> This is also going to need some thought, perhaps just copying small
> buffers and recycling the buffer if a large one was used? but there
> might be a window with no buffer available and I'm not sure what'd
> happen, and don't have any RDMA hardware available to test this right
> now so this will be fun.
> 
> 
> I'm not shooting this down (it's definitely interesting), but we might
> need to make it optional until someone with RDMA hardware can validate a
> solution.

So maybe I should just exclude the 9p RDMA transport from this 9p message size 
reduction change in v5 until somebody had a chance to test this change with 
RDMA.

Which makes me wonder, what is that exact hardware, hypervisor, OS that 
supports 9p & RDMA?

On the long-term I can imagine to add RDMA transport support on QEMU 9p side. 
There is already RDMA code in QEMU, however it is only used for migration by 
QEMU so far I think.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers
  2022-04-03 11:29     ` Christian Schoenebeck
@ 2022-04-03 12:37       ` Dominique Martinet
  2022-04-03 14:00         ` Christian Schoenebeck
  0 siblings, 1 reply; 32+ messages in thread
From: Dominique Martinet @ 2022-04-03 12:37 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: v9fs-developer, netdev, Eric Van Hensbergen, Latchesar Ionkov,
	Greg Kurz, Vivek Goyal, Nikolay Kichukov

Christian Schoenebeck wrote on Sun, Apr 03, 2022 at 01:29:53PM +0200:
> So maybe I should just exclude the 9p RDMA transport from this 9p message size 
> reduction change in v5 until somebody had a chance to test this change with 
> RDMA.

Yes, I'm pretty certain it won't work so we'll want to exclude it unless
we can extend the RDMA protocol to address buffers.

From my understanding, RDMA comes with two type of primitives:
 - recv/send that 9p exlusively uses, which is just a pool of buffers
registered to the NIC and get filled on a first-come-first-serve basis
(I'm not sure if it's a first-fit, or if message will be truncated, or
if it'll error out if the message doesn't fit... But basically given
that's what we use for 9p we have no way of guaranteeing that a read
reply will be filled in the big buffer allocated for it and not
something else)

If we're lucky the algorithm used is smallest-fit first, but it doesn't
look like it:
---
The order of the Receive Request consumptions in a Receive Queue is by
the order that they were posted to it.
When you have a SRQ, you cannot predict which Receive Request will be
consumed by which QP, so all the Receive Requests in that SRQ should be
able to contain the incoming message (in terms of length).
--- https://www.rdmamojo.com/2013/02/02/ibv_post_recv/ (in a comment)


 - read/write, which can be addressed e.g. the remote end can specify a
cookie along with address/size and directly operate on remote memory
(hence the "remote direct memory access" name). There are also some cool
stuff that can be done like atomic compare and swap or arithmetic
operations on remote memory which doesn't really concern us.

Using read/writes like NFS over RDMA does would resolve the problem and
allow what they call "direct data placement", which is reading or
writing directly from the page cache or user buffer as a real zero copy
operation, but it requires the cookie/address to be sent and client to
act on it so it's a real transport-specific protocol change, but given
the low number of users I think that's something that could be
considered if someone wants to work on it.

Until then we'll be safer with that bit disabled...

> Which makes me wonder, what is that exact hardware, hypervisor, OS that 
> supports 9p & RDMA?

I've used it with mellanox infiniband cards in the past. These support
SRIOV virtual functions so are quite practical for VMs, could let it do
the work with a single machine and no cable.

I'm pretty sure it'd work with any recent server hardware that supports
RoCE (I -think- it's getting more common?), or with emulation ~10 years
ago I got it to run with softiwarp which has been merged in the kernel
(siw) since so that might be the easiest way to run it now.

Server-side, both diod and nfs-ganesha support 9p over RDMA, I haven't
used diod recently but ganesha ought to work.


> On the long-term I can imagine to add RDMA transport support on QEMU 9p side. 

What would you expect it to be used for?

> There is already RDMA code in QEMU, however it is only used for migration by 
> QEMU so far I think.

Yes, looking at it a bit there's live migration over RDMA (I tested it
at my previous job), some handling for gluster+rdma, and a
paravirtualized RDMA device (pvrdma).
the docs for it says it works with soft-roce so it would also probably
work for tests (I'm not sure what difference there is between rxe and
siw), but at this point you've just setup virtualized rdma on the host
anyway...

I'll try to get something setup for tests on my end as well, it's
definitely something I had on my todo...
-- 
Dominique

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers
  2022-04-03 12:37       ` Dominique Martinet
@ 2022-04-03 14:00         ` Christian Schoenebeck
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2022-04-03 14:00 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: v9fs-developer, netdev, Eric Van Hensbergen, Latchesar Ionkov,
	Greg Kurz, Vivek Goyal, Nikolay Kichukov

On Sonntag, 3. April 2022 14:37:55 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Sun, Apr 03, 2022 at 01:29:53PM +0200:
> > So maybe I should just exclude the 9p RDMA transport from this 9p message
> > size reduction change in v5 until somebody had a chance to test this
> > change with RDMA.
> 
> Yes, I'm pretty certain it won't work so we'll want to exclude it unless
> we can extend the RDMA protocol to address buffers.

OK, agreed. It only needs a minor adjustment to this patch 12 to exclude the
RDMA transport (+2 lines or so). So no big deal.

> > On the long-term I can imagine to add RDMA transport support on QEMU 9p
> > side.
> What would you expect it to be used for?

There are several potential use cases that would come to my mind, e.g:

- Separating storage hardware from host hardware. With virtio we are
  constrained to the same machine.

- Maybe also a candidate to achieve what the 9p 'synth' driver in QEMU tried 
  to achieve? That 'synth' driver is running in a separate process from the 
  QEMU process, with the goal to increase safety. However currently it is 
  more or less abondened as it is extremely slow, as 9p requests have to be
  dispatched like:

   guest -> QEMU (9p server) -> synth daemon -> QEMU (9p server) -> guest

  Maybe we could rid of those costly extra hops with RDMA, not sure though.

- Maybe also an alternative to virtio on the same machine: there are also some 
  shortcomings in virtio that are tedious to address (see e.g. current 
  struggle with pure formal negotiation issues just to relax the virtio spec 
  regarding its "Queue Size" requirements so that we could achieve higher 
  message sizes). I'm also not a big fan of virtio's assumption that guest 
  should guess in advance host's response size.

- Maybe as transport for macOS guest support in future? Upcoming QEMU 7.0 adds
  support for macOS 9p hosts, which revives the plan to add support for 9p
  to macOS guests as well. The question still is what transport to use for 
  macOS guests then.

However I currently don't know any details inside RDMA yet, and as you already
outlined, it probably has some shortcomings that would need to be revised with
protocol changes as well.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-01-24 11:07       ` Dominique Martinet
  2022-01-24 11:57         ` Christian Schoenebeck
@ 2022-05-24  8:10         ` Nikolay Kichukov
  2022-05-24 11:29           ` Christian Schoenebeck
  1 sibling, 1 reply; 32+ messages in thread
From: Nikolay Kichukov @ 2022-05-24  8:10 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Christian Schoenebeck, v9fs-developer, netdev,
	Eric Van Hensbergen, Latchesar Ionkov, Greg Kurz, Vivek Goyal

Hello Dominique,

On Mon, 2022-01-24 at 20:07 +0900, Dominique Martinet wrote:
> Nikolay Kichukov wrote on Mon, Jan 24, 2022 at 11:21:08AM +0100:
> > It works, sorry for overlooking the 'known limitations' in the first
> > place. When do we expect these patches to be merged upstream?
> 
> We're just starting a new development cycle for 5.18 while 5.17 is
> stabilizing, so this mostly depends on the ability to check if a msize
> given in parameter is valid as described in the first "STILL TO DO"
> point listed in the cover letter.
> 
> I personally would be happy considering this series for this cycle
> with
> just a max msize of 4MB-8k and leave that further bump for later if
> we're sure qemu will handle it.
> We're still seeing a boost for that and the smaller buffers for small
> messages will benefit all transport types, so that would get in in
> roughly two months for 5.18-rc1, then another two months for 5.18 to
> actually be released and start hitting production code.
> 
> 
> I'm not sure when exactly but I'll run some tests with it as well and
> redo a proper code review within the next few weeks, so we can get
> this
> in -next for a little while before the merge window.
> 

Did you make it into 5.18? I see it just got released...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-05-24  8:10         ` Nikolay Kichukov
@ 2022-05-24 11:29           ` Christian Schoenebeck
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2022-05-24 11:29 UTC (permalink / raw)
  To: Dominique Martinet, Nikolay Kichukov
  Cc: v9fs-developer, netdev, Eric Van Hensbergen, Latchesar Ionkov,
	Greg Kurz, Vivek Goyal

On Dienstag, 24. Mai 2022 10:10:31 CEST Nikolay Kichukov wrote:
> Hello Dominique,
> 
> On Mon, 2022-01-24 at 20:07 +0900, Dominique Martinet wrote:
> > Nikolay Kichukov wrote on Mon, Jan 24, 2022 at 11:21:08AM +0100:
> > > It works, sorry for overlooking the 'known limitations' in the first
> > > place. When do we expect these patches to be merged upstream?
> > 
> > We're just starting a new development cycle for 5.18 while 5.17 is
> > stabilizing, so this mostly depends on the ability to check if a msize
> > given in parameter is valid as described in the first "STILL TO DO"
> > point listed in the cover letter.
> > 
> > I personally would be happy considering this series for this cycle
> > with
> > just a max msize of 4MB-8k and leave that further bump for later if
> > we're sure qemu will handle it.
> > We're still seeing a boost for that and the smaller buffers for small
> > messages will benefit all transport types, so that would get in in
> > roughly two months for 5.18-rc1, then another two months for 5.18 to
> > actually be released and start hitting production code.
> > 
> > 
> > I'm not sure when exactly but I'll run some tests with it as well and
> > redo a proper code review within the next few weeks, so we can get
> > this
> > in -next for a little while before the merge window.
> 
> Did you make it into 5.18? I see it just got released...

No, not yet. I can send a v5 as outlined above, including opt-out for the RDMA 
transport as Dominique noted, that wouldn't be much work for me.

However ATM I fear it would probably not help anybody, as we currently have 
two far more serious issues [1] regarding 9p's cache support introduced by the 
netfs changes: 

1. 9p cache enabled performs now worse than without any cache.

2. There are misbehaviours once in a while, as 9p cache opens FIDs in read-
only mode and once in a while then tries to write with those read-only FIDs 
which causes EBADF errors.

[1] https://lore.kernel.org/lkml/9591612.lsmsJCMaJN@silver/

Issue (2.) can probably be fixed by just opening the FIDs in RW mode, as 
suggested by Dominique. But for performance issue (1.) nobody had an idea yet.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
                   ` (12 preceding siblings ...)
  2022-01-20 22:43 ` [PATCH v4 00/12] remove msize limit in virtio transport Nikolay Kichukov
@ 2022-07-07 14:30 ` Christian Schoenebeck
       [not found]   ` <CAFkjPT=GAoViYd0E7CZQDq3ZjhmYT0DsBytfZXnE10JL0P8O-Q@mail.gmail.com>
  13 siblings, 1 reply; 32+ messages in thread
From: Christian Schoenebeck @ 2022-07-07 14:30 UTC (permalink / raw)
  To: Dominique Martinet, Greg Kurz
  Cc: v9fs-developer, netdev, Eric Van Hensbergen, Latchesar Ionkov,
	Nikolay Kichukov

On Donnerstag, 30. Dezember 2021 14:23:18 CEST Christian Schoenebeck wrote:
> This series aims to get get rid of the current 500k 'msize' limitation in
> the 9p virtio transport, which is currently a bottleneck for performance
> of 9p mounts.
[...]
> KNOWN LIMITATION:
> 
> With this series applied I can run
> 
>   QEMU host <-> 9P virtio <-> Linux guest
> 
> with up to slightly below 4 MB msize [4186112 = (1024-2) * 4096]. If I try
> to run it with exactly 4 MB (4194304) it currently hits a limitation on
> QEMU side:
> 
>   qemu-system-x86_64: virtio: too many write descriptors in indirect table
> 
> That's because QEMU currently has a hard coded limit of max. 1024 virtio
> descriptors per vring slot (i.e. per virtio message), see to do (1.) below.
> 
> 
> STILL TO DO:
> 
>   1. Negotiating virtio "Queue Indirect Size" (MANDATORY):
> 
>     The QEMU issue described above must be addressed by negotiating the
>     maximum length of virtio indirect descriptor tables on virtio device
>     initialization. This would not only avoid the QEMU error above, but
>     would also allow msize of >4MB in future. Before that change can be done
>     on Linux and QEMU sides though, it first requires a change to the virtio
>     specs. Work on that on the virtio specs is in progress:
> 
>     https://github.com/oasis-tcs/virtio-spec/issues/122
> 
>     This is not really an issue for testing this series. Just stick to max.
>     msize=4186112 as described above and you will be fine. However for the
>     final PR this should obviously be addressed in a clean way.

As there is no progress on virtio spec side in sight, I'm considering to 
handle this issue in upcoming v5 by simply assuming (hard coding) that 9p 
server supports exactly up to 1024 virtio descriptors (memory segments) per 
round trip message. That's maybe a bit unclean, but that's what other virtio 
drivers in the Linux kernel do for many years as well, so I am not expecting a 
negative issue in practice.

And I mean, when we talk about 9p + virtio, that actually implies QEMU being 
the 9p server, right? At least I am not aware of another 9p server 
implementation supporting virtio transport (nor any QEMU version that ever 
supported less than 1024 virtio descriptors). Maybe Microsoft WSL? Not sure.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
       [not found]   ` <CAFkjPT=GAoViYd0E7CZQDq3ZjhmYT0DsBytfZXnE10JL0P8O-Q@mail.gmail.com>
@ 2022-07-08  1:15     ` Dominique Martinet
       [not found]       ` <CAFkjPTngeFh=0mPVW-Yf1Sxkxp_HDNUeANndoYN3-eU9_rGLuQ@mail.gmail.com>
  0 siblings, 1 reply; 32+ messages in thread
From: Dominique Martinet @ 2022-07-08  1:15 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Christian Schoenebeck, Greg Kurz, Latchesar Ionkov,
	Nikolay Kichukov, netdev, v9fs-developer

Eric Van Hensbergen wrote on Fri, Jul 08, 2022 at 10:44:45AM +1000:
> there are other 9p virtio servers - several emulation platforms support it
> sans qemu.

Would you happen to have any concrete example?
I'd be curious if there are some that'd be easy to setup for test for
example; my current validation setup lacks a bit of diversity...

I found https://github.com/moby/hyperkit for OSX but that doesn't really
help me, and can't see much else relevant in a quick search

--
Dominique

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
       [not found]       ` <CAFkjPTngeFh=0mPVW-Yf1Sxkxp_HDNUeANndoYN3-eU9_rGLuQ@mail.gmail.com>
@ 2022-07-08 11:18         ` Christian Schoenebeck
  2022-07-08 11:40           ` Dominique Martinet
  0 siblings, 1 reply; 32+ messages in thread
From: Christian Schoenebeck @ 2022-07-08 11:18 UTC (permalink / raw)
  To: Dominique Martinet, Eric Van Hensbergen
  Cc: Greg Kurz, Latchesar Ionkov, Nikolay Kichukov, netdev, v9fs-developer

On Freitag, 8. Juli 2022 04:26:40 CEST Eric Van Hensbergen wrote:
> kvmtool might be the easiest I guess - I’m traveling right now but I can
> try and find some others.  The arm fast models have free versions that are
> downloadable as well.  I know I’ve seem some other less-traditional uses of
> virtio particularly in libos deployments but will take some time to rattle
> those from my memory.

Some examples would indeed be useful, thanks!

> On Fri, Jul 8, 2022 at 11:16 AM Dominique Martinet <asmadeus@codewreck.org>
> 
> wrote:
> > Eric Van Hensbergen wrote on Fri, Jul 08, 2022 at 10:44:45AM +1000:
> > > there are other 9p virtio servers - several emulation platforms support
> > 
> > it
> > 
> > > sans qemu.
> > 
> > Would you happen to have any concrete example?
> > I'd be curious if there are some that'd be easy to setup for test for
> > example; my current validation setup lacks a bit of diversity...
> > 
> > I found https://github.com/moby/hyperkit for OSX but that doesn't really
> > help me, and can't see much else relevant in a quick search

So that appears to be a 9p (@virtio-PCI) client for xhyve, with max. 256kB 
buffers <=> max. 68 virtio descriptors (memory segments) [1]:

/* XXX issues with larger buffers elsewhere in stack */
#define BUFSIZE (1 << 18)
#define MAXDESC (BUFSIZE / 4096 + 4)
#define VT9P_RINGSZ (BUFSIZE / 4096 * 4)

[1] https://github.com/moby/hyperkit/blob/master/src/lib/pci_virtio_9p.c#L27

But on xhyve side I don't see any 9p server implementation:
https://github.com/machyve/xhyve/search?q=9p
Maybe a 9p server is already implemented by Apple's Hypervisor framework. I 
don't find this documented anywhere though.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-07-08 11:18         ` Christian Schoenebeck
@ 2022-07-08 11:40           ` Dominique Martinet
  2022-07-08 13:00             ` Christian Schoenebeck
  0 siblings, 1 reply; 32+ messages in thread
From: Dominique Martinet @ 2022-07-08 11:40 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Eric Van Hensbergen, Greg Kurz, Latchesar Ionkov,
	Nikolay Kichukov, netdev, v9fs-developer

Christian Schoenebeck wrote on Fri, Jul 08, 2022 at 01:18:40PM +0200:
> On Freitag, 8. Juli 2022 04:26:40 CEST Eric Van Hensbergen wrote:
> > kvmtool might be the easiest I guess - I’m traveling right now but I can
> > try and find some others.  The arm fast models have free versions that are
> > downloadable as well.  I know I’ve seem some other less-traditional uses of
> > virtio particularly in libos deployments but will take some time to rattle
> > those from my memory.
> 
> Some examples would indeed be useful, thanks!

https://github.com/kvmtool/kvmtool indeed has a 9p server, I think I
used to run it ages ago.
I'll give it a fresh spin, thanks for the reminder.

For this one it defines VIRTQUEUE_NUM to 128, so not quite 1024.


> > > I found https://github.com/moby/hyperkit for OSX but that doesn't really
> > > help me, and can't see much else relevant in a quick search
> 
> So that appears to be a 9p (@virtio-PCI) client for xhyve,

oh the 9p part is client code?
the readme says it's a server:
"It includes a complete hypervisor, based on xhyve/bhyve"
but I can't run it anyway, so I didn't check very hard.

> with max. 256kB buffers <=> max. 68 virtio descriptors (memory segments) [1]:

huh...

Well, as long as msize is set I assume it'll work out anyway? How does
virtio queue size work with e.g. parallel messages?

Anyway, even if the negotiation part gets done servers won't all get
implemented in a day, so we need to think of other servers a bit..

--
Dominique

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/12] remove msize limit in virtio transport
  2022-07-08 11:40           ` Dominique Martinet
@ 2022-07-08 13:00             ` Christian Schoenebeck
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Schoenebeck @ 2022-07-08 13:00 UTC (permalink / raw)
  To: Dominique Martinet, Eric Van Hensbergen
  Cc: Greg Kurz, Latchesar Ionkov, Nikolay Kichukov, netdev, v9fs-developer

On Freitag, 8. Juli 2022 13:40:36 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Fri, Jul 08, 2022 at 01:18:40PM +0200:
> > On Freitag, 8. Juli 2022 04:26:40 CEST Eric Van Hensbergen wrote:
[...]
> https://github.com/kvmtool/kvmtool indeed has a 9p server, I think I
> used to run it ages ago.
> I'll give it a fresh spin, thanks for the reminder.
> 
> For this one it defines VIRTQUEUE_NUM to 128, so not quite 1024.

Yes, and it does *not* limit client supplied 'msize' either. It just always 
sends the same 'msize' value as-is back to client. :/ So I would expect it to 
error (or worse) if client tries msize > 512kB.

> > > > I found https://github.com/moby/hyperkit for OSX but that doesn't
> > > > really
> > > > help me, and can't see much else relevant in a quick search
> > 
> > So that appears to be a 9p (@virtio-PCI) client for xhyve,
> 
> oh the 9p part is client code?
> the readme says it's a server:
> "It includes a complete hypervisor, based on xhyve/bhyve"
> but I can't run it anyway, so I didn't check very hard.

Hmm, I actually just interpreted this for it to be a client:

fprintf(stderr, "virtio-9p: unexpected EOF writing to server-- did the 9P 
server crash?\n");

But looking at it again, it seems you are right, at least I see that it also 
handles even 9p message type numbers, but only Twrite and Tflush? I don't see 
any Tversion or msize handling in general. [shrug]

> > with max. 256kB buffers <=> max. 68 virtio descriptors (memory segments) 
[1]:
> huh...
> 
> Well, as long as msize is set I assume it'll work out anyway?

If server would limit 'msize' appropriately, yes. But as the kvmtool example 
shows, that should probably not taken for granted.

> How does virtio queue size work with e.g. parallel messages?

Simple question, complicated to answer.

From virtio spec PoV (and current virtio <= v1.2), the negotiated virtio queue 
size defines both the max. amount of parallel (round-trip) messages *and* the 
max. amount of virtio descriptors (memory segments) of *all* currently active/
parallel messages in total. I "think" because of virtio's origin for 
virtualized network devices?

So yes, if you are very strict what the virtio spec <= v1.2 says, and say you 
have a virtio queue size of 128 (e.g. hard coded by QEMU, kvmtool), and say 
client would send out the first 9p request with 128 memory segments, then the 
next (i.e. second) parallel 9p request sent to server would already exceed the 
theoretically allowed max. amount of virtio descriptors.

But in practice, I don't see that this theoretical limitation would exist in 
actual 9p virtio server implementations. At least in all server 
implementations I saw so far, they all seem to handle the max. virtio 
descriptors amount for each request separately.

> Anyway, even if the negotiation part gets done servers won't all get
> implemented in a day, so we need to think of other servers a bit..

OTOH kernel should have the name of the hypervisor/emulator somewhere, right? 
So Linux client's max. virtio descriptors could probably made dependent on 
that name?

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-07-08 13:00 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers Christian Schoenebeck
2022-04-02 14:05   ` Dominique Martinet
2022-04-03 11:29     ` Christian Schoenebeck
2022-04-03 12:37       ` Dominique Martinet
2022-04-03 14:00         ` Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 08/12] net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 04/12] 9p/trans_virtio: introduce struct virtqueue_sg Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 01/12] net/9p: show error message if user 'msize' cannot be satisfied Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 10/12] 9p: add P9_ERRMAX for 9p2000 and 9p2000.u Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 11/12] net/9p: add p9_msg_buf_size() Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 06/12] 9p/trans_virtio: support larger msize values Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 02/12] 9p/trans_virtio: separate allocation of scatter gather list Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 09/12] net/9p: split message size argument into 't_size' and 'r_size' pair Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 07/12] 9p/trans_virtio: resize sg lists to whatever is possible Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 05/12] net/9p: add trans_maxsize to struct p9_client Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 03/12] 9p/trans_virtio: turn amount of sg lists into runtime info Christian Schoenebeck
2022-01-20 22:43 ` [PATCH v4 00/12] remove msize limit in virtio transport Nikolay Kichukov
2022-01-22 13:34   ` Christian Schoenebeck
2022-01-24 10:21     ` Nikolay Kichukov
2022-01-24 11:07       ` Dominique Martinet
2022-01-24 11:57         ` Christian Schoenebeck
2022-01-24 12:56           ` Dominique Martinet
2022-01-24 13:55             ` Christian Schoenebeck
2022-01-25  8:45           ` Nikolay Kichukov
2022-05-24  8:10         ` Nikolay Kichukov
2022-05-24 11:29           ` Christian Schoenebeck
2022-07-07 14:30 ` Christian Schoenebeck
     [not found]   ` <CAFkjPT=GAoViYd0E7CZQDq3ZjhmYT0DsBytfZXnE10JL0P8O-Q@mail.gmail.com>
2022-07-08  1:15     ` Dominique Martinet
     [not found]       ` <CAFkjPTngeFh=0mPVW-Yf1Sxkxp_HDNUeANndoYN3-eU9_rGLuQ@mail.gmail.com>
2022-07-08 11:18         ` Christian Schoenebeck
2022-07-08 11:40           ` Dominique Martinet
2022-07-08 13:00             ` Christian Schoenebeck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.