Re: [PATCH 2/2] net/9p: increase default msize to 128k

From: Christian Schoenebeck <linux_oss@crudebyte.com>
To: Dominique Martinet <asmadeus@codewreck.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>,
	Greg Kurz <groug@kaod.org>, Latchesar Ionkov <lucho@ionkov.net>,
	netdev@vger.kernel.org, v9fs-developer@lists.sourceforge.net
Subject: Re: [PATCH 2/2] net/9p: increase default msize to 128k
Date: Fri, 10 Sep 2021 16:04:56 +0200	[thread overview]
Message-ID: <6943365.098VrMUqUo@silver> (raw)
In-Reply-To: <YTVb3K37JxUWUdXN@codewreck.org>

On Sonntag, 5. September 2021 23:48:04 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Sun, Sep 05, 2021 at 03:33:11PM +0200:
> > > Subject: [PATCH] net/9p: increase tcp max msize to 1MB
> > 
> > Yes, makes sense.
> > 
> > Is the TCP transport driver of the Linux 9p client somewhat equally mature
> > to the virtio transport driver? Because I still have macOS support for 9p
> > in QEMU on my TODO list and accordingly a decision for a specific
> > transport type for macOS will be needed.
> 
> I'd say it should be just about as stable as virtio.. It's definitely
> getting a lot of tests like syzcaller as it's easy to open an arbitrary
> fd and pass that to kernel for fuzzing.
> 
> Performance-wise you won't have zero-copy, and the monolitic memory
> blocks requirement is the same, so it won't be as good as virtio but it
> can probably be used. The problem will more be one of setting -- for
> user networking we can always use qemu's networking implementation, but
> for passthrough or tap qemu won't easily be discoverable from the VM,
> I'm not sure about that.
> Does AF_VSOCK work on macos? that could probably easily be added to
> trans_fd...

I haven't looked yet into what the latest options are on macOS. It probably 
takes a while before I'll have time for that, as it is not high priority for 
me.

With macOS 11 Apple added many new restrictions of what can still be done 
beyond user space. For virtualization you would now use the macoOS provided 
hypervisor for instance. Apple sent a bunch of patches last year to support 
their HVF in QEMU.

> > On Samstag, 27. Februar 2021 01:03:40 CEST Dominique Martinet wrote:
> > > Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100:
> > > > Right now the client uses a hard coded amount of 128 elements. So what
> > > > about replacing VIRTQUEUE_NUM by a variable which is initialized with
> > > > a
> > > > value according to the user's requested 'msize' option at init time?
> > > > 
> > > > According to the virtio specs the max. amount of elements in a
> > > > virtqueue
> > > > is
> > > > 32768. So 32768 * 4k = 128M as new upper limit would already be a
> > > > significant improvement and would not require too many changes to the
> > > > client code, right?
> > > 
> > > The current code inits the chan->sg at probe time (when driver is
> > > loader) and not mount time, and it is currently embedded in the chan
> > > struct, so that would need allocating at mount time (p9_client_create ;
> > > either resizing if required or not sharing) but it doesn't sound too
> > > intrusive yes.
> > > 
> > > I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying.
> > 
> > So probably as a first step I would only re-allocate the virtio elements
> > according to the user supplied (i.e. large) 'msize' value if the 9p driver
> > is not in use at that point, and stick to capping otherwise. That should
> > probably be fine for many users and would avoid need for some
> > synchronization measures and the traps it might bring. But again, I still
> > need to read more of the Linux client code first.
> 
> Right, looking at it again p9_virtio_create would already allow that so
> you just need to pick the most appropriate channel or create a new one
> if required, synchronization shouldn't be too much of a problem.
> 
> The 9p code is sometimes impressive in its foresight ;)

I just realized this is going to be much more work than I expected. Apparently 
struct scatterlist is limited to a value of SG_MAX_SINGLE_ALLOC, which is 
something around 128 or slightly more:

/*
 * Maximum number of entries that will be allocated in one piece, if
 * a list larger than this is required then chaining will be utilized.
 */
#define SG_MAX_SINGLE_ALLOC		(PAGE_SIZE / sizeof(struct scatterlist))

/*
 * The maximum number of SG segments that we will put inside a
 * scatterlist (unless chaining is used). Should ideally fit inside a
 * single page, to avoid a higher order allocation.  We could define this
 * to SG_MAX_SINGLE_ALLOC to pack correctly at the highest order.  The
 * minimum value is 32
 */
#define SG_CHUNK_SIZE	128

Which explains the current hard coded default value of 128 sglist elements in 
the 9p virtio transport:

#define VIRTQUEUE_NUM	128

So struct virtio_chan would need to switch from scatterlist to a different 
type, probably to sg_table. The latter API allows to automatically chain sg 
lists if the required amount exceeds SG_MAX_SINGLE_ALLOC. It misuses the last 
sglist element to build a chained list. However chaining is not supported on 
some architectures:

int __sg_alloc_table(struct sg_table *table, unsigned int nents,
		     unsigned int max_ents, struct scatterlist *first_chunk,
		     unsigned int nents_first_chunk, gfp_t gfp_mask,
		     sg_alloc_fn *alloc_fn)
{
	...
#ifdef CONFIG_ARCH_NO_SG_CHAIN
	if (WARN_ON_ONCE(nents > max_ents))
		return -EINVAL;
#endif
	...
}

So those architectures would still end up with the current 128 sglist elements 
limitation with the 9p virtio transport.

And it is also not clear to me whether the Linux sg_table API allows to grow 
the table if it had been allocated already; say virtio transport retains the 
current pre-allocation of hard-coded 128 elements in early p9_virtio_probe() 
and would then just append more sglists in p9_virtio_create() if needed due to 
a large msize option provided by user. It "looks" like sg_alloc_table() might 
allow to append, but I am not sure.

Another type candidate would be struct sg_append_table, but its API wants a 
complete list of pages to be provided upon table allocation upfront:

/**
 * sg_alloc_append_table_from_pages - Allocate and initialize an append sg
 *                                    table from an array of pages
 * @sgt_append:  The sg append table to use
 * @pages:       Pointer to an array of page pointers
 * @n_pages:     Number of pages in the pages array
 * @offset:      Offset from start of the first page to the start of a buffer
 * @size:        Number of valid bytes in the buffer (after offset)
 * @max_segment: Maximum size of a scatterlist element in bytes
 * @left_pages:  Left pages caller have to set after this call
 * @gfp_mask:	 GFP allocation mask
 *
 * Description:
 *    In the first call it allocate and initialize an sg table from a list of
 *    pages, else reuse the scatterlist from sgt_append. Contiguous ranges of
 *    the pages are squashed into a single scatterlist entry up to the maximum
 *    size specified in @max_segment.  A user may provide an offset at a start
 *    and a size of valid data in a buffer specified by the page array. The
 *    returned sg table is released by sg_free_append_table
 *
 * Returns:
 *   0 on success, negative error on failure
 *
 * Notes:
 *   If this function returns non-0 (eg failure), the caller must call
 *   sg_free_append_table() to cleanup any leftover allocations.
 *
 *   In the fist call, sgt_append must by initialized.
 */
int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
		struct page **pages, unsigned int n_pages, unsigned int offset,
		unsigned long size, unsigned int max_segment,
		unsigned int left_pages, gfp_t gfp_mask)
{
	...
}

Which does not really fit here (ATM), as the 9p virtio transport maps the 
pages on demand apparently, not when the client/transport is created already.

Either way, additionally functions pack_sg_list() and pack_sg_list_p() in the 
9p virtio transport would need to be adjusted as well to run like:

	int pack_sg_list() {
		FOR_EACH(sglist in sgtable) {
			for (i in 0..127) {
				... sglist[i] ...
			}
		}
	}

Which is not really a trivial change set. :/

On Montag, 6. September 2021 02:07:56 CEST Dominique Martinet wrote:
> Eric Van Hensbergen wrote on Sun, Sep 05, 2021 at 06:44:13PM -0500:
> > there will likely be a tradeoff with tcp in terms of latency to first
> > message so while
> > absolute bw may be higher processing time may suffer.  8k was default
> > msize
> > to more closely match it to jumbo frames on an ethernet.  of course all
> > that intuition is close to 30 years out of date….
> 
> It's not because the max size is 128k (or 1MB) that this much is sent
> over the wire everytime -- if a message used to fit in 8KB, then its
> on-the-wire size won't change and speed/latency won't be affected for
> these.
> 
> For messages that do require more than 8KB (read/write/readdir) then you
> can fit more data per message, so for a given userspace request (feed me
> xyz amount of data) you'll have less client-server round-trips, and the
> final user-reflected latency will be better as well -- that's why
> e.g. NFS has been setting a max size of 1MB by default for a while now,
> and they allow even more (32MB iirc? not sure)

I can just speak for the setup case QEMU server <-> virtio <-> Linux client 
now:

In this setup there is definitely no disadvantage regarding latency by raising 
msize. It is actually the exact opposite: low msize values significantly 
increase overall latency, because for each 9p message there is not only the 
latency between each server and client 9p message transfer, but there is also 
latency added on server side when handling each request, as server has to 
dispatch between fs driver worker thread(s) and server main thread at least 
twice per 9p request, in some cases even multiple times per request.

That's actually something I'm working on to reduce this to its theoretical 
limit of 2 thread hops for every 9p request type on QEMU server side (WIP).

The only bottleneck situation I can think of when raising msize to a giant 
value is when you have a guest app reading/writing constantly with maximum 
chunk size according to msize (which 'cat' for instance is always doing by 
reading stat's st_blksize). In this case other 9p requests of other threads/
processes would need to wait as that single 9p consumer would occupy all 
available pages with a single, huge 9p request.

That could be addressed maybe by multiplying msize with the amount of 
available CPU cores or something like that when allocating the sg table on 
client transport side.

> I've only had done these tests years ago and no longer have access to
> the note that was written back then, but TCP also definitely benefits
> from > 64k msize as long as there's enough memory available.
> 
> The downside (because it's not free) is there though, you need more
> memory for 9p with big buffers even if we didn't need so much in the
> first place.
> The code using a slab now means that the memory is not locked per mount
> as it used to, but that also means allocations can fail if there is a
> big pressure after not having been released. OTOH as long as it's
> consistently used the buffers will be recycled so it's not necessarily
> too bad performance-wise in hot periods.
> 
> Ideally we'd need to rework transports to allow scatter-gather (iovec or
> similar API), and work with smaller allocations batched together on
> send, but I don't have time for something like that... If we do that we
> can probably get the best of both worlds -- and could consider >1MB, but
> that's unrealistic as of now, regardless of the transport.

Ok, but that's not an issue for the virtio transport, is it? As virtio 
transport already uses scatter-gather. AFAICS in case of virtio the issue 
might be that it's claiming pages on demand instead of pinning them when 
client/virtio is already created. So it could run out of free pages during 
life time I guess.

Best regards,
Christian Schoenebeck