archive mirror
 help / color / mirror / Atom feed
From: Christian Schoenebeck <>
To: Dominique Martinet <>
Cc: Nikolay Kichukov <>,,,
	Eric Van Hensbergen <>,
	Latchesar Ionkov <>, Greg Kurz <>,
	Vivek Goyal <>
Subject: Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values
Date: Sat, 20 Nov 2021 15:46:19 +0100	[thread overview]
Message-ID: <2717208.9V0g2NVZY4@silver> (raw)
In-Reply-To: <>

On Samstag, 20. November 2021 12:45:09 CET Dominique Martinet wrote:
> (Thanks for restarting this thread, this had totally slipped out of my
> mind...)

Hi guys,

I have (more or less) silently been working on these patches on all ends in 
the meantime. If desired I try to find some time next week to summarize 
current status of this overall effort and what still needs to be done.

> Nikolay Kichukov wrote on Sat, Nov 20, 2021 at 12:20:35PM +0100:
> > When the client mounts the share via virtio, requested msize is:
> > 10485760 or 104857600
> > 
> > however the mount succeeds with:
> > msize=507904 in the end as per the /proc filesystem. This is less than
> > the previous maximum value.
> (Not sure about this, I'll test these patches tomorrow, but since
> something failed I'm not surprised you have less than what you could
> have here: what do you get with a more reasonable value like 1M first?)

The highest 'msize' value possible for me with this particular version of the 
kernel patches is 4186212 (so slightly below 4MB).

Some benchmarks, linear reading a 12 GB file:

msize    average      notes

8 kB     52.0 MB/s    default msize of Linux kernel <v5.15
128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15
512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15
1 MB     2551 MB/s    this msize would violate current virtio specs
2 MB     2521 MB/s    this msize would violate current virtio specs
4 MB     2628 MB/s    planned milestone

That does not mean it already makes sense to use these patches in this version 
as is to increase overall performance yet, but the numbers already suggest 
that increasing msize can improve performance on large sequential I/O. There 
are still some things to do though to fix other use patterns from slowing down 
with rising msize, which I will describe in a separate email.

I also have an experimental version of kernel patches and QEMU that goes as 
high as msize=128MB. But that's a highly hacked version that copies buffers 
between 9p client level and virtio level back and forth and with intentional 
large buffer sizes on every 9p message type. This was solely meant as a stress 
test, i.e. whether it was possible to go as high as virtio's theoretical 
protocol limit in the first place, and whether it was stable. This stress test 
was not about performance at all. And yes, I had it running with 128MB for 
weeks without issues (except of being very slow of course due to hacks used).

> > In addition to the above, when the kernel on the guest boots and loads
> > 9pfs support, the attached memory allocation failure trace is generated.
> > 
> > Is anyone else seeing similar and was anybody able to get msize set to
> > 10MB via virtio protocol with these patches?
> I don't think the kernel would ever allow this with the given code, as
> the "common" part of 9p is not smart enough to use scatter-gather and
> tries to do a big allocation (almost) the size of the msize:
> ---
>         clnt->fcall_cache =
>                 kmem_cache_create_usercopy("9p-fcall-cache", clnt->msize,
>                                            0, 0, P9_HDRSZ + 4,
>                                            clnt->msize - (P9_HDRSZ + 4),
>                                            NULL);
> ...
> 		fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> ---
> So in practice, you will be capped at 2MB as that is the biggest the
> slab will be able to hand over in a single chunk.

I did not encounter a 2MB limit here. But kmalloc() clearly has a 4MB limit, 
so when trying an msize larger than 4MB it inevitably causes a memory 
allocation error. In my tests this allocation error would always happen 
immediately at mount time causing an instant kernel oops.

> It'll also make allocation failures quite likely as soon as the system
> has had some uptime (depending on your workload, look at /proc/buddyinfo
> if your machines normally have 2MB chunks available), so I would really
> not recommend running with buffers bigger than e.g. 512k on real
> workloads -- it looks great on benchmarks, especially as it's on its own
> slab so as long as you're doing a lot of requests it will dish out
> buffers fast, but it'll likely be unreliable over time.
> (oh, and we allocate two of these per request...)
> The next step to support large buffers really would be splitting htat
> buffer to:
>  1/ not allocate huge buffers for small metadata ops, e.g. an open call
> certainly doesn't need to allocate 10MB of memory
>  2/ support splitting the buffer in some scattergather array
> Ideally we'd only allocate on an as-need basis, most of the protocol
> calls bound how much data is supposed to come back and we know how much
> we want to send (it's a format string actually, but we can majorate it
> quite easily), so one would need to adjust all protocol calls to pass
> this info to p9_client_rpc/p9_client_zc_rpc so it only allocates buffers
> as required, if necessary in multiple reasonably-sized segments (I'd
> love 2MB hugepages-backed folios...), and have all transports use these
> buffers.

It is not that bad in sense of pending work. One major thing that needs to be 
done is to cap the majority of 9p message types to allocate only as much as 
they need, which is for most message types <8k. Right now they always simply 
kmalloc(msize), which hurts with increasing msize values. That task does not 
need many changes though.

> I have a rough idea on how to do all this but honestly less than 0 time
> for that, so happy to give advices or review any patch, but it's going
> to be a lot of work that stand in the way of really big IOs.

Reviews of the patches would actually help a lot to bring this overall effort 
forward, but probably rather starting with the upcoming next version of the 
kernel patches, not this old v3.

> > [    1.527981] 9p: Installing v9fs 9p2000 file system support
> > [    1.528173] ------------[ cut here ]------------
> > [    1.528174] WARNING: CPU: 1 PID: 791 at mm/page_alloc.c:5356
> > __alloc_pages+0x1ed/0x290
> This warning is exactly what I was saying about the allocation cap:
> you've requested an allocation that was bigger than the max __alloc_page
> is willing to provide (MAX_ORDER, 11, so 2MB as I was saying)

Best regards,
Christian Schoenebeck

  reply	other threads:[~2021-11-20 15:06 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-22 16:00 [PATCH v3 0/7] net/9p: remove msize limit in virtio transport Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 1/7] net/9p: show error message if user 'msize' cannot be satisfied Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 2/7] 9p/trans_virtio: separate allocation of scatter gather list Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 3/7] 9p/trans_virtio: turn amount of sg lists into runtime info Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 4/7] 9p/trans_virtio: introduce struct virtqueue_sg Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 5/7] net/9p: add trans_maxsize to struct p9_client Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 6/7] 9p/trans_virtio: support larger msize values Christian Schoenebeck
2021-11-20 11:20   ` Nikolay Kichukov
2021-11-20 11:45     ` Dominique Martinet
2021-11-20 14:46       ` Christian Schoenebeck [this message]
2021-11-20 21:28         ` Nikolay Kichukov
2021-11-20 23:02         ` Dominique Martinet
2021-11-21 16:57           ` Christian Schoenebeck
2021-11-21 22:12             ` Dominique Martinet
2021-11-22 13:32               ` Christian Schoenebeck
2021-11-22 22:35                 ` Dominique Martinet
2021-11-24 12:22                   ` Christian Schoenebeck
2021-09-22 16:00 ` [PATCH v3 7/7] 9p/trans_virtio: resize sg lists to whatever is possible Christian Schoenebeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2717208.9V0g2NVZY4@silver \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).