On Wed, Nov 10, 2021 at 04:53:33PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 10. November 2021 16:14:19 CET Stefan Hajnoczi wrote:
> > On Wed, Nov 10, 2021 at 02:14:43PM +0100, Christian Schoenebeck wrote:
> > > On Mittwoch, 10. November 2021 11:05:50 CET Stefan Hajnoczi wrote:
> > > As you are apparently reluctant for changing the virtio specs, what about
> > > introducing those discussed virtio capabalities either as experimental
> > > ones
> > > without specs changes, or even just as 9p specific device capabilities for
> > > now. I mean those could be revoked on both sides at any time anyway.
> > 
> > I would like to understand the root cause before making changes.
> > 
> > "It's faster when I do X" is useful information but it doesn't
> > necessarily mean doing X is the solution. The "it's faster when I do X
> > because Y" part is missing in my mind. Once there is evidence that shows
> > Y then it will be clearer if X is a good solution, if there's a more
> > general solution, or if it was just a side-effect.
> 
> I think I made it clear that the root cause of the observed performance gain 
> with rising transmission size is latency (and also that performance is not the 
> only reason for addressing this queue size issue).
> 
> Each request roundtrip has a certain minimum latency, the virtio ring alone 
> has its latency, plus latency of the controller portion of the file server 
> (e.g. permissions, sandbox checks, file IDs) that is executed with *every* 
> request, plus latency of dispatching the request handling between threads 
> several times back and forth (also for each request).
> 
> Therefore when you split large payloads (e.g. reading a large file) into 
> smaller n amount of chunks, then that individual latency per request 
> accumulates to n times the individual latency, eventually leading to degraded 
> transmission speed as those requests are serialized.

It's easy to increase the blocksize in benchmarks, but real applications
offer less control over the I/O pattern. If latency in the device
implementation (QEMU) is the root cause then reduce the latency to speed
up all applications, even those that cannot send huge requests.

One idea is request merging on the QEMU side. If the application sends
10 sequential read or write requests, coalesce them together before the
main part of request processing begins in the device. Process a single
large request to spread the cost of the file server over the 10
requests. (virtio-blk has request merging to help with the cost of lots
of small qcow2 I/O requests.) The cool thing about this is that the
guest does not need to change its I/O pattern to benefit from the
optimization.

Stefan