All of lore.kernel.org
 help / color / mirror / Atom feed
* tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-18 21:34 ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-18 21:34 UTC (permalink / raw)
  To: virtio-fs-list, qemu-devel; +Cc: Dr. David Alan Gilbert, Stefan Hajnoczi

Hi All,

virtiofsd default thread pool size is 64. To me it feels that in most of
the cases thread pool size 1 performs better than thread pool size 64.

I ran virtiofs-tests.

https://github.com/rhvgoyal/virtiofs-tests

And here are the comparision results. To me it seems that by default
we should switch to 1 thread (Till we can figure out how to make
multi thread performance better even when single process is doing
I/O in client).

I am especially more interested in getting performance better for
single process in client. If that suffers, then it is pretty bad.

Especially look at randread, randwrite, seqwrite performance. seqread
seems pretty good anyway.

If I don't run who test suite and just ran randread-psync job,
my throughput jumps from around 40MB/s to 60MB/s. That's a huge
jump I would say.

Thoughts?

Thanks
Vivek


NAME                    WORKLOAD                Bandwidth       IOPS            
cache-auto              seqread-psync           690(MiB/s)      172k            
cache-auto-1-thread     seqread-psync           729(MiB/s)      182k            

cache-auto              seqread-psync-multi     2578(MiB/s)     644k            
cache-auto-1-thread     seqread-psync-multi     2597(MiB/s)     649k            

cache-auto              seqread-mmap            660(MiB/s)      165k            
cache-auto-1-thread     seqread-mmap            672(MiB/s)      168k            

cache-auto              seqread-mmap-multi      2499(MiB/s)     624k            
cache-auto-1-thread     seqread-mmap-multi      2618(MiB/s)     654k            

cache-auto              seqread-libaio          286(MiB/s)      71k             
cache-auto-1-thread     seqread-libaio          260(MiB/s)      65k             

cache-auto              seqread-libaio-multi    1508(MiB/s)     377k            
cache-auto-1-thread     seqread-libaio-multi    986(MiB/s)      246k            

cache-auto              randread-psync          35(MiB/s)       9191            
cache-auto-1-thread     randread-psync          55(MiB/s)       13k             

cache-auto              randread-psync-multi    179(MiB/s)      44k             
cache-auto-1-thread     randread-psync-multi    209(MiB/s)      52k             

cache-auto              randread-mmap           32(MiB/s)       8273            
cache-auto-1-thread     randread-mmap           50(MiB/s)       12k             

cache-auto              randread-mmap-multi     161(MiB/s)      40k             
cache-auto-1-thread     randread-mmap-multi     185(MiB/s)      46k             

cache-auto              randread-libaio         268(MiB/s)      67k             
cache-auto-1-thread     randread-libaio         254(MiB/s)      63k             

cache-auto              randread-libaio-multi   256(MiB/s)      64k             
cache-auto-1-thread     randread-libaio-multi   155(MiB/s)      38k             

cache-auto              seqwrite-psync          23(MiB/s)       6026            
cache-auto-1-thread     seqwrite-psync          30(MiB/s)       7925            

cache-auto              seqwrite-psync-multi    100(MiB/s)      25k             
cache-auto-1-thread     seqwrite-psync-multi    154(MiB/s)      38k             

cache-auto              seqwrite-mmap           343(MiB/s)      85k             
cache-auto-1-thread     seqwrite-mmap           355(MiB/s)      88k             

cache-auto              seqwrite-mmap-multi     408(MiB/s)      102k            
cache-auto-1-thread     seqwrite-mmap-multi     438(MiB/s)      109k            

cache-auto              seqwrite-libaio         41(MiB/s)       10k             
cache-auto-1-thread     seqwrite-libaio         65(MiB/s)       16k             

cache-auto              seqwrite-libaio-multi   137(MiB/s)      34k             
cache-auto-1-thread     seqwrite-libaio-multi   214(MiB/s)      53k             

cache-auto              randwrite-psync         22(MiB/s)       5801            
cache-auto-1-thread     randwrite-psync         30(MiB/s)       7927            

cache-auto              randwrite-psync-multi   100(MiB/s)      25k             
cache-auto-1-thread     randwrite-psync-multi   151(MiB/s)      37k             

cache-auto              randwrite-mmap          31(MiB/s)       7984            
cache-auto-1-thread     randwrite-mmap          55(MiB/s)       13k             

cache-auto              randwrite-mmap-multi    124(MiB/s)      31k             
cache-auto-1-thread     randwrite-mmap-multi    213(MiB/s)      53k             

cache-auto              randwrite-libaio        40(MiB/s)       10k             
cache-auto-1-thread     randwrite-libaio        64(MiB/s)       16k             

cache-auto              randwrite-libaio-multi  139(MiB/s)      34k             
cache-auto-1-thread     randwrite-libaio-multi  212(MiB/s)      53k             








^ permalink raw reply	[flat|nested] 107+ messages in thread

* [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-18 21:34 ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-18 21:34 UTC (permalink / raw)
  To: virtio-fs-list, qemu-devel

Hi All,

virtiofsd default thread pool size is 64. To me it feels that in most of
the cases thread pool size 1 performs better than thread pool size 64.

I ran virtiofs-tests.

https://github.com/rhvgoyal/virtiofs-tests

And here are the comparision results. To me it seems that by default
we should switch to 1 thread (Till we can figure out how to make
multi thread performance better even when single process is doing
I/O in client).

I am especially more interested in getting performance better for
single process in client. If that suffers, then it is pretty bad.

Especially look at randread, randwrite, seqwrite performance. seqread
seems pretty good anyway.

If I don't run who test suite and just ran randread-psync job,
my throughput jumps from around 40MB/s to 60MB/s. That's a huge
jump I would say.

Thoughts?

Thanks
Vivek


NAME                    WORKLOAD                Bandwidth       IOPS            
cache-auto              seqread-psync           690(MiB/s)      172k            
cache-auto-1-thread     seqread-psync           729(MiB/s)      182k            

cache-auto              seqread-psync-multi     2578(MiB/s)     644k            
cache-auto-1-thread     seqread-psync-multi     2597(MiB/s)     649k            

cache-auto              seqread-mmap            660(MiB/s)      165k            
cache-auto-1-thread     seqread-mmap            672(MiB/s)      168k            

cache-auto              seqread-mmap-multi      2499(MiB/s)     624k            
cache-auto-1-thread     seqread-mmap-multi      2618(MiB/s)     654k            

cache-auto              seqread-libaio          286(MiB/s)      71k             
cache-auto-1-thread     seqread-libaio          260(MiB/s)      65k             

cache-auto              seqread-libaio-multi    1508(MiB/s)     377k            
cache-auto-1-thread     seqread-libaio-multi    986(MiB/s)      246k            

cache-auto              randread-psync          35(MiB/s)       9191            
cache-auto-1-thread     randread-psync          55(MiB/s)       13k             

cache-auto              randread-psync-multi    179(MiB/s)      44k             
cache-auto-1-thread     randread-psync-multi    209(MiB/s)      52k             

cache-auto              randread-mmap           32(MiB/s)       8273            
cache-auto-1-thread     randread-mmap           50(MiB/s)       12k             

cache-auto              randread-mmap-multi     161(MiB/s)      40k             
cache-auto-1-thread     randread-mmap-multi     185(MiB/s)      46k             

cache-auto              randread-libaio         268(MiB/s)      67k             
cache-auto-1-thread     randread-libaio         254(MiB/s)      63k             

cache-auto              randread-libaio-multi   256(MiB/s)      64k             
cache-auto-1-thread     randread-libaio-multi   155(MiB/s)      38k             

cache-auto              seqwrite-psync          23(MiB/s)       6026            
cache-auto-1-thread     seqwrite-psync          30(MiB/s)       7925            

cache-auto              seqwrite-psync-multi    100(MiB/s)      25k             
cache-auto-1-thread     seqwrite-psync-multi    154(MiB/s)      38k             

cache-auto              seqwrite-mmap           343(MiB/s)      85k             
cache-auto-1-thread     seqwrite-mmap           355(MiB/s)      88k             

cache-auto              seqwrite-mmap-multi     408(MiB/s)      102k            
cache-auto-1-thread     seqwrite-mmap-multi     438(MiB/s)      109k            

cache-auto              seqwrite-libaio         41(MiB/s)       10k             
cache-auto-1-thread     seqwrite-libaio         65(MiB/s)       16k             

cache-auto              seqwrite-libaio-multi   137(MiB/s)      34k             
cache-auto-1-thread     seqwrite-libaio-multi   214(MiB/s)      53k             

cache-auto              randwrite-psync         22(MiB/s)       5801            
cache-auto-1-thread     randwrite-psync         30(MiB/s)       7927            

cache-auto              randwrite-psync-multi   100(MiB/s)      25k             
cache-auto-1-thread     randwrite-psync-multi   151(MiB/s)      37k             

cache-auto              randwrite-mmap          31(MiB/s)       7984            
cache-auto-1-thread     randwrite-mmap          55(MiB/s)       13k             

cache-auto              randwrite-mmap-multi    124(MiB/s)      31k             
cache-auto-1-thread     randwrite-mmap-multi    213(MiB/s)      53k             

cache-auto              randwrite-libaio        40(MiB/s)       10k             
cache-auto-1-thread     randwrite-libaio        64(MiB/s)       16k             

cache-auto              randwrite-libaio-multi  139(MiB/s)      34k             
cache-auto-1-thread     randwrite-libaio-multi  212(MiB/s)      53k             







^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-18 21:34 ` [Virtio-fs] " Vivek Goyal
@ 2020-09-21  8:39   ` Stefan Hajnoczi
  -1 siblings, 0 replies; 107+ messages in thread
From: Stefan Hajnoczi @ 2020-09-21  8:39 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 5074 bytes --]

On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> And here are the comparision results. To me it seems that by default
> we should switch to 1 thread (Till we can figure out how to make
> multi thread performance better even when single process is doing
> I/O in client).

Let's understand the reason before making changes.

Questions:
 * Is "1-thread" --thread-pool-size=1?
 * Was DAX enabled?
 * How does cache=none perform?
 * Does commenting out vu_queue_get_avail_bytes() + fuse_log("%s:
   Queue %d gave evalue: %zx available: in: %u out: %u\n") in
   fv_queue_thread help?
 * How do the kvm_stat vmexit counters compare?
 * How does host mpstat -P ALL compare?
 * How does host perf record -a compare?
 * Does the Rust virtiofsd show the same pattern (it doesn't use glib
   thread pools)?

Stefan

> NAME                    WORKLOAD                Bandwidth       IOPS            
> cache-auto              seqread-psync           690(MiB/s)      172k            
> cache-auto-1-thread     seqread-psync           729(MiB/s)      182k            
> 
> cache-auto              seqread-psync-multi     2578(MiB/s)     644k            
> cache-auto-1-thread     seqread-psync-multi     2597(MiB/s)     649k            
> 
> cache-auto              seqread-mmap            660(MiB/s)      165k            
> cache-auto-1-thread     seqread-mmap            672(MiB/s)      168k            
> 
> cache-auto              seqread-mmap-multi      2499(MiB/s)     624k            
> cache-auto-1-thread     seqread-mmap-multi      2618(MiB/s)     654k            
> 
> cache-auto              seqread-libaio          286(MiB/s)      71k             
> cache-auto-1-thread     seqread-libaio          260(MiB/s)      65k             
> 
> cache-auto              seqread-libaio-multi    1508(MiB/s)     377k            
> cache-auto-1-thread     seqread-libaio-multi    986(MiB/s)      246k            
> 
> cache-auto              randread-psync          35(MiB/s)       9191            
> cache-auto-1-thread     randread-psync          55(MiB/s)       13k             
> 
> cache-auto              randread-psync-multi    179(MiB/s)      44k             
> cache-auto-1-thread     randread-psync-multi    209(MiB/s)      52k             
> 
> cache-auto              randread-mmap           32(MiB/s)       8273            
> cache-auto-1-thread     randread-mmap           50(MiB/s)       12k             
> 
> cache-auto              randread-mmap-multi     161(MiB/s)      40k             
> cache-auto-1-thread     randread-mmap-multi     185(MiB/s)      46k             
> 
> cache-auto              randread-libaio         268(MiB/s)      67k             
> cache-auto-1-thread     randread-libaio         254(MiB/s)      63k             
> 
> cache-auto              randread-libaio-multi   256(MiB/s)      64k             
> cache-auto-1-thread     randread-libaio-multi   155(MiB/s)      38k             
> 
> cache-auto              seqwrite-psync          23(MiB/s)       6026            
> cache-auto-1-thread     seqwrite-psync          30(MiB/s)       7925            
> 
> cache-auto              seqwrite-psync-multi    100(MiB/s)      25k             
> cache-auto-1-thread     seqwrite-psync-multi    154(MiB/s)      38k             
> 
> cache-auto              seqwrite-mmap           343(MiB/s)      85k             
> cache-auto-1-thread     seqwrite-mmap           355(MiB/s)      88k             
> 
> cache-auto              seqwrite-mmap-multi     408(MiB/s)      102k            
> cache-auto-1-thread     seqwrite-mmap-multi     438(MiB/s)      109k            
> 
> cache-auto              seqwrite-libaio         41(MiB/s)       10k             
> cache-auto-1-thread     seqwrite-libaio         65(MiB/s)       16k             
> 
> cache-auto              seqwrite-libaio-multi   137(MiB/s)      34k             
> cache-auto-1-thread     seqwrite-libaio-multi   214(MiB/s)      53k             
> 
> cache-auto              randwrite-psync         22(MiB/s)       5801            
> cache-auto-1-thread     randwrite-psync         30(MiB/s)       7927            
> 
> cache-auto              randwrite-psync-multi   100(MiB/s)      25k             
> cache-auto-1-thread     randwrite-psync-multi   151(MiB/s)      37k             
> 
> cache-auto              randwrite-mmap          31(MiB/s)       7984            
> cache-auto-1-thread     randwrite-mmap          55(MiB/s)       13k             
> 
> cache-auto              randwrite-mmap-multi    124(MiB/s)      31k             
> cache-auto-1-thread     randwrite-mmap-multi    213(MiB/s)      53k             
> 
> cache-auto              randwrite-libaio        40(MiB/s)       10k             
> cache-auto-1-thread     randwrite-libaio        64(MiB/s)       16k             
> 
> cache-auto              randwrite-libaio-multi  139(MiB/s)      34k             
> cache-auto-1-thread     randwrite-libaio-multi  212(MiB/s)      53k             
> 
> 
> 
> 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-21  8:39   ` Stefan Hajnoczi
  0 siblings, 0 replies; 107+ messages in thread
From: Stefan Hajnoczi @ 2020-09-21  8:39 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5074 bytes --]

On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> And here are the comparision results. To me it seems that by default
> we should switch to 1 thread (Till we can figure out how to make
> multi thread performance better even when single process is doing
> I/O in client).

Let's understand the reason before making changes.

Questions:
 * Is "1-thread" --thread-pool-size=1?
 * Was DAX enabled?
 * How does cache=none perform?
 * Does commenting out vu_queue_get_avail_bytes() + fuse_log("%s:
   Queue %d gave evalue: %zx available: in: %u out: %u\n") in
   fv_queue_thread help?
 * How do the kvm_stat vmexit counters compare?
 * How does host mpstat -P ALL compare?
 * How does host perf record -a compare?
 * Does the Rust virtiofsd show the same pattern (it doesn't use glib
   thread pools)?

Stefan

> NAME                    WORKLOAD                Bandwidth       IOPS            
> cache-auto              seqread-psync           690(MiB/s)      172k            
> cache-auto-1-thread     seqread-psync           729(MiB/s)      182k            
> 
> cache-auto              seqread-psync-multi     2578(MiB/s)     644k            
> cache-auto-1-thread     seqread-psync-multi     2597(MiB/s)     649k            
> 
> cache-auto              seqread-mmap            660(MiB/s)      165k            
> cache-auto-1-thread     seqread-mmap            672(MiB/s)      168k            
> 
> cache-auto              seqread-mmap-multi      2499(MiB/s)     624k            
> cache-auto-1-thread     seqread-mmap-multi      2618(MiB/s)     654k            
> 
> cache-auto              seqread-libaio          286(MiB/s)      71k             
> cache-auto-1-thread     seqread-libaio          260(MiB/s)      65k             
> 
> cache-auto              seqread-libaio-multi    1508(MiB/s)     377k            
> cache-auto-1-thread     seqread-libaio-multi    986(MiB/s)      246k            
> 
> cache-auto              randread-psync          35(MiB/s)       9191            
> cache-auto-1-thread     randread-psync          55(MiB/s)       13k             
> 
> cache-auto              randread-psync-multi    179(MiB/s)      44k             
> cache-auto-1-thread     randread-psync-multi    209(MiB/s)      52k             
> 
> cache-auto              randread-mmap           32(MiB/s)       8273            
> cache-auto-1-thread     randread-mmap           50(MiB/s)       12k             
> 
> cache-auto              randread-mmap-multi     161(MiB/s)      40k             
> cache-auto-1-thread     randread-mmap-multi     185(MiB/s)      46k             
> 
> cache-auto              randread-libaio         268(MiB/s)      67k             
> cache-auto-1-thread     randread-libaio         254(MiB/s)      63k             
> 
> cache-auto              randread-libaio-multi   256(MiB/s)      64k             
> cache-auto-1-thread     randread-libaio-multi   155(MiB/s)      38k             
> 
> cache-auto              seqwrite-psync          23(MiB/s)       6026            
> cache-auto-1-thread     seqwrite-psync          30(MiB/s)       7925            
> 
> cache-auto              seqwrite-psync-multi    100(MiB/s)      25k             
> cache-auto-1-thread     seqwrite-psync-multi    154(MiB/s)      38k             
> 
> cache-auto              seqwrite-mmap           343(MiB/s)      85k             
> cache-auto-1-thread     seqwrite-mmap           355(MiB/s)      88k             
> 
> cache-auto              seqwrite-mmap-multi     408(MiB/s)      102k            
> cache-auto-1-thread     seqwrite-mmap-multi     438(MiB/s)      109k            
> 
> cache-auto              seqwrite-libaio         41(MiB/s)       10k             
> cache-auto-1-thread     seqwrite-libaio         65(MiB/s)       16k             
> 
> cache-auto              seqwrite-libaio-multi   137(MiB/s)      34k             
> cache-auto-1-thread     seqwrite-libaio-multi   214(MiB/s)      53k             
> 
> cache-auto              randwrite-psync         22(MiB/s)       5801            
> cache-auto-1-thread     randwrite-psync         30(MiB/s)       7927            
> 
> cache-auto              randwrite-psync-multi   100(MiB/s)      25k             
> cache-auto-1-thread     randwrite-psync-multi   151(MiB/s)      37k             
> 
> cache-auto              randwrite-mmap          31(MiB/s)       7984            
> cache-auto-1-thread     randwrite-mmap          55(MiB/s)       13k             
> 
> cache-auto              randwrite-mmap-multi    124(MiB/s)      31k             
> cache-auto-1-thread     randwrite-mmap-multi    213(MiB/s)      53k             
> 
> cache-auto              randwrite-libaio        40(MiB/s)       10k             
> cache-auto-1-thread     randwrite-libaio        64(MiB/s)       16k             
> 
> cache-auto              randwrite-libaio-multi  139(MiB/s)      34k             
> cache-auto-1-thread     randwrite-libaio-multi  212(MiB/s)      53k             
> 
> 
> 
> 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-18 21:34 ` [Virtio-fs] " Vivek Goyal
@ 2020-09-21  8:50   ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-21  8:50 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel, Stefan Hajnoczi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> Hi All,
> 
> virtiofsd default thread pool size is 64. To me it feels that in most of
> the cases thread pool size 1 performs better than thread pool size 64.
> 
> I ran virtiofs-tests.
> 
> https://github.com/rhvgoyal/virtiofs-tests
> 
> And here are the comparision results. To me it seems that by default
> we should switch to 1 thread (Till we can figure out how to make
> multi thread performance better even when single process is doing
> I/O in client).
> 
> I am especially more interested in getting performance better for
> single process in client. If that suffers, then it is pretty bad.
> 
> Especially look at randread, randwrite, seqwrite performance. seqread
> seems pretty good anyway.
> 
> If I don't run who test suite and just ran randread-psync job,
> my throughput jumps from around 40MB/s to 60MB/s. That's a huge
> jump I would say.
> 
> Thoughts?

What's your host setup; how many cores has the host got and how many did
you give the guest?

Dave

> Thanks
> Vivek
> 
> 
> NAME                    WORKLOAD                Bandwidth       IOPS            
> cache-auto              seqread-psync           690(MiB/s)      172k            
> cache-auto-1-thread     seqread-psync           729(MiB/s)      182k            
> 
> cache-auto              seqread-psync-multi     2578(MiB/s)     644k            
> cache-auto-1-thread     seqread-psync-multi     2597(MiB/s)     649k            
> 
> cache-auto              seqread-mmap            660(MiB/s)      165k            
> cache-auto-1-thread     seqread-mmap            672(MiB/s)      168k            
> 
> cache-auto              seqread-mmap-multi      2499(MiB/s)     624k            
> cache-auto-1-thread     seqread-mmap-multi      2618(MiB/s)     654k            
> 
> cache-auto              seqread-libaio          286(MiB/s)      71k             
> cache-auto-1-thread     seqread-libaio          260(MiB/s)      65k             
> 
> cache-auto              seqread-libaio-multi    1508(MiB/s)     377k            
> cache-auto-1-thread     seqread-libaio-multi    986(MiB/s)      246k            
> 
> cache-auto              randread-psync          35(MiB/s)       9191            
> cache-auto-1-thread     randread-psync          55(MiB/s)       13k             
> 
> cache-auto              randread-psync-multi    179(MiB/s)      44k             
> cache-auto-1-thread     randread-psync-multi    209(MiB/s)      52k             
> 
> cache-auto              randread-mmap           32(MiB/s)       8273            
> cache-auto-1-thread     randread-mmap           50(MiB/s)       12k             
> 
> cache-auto              randread-mmap-multi     161(MiB/s)      40k             
> cache-auto-1-thread     randread-mmap-multi     185(MiB/s)      46k             
> 
> cache-auto              randread-libaio         268(MiB/s)      67k             
> cache-auto-1-thread     randread-libaio         254(MiB/s)      63k             
> 
> cache-auto              randread-libaio-multi   256(MiB/s)      64k             
> cache-auto-1-thread     randread-libaio-multi   155(MiB/s)      38k             
> 
> cache-auto              seqwrite-psync          23(MiB/s)       6026            
> cache-auto-1-thread     seqwrite-psync          30(MiB/s)       7925            
> 
> cache-auto              seqwrite-psync-multi    100(MiB/s)      25k             
> cache-auto-1-thread     seqwrite-psync-multi    154(MiB/s)      38k             
> 
> cache-auto              seqwrite-mmap           343(MiB/s)      85k             
> cache-auto-1-thread     seqwrite-mmap           355(MiB/s)      88k             
> 
> cache-auto              seqwrite-mmap-multi     408(MiB/s)      102k            
> cache-auto-1-thread     seqwrite-mmap-multi     438(MiB/s)      109k            
> 
> cache-auto              seqwrite-libaio         41(MiB/s)       10k             
> cache-auto-1-thread     seqwrite-libaio         65(MiB/s)       16k             
> 
> cache-auto              seqwrite-libaio-multi   137(MiB/s)      34k             
> cache-auto-1-thread     seqwrite-libaio-multi   214(MiB/s)      53k             
> 
> cache-auto              randwrite-psync         22(MiB/s)       5801            
> cache-auto-1-thread     randwrite-psync         30(MiB/s)       7927            
> 
> cache-auto              randwrite-psync-multi   100(MiB/s)      25k             
> cache-auto-1-thread     randwrite-psync-multi   151(MiB/s)      37k             
> 
> cache-auto              randwrite-mmap          31(MiB/s)       7984            
> cache-auto-1-thread     randwrite-mmap          55(MiB/s)       13k             
> 
> cache-auto              randwrite-mmap-multi    124(MiB/s)      31k             
> cache-auto-1-thread     randwrite-mmap-multi    213(MiB/s)      53k             
> 
> cache-auto              randwrite-libaio        40(MiB/s)       10k             
> cache-auto-1-thread     randwrite-libaio        64(MiB/s)       16k             
> 
> cache-auto              randwrite-libaio-multi  139(MiB/s)      34k             
> cache-auto-1-thread     randwrite-libaio-multi  212(MiB/s)      53k             
> 
> 
> 
> 
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-21  8:50   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-21  8:50 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel

* Vivek Goyal (vgoyal@redhat.com) wrote:
> Hi All,
> 
> virtiofsd default thread pool size is 64. To me it feels that in most of
> the cases thread pool size 1 performs better than thread pool size 64.
> 
> I ran virtiofs-tests.
> 
> https://github.com/rhvgoyal/virtiofs-tests
> 
> And here are the comparision results. To me it seems that by default
> we should switch to 1 thread (Till we can figure out how to make
> multi thread performance better even when single process is doing
> I/O in client).
> 
> I am especially more interested in getting performance better for
> single process in client. If that suffers, then it is pretty bad.
> 
> Especially look at randread, randwrite, seqwrite performance. seqread
> seems pretty good anyway.
> 
> If I don't run who test suite and just ran randread-psync job,
> my throughput jumps from around 40MB/s to 60MB/s. That's a huge
> jump I would say.
> 
> Thoughts?

What's your host setup; how many cores has the host got and how many did
you give the guest?

Dave

> Thanks
> Vivek
> 
> 
> NAME                    WORKLOAD                Bandwidth       IOPS            
> cache-auto              seqread-psync           690(MiB/s)      172k            
> cache-auto-1-thread     seqread-psync           729(MiB/s)      182k            
> 
> cache-auto              seqread-psync-multi     2578(MiB/s)     644k            
> cache-auto-1-thread     seqread-psync-multi     2597(MiB/s)     649k            
> 
> cache-auto              seqread-mmap            660(MiB/s)      165k            
> cache-auto-1-thread     seqread-mmap            672(MiB/s)      168k            
> 
> cache-auto              seqread-mmap-multi      2499(MiB/s)     624k            
> cache-auto-1-thread     seqread-mmap-multi      2618(MiB/s)     654k            
> 
> cache-auto              seqread-libaio          286(MiB/s)      71k             
> cache-auto-1-thread     seqread-libaio          260(MiB/s)      65k             
> 
> cache-auto              seqread-libaio-multi    1508(MiB/s)     377k            
> cache-auto-1-thread     seqread-libaio-multi    986(MiB/s)      246k            
> 
> cache-auto              randread-psync          35(MiB/s)       9191            
> cache-auto-1-thread     randread-psync          55(MiB/s)       13k             
> 
> cache-auto              randread-psync-multi    179(MiB/s)      44k             
> cache-auto-1-thread     randread-psync-multi    209(MiB/s)      52k             
> 
> cache-auto              randread-mmap           32(MiB/s)       8273            
> cache-auto-1-thread     randread-mmap           50(MiB/s)       12k             
> 
> cache-auto              randread-mmap-multi     161(MiB/s)      40k             
> cache-auto-1-thread     randread-mmap-multi     185(MiB/s)      46k             
> 
> cache-auto              randread-libaio         268(MiB/s)      67k             
> cache-auto-1-thread     randread-libaio         254(MiB/s)      63k             
> 
> cache-auto              randread-libaio-multi   256(MiB/s)      64k             
> cache-auto-1-thread     randread-libaio-multi   155(MiB/s)      38k             
> 
> cache-auto              seqwrite-psync          23(MiB/s)       6026            
> cache-auto-1-thread     seqwrite-psync          30(MiB/s)       7925            
> 
> cache-auto              seqwrite-psync-multi    100(MiB/s)      25k             
> cache-auto-1-thread     seqwrite-psync-multi    154(MiB/s)      38k             
> 
> cache-auto              seqwrite-mmap           343(MiB/s)      85k             
> cache-auto-1-thread     seqwrite-mmap           355(MiB/s)      88k             
> 
> cache-auto              seqwrite-mmap-multi     408(MiB/s)      102k            
> cache-auto-1-thread     seqwrite-mmap-multi     438(MiB/s)      109k            
> 
> cache-auto              seqwrite-libaio         41(MiB/s)       10k             
> cache-auto-1-thread     seqwrite-libaio         65(MiB/s)       16k             
> 
> cache-auto              seqwrite-libaio-multi   137(MiB/s)      34k             
> cache-auto-1-thread     seqwrite-libaio-multi   214(MiB/s)      53k             
> 
> cache-auto              randwrite-psync         22(MiB/s)       5801            
> cache-auto-1-thread     randwrite-psync         30(MiB/s)       7927            
> 
> cache-auto              randwrite-psync-multi   100(MiB/s)      25k             
> cache-auto-1-thread     randwrite-psync-multi   151(MiB/s)      37k             
> 
> cache-auto              randwrite-mmap          31(MiB/s)       7984            
> cache-auto-1-thread     randwrite-mmap          55(MiB/s)       13k             
> 
> cache-auto              randwrite-mmap-multi    124(MiB/s)      31k             
> cache-auto-1-thread     randwrite-mmap-multi    213(MiB/s)      53k             
> 
> cache-auto              randwrite-libaio        40(MiB/s)       10k             
> cache-auto-1-thread     randwrite-libaio        64(MiB/s)       16k             
> 
> cache-auto              randwrite-libaio-multi  139(MiB/s)      34k             
> cache-auto-1-thread     randwrite-libaio-multi  212(MiB/s)      53k             
> 
> 
> 
> 
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-21  8:50   ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-09-21 13:35     ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-21 13:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs-list, qemu-devel, Stefan Hajnoczi

On Mon, Sep 21, 2020 at 09:50:19AM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > Hi All,
> > 
> > virtiofsd default thread pool size is 64. To me it feels that in most of
> > the cases thread pool size 1 performs better than thread pool size 64.
> > 
> > I ran virtiofs-tests.
> > 
> > https://github.com/rhvgoyal/virtiofs-tests
> > 
> > And here are the comparision results. To me it seems that by default
> > we should switch to 1 thread (Till we can figure out how to make
> > multi thread performance better even when single process is doing
> > I/O in client).
> > 
> > I am especially more interested in getting performance better for
> > single process in client. If that suffers, then it is pretty bad.
> > 
> > Especially look at randread, randwrite, seqwrite performance. seqread
> > seems pretty good anyway.
> > 
> > If I don't run who test suite and just ran randread-psync job,
> > my throughput jumps from around 40MB/s to 60MB/s. That's a huge
> > jump I would say.
> > 
> > Thoughts?
> 
> What's your host setup; how many cores has the host got and how many did
> you give the guest?

Got 2 processors on host with 16 cores in each processor. With
hyperthreading enabled, it makes 32 logical cores on each processor and
that makes 64 logical cores on host.

I have given 32 to guest.

Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-21 13:35     ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-21 13:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs-list, qemu-devel

On Mon, Sep 21, 2020 at 09:50:19AM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > Hi All,
> > 
> > virtiofsd default thread pool size is 64. To me it feels that in most of
> > the cases thread pool size 1 performs better than thread pool size 64.
> > 
> > I ran virtiofs-tests.
> > 
> > https://github.com/rhvgoyal/virtiofs-tests
> > 
> > And here are the comparision results. To me it seems that by default
> > we should switch to 1 thread (Till we can figure out how to make
> > multi thread performance better even when single process is doing
> > I/O in client).
> > 
> > I am especially more interested in getting performance better for
> > single process in client. If that suffers, then it is pretty bad.
> > 
> > Especially look at randread, randwrite, seqwrite performance. seqread
> > seems pretty good anyway.
> > 
> > If I don't run who test suite and just ran randread-psync job,
> > my throughput jumps from around 40MB/s to 60MB/s. That's a huge
> > jump I would say.
> > 
> > Thoughts?
> 
> What's your host setup; how many cores has the host got and how many did
> you give the guest?

Got 2 processors on host with 16 cores in each processor. With
hyperthreading enabled, it makes 32 logical cores on each processor and
that makes 64 logical cores on host.

I have given 32 to guest.

Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-21  8:39   ` [Virtio-fs] " Stefan Hajnoczi
@ 2020-09-21 13:39     ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-21 13:39 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs-list, qemu-devel, Dr. David Alan Gilbert

On Mon, Sep 21, 2020 at 09:39:23AM +0100, Stefan Hajnoczi wrote:
> On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> > And here are the comparision results. To me it seems that by default
> > we should switch to 1 thread (Till we can figure out how to make
> > multi thread performance better even when single process is doing
> > I/O in client).
> 
> Let's understand the reason before making changes.
> 
> Questions:
>  * Is "1-thread" --thread-pool-size=1?

Yes.

>  * Was DAX enabled?

No.

>  * How does cache=none perform?

I just ran random read workload with cache=none.

cache-none              randread-psync          45(MiB/s)       11k             
cache-none-1-thread     randread-psync          63(MiB/s)       15k

With 1 thread it offers more IOPS.

>  * Does commenting out vu_queue_get_avail_bytes() + fuse_log("%s:
>    Queue %d gave evalue: %zx available: in: %u out: %u\n") in
>    fv_queue_thread help?

Will try that.

>  * How do the kvm_stat vmexit counters compare?

This should be same, isn't it. Changing number of threads serving should
not change number of vmexits?

>  * How does host mpstat -P ALL compare?

Never used mpstat. Will try running it and see if I can get something
meaningful.

>  * How does host perf record -a compare?

Will try it. I feel this might be too big and too verbose to get
something meaningful.

>  * Does the Rust virtiofsd show the same pattern (it doesn't use glib
>    thread pools)?

No idea. Never tried rust implementation of virtiofsd.

But I suepct it has to do with thread pool implementation and possibly
extra cacheline bouncing.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-21 13:39     ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-21 13:39 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs-list, qemu-devel

On Mon, Sep 21, 2020 at 09:39:23AM +0100, Stefan Hajnoczi wrote:
> On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> > And here are the comparision results. To me it seems that by default
> > we should switch to 1 thread (Till we can figure out how to make
> > multi thread performance better even when single process is doing
> > I/O in client).
> 
> Let's understand the reason before making changes.
> 
> Questions:
>  * Is "1-thread" --thread-pool-size=1?

Yes.

>  * Was DAX enabled?

No.

>  * How does cache=none perform?

I just ran random read workload with cache=none.

cache-none              randread-psync          45(MiB/s)       11k             
cache-none-1-thread     randread-psync          63(MiB/s)       15k

With 1 thread it offers more IOPS.

>  * Does commenting out vu_queue_get_avail_bytes() + fuse_log("%s:
>    Queue %d gave evalue: %zx available: in: %u out: %u\n") in
>    fv_queue_thread help?

Will try that.

>  * How do the kvm_stat vmexit counters compare?

This should be same, isn't it. Changing number of threads serving should
not change number of vmexits?

>  * How does host mpstat -P ALL compare?

Never used mpstat. Will try running it and see if I can get something
meaningful.

>  * How does host perf record -a compare?

Will try it. I feel this might be too big and too verbose to get
something meaningful.

>  * Does the Rust virtiofsd show the same pattern (it doesn't use glib
>    thread pools)?

No idea. Never tried rust implementation of virtiofsd.

But I suepct it has to do with thread pool implementation and possibly
extra cacheline bouncing.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-21 13:35     ` [Virtio-fs] " Vivek Goyal
@ 2020-09-21 14:08       ` Daniel P. Berrangé
  -1 siblings, 0 replies; 107+ messages in thread
From: Daniel P. Berrangé @ 2020-09-21 14:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: virtio-fs-list, Dr. David Alan Gilbert, Stefan Hajnoczi, qemu-devel

On Mon, Sep 21, 2020 at 09:35:16AM -0400, Vivek Goyal wrote:
> On Mon, Sep 21, 2020 at 09:50:19AM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > Hi All,
> > > 
> > > virtiofsd default thread pool size is 64. To me it feels that in most of
> > > the cases thread pool size 1 performs better than thread pool size 64.
> > > 
> > > I ran virtiofs-tests.
> > > 
> > > https://github.com/rhvgoyal/virtiofs-tests
> > > 
> > > And here are the comparision results. To me it seems that by default
> > > we should switch to 1 thread (Till we can figure out how to make
> > > multi thread performance better even when single process is doing
> > > I/O in client).
> > > 
> > > I am especially more interested in getting performance better for
> > > single process in client. If that suffers, then it is pretty bad.
> > > 
> > > Especially look at randread, randwrite, seqwrite performance. seqread
> > > seems pretty good anyway.
> > > 
> > > If I don't run who test suite and just ran randread-psync job,
> > > my throughput jumps from around 40MB/s to 60MB/s. That's a huge
> > > jump I would say.
> > > 
> > > Thoughts?
> > 
> > What's your host setup; how many cores has the host got and how many did
> > you give the guest?
> 
> Got 2 processors on host with 16 cores in each processor. With
> hyperthreading enabled, it makes 32 logical cores on each processor and
> that makes 64 logical cores on host.
> 
> I have given 32 to guest.

FWIW, I'd be inclined to disable hyperthreading in the BIOS for one
test to validate whether it is impacting performance results seen.
Hyperthreads are weak compared to a real CPU, and could result in
misleading data even if you are limiting your guest to 1/2 the host
logical CPUs.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-21 14:08       ` Daniel P. Berrangé
  0 siblings, 0 replies; 107+ messages in thread
From: Daniel P. Berrangé @ 2020-09-21 14:08 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel

On Mon, Sep 21, 2020 at 09:35:16AM -0400, Vivek Goyal wrote:
> On Mon, Sep 21, 2020 at 09:50:19AM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > Hi All,
> > > 
> > > virtiofsd default thread pool size is 64. To me it feels that in most of
> > > the cases thread pool size 1 performs better than thread pool size 64.
> > > 
> > > I ran virtiofs-tests.
> > > 
> > > https://github.com/rhvgoyal/virtiofs-tests
> > > 
> > > And here are the comparision results. To me it seems that by default
> > > we should switch to 1 thread (Till we can figure out how to make
> > > multi thread performance better even when single process is doing
> > > I/O in client).
> > > 
> > > I am especially more interested in getting performance better for
> > > single process in client. If that suffers, then it is pretty bad.
> > > 
> > > Especially look at randread, randwrite, seqwrite performance. seqread
> > > seems pretty good anyway.
> > > 
> > > If I don't run who test suite and just ran randread-psync job,
> > > my throughput jumps from around 40MB/s to 60MB/s. That's a huge
> > > jump I would say.
> > > 
> > > Thoughts?
> > 
> > What's your host setup; how many cores has the host got and how many did
> > you give the guest?
> 
> Got 2 processors on host with 16 cores in each processor. With
> hyperthreading enabled, it makes 32 logical cores on each processor and
> that makes 64 logical cores on host.
> 
> I have given 32 to guest.

FWIW, I'd be inclined to disable hyperthreading in the BIOS for one
test to validate whether it is impacting performance results seen.
Hyperthreads are weak compared to a real CPU, and could result in
misleading data even if you are limiting your guest to 1/2 the host
logical CPUs.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-18 21:34 ` [Virtio-fs] " Vivek Goyal
@ 2020-09-21 15:32   ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-21 15:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, archana.m.shinde

Hi,
  I've been doing some of my own perf tests and I think I agree
about the thread pool size;  my test is a kernel build
and I've tried a bunch of different options.

My config:
  Host: 16 core AMD EPYC (32 thread), 128G RAM,
     5.9.0-rc4 kernel, rhel 8.2ish userspace.
  5.1.0 qemu/virtiofsd built from git.
  Guest: Fedora 32 from cloud image with just enough extra installed for
a kernel build.

  git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
fresh before each test.  Then log into the guest, make defconfig,
time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
The numbers below are the 'real' time in the guest from the initial make
(the subsequent makes dont vary much)

Below are the detauls of what each of these means, but here are the
numbers first

virtiofsdefault        4m0.978s
9pdefault              9m41.660s
virtiofscache=none    10m29.700s
9pmmappass             9m30.047s
9pmbigmsize           12m4.208s
9pmsecnone             9m21.363s
virtiofscache=noneT1   7m17.494s
virtiofsdefaultT1      3m43.326s

So the winner there by far is the 'virtiofsdefaultT1' - that's
the default virtiofs settings, but with --thread-pool-size=1 - so
yes it gives a small benefit.
But interestingly the cache=none virtiofs performance is pretty bad,
but thread-pool-size=1 on that makes a BIG improvement.


virtiofsdefault:
  ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
  mount -t virtiofs kernel /mnt

9pdefault
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
  mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L

virtiofscache=none
  ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
  mount -t virtiofs kernel /mnt

9pmmappass
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
  mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap

9pmbigmsize
   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
   mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576

9pmsecnone
   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=none
   mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L

virtiofscache=noneT1
   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none --thread-pool-size=1
   mount -t virtiofs kernel /mnt

virtiofsdefaultT1
   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux --thread-pool-size=1
    ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-21 15:32   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-21 15:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	archana.m.shinde

Hi,
  I've been doing some of my own perf tests and I think I agree
about the thread pool size;  my test is a kernel build
and I've tried a bunch of different options.

My config:
  Host: 16 core AMD EPYC (32 thread), 128G RAM,
     5.9.0-rc4 kernel, rhel 8.2ish userspace.
  5.1.0 qemu/virtiofsd built from git.
  Guest: Fedora 32 from cloud image with just enough extra installed for
a kernel build.

  git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
fresh before each test.  Then log into the guest, make defconfig,
time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
The numbers below are the 'real' time in the guest from the initial make
(the subsequent makes dont vary much)

Below are the detauls of what each of these means, but here are the
numbers first

virtiofsdefault        4m0.978s
9pdefault              9m41.660s
virtiofscache=none    10m29.700s
9pmmappass             9m30.047s
9pmbigmsize           12m4.208s
9pmsecnone             9m21.363s
virtiofscache=noneT1   7m17.494s
virtiofsdefaultT1      3m43.326s

So the winner there by far is the 'virtiofsdefaultT1' - that's
the default virtiofs settings, but with --thread-pool-size=1 - so
yes it gives a small benefit.
But interestingly the cache=none virtiofs performance is pretty bad,
but thread-pool-size=1 on that makes a BIG improvement.


virtiofsdefault:
  ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
  mount -t virtiofs kernel /mnt

9pdefault
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
  mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L

virtiofscache=none
  ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
  mount -t virtiofs kernel /mnt

9pmmappass
  ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
  mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap

9pmbigmsize
   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
   mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576

9pmsecnone
   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=none
   mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L

virtiofscache=noneT1
   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none --thread-pool-size=1
   mount -t virtiofs kernel /mnt

virtiofsdefaultT1
   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux --thread-pool-size=1
    ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-21 13:39     ` [Virtio-fs] " Vivek Goyal
@ 2020-09-21 16:57       ` Stefan Hajnoczi
  -1 siblings, 0 replies; 107+ messages in thread
From: Stefan Hajnoczi @ 2020-09-21 16:57 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]

On Mon, Sep 21, 2020 at 09:39:44AM -0400, Vivek Goyal wrote:
> On Mon, Sep 21, 2020 at 09:39:23AM +0100, Stefan Hajnoczi wrote:
> > On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> > > And here are the comparision results. To me it seems that by default
> > > we should switch to 1 thread (Till we can figure out how to make
> > > multi thread performance better even when single process is doing
> > > I/O in client).
> > 
> > Let's understand the reason before making changes.
> > 
> > Questions:
> >  * Is "1-thread" --thread-pool-size=1?
> 
> Yes.

Okay, I wanted to make sure 1-thread is still going through the glib
thread pool. So it's the same code path regardless of the
--thread-pool-size= value. This suggests the performance issue is
related to timing side-effects like lock contention, thread scheduling,
etc.

> >  * How do the kvm_stat vmexit counters compare?
> 
> This should be same, isn't it. Changing number of threads serving should
> not change number of vmexits?

There is batching at the virtio and eventfd levels. I'm not sure if it's
coming into play here but you would see it by comparing vmexits and
eventfd reads. Having more threads can increase the number of
notifications and completion interrupt, which can make overall
performance worse in some cases.

> >  * How does host mpstat -P ALL compare?
> 
> Never used mpstat. Will try running it and see if I can get something
> meaningful.

Tools like top, vmstat, etc can give similar information. I'm wondering
what the host CPU utilization (guest/sys/user) looks like.

> But I suepct it has to do with thread pool implementation and possibly
> extra cacheline bouncing.

I think perf can record cacheline bounces if you want to check.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-21 16:57       ` Stefan Hajnoczi
  0 siblings, 0 replies; 107+ messages in thread
From: Stefan Hajnoczi @ 2020-09-21 16:57 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]

On Mon, Sep 21, 2020 at 09:39:44AM -0400, Vivek Goyal wrote:
> On Mon, Sep 21, 2020 at 09:39:23AM +0100, Stefan Hajnoczi wrote:
> > On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> > > And here are the comparision results. To me it seems that by default
> > > we should switch to 1 thread (Till we can figure out how to make
> > > multi thread performance better even when single process is doing
> > > I/O in client).
> > 
> > Let's understand the reason before making changes.
> > 
> > Questions:
> >  * Is "1-thread" --thread-pool-size=1?
> 
> Yes.

Okay, I wanted to make sure 1-thread is still going through the glib
thread pool. So it's the same code path regardless of the
--thread-pool-size= value. This suggests the performance issue is
related to timing side-effects like lock contention, thread scheduling,
etc.

> >  * How do the kvm_stat vmexit counters compare?
> 
> This should be same, isn't it. Changing number of threads serving should
> not change number of vmexits?

There is batching at the virtio and eventfd levels. I'm not sure if it's
coming into play here but you would see it by comparing vmexits and
eventfd reads. Having more threads can increase the number of
notifications and completion interrupt, which can make overall
performance worse in some cases.

> >  * How does host mpstat -P ALL compare?
> 
> Never used mpstat. Will try running it and see if I can get something
> meaningful.

Tools like top, vmstat, etc can give similar information. I'm wondering
what the host CPU utilization (guest/sys/user) looks like.

> But I suepct it has to do with thread pool implementation and possibly
> extra cacheline bouncing.

I think perf can record cacheline bounces if you want to check.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-18 21:34 ` [Virtio-fs] " Vivek Goyal
@ 2020-09-21 20:16   ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-21 20:16 UTC (permalink / raw)
  To: virtio-fs-list, qemu-devel
  Cc: Dr. David Alan Gilbert, Stefan Hajnoczi, Miklos Szeredi

On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> Hi All,
> 
> virtiofsd default thread pool size is 64. To me it feels that in most of
> the cases thread pool size 1 performs better than thread pool size 64.
> 
> I ran virtiofs-tests.
> 
> https://github.com/rhvgoyal/virtiofs-tests

I spent more time debugging this. First thing I noticed is that we
are using "exclusive" glib thread pool.

https://developer.gnome.org/glib/stable/glib-Thread-Pools.html#g-thread-pool-new

This seems to run pre-determined number of threads dedicated to that
thread pool. Little instrumentation of code revealed that every new
request gets assiged to new thread (despite the fact that previous
thread finished its job). So internally there might be some kind of
round robin policy to choose next thread for running the job.

I decided to switch to "shared" pool instead where it seemed to spin
up new threads only if there is enough work. Also threads can be shared
between pools.

And looks like testing results are way better with "shared" pools. So
may be we should switch to shared pool by default. (Till somebody shows
in what cases exclusive pools are better).

Second thought which came to mind was what's the impact of NUMA. What
if qemu and virtiofsd process/threads are running on separate NUMA
node. That should increase memory access latency and increased overhead.
So I used "numactl --cpubind=0" to bind both qemu and virtiofsd to node
0. My machine seems to have two numa nodes. (Each node is having 32
logical processors). Keeping both qemu and virtiofsd on same node
improves throughput further.

So here are the results.

vtfs-none-epool --> cache=none, exclusive thread pool.
vtfs-none-spool --> cache=none, shared thread pool.
vtfs-none-spool-numa --> cache=none, shared thread pool, same numa node


NAME                    WORKLOAD                Bandwidth       IOPS            
vtfs-none-epool         seqread-psync           36(MiB/s)       9392            
vtfs-none-spool         seqread-psync           68(MiB/s)       17k             
vtfs-none-spool-numa    seqread-psync           73(MiB/s)       18k             

vtfs-none-epool         seqread-psync-multi     210(MiB/s)      52k             
vtfs-none-spool         seqread-psync-multi     260(MiB/s)      65k             
vtfs-none-spool-numa    seqread-psync-multi     309(MiB/s)      77k             

vtfs-none-epool         seqread-libaio          286(MiB/s)      71k             
vtfs-none-spool         seqread-libaio          328(MiB/s)      82k             
vtfs-none-spool-numa    seqread-libaio          332(MiB/s)      83k             

vtfs-none-epool         seqread-libaio-multi    201(MiB/s)      50k             
vtfs-none-spool         seqread-libaio-multi    254(MiB/s)      63k             
vtfs-none-spool-numa    seqread-libaio-multi    276(MiB/s)      69k             

vtfs-none-epool         randread-psync          40(MiB/s)       10k             
vtfs-none-spool         randread-psync          64(MiB/s)       16k             
vtfs-none-spool-numa    randread-psync          72(MiB/s)       18k             

vtfs-none-epool         randread-psync-multi    211(MiB/s)      52k             
vtfs-none-spool         randread-psync-multi    252(MiB/s)      63k             
vtfs-none-spool-numa    randread-psync-multi    297(MiB/s)      74k             

vtfs-none-epool         randread-libaio         313(MiB/s)      78k             
vtfs-none-spool         randread-libaio         320(MiB/s)      80k             
vtfs-none-spool-numa    randread-libaio         330(MiB/s)      82k             

vtfs-none-epool         randread-libaio-multi   257(MiB/s)      64k             
vtfs-none-spool         randread-libaio-multi   274(MiB/s)      68k             
vtfs-none-spool-numa    randread-libaio-multi   319(MiB/s)      79k             

vtfs-none-epool         seqwrite-psync          34(MiB/s)       8926            
vtfs-none-spool         seqwrite-psync          55(MiB/s)       13k             
vtfs-none-spool-numa    seqwrite-psync          66(MiB/s)       16k             

vtfs-none-epool         seqwrite-psync-multi    196(MiB/s)      49k             
vtfs-none-spool         seqwrite-psync-multi    225(MiB/s)      56k             
vtfs-none-spool-numa    seqwrite-psync-multi    270(MiB/s)      67k             

vtfs-none-epool         seqwrite-libaio         257(MiB/s)      64k             
vtfs-none-spool         seqwrite-libaio         304(MiB/s)      76k             
vtfs-none-spool-numa    seqwrite-libaio         267(MiB/s)      66k             

vtfs-none-epool         seqwrite-libaio-multi   312(MiB/s)      78k             
vtfs-none-spool         seqwrite-libaio-multi   366(MiB/s)      91k             
vtfs-none-spool-numa    seqwrite-libaio-multi   381(MiB/s)      95k             

vtfs-none-epool         randwrite-psync         38(MiB/s)       9745            
vtfs-none-spool         randwrite-psync         55(MiB/s)       13k             
vtfs-none-spool-numa    randwrite-psync         67(MiB/s)       16k             

vtfs-none-epool         randwrite-psync-multi   186(MiB/s)      46k             
vtfs-none-spool         randwrite-psync-multi   240(MiB/s)      60k             
vtfs-none-spool-numa    randwrite-psync-multi   271(MiB/s)      67k             

vtfs-none-epool         randwrite-libaio        224(MiB/s)      56k             
vtfs-none-spool         randwrite-libaio        296(MiB/s)      74k             
vtfs-none-spool-numa    randwrite-libaio        290(MiB/s)      72k             

vtfs-none-epool         randwrite-libaio-multi  300(MiB/s)      75k             
vtfs-none-spool         randwrite-libaio-multi  350(MiB/s)      87k             
vtfs-none-spool-numa    randwrite-libaio-multi  383(MiB/s)      95k             

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-21 20:16   ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-21 20:16 UTC (permalink / raw)
  To: virtio-fs-list, qemu-devel; +Cc: Miklos Szeredi

On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> Hi All,
> 
> virtiofsd default thread pool size is 64. To me it feels that in most of
> the cases thread pool size 1 performs better than thread pool size 64.
> 
> I ran virtiofs-tests.
> 
> https://github.com/rhvgoyal/virtiofs-tests

I spent more time debugging this. First thing I noticed is that we
are using "exclusive" glib thread pool.

https://developer.gnome.org/glib/stable/glib-Thread-Pools.html#g-thread-pool-new

This seems to run pre-determined number of threads dedicated to that
thread pool. Little instrumentation of code revealed that every new
request gets assiged to new thread (despite the fact that previous
thread finished its job). So internally there might be some kind of
round robin policy to choose next thread for running the job.

I decided to switch to "shared" pool instead where it seemed to spin
up new threads only if there is enough work. Also threads can be shared
between pools.

And looks like testing results are way better with "shared" pools. So
may be we should switch to shared pool by default. (Till somebody shows
in what cases exclusive pools are better).

Second thought which came to mind was what's the impact of NUMA. What
if qemu and virtiofsd process/threads are running on separate NUMA
node. That should increase memory access latency and increased overhead.
So I used "numactl --cpubind=0" to bind both qemu and virtiofsd to node
0. My machine seems to have two numa nodes. (Each node is having 32
logical processors). Keeping both qemu and virtiofsd on same node
improves throughput further.

So here are the results.

vtfs-none-epool --> cache=none, exclusive thread pool.
vtfs-none-spool --> cache=none, shared thread pool.
vtfs-none-spool-numa --> cache=none, shared thread pool, same numa node


NAME                    WORKLOAD                Bandwidth       IOPS            
vtfs-none-epool         seqread-psync           36(MiB/s)       9392            
vtfs-none-spool         seqread-psync           68(MiB/s)       17k             
vtfs-none-spool-numa    seqread-psync           73(MiB/s)       18k             

vtfs-none-epool         seqread-psync-multi     210(MiB/s)      52k             
vtfs-none-spool         seqread-psync-multi     260(MiB/s)      65k             
vtfs-none-spool-numa    seqread-psync-multi     309(MiB/s)      77k             

vtfs-none-epool         seqread-libaio          286(MiB/s)      71k             
vtfs-none-spool         seqread-libaio          328(MiB/s)      82k             
vtfs-none-spool-numa    seqread-libaio          332(MiB/s)      83k             

vtfs-none-epool         seqread-libaio-multi    201(MiB/s)      50k             
vtfs-none-spool         seqread-libaio-multi    254(MiB/s)      63k             
vtfs-none-spool-numa    seqread-libaio-multi    276(MiB/s)      69k             

vtfs-none-epool         randread-psync          40(MiB/s)       10k             
vtfs-none-spool         randread-psync          64(MiB/s)       16k             
vtfs-none-spool-numa    randread-psync          72(MiB/s)       18k             

vtfs-none-epool         randread-psync-multi    211(MiB/s)      52k             
vtfs-none-spool         randread-psync-multi    252(MiB/s)      63k             
vtfs-none-spool-numa    randread-psync-multi    297(MiB/s)      74k             

vtfs-none-epool         randread-libaio         313(MiB/s)      78k             
vtfs-none-spool         randread-libaio         320(MiB/s)      80k             
vtfs-none-spool-numa    randread-libaio         330(MiB/s)      82k             

vtfs-none-epool         randread-libaio-multi   257(MiB/s)      64k             
vtfs-none-spool         randread-libaio-multi   274(MiB/s)      68k             
vtfs-none-spool-numa    randread-libaio-multi   319(MiB/s)      79k             

vtfs-none-epool         seqwrite-psync          34(MiB/s)       8926            
vtfs-none-spool         seqwrite-psync          55(MiB/s)       13k             
vtfs-none-spool-numa    seqwrite-psync          66(MiB/s)       16k             

vtfs-none-epool         seqwrite-psync-multi    196(MiB/s)      49k             
vtfs-none-spool         seqwrite-psync-multi    225(MiB/s)      56k             
vtfs-none-spool-numa    seqwrite-psync-multi    270(MiB/s)      67k             

vtfs-none-epool         seqwrite-libaio         257(MiB/s)      64k             
vtfs-none-spool         seqwrite-libaio         304(MiB/s)      76k             
vtfs-none-spool-numa    seqwrite-libaio         267(MiB/s)      66k             

vtfs-none-epool         seqwrite-libaio-multi   312(MiB/s)      78k             
vtfs-none-spool         seqwrite-libaio-multi   366(MiB/s)      91k             
vtfs-none-spool-numa    seqwrite-libaio-multi   381(MiB/s)      95k             

vtfs-none-epool         randwrite-psync         38(MiB/s)       9745            
vtfs-none-spool         randwrite-psync         55(MiB/s)       13k             
vtfs-none-spool-numa    randwrite-psync         67(MiB/s)       16k             

vtfs-none-epool         randwrite-psync-multi   186(MiB/s)      46k             
vtfs-none-spool         randwrite-psync-multi   240(MiB/s)      60k             
vtfs-none-spool-numa    randwrite-psync-multi   271(MiB/s)      67k             

vtfs-none-epool         randwrite-libaio        224(MiB/s)      56k             
vtfs-none-spool         randwrite-libaio        296(MiB/s)      74k             
vtfs-none-spool-numa    randwrite-libaio        290(MiB/s)      72k             

vtfs-none-epool         randwrite-libaio-multi  300(MiB/s)      75k             
vtfs-none-spool         randwrite-libaio-multi  350(MiB/s)      87k             
vtfs-none-spool-numa    randwrite-libaio-multi  383(MiB/s)      95k             

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-21 15:32   ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-09-22 10:25     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-22 10:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, archana.m.shinde

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> Hi,
>   I've been doing some of my own perf tests and I think I agree
> about the thread pool size;  my test is a kernel build
> and I've tried a bunch of different options.
> 
> My config:
>   Host: 16 core AMD EPYC (32 thread), 128G RAM,
>      5.9.0-rc4 kernel, rhel 8.2ish userspace.
>   5.1.0 qemu/virtiofsd built from git.
>   Guest: Fedora 32 from cloud image with just enough extra installed for
> a kernel build.
> 
>   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> fresh before each test.  Then log into the guest, make defconfig,
> time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> The numbers below are the 'real' time in the guest from the initial make
> (the subsequent makes dont vary much)
> 
> Below are the detauls of what each of these means, but here are the
> numbers first
> 
> virtiofsdefault        4m0.978s
> 9pdefault              9m41.660s
> virtiofscache=none    10m29.700s
> 9pmmappass             9m30.047s
> 9pmbigmsize           12m4.208s
> 9pmsecnone             9m21.363s
> virtiofscache=noneT1   7m17.494s
> virtiofsdefaultT1      3m43.326s
> 
> So the winner there by far is the 'virtiofsdefaultT1' - that's
> the default virtiofs settings, but with --thread-pool-size=1 - so
> yes it gives a small benefit.
> But interestingly the cache=none virtiofs performance is pretty bad,
> but thread-pool-size=1 on that makes a BIG improvement.

Here are fio runs that Vivek asked me to run in my same environment
(there are some 0's in some of the mmap cases, and I've not investigated
why yet). virtiofs is looking good here in I think all of the cases;
there's some division over which cinfig; cache=none
seems faster in some cases which surprises me.

Dave


NAME                    WORKLOAD                Bandwidth       IOPS            
9pbigmsize              seqread-psync           108(MiB/s)      27k             
9pdefault               seqread-psync           105(MiB/s)      26k             
9pmmappass              seqread-psync           107(MiB/s)      26k             
9pmsecnone              seqread-psync           107(MiB/s)      26k             
virtiofscachenoneT1     seqread-psync           135(MiB/s)      33k             
virtiofscachenone       seqread-psync           115(MiB/s)      28k             
virtiofsdefaultT1       seqread-psync           2465(MiB/s)     616k            
virtiofsdefault         seqread-psync           2468(MiB/s)     617k            

9pbigmsize              seqread-psync-multi     357(MiB/s)      89k             
9pdefault               seqread-psync-multi     358(MiB/s)      89k             
9pmmappass              seqread-psync-multi     347(MiB/s)      86k             
9pmsecnone              seqread-psync-multi     364(MiB/s)      91k             
virtiofscachenoneT1     seqread-psync-multi     479(MiB/s)      119k            
virtiofscachenone       seqread-psync-multi     385(MiB/s)      96k             
virtiofsdefaultT1       seqread-psync-multi     5916(MiB/s)     1479k           
virtiofsdefault         seqread-psync-multi     8771(MiB/s)     2192k           

9pbigmsize              seqread-mmap            111(MiB/s)      27k             
9pdefault               seqread-mmap            101(MiB/s)      25k             
9pmmappass              seqread-mmap            114(MiB/s)      28k             
9pmsecnone              seqread-mmap            107(MiB/s)      26k             
virtiofscachenoneT1     seqread-mmap            0(KiB/s)        0               
virtiofscachenone       seqread-mmap            0(KiB/s)        0               
virtiofsdefaultT1       seqread-mmap            2896(MiB/s)     724k            
virtiofsdefault         seqread-mmap            2856(MiB/s)     714k            

9pbigmsize              seqread-mmap-multi      364(MiB/s)      91k             
9pdefault               seqread-mmap-multi      348(MiB/s)      87k             
9pmmappass              seqread-mmap-multi      354(MiB/s)      88k             
9pmsecnone              seqread-mmap-multi      340(MiB/s)      85k             
virtiofscachenoneT1     seqread-mmap-multi      0(KiB/s)        0               
virtiofscachenone       seqread-mmap-multi      0(KiB/s)        0               
virtiofsdefaultT1       seqread-mmap-multi      6057(MiB/s)     1514k           
virtiofsdefault         seqread-mmap-multi      9585(MiB/s)     2396k           

9pbigmsize              seqread-libaio          109(MiB/s)      27k             
9pdefault               seqread-libaio          103(MiB/s)      25k             
9pmmappass              seqread-libaio          107(MiB/s)      26k             
9pmsecnone              seqread-libaio          107(MiB/s)      26k             
virtiofscachenoneT1     seqread-libaio          671(MiB/s)      167k            
virtiofscachenone       seqread-libaio          538(MiB/s)      134k            
virtiofsdefaultT1       seqread-libaio          187(MiB/s)      46k             
virtiofsdefault         seqread-libaio          541(MiB/s)      135k            

9pbigmsize              seqread-libaio-multi    354(MiB/s)      88k             
9pdefault               seqread-libaio-multi    360(MiB/s)      90k             
9pmmappass              seqread-libaio-multi    356(MiB/s)      89k             
9pmsecnone              seqread-libaio-multi    344(MiB/s)      86k             
virtiofscachenoneT1     seqread-libaio-multi    488(MiB/s)      122k            
virtiofscachenone       seqread-libaio-multi    380(MiB/s)      95k             
virtiofsdefaultT1       seqread-libaio-multi    5577(MiB/s)     1394k           
virtiofsdefault         seqread-libaio-multi    5359(MiB/s)     1339k           

9pbigmsize              randread-psync          106(MiB/s)      26k             
9pdefault               randread-psync          106(MiB/s)      26k             
9pmmappass              randread-psync          120(MiB/s)      30k             
9pmsecnone              randread-psync          105(MiB/s)      26k             
virtiofscachenoneT1     randread-psync          154(MiB/s)      38k             
virtiofscachenone       randread-psync          134(MiB/s)      33k             
virtiofsdefaultT1       randread-psync          129(MiB/s)      32k             
virtiofsdefault         randread-psync          129(MiB/s)      32k             

9pbigmsize              randread-psync-multi    349(MiB/s)      87k             
9pdefault               randread-psync-multi    354(MiB/s)      88k             
9pmmappass              randread-psync-multi    360(MiB/s)      90k             
9pmsecnone              randread-psync-multi    352(MiB/s)      88k             
virtiofscachenoneT1     randread-psync-multi    449(MiB/s)      112k            
virtiofscachenone       randread-psync-multi    383(MiB/s)      95k             
virtiofsdefaultT1       randread-psync-multi    435(MiB/s)      108k            
virtiofsdefault         randread-psync-multi    368(MiB/s)      92k             

9pbigmsize              randread-mmap           100(MiB/s)      25k             
9pdefault               randread-mmap           89(MiB/s)       22k             
9pmmappass              randread-mmap           87(MiB/s)       21k             
9pmsecnone              randread-mmap           92(MiB/s)       23k             
virtiofscachenoneT1     randread-mmap           0(KiB/s)        0               
virtiofscachenone       randread-mmap           0(KiB/s)        0               
virtiofsdefaultT1       randread-mmap           111(MiB/s)      27k             
virtiofsdefault         randread-mmap           101(MiB/s)      25k             

9pbigmsize              randread-mmap-multi     335(MiB/s)      83k             
9pdefault               randread-mmap-multi     318(MiB/s)      79k             
9pmmappass              randread-mmap-multi     335(MiB/s)      83k             
9pmsecnone              randread-mmap-multi     323(MiB/s)      80k             
virtiofscachenoneT1     randread-mmap-multi     0(KiB/s)        0               
virtiofscachenone       randread-mmap-multi     0(KiB/s)        0               
virtiofsdefaultT1       randread-mmap-multi     422(MiB/s)      105k            
virtiofsdefault         randread-mmap-multi     345(MiB/s)      86k             

9pbigmsize              randread-libaio         84(MiB/s)       21k             
9pdefault               randread-libaio         89(MiB/s)       22k             
9pmmappass              randread-libaio         87(MiB/s)       21k             
9pmsecnone              randread-libaio         82(MiB/s)       20k             
virtiofscachenoneT1     randread-libaio         641(MiB/s)      160k            
virtiofscachenone       randread-libaio         527(MiB/s)      131k            
virtiofsdefaultT1       randread-libaio         205(MiB/s)      51k             
virtiofsdefault         randread-libaio         536(MiB/s)      134k            

9pbigmsize              randread-libaio-multi   265(MiB/s)      66k             
9pdefault               randread-libaio-multi   267(MiB/s)      66k             
9pmmappass              randread-libaio-multi   266(MiB/s)      66k             
9pmsecnone              randread-libaio-multi   269(MiB/s)      67k             
virtiofscachenoneT1     randread-libaio-multi   615(MiB/s)      153k            
virtiofscachenone       randread-libaio-multi   542(MiB/s)      135k            
virtiofsdefaultT1       randread-libaio-multi   595(MiB/s)      148k            
virtiofsdefault         randread-libaio-multi   552(MiB/s)      138k            

9pbigmsize              seqwrite-psync          106(MiB/s)      26k             
9pdefault               seqwrite-psync          106(MiB/s)      26k             
9pmmappass              seqwrite-psync          107(MiB/s)      26k             
9pmsecnone              seqwrite-psync          107(MiB/s)      26k             
virtiofscachenoneT1     seqwrite-psync          136(MiB/s)      34k             
virtiofscachenone       seqwrite-psync          112(MiB/s)      28k             
virtiofsdefaultT1       seqwrite-psync          132(MiB/s)      33k             
virtiofsdefault         seqwrite-psync          109(MiB/s)      27k             

9pbigmsize              seqwrite-psync-multi    353(MiB/s)      88k             
9pdefault               seqwrite-psync-multi    364(MiB/s)      91k             
9pmmappass              seqwrite-psync-multi    345(MiB/s)      86k             
9pmsecnone              seqwrite-psync-multi    350(MiB/s)      87k             
virtiofscachenoneT1     seqwrite-psync-multi    470(MiB/s)      117k            
virtiofscachenone       seqwrite-psync-multi    374(MiB/s)      93k             
virtiofsdefaultT1       seqwrite-psync-multi    470(MiB/s)      117k            
virtiofsdefault         seqwrite-psync-multi    373(MiB/s)      93k             

9pbigmsize              seqwrite-mmap           195(MiB/s)      48k             
9pdefault               seqwrite-mmap           0(KiB/s)        0               
9pmmappass              seqwrite-mmap           196(MiB/s)      49k             
9pmsecnone              seqwrite-mmap           0(KiB/s)        0               
virtiofscachenoneT1     seqwrite-mmap           0(KiB/s)        0               
virtiofscachenone       seqwrite-mmap           0(KiB/s)        0               
virtiofsdefaultT1       seqwrite-mmap           603(MiB/s)      150k            
virtiofsdefault         seqwrite-mmap           629(MiB/s)      157k            

9pbigmsize              seqwrite-mmap-multi     247(MiB/s)      61k             
9pdefault               seqwrite-mmap-multi     0(KiB/s)        0               
9pmmappass              seqwrite-mmap-multi     246(MiB/s)      61k             
9pmsecnone              seqwrite-mmap-multi     0(KiB/s)        0               
virtiofscachenoneT1     seqwrite-mmap-multi     0(KiB/s)        0               
virtiofscachenone       seqwrite-mmap-multi     0(KiB/s)        0               
virtiofsdefaultT1       seqwrite-mmap-multi     1787(MiB/s)     446k            
virtiofsdefault         seqwrite-mmap-multi     1692(MiB/s)     423k            

9pbigmsize              seqwrite-libaio         107(MiB/s)      26k             
9pdefault               seqwrite-libaio         107(MiB/s)      26k             
9pmmappass              seqwrite-libaio         106(MiB/s)      26k             
9pmsecnone              seqwrite-libaio         108(MiB/s)      27k             
virtiofscachenoneT1     seqwrite-libaio         595(MiB/s)      148k            
virtiofscachenone       seqwrite-libaio         524(MiB/s)      131k            
virtiofsdefaultT1       seqwrite-libaio         575(MiB/s)      143k            
virtiofsdefault         seqwrite-libaio         538(MiB/s)      134k            

9pbigmsize              seqwrite-libaio-multi   355(MiB/s)      88k             
9pdefault               seqwrite-libaio-multi   341(MiB/s)      85k             
9pmmappass              seqwrite-libaio-multi   354(MiB/s)      88k             
9pmsecnone              seqwrite-libaio-multi   350(MiB/s)      87k             
virtiofscachenoneT1     seqwrite-libaio-multi   609(MiB/s)      152k            
virtiofscachenone       seqwrite-libaio-multi   536(MiB/s)      134k            
virtiofsdefaultT1       seqwrite-libaio-multi   609(MiB/s)      152k            
virtiofsdefault         seqwrite-libaio-multi   538(MiB/s)      134k            

9pbigmsize              randwrite-psync         104(MiB/s)      26k             
9pdefault               randwrite-psync         106(MiB/s)      26k             
9pmmappass              randwrite-psync         105(MiB/s)      26k             
9pmsecnone              randwrite-psync         103(MiB/s)      25k             
virtiofscachenoneT1     randwrite-psync         125(MiB/s)      31k             
virtiofscachenone       randwrite-psync         110(MiB/s)      27k             
virtiofsdefaultT1       randwrite-psync         129(MiB/s)      32k             
virtiofsdefault         randwrite-psync         112(MiB/s)      28k             

9pbigmsize              randwrite-psync-multi   355(MiB/s)      88k             
9pdefault               randwrite-psync-multi   339(MiB/s)      84k             
9pmmappass              randwrite-psync-multi   343(MiB/s)      85k             
9pmsecnone              randwrite-psync-multi   344(MiB/s)      86k             
virtiofscachenoneT1     randwrite-psync-multi   461(MiB/s)      115k            
virtiofscachenone       randwrite-psync-multi   370(MiB/s)      92k             
virtiofsdefaultT1       randwrite-psync-multi   449(MiB/s)      112k            
virtiofsdefault         randwrite-psync-multi   364(MiB/s)      91k             

9pbigmsize              randwrite-mmap          98(MiB/s)       24k             
9pdefault               randwrite-mmap          0(KiB/s)        0               
9pmmappass              randwrite-mmap          97(MiB/s)       24k             
9pmsecnone              randwrite-mmap          0(KiB/s)        0               
virtiofscachenoneT1     randwrite-mmap          0(KiB/s)        0               
virtiofscachenone       randwrite-mmap          0(KiB/s)        0               
virtiofsdefaultT1       randwrite-mmap          102(MiB/s)      25k             
virtiofsdefault         randwrite-mmap          92(MiB/s)       23k             

9pbigmsize              randwrite-mmap-multi    246(MiB/s)      61k             
9pdefault               randwrite-mmap-multi    0(KiB/s)        0               
9pmmappass              randwrite-mmap-multi    239(MiB/s)      59k             
9pmsecnone              randwrite-mmap-multi    0(KiB/s)        0               
virtiofscachenoneT1     randwrite-mmap-multi    0(KiB/s)        0               
virtiofscachenone       randwrite-mmap-multi    0(KiB/s)        0               
virtiofsdefaultT1       randwrite-mmap-multi    279(MiB/s)      69k             
virtiofsdefault         randwrite-mmap-multi    225(MiB/s)      56k             

9pbigmsize              randwrite-libaio        110(MiB/s)      27k             
9pdefault               randwrite-libaio        111(MiB/s)      27k             
9pmmappass              randwrite-libaio        103(MiB/s)      25k             
9pmsecnone              randwrite-libaio        102(MiB/s)      25k             
virtiofscachenoneT1     randwrite-libaio        601(MiB/s)      150k            
virtiofscachenone       randwrite-libaio        525(MiB/s)      131k            
virtiofsdefaultT1       randwrite-libaio        618(MiB/s)      154k            
virtiofsdefault         randwrite-libaio        527(MiB/s)      131k            

9pbigmsize              randwrite-libaio-multi  332(MiB/s)      83k             
9pdefault               randwrite-libaio-multi  343(MiB/s)      85k             
9pmmappass              randwrite-libaio-multi  350(MiB/s)      87k             
9pmsecnone              randwrite-libaio-multi  334(MiB/s)      83k             
virtiofscachenoneT1     randwrite-libaio-multi  611(MiB/s)      152k            
virtiofscachenone       randwrite-libaio-multi  533(MiB/s)      133k            
virtiofsdefaultT1       randwrite-libaio-multi  599(MiB/s)      149k            
virtiofsdefault         randwrite-libaio-multi  531(MiB/s)      132k            

> 
> virtiofsdefault:
>   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
>   mount -t virtiofs kernel /mnt
> 
> 9pdefault
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
>   mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L
> 
> virtiofscache=none
>   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
>   mount -t virtiofs kernel /mnt
> 
> 9pmmappass
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
>   mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap
> 
> 9pmbigmsize
>    ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
>    mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576
> 
> 9pmsecnone
>    ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=none
>    mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L
> 
> virtiofscache=noneT1
>    ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none --thread-pool-size=1
>    mount -t virtiofs kernel /mnt
> 
> virtiofsdefaultT1
>    ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux --thread-pool-size=1
>     ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-22 10:25     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-22 10:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	archana.m.shinde

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> Hi,
>   I've been doing some of my own perf tests and I think I agree
> about the thread pool size;  my test is a kernel build
> and I've tried a bunch of different options.
> 
> My config:
>   Host: 16 core AMD EPYC (32 thread), 128G RAM,
>      5.9.0-rc4 kernel, rhel 8.2ish userspace.
>   5.1.0 qemu/virtiofsd built from git.
>   Guest: Fedora 32 from cloud image with just enough extra installed for
> a kernel build.
> 
>   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> fresh before each test.  Then log into the guest, make defconfig,
> time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> The numbers below are the 'real' time in the guest from the initial make
> (the subsequent makes dont vary much)
> 
> Below are the detauls of what each of these means, but here are the
> numbers first
> 
> virtiofsdefault        4m0.978s
> 9pdefault              9m41.660s
> virtiofscache=none    10m29.700s
> 9pmmappass             9m30.047s
> 9pmbigmsize           12m4.208s
> 9pmsecnone             9m21.363s
> virtiofscache=noneT1   7m17.494s
> virtiofsdefaultT1      3m43.326s
> 
> So the winner there by far is the 'virtiofsdefaultT1' - that's
> the default virtiofs settings, but with --thread-pool-size=1 - so
> yes it gives a small benefit.
> But interestingly the cache=none virtiofs performance is pretty bad,
> but thread-pool-size=1 on that makes a BIG improvement.

Here are fio runs that Vivek asked me to run in my same environment
(there are some 0's in some of the mmap cases, and I've not investigated
why yet). virtiofs is looking good here in I think all of the cases;
there's some division over which cinfig; cache=none
seems faster in some cases which surprises me.

Dave


NAME                    WORKLOAD                Bandwidth       IOPS            
9pbigmsize              seqread-psync           108(MiB/s)      27k             
9pdefault               seqread-psync           105(MiB/s)      26k             
9pmmappass              seqread-psync           107(MiB/s)      26k             
9pmsecnone              seqread-psync           107(MiB/s)      26k             
virtiofscachenoneT1     seqread-psync           135(MiB/s)      33k             
virtiofscachenone       seqread-psync           115(MiB/s)      28k             
virtiofsdefaultT1       seqread-psync           2465(MiB/s)     616k            
virtiofsdefault         seqread-psync           2468(MiB/s)     617k            

9pbigmsize              seqread-psync-multi     357(MiB/s)      89k             
9pdefault               seqread-psync-multi     358(MiB/s)      89k             
9pmmappass              seqread-psync-multi     347(MiB/s)      86k             
9pmsecnone              seqread-psync-multi     364(MiB/s)      91k             
virtiofscachenoneT1     seqread-psync-multi     479(MiB/s)      119k            
virtiofscachenone       seqread-psync-multi     385(MiB/s)      96k             
virtiofsdefaultT1       seqread-psync-multi     5916(MiB/s)     1479k           
virtiofsdefault         seqread-psync-multi     8771(MiB/s)     2192k           

9pbigmsize              seqread-mmap            111(MiB/s)      27k             
9pdefault               seqread-mmap            101(MiB/s)      25k             
9pmmappass              seqread-mmap            114(MiB/s)      28k             
9pmsecnone              seqread-mmap            107(MiB/s)      26k             
virtiofscachenoneT1     seqread-mmap            0(KiB/s)        0               
virtiofscachenone       seqread-mmap            0(KiB/s)        0               
virtiofsdefaultT1       seqread-mmap            2896(MiB/s)     724k            
virtiofsdefault         seqread-mmap            2856(MiB/s)     714k            

9pbigmsize              seqread-mmap-multi      364(MiB/s)      91k             
9pdefault               seqread-mmap-multi      348(MiB/s)      87k             
9pmmappass              seqread-mmap-multi      354(MiB/s)      88k             
9pmsecnone              seqread-mmap-multi      340(MiB/s)      85k             
virtiofscachenoneT1     seqread-mmap-multi      0(KiB/s)        0               
virtiofscachenone       seqread-mmap-multi      0(KiB/s)        0               
virtiofsdefaultT1       seqread-mmap-multi      6057(MiB/s)     1514k           
virtiofsdefault         seqread-mmap-multi      9585(MiB/s)     2396k           

9pbigmsize              seqread-libaio          109(MiB/s)      27k             
9pdefault               seqread-libaio          103(MiB/s)      25k             
9pmmappass              seqread-libaio          107(MiB/s)      26k             
9pmsecnone              seqread-libaio          107(MiB/s)      26k             
virtiofscachenoneT1     seqread-libaio          671(MiB/s)      167k            
virtiofscachenone       seqread-libaio          538(MiB/s)      134k            
virtiofsdefaultT1       seqread-libaio          187(MiB/s)      46k             
virtiofsdefault         seqread-libaio          541(MiB/s)      135k            

9pbigmsize              seqread-libaio-multi    354(MiB/s)      88k             
9pdefault               seqread-libaio-multi    360(MiB/s)      90k             
9pmmappass              seqread-libaio-multi    356(MiB/s)      89k             
9pmsecnone              seqread-libaio-multi    344(MiB/s)      86k             
virtiofscachenoneT1     seqread-libaio-multi    488(MiB/s)      122k            
virtiofscachenone       seqread-libaio-multi    380(MiB/s)      95k             
virtiofsdefaultT1       seqread-libaio-multi    5577(MiB/s)     1394k           
virtiofsdefault         seqread-libaio-multi    5359(MiB/s)     1339k           

9pbigmsize              randread-psync          106(MiB/s)      26k             
9pdefault               randread-psync          106(MiB/s)      26k             
9pmmappass              randread-psync          120(MiB/s)      30k             
9pmsecnone              randread-psync          105(MiB/s)      26k             
virtiofscachenoneT1     randread-psync          154(MiB/s)      38k             
virtiofscachenone       randread-psync          134(MiB/s)      33k             
virtiofsdefaultT1       randread-psync          129(MiB/s)      32k             
virtiofsdefault         randread-psync          129(MiB/s)      32k             

9pbigmsize              randread-psync-multi    349(MiB/s)      87k             
9pdefault               randread-psync-multi    354(MiB/s)      88k             
9pmmappass              randread-psync-multi    360(MiB/s)      90k             
9pmsecnone              randread-psync-multi    352(MiB/s)      88k             
virtiofscachenoneT1     randread-psync-multi    449(MiB/s)      112k            
virtiofscachenone       randread-psync-multi    383(MiB/s)      95k             
virtiofsdefaultT1       randread-psync-multi    435(MiB/s)      108k            
virtiofsdefault         randread-psync-multi    368(MiB/s)      92k             

9pbigmsize              randread-mmap           100(MiB/s)      25k             
9pdefault               randread-mmap           89(MiB/s)       22k             
9pmmappass              randread-mmap           87(MiB/s)       21k             
9pmsecnone              randread-mmap           92(MiB/s)       23k             
virtiofscachenoneT1     randread-mmap           0(KiB/s)        0               
virtiofscachenone       randread-mmap           0(KiB/s)        0               
virtiofsdefaultT1       randread-mmap           111(MiB/s)      27k             
virtiofsdefault         randread-mmap           101(MiB/s)      25k             

9pbigmsize              randread-mmap-multi     335(MiB/s)      83k             
9pdefault               randread-mmap-multi     318(MiB/s)      79k             
9pmmappass              randread-mmap-multi     335(MiB/s)      83k             
9pmsecnone              randread-mmap-multi     323(MiB/s)      80k             
virtiofscachenoneT1     randread-mmap-multi     0(KiB/s)        0               
virtiofscachenone       randread-mmap-multi     0(KiB/s)        0               
virtiofsdefaultT1       randread-mmap-multi     422(MiB/s)      105k            
virtiofsdefault         randread-mmap-multi     345(MiB/s)      86k             

9pbigmsize              randread-libaio         84(MiB/s)       21k             
9pdefault               randread-libaio         89(MiB/s)       22k             
9pmmappass              randread-libaio         87(MiB/s)       21k             
9pmsecnone              randread-libaio         82(MiB/s)       20k             
virtiofscachenoneT1     randread-libaio         641(MiB/s)      160k            
virtiofscachenone       randread-libaio         527(MiB/s)      131k            
virtiofsdefaultT1       randread-libaio         205(MiB/s)      51k             
virtiofsdefault         randread-libaio         536(MiB/s)      134k            

9pbigmsize              randread-libaio-multi   265(MiB/s)      66k             
9pdefault               randread-libaio-multi   267(MiB/s)      66k             
9pmmappass              randread-libaio-multi   266(MiB/s)      66k             
9pmsecnone              randread-libaio-multi   269(MiB/s)      67k             
virtiofscachenoneT1     randread-libaio-multi   615(MiB/s)      153k            
virtiofscachenone       randread-libaio-multi   542(MiB/s)      135k            
virtiofsdefaultT1       randread-libaio-multi   595(MiB/s)      148k            
virtiofsdefault         randread-libaio-multi   552(MiB/s)      138k            

9pbigmsize              seqwrite-psync          106(MiB/s)      26k             
9pdefault               seqwrite-psync          106(MiB/s)      26k             
9pmmappass              seqwrite-psync          107(MiB/s)      26k             
9pmsecnone              seqwrite-psync          107(MiB/s)      26k             
virtiofscachenoneT1     seqwrite-psync          136(MiB/s)      34k             
virtiofscachenone       seqwrite-psync          112(MiB/s)      28k             
virtiofsdefaultT1       seqwrite-psync          132(MiB/s)      33k             
virtiofsdefault         seqwrite-psync          109(MiB/s)      27k             

9pbigmsize              seqwrite-psync-multi    353(MiB/s)      88k             
9pdefault               seqwrite-psync-multi    364(MiB/s)      91k             
9pmmappass              seqwrite-psync-multi    345(MiB/s)      86k             
9pmsecnone              seqwrite-psync-multi    350(MiB/s)      87k             
virtiofscachenoneT1     seqwrite-psync-multi    470(MiB/s)      117k            
virtiofscachenone       seqwrite-psync-multi    374(MiB/s)      93k             
virtiofsdefaultT1       seqwrite-psync-multi    470(MiB/s)      117k            
virtiofsdefault         seqwrite-psync-multi    373(MiB/s)      93k             

9pbigmsize              seqwrite-mmap           195(MiB/s)      48k             
9pdefault               seqwrite-mmap           0(KiB/s)        0               
9pmmappass              seqwrite-mmap           196(MiB/s)      49k             
9pmsecnone              seqwrite-mmap           0(KiB/s)        0               
virtiofscachenoneT1     seqwrite-mmap           0(KiB/s)        0               
virtiofscachenone       seqwrite-mmap           0(KiB/s)        0               
virtiofsdefaultT1       seqwrite-mmap           603(MiB/s)      150k            
virtiofsdefault         seqwrite-mmap           629(MiB/s)      157k            

9pbigmsize              seqwrite-mmap-multi     247(MiB/s)      61k             
9pdefault               seqwrite-mmap-multi     0(KiB/s)        0               
9pmmappass              seqwrite-mmap-multi     246(MiB/s)      61k             
9pmsecnone              seqwrite-mmap-multi     0(KiB/s)        0               
virtiofscachenoneT1     seqwrite-mmap-multi     0(KiB/s)        0               
virtiofscachenone       seqwrite-mmap-multi     0(KiB/s)        0               
virtiofsdefaultT1       seqwrite-mmap-multi     1787(MiB/s)     446k            
virtiofsdefault         seqwrite-mmap-multi     1692(MiB/s)     423k            

9pbigmsize              seqwrite-libaio         107(MiB/s)      26k             
9pdefault               seqwrite-libaio         107(MiB/s)      26k             
9pmmappass              seqwrite-libaio         106(MiB/s)      26k             
9pmsecnone              seqwrite-libaio         108(MiB/s)      27k             
virtiofscachenoneT1     seqwrite-libaio         595(MiB/s)      148k            
virtiofscachenone       seqwrite-libaio         524(MiB/s)      131k            
virtiofsdefaultT1       seqwrite-libaio         575(MiB/s)      143k            
virtiofsdefault         seqwrite-libaio         538(MiB/s)      134k            

9pbigmsize              seqwrite-libaio-multi   355(MiB/s)      88k             
9pdefault               seqwrite-libaio-multi   341(MiB/s)      85k             
9pmmappass              seqwrite-libaio-multi   354(MiB/s)      88k             
9pmsecnone              seqwrite-libaio-multi   350(MiB/s)      87k             
virtiofscachenoneT1     seqwrite-libaio-multi   609(MiB/s)      152k            
virtiofscachenone       seqwrite-libaio-multi   536(MiB/s)      134k            
virtiofsdefaultT1       seqwrite-libaio-multi   609(MiB/s)      152k            
virtiofsdefault         seqwrite-libaio-multi   538(MiB/s)      134k            

9pbigmsize              randwrite-psync         104(MiB/s)      26k             
9pdefault               randwrite-psync         106(MiB/s)      26k             
9pmmappass              randwrite-psync         105(MiB/s)      26k             
9pmsecnone              randwrite-psync         103(MiB/s)      25k             
virtiofscachenoneT1     randwrite-psync         125(MiB/s)      31k             
virtiofscachenone       randwrite-psync         110(MiB/s)      27k             
virtiofsdefaultT1       randwrite-psync         129(MiB/s)      32k             
virtiofsdefault         randwrite-psync         112(MiB/s)      28k             

9pbigmsize              randwrite-psync-multi   355(MiB/s)      88k             
9pdefault               randwrite-psync-multi   339(MiB/s)      84k             
9pmmappass              randwrite-psync-multi   343(MiB/s)      85k             
9pmsecnone              randwrite-psync-multi   344(MiB/s)      86k             
virtiofscachenoneT1     randwrite-psync-multi   461(MiB/s)      115k            
virtiofscachenone       randwrite-psync-multi   370(MiB/s)      92k             
virtiofsdefaultT1       randwrite-psync-multi   449(MiB/s)      112k            
virtiofsdefault         randwrite-psync-multi   364(MiB/s)      91k             

9pbigmsize              randwrite-mmap          98(MiB/s)       24k             
9pdefault               randwrite-mmap          0(KiB/s)        0               
9pmmappass              randwrite-mmap          97(MiB/s)       24k             
9pmsecnone              randwrite-mmap          0(KiB/s)        0               
virtiofscachenoneT1     randwrite-mmap          0(KiB/s)        0               
virtiofscachenone       randwrite-mmap          0(KiB/s)        0               
virtiofsdefaultT1       randwrite-mmap          102(MiB/s)      25k             
virtiofsdefault         randwrite-mmap          92(MiB/s)       23k             

9pbigmsize              randwrite-mmap-multi    246(MiB/s)      61k             
9pdefault               randwrite-mmap-multi    0(KiB/s)        0               
9pmmappass              randwrite-mmap-multi    239(MiB/s)      59k             
9pmsecnone              randwrite-mmap-multi    0(KiB/s)        0               
virtiofscachenoneT1     randwrite-mmap-multi    0(KiB/s)        0               
virtiofscachenone       randwrite-mmap-multi    0(KiB/s)        0               
virtiofsdefaultT1       randwrite-mmap-multi    279(MiB/s)      69k             
virtiofsdefault         randwrite-mmap-multi    225(MiB/s)      56k             

9pbigmsize              randwrite-libaio        110(MiB/s)      27k             
9pdefault               randwrite-libaio        111(MiB/s)      27k             
9pmmappass              randwrite-libaio        103(MiB/s)      25k             
9pmsecnone              randwrite-libaio        102(MiB/s)      25k             
virtiofscachenoneT1     randwrite-libaio        601(MiB/s)      150k            
virtiofscachenone       randwrite-libaio        525(MiB/s)      131k            
virtiofsdefaultT1       randwrite-libaio        618(MiB/s)      154k            
virtiofsdefault         randwrite-libaio        527(MiB/s)      131k            

9pbigmsize              randwrite-libaio-multi  332(MiB/s)      83k             
9pdefault               randwrite-libaio-multi  343(MiB/s)      85k             
9pmmappass              randwrite-libaio-multi  350(MiB/s)      87k             
9pmsecnone              randwrite-libaio-multi  334(MiB/s)      83k             
virtiofscachenoneT1     randwrite-libaio-multi  611(MiB/s)      152k            
virtiofscachenone       randwrite-libaio-multi  533(MiB/s)      133k            
virtiofsdefaultT1       randwrite-libaio-multi  599(MiB/s)      149k            
virtiofsdefault         randwrite-libaio-multi  531(MiB/s)      132k            

> 
> virtiofsdefault:
>   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
>   mount -t virtiofs kernel /mnt
> 
> 9pdefault
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
>   mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L
> 
> virtiofscache=none
>   ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
>   mount -t virtiofs kernel /mnt
> 
> 9pmmappass
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
>   mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap
> 
> 9pmbigmsize
>    ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=passthrough
>    mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576
> 
> 9pmsecnone
>    ./x86_64-softmmu/qemu-system-x86_64 -M pc,accel=kvm -smp 8 -cpu host -m 32G -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -virtfs local,path=/dev/shm/linux,mount_tag=kernel,security_model=none
>    mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L
> 
> virtiofscache=noneT1
>    ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux -o cache=none --thread-pool-size=1
>    mount -t virtiofs kernel /mnt
> 
> virtiofsdefaultT1
>    ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/dev/shm/linux --thread-pool-size=1
>     ./x86_64-softmmu/qemu-system-x86_64 -M pc,memory-backend=mem,accel=kvm -smp 8 -cpu host -m 32G,maxmem=64G,slots=1 -object memory-backend-memfd,id=mem,size=32G,share=on -drive if=virtio,file=/home/images/f-32-kernel.qcow2 -nographic -chardev socket,id=char0,path=/tmp/vhostqemu -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=kernel
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-21 20:16   ` [Virtio-fs] " Vivek Goyal
@ 2020-09-22 11:09     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-22 11:09 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel, Stefan Hajnoczi, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> > Hi All,
> > 
> > virtiofsd default thread pool size is 64. To me it feels that in most of
> > the cases thread pool size 1 performs better than thread pool size 64.
> > 
> > I ran virtiofs-tests.
> > 
> > https://github.com/rhvgoyal/virtiofs-tests
> 
> I spent more time debugging this. First thing I noticed is that we
> are using "exclusive" glib thread pool.
> 
> https://developer.gnome.org/glib/stable/glib-Thread-Pools.html#g-thread-pool-new
> 
> This seems to run pre-determined number of threads dedicated to that
> thread pool. Little instrumentation of code revealed that every new
> request gets assiged to new thread (despite the fact that previous
> thread finished its job). So internally there might be some kind of
> round robin policy to choose next thread for running the job.
> 
> I decided to switch to "shared" pool instead where it seemed to spin
> up new threads only if there is enough work. Also threads can be shared
> between pools.
> 
> And looks like testing results are way better with "shared" pools. So
> may be we should switch to shared pool by default. (Till somebody shows
> in what cases exclusive pools are better).
> 
> Second thought which came to mind was what's the impact of NUMA. What
> if qemu and virtiofsd process/threads are running on separate NUMA
> node. That should increase memory access latency and increased overhead.
> So I used "numactl --cpubind=0" to bind both qemu and virtiofsd to node
> 0. My machine seems to have two numa nodes. (Each node is having 32
> logical processors). Keeping both qemu and virtiofsd on same node
> improves throughput further.
> 
> So here are the results.
> 
> vtfs-none-epool --> cache=none, exclusive thread pool.
> vtfs-none-spool --> cache=none, shared thread pool.
> vtfs-none-spool-numa --> cache=none, shared thread pool, same numa node

Do you have the numbers for:
   epool
   epool thread-pool-size=1
   spool

?

Dave

> 
> NAME                    WORKLOAD                Bandwidth       IOPS            
> vtfs-none-epool         seqread-psync           36(MiB/s)       9392            
> vtfs-none-spool         seqread-psync           68(MiB/s)       17k             
> vtfs-none-spool-numa    seqread-psync           73(MiB/s)       18k             
> 
> vtfs-none-epool         seqread-psync-multi     210(MiB/s)      52k             
> vtfs-none-spool         seqread-psync-multi     260(MiB/s)      65k             
> vtfs-none-spool-numa    seqread-psync-multi     309(MiB/s)      77k             
> 
> vtfs-none-epool         seqread-libaio          286(MiB/s)      71k             
> vtfs-none-spool         seqread-libaio          328(MiB/s)      82k             
> vtfs-none-spool-numa    seqread-libaio          332(MiB/s)      83k             
> 
> vtfs-none-epool         seqread-libaio-multi    201(MiB/s)      50k             
> vtfs-none-spool         seqread-libaio-multi    254(MiB/s)      63k             
> vtfs-none-spool-numa    seqread-libaio-multi    276(MiB/s)      69k             
> 
> vtfs-none-epool         randread-psync          40(MiB/s)       10k             
> vtfs-none-spool         randread-psync          64(MiB/s)       16k             
> vtfs-none-spool-numa    randread-psync          72(MiB/s)       18k             
> 
> vtfs-none-epool         randread-psync-multi    211(MiB/s)      52k             
> vtfs-none-spool         randread-psync-multi    252(MiB/s)      63k             
> vtfs-none-spool-numa    randread-psync-multi    297(MiB/s)      74k             
> 
> vtfs-none-epool         randread-libaio         313(MiB/s)      78k             
> vtfs-none-spool         randread-libaio         320(MiB/s)      80k             
> vtfs-none-spool-numa    randread-libaio         330(MiB/s)      82k             
> 
> vtfs-none-epool         randread-libaio-multi   257(MiB/s)      64k             
> vtfs-none-spool         randread-libaio-multi   274(MiB/s)      68k             
> vtfs-none-spool-numa    randread-libaio-multi   319(MiB/s)      79k             
> 
> vtfs-none-epool         seqwrite-psync          34(MiB/s)       8926            
> vtfs-none-spool         seqwrite-psync          55(MiB/s)       13k             
> vtfs-none-spool-numa    seqwrite-psync          66(MiB/s)       16k             
> 
> vtfs-none-epool         seqwrite-psync-multi    196(MiB/s)      49k             
> vtfs-none-spool         seqwrite-psync-multi    225(MiB/s)      56k             
> vtfs-none-spool-numa    seqwrite-psync-multi    270(MiB/s)      67k             
> 
> vtfs-none-epool         seqwrite-libaio         257(MiB/s)      64k             
> vtfs-none-spool         seqwrite-libaio         304(MiB/s)      76k             
> vtfs-none-spool-numa    seqwrite-libaio         267(MiB/s)      66k             
> 
> vtfs-none-epool         seqwrite-libaio-multi   312(MiB/s)      78k             
> vtfs-none-spool         seqwrite-libaio-multi   366(MiB/s)      91k             
> vtfs-none-spool-numa    seqwrite-libaio-multi   381(MiB/s)      95k             
> 
> vtfs-none-epool         randwrite-psync         38(MiB/s)       9745            
> vtfs-none-spool         randwrite-psync         55(MiB/s)       13k             
> vtfs-none-spool-numa    randwrite-psync         67(MiB/s)       16k             
> 
> vtfs-none-epool         randwrite-psync-multi   186(MiB/s)      46k             
> vtfs-none-spool         randwrite-psync-multi   240(MiB/s)      60k             
> vtfs-none-spool-numa    randwrite-psync-multi   271(MiB/s)      67k             
> 
> vtfs-none-epool         randwrite-libaio        224(MiB/s)      56k             
> vtfs-none-spool         randwrite-libaio        296(MiB/s)      74k             
> vtfs-none-spool-numa    randwrite-libaio        290(MiB/s)      72k             
> 
> vtfs-none-epool         randwrite-libaio-multi  300(MiB/s)      75k             
> vtfs-none-spool         randwrite-libaio-multi  350(MiB/s)      87k             
> vtfs-none-spool-numa    randwrite-libaio-multi  383(MiB/s)      95k             
> 
> Thanks
> Vivek
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-22 11:09     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-22 11:09 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel, Miklos Szeredi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Sep 18, 2020 at 05:34:36PM -0400, Vivek Goyal wrote:
> > Hi All,
> > 
> > virtiofsd default thread pool size is 64. To me it feels that in most of
> > the cases thread pool size 1 performs better than thread pool size 64.
> > 
> > I ran virtiofs-tests.
> > 
> > https://github.com/rhvgoyal/virtiofs-tests
> 
> I spent more time debugging this. First thing I noticed is that we
> are using "exclusive" glib thread pool.
> 
> https://developer.gnome.org/glib/stable/glib-Thread-Pools.html#g-thread-pool-new
> 
> This seems to run pre-determined number of threads dedicated to that
> thread pool. Little instrumentation of code revealed that every new
> request gets assiged to new thread (despite the fact that previous
> thread finished its job). So internally there might be some kind of
> round robin policy to choose next thread for running the job.
> 
> I decided to switch to "shared" pool instead where it seemed to spin
> up new threads only if there is enough work. Also threads can be shared
> between pools.
> 
> And looks like testing results are way better with "shared" pools. So
> may be we should switch to shared pool by default. (Till somebody shows
> in what cases exclusive pools are better).
> 
> Second thought which came to mind was what's the impact of NUMA. What
> if qemu and virtiofsd process/threads are running on separate NUMA
> node. That should increase memory access latency and increased overhead.
> So I used "numactl --cpubind=0" to bind both qemu and virtiofsd to node
> 0. My machine seems to have two numa nodes. (Each node is having 32
> logical processors). Keeping both qemu and virtiofsd on same node
> improves throughput further.
> 
> So here are the results.
> 
> vtfs-none-epool --> cache=none, exclusive thread pool.
> vtfs-none-spool --> cache=none, shared thread pool.
> vtfs-none-spool-numa --> cache=none, shared thread pool, same numa node

Do you have the numbers for:
   epool
   epool thread-pool-size=1
   spool

?

Dave

> 
> NAME                    WORKLOAD                Bandwidth       IOPS            
> vtfs-none-epool         seqread-psync           36(MiB/s)       9392            
> vtfs-none-spool         seqread-psync           68(MiB/s)       17k             
> vtfs-none-spool-numa    seqread-psync           73(MiB/s)       18k             
> 
> vtfs-none-epool         seqread-psync-multi     210(MiB/s)      52k             
> vtfs-none-spool         seqread-psync-multi     260(MiB/s)      65k             
> vtfs-none-spool-numa    seqread-psync-multi     309(MiB/s)      77k             
> 
> vtfs-none-epool         seqread-libaio          286(MiB/s)      71k             
> vtfs-none-spool         seqread-libaio          328(MiB/s)      82k             
> vtfs-none-spool-numa    seqread-libaio          332(MiB/s)      83k             
> 
> vtfs-none-epool         seqread-libaio-multi    201(MiB/s)      50k             
> vtfs-none-spool         seqread-libaio-multi    254(MiB/s)      63k             
> vtfs-none-spool-numa    seqread-libaio-multi    276(MiB/s)      69k             
> 
> vtfs-none-epool         randread-psync          40(MiB/s)       10k             
> vtfs-none-spool         randread-psync          64(MiB/s)       16k             
> vtfs-none-spool-numa    randread-psync          72(MiB/s)       18k             
> 
> vtfs-none-epool         randread-psync-multi    211(MiB/s)      52k             
> vtfs-none-spool         randread-psync-multi    252(MiB/s)      63k             
> vtfs-none-spool-numa    randread-psync-multi    297(MiB/s)      74k             
> 
> vtfs-none-epool         randread-libaio         313(MiB/s)      78k             
> vtfs-none-spool         randread-libaio         320(MiB/s)      80k             
> vtfs-none-spool-numa    randread-libaio         330(MiB/s)      82k             
> 
> vtfs-none-epool         randread-libaio-multi   257(MiB/s)      64k             
> vtfs-none-spool         randread-libaio-multi   274(MiB/s)      68k             
> vtfs-none-spool-numa    randread-libaio-multi   319(MiB/s)      79k             
> 
> vtfs-none-epool         seqwrite-psync          34(MiB/s)       8926            
> vtfs-none-spool         seqwrite-psync          55(MiB/s)       13k             
> vtfs-none-spool-numa    seqwrite-psync          66(MiB/s)       16k             
> 
> vtfs-none-epool         seqwrite-psync-multi    196(MiB/s)      49k             
> vtfs-none-spool         seqwrite-psync-multi    225(MiB/s)      56k             
> vtfs-none-spool-numa    seqwrite-psync-multi    270(MiB/s)      67k             
> 
> vtfs-none-epool         seqwrite-libaio         257(MiB/s)      64k             
> vtfs-none-spool         seqwrite-libaio         304(MiB/s)      76k             
> vtfs-none-spool-numa    seqwrite-libaio         267(MiB/s)      66k             
> 
> vtfs-none-epool         seqwrite-libaio-multi   312(MiB/s)      78k             
> vtfs-none-spool         seqwrite-libaio-multi   366(MiB/s)      91k             
> vtfs-none-spool-numa    seqwrite-libaio-multi   381(MiB/s)      95k             
> 
> vtfs-none-epool         randwrite-psync         38(MiB/s)       9745            
> vtfs-none-spool         randwrite-psync         55(MiB/s)       13k             
> vtfs-none-spool-numa    randwrite-psync         67(MiB/s)       16k             
> 
> vtfs-none-epool         randwrite-psync-multi   186(MiB/s)      46k             
> vtfs-none-spool         randwrite-psync-multi   240(MiB/s)      60k             
> vtfs-none-spool-numa    randwrite-psync-multi   271(MiB/s)      67k             
> 
> vtfs-none-epool         randwrite-libaio        224(MiB/s)      56k             
> vtfs-none-spool         randwrite-libaio        296(MiB/s)      74k             
> vtfs-none-spool-numa    randwrite-libaio        290(MiB/s)      72k             
> 
> vtfs-none-epool         randwrite-libaio-multi  300(MiB/s)      75k             
> vtfs-none-spool         randwrite-libaio-multi  350(MiB/s)      87k             
> vtfs-none-spool-numa    randwrite-libaio-multi  383(MiB/s)      95k             
> 
> Thanks
> Vivek
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-22 10:25     ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-09-22 17:47       ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-22 17:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, archana.m.shinde

On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > Hi,
> >   I've been doing some of my own perf tests and I think I agree
> > about the thread pool size;  my test is a kernel build
> > and I've tried a bunch of different options.
> > 
> > My config:
> >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> >      5.9.0-rc4 kernel, rhel 8.2ish userspace.
> >   5.1.0 qemu/virtiofsd built from git.
> >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > a kernel build.
> > 
> >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > fresh before each test.  Then log into the guest, make defconfig,
> > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > The numbers below are the 'real' time in the guest from the initial make
> > (the subsequent makes dont vary much)
> > 
> > Below are the detauls of what each of these means, but here are the
> > numbers first
> > 
> > virtiofsdefault        4m0.978s
> > 9pdefault              9m41.660s
> > virtiofscache=none    10m29.700s
> > 9pmmappass             9m30.047s
> > 9pmbigmsize           12m4.208s
> > 9pmsecnone             9m21.363s
> > virtiofscache=noneT1   7m17.494s
> > virtiofsdefaultT1      3m43.326s
> > 
> > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > the default virtiofs settings, but with --thread-pool-size=1 - so
> > yes it gives a small benefit.
> > But interestingly the cache=none virtiofs performance is pretty bad,
> > but thread-pool-size=1 on that makes a BIG improvement.
> 
> Here are fio runs that Vivek asked me to run in my same environment
> (there are some 0's in some of the mmap cases, and I've not investigated
> why yet).

cache=none does not allow mmap in case of virtiofs. That's when you
are seeing 0.

>virtiofs is looking good here in I think all of the cases;
> there's some division over which cinfig; cache=none
> seems faster in some cases which surprises me.

I know cache=none is faster in case of write workloads. It forces
direct write where we don't call file_remove_privs(). While cache=auto
goes through file_remove_privs() and that adds a GETXATTR request to
every WRITE request.

Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-22 17:47       ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-22 17:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	archana.m.shinde

On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > Hi,
> >   I've been doing some of my own perf tests and I think I agree
> > about the thread pool size;  my test is a kernel build
> > and I've tried a bunch of different options.
> > 
> > My config:
> >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> >      5.9.0-rc4 kernel, rhel 8.2ish userspace.
> >   5.1.0 qemu/virtiofsd built from git.
> >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > a kernel build.
> > 
> >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > fresh before each test.  Then log into the guest, make defconfig,
> > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > The numbers below are the 'real' time in the guest from the initial make
> > (the subsequent makes dont vary much)
> > 
> > Below are the detauls of what each of these means, but here are the
> > numbers first
> > 
> > virtiofsdefault        4m0.978s
> > 9pdefault              9m41.660s
> > virtiofscache=none    10m29.700s
> > 9pmmappass             9m30.047s
> > 9pmbigmsize           12m4.208s
> > 9pmsecnone             9m21.363s
> > virtiofscache=noneT1   7m17.494s
> > virtiofsdefaultT1      3m43.326s
> > 
> > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > the default virtiofs settings, but with --thread-pool-size=1 - so
> > yes it gives a small benefit.
> > But interestingly the cache=none virtiofs performance is pretty bad,
> > but thread-pool-size=1 on that makes a BIG improvement.
> 
> Here are fio runs that Vivek asked me to run in my same environment
> (there are some 0's in some of the mmap cases, and I've not investigated
> why yet).

cache=none does not allow mmap in case of virtiofs. That's when you
are seeing 0.

>virtiofs is looking good here in I think all of the cases;
> there's some division over which cinfig; cache=none
> seems faster in some cases which surprises me.

I know cache=none is faster in case of write workloads. It forces
direct write where we don't call file_remove_privs(). While cache=auto
goes through file_remove_privs() and that adds a GETXATTR request to
every WRITE request.

Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-22 11:09     ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-09-22 22:56       ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-22 22:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: virtio-fs-list, qemu-devel, Stefan Hajnoczi, Miklos Szeredi

On Tue, Sep 22, 2020 at 12:09:46PM +0100, Dr. David Alan Gilbert wrote:
> 
> Do you have the numbers for:
>    epool
>    epool thread-pool-size=1
>    spool

Hi David,

Ok, I re-ran my numbers again after upgrading to latest qemu and also
upgraded host kernel to latest upstream. Apart from comparing I epool,
spool and 1Thread, I also ran their numa variants. That is I launched
qemu and virtiofsd on node 0 of machine (numactl --cpunodebind=0).

Results are kind of mixed. Here are my takeaways.

- Running on same numa node improves performance overall for exclusive,
  shared and exclusive-1T mode.

- In general both shared pool and exclusive-1T mode seem to perform
  better than exclusive mode, except for the case of randwrite-libaio.
  In some cases (seqread-libaio, seqwrite-libaio, seqwrite-libaio-multi)
  exclusive pool performs better than exclusive-1T.

- Looks like in some cases exclusive-1T performs better than shared
  pool. (randwrite-libaio, randwrite-psync-multi, seqwrite-psync-multi,
  seqwrite-psync, seqread-libaio-multi, seqread-psync-multi)


Overall, I feel that both exlusive-1T and shared perform better than
exclusive pool. Results between exclusive-1T and shared pool are mixed.
It seems like in many cases exclusve-1T performs better. I would say
that moving to "shared" pool seems like a reasonable option.

Thanks
Vivek

NAME                    WORKLOAD                Bandwidth       IOPS            
vtfs-none-epool         seqread-psync           38(MiB/s)       9967            
vtfs-none-epool-1T      seqread-psync           66(MiB/s)       16k             
vtfs-none-spool         seqread-psync           67(MiB/s)       16k             
vtfs-none-epool-numa    seqread-psync           48(MiB/s)       12k             
vtfs-none-epool-1T-numa seqread-psync           74(MiB/s)       18k             
vtfs-none-spool-numa    seqread-psync           74(MiB/s)       18k             

vtfs-none-epool         seqread-psync-multi     204(MiB/s)      51k             
vtfs-none-epool-1T      seqread-psync-multi     325(MiB/s)      81k             
vtfs-none-spool         seqread-psync-multi     271(MiB/s)      67k             
vtfs-none-epool-numa    seqread-psync-multi     253(MiB/s)      63k             
vtfs-none-epool-1T-numa seqread-psync-multi     349(MiB/s)      87k             
vtfs-none-spool-numa    seqread-psync-multi     301(MiB/s)      75k             

vtfs-none-epool         seqread-libaio          301(MiB/s)      75k             
vtfs-none-epool-1T      seqread-libaio          273(MiB/s)      68k             
vtfs-none-spool         seqread-libaio          334(MiB/s)      83k             
vtfs-none-epool-numa    seqread-libaio          315(MiB/s)      78k             
vtfs-none-epool-1T-numa seqread-libaio          326(MiB/s)      81k             
vtfs-none-spool-numa    seqread-libaio          335(MiB/s)      83k             

vtfs-none-epool         seqread-libaio-multi    202(MiB/s)      50k             
vtfs-none-epool-1T      seqread-libaio-multi    308(MiB/s)      77k             
vtfs-none-spool         seqread-libaio-multi    247(MiB/s)      61k             
vtfs-none-epool-numa    seqread-libaio-multi    238(MiB/s)      59k             
vtfs-none-epool-1T-numa seqread-libaio-multi    307(MiB/s)      76k             
vtfs-none-spool-numa    seqread-libaio-multi    269(MiB/s)      67k             

vtfs-none-epool         randread-psync          41(MiB/s)       10k             
vtfs-none-epool-1T      randread-psync          67(MiB/s)       16k             
vtfs-none-spool         randread-psync          64(MiB/s)       16k             
vtfs-none-epool-numa    randread-psync          48(MiB/s)       12k             
vtfs-none-epool-1T-numa randread-psync          73(MiB/s)       18k             
vtfs-none-spool-numa    randread-psync          72(MiB/s)       18k             

vtfs-none-epool         randread-psync-multi    207(MiB/s)      51k             
vtfs-none-epool-1T      randread-psync-multi    313(MiB/s)      78k             
vtfs-none-spool         randread-psync-multi    265(MiB/s)      66k             
vtfs-none-epool-numa    randread-psync-multi    253(MiB/s)      63k             
vtfs-none-epool-1T-numa randread-psync-multi    340(MiB/s)      85k             
vtfs-none-spool-numa    randread-psync-multi    305(MiB/s)      76k             

vtfs-none-epool         randread-libaio         305(MiB/s)      76k             
vtfs-none-epool-1T      randread-libaio         308(MiB/s)      77k             
vtfs-none-spool         randread-libaio         329(MiB/s)      82k             
vtfs-none-epool-numa    randread-libaio         310(MiB/s)      77k             
vtfs-none-epool-1T-numa randread-libaio         328(MiB/s)      82k             
vtfs-none-spool-numa    randread-libaio         339(MiB/s)      84k             

vtfs-none-epool         randread-libaio-multi   265(MiB/s)      66k             
vtfs-none-epool-1T      randread-libaio-multi   267(MiB/s)      66k             
vtfs-none-spool         randread-libaio-multi   269(MiB/s)      67k             
vtfs-none-epool-numa    randread-libaio-multi   314(MiB/s)      78k             
vtfs-none-epool-1T-numa randread-libaio-multi   319(MiB/s)      79k             
vtfs-none-spool-numa    randread-libaio-multi   318(MiB/s)      79k             

vtfs-none-epool         seqwrite-psync          36(MiB/s)       9224            
vtfs-none-epool-1T      seqwrite-psync          67(MiB/s)       16k             
vtfs-none-spool         seqwrite-psync          61(MiB/s)       15k             
vtfs-none-epool-numa    seqwrite-psync          44(MiB/s)       11k             
vtfs-none-epool-1T-numa seqwrite-psync          69(MiB/s)       17k             
vtfs-none-spool-numa    seqwrite-psync          68(MiB/s)       17k             

vtfs-none-epool         seqwrite-psync-multi    193(MiB/s)      48k             
vtfs-none-epool-1T      seqwrite-psync-multi    299(MiB/s)      74k             
vtfs-none-spool         seqwrite-psync-multi    240(MiB/s)      60k             
vtfs-none-epool-numa    seqwrite-psync-multi    233(MiB/s)      58k             
vtfs-none-epool-1T-numa seqwrite-psync-multi    358(MiB/s)      89k             
vtfs-none-spool-numa    seqwrite-psync-multi    285(MiB/s)      71k             

vtfs-none-epool         seqwrite-libaio         265(MiB/s)      66k             
vtfs-none-epool-1T      seqwrite-libaio         245(MiB/s)      61k             
vtfs-none-spool         seqwrite-libaio         312(MiB/s)      78k             
vtfs-none-epool-numa    seqwrite-libaio         295(MiB/s)      73k             
vtfs-none-epool-1T-numa seqwrite-libaio         282(MiB/s)      70k             
vtfs-none-spool-numa    seqwrite-libaio         297(MiB/s)      74k             

vtfs-none-epool         seqwrite-libaio-multi   313(MiB/s)      78k             
vtfs-none-epool-1T      seqwrite-libaio-multi   299(MiB/s)      74k             
vtfs-none-spool         seqwrite-libaio-multi   315(MiB/s)      78k             
vtfs-none-epool-numa    seqwrite-libaio-multi   318(MiB/s)      79k             
vtfs-none-epool-1T-numa seqwrite-libaio-multi   410(MiB/s)      102k            
vtfs-none-spool-numa    seqwrite-libaio-multi   378(MiB/s)      94k             

vtfs-none-epool         randwrite-psync         33(MiB/s)       8629            
vtfs-none-epool-1T      randwrite-psync         61(MiB/s)       15k             
vtfs-none-spool         randwrite-psync         63(MiB/s)       15k             
vtfs-none-epool-numa    randwrite-psync         49(MiB/s)       12k             
vtfs-none-epool-1T-numa randwrite-psync         68(MiB/s)       17k             
vtfs-none-spool-numa    randwrite-psync         66(MiB/s)       16k             

vtfs-none-epool         randwrite-psync-multi   186(MiB/s)      46k             
vtfs-none-epool-1T      randwrite-psync-multi   300(MiB/s)      75k             
vtfs-none-spool         randwrite-psync-multi   233(MiB/s)      58k             
vtfs-none-epool-numa    randwrite-psync-multi   235(MiB/s)      58k             
vtfs-none-epool-1T-numa randwrite-psync-multi   355(MiB/s)      88k             
vtfs-none-spool-numa    randwrite-psync-multi   266(MiB/s)      66k             

vtfs-none-epool         randwrite-libaio        289(MiB/s)      72k             
vtfs-none-epool-1T      randwrite-libaio        284(MiB/s)      71k             
vtfs-none-spool         randwrite-libaio        278(MiB/s)      69k             
vtfs-none-epool-numa    randwrite-libaio        292(MiB/s)      73k             
vtfs-none-epool-1T-numa randwrite-libaio        294(MiB/s)      73k             
vtfs-none-spool-numa    randwrite-libaio        290(MiB/s)      72k             

vtfs-none-epool         randwrite-libaio-multi  317(MiB/s)      79k             
vtfs-none-epool-1T      randwrite-libaio-multi  323(MiB/s)      80k             
vtfs-none-spool         randwrite-libaio-multi  330(MiB/s)      82k             
vtfs-none-epool-numa    randwrite-libaio-multi  315(MiB/s)      78k             
vtfs-none-epool-1T-numa randwrite-libaio-multi  409(MiB/s)      102k            
vtfs-none-spool-numa    randwrite-libaio-multi  384(MiB/s)      96k             



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-22 22:56       ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-22 22:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs-list, qemu-devel, Miklos Szeredi

On Tue, Sep 22, 2020 at 12:09:46PM +0100, Dr. David Alan Gilbert wrote:
> 
> Do you have the numbers for:
>    epool
>    epool thread-pool-size=1
>    spool

Hi David,

Ok, I re-ran my numbers again after upgrading to latest qemu and also
upgraded host kernel to latest upstream. Apart from comparing I epool,
spool and 1Thread, I also ran their numa variants. That is I launched
qemu and virtiofsd on node 0 of machine (numactl --cpunodebind=0).

Results are kind of mixed. Here are my takeaways.

- Running on same numa node improves performance overall for exclusive,
  shared and exclusive-1T mode.

- In general both shared pool and exclusive-1T mode seem to perform
  better than exclusive mode, except for the case of randwrite-libaio.
  In some cases (seqread-libaio, seqwrite-libaio, seqwrite-libaio-multi)
  exclusive pool performs better than exclusive-1T.

- Looks like in some cases exclusive-1T performs better than shared
  pool. (randwrite-libaio, randwrite-psync-multi, seqwrite-psync-multi,
  seqwrite-psync, seqread-libaio-multi, seqread-psync-multi)


Overall, I feel that both exlusive-1T and shared perform better than
exclusive pool. Results between exclusive-1T and shared pool are mixed.
It seems like in many cases exclusve-1T performs better. I would say
that moving to "shared" pool seems like a reasonable option.

Thanks
Vivek

NAME                    WORKLOAD                Bandwidth       IOPS            
vtfs-none-epool         seqread-psync           38(MiB/s)       9967            
vtfs-none-epool-1T      seqread-psync           66(MiB/s)       16k             
vtfs-none-spool         seqread-psync           67(MiB/s)       16k             
vtfs-none-epool-numa    seqread-psync           48(MiB/s)       12k             
vtfs-none-epool-1T-numa seqread-psync           74(MiB/s)       18k             
vtfs-none-spool-numa    seqread-psync           74(MiB/s)       18k             

vtfs-none-epool         seqread-psync-multi     204(MiB/s)      51k             
vtfs-none-epool-1T      seqread-psync-multi     325(MiB/s)      81k             
vtfs-none-spool         seqread-psync-multi     271(MiB/s)      67k             
vtfs-none-epool-numa    seqread-psync-multi     253(MiB/s)      63k             
vtfs-none-epool-1T-numa seqread-psync-multi     349(MiB/s)      87k             
vtfs-none-spool-numa    seqread-psync-multi     301(MiB/s)      75k             

vtfs-none-epool         seqread-libaio          301(MiB/s)      75k             
vtfs-none-epool-1T      seqread-libaio          273(MiB/s)      68k             
vtfs-none-spool         seqread-libaio          334(MiB/s)      83k             
vtfs-none-epool-numa    seqread-libaio          315(MiB/s)      78k             
vtfs-none-epool-1T-numa seqread-libaio          326(MiB/s)      81k             
vtfs-none-spool-numa    seqread-libaio          335(MiB/s)      83k             

vtfs-none-epool         seqread-libaio-multi    202(MiB/s)      50k             
vtfs-none-epool-1T      seqread-libaio-multi    308(MiB/s)      77k             
vtfs-none-spool         seqread-libaio-multi    247(MiB/s)      61k             
vtfs-none-epool-numa    seqread-libaio-multi    238(MiB/s)      59k             
vtfs-none-epool-1T-numa seqread-libaio-multi    307(MiB/s)      76k             
vtfs-none-spool-numa    seqread-libaio-multi    269(MiB/s)      67k             

vtfs-none-epool         randread-psync          41(MiB/s)       10k             
vtfs-none-epool-1T      randread-psync          67(MiB/s)       16k             
vtfs-none-spool         randread-psync          64(MiB/s)       16k             
vtfs-none-epool-numa    randread-psync          48(MiB/s)       12k             
vtfs-none-epool-1T-numa randread-psync          73(MiB/s)       18k             
vtfs-none-spool-numa    randread-psync          72(MiB/s)       18k             

vtfs-none-epool         randread-psync-multi    207(MiB/s)      51k             
vtfs-none-epool-1T      randread-psync-multi    313(MiB/s)      78k             
vtfs-none-spool         randread-psync-multi    265(MiB/s)      66k             
vtfs-none-epool-numa    randread-psync-multi    253(MiB/s)      63k             
vtfs-none-epool-1T-numa randread-psync-multi    340(MiB/s)      85k             
vtfs-none-spool-numa    randread-psync-multi    305(MiB/s)      76k             

vtfs-none-epool         randread-libaio         305(MiB/s)      76k             
vtfs-none-epool-1T      randread-libaio         308(MiB/s)      77k             
vtfs-none-spool         randread-libaio         329(MiB/s)      82k             
vtfs-none-epool-numa    randread-libaio         310(MiB/s)      77k             
vtfs-none-epool-1T-numa randread-libaio         328(MiB/s)      82k             
vtfs-none-spool-numa    randread-libaio         339(MiB/s)      84k             

vtfs-none-epool         randread-libaio-multi   265(MiB/s)      66k             
vtfs-none-epool-1T      randread-libaio-multi   267(MiB/s)      66k             
vtfs-none-spool         randread-libaio-multi   269(MiB/s)      67k             
vtfs-none-epool-numa    randread-libaio-multi   314(MiB/s)      78k             
vtfs-none-epool-1T-numa randread-libaio-multi   319(MiB/s)      79k             
vtfs-none-spool-numa    randread-libaio-multi   318(MiB/s)      79k             

vtfs-none-epool         seqwrite-psync          36(MiB/s)       9224            
vtfs-none-epool-1T      seqwrite-psync          67(MiB/s)       16k             
vtfs-none-spool         seqwrite-psync          61(MiB/s)       15k             
vtfs-none-epool-numa    seqwrite-psync          44(MiB/s)       11k             
vtfs-none-epool-1T-numa seqwrite-psync          69(MiB/s)       17k             
vtfs-none-spool-numa    seqwrite-psync          68(MiB/s)       17k             

vtfs-none-epool         seqwrite-psync-multi    193(MiB/s)      48k             
vtfs-none-epool-1T      seqwrite-psync-multi    299(MiB/s)      74k             
vtfs-none-spool         seqwrite-psync-multi    240(MiB/s)      60k             
vtfs-none-epool-numa    seqwrite-psync-multi    233(MiB/s)      58k             
vtfs-none-epool-1T-numa seqwrite-psync-multi    358(MiB/s)      89k             
vtfs-none-spool-numa    seqwrite-psync-multi    285(MiB/s)      71k             

vtfs-none-epool         seqwrite-libaio         265(MiB/s)      66k             
vtfs-none-epool-1T      seqwrite-libaio         245(MiB/s)      61k             
vtfs-none-spool         seqwrite-libaio         312(MiB/s)      78k             
vtfs-none-epool-numa    seqwrite-libaio         295(MiB/s)      73k             
vtfs-none-epool-1T-numa seqwrite-libaio         282(MiB/s)      70k             
vtfs-none-spool-numa    seqwrite-libaio         297(MiB/s)      74k             

vtfs-none-epool         seqwrite-libaio-multi   313(MiB/s)      78k             
vtfs-none-epool-1T      seqwrite-libaio-multi   299(MiB/s)      74k             
vtfs-none-spool         seqwrite-libaio-multi   315(MiB/s)      78k             
vtfs-none-epool-numa    seqwrite-libaio-multi   318(MiB/s)      79k             
vtfs-none-epool-1T-numa seqwrite-libaio-multi   410(MiB/s)      102k            
vtfs-none-spool-numa    seqwrite-libaio-multi   378(MiB/s)      94k             

vtfs-none-epool         randwrite-psync         33(MiB/s)       8629            
vtfs-none-epool-1T      randwrite-psync         61(MiB/s)       15k             
vtfs-none-spool         randwrite-psync         63(MiB/s)       15k             
vtfs-none-epool-numa    randwrite-psync         49(MiB/s)       12k             
vtfs-none-epool-1T-numa randwrite-psync         68(MiB/s)       17k             
vtfs-none-spool-numa    randwrite-psync         66(MiB/s)       16k             

vtfs-none-epool         randwrite-psync-multi   186(MiB/s)      46k             
vtfs-none-epool-1T      randwrite-psync-multi   300(MiB/s)      75k             
vtfs-none-spool         randwrite-psync-multi   233(MiB/s)      58k             
vtfs-none-epool-numa    randwrite-psync-multi   235(MiB/s)      58k             
vtfs-none-epool-1T-numa randwrite-psync-multi   355(MiB/s)      88k             
vtfs-none-spool-numa    randwrite-psync-multi   266(MiB/s)      66k             

vtfs-none-epool         randwrite-libaio        289(MiB/s)      72k             
vtfs-none-epool-1T      randwrite-libaio        284(MiB/s)      71k             
vtfs-none-spool         randwrite-libaio        278(MiB/s)      69k             
vtfs-none-epool-numa    randwrite-libaio        292(MiB/s)      73k             
vtfs-none-epool-1T-numa randwrite-libaio        294(MiB/s)      73k             
vtfs-none-spool-numa    randwrite-libaio        290(MiB/s)      72k             

vtfs-none-epool         randwrite-libaio-multi  317(MiB/s)      79k             
vtfs-none-epool-1T      randwrite-libaio-multi  323(MiB/s)      80k             
vtfs-none-spool         randwrite-libaio-multi  330(MiB/s)      82k             
vtfs-none-epool-numa    randwrite-libaio-multi  315(MiB/s)      78k             
vtfs-none-epool-1T-numa randwrite-libaio-multi  409(MiB/s)      102k            
vtfs-none-spool-numa    randwrite-libaio-multi  384(MiB/s)      96k             


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
  2020-09-18 21:34 ` [Virtio-fs] " Vivek Goyal
                   ` (4 preceding siblings ...)
  (?)
@ 2020-09-23 12:50 ` Chirantan Ekbote
  2020-09-23 12:59   ` Vivek Goyal
  2020-09-25 11:35   ` Dr. David Alan Gilbert
  -1 siblings, 2 replies; 107+ messages in thread
From: Chirantan Ekbote @ 2020-09-23 12:50 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list, qemu-devel

On Sat, Sep 19, 2020 at 6:36 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> Hi All,
>
> virtiofsd default thread pool size is 64. To me it feels that in most of
> the cases thread pool size 1 performs better than thread pool size 64.
>
> I ran virtiofs-tests.
>
> https://github.com/rhvgoyal/virtiofs-tests
>
> And here are the comparision results. To me it seems that by default
> we should switch to 1 thread (Till we can figure out how to make
> multi thread performance better even when single process is doing
> I/O in client).
>

FWIW, we've observed the same behavior in crosvm. Using a thread pool
for the virtiofs server consistently gave us worse performance than
using a single thread.

Chirantan


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
  2020-09-23 12:50 ` Chirantan Ekbote
@ 2020-09-23 12:59   ` Vivek Goyal
  2020-09-25 11:35   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-23 12:59 UTC (permalink / raw)
  To: Chirantan Ekbote; +Cc: virtio-fs-list, qemu-devel

On Wed, Sep 23, 2020 at 09:50:59PM +0900, Chirantan Ekbote wrote:
> On Sat, Sep 19, 2020 at 6:36 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Hi All,
> >
> > virtiofsd default thread pool size is 64. To me it feels that in most of
> > the cases thread pool size 1 performs better than thread pool size 64.
> >
> > I ran virtiofs-tests.
> >
> > https://github.com/rhvgoyal/virtiofs-tests
> >
> > And here are the comparision results. To me it seems that by default
> > we should switch to 1 thread (Till we can figure out how to make
> > multi thread performance better even when single process is doing
> > I/O in client).
> >
> 
> FWIW, we've observed the same behavior in crosvm. Using a thread pool
> for the virtiofs server consistently gave us worse performance than
> using a single thread.

Thanks for sharing this information Chirantan. Shared pool seems to
perform better than exclusive pool. Single thread vs shared pool is
sort of mixed result but it looks like one thread beats shared pool
results in many of the tests.

May be we will have to swtich to single thread as default at some point
of time if shared pool does not live up to the expectations.

Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-22 17:47       ` [Virtio-fs] " Vivek Goyal
@ 2020-09-24 21:33         ` Venegas Munoz, Jose Carlos
  -1 siblings, 0 replies; 107+ messages in thread
From: Venegas Munoz, Jose Carlos @ 2020-09-24 21:33 UTC (permalink / raw)
  To: Vivek Goyal, Dr. David Alan Gilbert
  Cc: virtio-fs-list, Shinde, Archana M, qemu-devel, Stefan Hajnoczi, cdupontd

[-- Attachment #1: Type: text/plain, Size: 4115 bytes --]

Hi Folks,

Sorry for the delay about how to reproduce `fio` data.

I have some code to automate testing for multiple kata configs and collect info like:
- Kata-env, kata configuration.toml, qemu command, virtiofsd command.

See: 
https://github.com/jcvenegas/mrunner/


Last time we agreed to narrow the cases and configs to compare virtiofs and 9pfs

The configs where the following:

- qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr
- qemu + 9pfs a.k.a `kata-qemu`

Please take a look to the html and raw results I attach in this mail.

## Can I say that the  current status is:
- As David tests and Vivek points, for the fio workload you are using, seems that the best candidate should be cache=none
   -  In the comparison I took  cache=auto as Vivek suggested, this make sense as it seems that will be the default for kata.
   - Even if for this case cache=none works better, Can I assume that cache=auto dax=0 will be better than any 9pfs config? (once we find the root cause)

- Vivek is taking a look to mmap mode from 9pfs, to see how different is  with current virtiofs implementations. In 9pfs for kata, this is what we use by default.

## I'd like to identify what should be next on the debug/testing?

- Should I try to narrow by only trying to with qemu? 
- Should I try first with a new patch you already have? 
- Probably try with qemu without static build?
- Do the same test with thread-pool-size=1?

Please let me know how can I help.

Cheers.

On 22/09/20 12:47, "Vivek Goyal" <vgoyal@redhat.com> wrote:

    On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
    > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
    > > Hi,
    > >   I've been doing some of my own perf tests and I think I agree
    > > about the thread pool size;  my test is a kernel build
    > > and I've tried a bunch of different options.
    > > 
    > > My config:
    > >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
    > >      5.9.0-rc4 kernel, rhel 8.2ish userspace.
    > >   5.1.0 qemu/virtiofsd built from git.
    > >   Guest: Fedora 32 from cloud image with just enough extra installed for
    > > a kernel build.
    > > 
    > >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
    > > fresh before each test.  Then log into the guest, make defconfig,
    > > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
    > > The numbers below are the 'real' time in the guest from the initial make
    > > (the subsequent makes dont vary much)
    > > 
    > > Below are the detauls of what each of these means, but here are the
    > > numbers first
    > > 
    > > virtiofsdefault        4m0.978s
    > > 9pdefault              9m41.660s
    > > virtiofscache=none    10m29.700s
    > > 9pmmappass             9m30.047s
    > > 9pmbigmsize           12m4.208s
    > > 9pmsecnone             9m21.363s
    > > virtiofscache=noneT1   7m17.494s
    > > virtiofsdefaultT1      3m43.326s
    > > 
    > > So the winner there by far is the 'virtiofsdefaultT1' - that's
    > > the default virtiofs settings, but with --thread-pool-size=1 - so
    > > yes it gives a small benefit.
    > > But interestingly the cache=none virtiofs performance is pretty bad,
    > > but thread-pool-size=1 on that makes a BIG improvement.
    > 
    > Here are fio runs that Vivek asked me to run in my same environment
    > (there are some 0's in some of the mmap cases, and I've not investigated
    > why yet).

    cache=none does not allow mmap in case of virtiofs. That's when you
    are seeing 0.

    >virtiofs is looking good here in I think all of the cases;
    > there's some division over which cinfig; cache=none
    > seems faster in some cases which surprises me.

    I know cache=none is faster in case of write workloads. It forces
    direct write where we don't call file_remove_privs(). While cache=auto
    goes through file_remove_privs() and that adds a GETXATTR request to
    every WRITE request.

    Vivek



[-- Attachment #2: results.tar.gz --]
[-- Type: application/x-gzip, Size: 18156 bytes --]

[-- Attachment #3: vitiofs 9pfs fio comparsion.html --]
[-- Type: text/html, Size: 29758 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-24 21:33         ` Venegas Munoz, Jose Carlos
  0 siblings, 0 replies; 107+ messages in thread
From: Venegas Munoz, Jose Carlos @ 2020-09-24 21:33 UTC (permalink / raw)
  To: Vivek Goyal, Dr. David Alan Gilbert
  Cc: virtio-fs-list, Shinde, Archana M, qemu-devel, cdupontd

[-- Attachment #1: Type: text/plain, Size: 4115 bytes --]

Hi Folks,

Sorry for the delay about how to reproduce `fio` data.

I have some code to automate testing for multiple kata configs and collect info like:
- Kata-env, kata configuration.toml, qemu command, virtiofsd command.

See: 
https://github.com/jcvenegas/mrunner/


Last time we agreed to narrow the cases and configs to compare virtiofs and 9pfs

The configs where the following:

- qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr
- qemu + 9pfs a.k.a `kata-qemu`

Please take a look to the html and raw results I attach in this mail.

## Can I say that the  current status is:
- As David tests and Vivek points, for the fio workload you are using, seems that the best candidate should be cache=none
   -  In the comparison I took  cache=auto as Vivek suggested, this make sense as it seems that will be the default for kata.
   - Even if for this case cache=none works better, Can I assume that cache=auto dax=0 will be better than any 9pfs config? (once we find the root cause)

- Vivek is taking a look to mmap mode from 9pfs, to see how different is  with current virtiofs implementations. In 9pfs for kata, this is what we use by default.

## I'd like to identify what should be next on the debug/testing?

- Should I try to narrow by only trying to with qemu? 
- Should I try first with a new patch you already have? 
- Probably try with qemu without static build?
- Do the same test with thread-pool-size=1?

Please let me know how can I help.

Cheers.

On 22/09/20 12:47, "Vivek Goyal" <vgoyal@redhat.com> wrote:

    On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
    > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
    > > Hi,
    > >   I've been doing some of my own perf tests and I think I agree
    > > about the thread pool size;  my test is a kernel build
    > > and I've tried a bunch of different options.
    > > 
    > > My config:
    > >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
    > >      5.9.0-rc4 kernel, rhel 8.2ish userspace.
    > >   5.1.0 qemu/virtiofsd built from git.
    > >   Guest: Fedora 32 from cloud image with just enough extra installed for
    > > a kernel build.
    > > 
    > >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
    > > fresh before each test.  Then log into the guest, make defconfig,
    > > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
    > > The numbers below are the 'real' time in the guest from the initial make
    > > (the subsequent makes dont vary much)
    > > 
    > > Below are the detauls of what each of these means, but here are the
    > > numbers first
    > > 
    > > virtiofsdefault        4m0.978s
    > > 9pdefault              9m41.660s
    > > virtiofscache=none    10m29.700s
    > > 9pmmappass             9m30.047s
    > > 9pmbigmsize           12m4.208s
    > > 9pmsecnone             9m21.363s
    > > virtiofscache=noneT1   7m17.494s
    > > virtiofsdefaultT1      3m43.326s
    > > 
    > > So the winner there by far is the 'virtiofsdefaultT1' - that's
    > > the default virtiofs settings, but with --thread-pool-size=1 - so
    > > yes it gives a small benefit.
    > > But interestingly the cache=none virtiofs performance is pretty bad,
    > > but thread-pool-size=1 on that makes a BIG improvement.
    > 
    > Here are fio runs that Vivek asked me to run in my same environment
    > (there are some 0's in some of the mmap cases, and I've not investigated
    > why yet).

    cache=none does not allow mmap in case of virtiofs. That's when you
    are seeing 0.

    >virtiofs is looking good here in I think all of the cases;
    > there's some division over which cinfig; cache=none
    > seems faster in some cases which surprises me.

    I know cache=none is faster in case of write workloads. It forces
    direct write where we don't call file_remove_privs(). While cache=auto
    goes through file_remove_privs() and that adds a GETXATTR request to
    every WRITE request.

    Vivek



[-- Attachment #2: results.tar.gz --]
[-- Type: application/x-gzip, Size: 18156 bytes --]

[-- Attachment #3: vitiofs 9pfs fio comparsion.html --]
[-- Type: text/html, Size: 29758 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-24 21:33         ` [Virtio-fs] " Venegas Munoz, Jose Carlos
@ 2020-09-24 22:10           ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-24 22:10 UTC (permalink / raw)
  To: Venegas Munoz, Jose Carlos
  Cc: qemu-devel, cdupontd, Dr. David Alan Gilbert, virtio-fs-list,
	Stefan Hajnoczi, Shinde, Archana M

On Thu, Sep 24, 2020 at 09:33:01PM +0000, Venegas Munoz, Jose Carlos wrote:
> Hi Folks,
> 
> Sorry for the delay about how to reproduce `fio` data.
> 
> I have some code to automate testing for multiple kata configs and collect info like:
> - Kata-env, kata configuration.toml, qemu command, virtiofsd command.
> 
> See: 
> https://github.com/jcvenegas/mrunner/
> 
> 
> Last time we agreed to narrow the cases and configs to compare virtiofs and 9pfs
> 
> The configs where the following:
> 
> - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr
> - qemu + 9pfs a.k.a `kata-qemu`
> 
> Please take a look to the html and raw results I attach in this mail.

Hi Carlos,

So you are running following test.

fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=/output/fio.txt

And following are your results.

9p
--
READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), io=3070MiB (3219MB), run=14532-14532msec

WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), io=1026MiB (1076MB), run=14532-14532msec

virtiofs
--------
Run status group 0 (all jobs):
   READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=3070MiB (3219MB), run=19321-19321msec
  WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), io=1026MiB (1076MB), run=19321-19321msec

So looks like you are getting better performance with 9p in this case.

Can you apply "shared pool" patch to qemu for virtiofsd and re-run this
test and see if you see any better results.

In my testing, with cache=none, virtiofs performed better than 9p in 
all the fio jobs I was running. For the case of cache=auto  for virtiofs
(with xattr enabled), 9p performed better in certain write workloads. I
have identified root cause of that problem and working on
HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
with cache=auto and xattr enabled.

I will post my 9p and virtiofs comparison numbers next week. In the
mean time will be great if you could apply following qemu patch, rebuild
qemu and re-run above test.

https://www.redhat.com/archives/virtio-fs/2020-September/msg00081.html

Also what's the status of file cache on host in both the cases. Are
you booting host fresh for these tests so that cache is cold on host
or cache is warm?

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-24 22:10           ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-24 22:10 UTC (permalink / raw)
  To: Venegas Munoz, Jose Carlos
  Cc: qemu-devel, cdupontd, virtio-fs-list, Shinde, Archana M

On Thu, Sep 24, 2020 at 09:33:01PM +0000, Venegas Munoz, Jose Carlos wrote:
> Hi Folks,
> 
> Sorry for the delay about how to reproduce `fio` data.
> 
> I have some code to automate testing for multiple kata configs and collect info like:
> - Kata-env, kata configuration.toml, qemu command, virtiofsd command.
> 
> See: 
> https://github.com/jcvenegas/mrunner/
> 
> 
> Last time we agreed to narrow the cases and configs to compare virtiofs and 9pfs
> 
> The configs where the following:
> 
> - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr
> - qemu + 9pfs a.k.a `kata-qemu`
> 
> Please take a look to the html and raw results I attach in this mail.

Hi Carlos,

So you are running following test.

fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=/output/fio.txt

And following are your results.

9p
--
READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), io=3070MiB (3219MB), run=14532-14532msec

WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), io=1026MiB (1076MB), run=14532-14532msec

virtiofs
--------
Run status group 0 (all jobs):
   READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=3070MiB (3219MB), run=19321-19321msec
  WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), io=1026MiB (1076MB), run=19321-19321msec

So looks like you are getting better performance with 9p in this case.

Can you apply "shared pool" patch to qemu for virtiofsd and re-run this
test and see if you see any better results.

In my testing, with cache=none, virtiofs performed better than 9p in 
all the fio jobs I was running. For the case of cache=auto  for virtiofs
(with xattr enabled), 9p performed better in certain write workloads. I
have identified root cause of that problem and working on
HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
with cache=auto and xattr enabled.

I will post my 9p and virtiofs comparison numbers next week. In the
mean time will be great if you could apply following qemu patch, rebuild
qemu and re-run above test.

https://www.redhat.com/archives/virtio-fs/2020-September/msg00081.html

Also what's the status of file cache on host in both the cases. Are
you booting host fresh for these tests so that cache is cold on host
or cache is warm?

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance
  2020-09-24 22:10           ` [Virtio-fs] " Vivek Goyal
@ 2020-09-25  8:06             ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25  8:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Vivek Goyal, Venegas Munoz, Jose Carlos, cdupontd,
	Dr. David Alan Gilbert, virtio-fs-list, Stefan Hajnoczi, Shinde,
	Archana M, Greg Kurz

On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> In my testing, with cache=none, virtiofs performed better than 9p in
> all the fio jobs I was running. For the case of cache=auto  for virtiofs
> (with xattr enabled), 9p performed better in certain write workloads. I
> have identified root cause of that problem and working on
> HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> with cache=auto and xattr enabled.

Please note, when it comes to performance aspects, you should set a reasonable 
high value for 'msize' on 9p client side:
https://wiki.qemu.org/Documentation/9psetup#msize

I'm also working on performance optimizations for 9p BTW. There is plenty of 
headroom to put it mildly. For QEMU 5.2 I started by addressing readdir 
requests:
https://wiki.qemu.org/ChangeLog/5.2#9pfs

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance
@ 2020-09-25  8:06             ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25  8:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list, Greg Kurz,
	Shinde, Archana M, Vivek Goyal

On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> In my testing, with cache=none, virtiofs performed better than 9p in
> all the fio jobs I was running. For the case of cache=auto  for virtiofs
> (with xattr enabled), 9p performed better in certain write workloads. I
> have identified root cause of that problem and working on
> HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> with cache=auto and xattr enabled.

Please note, when it comes to performance aspects, you should set a reasonable 
high value for 'msize' on 9p client side:
https://wiki.qemu.org/Documentation/9psetup#msize

I'm also working on performance optimizations for 9p BTW. There is plenty of 
headroom to put it mildly. For QEMU 5.2 I started by addressing readdir 
requests:
https://wiki.qemu.org/ChangeLog/5.2#9pfs

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
  2020-09-23 12:50 ` Chirantan Ekbote
  2020-09-23 12:59   ` Vivek Goyal
@ 2020-09-25 11:35   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 11:35 UTC (permalink / raw)
  To: Chirantan Ekbote; +Cc: virtio-fs-list, qemu-devel, Vivek Goyal

* Chirantan Ekbote (chirantan@chromium.org) wrote:
> On Sat, Sep 19, 2020 at 6:36 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Hi All,
> >
> > virtiofsd default thread pool size is 64. To me it feels that in most of
> > the cases thread pool size 1 performs better than thread pool size 64.
> >
> > I ran virtiofs-tests.
> >
> > https://github.com/rhvgoyal/virtiofs-tests
> >
> > And here are the comparision results. To me it seems that by default
> > we should switch to 1 thread (Till we can figure out how to make
> > multi thread performance better even when single process is doing
> > I/O in client).
> >
> 
> FWIW, we've observed the same behavior in crosvm. Using a thread pool
> for the virtiofs server consistently gave us worse performance than
> using a single thread.

Interesting; so it's not just us doing something silly!
It does feel like you *should* be able to get some benefit from multiple
threads; so I guess some more investigation needed at some time.

Dave

> Chirantan
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-22 17:47       ` [Virtio-fs] " Vivek Goyal
@ 2020-09-25 12:11         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 12:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, archana.m.shinde

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > Hi,
> > >   I've been doing some of my own perf tests and I think I agree
> > > about the thread pool size;  my test is a kernel build
> > > and I've tried a bunch of different options.
> > > 
> > > My config:
> > >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> > >      5.9.0-rc4 kernel, rhel 8.2ish userspace.
> > >   5.1.0 qemu/virtiofsd built from git.
> > >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > > a kernel build.
> > > 
> > >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > > fresh before each test.  Then log into the guest, make defconfig,
> > > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > > The numbers below are the 'real' time in the guest from the initial make
> > > (the subsequent makes dont vary much)
> > > 
> > > Below are the detauls of what each of these means, but here are the
> > > numbers first
> > > 
> > > virtiofsdefault        4m0.978s
> > > 9pdefault              9m41.660s
> > > virtiofscache=none    10m29.700s
> > > 9pmmappass             9m30.047s
> > > 9pmbigmsize           12m4.208s
> > > 9pmsecnone             9m21.363s
> > > virtiofscache=noneT1   7m17.494s
> > > virtiofsdefaultT1      3m43.326s
> > > 
> > > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > > the default virtiofs settings, but with --thread-pool-size=1 - so
> > > yes it gives a small benefit.
> > > But interestingly the cache=none virtiofs performance is pretty bad,
> > > but thread-pool-size=1 on that makes a BIG improvement.
> > 
> > Here are fio runs that Vivek asked me to run in my same environment
> > (there are some 0's in some of the mmap cases, and I've not investigated
> > why yet).
> 
> cache=none does not allow mmap in case of virtiofs. That's when you
> are seeing 0.
> 
> >virtiofs is looking good here in I think all of the cases;
> > there's some division over which cinfig; cache=none
> > seems faster in some cases which surprises me.
> 
> I know cache=none is faster in case of write workloads. It forces
> direct write where we don't call file_remove_privs(). While cache=auto
> goes through file_remove_privs() and that adds a GETXATTR request to
> every WRITE request.

Can you point me to how cache=auto causes the file_remove_privs?

Dave

> Vivek
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-25 12:11         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 12:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	archana.m.shinde

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > Hi,
> > >   I've been doing some of my own perf tests and I think I agree
> > > about the thread pool size;  my test is a kernel build
> > > and I've tried a bunch of different options.
> > > 
> > > My config:
> > >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> > >      5.9.0-rc4 kernel, rhel 8.2ish userspace.
> > >   5.1.0 qemu/virtiofsd built from git.
> > >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > > a kernel build.
> > > 
> > >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > > fresh before each test.  Then log into the guest, make defconfig,
> > > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > > The numbers below are the 'real' time in the guest from the initial make
> > > (the subsequent makes dont vary much)
> > > 
> > > Below are the detauls of what each of these means, but here are the
> > > numbers first
> > > 
> > > virtiofsdefault        4m0.978s
> > > 9pdefault              9m41.660s
> > > virtiofscache=none    10m29.700s
> > > 9pmmappass             9m30.047s
> > > 9pmbigmsize           12m4.208s
> > > 9pmsecnone             9m21.363s
> > > virtiofscache=noneT1   7m17.494s
> > > virtiofsdefaultT1      3m43.326s
> > > 
> > > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > > the default virtiofs settings, but with --thread-pool-size=1 - so
> > > yes it gives a small benefit.
> > > But interestingly the cache=none virtiofs performance is pretty bad,
> > > but thread-pool-size=1 on that makes a BIG improvement.
> > 
> > Here are fio runs that Vivek asked me to run in my same environment
> > (there are some 0's in some of the mmap cases, and I've not investigated
> > why yet).
> 
> cache=none does not allow mmap in case of virtiofs. That's when you
> are seeing 0.
> 
> >virtiofs is looking good here in I think all of the cases;
> > there's some division over which cinfig; cache=none
> > seems faster in some cases which surprises me.
> 
> I know cache=none is faster in case of write workloads. It forces
> direct write where we don't call file_remove_privs(). While cache=auto
> goes through file_remove_privs() and that adds a GETXATTR request to
> every WRITE request.

Can you point me to how cache=auto causes the file_remove_privs?

Dave

> Vivek
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-24 22:10           ` [Virtio-fs] " Vivek Goyal
@ 2020-09-25 12:41             ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 12:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, Shinde, Archana M

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Thu, Sep 24, 2020 at 09:33:01PM +0000, Venegas Munoz, Jose Carlos wrote:
> > Hi Folks,
> > 
> > Sorry for the delay about how to reproduce `fio` data.
> > 
> > I have some code to automate testing for multiple kata configs and collect info like:
> > - Kata-env, kata configuration.toml, qemu command, virtiofsd command.
> > 
> > See: 
> > https://github.com/jcvenegas/mrunner/
> > 
> > 
> > Last time we agreed to narrow the cases and configs to compare virtiofs and 9pfs
> > 
> > The configs where the following:
> > 
> > - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr
> > - qemu + 9pfs a.k.a `kata-qemu`
> > 
> > Please take a look to the html and raw results I attach in this mail.
> 
> Hi Carlos,
> 
> So you are running following test.
> 
> fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=/output/fio.txt
> 
> And following are your results.
> 
> 9p
> --
> READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), io=3070MiB (3219MB), run=14532-14532msec
> 
> WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), io=1026MiB (1076MB), run=14532-14532msec
> 
> virtiofs
> --------
> Run status group 0 (all jobs):
>    READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=3070MiB (3219MB), run=19321-19321msec
>   WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), io=1026MiB (1076MB), run=19321-19321msec
> 
> So looks like you are getting better performance with 9p in this case.

That's interesting, because I've just tried similar again with my
ramdisk setup:

fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=aname.txt


virtiofs default options
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.21
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)

test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020
  read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec)
   bw (  KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71, stdev=1603.47, samples=85
   iops        : min=17688, max=19320, avg=18268.92, stdev=400.86, samples=85
  write: IOPS=6102, BW=23.8MiB/s (24.0MB/s)(1026MiB/43042msec); 0 zone resets
   bw (  KiB/s): min=23128, max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85
   iops        : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85
  cpu          : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s), io=3070MiB (3219MB), run=43042-43042msec
  WRITE: bw=23.8MiB/s (24.0MB/s), 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB), run=43042-43042msec

virtiofs cache=none
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.21
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020
  read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec)
   bw (  KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06, stdev=967.87, samples=68
   iops        : min=22262, max=23560, avg=22967.76, stdev=241.97, samples=68
  write: IOPS=7667, BW=29.0MiB/s (31.4MB/s)(1026MiB/34256msec); 0 zone resets
   bw (  KiB/s): min=29264, max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68
   iops        : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68
  cpu          : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s), io=3070MiB (3219MB), run=34256-34256msec
  WRITE: bw=29.0MiB/s (31.4MB/s), 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB), run=34256-34256msec

virtiofs cache=none thread-pool-size=1
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.21
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020
  read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec)
   bw (  KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30, stdev=4507.43, samples=66
   iops        : min=22452, max=27988, avg=23690.58, stdev=1126.86, samples=66
  write: IOPS=7907, BW=30.9MiB/s (32.4MB/s)(1026MiB/33215msec); 0 zone resets
   bw (  KiB/s): min=29424, max=37112, per=100.00%, avg=31668.73, stdev=1558.69, samples=66
   iops        : min= 7356, max= 9278, avg=7917.18, stdev=389.67, samples=66
  cpu          : usr=0.43%, sys=29.07%, ctx=1048627, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=92.4MiB/s (96.9MB/s), 92.4MiB/s-92.4MiB/s (96.9MB/s-96.9MB/s), io=3070MiB (3219MB), run=33215-33215msec
  WRITE: bw=30.9MiB/s (32.4MB/s), 30.9MiB/s-30.9MiB/s (32.4MB/s-32.4MB/s), io=1026MiB (1076MB), run=33215-33215msec

9p ( mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576 )
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.21
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=736: Fri Sep 25 12:36:00 2020
  read: IOPS=16.2k, BW=63.5MiB/s (66.6MB/s)(3070MiB/48366msec)
   bw (  KiB/s): min=63426, max=82776, per=100.00%, avg=65054.28, stdev=2014.88, samples=96
   iops        : min=15856, max=20694, avg=16263.34, stdev=503.74, samples=96
  write: IOPS=5430, BW=21.2MiB/s (22.2MB/s)(1026MiB/48366msec); 0 zone resets
   bw (  KiB/s): min=20916, max=27632, per=100.00%, avg=21740.64, stdev=735.73, samples=96
   iops        : min= 5229, max= 6908, avg=5434.99, stdev=183.95, samples=96
  cpu          : usr=1.60%, sys=14.28%, ctx=1049348, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=63.5MiB/s (66.6MB/s), 63.5MiB/s-63.5MiB/s (66.6MB/s-66.6MB/s), io=3070MiB (3219MB), run=48366-48366msec
  WRITE: bw=21.2MiB/s (22.2MB/s), 21.2MiB/s-21.2MiB/s (22.2MB/s-22.2MB/s), io=1026MiB (1076MB), run=48366-48366msec

So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for
read performance here.

Dave

> Can you apply "shared pool" patch to qemu for virtiofsd and re-run this
> test and see if you see any better results.
> 
> In my testing, with cache=none, virtiofs performed better than 9p in 
> all the fio jobs I was running. For the case of cache=auto  for virtiofs
> (with xattr enabled), 9p performed better in certain write workloads. I
> have identified root cause of that problem and working on
> HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> with cache=auto and xattr enabled.
> 
> I will post my 9p and virtiofs comparison numbers next week. In the
> mean time will be great if you could apply following qemu patch, rebuild
> qemu and re-run above test.
> 
> https://www.redhat.com/archives/virtio-fs/2020-September/msg00081.html
> 
> Also what's the status of file cache on host in both the cases. Are
> you booting host fresh for these tests so that cache is cold on host
> or cache is warm?
> 
> Thanks
> Vivek
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-25 12:41             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 12:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Shinde, Archana M

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Thu, Sep 24, 2020 at 09:33:01PM +0000, Venegas Munoz, Jose Carlos wrote:
> > Hi Folks,
> > 
> > Sorry for the delay about how to reproduce `fio` data.
> > 
> > I have some code to automate testing for multiple kata configs and collect info like:
> > - Kata-env, kata configuration.toml, qemu command, virtiofsd command.
> > 
> > See: 
> > https://github.com/jcvenegas/mrunner/
> > 
> > 
> > Last time we agreed to narrow the cases and configs to compare virtiofs and 9pfs
> > 
> > The configs where the following:
> > 
> > - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr
> > - qemu + 9pfs a.k.a `kata-qemu`
> > 
> > Please take a look to the html and raw results I attach in this mail.
> 
> Hi Carlos,
> 
> So you are running following test.
> 
> fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=/output/fio.txt
> 
> And following are your results.
> 
> 9p
> --
> READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s), io=3070MiB (3219MB), run=14532-14532msec
> 
> WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), io=1026MiB (1076MB), run=14532-14532msec
> 
> virtiofs
> --------
> Run status group 0 (all jobs):
>    READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s), io=3070MiB (3219MB), run=19321-19321msec
>   WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), io=1026MiB (1076MB), run=19321-19321msec
> 
> So looks like you are getting better performance with 9p in this case.

That's interesting, because I've just tried similar again with my
ramdisk setup:

fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --output=aname.txt


virtiofs default options
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.21
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)

test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020
  read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec)
   bw (  KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71, stdev=1603.47, samples=85
   iops        : min=17688, max=19320, avg=18268.92, stdev=400.86, samples=85
  write: IOPS=6102, BW=23.8MiB/s (24.0MB/s)(1026MiB/43042msec); 0 zone resets
   bw (  KiB/s): min=23128, max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85
   iops        : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85
  cpu          : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s), io=3070MiB (3219MB), run=43042-43042msec
  WRITE: bw=23.8MiB/s (24.0MB/s), 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB), run=43042-43042msec

virtiofs cache=none
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.21
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020
  read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec)
   bw (  KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06, stdev=967.87, samples=68
   iops        : min=22262, max=23560, avg=22967.76, stdev=241.97, samples=68
  write: IOPS=7667, BW=29.0MiB/s (31.4MB/s)(1026MiB/34256msec); 0 zone resets
   bw (  KiB/s): min=29264, max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68
   iops        : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68
  cpu          : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s), io=3070MiB (3219MB), run=34256-34256msec
  WRITE: bw=29.0MiB/s (31.4MB/s), 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB), run=34256-34256msec

virtiofs cache=none thread-pool-size=1
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.21
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020
  read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec)
   bw (  KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30, stdev=4507.43, samples=66
   iops        : min=22452, max=27988, avg=23690.58, stdev=1126.86, samples=66
  write: IOPS=7907, BW=30.9MiB/s (32.4MB/s)(1026MiB/33215msec); 0 zone resets
   bw (  KiB/s): min=29424, max=37112, per=100.00%, avg=31668.73, stdev=1558.69, samples=66
   iops        : min= 7356, max= 9278, avg=7917.18, stdev=389.67, samples=66
  cpu          : usr=0.43%, sys=29.07%, ctx=1048627, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=92.4MiB/s (96.9MB/s), 92.4MiB/s-92.4MiB/s (96.9MB/s-96.9MB/s), io=3070MiB (3219MB), run=33215-33215msec
  WRITE: bw=30.9MiB/s (32.4MB/s), 30.9MiB/s-30.9MiB/s (32.4MB/s-32.4MB/s), io=1026MiB (1076MB), run=33215-33215msec

9p ( mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576 )
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=64
fio-3.21
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=736: Fri Sep 25 12:36:00 2020
  read: IOPS=16.2k, BW=63.5MiB/s (66.6MB/s)(3070MiB/48366msec)
   bw (  KiB/s): min=63426, max=82776, per=100.00%, avg=65054.28, stdev=2014.88, samples=96
   iops        : min=15856, max=20694, avg=16263.34, stdev=503.74, samples=96
  write: IOPS=5430, BW=21.2MiB/s (22.2MB/s)(1026MiB/48366msec); 0 zone resets
   bw (  KiB/s): min=20916, max=27632, per=100.00%, avg=21740.64, stdev=735.73, samples=96
   iops        : min= 5229, max= 6908, avg=5434.99, stdev=183.95, samples=96
  cpu          : usr=1.60%, sys=14.28%, ctx=1049348, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=63.5MiB/s (66.6MB/s), 63.5MiB/s-63.5MiB/s (66.6MB/s-66.6MB/s), io=3070MiB (3219MB), run=48366-48366msec
  WRITE: bw=21.2MiB/s (22.2MB/s), 21.2MiB/s-21.2MiB/s (22.2MB/s-22.2MB/s), io=1026MiB (1076MB), run=48366-48366msec

So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for
read performance here.

Dave

> Can you apply "shared pool" patch to qemu for virtiofsd and re-run this
> test and see if you see any better results.
> 
> In my testing, with cache=none, virtiofs performed better than 9p in 
> all the fio jobs I was running. For the case of cache=auto  for virtiofs
> (with xattr enabled), 9p performed better in certain write workloads. I
> have identified root cause of that problem and working on
> HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> with cache=auto and xattr enabled.
> 
> I will post my 9p and virtiofs comparison numbers next week. In the
> mean time will be great if you could apply following qemu patch, rebuild
> qemu and re-run above test.
> 
> https://www.redhat.com/archives/virtio-fs/2020-September/msg00081.html
> 
> Also what's the status of file cache on host in both the cases. Are
> you booting host fresh for these tests so that cache is cold on host
> or cache is warm?
> 
> Thanks
> Vivek
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-25 12:41             ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-09-25 13:04               ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25 13:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr. David Alan Gilbert, Vivek Goyal, Venegas Munoz, Jose Carlos,
	cdupontd, virtio-fs-list, Stefan Hajnoczi, Shinde, Archana M

On Freitag, 25. September 2020 14:41:39 CEST Dr. David Alan Gilbert wrote:
> > Hi Carlos,
> > 
> > So you are running following test.
> > 
> > fio --direct=1 --gtod_reduce=1 --name=test
> > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G
> > --readwrite=randrw --rwmixread=75 --output=/output/fio.txt
> > 
> > And following are your results.
> > 
> > 9p
> > --
> > READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s),
> > io=3070MiB (3219MB), run=14532-14532msec
> > 
> > WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s),
> > io=1026MiB (1076MB), run=14532-14532msec
> > 
> > virtiofs
> > --------
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s),
> >    io=3070MiB (3219MB), run=19321-19321msec>   
> >   WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s),
> >   io=1026MiB (1076MB), run=19321-19321msec> 
> > So looks like you are getting better performance with 9p in this case.
> 
> That's interesting, because I've just tried similar again with my
> ramdisk setup:
> 
> fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio
> --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
> --output=aname.txt
> 
> 
> virtiofs default options
> test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> Starting 1 process
> test: Laying out IO file (1 file / 4096MiB)
> 
> test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020
>   read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec)
>    bw (  KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71,
> stdev=1603.47, samples=85 iops        : min=17688, max=19320, avg=18268.92,
> stdev=400.86, samples=85 write: IOPS=6102, BW=23.8MiB/s
> (24.0MB/s)(1026MiB/43042msec); 0 zone resets bw (  KiB/s): min=23128,
> max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85 iops       
> : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85 cpu         
> : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6 IO depths    :
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s),
> io=3070MiB (3219MB), run=43042-43042msec WRITE: bw=23.8MiB/s (24.0MB/s),
> 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB),
> run=43042-43042msec
> 
> virtiofs cache=none
> test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> Starting 1 process
> 
> test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020
>   read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec)
>    bw (  KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06,
> stdev=967.87, samples=68 iops        : min=22262, max=23560, avg=22967.76,
> stdev=241.97, samples=68 write: IOPS=7667, BW=29.0MiB/s
> (31.4MB/s)(1026MiB/34256msec); 0 zone resets bw (  KiB/s): min=29264,
> max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68 iops       
> : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68 cpu         
> : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5 IO depths    :
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s),
> io=3070MiB (3219MB), run=34256-34256msec WRITE: bw=29.0MiB/s (31.4MB/s),
> 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB),
> run=34256-34256msec
> 
> virtiofs cache=none thread-pool-size=1
> test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> Starting 1 process
> 
> test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020
>   read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec)
>    bw (  KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30,
> stdev=4507.43, samples=66 iops        : min=22452, max=27988, avg=23690.58,
> stdev=1126.86, samples=66 write: IOPS=7907, BW=30.9MiB/s
> (32.4MB/s)(1026MiB/33215msec); 0 zone resets bw (  KiB/s): min=29424,
> max=37112, per=100.00%, avg=31668.73, stdev=1558.69, samples=66 iops       
> : min= 7356, max= 9278, avg=7917.18, stdev=389.67, samples=66 cpu         
> : usr=0.43%, sys=29.07%, ctx=1048627, majf=0, minf=7 IO depths    :
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: bw=92.4MiB/s (96.9MB/s), 92.4MiB/s-92.4MiB/s (96.9MB/s-96.9MB/s),
> io=3070MiB (3219MB), run=33215-33215msec WRITE: bw=30.9MiB/s (32.4MB/s),
> 30.9MiB/s-30.9MiB/s (32.4MB/s-32.4MB/s), io=1026MiB (1076MB),
> run=33215-33215msec
> 
> 9p ( mount -t 9p -o trans=virtio kernel /mnt
> -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
Bottleneck ------------------------------^

By increasing 'msize' you would encounter better 9P I/O results.

> bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync,
> iodepth=64 fio-3.21
> Starting 1 process
> 
> test: (groupid=0, jobs=1): err= 0: pid=736: Fri Sep 25 12:36:00 2020
>   read: IOPS=16.2k, BW=63.5MiB/s (66.6MB/s)(3070MiB/48366msec)
>    bw (  KiB/s): min=63426, max=82776, per=100.00%, avg=65054.28,
> stdev=2014.88, samples=96 iops        : min=15856, max=20694, avg=16263.34,
> stdev=503.74, samples=96 write: IOPS=5430, BW=21.2MiB/s
> (22.2MB/s)(1026MiB/48366msec); 0 zone resets bw (  KiB/s): min=20916,
> max=27632, per=100.00%, avg=21740.64, stdev=735.73, samples=96 iops       
> : min= 5229, max= 6908, avg=5434.99, stdev=183.95, samples=96 cpu         
> : usr=1.60%, sys=14.28%, ctx=1049348, majf=0, minf=7 IO depths    :
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: bw=63.5MiB/s (66.6MB/s), 63.5MiB/s-63.5MiB/s (66.6MB/s-66.6MB/s),
> io=3070MiB (3219MB), run=48366-48366msec WRITE: bw=21.2MiB/s (22.2MB/s),
> 21.2MiB/s-21.2MiB/s (22.2MB/s-22.2MB/s), io=1026MiB (1076MB),
> run=48366-48366msec
> 
> So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for
> read performance here.
> 
> Dave

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-25 13:04               ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25 13:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list, Shinde,
	Archana M, Vivek Goyal

On Freitag, 25. September 2020 14:41:39 CEST Dr. David Alan Gilbert wrote:
> > Hi Carlos,
> > 
> > So you are running following test.
> > 
> > fio --direct=1 --gtod_reduce=1 --name=test
> > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G
> > --readwrite=randrw --rwmixread=75 --output=/output/fio.txt
> > 
> > And following are your results.
> > 
> > 9p
> > --
> > READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s),
> > io=3070MiB (3219MB), run=14532-14532msec
> > 
> > WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s),
> > io=1026MiB (1076MB), run=14532-14532msec
> > 
> > virtiofs
> > --------
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s),
> >    io=3070MiB (3219MB), run=19321-19321msec>   
> >   WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s),
> >   io=1026MiB (1076MB), run=19321-19321msec> 
> > So looks like you are getting better performance with 9p in this case.
> 
> That's interesting, because I've just tried similar again with my
> ramdisk setup:
> 
> fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio
> --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
> --output=aname.txt
> 
> 
> virtiofs default options
> test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> Starting 1 process
> test: Laying out IO file (1 file / 4096MiB)
> 
> test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020
>   read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec)
>    bw (  KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71,
> stdev=1603.47, samples=85 iops        : min=17688, max=19320, avg=18268.92,
> stdev=400.86, samples=85 write: IOPS=6102, BW=23.8MiB/s
> (24.0MB/s)(1026MiB/43042msec); 0 zone resets bw (  KiB/s): min=23128,
> max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85 iops       
> : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85 cpu         
> : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6 IO depths    :
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s),
> io=3070MiB (3219MB), run=43042-43042msec WRITE: bw=23.8MiB/s (24.0MB/s),
> 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB),
> run=43042-43042msec
> 
> virtiofs cache=none
> test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> Starting 1 process
> 
> test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020
>   read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec)
>    bw (  KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06,
> stdev=967.87, samples=68 iops        : min=22262, max=23560, avg=22967.76,
> stdev=241.97, samples=68 write: IOPS=7667, BW=29.0MiB/s
> (31.4MB/s)(1026MiB/34256msec); 0 zone resets bw (  KiB/s): min=29264,
> max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68 iops       
> : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68 cpu         
> : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5 IO depths    :
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s),
> io=3070MiB (3219MB), run=34256-34256msec WRITE: bw=29.0MiB/s (31.4MB/s),
> 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB),
> run=34256-34256msec
> 
> virtiofs cache=none thread-pool-size=1
> test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> Starting 1 process
> 
> test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020
>   read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec)
>    bw (  KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30,
> stdev=4507.43, samples=66 iops        : min=22452, max=27988, avg=23690.58,
> stdev=1126.86, samples=66 write: IOPS=7907, BW=30.9MiB/s
> (32.4MB/s)(1026MiB/33215msec); 0 zone resets bw (  KiB/s): min=29424,
> max=37112, per=100.00%, avg=31668.73, stdev=1558.69, samples=66 iops       
> : min= 7356, max= 9278, avg=7917.18, stdev=389.67, samples=66 cpu         
> : usr=0.43%, sys=29.07%, ctx=1048627, majf=0, minf=7 IO depths    :
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: bw=92.4MiB/s (96.9MB/s), 92.4MiB/s-92.4MiB/s (96.9MB/s-96.9MB/s),
> io=3070MiB (3219MB), run=33215-33215msec WRITE: bw=30.9MiB/s (32.4MB/s),
> 30.9MiB/s-30.9MiB/s (32.4MB/s-32.4MB/s), io=1026MiB (1076MB),
> run=33215-33215msec
> 
> 9p ( mount -t 9p -o trans=virtio kernel /mnt
> -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
Bottleneck ------------------------------^

By increasing 'msize' you would encounter better 9P I/O results.

> bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync,
> iodepth=64 fio-3.21
> Starting 1 process
> 
> test: (groupid=0, jobs=1): err= 0: pid=736: Fri Sep 25 12:36:00 2020
>   read: IOPS=16.2k, BW=63.5MiB/s (66.6MB/s)(3070MiB/48366msec)
>    bw (  KiB/s): min=63426, max=82776, per=100.00%, avg=65054.28,
> stdev=2014.88, samples=96 iops        : min=15856, max=20694, avg=16263.34,
> stdev=503.74, samples=96 write: IOPS=5430, BW=21.2MiB/s
> (22.2MB/s)(1026MiB/48366msec); 0 zone resets bw (  KiB/s): min=20916,
> max=27632, per=100.00%, avg=21740.64, stdev=735.73, samples=96 iops       
> : min= 5229, max= 6908, avg=5434.99, stdev=183.95, samples=96 cpu         
> : usr=1.60%, sys=14.28%, ctx=1049348, majf=0, minf=7 IO depths    :
> 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: bw=63.5MiB/s (66.6MB/s), 63.5MiB/s-63.5MiB/s (66.6MB/s-66.6MB/s),
> io=3070MiB (3219MB), run=48366-48366msec WRITE: bw=21.2MiB/s (22.2MB/s),
> 21.2MiB/s-21.2MiB/s (22.2MB/s-22.2MB/s), io=1026MiB (1076MB),
> run=48366-48366msec
> 
> So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for
> read performance here.
> 
> Dave

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-25 13:04               ` [Virtio-fs] " Christian Schoenebeck
@ 2020-09-25 13:05                 ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 13:05 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, Shinde, Archana M, Vivek Goyal

* Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> On Freitag, 25. September 2020 14:41:39 CEST Dr. David Alan Gilbert wrote:
> > > Hi Carlos,
> > > 
> > > So you are running following test.
> > > 
> > > fio --direct=1 --gtod_reduce=1 --name=test
> > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G
> > > --readwrite=randrw --rwmixread=75 --output=/output/fio.txt
> > > 
> > > And following are your results.
> > > 
> > > 9p
> > > --
> > > READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s),
> > > io=3070MiB (3219MB), run=14532-14532msec
> > > 
> > > WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s),
> > > io=1026MiB (1076MB), run=14532-14532msec
> > > 
> > > virtiofs
> > > --------
> > > 
> > > Run status group 0 (all jobs):
> > >    READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s),
> > >    io=3070MiB (3219MB), run=19321-19321msec>   
> > >   WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s),
> > >   io=1026MiB (1076MB), run=19321-19321msec> 
> > > So looks like you are getting better performance with 9p in this case.
> > 
> > That's interesting, because I've just tried similar again with my
> > ramdisk setup:
> > 
> > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio
> > --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
> > --output=aname.txt
> > 
> > 
> > virtiofs default options
> > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> > Starting 1 process
> > test: Laying out IO file (1 file / 4096MiB)
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020
> >   read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec)
> >    bw (  KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71,
> > stdev=1603.47, samples=85 iops        : min=17688, max=19320, avg=18268.92,
> > stdev=400.86, samples=85 write: IOPS=6102, BW=23.8MiB/s
> > (24.0MB/s)(1026MiB/43042msec); 0 zone resets bw (  KiB/s): min=23128,
> > max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85 iops       
> > : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85 cpu         
> > : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6 IO depths    :
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s),
> > io=3070MiB (3219MB), run=43042-43042msec WRITE: bw=23.8MiB/s (24.0MB/s),
> > 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB),
> > run=43042-43042msec
> > 
> > virtiofs cache=none
> > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> > Starting 1 process
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020
> >   read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec)
> >    bw (  KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06,
> > stdev=967.87, samples=68 iops        : min=22262, max=23560, avg=22967.76,
> > stdev=241.97, samples=68 write: IOPS=7667, BW=29.0MiB/s
> > (31.4MB/s)(1026MiB/34256msec); 0 zone resets bw (  KiB/s): min=29264,
> > max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68 iops       
> > : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68 cpu         
> > : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5 IO depths    :
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s),
> > io=3070MiB (3219MB), run=34256-34256msec WRITE: bw=29.0MiB/s (31.4MB/s),
> > 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB),
> > run=34256-34256msec
> > 
> > virtiofs cache=none thread-pool-size=1
> > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> > Starting 1 process
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020
> >   read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec)
> >    bw (  KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30,
> > stdev=4507.43, samples=66 iops        : min=22452, max=27988, avg=23690.58,
> > stdev=1126.86, samples=66 write: IOPS=7907, BW=30.9MiB/s
> > (32.4MB/s)(1026MiB/33215msec); 0 zone resets bw (  KiB/s): min=29424,
> > max=37112, per=100.00%, avg=31668.73, stdev=1558.69, samples=66 iops       
> > : min= 7356, max= 9278, avg=7917.18, stdev=389.67, samples=66 cpu         
> > : usr=0.43%, sys=29.07%, ctx=1048627, majf=0, minf=7 IO depths    :
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=92.4MiB/s (96.9MB/s), 92.4MiB/s-92.4MiB/s (96.9MB/s-96.9MB/s),
> > io=3070MiB (3219MB), run=33215-33215msec WRITE: bw=30.9MiB/s (32.4MB/s),
> > 30.9MiB/s-30.9MiB/s (32.4MB/s-32.4MB/s), io=1026MiB (1076MB),
> > run=33215-33215msec
> > 
> > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> Bottleneck ------------------------------^
> 
> By increasing 'msize' you would encounter better 9P I/O results.

OK, I thought that was bigger than the default;  what number should I
use?

Dave

> > bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync,
> > iodepth=64 fio-3.21
> > Starting 1 process
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=736: Fri Sep 25 12:36:00 2020
> >   read: IOPS=16.2k, BW=63.5MiB/s (66.6MB/s)(3070MiB/48366msec)
> >    bw (  KiB/s): min=63426, max=82776, per=100.00%, avg=65054.28,
> > stdev=2014.88, samples=96 iops        : min=15856, max=20694, avg=16263.34,
> > stdev=503.74, samples=96 write: IOPS=5430, BW=21.2MiB/s
> > (22.2MB/s)(1026MiB/48366msec); 0 zone resets bw (  KiB/s): min=20916,
> > max=27632, per=100.00%, avg=21740.64, stdev=735.73, samples=96 iops       
> > : min= 5229, max= 6908, avg=5434.99, stdev=183.95, samples=96 cpu         
> > : usr=1.60%, sys=14.28%, ctx=1049348, majf=0, minf=7 IO depths    :
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=63.5MiB/s (66.6MB/s), 63.5MiB/s-63.5MiB/s (66.6MB/s-66.6MB/s),
> > io=3070MiB (3219MB), run=48366-48366msec WRITE: bw=21.2MiB/s (22.2MB/s),
> > 21.2MiB/s-21.2MiB/s (22.2MB/s-22.2MB/s), io=1026MiB (1076MB),
> > run=48366-48366msec
> > 
> > So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for
> > read performance here.
> > 
> > Dave
> 
> Best regards,
> Christian Schoenebeck
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-25 13:05                 ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 13:05 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Shinde, Archana M, Vivek Goyal

* Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> On Freitag, 25. September 2020 14:41:39 CEST Dr. David Alan Gilbert wrote:
> > > Hi Carlos,
> > > 
> > > So you are running following test.
> > > 
> > > fio --direct=1 --gtod_reduce=1 --name=test
> > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G
> > > --readwrite=randrw --rwmixread=75 --output=/output/fio.txt
> > > 
> > > And following are your results.
> > > 
> > > 9p
> > > --
> > > READ: bw=211MiB/s (222MB/s), 211MiB/s-211MiB/s (222MB/s-222MB/s),
> > > io=3070MiB (3219MB), run=14532-14532msec
> > > 
> > > WRITE: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s),
> > > io=1026MiB (1076MB), run=14532-14532msec
> > > 
> > > virtiofs
> > > --------
> > > 
> > > Run status group 0 (all jobs):
> > >    READ: bw=159MiB/s (167MB/s), 159MiB/s-159MiB/s (167MB/s-167MB/s),
> > >    io=3070MiB (3219MB), run=19321-19321msec>   
> > >   WRITE: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s),
> > >   io=1026MiB (1076MB), run=19321-19321msec> 
> > > So looks like you are getting better performance with 9p in this case.
> > 
> > That's interesting, because I've just tried similar again with my
> > ramdisk setup:
> > 
> > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio
> > --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
> > --output=aname.txt
> > 
> > 
> > virtiofs default options
> > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> > Starting 1 process
> > test: Laying out IO file (1 file / 4096MiB)
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=773: Fri Sep 25 12:28:32 2020
> >   read: IOPS=18.3k, BW=71.3MiB/s (74.8MB/s)(3070MiB/43042msec)
> >    bw (  KiB/s): min=70752, max=77280, per=100.00%, avg=73075.71,
> > stdev=1603.47, samples=85 iops        : min=17688, max=19320, avg=18268.92,
> > stdev=400.86, samples=85 write: IOPS=6102, BW=23.8MiB/s
> > (24.0MB/s)(1026MiB/43042msec); 0 zone resets bw (  KiB/s): min=23128,
> > max=25696, per=100.00%, avg=24420.40, stdev=583.08, samples=85 iops       
> > : min= 5782, max= 6424, avg=6105.09, stdev=145.76, samples=85 cpu         
> > : usr=0.10%, sys=30.09%, ctx=1245312, majf=0, minf=6 IO depths    :
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=71.3MiB/s (74.8MB/s), 71.3MiB/s-71.3MiB/s (74.8MB/s-74.8MB/s),
> > io=3070MiB (3219MB), run=43042-43042msec WRITE: bw=23.8MiB/s (24.0MB/s),
> > 23.8MiB/s-23.8MiB/s (24.0MB/s-24.0MB/s), io=1026MiB (1076MB),
> > run=43042-43042msec
> > 
> > virtiofs cache=none
> > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> > Starting 1 process
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=740: Fri Sep 25 12:30:57 2020
> >   read: IOPS=22.9k, BW=89.6MiB/s (93.0MB/s)(3070MiB/34256msec)
> >    bw (  KiB/s): min=89048, max=94240, per=100.00%, avg=91871.06,
> > stdev=967.87, samples=68 iops        : min=22262, max=23560, avg=22967.76,
> > stdev=241.97, samples=68 write: IOPS=7667, BW=29.0MiB/s
> > (31.4MB/s)(1026MiB/34256msec); 0 zone resets bw (  KiB/s): min=29264,
> > max=32248, per=100.00%, avg=30700.82, stdev=541.97, samples=68 iops       
> > : min= 7316, max= 8062, avg=7675.21, stdev=135.49, samples=68 cpu         
> > : usr=1.03%, sys=27.64%, ctx=1048635, majf=0, minf=5 IO depths    :
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=89.6MiB/s (93.0MB/s), 89.6MiB/s-89.6MiB/s (93.0MB/s-93.0MB/s),
> > io=3070MiB (3219MB), run=34256-34256msec WRITE: bw=29.0MiB/s (31.4MB/s),
> > 29.0MiB/s-29.0MiB/s (31.4MB/s-31.4MB/s), io=1026MiB (1076MB),
> > run=34256-34256msec
> > 
> > virtiofs cache=none thread-pool-size=1
> > test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=psync, iodepth=64 fio-3.21
> > Starting 1 process
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=738: Fri Sep 25 12:33:17 2020
> >   read: IOPS=23.7k, BW=92.4MiB/s (96.9MB/s)(3070MiB/33215msec)
> >    bw (  KiB/s): min=89808, max=111952, per=100.00%, avg=94762.30,
> > stdev=4507.43, samples=66 iops        : min=22452, max=27988, avg=23690.58,
> > stdev=1126.86, samples=66 write: IOPS=7907, BW=30.9MiB/s
> > (32.4MB/s)(1026MiB/33215msec); 0 zone resets bw (  KiB/s): min=29424,
> > max=37112, per=100.00%, avg=31668.73, stdev=1558.69, samples=66 iops       
> > : min= 7356, max= 9278, avg=7917.18, stdev=389.67, samples=66 cpu         
> > : usr=0.43%, sys=29.07%, ctx=1048627, majf=0, minf=7 IO depths    :
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=92.4MiB/s (96.9MB/s), 92.4MiB/s-92.4MiB/s (96.9MB/s-96.9MB/s),
> > io=3070MiB (3219MB), run=33215-33215msec WRITE: bw=30.9MiB/s (32.4MB/s),
> > 30.9MiB/s-30.9MiB/s (32.4MB/s-32.4MB/s), io=1026MiB (1076MB),
> > run=33215-33215msec
> > 
> > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> Bottleneck ------------------------------^
> 
> By increasing 'msize' you would encounter better 9P I/O results.

OK, I thought that was bigger than the default;  what number should I
use?

Dave

> > bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync,
> > iodepth=64 fio-3.21
> > Starting 1 process
> > 
> > test: (groupid=0, jobs=1): err= 0: pid=736: Fri Sep 25 12:36:00 2020
> >   read: IOPS=16.2k, BW=63.5MiB/s (66.6MB/s)(3070MiB/48366msec)
> >    bw (  KiB/s): min=63426, max=82776, per=100.00%, avg=65054.28,
> > stdev=2014.88, samples=96 iops        : min=15856, max=20694, avg=16263.34,
> > stdev=503.74, samples=96 write: IOPS=5430, BW=21.2MiB/s
> > (22.2MB/s)(1026MiB/48366msec); 0 zone resets bw (  KiB/s): min=20916,
> > max=27632, per=100.00%, avg=21740.64, stdev=735.73, samples=96 iops       
> > : min= 5229, max= 6908, avg=5434.99, stdev=183.95, samples=96 cpu         
> > : usr=1.60%, sys=14.28%, ctx=1049348, majf=0, minf=7 IO depths    :
> > 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit    :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete  :
> > 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts:
> > total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0 latency   : target=0,
> > window=0, percentile=100.00%, depth=64
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=63.5MiB/s (66.6MB/s), 63.5MiB/s-63.5MiB/s (66.6MB/s-66.6MB/s),
> > io=3070MiB (3219MB), run=48366-48366msec WRITE: bw=21.2MiB/s (22.2MB/s),
> > 21.2MiB/s-21.2MiB/s (22.2MB/s-22.2MB/s), io=1026MiB (1076MB),
> > run=48366-48366msec
> > 
> > So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for
> > read performance here.
> > 
> > Dave
> 
> Best regards,
> Christian Schoenebeck
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: tools/virtiofs: Multi threading seems to hurt performance
  2020-09-25 12:11         ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-09-25 13:11           ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-25 13:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, archana.m.shinde

On Fri, Sep 25, 2020 at 01:11:27PM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > > Hi,
> > > >   I've been doing some of my own perf tests and I think I agree
> > > > about the thread pool size;  my test is a kernel build
> > > > and I've tried a bunch of different options.
> > > > 
> > > > My config:
> > > >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> > > >      5.9.0-rc4 kernel, rhel 8.2ish userspace.
> > > >   5.1.0 qemu/virtiofsd built from git.
> > > >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > > > a kernel build.
> > > > 
> > > >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > > > fresh before each test.  Then log into the guest, make defconfig,
> > > > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > > > The numbers below are the 'real' time in the guest from the initial make
> > > > (the subsequent makes dont vary much)
> > > > 
> > > > Below are the detauls of what each of these means, but here are the
> > > > numbers first
> > > > 
> > > > virtiofsdefault        4m0.978s
> > > > 9pdefault              9m41.660s
> > > > virtiofscache=none    10m29.700s
> > > > 9pmmappass             9m30.047s
> > > > 9pmbigmsize           12m4.208s
> > > > 9pmsecnone             9m21.363s
> > > > virtiofscache=noneT1   7m17.494s
> > > > virtiofsdefaultT1      3m43.326s
> > > > 
> > > > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > > > the default virtiofs settings, but with --thread-pool-size=1 - so
> > > > yes it gives a small benefit.
> > > > But interestingly the cache=none virtiofs performance is pretty bad,
> > > > but thread-pool-size=1 on that makes a BIG improvement.
> > > 
> > > Here are fio runs that Vivek asked me to run in my same environment
> > > (there are some 0's in some of the mmap cases, and I've not investigated
> > > why yet).
> > 
> > cache=none does not allow mmap in case of virtiofs. That's when you
> > are seeing 0.
> > 
> > >virtiofs is looking good here in I think all of the cases;
> > > there's some division over which cinfig; cache=none
> > > seems faster in some cases which surprises me.
> > 
> > I know cache=none is faster in case of write workloads. It forces
> > direct write where we don't call file_remove_privs(). While cache=auto
> > goes through file_remove_privs() and that adds a GETXATTR request to
> > every WRITE request.
> 
> Can you point me to how cache=auto causes the file_remove_privs?

fs/fuse/file.c

fuse_cache_write_iter() {
	err = file_remove_privs(file);
}

Above path is taken when cache=auto/cache=always is used. If virtiofsd
is running with noxattr, then it does not impose any cost. But if xattr
are enabled, then every WRITE first results in a
getxattr(security.capability) and that slows down WRITES tremendously.

When cache=none is used, we go through following path instead.

fuse_direct_write_iter() and it does not have file_remove_privs(). We
set a flag in WRITE request to tell server to kill
suid/sgid/security.capability, instead.

fuse_direct_io() {
	ia->write.in.write_flags |= FUSE_WRITE_KILL_PRIV
}

Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] tools/virtiofs: Multi threading seems to hurt performance
@ 2020-09-25 13:11           ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-25 13:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: jose.carlos.venegas.munoz, qemu-devel, cdupontd, virtio-fs-list,
	archana.m.shinde

On Fri, Sep 25, 2020 at 01:11:27PM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote:
> > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > > Hi,
> > > >   I've been doing some of my own perf tests and I think I agree
> > > > about the thread pool size;  my test is a kernel build
> > > > and I've tried a bunch of different options.
> > > > 
> > > > My config:
> > > >   Host: 16 core AMD EPYC (32 thread), 128G RAM,
> > > >      5.9.0-rc4 kernel, rhel 8.2ish userspace.
> > > >   5.1.0 qemu/virtiofsd built from git.
> > > >   Guest: Fedora 32 from cloud image with just enough extra installed for
> > > > a kernel build.
> > > > 
> > > >   git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host
> > > > fresh before each test.  Then log into the guest, make defconfig,
> > > > time make -j 16 bzImage,  make clean; time make -j 16 bzImage 
> > > > The numbers below are the 'real' time in the guest from the initial make
> > > > (the subsequent makes dont vary much)
> > > > 
> > > > Below are the detauls of what each of these means, but here are the
> > > > numbers first
> > > > 
> > > > virtiofsdefault        4m0.978s
> > > > 9pdefault              9m41.660s
> > > > virtiofscache=none    10m29.700s
> > > > 9pmmappass             9m30.047s
> > > > 9pmbigmsize           12m4.208s
> > > > 9pmsecnone             9m21.363s
> > > > virtiofscache=noneT1   7m17.494s
> > > > virtiofsdefaultT1      3m43.326s
> > > > 
> > > > So the winner there by far is the 'virtiofsdefaultT1' - that's
> > > > the default virtiofs settings, but with --thread-pool-size=1 - so
> > > > yes it gives a small benefit.
> > > > But interestingly the cache=none virtiofs performance is pretty bad,
> > > > but thread-pool-size=1 on that makes a BIG improvement.
> > > 
> > > Here are fio runs that Vivek asked me to run in my same environment
> > > (there are some 0's in some of the mmap cases, and I've not investigated
> > > why yet).
> > 
> > cache=none does not allow mmap in case of virtiofs. That's when you
> > are seeing 0.
> > 
> > >virtiofs is looking good here in I think all of the cases;
> > > there's some division over which cinfig; cache=none
> > > seems faster in some cases which surprises me.
> > 
> > I know cache=none is faster in case of write workloads. It forces
> > direct write where we don't call file_remove_privs(). While cache=auto
> > goes through file_remove_privs() and that adds a GETXATTR request to
> > every WRITE request.
> 
> Can you point me to how cache=auto causes the file_remove_privs?

fs/fuse/file.c

fuse_cache_write_iter() {
	err = file_remove_privs(file);
}

Above path is taken when cache=auto/cache=always is used. If virtiofsd
is running with noxattr, then it does not impose any cost. But if xattr
are enabled, then every WRITE first results in a
getxattr(security.capability) and that slows down WRITES tremendously.

When cache=none is used, we go through following path instead.

fuse_direct_write_iter() and it does not have file_remove_privs(). We
set a flag in WRITE request to tell server to kill
suid/sgid/security.capability, instead.

fuse_direct_io() {
	ia->write.in.write_flags |= FUSE_WRITE_KILL_PRIV
}

Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance
  2020-09-25  8:06             ` [Virtio-fs] " Christian Schoenebeck
@ 2020-09-25 13:13               ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-25 13:13 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Shinde, Archana M, Venegas Munoz, Jose Carlos, qemu-devel,
	Dr. David Alan Gilbert, virtio-fs-list, Greg Kurz,
	Stefan Hajnoczi, cdupontd

On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > In my testing, with cache=none, virtiofs performed better than 9p in
> > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > (with xattr enabled), 9p performed better in certain write workloads. I
> > have identified root cause of that problem and working on
> > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > with cache=auto and xattr enabled.
> 
> Please note, when it comes to performance aspects, you should set a reasonable 
> high value for 'msize' on 9p client side:
> https://wiki.qemu.org/Documentation/9psetup#msize

Interesting. I will try that. What does "msize" do? 

> 
> I'm also working on performance optimizations for 9p BTW. There is plenty of 
> headroom to put it mildly. For QEMU 5.2 I started by addressing readdir 
> requests:
> https://wiki.qemu.org/ChangeLog/5.2#9pfs

Nice. I guess this performance comparison between 9p and virtiofs is good.
Both the projects can try to identify weak points and improve performance.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance
@ 2020-09-25 13:13               ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-25 13:13 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Shinde, Archana M, Venegas Munoz, Jose Carlos, qemu-devel,
	virtio-fs-list, Greg Kurz, cdupontd

On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > In my testing, with cache=none, virtiofs performed better than 9p in
> > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > (with xattr enabled), 9p performed better in certain write workloads. I
> > have identified root cause of that problem and working on
> > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > with cache=auto and xattr enabled.
> 
> Please note, when it comes to performance aspects, you should set a reasonable 
> high value for 'msize' on 9p client side:
> https://wiki.qemu.org/Documentation/9psetup#msize

Interesting. I will try that. What does "msize" do? 

> 
> I'm also working on performance optimizations for 9p BTW. There is plenty of 
> headroom to put it mildly. For QEMU 5.2 I started by addressing readdir 
> requests:
> https://wiki.qemu.org/ChangeLog/5.2#9pfs

Nice. I guess this performance comparison between 9p and virtiofs is good.
Both the projects can try to identify weak points and improve performance.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance
  2020-09-25 13:13               ` [Virtio-fs] " Vivek Goyal
@ 2020-09-25 15:47                 ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25 15:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Vivek Goyal, Shinde, Archana M, Venegas Munoz, Jose Carlos,
	Dr. David Alan Gilbert, virtio-fs-list, Greg Kurz,
	Stefan Hajnoczi, cdupontd

On Freitag, 25. September 2020 15:13:56 CEST Vivek Goyal wrote:
> On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > > (with xattr enabled), 9p performed better in certain write workloads. I
> > > have identified root cause of that problem and working on
> > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > with cache=auto and xattr enabled.
> > 
> > Please note, when it comes to performance aspects, you should set a
> > reasonable high value for 'msize' on 9p client side:
> > https://wiki.qemu.org/Documentation/9psetup#msize
> 
> Interesting. I will try that. What does "msize" do?

Simple: it's the "maximum message size" ever to be used for communication 
between host and guest, in both directions that is.

So if that 'msize' value is too small, a potential large 9p message would be 
split into several smaller 9p messages, and each message adds latency which is 
the main problem.

Keep in mind: The default value with Linux clients for msize is still only 
8kB!

Think of doing 'dd bs=8192 if=/src.dat of=/dst.dat count=...' as analogy, 
which probably makes its impact on performance clear.

However the negative impact of a small 'msize' value is not just limited to 
raw file I/O like that; calling readdir() for instance on a guest directory 
with several hundred files or more, will likewise slow down in the same way 
tremendously as both sides have to transmit a large amount of 9p messages back 
and forth instead of just 2 messages (Treaddir and Rreaddir).

> > I'm also working on performance optimizations for 9p BTW. There is plenty
> > of headroom to put it mildly. For QEMU 5.2 I started by addressing
> > readdir requests:
> > https://wiki.qemu.org/ChangeLog/5.2#9pfs
> 
> Nice. I guess this performance comparison between 9p and virtiofs is good.
> Both the projects can try to identify weak points and improve performance.

Yes, that's indeed handy being able to make comparisons.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance
@ 2020-09-25 15:47                 ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25 15:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: cdupontd, Venegas Munoz, Jose Carlos, Greg Kurz, virtio-fs-list,
	Shinde, Archana M, Vivek Goyal

On Freitag, 25. September 2020 15:13:56 CEST Vivek Goyal wrote:
> On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > > (with xattr enabled), 9p performed better in certain write workloads. I
> > > have identified root cause of that problem and working on
> > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > with cache=auto and xattr enabled.
> > 
> > Please note, when it comes to performance aspects, you should set a
> > reasonable high value for 'msize' on 9p client side:
> > https://wiki.qemu.org/Documentation/9psetup#msize
> 
> Interesting. I will try that. What does "msize" do?

Simple: it's the "maximum message size" ever to be used for communication 
between host and guest, in both directions that is.

So if that 'msize' value is too small, a potential large 9p message would be 
split into several smaller 9p messages, and each message adds latency which is 
the main problem.

Keep in mind: The default value with Linux clients for msize is still only 
8kB!

Think of doing 'dd bs=8192 if=/src.dat of=/dst.dat count=...' as analogy, 
which probably makes its impact on performance clear.

However the negative impact of a small 'msize' value is not just limited to 
raw file I/O like that; calling readdir() for instance on a guest directory 
with several hundred files or more, will likewise slow down in the same way 
tremendously as both sides have to transmit a large amount of 9p messages back 
and forth instead of just 2 messages (Treaddir and Rreaddir).

> > I'm also working on performance optimizations for 9p BTW. There is plenty
> > of headroom to put it mildly. For QEMU 5.2 I started by addressing
> > readdir requests:
> > https://wiki.qemu.org/ChangeLog/5.2#9pfs
> 
> Nice. I guess this performance comparison between 9p and virtiofs is good.
> Both the projects can try to identify weak points and improve performance.

Yes, that's indeed handy being able to make comparisons.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-25 13:05                 ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-09-25 16:05                   ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25 16:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr. David Alan Gilbert, Venegas Munoz, Jose Carlos, cdupontd,
	virtio-fs-list, Stefan Hajnoczi, Shinde, Archana M, Vivek Goyal

On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> > 
> > Bottleneck ------------------------------^
> > 
> > By increasing 'msize' you would encounter better 9P I/O results.
> 
> OK, I thought that was bigger than the default;  what number should I
> use?

It depends on the underlying storage hardware. In other words: you have to try 
increasing the 'msize' value to a point where you no longer notice a negative 
performance impact (or almost). Which is fortunately quite easy to test on 
guest like:

	dd if=/dev/zero of=test.dat bs=1G count=12
	time cat test.dat > /dev/null

I would start with an absolute minimum msize of 10MB. I would recommend 
something around 100MB maybe for a mechanical hard drive. With a PCIe flash 
you probably would rather pick several hundred MB or even more.

That unpleasant 'msize' issue is a limitation of the 9p protocol: client 
(guest) must suggest the value of msize on connection to server (host). Server 
can only lower, but not raise it. And the client in turn obviously cannot see 
host's storage device(s), so client is unable to pick a good value by itself. 
So it's a suboptimal handshake issue right now.

Many users don't even know this 'msize' parameter exists and hence run with 
the Linux kernel's default value of just 8kB. For QEMU 5.2 I addressed this by 
logging a performance warning on host side for making users at least aware 
about this issue. The long-term plan is to pass a good msize value from host 
to guest via virtio (like it's already done for the available export tags) and 
the Linux kernel would default to that instead.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-25 16:05                   ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25 16:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list, Shinde,
	Archana M, Vivek Goyal

On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> > 
> > Bottleneck ------------------------------^
> > 
> > By increasing 'msize' you would encounter better 9P I/O results.
> 
> OK, I thought that was bigger than the default;  what number should I
> use?

It depends on the underlying storage hardware. In other words: you have to try 
increasing the 'msize' value to a point where you no longer notice a negative 
performance impact (or almost). Which is fortunately quite easy to test on 
guest like:

	dd if=/dev/zero of=test.dat bs=1G count=12
	time cat test.dat > /dev/null

I would start with an absolute minimum msize of 10MB. I would recommend 
something around 100MB maybe for a mechanical hard drive. With a PCIe flash 
you probably would rather pick several hundred MB or even more.

That unpleasant 'msize' issue is a limitation of the 9p protocol: client 
(guest) must suggest the value of msize on connection to server (host). Server 
can only lower, but not raise it. And the client in turn obviously cannot see 
host's storage device(s), so client is unable to pick a good value by itself. 
So it's a suboptimal handshake issue right now.

Many users don't even know this 'msize' parameter exists and hence run with 
the Linux kernel's default value of just 8kB. For QEMU 5.2 I addressed this by 
logging a performance warning on host side for making users at least aware 
about this issue. The long-term plan is to pass a good msize value from host 
to guest via virtio (like it's already done for the available export tags) and 
the Linux kernel would default to that instead.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-25 16:05                   ` [Virtio-fs] " Christian Schoenebeck
@ 2020-09-25 16:33                     ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25 16:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr. David Alan Gilbert, Venegas Munoz, Jose Carlos, cdupontd,
	virtio-fs-list, Stefan Hajnoczi, Shinde, Archana M, Vivek Goyal

On Freitag, 25. September 2020 18:05:17 CEST Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> > > 
> > > Bottleneck ------------------------------^
> > > 
> > > By increasing 'msize' you would encounter better 9P I/O results.
> > 
> > OK, I thought that was bigger than the default;  what number should I
> > use?
> 
> It depends on the underlying storage hardware. In other words: you have to
> try increasing the 'msize' value to a point where you no longer notice a
> negative performance impact (or almost). Which is fortunately quite easy to
> test on guest like:
> 
> 	dd if=/dev/zero of=test.dat bs=1G count=12
> 	time cat test.dat > /dev/null

I forgot: you should execute that 'dd' command and host side, and the 'cat' 
command on guest side, to avoid any caching making the benchmark result look 
better than it actually is. Because for finding a good 'msize' value you only 
care about actual 9p data really being transmitted between host and guest.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-25 16:33                     ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-25 16:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list, Shinde,
	Archana M, Vivek Goyal

On Freitag, 25. September 2020 18:05:17 CEST Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> > > 
> > > Bottleneck ------------------------------^
> > > 
> > > By increasing 'msize' you would encounter better 9P I/O results.
> > 
> > OK, I thought that was bigger than the default;  what number should I
> > use?
> 
> It depends on the underlying storage hardware. In other words: you have to
> try increasing the 'msize' value to a point where you no longer notice a
> negative performance impact (or almost). Which is fortunately quite easy to
> test on guest like:
> 
> 	dd if=/dev/zero of=test.dat bs=1G count=12
> 	time cat test.dat > /dev/null

I forgot: you should execute that 'dd' command and host side, and the 'cat' 
command on guest side, to avoid any caching making the benchmark result look 
better than it actually is. Because for finding a good 'msize' value you only 
care about actual 9p data really being transmitted between host and guest.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-25 16:05                   ` [Virtio-fs] " Christian Schoenebeck
@ 2020-09-25 18:51                     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 18:51 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, Shinde, Archana M, Vivek Goyal

* Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> > > 
> > > Bottleneck ------------------------------^
> > > 
> > > By increasing 'msize' you would encounter better 9P I/O results.
> > 
> > OK, I thought that was bigger than the default;  what number should I
> > use?
> 
> It depends on the underlying storage hardware. In other words: you have to try 
> increasing the 'msize' value to a point where you no longer notice a negative 
> performance impact (or almost). Which is fortunately quite easy to test on 
> guest like:
> 
> 	dd if=/dev/zero of=test.dat bs=1G count=12
> 	time cat test.dat > /dev/null
> 
> I would start with an absolute minimum msize of 10MB. I would recommend 
> something around 100MB maybe for a mechanical hard drive. With a PCIe flash 
> you probably would rather pick several hundred MB or even more.
> 
> That unpleasant 'msize' issue is a limitation of the 9p protocol: client 
> (guest) must suggest the value of msize on connection to server (host). Server 
> can only lower, but not raise it. And the client in turn obviously cannot see 
> host's storage device(s), so client is unable to pick a good value by itself. 
> So it's a suboptimal handshake issue right now.

It doesn't seem to be making a vast difference here:



9p mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=104857600

Run status group 0 (all jobs):
   READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s), io=3070MiB (3219MB), run=49099-49099msec
  WRITE: bw=20.9MiB/s (21.9MB/s), 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB), run=49099-49099msec

9p mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576000

Run status group 0 (all jobs):
   READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s), io=3070MiB (3219MB), run=47104-47104msec
  WRITE: bw=21.8MiB/s (22.8MB/s), 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB), run=47104-47104msec


Dave

> Many users don't even know this 'msize' parameter exists and hence run with 
> the Linux kernel's default value of just 8kB. For QEMU 5.2 I addressed this by 
> logging a performance warning on host side for making users at least aware 
> about this issue. The long-term plan is to pass a good msize value from host 
> to guest via virtio (like it's already done for the available export tags) and 
> the Linux kernel would default to that instead.
> 
> Best regards,
> Christian Schoenebeck
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-25 18:51                     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 107+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 18:51 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Shinde, Archana M, Vivek Goyal

* Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0): rw=randrw,
> > > 
> > > Bottleneck ------------------------------^
> > > 
> > > By increasing 'msize' you would encounter better 9P I/O results.
> > 
> > OK, I thought that was bigger than the default;  what number should I
> > use?
> 
> It depends on the underlying storage hardware. In other words: you have to try 
> increasing the 'msize' value to a point where you no longer notice a negative 
> performance impact (or almost). Which is fortunately quite easy to test on 
> guest like:
> 
> 	dd if=/dev/zero of=test.dat bs=1G count=12
> 	time cat test.dat > /dev/null
> 
> I would start with an absolute minimum msize of 10MB. I would recommend 
> something around 100MB maybe for a mechanical hard drive. With a PCIe flash 
> you probably would rather pick several hundred MB or even more.
> 
> That unpleasant 'msize' issue is a limitation of the 9p protocol: client 
> (guest) must suggest the value of msize on connection to server (host). Server 
> can only lower, but not raise it. And the client in turn obviously cannot see 
> host's storage device(s), so client is unable to pick a good value by itself. 
> So it's a suboptimal handshake issue right now.

It doesn't seem to be making a vast difference here:



9p mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=104857600

Run status group 0 (all jobs):
   READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s), io=3070MiB (3219MB), run=49099-49099msec
  WRITE: bw=20.9MiB/s (21.9MB/s), 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB), run=49099-49099msec

9p mount -t 9p -o trans=virtio kernel /mnt -oversion=9p2000.L,cache=mmap,msize=1048576000

Run status group 0 (all jobs):
   READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s), io=3070MiB (3219MB), run=47104-47104msec
  WRITE: bw=21.8MiB/s (22.8MB/s), 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB), run=47104-47104msec


Dave

> Many users don't even know this 'msize' parameter exists and hence run with 
> the Linux kernel's default value of just 8kB. For QEMU 5.2 I addressed this by 
> logging a performance warning on host side for making users at least aware 
> about this issue. The long-term plan is to pass a good msize value from host 
> to guest via virtio (like it's already done for the available export tags) and 
> the Linux kernel would default to that instead.
> 
> Best regards,
> Christian Schoenebeck
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-25 18:51                     ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-09-27 12:14                       ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-27 12:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr. David Alan Gilbert, Venegas Munoz, Jose Carlos, cdupontd,
	virtio-fs-list, Stefan Hajnoczi, Shinde, Archana M, Vivek Goyal

On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> * Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > rw=randrw,
> > > > 
> > > > Bottleneck ------------------------------^
> > > > 
> > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > 
> > > OK, I thought that was bigger than the default;  what number should I
> > > use?
> > 
> > It depends on the underlying storage hardware. In other words: you have to
> > try increasing the 'msize' value to a point where you no longer notice a
> > negative performance impact (or almost). Which is fortunately quite easy
> > to test on> 
> > guest like:
> > 	dd if=/dev/zero of=test.dat bs=1G count=12
> > 	time cat test.dat > /dev/null
> > 
> > I would start with an absolute minimum msize of 10MB. I would recommend
> > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > flash
> > you probably would rather pick several hundred MB or even more.
> > 
> > That unpleasant 'msize' issue is a limitation of the 9p protocol: client
> > (guest) must suggest the value of msize on connection to server (host).
> > Server can only lower, but not raise it. And the client in turn obviously
> > cannot see host's storage device(s), so client is unable to pick a good
> > value by itself. So it's a suboptimal handshake issue right now.
> 
> It doesn't seem to be making a vast difference here:
> 
> 
> 
> 9p mount -t 9p -o trans=virtio kernel /mnt
> -oversion=9p2000.L,cache=mmap,msize=104857600
> 
> Run status group 0 (all jobs):
>    READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s),
> io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> run=49099-49099msec
> 
> 9p mount -t 9p -o trans=virtio kernel /mnt
> -oversion=9p2000.L,cache=mmap,msize=1048576000
> 
> Run status group 0 (all jobs):
>    READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s),
> io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> run=47104-47104msec
> 
> 
> Dave

Is that benchmark tool honoring 'iounit' to automatically run with max. I/O 
chunk sizes? What's that benchmark tool actually? And do you also see no 
improvement with a simple

	time cat largefile.dat > /dev/null

?

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-27 12:14                       ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-27 12:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list, Shinde,
	Archana M, Vivek Goyal

On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> * Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > rw=randrw,
> > > > 
> > > > Bottleneck ------------------------------^
> > > > 
> > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > 
> > > OK, I thought that was bigger than the default;  what number should I
> > > use?
> > 
> > It depends on the underlying storage hardware. In other words: you have to
> > try increasing the 'msize' value to a point where you no longer notice a
> > negative performance impact (or almost). Which is fortunately quite easy
> > to test on> 
> > guest like:
> > 	dd if=/dev/zero of=test.dat bs=1G count=12
> > 	time cat test.dat > /dev/null
> > 
> > I would start with an absolute minimum msize of 10MB. I would recommend
> > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > flash
> > you probably would rather pick several hundred MB or even more.
> > 
> > That unpleasant 'msize' issue is a limitation of the 9p protocol: client
> > (guest) must suggest the value of msize on connection to server (host).
> > Server can only lower, but not raise it. And the client in turn obviously
> > cannot see host's storage device(s), so client is unable to pick a good
> > value by itself. So it's a suboptimal handshake issue right now.
> 
> It doesn't seem to be making a vast difference here:
> 
> 
> 
> 9p mount -t 9p -o trans=virtio kernel /mnt
> -oversion=9p2000.L,cache=mmap,msize=104857600
> 
> Run status group 0 (all jobs):
>    READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s),
> io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> run=49099-49099msec
> 
> 9p mount -t 9p -o trans=virtio kernel /mnt
> -oversion=9p2000.L,cache=mmap,msize=1048576000
> 
> Run status group 0 (all jobs):
>    READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s),
> io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> run=47104-47104msec
> 
> 
> Dave

Is that benchmark tool honoring 'iounit' to automatically run with max. I/O 
chunk sizes? What's that benchmark tool actually? And do you also see no 
improvement with a simple

	time cat largefile.dat > /dev/null

?

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-27 12:14                       ` [Virtio-fs] " Christian Schoenebeck
@ 2020-09-29 13:03                         ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 13:03 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, cdupontd, qemu-devel, virtio-fs-list,
	Stefan Hajnoczi, Shinde, Archana M, Dr. David Alan Gilbert

On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> > * Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > > rw=randrw,
> > > > > 
> > > > > Bottleneck ------------------------------^
> > > > > 
> > > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > > 
> > > > OK, I thought that was bigger than the default;  what number should I
> > > > use?
> > > 
> > > It depends on the underlying storage hardware. In other words: you have to
> > > try increasing the 'msize' value to a point where you no longer notice a
> > > negative performance impact (or almost). Which is fortunately quite easy
> > > to test on> 
> > > guest like:
> > > 	dd if=/dev/zero of=test.dat bs=1G count=12
> > > 	time cat test.dat > /dev/null
> > > 
> > > I would start with an absolute minimum msize of 10MB. I would recommend
> > > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > > flash
> > > you probably would rather pick several hundred MB or even more.
> > > 
> > > That unpleasant 'msize' issue is a limitation of the 9p protocol: client
> > > (guest) must suggest the value of msize on connection to server (host).
> > > Server can only lower, but not raise it. And the client in turn obviously
> > > cannot see host's storage device(s), so client is unable to pick a good
> > > value by itself. So it's a suboptimal handshake issue right now.
> > 
> > It doesn't seem to be making a vast difference here:
> > 
> > 
> > 
> > 9p mount -t 9p -o trans=virtio kernel /mnt
> > -oversion=9p2000.L,cache=mmap,msize=104857600
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s),
> > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> > run=49099-49099msec
> > 
> > 9p mount -t 9p -o trans=virtio kernel /mnt
> > -oversion=9p2000.L,cache=mmap,msize=1048576000
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s),
> > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> > run=47104-47104msec
> > 
> > 
> > Dave
> 
> Is that benchmark tool honoring 'iounit' to automatically run with max. I/O 
> chunk sizes? What's that benchmark tool actually? And do you also see no 
> improvement with a simple
> 
> 	time cat largefile.dat > /dev/null

I am assuming that msize only helps with sequential I/O and not random
I/O.

Dave is running random read and random write mix and probably that's why
he is not seeing any improvement with msize increase.

If we run sequential workload (as "cat largefile.dat"), that should
see an improvement with msize increase.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-29 13:03                         ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 13:03 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, cdupontd, qemu-devel, virtio-fs-list,
	Shinde, Archana M

On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> > * Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert wrote:
> > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > > rw=randrw,
> > > > > 
> > > > > Bottleneck ------------------------------^
> > > > > 
> > > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > > 
> > > > OK, I thought that was bigger than the default;  what number should I
> > > > use?
> > > 
> > > It depends on the underlying storage hardware. In other words: you have to
> > > try increasing the 'msize' value to a point where you no longer notice a
> > > negative performance impact (or almost). Which is fortunately quite easy
> > > to test on> 
> > > guest like:
> > > 	dd if=/dev/zero of=test.dat bs=1G count=12
> > > 	time cat test.dat > /dev/null
> > > 
> > > I would start with an absolute minimum msize of 10MB. I would recommend
> > > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > > flash
> > > you probably would rather pick several hundred MB or even more.
> > > 
> > > That unpleasant 'msize' issue is a limitation of the 9p protocol: client
> > > (guest) must suggest the value of msize on connection to server (host).
> > > Server can only lower, but not raise it. And the client in turn obviously
> > > cannot see host's storage device(s), so client is unable to pick a good
> > > value by itself. So it's a suboptimal handshake issue right now.
> > 
> > It doesn't seem to be making a vast difference here:
> > 
> > 
> > 
> > 9p mount -t 9p -o trans=virtio kernel /mnt
> > -oversion=9p2000.L,cache=mmap,msize=104857600
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s (65.6MB/s-65.6MB/s),
> > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> > run=49099-49099msec
> > 
> > 9p mount -t 9p -o trans=virtio kernel /mnt
> > -oversion=9p2000.L,cache=mmap,msize=1048576000
> > 
> > Run status group 0 (all jobs):
> >    READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s (68.3MB/s-68.3MB/s),
> > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> > run=47104-47104msec
> > 
> > 
> > Dave
> 
> Is that benchmark tool honoring 'iounit' to automatically run with max. I/O 
> chunk sizes? What's that benchmark tool actually? And do you also see no 
> improvement with a simple
> 
> 	time cat largefile.dat > /dev/null

I am assuming that msize only helps with sequential I/O and not random
I/O.

Dave is running random read and random write mix and probably that's why
he is not seeing any improvement with msize increase.

If we run sequential workload (as "cat largefile.dat"), that should
see an improvement with msize increase.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-25 12:41             ` [Virtio-fs] " Dr. David Alan Gilbert
@ 2020-09-29 13:17               ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 13:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, Shinde, Archana M

On Fri, Sep 25, 2020 at 01:41:39PM +0100, Dr. David Alan Gilbert wrote:

[..]
> So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for
> read performance here.
> 

Hi Dave,

I spent some time making changes to virtiofs-tests so that I can test
a mix of random read and random write workload. That testsuite runs
a workload 3 times and reports the average. So I like to use it to
reduce run to run variation effect.

So I ran following to mimic carlos's workload.

$ ./run-fio-test.sh test -direct=1 -c <test-dir> fio-jobs/randrw-psync.job >
testresults.txt

$ ./parse-fio-results.sh testresults.txt

I am using a SSD at the host to back these files. Option "-c" always
creates new files for testing.

Following are my results in various configurations. Used cache=mmap mode
for 9p and cache=auto (and cache=none) modes for virtiofs. Also tested
9p default as well as msize=16m. Tested virtiofs both with exclusive
as well as shared thread pool.

NAME                    WORKLOAD                Bandwidth       IOPS            
9p-mmap-randrw          randrw-psync            42.8mb/14.3mb   10.7k/3666      
9p-mmap-msize16m        randrw-psync            42.8mb/14.3mb   10.7k/3674      
vtfs-auto-ex-randrw     randrw-psync            27.8mb/9547kb   7136/2386       
vtfs-auto-sh-randrw     randrw-psync            43.3mb/14.4mb   10.8k/3709      
vtfs-none-sh-randrw     randrw-psync            54.1mb/18.1mb   13.5k/4649      


- Increasing msize to 16m did not help with performance for this workload.
- virtiofs exclusive thread pool ("ex"), is slower than 9p.
- virtiofs shared thread pool ("sh"), matches the performance of 9p.
- virtiofs cache=none mode is faster than cache=auto mode for this
  workload.

Carlos, I am looking at more ways to optimize it further for virtiofs.
In the mean time I think switching to "shared" thread pool should
bring you very close to 9p in your setup I think.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-29 13:17               ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 13:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Shinde, Archana M

On Fri, Sep 25, 2020 at 01:41:39PM +0100, Dr. David Alan Gilbert wrote:

[..]
> So I'm sitll beating 9p; the thread-pool-size=1 seems to be great for
> read performance here.
> 

Hi Dave,

I spent some time making changes to virtiofs-tests so that I can test
a mix of random read and random write workload. That testsuite runs
a workload 3 times and reports the average. So I like to use it to
reduce run to run variation effect.

So I ran following to mimic carlos's workload.

$ ./run-fio-test.sh test -direct=1 -c <test-dir> fio-jobs/randrw-psync.job >
testresults.txt

$ ./parse-fio-results.sh testresults.txt

I am using a SSD at the host to back these files. Option "-c" always
creates new files for testing.

Following are my results in various configurations. Used cache=mmap mode
for 9p and cache=auto (and cache=none) modes for virtiofs. Also tested
9p default as well as msize=16m. Tested virtiofs both with exclusive
as well as shared thread pool.

NAME                    WORKLOAD                Bandwidth       IOPS            
9p-mmap-randrw          randrw-psync            42.8mb/14.3mb   10.7k/3666      
9p-mmap-msize16m        randrw-psync            42.8mb/14.3mb   10.7k/3674      
vtfs-auto-ex-randrw     randrw-psync            27.8mb/9547kb   7136/2386       
vtfs-auto-sh-randrw     randrw-psync            43.3mb/14.4mb   10.8k/3709      
vtfs-none-sh-randrw     randrw-psync            54.1mb/18.1mb   13.5k/4649      


- Increasing msize to 16m did not help with performance for this workload.
- virtiofs exclusive thread pool ("ex"), is slower than 9p.
- virtiofs shared thread pool ("sh"), matches the performance of 9p.
- virtiofs cache=none mode is faster than cache=auto mode for this
  workload.

Carlos, I am looking at more ways to optimize it further for virtiofs.
In the mean time I think switching to "shared" thread pool should
bring you very close to 9p in your setup I think.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-29 13:03                         ` [Virtio-fs] " Vivek Goyal
@ 2020-09-29 13:28                           ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-29 13:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Vivek Goyal, Venegas Munoz, Jose Carlos, cdupontd,
	virtio-fs-list, Stefan Hajnoczi, Shinde, Archana M,
	Dr. David Alan Gilbert

On Dienstag, 29. September 2020 15:03:25 CEST Vivek Goyal wrote:
> On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote:
> > On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> > > * Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> > > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert 
wrote:
> > > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > > > rw=randrw,
> > > > > > 
> > > > > > Bottleneck ------------------------------^
> > > > > > 
> > > > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > > > 
> > > > > OK, I thought that was bigger than the default;  what number should
> > > > > I
> > > > > use?
> > > > 
> > > > It depends on the underlying storage hardware. In other words: you
> > > > have to
> > > > try increasing the 'msize' value to a point where you no longer notice
> > > > a
> > > > negative performance impact (or almost). Which is fortunately quite
> > > > easy
> > > > to test on>
> > > > 
> > > > guest like:
> > > > 	dd if=/dev/zero of=test.dat bs=1G count=12
> > > > 	time cat test.dat > /dev/null
> > > > 
> > > > I would start with an absolute minimum msize of 10MB. I would
> > > > recommend
> > > > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > > > flash
> > > > you probably would rather pick several hundred MB or even more.
> > > > 
> > > > That unpleasant 'msize' issue is a limitation of the 9p protocol:
> > > > client
> > > > (guest) must suggest the value of msize on connection to server
> > > > (host).
> > > > Server can only lower, but not raise it. And the client in turn
> > > > obviously
> > > > cannot see host's storage device(s), so client is unable to pick a
> > > > good
> > > > value by itself. So it's a suboptimal handshake issue right now.
> > > 
> > > It doesn't seem to be making a vast difference here:
> > > 
> > > 
> > > 
> > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > -oversion=9p2000.L,cache=mmap,msize=104857600
> > > 
> > > Run status group 0 (all jobs):
> > >    READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s
> > >    (65.6MB/s-65.6MB/s),
> > > 
> > > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> > > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> > > run=49099-49099msec
> > > 
> > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > -oversion=9p2000.L,cache=mmap,msize=1048576000
> > > 
> > > Run status group 0 (all jobs):
> > >    READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s
> > >    (68.3MB/s-68.3MB/s),
> > > 
> > > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> > > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> > > run=47104-47104msec
> > > 
> > > 
> > > Dave
> > 
> > Is that benchmark tool honoring 'iounit' to automatically run with max.
> > I/O
> > chunk sizes? What's that benchmark tool actually? And do you also see no
> > improvement with a simple
> > 
> > 	time cat largefile.dat > /dev/null
> 
> I am assuming that msize only helps with sequential I/O and not random
> I/O.
> 
> Dave is running random read and random write mix and probably that's why
> he is not seeing any improvement with msize increase.
> 
> If we run sequential workload (as "cat largefile.dat"), that should
> see an improvement with msize increase.
> 
> Thanks
> Vivek

Depends on what's randomized. If read chunk size is randomized, then yes, you 
would probably see less performance increase compared to a simple
'cat foo.dat'.

If only the read position is randomized, but the read chunk size honors 
iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient block 
size advertised by 9P), then I would assume still seeing a performance 
increase. Because seeking is a no/low cost factor in this case. The guest OS 
seeking does not transmit a 9p message. The offset is rather passed with any 
Tread message instead:
https://github.com/chaos/diod/blob/master/protocol.md

I mean, yes, random seeks reduce I/O performance in general of course, but in 
direct performance comparison, the difference in overhead of the 9p vs. 
virtiofs network controller layer is most probably the most relevant aspect if 
large I/O chunk sizes are used.

But OTOH: I haven't optimized anything in Tread handling in 9p (yet).

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-29 13:28                           ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-29 13:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list, Shinde,
	Archana M, Vivek Goyal

On Dienstag, 29. September 2020 15:03:25 CEST Vivek Goyal wrote:
> On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote:
> > On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> > > * Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> > > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert 
wrote:
> > > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > > > rw=randrw,
> > > > > > 
> > > > > > Bottleneck ------------------------------^
> > > > > > 
> > > > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > > > 
> > > > > OK, I thought that was bigger than the default;  what number should
> > > > > I
> > > > > use?
> > > > 
> > > > It depends on the underlying storage hardware. In other words: you
> > > > have to
> > > > try increasing the 'msize' value to a point where you no longer notice
> > > > a
> > > > negative performance impact (or almost). Which is fortunately quite
> > > > easy
> > > > to test on>
> > > > 
> > > > guest like:
> > > > 	dd if=/dev/zero of=test.dat bs=1G count=12
> > > > 	time cat test.dat > /dev/null
> > > > 
> > > > I would start with an absolute minimum msize of 10MB. I would
> > > > recommend
> > > > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > > > flash
> > > > you probably would rather pick several hundred MB or even more.
> > > > 
> > > > That unpleasant 'msize' issue is a limitation of the 9p protocol:
> > > > client
> > > > (guest) must suggest the value of msize on connection to server
> > > > (host).
> > > > Server can only lower, but not raise it. And the client in turn
> > > > obviously
> > > > cannot see host's storage device(s), so client is unable to pick a
> > > > good
> > > > value by itself. So it's a suboptimal handshake issue right now.
> > > 
> > > It doesn't seem to be making a vast difference here:
> > > 
> > > 
> > > 
> > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > -oversion=9p2000.L,cache=mmap,msize=104857600
> > > 
> > > Run status group 0 (all jobs):
> > >    READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s
> > >    (65.6MB/s-65.6MB/s),
> > > 
> > > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> > > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> > > run=49099-49099msec
> > > 
> > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > -oversion=9p2000.L,cache=mmap,msize=1048576000
> > > 
> > > Run status group 0 (all jobs):
> > >    READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s
> > >    (68.3MB/s-68.3MB/s),
> > > 
> > > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> > > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> > > run=47104-47104msec
> > > 
> > > 
> > > Dave
> > 
> > Is that benchmark tool honoring 'iounit' to automatically run with max.
> > I/O
> > chunk sizes? What's that benchmark tool actually? And do you also see no
> > improvement with a simple
> > 
> > 	time cat largefile.dat > /dev/null
> 
> I am assuming that msize only helps with sequential I/O and not random
> I/O.
> 
> Dave is running random read and random write mix and probably that's why
> he is not seeing any improvement with msize increase.
> 
> If we run sequential workload (as "cat largefile.dat"), that should
> see an improvement with msize increase.
> 
> Thanks
> Vivek

Depends on what's randomized. If read chunk size is randomized, then yes, you 
would probably see less performance increase compared to a simple
'cat foo.dat'.

If only the read position is randomized, but the read chunk size honors 
iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient block 
size advertised by 9P), then I would assume still seeing a performance 
increase. Because seeking is a no/low cost factor in this case. The guest OS 
seeking does not transmit a 9p message. The offset is rather passed with any 
Tread message instead:
https://github.com/chaos/diod/blob/master/protocol.md

I mean, yes, random seeks reduce I/O performance in general of course, but in 
direct performance comparison, the difference in overhead of the 9p vs. 
virtiofs network controller layer is most probably the most relevant aspect if 
large I/O chunk sizes are used.

But OTOH: I haven't optimized anything in Tread handling in 9p (yet).

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-29 13:17               ` [Virtio-fs] " Vivek Goyal
@ 2020-09-29 13:49                 ` Miklos Szeredi
  -1 siblings, 0 replies; 107+ messages in thread
From: Miklos Szeredi @ 2020-09-29 13:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: qemu-devel, Venegas Munoz, Jose Carlos, cdupontd,
	Dr. David Alan Gilbert, virtio-fs-list, Shinde, Archana M

On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> - virtiofs cache=none mode is faster than cache=auto mode for this
>   workload.

Not sure why.  One cause could be that readahead is not perfect at
detecting the random pattern.  Could we compare total I/O on the
server vs. total I/O by fio?

Thanks,
Millos



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-29 13:49                 ` Miklos Szeredi
  0 siblings, 0 replies; 107+ messages in thread
From: Miklos Szeredi @ 2020-09-29 13:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: qemu-devel, Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list,
	Shinde, Archana M

On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> - virtiofs cache=none mode is faster than cache=auto mode for this
>   workload.

Not sure why.  One cause could be that readahead is not perfect at
detecting the random pattern.  Could we compare total I/O on the
server vs. total I/O by fio?

Thanks,
Millos


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-29 13:28                           ` [Virtio-fs] " Christian Schoenebeck
@ 2020-09-29 13:49                             ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 13:49 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Stefan Hajnoczi, Shinde, Archana M, Dr. David Alan Gilbert

On Tue, Sep 29, 2020 at 03:28:06PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 29. September 2020 15:03:25 CEST Vivek Goyal wrote:
> > On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote:
> > > On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> > > > * Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> > > > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert 
> wrote:
> > > > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > > > > rw=randrw,
> > > > > > > 
> > > > > > > Bottleneck ------------------------------^
> > > > > > > 
> > > > > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > > > > 
> > > > > > OK, I thought that was bigger than the default;  what number should
> > > > > > I
> > > > > > use?
> > > > > 
> > > > > It depends on the underlying storage hardware. In other words: you
> > > > > have to
> > > > > try increasing the 'msize' value to a point where you no longer notice
> > > > > a
> > > > > negative performance impact (or almost). Which is fortunately quite
> > > > > easy
> > > > > to test on>
> > > > > 
> > > > > guest like:
> > > > > 	dd if=/dev/zero of=test.dat bs=1G count=12
> > > > > 	time cat test.dat > /dev/null
> > > > > 
> > > > > I would start with an absolute minimum msize of 10MB. I would
> > > > > recommend
> > > > > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > > > > flash
> > > > > you probably would rather pick several hundred MB or even more.
> > > > > 
> > > > > That unpleasant 'msize' issue is a limitation of the 9p protocol:
> > > > > client
> > > > > (guest) must suggest the value of msize on connection to server
> > > > > (host).
> > > > > Server can only lower, but not raise it. And the client in turn
> > > > > obviously
> > > > > cannot see host's storage device(s), so client is unable to pick a
> > > > > good
> > > > > value by itself. So it's a suboptimal handshake issue right now.
> > > > 
> > > > It doesn't seem to be making a vast difference here:
> > > > 
> > > > 
> > > > 
> > > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=104857600
> > > > 
> > > > Run status group 0 (all jobs):
> > > >    READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s
> > > >    (65.6MB/s-65.6MB/s),
> > > > 
> > > > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> > > > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> > > > run=49099-49099msec
> > > > 
> > > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=1048576000
> > > > 
> > > > Run status group 0 (all jobs):
> > > >    READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s
> > > >    (68.3MB/s-68.3MB/s),
> > > > 
> > > > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> > > > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> > > > run=47104-47104msec
> > > > 
> > > > 
> > > > Dave
> > > 
> > > Is that benchmark tool honoring 'iounit' to automatically run with max.
> > > I/O
> > > chunk sizes? What's that benchmark tool actually? And do you also see no
> > > improvement with a simple
> > > 
> > > 	time cat largefile.dat > /dev/null
> > 
> > I am assuming that msize only helps with sequential I/O and not random
> > I/O.
> > 
> > Dave is running random read and random write mix and probably that's why
> > he is not seeing any improvement with msize increase.
> > 
> > If we run sequential workload (as "cat largefile.dat"), that should
> > see an improvement with msize increase.
> > 
> > Thanks
> > Vivek
> 
> Depends on what's randomized. If read chunk size is randomized, then yes, you 
> would probably see less performance increase compared to a simple
> 'cat foo.dat'.

We are using "fio" for testing and read chunk size is not being
randomized. chunk size (block size) is fixed at 4K size for these tests.

> 
> If only the read position is randomized, but the read chunk size honors 
> iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient block 
> size advertised by 9P), then I would assume still seeing a performance 
> increase.

Yes, we are randomizing read position. But there is no notion of looking
at st_blksize. Its fixed at 4K. (notice option --bs=4k in fio
commandline).

> Because seeking is a no/low cost factor in this case. The guest OS 
> seeking does not transmit a 9p message. The offset is rather passed with any 
> Tread message instead:
> https://github.com/chaos/diod/blob/master/protocol.md
> 
> I mean, yes, random seeks reduce I/O performance in general of course, but in 
> direct performance comparison, the difference in overhead of the 9p vs. 
> virtiofs network controller layer is most probably the most relevant aspect if 
> large I/O chunk sizes are used.
> 

Agreed that large I/O chunk size will help with the perfomance numbers.
But idea is to intentonally use smaller I/O chunk size with some of
the tests to measure how efficient communication path is.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-29 13:49                             ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 13:49 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Shinde, Archana M

On Tue, Sep 29, 2020 at 03:28:06PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 29. September 2020 15:03:25 CEST Vivek Goyal wrote:
> > On Sun, Sep 27, 2020 at 02:14:43PM +0200, Christian Schoenebeck wrote:
> > > On Freitag, 25. September 2020 20:51:47 CEST Dr. David Alan Gilbert wrote:
> > > > * Christian Schoenebeck (qemu_oss@crudebyte.com) wrote:
> > > > > On Freitag, 25. September 2020 15:05:38 CEST Dr. David Alan Gilbert 
> wrote:
> > > > > > > > 9p ( mount -t 9p -o trans=virtio kernel /mnt
> > > > > > > > -oversion=9p2000.L,cache=mmap,msize=1048576 ) test: (g=0):
> > > > > > > > rw=randrw,
> > > > > > > 
> > > > > > > Bottleneck ------------------------------^
> > > > > > > 
> > > > > > > By increasing 'msize' you would encounter better 9P I/O results.
> > > > > > 
> > > > > > OK, I thought that was bigger than the default;  what number should
> > > > > > I
> > > > > > use?
> > > > > 
> > > > > It depends on the underlying storage hardware. In other words: you
> > > > > have to
> > > > > try increasing the 'msize' value to a point where you no longer notice
> > > > > a
> > > > > negative performance impact (or almost). Which is fortunately quite
> > > > > easy
> > > > > to test on>
> > > > > 
> > > > > guest like:
> > > > > 	dd if=/dev/zero of=test.dat bs=1G count=12
> > > > > 	time cat test.dat > /dev/null
> > > > > 
> > > > > I would start with an absolute minimum msize of 10MB. I would
> > > > > recommend
> > > > > something around 100MB maybe for a mechanical hard drive. With a PCIe
> > > > > flash
> > > > > you probably would rather pick several hundred MB or even more.
> > > > > 
> > > > > That unpleasant 'msize' issue is a limitation of the 9p protocol:
> > > > > client
> > > > > (guest) must suggest the value of msize on connection to server
> > > > > (host).
> > > > > Server can only lower, but not raise it. And the client in turn
> > > > > obviously
> > > > > cannot see host's storage device(s), so client is unable to pick a
> > > > > good
> > > > > value by itself. So it's a suboptimal handshake issue right now.
> > > > 
> > > > It doesn't seem to be making a vast difference here:
> > > > 
> > > > 
> > > > 
> > > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=104857600
> > > > 
> > > > Run status group 0 (all jobs):
> > > >    READ: bw=62.5MiB/s (65.6MB/s), 62.5MiB/s-62.5MiB/s
> > > >    (65.6MB/s-65.6MB/s),
> > > > 
> > > > io=3070MiB (3219MB), run=49099-49099msec WRITE: bw=20.9MiB/s (21.9MB/s),
> > > > 20.9MiB/s-20.9MiB/s (21.9MB/s-21.9MB/s), io=1026MiB (1076MB),
> > > > run=49099-49099msec
> > > > 
> > > > 9p mount -t 9p -o trans=virtio kernel /mnt
> > > > -oversion=9p2000.L,cache=mmap,msize=1048576000
> > > > 
> > > > Run status group 0 (all jobs):
> > > >    READ: bw=65.2MiB/s (68.3MB/s), 65.2MiB/s-65.2MiB/s
> > > >    (68.3MB/s-68.3MB/s),
> > > > 
> > > > io=3070MiB (3219MB), run=47104-47104msec WRITE: bw=21.8MiB/s (22.8MB/s),
> > > > 21.8MiB/s-21.8MiB/s (22.8MB/s-22.8MB/s), io=1026MiB (1076MB),
> > > > run=47104-47104msec
> > > > 
> > > > 
> > > > Dave
> > > 
> > > Is that benchmark tool honoring 'iounit' to automatically run with max.
> > > I/O
> > > chunk sizes? What's that benchmark tool actually? And do you also see no
> > > improvement with a simple
> > > 
> > > 	time cat largefile.dat > /dev/null
> > 
> > I am assuming that msize only helps with sequential I/O and not random
> > I/O.
> > 
> > Dave is running random read and random write mix and probably that's why
> > he is not seeing any improvement with msize increase.
> > 
> > If we run sequential workload (as "cat largefile.dat"), that should
> > see an improvement with msize increase.
> > 
> > Thanks
> > Vivek
> 
> Depends on what's randomized. If read chunk size is randomized, then yes, you 
> would probably see less performance increase compared to a simple
> 'cat foo.dat'.

We are using "fio" for testing and read chunk size is not being
randomized. chunk size (block size) is fixed at 4K size for these tests.

> 
> If only the read position is randomized, but the read chunk size honors 
> iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient block 
> size advertised by 9P), then I would assume still seeing a performance 
> increase.

Yes, we are randomizing read position. But there is no notion of looking
at st_blksize. Its fixed at 4K. (notice option --bs=4k in fio
commandline).

> Because seeking is a no/low cost factor in this case. The guest OS 
> seeking does not transmit a 9p message. The offset is rather passed with any 
> Tread message instead:
> https://github.com/chaos/diod/blob/master/protocol.md
> 
> I mean, yes, random seeks reduce I/O performance in general of course, but in 
> direct performance comparison, the difference in overhead of the 9p vs. 
> virtiofs network controller layer is most probably the most relevant aspect if 
> large I/O chunk sizes are used.
> 

Agreed that large I/O chunk size will help with the perfomance numbers.
But idea is to intentonally use smaller I/O chunk size with some of
the tests to measure how efficient communication path is.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-29 13:49                             ` [Virtio-fs] " Vivek Goyal
@ 2020-09-29 13:59                               ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-29 13:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Vivek Goyal, Venegas Munoz, Jose Carlos, cdupontd,
	virtio-fs-list, Stefan Hajnoczi, Shinde, Archana M,
	Dr. David Alan Gilbert

On Dienstag, 29. September 2020 15:49:42 CEST Vivek Goyal wrote:
> > Depends on what's randomized. If read chunk size is randomized, then yes,
> > you would probably see less performance increase compared to a simple
> > 'cat foo.dat'.
> 
> We are using "fio" for testing and read chunk size is not being
> randomized. chunk size (block size) is fixed at 4K size for these tests.

Good to know, thanks!

> > If only the read position is randomized, but the read chunk size honors
> > iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient
> > block size advertised by 9P), then I would assume still seeing a
> > performance increase.
> 
> Yes, we are randomizing read position. But there is no notion of looking
> at st_blksize. Its fixed at 4K. (notice option --bs=4k in fio
> commandline).

Ah ok, then the results make sense.

With these block sizes you will indeed suffer a performance issue with 9p, due 
to several thread hops in Tread handling, which is due to be fixed.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-29 13:59                               ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2020-09-29 13:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list, Shinde,
	Archana M, Vivek Goyal

On Dienstag, 29. September 2020 15:49:42 CEST Vivek Goyal wrote:
> > Depends on what's randomized. If read chunk size is randomized, then yes,
> > you would probably see less performance increase compared to a simple
> > 'cat foo.dat'.
> 
> We are using "fio" for testing and read chunk size is not being
> randomized. chunk size (block size) is fixed at 4K size for these tests.

Good to know, thanks!

> > If only the read position is randomized, but the read chunk size honors
> > iounit, a.k.a. stat's st_blksize (i.e. reading with the most efficient
> > block size advertised by 9P), then I would assume still seeing a
> > performance increase.
> 
> Yes, we are randomizing read position. But there is no notion of looking
> at st_blksize. Its fixed at 4K. (notice option --bs=4k in fio
> commandline).

Ah ok, then the results make sense.

With these block sizes you will indeed suffer a performance issue with 9p, due 
to several thread hops in Tread handling, which is due to be fixed.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-29 13:49                 ` Miklos Szeredi
@ 2020-09-29 14:01                   ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 14:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: qemu-devel, Venegas Munoz, Jose Carlos, cdupontd,
	Dr. David Alan Gilbert, virtio-fs-list, Shinde, Archana M

On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > - virtiofs cache=none mode is faster than cache=auto mode for this
> >   workload.
> 
> Not sure why.  One cause could be that readahead is not perfect at
> detecting the random pattern.  Could we compare total I/O on the
> server vs. total I/O by fio?

Hi Miklos,

I will instrument virtiosd code to figure out total I/O.

One more potential issue I am staring at is refreshing the attrs on 
READ if fc->auto_inval_data is set.

fuse_cache_read_iter() {
        /*
         * In auto invalidate mode, always update attributes on read.
         * Otherwise, only update if we attempt to read past EOF (to ensure
         * i_size is up to date).
         */
        if (fc->auto_inval_data ||
            (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
                int err;
                err = fuse_update_attributes(inode, iocb->ki_filp);
                if (err)
                        return err;
        }
}

Given this is a mixed READ/WRITE workload, every WRITE will invalidate
attrs. And next READ will first do GETATTR() from server (and potentially
invalidate page cache) before doing READ.

This sounds suboptimal especially from the point of view of WRITEs
done by this client itself. I mean if another client has modified
the file, then doing GETATTR after a second makes sense. But there
should be some optimization to make sure our own WRITEs don't end
up doing GETATTR and invalidate page cache (because cache contents
are still valid).

I disabled ->auto_invalid_data and that seemed to result in 8-10%
gain in performance for this workload.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-29 14:01                   ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 14:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: qemu-devel, Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list,
	Shinde, Archana M

On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > - virtiofs cache=none mode is faster than cache=auto mode for this
> >   workload.
> 
> Not sure why.  One cause could be that readahead is not perfect at
> detecting the random pattern.  Could we compare total I/O on the
> server vs. total I/O by fio?

Hi Miklos,

I will instrument virtiosd code to figure out total I/O.

One more potential issue I am staring at is refreshing the attrs on 
READ if fc->auto_inval_data is set.

fuse_cache_read_iter() {
        /*
         * In auto invalidate mode, always update attributes on read.
         * Otherwise, only update if we attempt to read past EOF (to ensure
         * i_size is up to date).
         */
        if (fc->auto_inval_data ||
            (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
                int err;
                err = fuse_update_attributes(inode, iocb->ki_filp);
                if (err)
                        return err;
        }
}

Given this is a mixed READ/WRITE workload, every WRITE will invalidate
attrs. And next READ will first do GETATTR() from server (and potentially
invalidate page cache) before doing READ.

This sounds suboptimal especially from the point of view of WRITEs
done by this client itself. I mean if another client has modified
the file, then doing GETATTR after a second makes sense. But there
should be some optimization to make sure our own WRITEs don't end
up doing GETATTR and invalidate page cache (because cache contents
are still valid).

I disabled ->auto_invalid_data and that seemed to result in 8-10%
gain in performance for this workload.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-29 14:01                   ` Vivek Goyal
@ 2020-09-29 14:54                     ` Miklos Szeredi
  -1 siblings, 0 replies; 107+ messages in thread
From: Miklos Szeredi @ 2020-09-29 14:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: qemu-devel, Venegas Munoz, Jose Carlos, cdupontd,
	Dr. David Alan Gilbert, virtio-fs-list, Shinde, Archana M

On Tue, Sep 29, 2020 at 4:01 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> > On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > > - virtiofs cache=none mode is faster than cache=auto mode for this
> > >   workload.
> >
> > Not sure why.  One cause could be that readahead is not perfect at
> > detecting the random pattern.  Could we compare total I/O on the
> > server vs. total I/O by fio?
>
> Hi Miklos,
>
> I will instrument virtiosd code to figure out total I/O.
>
> One more potential issue I am staring at is refreshing the attrs on
> READ if fc->auto_inval_data is set.
>
> fuse_cache_read_iter() {
>         /*
>          * In auto invalidate mode, always update attributes on read.
>          * Otherwise, only update if we attempt to read past EOF (to ensure
>          * i_size is up to date).
>          */
>         if (fc->auto_inval_data ||
>             (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
>                 int err;
>                 err = fuse_update_attributes(inode, iocb->ki_filp);
>                 if (err)
>                         return err;
>         }
> }
>
> Given this is a mixed READ/WRITE workload, every WRITE will invalidate
> attrs. And next READ will first do GETATTR() from server (and potentially
> invalidate page cache) before doing READ.
>
> This sounds suboptimal especially from the point of view of WRITEs
> done by this client itself. I mean if another client has modified
> the file, then doing GETATTR after a second makes sense. But there
> should be some optimization to make sure our own WRITEs don't end
> up doing GETATTR and invalidate page cache (because cache contents
> are still valid).

Yeah, that sucks.

> I disabled ->auto_invalid_data and that seemed to result in 8-10%
> gain in performance for this workload.

Need to wrap my head around these caching issues.

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-29 14:54                     ` Miklos Szeredi
  0 siblings, 0 replies; 107+ messages in thread
From: Miklos Szeredi @ 2020-09-29 14:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: qemu-devel, Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list,
	Shinde, Archana M

On Tue, Sep 29, 2020 at 4:01 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> > On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > > - virtiofs cache=none mode is faster than cache=auto mode for this
> > >   workload.
> >
> > Not sure why.  One cause could be that readahead is not perfect at
> > detecting the random pattern.  Could we compare total I/O on the
> > server vs. total I/O by fio?
>
> Hi Miklos,
>
> I will instrument virtiosd code to figure out total I/O.
>
> One more potential issue I am staring at is refreshing the attrs on
> READ if fc->auto_inval_data is set.
>
> fuse_cache_read_iter() {
>         /*
>          * In auto invalidate mode, always update attributes on read.
>          * Otherwise, only update if we attempt to read past EOF (to ensure
>          * i_size is up to date).
>          */
>         if (fc->auto_inval_data ||
>             (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
>                 int err;
>                 err = fuse_update_attributes(inode, iocb->ki_filp);
>                 if (err)
>                         return err;
>         }
> }
>
> Given this is a mixed READ/WRITE workload, every WRITE will invalidate
> attrs. And next READ will first do GETATTR() from server (and potentially
> invalidate page cache) before doing READ.
>
> This sounds suboptimal especially from the point of view of WRITEs
> done by this client itself. I mean if another client has modified
> the file, then doing GETATTR after a second makes sense. But there
> should be some optimization to make sure our own WRITEs don't end
> up doing GETATTR and invalidate page cache (because cache contents
> are still valid).

Yeah, that sucks.

> I disabled ->auto_invalid_data and that seemed to result in 8-10%
> gain in performance for this workload.

Need to wrap my head around these caching issues.

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
  2020-09-29 13:49                 ` Miklos Szeredi
@ 2020-09-29 15:28                   ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 15:28 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: qemu-devel, Venegas Munoz, Jose Carlos, cdupontd,
	Dr. David Alan Gilbert, virtio-fs-list, Shinde, Archana M

On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > - virtiofs cache=none mode is faster than cache=auto mode for this
> >   workload.
> 
> Not sure why.  One cause could be that readahead is not perfect at
> detecting the random pattern.  Could we compare total I/O on the
> server vs. total I/O by fio?

Ran tests with auto_inval_data disabled and compared with other results.

vtfs-auto-ex-randrw     randrw-psync            27.8mb/9547kb   7136/2386
vtfs-auto-sh-randrw     randrw-psync            43.3mb/14.4mb   10.8k/3709
vtfs-auto-sh-noinval    randrw-psync            50.5mb/16.9mb   12.6k/4330
vtfs-none-sh-randrw     randrw-psync            54.1mb/18.1mb   13.5k/4649

With auto_inval_data disabled, this time I saw around 20% performance jump
in READ and is now much closer to cache=none performance.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)
@ 2020-09-29 15:28                   ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2020-09-29 15:28 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: qemu-devel, Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list,
	Shinde, Archana M

On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > - virtiofs cache=none mode is faster than cache=auto mode for this
> >   workload.
> 
> Not sure why.  One cause could be that readahead is not perfect at
> detecting the random pattern.  Could we compare total I/O on the
> server vs. total I/O by fio?

Ran tests with auto_inval_data disabled and compared with other results.

vtfs-auto-ex-randrw     randrw-psync            27.8mb/9547kb   7136/2386
vtfs-auto-sh-randrw     randrw-psync            43.3mb/14.4mb   10.8k/3709
vtfs-auto-sh-noinval    randrw-psync            50.5mb/16.9mb   12.6k/4330
vtfs-none-sh-randrw     randrw-psync            54.1mb/18.1mb   13.5k/4649

With auto_inval_data disabled, this time I saw around 20% performance jump
in READ and is now much closer to cache=none performance.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2020-09-25  8:06             ` [Virtio-fs] " Christian Schoenebeck
@ 2021-02-19 16:08               ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2021-02-19 16:08 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Shinde, Archana M, Venegas Munoz, Jose Carlos, qemu-devel,
	Dr. David Alan Gilbert, virtio-fs-list, Greg Kurz,
	Stefan Hajnoczi, cdupontd

On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > In my testing, with cache=none, virtiofs performed better than 9p in
> > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > (with xattr enabled), 9p performed better in certain write workloads. I
> > have identified root cause of that problem and working on
> > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > with cache=auto and xattr enabled.
> 
> Please note, when it comes to performance aspects, you should set a reasonable 
> high value for 'msize' on 9p client side:
> https://wiki.qemu.org/Documentation/9psetup#msize

Hi Christian,

I am not able to set msize to a higher value. If I try to specify msize
16MB, and then read back msize from /proc/mounts, it sees to cap it
at 512000. Is that intended?

$ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216 hostShared /mnt/virtio-9p

$ cat /proc/mounts | grep 9p
hostShared /mnt/virtio-9p 9p rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0

I am using 5.11 kernel.

Thanks
Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-19 16:08               ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2021-02-19 16:08 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Shinde, Archana M, Venegas Munoz, Jose Carlos, qemu-devel,
	virtio-fs-list, cdupontd

On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > In my testing, with cache=none, virtiofs performed better than 9p in
> > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > (with xattr enabled), 9p performed better in certain write workloads. I
> > have identified root cause of that problem and working on
> > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > with cache=auto and xattr enabled.
> 
> Please note, when it comes to performance aspects, you should set a reasonable 
> high value for 'msize' on 9p client side:
> https://wiki.qemu.org/Documentation/9psetup#msize

Hi Christian,

I am not able to set msize to a higher value. If I try to specify msize
16MB, and then read back msize from /proc/mounts, it sees to cap it
at 512000. Is that intended?

$ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216 hostShared /mnt/virtio-9p

$ cat /proc/mounts | grep 9p
hostShared /mnt/virtio-9p 9p rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0

I am using 5.11 kernel.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-19 16:08               ` [Virtio-fs] " Vivek Goyal
@ 2021-02-19 17:33                 ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-19 17:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Vivek Goyal, Shinde, Archana M, Venegas Munoz, Jose Carlos,
	Dr. David Alan Gilbert, virtio-fs-list, Greg Kurz,
	Stefan Hajnoczi, cdupontd

On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > > (with xattr enabled), 9p performed better in certain write workloads. I
> > > have identified root cause of that problem and working on
> > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > with cache=auto and xattr enabled.
> > 
> > Please note, when it comes to performance aspects, you should set a
> > reasonable high value for 'msize' on 9p client side:
> > https://wiki.qemu.org/Documentation/9psetup#msize
> 
> Hi Christian,
> 
> I am not able to set msize to a higher value. If I try to specify msize
> 16MB, and then read back msize from /proc/mounts, it sees to cap it
> at 512000. Is that intended?

9p server side in QEMU does not perform any msize capping. The code in this
case is very simple, it's just what you see in function v9fs_version():

https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6b57a/hw/9pfs/9p.c#L1332

> $ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216
> hostShared /mnt/virtio-9p
> 
> $ cat /proc/mounts | grep 9p
> hostShared /mnt/virtio-9p 9p
> rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> 
> I am using 5.11 kernel.

Must be something on client (guest kernel) side. I don't see this here with
guest kernel 4.9.0 happening with my setup in a quick test:

$ cat /etc/mtab | grep 9p
svnRoot / 9p rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cache=mmap 0 0
$ 

Looks like the root cause of your issue is this:

struct p9_client *p9_client_create(const char *dev_name, char *options)
{
	...
	if (clnt->msize > clnt->trans_mod->maxsize)
		clnt->msize = clnt->trans_mod->maxsize;

https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f4b84c004b/net/9p/client.c#L1045

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-19 17:33                 ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-19 17:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: cdupontd, Venegas Munoz, Jose Carlos, virtio-fs-list, Shinde,
	Archana M, Vivek Goyal

On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > > (with xattr enabled), 9p performed better in certain write workloads. I
> > > have identified root cause of that problem and working on
> > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > with cache=auto and xattr enabled.
> > 
> > Please note, when it comes to performance aspects, you should set a
> > reasonable high value for 'msize' on 9p client side:
> > https://wiki.qemu.org/Documentation/9psetup#msize
> 
> Hi Christian,
> 
> I am not able to set msize to a higher value. If I try to specify msize
> 16MB, and then read back msize from /proc/mounts, it sees to cap it
> at 512000. Is that intended?

9p server side in QEMU does not perform any msize capping. The code in this
case is very simple, it's just what you see in function v9fs_version():

https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6b57a/hw/9pfs/9p.c#L1332

> $ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216
> hostShared /mnt/virtio-9p
> 
> $ cat /proc/mounts | grep 9p
> hostShared /mnt/virtio-9p 9p
> rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> 
> I am using 5.11 kernel.

Must be something on client (guest kernel) side. I don't see this here with
guest kernel 4.9.0 happening with my setup in a quick test:

$ cat /etc/mtab | grep 9p
svnRoot / 9p rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cache=mmap 0 0
$ 

Looks like the root cause of your issue is this:

struct p9_client *p9_client_create(const char *dev_name, char *options)
{
	...
	if (clnt->msize > clnt->trans_mod->maxsize)
		clnt->msize = clnt->trans_mod->maxsize;

https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f4b84c004b/net/9p/client.c#L1045

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-19 17:33                 ` [Virtio-fs] " Christian Schoenebeck
@ 2021-02-19 19:01                   ` Vivek Goyal
  -1 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2021-02-19 19:01 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: cdupontd, Venegas Munoz, Jose Carlos, Greg Kurz, qemu-devel,
	virtio-fs-list, Stefan Hajnoczi, Shinde, Archana M,
	Dr. David Alan Gilbert

On Fri, Feb 19, 2021 at 06:33:46PM +0100, Christian Schoenebeck wrote:
> On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> > On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > > > (with xattr enabled), 9p performed better in certain write workloads. I
> > > > have identified root cause of that problem and working on
> > > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > > with cache=auto and xattr enabled.
> > > 
> > > Please note, when it comes to performance aspects, you should set a
> > > reasonable high value for 'msize' on 9p client side:
> > > https://wiki.qemu.org/Documentation/9psetup#msize
> > 
> > Hi Christian,
> > 
> > I am not able to set msize to a higher value. If I try to specify msize
> > 16MB, and then read back msize from /proc/mounts, it sees to cap it
> > at 512000. Is that intended?
> 
> 9p server side in QEMU does not perform any msize capping. The code in this
> case is very simple, it's just what you see in function v9fs_version():
> 
> https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6b57a/hw/9pfs/9p.c#L1332
> 
> > $ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216
> > hostShared /mnt/virtio-9p
> > 
> > $ cat /proc/mounts | grep 9p
> > hostShared /mnt/virtio-9p 9p
> > rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> > 
> > I am using 5.11 kernel.
> 
> Must be something on client (guest kernel) side. I don't see this here with
> guest kernel 4.9.0 happening with my setup in a quick test:
> 
> $ cat /etc/mtab | grep 9p
> svnRoot / 9p rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cache=mmap 0 0
> $ 
> 
> Looks like the root cause of your issue is this:
> 
> struct p9_client *p9_client_create(const char *dev_name, char *options)
> {
> 	...
> 	if (clnt->msize > clnt->trans_mod->maxsize)
> 		clnt->msize = clnt->trans_mod->maxsize;
> 
> https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f4b84c004b/net/9p/client.c#L1045

That was introduced by a patch 2011.

commit c9ffb05ca5b5098d6ea468c909dd384d90da7d54
Author: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
Date:   Wed Jun 29 18:06:33 2011 -0700

    net/9p: Fix the msize calculation.

    msize represents the maximum PDU size that includes P9_IOHDRSZ.


You kernel 4.9 is newer than this. So most likely you have this commit
too. I will spend some time later trying to debug this.

Vivek



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-19 19:01                   ` Vivek Goyal
  0 siblings, 0 replies; 107+ messages in thread
From: Vivek Goyal @ 2021-02-19 19:01 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: cdupontd, Venegas Munoz, Jose Carlos, qemu-devel, virtio-fs-list,
	Shinde, Archana M

On Fri, Feb 19, 2021 at 06:33:46PM +0100, Christian Schoenebeck wrote:
> On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> > On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > > all the fio jobs I was running. For the case of cache=auto  for virtiofs
> > > > (with xattr enabled), 9p performed better in certain write workloads. I
> > > > have identified root cause of that problem and working on
> > > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > > with cache=auto and xattr enabled.
> > > 
> > > Please note, when it comes to performance aspects, you should set a
> > > reasonable high value for 'msize' on 9p client side:
> > > https://wiki.qemu.org/Documentation/9psetup#msize
> > 
> > Hi Christian,
> > 
> > I am not able to set msize to a higher value. If I try to specify msize
> > 16MB, and then read back msize from /proc/mounts, it sees to cap it
> > at 512000. Is that intended?
> 
> 9p server side in QEMU does not perform any msize capping. The code in this
> case is very simple, it's just what you see in function v9fs_version():
> 
> https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6b57a/hw/9pfs/9p.c#L1332
> 
> > $ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216
> > hostShared /mnt/virtio-9p
> > 
> > $ cat /proc/mounts | grep 9p
> > hostShared /mnt/virtio-9p 9p
> > rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> > 
> > I am using 5.11 kernel.
> 
> Must be something on client (guest kernel) side. I don't see this here with
> guest kernel 4.9.0 happening with my setup in a quick test:
> 
> $ cat /etc/mtab | grep 9p
> svnRoot / 9p rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cache=mmap 0 0
> $ 
> 
> Looks like the root cause of your issue is this:
> 
> struct p9_client *p9_client_create(const char *dev_name, char *options)
> {
> 	...
> 	if (clnt->msize > clnt->trans_mod->maxsize)
> 		clnt->msize = clnt->trans_mod->maxsize;
> 
> https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f4b84c004b/net/9p/client.c#L1045

That was introduced by a patch 2011.

commit c9ffb05ca5b5098d6ea468c909dd384d90da7d54
Author: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
Date:   Wed Jun 29 18:06:33 2011 -0700

    net/9p: Fix the msize calculation.

    msize represents the maximum PDU size that includes P9_IOHDRSZ.


You kernel 4.9 is newer than this. So most likely you have this commit
too. I will spend some time later trying to debug this.

Vivek


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-19 19:01                   ` [Virtio-fs] " Vivek Goyal
@ 2021-02-20 15:38                     ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-20 15:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Vivek Goyal, cdupontd, Venegas Munoz, Jose Carlos, Greg Kurz,
	virtio-fs-list, Stefan Hajnoczi, Shinde, Archana M,
	Dr. David Alan Gilbert

On Freitag, 19. Februar 2021 20:01:12 CET Vivek Goyal wrote:
> On Fri, Feb 19, 2021 at 06:33:46PM +0100, Christian Schoenebeck wrote:
> > On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> > > On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > > > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > > > all the fio jobs I was running. For the case of cache=auto  for
> > > > > virtiofs
> > > > > (with xattr enabled), 9p performed better in certain write
> > > > > workloads. I
> > > > > have identified root cause of that problem and working on
> > > > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > > > with cache=auto and xattr enabled.
> > > > 
> > > > Please note, when it comes to performance aspects, you should set a
> > > > reasonable high value for 'msize' on 9p client side:
> > > > https://wiki.qemu.org/Documentation/9psetup#msize
> > > 
> > > Hi Christian,
> > > 
> > > I am not able to set msize to a higher value. If I try to specify msize
> > > 16MB, and then read back msize from /proc/mounts, it sees to cap it
> > > at 512000. Is that intended?
> > 
> > 9p server side in QEMU does not perform any msize capping. The code in
> > this
> > case is very simple, it's just what you see in function v9fs_version():
> > 
> > https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6b57a
> > /hw/9pfs/9p.c#L1332> 
> > > $ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216
> > > hostShared /mnt/virtio-9p
> > > 
> > > $ cat /proc/mounts | grep 9p
> > > hostShared /mnt/virtio-9p 9p
> > > rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> > > 
> > > I am using 5.11 kernel.
> > 
> > Must be something on client (guest kernel) side. I don't see this here
> > with
> > guest kernel 4.9.0 happening with my setup in a quick test:
> > 
> > $ cat /etc/mtab | grep 9p
> > svnRoot / 9p
> > rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cache=m
> > map 0 0 $
> > 
> > Looks like the root cause of your issue is this:
> > 
> > struct p9_client *p9_client_create(const char *dev_name, char *options)
> > {
> > 
> > 	...
> > 	if (clnt->msize > clnt->trans_mod->maxsize)
> > 	
> > 		clnt->msize = clnt->trans_mod->maxsize;
> > 
> > https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f4b84
> > c004b/net/9p/client.c#L1045
> That was introduced by a patch 2011.
> 
> commit c9ffb05ca5b5098d6ea468c909dd384d90da7d54
> Author: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
> Date:   Wed Jun 29 18:06:33 2011 -0700
> 
>     net/9p: Fix the msize calculation.
> 
>     msize represents the maximum PDU size that includes P9_IOHDRSZ.
> 
> 
> You kernel 4.9 is newer than this. So most likely you have this commit
> too. I will spend some time later trying to debug this.
> 
> Vivek

As the kernel code sais trans_mod->maxsize, maybe its something in virtio on
qemu side that does an automatic step back for some reason. I don't see
something in the 9pfs virtio transport driver (hw/9pfs/virtio-9p-device.c on
QEMU side) that would do this, so I would also need to dig deeper.

Do you have some RAM limitation in your setup somewhere?

For comparison, this is how I started the VM:

~/git/qemu/build/qemu-system-x86_64 \
-machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
-smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
-boot strict=on -kernel /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
-initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
-append 'root=svnRoot rw rootfstype=9p rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap console=ttyS0' \
-fsdev local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=/home/bee/vm/stretch/ \
-device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=svnRoot \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-nographic

So the guest system is running entirely and solely on top of 9pfs (as root fs)
and hence it's mounted by above's CL i.e. immediately when the guest is
booted, and RAM size is set to 2 GB.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-20 15:38                     ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-20 15:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list, Shinde,
	Archana M, Vivek Goyal

On Freitag, 19. Februar 2021 20:01:12 CET Vivek Goyal wrote:
> On Fri, Feb 19, 2021 at 06:33:46PM +0100, Christian Schoenebeck wrote:
> > On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> > > On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > > > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > > > all the fio jobs I was running. For the case of cache=auto  for
> > > > > virtiofs
> > > > > (with xattr enabled), 9p performed better in certain write
> > > > > workloads. I
> > > > > have identified root cause of that problem and working on
> > > > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > > > with cache=auto and xattr enabled.
> > > > 
> > > > Please note, when it comes to performance aspects, you should set a
> > > > reasonable high value for 'msize' on 9p client side:
> > > > https://wiki.qemu.org/Documentation/9psetup#msize
> > > 
> > > Hi Christian,
> > > 
> > > I am not able to set msize to a higher value. If I try to specify msize
> > > 16MB, and then read back msize from /proc/mounts, it sees to cap it
> > > at 512000. Is that intended?
> > 
> > 9p server side in QEMU does not perform any msize capping. The code in
> > this
> > case is very simple, it's just what you see in function v9fs_version():
> > 
> > https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6b57a
> > /hw/9pfs/9p.c#L1332> 
> > > $ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216
> > > hostShared /mnt/virtio-9p
> > > 
> > > $ cat /proc/mounts | grep 9p
> > > hostShared /mnt/virtio-9p 9p
> > > rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> > > 
> > > I am using 5.11 kernel.
> > 
> > Must be something on client (guest kernel) side. I don't see this here
> > with
> > guest kernel 4.9.0 happening with my setup in a quick test:
> > 
> > $ cat /etc/mtab | grep 9p
> > svnRoot / 9p
> > rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cache=m
> > map 0 0 $
> > 
> > Looks like the root cause of your issue is this:
> > 
> > struct p9_client *p9_client_create(const char *dev_name, char *options)
> > {
> > 
> > 	...
> > 	if (clnt->msize > clnt->trans_mod->maxsize)
> > 	
> > 		clnt->msize = clnt->trans_mod->maxsize;
> > 
> > https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f4b84
> > c004b/net/9p/client.c#L1045
> That was introduced by a patch 2011.
> 
> commit c9ffb05ca5b5098d6ea468c909dd384d90da7d54
> Author: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
> Date:   Wed Jun 29 18:06:33 2011 -0700
> 
>     net/9p: Fix the msize calculation.
> 
>     msize represents the maximum PDU size that includes P9_IOHDRSZ.
> 
> 
> You kernel 4.9 is newer than this. So most likely you have this commit
> too. I will spend some time later trying to debug this.
> 
> Vivek

As the kernel code sais trans_mod->maxsize, maybe its something in virtio on
qemu side that does an automatic step back for some reason. I don't see
something in the 9pfs virtio transport driver (hw/9pfs/virtio-9p-device.c on
QEMU side) that would do this, so I would also need to dig deeper.

Do you have some RAM limitation in your setup somewhere?

For comparison, this is how I started the VM:

~/git/qemu/build/qemu-system-x86_64 \
-machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
-smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
-boot strict=on -kernel /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
-initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
-append 'root=svnRoot rw rootfstype=9p rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap console=ttyS0' \
-fsdev local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=/home/bee/vm/stretch/ \
-device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=svnRoot \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-nographic

So the guest system is running entirely and solely on top of 9pfs (as root fs)
and hence it's mounted by above's CL i.e. immediately when the guest is
booted, and RAM size is set to 2 GB.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-20 15:38                     ` [Virtio-fs] " Christian Schoenebeck
@ 2021-02-22 12:18                       ` Greg Kurz
  -1 siblings, 0 replies; 107+ messages in thread
From: Greg Kurz @ 2021-02-22 12:18 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Shinde, Archana M,
	Vivek Goyal

On Sat, 20 Feb 2021 16:38:35 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Freitag, 19. Februar 2021 20:01:12 CET Vivek Goyal wrote:
> > On Fri, Feb 19, 2021 at 06:33:46PM +0100, Christian Schoenebeck wrote:
> > > On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> > > > On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > > > > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > > > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > > > > all the fio jobs I was running. For the case of cache=auto  for
> > > > > > virtiofs
> > > > > > (with xattr enabled), 9p performed better in certain write
> > > > > > workloads. I
> > > > > > have identified root cause of that problem and working on
> > > > > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > > > > with cache=auto and xattr enabled.
> > > > > 
> > > > > Please note, when it comes to performance aspects, you should set a
> > > > > reasonable high value for 'msize' on 9p client side:
> > > > > https://wiki.qemu.org/Documentation/9psetup#msize
> > > > 
> > > > Hi Christian,
> > > > 
> > > > I am not able to set msize to a higher value. If I try to specify msize
> > > > 16MB, and then read back msize from /proc/mounts, it sees to cap it
> > > > at 512000. Is that intended?
> > > 
> > > 9p server side in QEMU does not perform any msize capping. The code in
> > > this
> > > case is very simple, it's just what you see in function v9fs_version():
> > > 
> > > https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6b57a
> > > /hw/9pfs/9p.c#L1332> 
> > > > $ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216
> > > > hostShared /mnt/virtio-9p
> > > > 
> > > > $ cat /proc/mounts | grep 9p
> > > > hostShared /mnt/virtio-9p 9p
> > > > rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> > > > 
> > > > I am using 5.11 kernel.
> > > 
> > > Must be something on client (guest kernel) side. I don't see this here
> > > with
> > > guest kernel 4.9.0 happening with my setup in a quick test:
> > > 
> > > $ cat /etc/mtab | grep 9p
> > > svnRoot / 9p
> > > rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cache=m
> > > map 0 0 $
> > > 
> > > Looks like the root cause of your issue is this:
> > > 
> > > struct p9_client *p9_client_create(const char *dev_name, char *options)
> > > {
> > > 
> > > 	...
> > > 	if (clnt->msize > clnt->trans_mod->maxsize)
> > > 	
> > > 		clnt->msize = clnt->trans_mod->maxsize;
> > > 
> > > https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f4b84
> > > c004b/net/9p/client.c#L1045
> > That was introduced by a patch 2011.
> > 
> > commit c9ffb05ca5b5098d6ea468c909dd384d90da7d54
> > Author: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
> > Date:   Wed Jun 29 18:06:33 2011 -0700
> > 
> >     net/9p: Fix the msize calculation.
> > 
> >     msize represents the maximum PDU size that includes P9_IOHDRSZ.
> > 
> > 
> > You kernel 4.9 is newer than this. So most likely you have this commit
> > too. I will spend some time later trying to debug this.
> > 
> > Vivek
> 

Hi Vivek and Christian,

I reproduce with an up-to-date fedora rawhide guest.

Capping comes from here:

net/9p/trans_virtio.c:  .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),

i.e. 4096 * (128 - 3) == 512000

AFAICT this has been around since 2011, i.e. always for me as a
maintainer and I admit I had never tried such high msize settings
before.

commit b49d8b5d7007a673796f3f99688b46931293873e
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Aug 17 16:56:04 2011 +0000

    net/9p: Fix kernel crash with msize 512K
    
    With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
    crashes. This patch fix those.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
    Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>

Changelog doesn't help much but it looks like it was a bandaid
for some more severe issues.

> As the kernel code sais trans_mod->maxsize, maybe its something in virtio on
> qemu side that does an automatic step back for some reason. I don't see
> something in the 9pfs virtio transport driver (hw/9pfs/virtio-9p-device.c on
> QEMU side) that would do this, so I would also need to dig deeper.
> 
> Do you have some RAM limitation in your setup somewhere?
> 
> For comparison, this is how I started the VM:
> 
> ~/git/qemu/build/qemu-system-x86_64 \
> -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> -boot strict=on -kernel /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> -append 'root=svnRoot rw rootfstype=9p rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap console=ttyS0' \

First obvious difference I see between your setup and mine is that
you're mounting the 9pfs as root from the kernel command line. For
some reason, maybe this has an impact on the check in p9_client_create() ?

Can you reproduce with a scenario like Vivek's one ?

> -fsdev local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=/home/bee/vm/stretch/ \
> -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=svnRoot \
> -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
> -nographic
> 
> So the guest system is running entirely and solely on top of 9pfs (as root fs)
> and hence it's mounted by above's CL i.e. immediately when the guest is
> booted, and RAM size is set to 2 GB.
> 
> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-22 12:18                       ` Greg Kurz
  0 siblings, 0 replies; 107+ messages in thread
From: Greg Kurz @ 2021-02-22 12:18 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Venegas Munoz, Jose Carlos, qemu-devel, cdupontd, virtio-fs-list,
	Shinde, Archana M, Vivek Goyal

On Sat, 20 Feb 2021 16:38:35 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Freitag, 19. Februar 2021 20:01:12 CET Vivek Goyal wrote:
> > On Fri, Feb 19, 2021 at 06:33:46PM +0100, Christian Schoenebeck wrote:
> > > On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> > > > On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck wrote:
> > > > > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > > > > In my testing, with cache=none, virtiofs performed better than 9p in
> > > > > > all the fio jobs I was running. For the case of cache=auto  for
> > > > > > virtiofs
> > > > > > (with xattr enabled), 9p performed better in certain write
> > > > > > workloads. I
> > > > > > have identified root cause of that problem and working on
> > > > > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of virtiofs
> > > > > > with cache=auto and xattr enabled.
> > > > > 
> > > > > Please note, when it comes to performance aspects, you should set a
> > > > > reasonable high value for 'msize' on 9p client side:
> > > > > https://wiki.qemu.org/Documentation/9psetup#msize
> > > > 
> > > > Hi Christian,
> > > > 
> > > > I am not able to set msize to a higher value. If I try to specify msize
> > > > 16MB, and then read back msize from /proc/mounts, it sees to cap it
> > > > at 512000. Is that intended?
> > > 
> > > 9p server side in QEMU does not perform any msize capping. The code in
> > > this
> > > case is very simple, it's just what you see in function v9fs_version():
> > > 
> > > https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6b57a
> > > /hw/9pfs/9p.c#L1332> 
> > > > $ mount -t 9p -o trans=virtio,version=9p2000.L,cache=none,msize=16777216
> > > > hostShared /mnt/virtio-9p
> > > > 
> > > > $ cat /proc/mounts | grep 9p
> > > > hostShared /mnt/virtio-9p 9p
> > > > rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> > > > 
> > > > I am using 5.11 kernel.
> > > 
> > > Must be something on client (guest kernel) side. I don't see this here
> > > with
> > > guest kernel 4.9.0 happening with my setup in a quick test:
> > > 
> > > $ cat /etc/mtab | grep 9p
> > > svnRoot / 9p
> > > rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cache=m
> > > map 0 0 $
> > > 
> > > Looks like the root cause of your issue is this:
> > > 
> > > struct p9_client *p9_client_create(const char *dev_name, char *options)
> > > {
> > > 
> > > 	...
> > > 	if (clnt->msize > clnt->trans_mod->maxsize)
> > > 	
> > > 		clnt->msize = clnt->trans_mod->maxsize;
> > > 
> > > https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f4b84
> > > c004b/net/9p/client.c#L1045
> > That was introduced by a patch 2011.
> > 
> > commit c9ffb05ca5b5098d6ea468c909dd384d90da7d54
> > Author: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
> > Date:   Wed Jun 29 18:06:33 2011 -0700
> > 
> >     net/9p: Fix the msize calculation.
> > 
> >     msize represents the maximum PDU size that includes P9_IOHDRSZ.
> > 
> > 
> > You kernel 4.9 is newer than this. So most likely you have this commit
> > too. I will spend some time later trying to debug this.
> > 
> > Vivek
> 

Hi Vivek and Christian,

I reproduce with an up-to-date fedora rawhide guest.

Capping comes from here:

net/9p/trans_virtio.c:  .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),

i.e. 4096 * (128 - 3) == 512000

AFAICT this has been around since 2011, i.e. always for me as a
maintainer and I admit I had never tried such high msize settings
before.

commit b49d8b5d7007a673796f3f99688b46931293873e
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Aug 17 16:56:04 2011 +0000

    net/9p: Fix kernel crash with msize 512K
    
    With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
    crashes. This patch fix those.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
    Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>

Changelog doesn't help much but it looks like it was a bandaid
for some more severe issues.

> As the kernel code sais trans_mod->maxsize, maybe its something in virtio on
> qemu side that does an automatic step back for some reason. I don't see
> something in the 9pfs virtio transport driver (hw/9pfs/virtio-9p-device.c on
> QEMU side) that would do this, so I would also need to dig deeper.
> 
> Do you have some RAM limitation in your setup somewhere?
> 
> For comparison, this is how I started the VM:
> 
> ~/git/qemu/build/qemu-system-x86_64 \
> -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> -boot strict=on -kernel /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> -append 'root=svnRoot rw rootfstype=9p rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap console=ttyS0' \

First obvious difference I see between your setup and mine is that
you're mounting the 9pfs as root from the kernel command line. For
some reason, maybe this has an impact on the check in p9_client_create() ?

Can you reproduce with a scenario like Vivek's one ?

> -fsdev local,security_model=mapped,multidevs=remap,id=fsdev-fs0,path=/home/bee/vm/stretch/ \
> -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=svnRoot \
> -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
> -nographic
> 
> So the guest system is running entirely and solely on top of 9pfs (as root fs)
> and hence it's mounted by above's CL i.e. immediately when the guest is
> booted, and RAM size is set to 2 GB.
> 
> Best regards,
> Christian Schoenebeck
> 
> 


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-22 12:18                       ` [Virtio-fs] " Greg Kurz
@ 2021-02-22 15:08                         ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-22 15:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Greg Kurz, Venegas Munoz, Jose Carlos, cdupontd, virtio-fs-list,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Shinde, Archana M,
	Vivek Goyal

On Montag, 22. Februar 2021 13:18:14 CET Greg Kurz wrote:
> On Sat, 20 Feb 2021 16:38:35 +0100
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Freitag, 19. Februar 2021 20:01:12 CET Vivek Goyal wrote:
> > > On Fri, Feb 19, 2021 at 06:33:46PM +0100, Christian Schoenebeck wrote:
> > > > On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> > > > > On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > > > > > In my testing, with cache=none, virtiofs performed better than
> > > > > > > 9p in
> > > > > > > all the fio jobs I was running. For the case of cache=auto  for
> > > > > > > virtiofs
> > > > > > > (with xattr enabled), 9p performed better in certain write
> > > > > > > workloads. I
> > > > > > > have identified root cause of that problem and working on
> > > > > > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of
> > > > > > > virtiofs
> > > > > > > with cache=auto and xattr enabled.
> > > > > > 
> > > > > > Please note, when it comes to performance aspects, you should set
> > > > > > a
> > > > > > reasonable high value for 'msize' on 9p client side:
> > > > > > https://wiki.qemu.org/Documentation/9psetup#msize
> > > > > 
> > > > > Hi Christian,
> > > > > 
> > > > > I am not able to set msize to a higher value. If I try to specify
> > > > > msize
> > > > > 16MB, and then read back msize from /proc/mounts, it sees to cap it
> > > > > at 512000. Is that intended?
> > > > 
> > > > 9p server side in QEMU does not perform any msize capping. The code in
> > > > this
> > > > case is very simple, it's just what you see in function
> > > > v9fs_version():
> > > > 
> > > > https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6
> > > > b57a
> > > > /hw/9pfs/9p.c#L1332>
> > > > 
> > > > > $ mount -t 9p -o
> > > > > trans=virtio,version=9p2000.L,cache=none,msize=16777216
> > > > > hostShared /mnt/virtio-9p
> > > > > 
> > > > > $ cat /proc/mounts | grep 9p
> > > > > hostShared /mnt/virtio-9p 9p
> > > > > rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> > > > > 
> > > > > I am using 5.11 kernel.
> > > > 
> > > > Must be something on client (guest kernel) side. I don't see this here
> > > > with
> > > > guest kernel 4.9.0 happening with my setup in a quick test:
> > > > 
> > > > $ cat /etc/mtab | grep 9p
> > > > svnRoot / 9p
> > > > rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cach
> > > > e=m
> > > > map 0 0 $
> > > > 
> > > > Looks like the root cause of your issue is this:
> > > > 
> > > > struct p9_client *p9_client_create(const char *dev_name, char
> > > > *options)
> > > > {
> > > > 
> > > > 	...
> > > > 	if (clnt->msize > clnt->trans_mod->maxsize)
> > > > 	
> > > > 		clnt->msize = clnt->trans_mod->maxsize;
> > > > 
> > > > https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f
> > > > 4b84
> > > > c004b/net/9p/client.c#L1045
> > > 
> > > That was introduced by a patch 2011.
> > > 
> > > commit c9ffb05ca5b5098d6ea468c909dd384d90da7d54
> > > Author: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
> > > Date:   Wed Jun 29 18:06:33 2011 -0700
> > > 
> > >     net/9p: Fix the msize calculation.
> > >     
> > >     msize represents the maximum PDU size that includes P9_IOHDRSZ.
> > > 
> > > You kernel 4.9 is newer than this. So most likely you have this commit
> > > too. I will spend some time later trying to debug this.
> > > 
> > > Vivek
> 
> Hi Vivek and Christian,
> 
> I reproduce with an up-to-date fedora rawhide guest.
> 
> Capping comes from here:
> 
> net/9p/trans_virtio.c:  .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
> 
> i.e. 4096 * (128 - 3) == 512000
> 
> AFAICT this has been around since 2011, i.e. always for me as a
> maintainer and I admit I had never tried such high msize settings
> before.
> 
> commit b49d8b5d7007a673796f3f99688b46931293873e
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Wed Aug 17 16:56:04 2011 +0000
> 
>     net/9p: Fix kernel crash with msize 512K
> 
>     With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
>     crashes. This patch fix those.
> 
>     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>     Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
> 
> Changelog doesn't help much but it looks like it was a bandaid
> for some more severe issues.

I did not ever have a kernel crash when I boot a Linux guest with a 9pfs root 
fs and 100 MiB msize. Should we ask virtio or 9p Linux client maintainers if 
they can add some info what this is about?

> > As the kernel code sais trans_mod->maxsize, maybe its something in virtio
> > on qemu side that does an automatic step back for some reason. I don't
> > see something in the 9pfs virtio transport driver
> > (hw/9pfs/virtio-9p-device.c on QEMU side) that would do this, so I would
> > also need to dig deeper.
> > 
> > Do you have some RAM limitation in your setup somewhere?
> > 
> > For comparison, this is how I started the VM:
> > 
> > ~/git/qemu/build/qemu-system-x86_64 \
> > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> > -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> > -boot strict=on -kernel /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> > -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> > -append 'root=svnRoot rw rootfstype=9p
> > rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap
> > console=ttyS0' \
> First obvious difference I see between your setup and mine is that
> you're mounting the 9pfs as root from the kernel command line. For
> some reason, maybe this has an impact on the check in p9_client_create() ?
> 
> Can you reproduce with a scenario like Vivek's one ?

Yep, confirmed. If I boot a guest from an image file first and then try to 
manually mount a 9pfs share after guest booted, then I get indeed that msize 
capping of just 512 kiB as well. That's far too small. :/

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-22 15:08                         ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-22 15:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Shinde, Archana M, Venegas Munoz, Jose Carlos, virtio-fs-list,
	cdupontd, Vivek Goyal

On Montag, 22. Februar 2021 13:18:14 CET Greg Kurz wrote:
> On Sat, 20 Feb 2021 16:38:35 +0100
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Freitag, 19. Februar 2021 20:01:12 CET Vivek Goyal wrote:
> > > On Fri, Feb 19, 2021 at 06:33:46PM +0100, Christian Schoenebeck wrote:
> > > > On Freitag, 19. Februar 2021 17:08:48 CET Vivek Goyal wrote:
> > > > > On Fri, Sep 25, 2020 at 10:06:41AM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Freitag, 25. September 2020 00:10:23 CEST Vivek Goyal wrote:
> > > > > > > In my testing, with cache=none, virtiofs performed better than
> > > > > > > 9p in
> > > > > > > all the fio jobs I was running. For the case of cache=auto  for
> > > > > > > virtiofs
> > > > > > > (with xattr enabled), 9p performed better in certain write
> > > > > > > workloads. I
> > > > > > > have identified root cause of that problem and working on
> > > > > > > HANDLE_KILLPRIV_V2 patches to improve WRITE performance of
> > > > > > > virtiofs
> > > > > > > with cache=auto and xattr enabled.
> > > > > > 
> > > > > > Please note, when it comes to performance aspects, you should set
> > > > > > a
> > > > > > reasonable high value for 'msize' on 9p client side:
> > > > > > https://wiki.qemu.org/Documentation/9psetup#msize
> > > > > 
> > > > > Hi Christian,
> > > > > 
> > > > > I am not able to set msize to a higher value. If I try to specify
> > > > > msize
> > > > > 16MB, and then read back msize from /proc/mounts, it sees to cap it
> > > > > at 512000. Is that intended?
> > > > 
> > > > 9p server side in QEMU does not perform any msize capping. The code in
> > > > this
> > > > case is very simple, it's just what you see in function
> > > > v9fs_version():
> > > > 
> > > > https://github.com/qemu/qemu/blob/6de76c5f324904c93e69f9a1e8e4fd0bd6f6
> > > > b57a
> > > > /hw/9pfs/9p.c#L1332>
> > > > 
> > > > > $ mount -t 9p -o
> > > > > trans=virtio,version=9p2000.L,cache=none,msize=16777216
> > > > > hostShared /mnt/virtio-9p
> > > > > 
> > > > > $ cat /proc/mounts | grep 9p
> > > > > hostShared /mnt/virtio-9p 9p
> > > > > rw,sync,dirsync,relatime,access=client,msize=512000,trans=virtio 0 0
> > > > > 
> > > > > I am using 5.11 kernel.
> > > > 
> > > > Must be something on client (guest kernel) side. I don't see this here
> > > > with
> > > > guest kernel 4.9.0 happening with my setup in a quick test:
> > > > 
> > > > $ cat /etc/mtab | grep 9p
> > > > svnRoot / 9p
> > > > rw,dirsync,relatime,trans=virtio,version=9p2000.L,msize=104857600,cach
> > > > e=m
> > > > map 0 0 $
> > > > 
> > > > Looks like the root cause of your issue is this:
> > > > 
> > > > struct p9_client *p9_client_create(const char *dev_name, char
> > > > *options)
> > > > {
> > > > 
> > > > 	...
> > > > 	if (clnt->msize > clnt->trans_mod->maxsize)
> > > > 	
> > > > 		clnt->msize = clnt->trans_mod->maxsize;
> > > > 
> > > > https://github.com/torvalds/linux/blob/f40ddce88593482919761f74910f42f
> > > > 4b84
> > > > c004b/net/9p/client.c#L1045
> > > 
> > > That was introduced by a patch 2011.
> > > 
> > > commit c9ffb05ca5b5098d6ea468c909dd384d90da7d54
> > > Author: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
> > > Date:   Wed Jun 29 18:06:33 2011 -0700
> > > 
> > >     net/9p: Fix the msize calculation.
> > >     
> > >     msize represents the maximum PDU size that includes P9_IOHDRSZ.
> > > 
> > > You kernel 4.9 is newer than this. So most likely you have this commit
> > > too. I will spend some time later trying to debug this.
> > > 
> > > Vivek
> 
> Hi Vivek and Christian,
> 
> I reproduce with an up-to-date fedora rawhide guest.
> 
> Capping comes from here:
> 
> net/9p/trans_virtio.c:  .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
> 
> i.e. 4096 * (128 - 3) == 512000
> 
> AFAICT this has been around since 2011, i.e. always for me as a
> maintainer and I admit I had never tried such high msize settings
> before.
> 
> commit b49d8b5d7007a673796f3f99688b46931293873e
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Wed Aug 17 16:56:04 2011 +0000
> 
>     net/9p: Fix kernel crash with msize 512K
> 
>     With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
>     crashes. This patch fix those.
> 
>     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>     Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
> 
> Changelog doesn't help much but it looks like it was a bandaid
> for some more severe issues.

I did not ever have a kernel crash when I boot a Linux guest with a 9pfs root 
fs and 100 MiB msize. Should we ask virtio or 9p Linux client maintainers if 
they can add some info what this is about?

> > As the kernel code sais trans_mod->maxsize, maybe its something in virtio
> > on qemu side that does an automatic step back for some reason. I don't
> > see something in the 9pfs virtio transport driver
> > (hw/9pfs/virtio-9p-device.c on QEMU side) that would do this, so I would
> > also need to dig deeper.
> > 
> > Do you have some RAM limitation in your setup somewhere?
> > 
> > For comparison, this is how I started the VM:
> > 
> > ~/git/qemu/build/qemu-system-x86_64 \
> > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> > -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> > -boot strict=on -kernel /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> > -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> > -append 'root=svnRoot rw rootfstype=9p
> > rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap
> > console=ttyS0' \
> First obvious difference I see between your setup and mine is that
> you're mounting the 9pfs as root from the kernel command line. For
> some reason, maybe this has an impact on the check in p9_client_create() ?
> 
> Can you reproduce with a scenario like Vivek's one ?

Yep, confirmed. If I boot a guest from an image file first and then try to 
manually mount a 9pfs share after guest booted, then I get indeed that msize 
capping of just 512 kiB as well. That's far too small. :/

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-22 15:08                         ` [Virtio-fs] " Christian Schoenebeck
@ 2021-02-22 17:11                           ` Greg Kurz
  -1 siblings, 0 replies; 107+ messages in thread
From: Greg Kurz @ 2021-02-22 17:11 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Shinde, Archana M, Venegas Munoz, Jose Carlos, qemu-devel,
	Dr. David Alan Gilbert, virtio-fs-list, Stefan Hajnoczi,
	cdupontd, Vivek Goyal

On Mon, 22 Feb 2021 16:08:04 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

[...]

> I did not ever have a kernel crash when I boot a Linux guest with a 9pfs root 
> fs and 100 MiB msize.

Interesting.

> Should we ask virtio or 9p Linux client maintainers if 
> they can add some info what this is about?
> 

Probably worth to try that first, even if I'm not sure anyone has a
answer for that since all the people who worked on virtio-9p at
the time have somehow deserted the project.

> > > As the kernel code sais trans_mod->maxsize, maybe its something in virtio
> > > on qemu side that does an automatic step back for some reason. I don't
> > > see something in the 9pfs virtio transport driver
> > > (hw/9pfs/virtio-9p-device.c on QEMU side) that would do this, so I would
> > > also need to dig deeper.
> > > 
> > > Do you have some RAM limitation in your setup somewhere?
> > > 
> > > For comparison, this is how I started the VM:
> > > 
> > > ~/git/qemu/build/qemu-system-x86_64 \
> > > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> > > -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> > > -boot strict=on -kernel /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> > > -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> > > -append 'root=svnRoot rw rootfstype=9p
> > > rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap
> > > console=ttyS0' \
> > First obvious difference I see between your setup and mine is that
> > you're mounting the 9pfs as root from the kernel command line. For
> > some reason, maybe this has an impact on the check in p9_client_create() ?
> > 
> > Can you reproduce with a scenario like Vivek's one ?
> 
> Yep, confirmed. If I boot a guest from an image file first and then try to 
> manually mount a 9pfs share after guest booted, then I get indeed that msize 
> capping of just 512 kiB as well. That's far too small. :/
> 

Maybe worth digging : 
- why no capping happens in your scenario ?
- is capping really needed ?

Cheers,

--
Greg

> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-22 17:11                           ` Greg Kurz
  0 siblings, 0 replies; 107+ messages in thread
From: Greg Kurz @ 2021-02-22 17:11 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Shinde, Archana M, Venegas Munoz, Jose Carlos, qemu-devel,
	virtio-fs-list, cdupontd, Vivek Goyal

On Mon, 22 Feb 2021 16:08:04 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

[...]

> I did not ever have a kernel crash when I boot a Linux guest with a 9pfs root 
> fs and 100 MiB msize.

Interesting.

> Should we ask virtio or 9p Linux client maintainers if 
> they can add some info what this is about?
> 

Probably worth to try that first, even if I'm not sure anyone has a
answer for that since all the people who worked on virtio-9p at
the time have somehow deserted the project.

> > > As the kernel code sais trans_mod->maxsize, maybe its something in virtio
> > > on qemu side that does an automatic step back for some reason. I don't
> > > see something in the 9pfs virtio transport driver
> > > (hw/9pfs/virtio-9p-device.c on QEMU side) that would do this, so I would
> > > also need to dig deeper.
> > > 
> > > Do you have some RAM limitation in your setup somewhere?
> > > 
> > > For comparison, this is how I started the VM:
> > > 
> > > ~/git/qemu/build/qemu-system-x86_64 \
> > > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> > > -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> > > -boot strict=on -kernel /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> > > -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> > > -append 'root=svnRoot rw rootfstype=9p
> > > rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap
> > > console=ttyS0' \
> > First obvious difference I see between your setup and mine is that
> > you're mounting the 9pfs as root from the kernel command line. For
> > some reason, maybe this has an impact on the check in p9_client_create() ?
> > 
> > Can you reproduce with a scenario like Vivek's one ?
> 
> Yep, confirmed. If I boot a guest from an image file first and then try to 
> manually mount a 9pfs share after guest booted, then I get indeed that msize 
> capping of just 512 kiB as well. That's far too small. :/
> 

Maybe worth digging : 
- why no capping happens in your scenario ?
- is capping really needed ?

Cheers,

--
Greg

> Best regards,
> Christian Schoenebeck
> 
> 


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-22 17:11                           ` [Virtio-fs] " Greg Kurz
@ 2021-02-23 13:39                             ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-23 13:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Greg Kurz, Shinde, Archana M, Venegas Munoz, Jose Carlos,
	Dr. David Alan Gilbert, virtio-fs-list, Stefan Hajnoczi,
	cdupontd, Vivek Goyal, Michael S. Tsirkin, Dominique Martinet,
	v9fs-developer

On Montag, 22. Februar 2021 18:11:59 CET Greg Kurz wrote:
> On Mon, 22 Feb 2021 16:08:04 +0100
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> 
> [...]
> 
> > I did not ever have a kernel crash when I boot a Linux guest with a 9pfs
> > root fs and 100 MiB msize.
> 
> Interesting.
> 
> > Should we ask virtio or 9p Linux client maintainers if
> > they can add some info what this is about?
> 
> Probably worth to try that first, even if I'm not sure anyone has a
> answer for that since all the people who worked on virtio-9p at
> the time have somehow deserted the project.

Michael, Dominique,

we are wondering here about the message size limitation of just 5 kiB in the 
9p Linux client (using virtio transport) which imposes a performance 
bottleneck, introduced by this kernel commit:

commit b49d8b5d7007a673796f3f99688b46931293873e
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Aug 17 16:56:04 2011 +0000

    net/9p: Fix kernel crash with msize 512K
    
    With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
    crashes. This patch fix those.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
    Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>

Is this a fundamental maximum message size that cannot be exceeded with virtio 
in general or is there another reason for this limit that still applies?

Full discussion:
https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06343.html

> > > > As the kernel code sais trans_mod->maxsize, maybe its something in
> > > > virtio
> > > > on qemu side that does an automatic step back for some reason. I don't
> > > > see something in the 9pfs virtio transport driver
> > > > (hw/9pfs/virtio-9p-device.c on QEMU side) that would do this, so I
> > > > would
> > > > also need to dig deeper.
> > > > 
> > > > Do you have some RAM limitation in your setup somewhere?
> > > > 
> > > > For comparison, this is how I started the VM:
> > > > 
> > > > ~/git/qemu/build/qemu-system-x86_64 \
> > > > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> > > > -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> > > > -boot strict=on -kernel
> > > > /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> > > > -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> > > > -append 'root=svnRoot rw rootfstype=9p
> > > > rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap
> > > > console=ttyS0' \
> > > 
> > > First obvious difference I see between your setup and mine is that
> > > you're mounting the 9pfs as root from the kernel command line. For
> > > some reason, maybe this has an impact on the check in p9_client_create()
> > > ?
> > > 
> > > Can you reproduce with a scenario like Vivek's one ?
> > 
> > Yep, confirmed. If I boot a guest from an image file first and then try to
> > manually mount a 9pfs share after guest booted, then I get indeed that
> > msize capping of just 512 kiB as well. That's far too small. :/
> 
> Maybe worth digging :
> - why no capping happens in your scenario ?

Because I was wrong.

I just figured even in the 9p rootfs scenario it does indeed cap msize to 5kiB 
as well. The output of /etc/mtab on guest side was fooling me. I debugged this 
on 9p server side and the Linux 9p client always connects with a max. msize of 
5 kiB, no matter what you do.

> - is capping really needed ?
> 
> Cheers,

That's a good question and probably depends on whether there is a limitation 
on virtio side, which I don't have an answer for. Maybe Michael or Dominique 
can answer this.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-23 13:39                             ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-23 13:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: cdupontd, Michael S. Tsirkin, Dominique Martinet, Venegas Munoz,
	Jose Carlos, virtio-fs-list, v9fs-developer, Shinde, Archana M,
	Vivek Goyal

On Montag, 22. Februar 2021 18:11:59 CET Greg Kurz wrote:
> On Mon, 22 Feb 2021 16:08:04 +0100
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> 
> [...]
> 
> > I did not ever have a kernel crash when I boot a Linux guest with a 9pfs
> > root fs and 100 MiB msize.
> 
> Interesting.
> 
> > Should we ask virtio or 9p Linux client maintainers if
> > they can add some info what this is about?
> 
> Probably worth to try that first, even if I'm not sure anyone has a
> answer for that since all the people who worked on virtio-9p at
> the time have somehow deserted the project.

Michael, Dominique,

we are wondering here about the message size limitation of just 5 kiB in the 
9p Linux client (using virtio transport) which imposes a performance 
bottleneck, introduced by this kernel commit:

commit b49d8b5d7007a673796f3f99688b46931293873e
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Aug 17 16:56:04 2011 +0000

    net/9p: Fix kernel crash with msize 512K
    
    With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
    crashes. This patch fix those.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
    Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>

Is this a fundamental maximum message size that cannot be exceeded with virtio 
in general or is there another reason for this limit that still applies?

Full discussion:
https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06343.html

> > > > As the kernel code sais trans_mod->maxsize, maybe its something in
> > > > virtio
> > > > on qemu side that does an automatic step back for some reason. I don't
> > > > see something in the 9pfs virtio transport driver
> > > > (hw/9pfs/virtio-9p-device.c on QEMU side) that would do this, so I
> > > > would
> > > > also need to dig deeper.
> > > > 
> > > > Do you have some RAM limitation in your setup somewhere?
> > > > 
> > > > For comparison, this is how I started the VM:
> > > > 
> > > > ~/git/qemu/build/qemu-system-x86_64 \
> > > > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> > > > -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> > > > -boot strict=on -kernel
> > > > /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> > > > -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> > > > -append 'root=svnRoot rw rootfstype=9p
> > > > rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap
> > > > console=ttyS0' \
> > > 
> > > First obvious difference I see between your setup and mine is that
> > > you're mounting the 9pfs as root from the kernel command line. For
> > > some reason, maybe this has an impact on the check in p9_client_create()
> > > ?
> > > 
> > > Can you reproduce with a scenario like Vivek's one ?
> > 
> > Yep, confirmed. If I boot a guest from an image file first and then try to
> > manually mount a 9pfs share after guest booted, then I get indeed that
> > msize capping of just 512 kiB as well. That's far too small. :/
> 
> Maybe worth digging :
> - why no capping happens in your scenario ?

Because I was wrong.

I just figured even in the 9p rootfs scenario it does indeed cap msize to 5kiB 
as well. The output of /etc/mtab on guest side was fooling me. I debugged this 
on 9p server side and the Linux 9p client always connects with a max. msize of 
5 kiB, no matter what you do.

> - is capping really needed ?
> 
> Cheers,

That's a good question and probably depends on whether there is a limitation 
on virtio side, which I don't have an answer for. Maybe Michael or Dominique 
can answer this.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-23 13:39                             ` [Virtio-fs] " Christian Schoenebeck
@ 2021-02-23 14:07                               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 107+ messages in thread
From: Michael S. Tsirkin @ 2021-02-23 14:07 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: cdupontd, Dominique Martinet, Venegas Munoz, Jose Carlos,
	qemu-devel, Dr. David Alan Gilbert, virtio-fs-list, Greg Kurz,
	Stefan Hajnoczi, v9fs-developer, Shinde, Archana M, Vivek Goyal

On Tue, Feb 23, 2021 at 02:39:48PM +0100, Christian Schoenebeck wrote:
> On Montag, 22. Februar 2021 18:11:59 CET Greg Kurz wrote:
> > On Mon, 22 Feb 2021 16:08:04 +0100
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > 
> > [...]
> > 
> > > I did not ever have a kernel crash when I boot a Linux guest with a 9pfs
> > > root fs and 100 MiB msize.
> > 
> > Interesting.
> > 
> > > Should we ask virtio or 9p Linux client maintainers if
> > > they can add some info what this is about?
> > 
> > Probably worth to try that first, even if I'm not sure anyone has a
> > answer for that since all the people who worked on virtio-9p at
> > the time have somehow deserted the project.
> 
> Michael, Dominique,
> 
> we are wondering here about the message size limitation of just 5 kiB in the 
> 9p Linux client (using virtio transport) which imposes a performance 
> bottleneck, introduced by this kernel commit:
> 
> commit b49d8b5d7007a673796f3f99688b46931293873e
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Wed Aug 17 16:56:04 2011 +0000
> 
>     net/9p: Fix kernel crash with msize 512K
>     
>     With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
>     crashes. This patch fix those.
>     
>     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>     Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>

Well the change I see is:

-       .maxsize = PAGE_SIZE*VIRTQUEUE_NUM,
+       .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),


so how come you say it changes 512K to 5K?
Looks more like 500K to me.

> Is this a fundamental maximum message size that cannot be exceeded with virtio 
> in general or is there another reason for this limit that still applies?
> 
> Full discussion:
> https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06343.html
> 
> > > > > As the kernel code sais trans_mod->maxsize, maybe its something in
> > > > > virtio
> > > > > on qemu side that does an automatic step back for some reason. I don't
> > > > > see something in the 9pfs virtio transport driver
> > > > > (hw/9pfs/virtio-9p-device.c on QEMU side) that would do this, so I
> > > > > would
> > > > > also need to dig deeper.
> > > > > 
> > > > > Do you have some RAM limitation in your setup somewhere?
> > > > > 
> > > > > For comparison, this is how I started the VM:
> > > > > 
> > > > > ~/git/qemu/build/qemu-system-x86_64 \
> > > > > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> > > > > -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> > > > > -boot strict=on -kernel
> > > > > /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> > > > > -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> > > > > -append 'root=svnRoot rw rootfstype=9p
> > > > > rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap
> > > > > console=ttyS0' \
> > > > 
> > > > First obvious difference I see between your setup and mine is that
> > > > you're mounting the 9pfs as root from the kernel command line. For
> > > > some reason, maybe this has an impact on the check in p9_client_create()
> > > > ?
> > > > 
> > > > Can you reproduce with a scenario like Vivek's one ?
> > > 
> > > Yep, confirmed. If I boot a guest from an image file first and then try to
> > > manually mount a 9pfs share after guest booted, then I get indeed that
> > > msize capping of just 512 kiB as well. That's far too small. :/
> > 
> > Maybe worth digging :
> > - why no capping happens in your scenario ?
> 
> Because I was wrong.
> 
> I just figured even in the 9p rootfs scenario it does indeed cap msize to 5kiB 
> as well. The output of /etc/mtab on guest side was fooling me. I debugged this 
> on 9p server side and the Linux 9p client always connects with a max. msize of 
> 5 kiB, no matter what you do.
> 
> > - is capping really needed ?
> > 
> > Cheers,
> 
> That's a good question and probably depends on whether there is a limitation 
> on virtio side, which I don't have an answer for. Maybe Michael or Dominique 
> can answer this.
> 
> Best regards,
> Christian Schoenebeck
> 



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-23 14:07                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 107+ messages in thread
From: Michael S. Tsirkin @ 2021-02-23 14:07 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: cdupontd, Dominique Martinet, Venegas Munoz, Jose Carlos,
	qemu-devel, virtio-fs-list, v9fs-developer, Shinde, Archana M,
	Vivek Goyal

On Tue, Feb 23, 2021 at 02:39:48PM +0100, Christian Schoenebeck wrote:
> On Montag, 22. Februar 2021 18:11:59 CET Greg Kurz wrote:
> > On Mon, 22 Feb 2021 16:08:04 +0100
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > 
> > [...]
> > 
> > > I did not ever have a kernel crash when I boot a Linux guest with a 9pfs
> > > root fs and 100 MiB msize.
> > 
> > Interesting.
> > 
> > > Should we ask virtio or 9p Linux client maintainers if
> > > they can add some info what this is about?
> > 
> > Probably worth to try that first, even if I'm not sure anyone has a
> > answer for that since all the people who worked on virtio-9p at
> > the time have somehow deserted the project.
> 
> Michael, Dominique,
> 
> we are wondering here about the message size limitation of just 5 kiB in the 
> 9p Linux client (using virtio transport) which imposes a performance 
> bottleneck, introduced by this kernel commit:
> 
> commit b49d8b5d7007a673796f3f99688b46931293873e
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Wed Aug 17 16:56:04 2011 +0000
> 
>     net/9p: Fix kernel crash with msize 512K
>     
>     With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
>     crashes. This patch fix those.
>     
>     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>     Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>

Well the change I see is:

-       .maxsize = PAGE_SIZE*VIRTQUEUE_NUM,
+       .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),


so how come you say it changes 512K to 5K?
Looks more like 500K to me.

> Is this a fundamental maximum message size that cannot be exceeded with virtio 
> in general or is there another reason for this limit that still applies?
> 
> Full discussion:
> https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06343.html
> 
> > > > > As the kernel code sais trans_mod->maxsize, maybe its something in
> > > > > virtio
> > > > > on qemu side that does an automatic step back for some reason. I don't
> > > > > see something in the 9pfs virtio transport driver
> > > > > (hw/9pfs/virtio-9p-device.c on QEMU side) that would do this, so I
> > > > > would
> > > > > also need to dig deeper.
> > > > > 
> > > > > Do you have some RAM limitation in your setup somewhere?
> > > > > 
> > > > > For comparison, this is how I started the VM:
> > > > > 
> > > > > ~/git/qemu/build/qemu-system-x86_64 \
> > > > > -machine pc,accel=kvm,usb=off,dump-guest-core=off -m 2048 \
> > > > > -smp 4,sockets=4,cores=1,threads=1 -rtc base=utc \
> > > > > -boot strict=on -kernel
> > > > > /home/bee/vm/stretch/boot/vmlinuz-4.9.0-13-amd64 \
> > > > > -initrd /home/bee/vm/stretch/boot/initrd.img-4.9.0-13-amd64 \
> > > > > -append 'root=svnRoot rw rootfstype=9p
> > > > > rootflags=trans=virtio,version=9p2000.L,msize=104857600,cache=mmap
> > > > > console=ttyS0' \
> > > > 
> > > > First obvious difference I see between your setup and mine is that
> > > > you're mounting the 9pfs as root from the kernel command line. For
> > > > some reason, maybe this has an impact on the check in p9_client_create()
> > > > ?
> > > > 
> > > > Can you reproduce with a scenario like Vivek's one ?
> > > 
> > > Yep, confirmed. If I boot a guest from an image file first and then try to
> > > manually mount a 9pfs share after guest booted, then I get indeed that
> > > msize capping of just 512 kiB as well. That's far too small. :/
> > 
> > Maybe worth digging :
> > - why no capping happens in your scenario ?
> 
> Because I was wrong.
> 
> I just figured even in the 9p rootfs scenario it does indeed cap msize to 5kiB 
> as well. The output of /etc/mtab on guest side was fooling me. I debugged this 
> on 9p server side and the Linux 9p client always connects with a max. msize of 
> 5 kiB, no matter what you do.
> 
> > - is capping really needed ?
> > 
> > Cheers,
> 
> That's a good question and probably depends on whether there is a limitation 
> on virtio side, which I don't have an answer for. Maybe Michael or Dominique 
> can answer this.
> 
> Best regards,
> Christian Schoenebeck
> 


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-23 14:07                               ` [Virtio-fs] " Michael S. Tsirkin
@ 2021-02-24 15:16                                 ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-24 15:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Greg Kurz, Shinde, Archana M, Venegas Munoz,
	Jose Carlos, Dr. David Alan Gilbert, virtio-fs-list,
	Stefan Hajnoczi, cdupontd, Vivek Goyal, Dominique Martinet,
	v9fs-developer

On Dienstag, 23. Februar 2021 15:07:31 CET Michael S. Tsirkin wrote:
> > Michael, Dominique,
> > 
> > we are wondering here about the message size limitation of just 5 kiB in
> > the 9p Linux client (using virtio transport) which imposes a performance
> > bottleneck, introduced by this kernel commit:
> > 
> > commit b49d8b5d7007a673796f3f99688b46931293873e
> > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > Date:   Wed Aug 17 16:56:04 2011 +0000
> > 
> >     net/9p: Fix kernel crash with msize 512K
> >     
> >     With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
> >     crashes. This patch fix those.
> >     
> >     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >     Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
> 
> Well the change I see is:
> 
> -       .maxsize = PAGE_SIZE*VIRTQUEUE_NUM,
> +       .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
> 
> 
> so how come you say it changes 512K to 5K?
> Looks more like 500K to me.

Misapprehension + typo(s) in my previous message, sorry Michael. That's 500k 
of course (not 5k), yes.

Let me rephrase that question: are you aware of something in virtio that would 
per se mandate an absolute hard coded message size limit (e.g. from virtio 
specs perspective or maybe some compatibility issue)?

If not, we would try getting rid of that hard coded limit of the 9p client on 
kernel side in the first place, because the kernel's 9p client already has a 
dynamic runtime option 'msize' and that hard coded enforced limit (500k) is a 
performance bottleneck like I said.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-24 15:16                                 ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-24 15:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: cdupontd, Dominique Martinet, Venegas Munoz, Jose Carlos,
	qemu-devel, virtio-fs-list, v9fs-developer, Shinde, Archana M,
	Vivek Goyal

On Dienstag, 23. Februar 2021 15:07:31 CET Michael S. Tsirkin wrote:
> > Michael, Dominique,
> > 
> > we are wondering here about the message size limitation of just 5 kiB in
> > the 9p Linux client (using virtio transport) which imposes a performance
> > bottleneck, introduced by this kernel commit:
> > 
> > commit b49d8b5d7007a673796f3f99688b46931293873e
> > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > Date:   Wed Aug 17 16:56:04 2011 +0000
> > 
> >     net/9p: Fix kernel crash with msize 512K
> >     
> >     With msize equal to 512K (PAGE_SIZE * VIRTQUEUE_NUM), we hit multiple
> >     crashes. This patch fix those.
> >     
> >     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >     Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
> 
> Well the change I see is:
> 
> -       .maxsize = PAGE_SIZE*VIRTQUEUE_NUM,
> +       .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
> 
> 
> so how come you say it changes 512K to 5K?
> Looks more like 500K to me.

Misapprehension + typo(s) in my previous message, sorry Michael. That's 500k 
of course (not 5k), yes.

Let me rephrase that question: are you aware of something in virtio that would 
per se mandate an absolute hard coded message size limit (e.g. from virtio 
specs perspective or maybe some compatibility issue)?

If not, we would try getting rid of that hard coded limit of the 9p client on 
kernel side in the first place, because the kernel's 9p client already has a 
dynamic runtime option 'msize' and that hard coded enforced limit (500k) is a 
performance bottleneck like I said.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-24 15:16                                 ` [Virtio-fs] " Christian Schoenebeck
@ 2021-02-24 15:43                                   ` Dominique Martinet
  -1 siblings, 0 replies; 107+ messages in thread
From: Dominique Martinet @ 2021-02-24 15:43 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: cdupontd, Michael S. Tsirkin, Venegas Munoz, Jose Carlos,
	Greg Kurz, qemu-devel, virtio-fs-list, Vivek Goyal,
	Stefan Hajnoczi, v9fs-developer, Shinde, Archana M,
	Dr. David Alan Gilbert

Christian Schoenebeck wrote on Wed, Feb 24, 2021 at 04:16:52PM +0100:
> Misapprehension + typo(s) in my previous message, sorry Michael. That's 500k 
> of course (not 5k), yes.
> 
> Let me rephrase that question: are you aware of something in virtio that would 
> per se mandate an absolute hard coded message size limit (e.g. from virtio 
> specs perspective or maybe some compatibility issue)?
> 
> If not, we would try getting rid of that hard coded limit of the 9p client on 
> kernel side in the first place, because the kernel's 9p client already has a 
> dynamic runtime option 'msize' and that hard coded enforced limit (500k) is a 
> performance bottleneck like I said.

We could probably set it at init time through virtio_max_dma_size(vdev)
like virtio_blk does (I just tried and get 2^64 so we can probably
expect virtually no limit there)

I'm not too familiar with virtio, feel free to try and if it works send
me a patch -- the size drop from 512 to 500k is old enough that things
probably have changed in the background since then.


On the 9p side itself, unrelated to virtio, we don't want to make it
*too* big as the client code doesn't use any scatter-gather and will
want to allocate upfront contiguous buffers of the size that got
negotiated -- that can get ugly quite fast, but we can leave it up to
users to decide.
One of my very-long-term goal would be to tend to that, if someone has
cycles to work on it I'd gladly review any patch in that area.
A possible implementation path would be to have transport define
themselves if they support it or not and handle it accordingly until all
transports migrated, so one wouldn't need to care about e.g. rdma or xen
if you don't have hardware to test in the short term.

The next best thing would be David's netfs helpers and sending
concurrent requests if you use cache, but that's not merged yet either
so it'll be a few cycles as well.


Cheers,
-- 
Dominique


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-24 15:43                                   ` Dominique Martinet
  0 siblings, 0 replies; 107+ messages in thread
From: Dominique Martinet @ 2021-02-24 15:43 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: cdupontd, Michael S. Tsirkin, Venegas Munoz, Jose Carlos,
	qemu-devel, virtio-fs-list, Vivek Goyal, v9fs-developer, Shinde,
	Archana M

Christian Schoenebeck wrote on Wed, Feb 24, 2021 at 04:16:52PM +0100:
> Misapprehension + typo(s) in my previous message, sorry Michael. That's 500k 
> of course (not 5k), yes.
> 
> Let me rephrase that question: are you aware of something in virtio that would 
> per se mandate an absolute hard coded message size limit (e.g. from virtio 
> specs perspective or maybe some compatibility issue)?
> 
> If not, we would try getting rid of that hard coded limit of the 9p client on 
> kernel side in the first place, because the kernel's 9p client already has a 
> dynamic runtime option 'msize' and that hard coded enforced limit (500k) is a 
> performance bottleneck like I said.

We could probably set it at init time through virtio_max_dma_size(vdev)
like virtio_blk does (I just tried and get 2^64 so we can probably
expect virtually no limit there)

I'm not too familiar with virtio, feel free to try and if it works send
me a patch -- the size drop from 512 to 500k is old enough that things
probably have changed in the background since then.


On the 9p side itself, unrelated to virtio, we don't want to make it
*too* big as the client code doesn't use any scatter-gather and will
want to allocate upfront contiguous buffers of the size that got
negotiated -- that can get ugly quite fast, but we can leave it up to
users to decide.
One of my very-long-term goal would be to tend to that, if someone has
cycles to work on it I'd gladly review any patch in that area.
A possible implementation path would be to have transport define
themselves if they support it or not and handle it accordingly until all
transports migrated, so one wouldn't need to care about e.g. rdma or xen
if you don't have hardware to test in the short term.

The next best thing would be David's netfs helpers and sending
concurrent requests if you use cache, but that's not merged yet either
so it'll be a few cycles as well.


Cheers,
-- 
Dominique


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-24 15:43                                   ` [Virtio-fs] " Dominique Martinet
@ 2021-02-26 13:49                                     ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-26 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dominique Martinet, cdupontd, Michael S. Tsirkin, Venegas Munoz,
	Jose Carlos, Greg Kurz, virtio-fs-list, Vivek Goyal,
	Stefan Hajnoczi, v9fs-developer, Shinde, Archana M,
	Dr. David Alan Gilbert

On Mittwoch, 24. Februar 2021 16:43:57 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Wed, Feb 24, 2021 at 04:16:52PM +0100:
> > Misapprehension + typo(s) in my previous message, sorry Michael. That's
> > 500k of course (not 5k), yes.
> > 
> > Let me rephrase that question: are you aware of something in virtio that
> > would per se mandate an absolute hard coded message size limit (e.g. from
> > virtio specs perspective or maybe some compatibility issue)?
> > 
> > If not, we would try getting rid of that hard coded limit of the 9p client
> > on kernel side in the first place, because the kernel's 9p client already
> > has a dynamic runtime option 'msize' and that hard coded enforced limit
> > (500k) is a performance bottleneck like I said.
> 
> We could probably set it at init time through virtio_max_dma_size(vdev)
> like virtio_blk does (I just tried and get 2^64 so we can probably
> expect virtually no limit there)
> 
> I'm not too familiar with virtio, feel free to try and if it works send
> me a patch -- the size drop from 512 to 500k is old enough that things
> probably have changed in the background since then.

Yes, agreed. I'm neither too familiar with virtio, nor with the Linux 9p
client code yet. For that reason I consider a minimal invasive change as a
first step at least. AFAICS a "split virtqueue" setup is currently used:

https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

Right now the client uses a hard coded amount of 128 elements. So what about
replacing VIRTQUEUE_NUM by a variable which is initialized with a value
according to the user's requested 'msize' option at init time?

According to the virtio specs the max. amount of elements in a virtqueue is
32768. So 32768 * 4k = 128M as new upper limit would already be a significant
improvement and would not require too many changes to the client code, right?

> On the 9p side itself, unrelated to virtio, we don't want to make it
> *too* big as the client code doesn't use any scatter-gather and will
> want to allocate upfront contiguous buffers of the size that got
> negotiated -- that can get ugly quite fast, but we can leave it up to
> users to decide.

With ugly you just mean that it's occupying this memory for good as long as
the driver is loaded, or is there some runtime performance penalty as well to
be aware of?

> One of my very-long-term goal would be to tend to that, if someone has
> cycles to work on it I'd gladly review any patch in that area.
> A possible implementation path would be to have transport define
> themselves if they support it or not and handle it accordingly until all
> transports migrated, so one wouldn't need to care about e.g. rdma or xen
> if you don't have hardware to test in the short term.

Sounds like something that Greg suggested before for a slightly different,
even though related issue: right now the default 'msize' on Linux client side
is 8k, which really hurts performance wise as virtually all 9p messages have
to be split into a huge number of request and response messages. OTOH you
don't want to set this default value too high. So Greg noted that virtio could
suggest a default msize, i.e. a value that would suit host's storage hardware
appropriately.

> The next best thing would be David's netfs helpers and sending
> concurrent requests if you use cache, but that's not merged yet either
> so it'll be a few cycles as well.

So right now the Linux client is always just handling one request at a time;
it sends a 9p request and waits for its response before processing the next
request?

If so, is there a reason to limit the planned concurrent request handling
feature to one of the cached modes? I mean ordering of requests is already
handled on 9p server side, so client could just pass all messages in a
lite-weight way and assume server takes care of it.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-26 13:49                                     ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-02-26 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Dominique Martinet, Venegas Munoz,
	Jose Carlos, cdupontd, virtio-fs-list, v9fs-developer, Shinde,
	Archana M, Vivek Goyal

On Mittwoch, 24. Februar 2021 16:43:57 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Wed, Feb 24, 2021 at 04:16:52PM +0100:
> > Misapprehension + typo(s) in my previous message, sorry Michael. That's
> > 500k of course (not 5k), yes.
> > 
> > Let me rephrase that question: are you aware of something in virtio that
> > would per se mandate an absolute hard coded message size limit (e.g. from
> > virtio specs perspective or maybe some compatibility issue)?
> > 
> > If not, we would try getting rid of that hard coded limit of the 9p client
> > on kernel side in the first place, because the kernel's 9p client already
> > has a dynamic runtime option 'msize' and that hard coded enforced limit
> > (500k) is a performance bottleneck like I said.
> 
> We could probably set it at init time through virtio_max_dma_size(vdev)
> like virtio_blk does (I just tried and get 2^64 so we can probably
> expect virtually no limit there)
> 
> I'm not too familiar with virtio, feel free to try and if it works send
> me a patch -- the size drop from 512 to 500k is old enough that things
> probably have changed in the background since then.

Yes, agreed. I'm neither too familiar with virtio, nor with the Linux 9p
client code yet. For that reason I consider a minimal invasive change as a
first step at least. AFAICS a "split virtqueue" setup is currently used:

https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006

Right now the client uses a hard coded amount of 128 elements. So what about
replacing VIRTQUEUE_NUM by a variable which is initialized with a value
according to the user's requested 'msize' option at init time?

According to the virtio specs the max. amount of elements in a virtqueue is
32768. So 32768 * 4k = 128M as new upper limit would already be a significant
improvement and would not require too many changes to the client code, right?

> On the 9p side itself, unrelated to virtio, we don't want to make it
> *too* big as the client code doesn't use any scatter-gather and will
> want to allocate upfront contiguous buffers of the size that got
> negotiated -- that can get ugly quite fast, but we can leave it up to
> users to decide.

With ugly you just mean that it's occupying this memory for good as long as
the driver is loaded, or is there some runtime performance penalty as well to
be aware of?

> One of my very-long-term goal would be to tend to that, if someone has
> cycles to work on it I'd gladly review any patch in that area.
> A possible implementation path would be to have transport define
> themselves if they support it or not and handle it accordingly until all
> transports migrated, so one wouldn't need to care about e.g. rdma or xen
> if you don't have hardware to test in the short term.

Sounds like something that Greg suggested before for a slightly different,
even though related issue: right now the default 'msize' on Linux client side
is 8k, which really hurts performance wise as virtually all 9p messages have
to be split into a huge number of request and response messages. OTOH you
don't want to set this default value too high. So Greg noted that virtio could
suggest a default msize, i.e. a value that would suit host's storage hardware
appropriately.

> The next best thing would be David's netfs helpers and sending
> concurrent requests if you use cache, but that's not merged yet either
> so it'll be a few cycles as well.

So right now the Linux client is always just handling one request at a time;
it sends a 9p request and waits for its response before processing the next
request?

If so, is there a reason to limit the planned concurrent request handling
feature to one of the cached modes? I mean ordering of requests is already
handled on 9p server side, so client could just pass all messages in a
lite-weight way and assume server takes care of it.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-26 13:49                                     ` [Virtio-fs] " Christian Schoenebeck
@ 2021-02-27  0:03                                       ` Dominique Martinet
  -1 siblings, 0 replies; 107+ messages in thread
From: Dominique Martinet @ 2021-02-27  0:03 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Shinde, Archana M, Michael S. Tsirkin, Venegas Munoz,
	Jose Carlos, Greg Kurz, qemu-devel, virtio-fs-list,
	Dr. David Alan Gilbert, Stefan Hajnoczi, v9fs-developer,
	cdupontd, Vivek Goyal

Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100:
> Right now the client uses a hard coded amount of 128 elements. So what about
> replacing VIRTQUEUE_NUM by a variable which is initialized with a value
> according to the user's requested 'msize' option at init time?
> 
> According to the virtio specs the max. amount of elements in a virtqueue is
> 32768. So 32768 * 4k = 128M as new upper limit would already be a significant
> improvement and would not require too many changes to the client code, right?

The current code inits the chan->sg at probe time (when driver is
loader) and not mount time, and it is currently embedded in the chan
struct, so that would need allocating at mount time (p9_client_create ;
either resizing if required or not sharing) but it doesn't sound too
intrusive yes.

I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying.

> > On the 9p side itself, unrelated to virtio, we don't want to make it
> > *too* big as the client code doesn't use any scatter-gather and will
> > want to allocate upfront contiguous buffers of the size that got
> > negotiated -- that can get ugly quite fast, but we can leave it up to
> > users to decide.
> 
> With ugly you just mean that it's occupying this memory for good as long as
> the driver is loaded, or is there some runtime performance penalty as well to
> be aware of?

The main problem is memory fragmentation, see /proc/buddyinfo on various
systems.
After a fresh boot memory is quite clean and there is no problem
allocating 2MB contiguous buffers, but after a while depending on the
workload it can be hard to even allocate large buffers.
I've had that problem at work in the past with a RDMA driver that wanted
to allocate 256KB and could get that to fail quite reliably with our
workload, so it really depends on what the client does.

In the 9p case, the memory used to be allocated for good and per client
(= mountpoint), so if you had 15 9p mounts that could do e.g. 32
requests in parallel with 1MB buffers you could lock 500MB of idling
ram. I changed that to a dedicated slab a while ago, so that should no
longer be so much of a problem -- the slab will keep the buffers around
as well if used frequently so the performance hit wasn't bad even for
larger msizes


> > One of my very-long-term goal would be to tend to that, if someone has
> > cycles to work on it I'd gladly review any patch in that area.
> > A possible implementation path would be to have transport define
> > themselves if they support it or not and handle it accordingly until all
> > transports migrated, so one wouldn't need to care about e.g. rdma or xen
> > if you don't have hardware to test in the short term.
> 
> Sounds like something that Greg suggested before for a slightly different,
> even though related issue: right now the default 'msize' on Linux client side
> is 8k, which really hurts performance wise as virtually all 9p messages have
> to be split into a huge number of request and response messages. OTOH you
> don't want to set this default value too high. So Greg noted that virtio could
> suggest a default msize, i.e. a value that would suit host's storage hardware
> appropriately.

We can definitely increase the default, for all transports in my
opinion.
As a first step, 64 or 128k?

> > The next best thing would be David's netfs helpers and sending
> > concurrent requests if you use cache, but that's not merged yet either
> > so it'll be a few cycles as well.
> 
> So right now the Linux client is always just handling one request at a time;
> it sends a 9p request and waits for its response before processing the next
> request?

Requests are handled concurrently just fine - if you have multiple
processes all doing their things it will all go out in parallel.

The bottleneck people generally complain about (and where things hurt)
is if you have a single process reading then there is currently no
readahead as far as I know, so reads are really sent one at a time,
waiting for reply and sending next.

> If so, is there a reason to limit the planned concurrent request handling
> feature to one of the cached modes? I mean ordering of requests is already
> handled on 9p server side, so client could just pass all messages in a
> lite-weight way and assume server takes care of it.

cache=none is difficult, we could pipeline requests up to the buffer
size the client requested, but that's it.
Still something worth doing if the msize is tiny and the client requests
4+MB in my opinion, but nothing anything in the vfs can help us with.

cache=mmap is basically cache=none with a hack to say "ok, for mmap
there's no choice so do use some" -- afaik mmap has its own readahead
mechanism, so this should actually prefetch things, but I don't know
about the parallelism of that mechanism and would say it's linear.

Other chaching models (loose / fscache) actually share most of the code
so whatever is done for one would be for both, the discussion is still
underway with David/Willy and others mostly about ceph/cifs but would
benefit everyone and I'm following closely.

-- 
Dominique


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-02-27  0:03                                       ` Dominique Martinet
  0 siblings, 0 replies; 107+ messages in thread
From: Dominique Martinet @ 2021-02-27  0:03 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Shinde, Archana M, Michael S. Tsirkin, Venegas Munoz,
	Jose Carlos, qemu-devel, virtio-fs-list, v9fs-developer,
	cdupontd, Vivek Goyal

Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100:
> Right now the client uses a hard coded amount of 128 elements. So what about
> replacing VIRTQUEUE_NUM by a variable which is initialized with a value
> according to the user's requested 'msize' option at init time?
> 
> According to the virtio specs the max. amount of elements in a virtqueue is
> 32768. So 32768 * 4k = 128M as new upper limit would already be a significant
> improvement and would not require too many changes to the client code, right?

The current code inits the chan->sg at probe time (when driver is
loader) and not mount time, and it is currently embedded in the chan
struct, so that would need allocating at mount time (p9_client_create ;
either resizing if required or not sharing) but it doesn't sound too
intrusive yes.

I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying.

> > On the 9p side itself, unrelated to virtio, we don't want to make it
> > *too* big as the client code doesn't use any scatter-gather and will
> > want to allocate upfront contiguous buffers of the size that got
> > negotiated -- that can get ugly quite fast, but we can leave it up to
> > users to decide.
> 
> With ugly you just mean that it's occupying this memory for good as long as
> the driver is loaded, or is there some runtime performance penalty as well to
> be aware of?

The main problem is memory fragmentation, see /proc/buddyinfo on various
systems.
After a fresh boot memory is quite clean and there is no problem
allocating 2MB contiguous buffers, but after a while depending on the
workload it can be hard to even allocate large buffers.
I've had that problem at work in the past with a RDMA driver that wanted
to allocate 256KB and could get that to fail quite reliably with our
workload, so it really depends on what the client does.

In the 9p case, the memory used to be allocated for good and per client
(= mountpoint), so if you had 15 9p mounts that could do e.g. 32
requests in parallel with 1MB buffers you could lock 500MB of idling
ram. I changed that to a dedicated slab a while ago, so that should no
longer be so much of a problem -- the slab will keep the buffers around
as well if used frequently so the performance hit wasn't bad even for
larger msizes


> > One of my very-long-term goal would be to tend to that, if someone has
> > cycles to work on it I'd gladly review any patch in that area.
> > A possible implementation path would be to have transport define
> > themselves if they support it or not and handle it accordingly until all
> > transports migrated, so one wouldn't need to care about e.g. rdma or xen
> > if you don't have hardware to test in the short term.
> 
> Sounds like something that Greg suggested before for a slightly different,
> even though related issue: right now the default 'msize' on Linux client side
> is 8k, which really hurts performance wise as virtually all 9p messages have
> to be split into a huge number of request and response messages. OTOH you
> don't want to set this default value too high. So Greg noted that virtio could
> suggest a default msize, i.e. a value that would suit host's storage hardware
> appropriately.

We can definitely increase the default, for all transports in my
opinion.
As a first step, 64 or 128k?

> > The next best thing would be David's netfs helpers and sending
> > concurrent requests if you use cache, but that's not merged yet either
> > so it'll be a few cycles as well.
> 
> So right now the Linux client is always just handling one request at a time;
> it sends a 9p request and waits for its response before processing the next
> request?

Requests are handled concurrently just fine - if you have multiple
processes all doing their things it will all go out in parallel.

The bottleneck people generally complain about (and where things hurt)
is if you have a single process reading then there is currently no
readahead as far as I know, so reads are really sent one at a time,
waiting for reply and sending next.

> If so, is there a reason to limit the planned concurrent request handling
> feature to one of the cached modes? I mean ordering of requests is already
> handled on 9p server side, so client could just pass all messages in a
> lite-weight way and assume server takes care of it.

cache=none is difficult, we could pipeline requests up to the buffer
size the client requested, but that's it.
Still something worth doing if the msize is tiny and the client requests
4+MB in my opinion, but nothing anything in the vfs can help us with.

cache=mmap is basically cache=none with a hack to say "ok, for mmap
there's no choice so do use some" -- afaik mmap has its own readahead
mechanism, so this should actually prefetch things, but I don't know
about the parallelism of that mechanism and would say it's linear.

Other chaching models (loose / fscache) actually share most of the code
so whatever is done for one would be for both, the discussion is still
underway with David/Willy and others mostly about ceph/cifs but would
benefit everyone and I'm following closely.

-- 
Dominique


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-02-27  0:03                                       ` [Virtio-fs] " Dominique Martinet
@ 2021-03-03 14:04                                         ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-03-03 14:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dominique Martinet, Shinde, Archana M, Michael S. Tsirkin,
	Venegas Munoz, Jose Carlos, Greg Kurz, virtio-fs-list,
	Dr. David Alan Gilbert, Stefan Hajnoczi, v9fs-developer,
	cdupontd, Vivek Goyal

On Samstag, 27. Februar 2021 01:03:40 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100:
> > Right now the client uses a hard coded amount of 128 elements. So what
> > about replacing VIRTQUEUE_NUM by a variable which is initialized with a
> > value according to the user's requested 'msize' option at init time?
> > 
> > According to the virtio specs the max. amount of elements in a virtqueue
> > is
> > 32768. So 32768 * 4k = 128M as new upper limit would already be a
> > significant improvement and would not require too many changes to the
> > client code, right?
> The current code inits the chan->sg at probe time (when driver is
> loader) and not mount time, and it is currently embedded in the chan
> struct, so that would need allocating at mount time (p9_client_create ;
> either resizing if required or not sharing) but it doesn't sound too
> intrusive yes.
> 
> I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying.

Ok, then I will look into changing this when I hopefully have some time in few 
weeks.

> > > On the 9p side itself, unrelated to virtio, we don't want to make it
> > > *too* big as the client code doesn't use any scatter-gather and will
> > > want to allocate upfront contiguous buffers of the size that got
> > > negotiated -- that can get ugly quite fast, but we can leave it up to
> > > users to decide.
> > 
> > With ugly you just mean that it's occupying this memory for good as long
> > as
> > the driver is loaded, or is there some runtime performance penalty as well
> > to be aware of?
> 
> The main problem is memory fragmentation, see /proc/buddyinfo on various
> systems.
> After a fresh boot memory is quite clean and there is no problem
> allocating 2MB contiguous buffers, but after a while depending on the
> workload it can be hard to even allocate large buffers.
> I've had that problem at work in the past with a RDMA driver that wanted
> to allocate 256KB and could get that to fail quite reliably with our
> workload, so it really depends on what the client does.
> 
> In the 9p case, the memory used to be allocated for good and per client
> (= mountpoint), so if you had 15 9p mounts that could do e.g. 32
> requests in parallel with 1MB buffers you could lock 500MB of idling
> ram. I changed that to a dedicated slab a while ago, so that should no
> longer be so much of a problem -- the slab will keep the buffers around
> as well if used frequently so the performance hit wasn't bad even for
> larger msizes

Ah ok, good to know.

BTW qemu now handles multiple filesystems below one 9p share correctly by 
(optionally) remapping inode numbers from host side -> guest side 
appropriately to prevent potential file ID collisions. This might reduce the 
need to have a large amount of 9p mount points on guest side.

For instance I am running entire guest systems entirely on one 9p mount point 
as root fs that is. The guest system is divided into multiple filesystems on 
host side (e.g. multiple zfs datasets), not on guest side.

> > > One of my very-long-term goal would be to tend to that, if someone has
> > > cycles to work on it I'd gladly review any patch in that area.
> > > A possible implementation path would be to have transport define
> > > themselves if they support it or not and handle it accordingly until all
> > > transports migrated, so one wouldn't need to care about e.g. rdma or xen
> > > if you don't have hardware to test in the short term.
> > 
> > Sounds like something that Greg suggested before for a slightly different,
> > even though related issue: right now the default 'msize' on Linux client
> > side is 8k, which really hurts performance wise as virtually all 9p
> > messages have to be split into a huge number of request and response
> > messages. OTOH you don't want to set this default value too high. So Greg
> > noted that virtio could suggest a default msize, i.e. a value that would
> > suit host's storage hardware appropriately.
> 
> We can definitely increase the default, for all transports in my
> opinion.
> As a first step, 64 or 128k?

Just to throw some numbers first; when linearly reading a 12 GB file on guest 
(i.e. "time cat test.dat > /dev/null") on a test machine, these are the 
results that I get (cache=mmap):

msize=16k: 2min7s (95 MB/s)
msize=64k: 17s (706 MB/s)
msize=128k: 12s (1000 MB/s)
msize=256k: 8s (1500 MB/s)
msize=512k: 6.5s (1846 MB/s)

Personally I would raise the default msize value at least to 128k.

> > > The next best thing would be David's netfs helpers and sending
> > > concurrent requests if you use cache, but that's not merged yet either
> > > so it'll be a few cycles as well.
> > 
> > So right now the Linux client is always just handling one request at a
> > time; it sends a 9p request and waits for its response before processing
> > the next request?
> 
> Requests are handled concurrently just fine - if you have multiple
> processes all doing their things it will all go out in parallel.
> 
> The bottleneck people generally complain about (and where things hurt)
> is if you have a single process reading then there is currently no
> readahead as far as I know, so reads are really sent one at a time,
> waiting for reply and sending next.

So that also means if you are running a multi-threaded app (in one process) on 
guest side, then none of its I/O requests are handled in parallel right now. 
It would be desirable to have parallel requests for multi-threaded apps as 
well.

Personally I don't find raw I/O the worst performance issue right now. As you 
can see from the numbers above, if 'msize' is raised and I/O being performed 
with large chunk sizes (e.g. 'cat' automatically uses a chunk size according 
to the iounit advertised by stat) then the I/O results are okay.

What hurts IMO the most in practice is the sluggish behaviour regarding 
dentries ATM. The following is with cache=mmap (on guest side):

$ time ls /etc/ > /dev/null
real    0m0.091s
user    0m0.000s
sys     0m0.044s
$ time ls -l /etc/ > /dev/null
real    0m0.259s
user    0m0.008s
sys     0m0.016s
$ ls -l /etc/ | wc -l
113
$

With cache=loose there is some improvement; on the first "ls" run (when its 
not in the dentry cache I assume) the results are similar. The subsequent runs 
then improve to around 50ms for "ls" and around 70ms for "ls -l". But that's 
still far from numbers I would expect.

Keep in mind, even when you just open() & read() a file, then directory 
components have to be walked for checking ownership and permissions. I have 
seen huge slowdowns in deep directory structures for that reason.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-03-03 14:04                                         ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-03-03 14:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: cdupontd, Michael S. Tsirkin, Dominique Martinet, Venegas Munoz,
	Jose Carlos, virtio-fs-list, v9fs-developer, Shinde, Archana M,
	Vivek Goyal

On Samstag, 27. Februar 2021 01:03:40 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100:
> > Right now the client uses a hard coded amount of 128 elements. So what
> > about replacing VIRTQUEUE_NUM by a variable which is initialized with a
> > value according to the user's requested 'msize' option at init time?
> > 
> > According to the virtio specs the max. amount of elements in a virtqueue
> > is
> > 32768. So 32768 * 4k = 128M as new upper limit would already be a
> > significant improvement and would not require too many changes to the
> > client code, right?
> The current code inits the chan->sg at probe time (when driver is
> loader) and not mount time, and it is currently embedded in the chan
> struct, so that would need allocating at mount time (p9_client_create ;
> either resizing if required or not sharing) but it doesn't sound too
> intrusive yes.
> 
> I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying.

Ok, then I will look into changing this when I hopefully have some time in few 
weeks.

> > > On the 9p side itself, unrelated to virtio, we don't want to make it
> > > *too* big as the client code doesn't use any scatter-gather and will
> > > want to allocate upfront contiguous buffers of the size that got
> > > negotiated -- that can get ugly quite fast, but we can leave it up to
> > > users to decide.
> > 
> > With ugly you just mean that it's occupying this memory for good as long
> > as
> > the driver is loaded, or is there some runtime performance penalty as well
> > to be aware of?
> 
> The main problem is memory fragmentation, see /proc/buddyinfo on various
> systems.
> After a fresh boot memory is quite clean and there is no problem
> allocating 2MB contiguous buffers, but after a while depending on the
> workload it can be hard to even allocate large buffers.
> I've had that problem at work in the past with a RDMA driver that wanted
> to allocate 256KB and could get that to fail quite reliably with our
> workload, so it really depends on what the client does.
> 
> In the 9p case, the memory used to be allocated for good and per client
> (= mountpoint), so if you had 15 9p mounts that could do e.g. 32
> requests in parallel with 1MB buffers you could lock 500MB of idling
> ram. I changed that to a dedicated slab a while ago, so that should no
> longer be so much of a problem -- the slab will keep the buffers around
> as well if used frequently so the performance hit wasn't bad even for
> larger msizes

Ah ok, good to know.

BTW qemu now handles multiple filesystems below one 9p share correctly by 
(optionally) remapping inode numbers from host side -> guest side 
appropriately to prevent potential file ID collisions. This might reduce the 
need to have a large amount of 9p mount points on guest side.

For instance I am running entire guest systems entirely on one 9p mount point 
as root fs that is. The guest system is divided into multiple filesystems on 
host side (e.g. multiple zfs datasets), not on guest side.

> > > One of my very-long-term goal would be to tend to that, if someone has
> > > cycles to work on it I'd gladly review any patch in that area.
> > > A possible implementation path would be to have transport define
> > > themselves if they support it or not and handle it accordingly until all
> > > transports migrated, so one wouldn't need to care about e.g. rdma or xen
> > > if you don't have hardware to test in the short term.
> > 
> > Sounds like something that Greg suggested before for a slightly different,
> > even though related issue: right now the default 'msize' on Linux client
> > side is 8k, which really hurts performance wise as virtually all 9p
> > messages have to be split into a huge number of request and response
> > messages. OTOH you don't want to set this default value too high. So Greg
> > noted that virtio could suggest a default msize, i.e. a value that would
> > suit host's storage hardware appropriately.
> 
> We can definitely increase the default, for all transports in my
> opinion.
> As a first step, 64 or 128k?

Just to throw some numbers first; when linearly reading a 12 GB file on guest 
(i.e. "time cat test.dat > /dev/null") on a test machine, these are the 
results that I get (cache=mmap):

msize=16k: 2min7s (95 MB/s)
msize=64k: 17s (706 MB/s)
msize=128k: 12s (1000 MB/s)
msize=256k: 8s (1500 MB/s)
msize=512k: 6.5s (1846 MB/s)

Personally I would raise the default msize value at least to 128k.

> > > The next best thing would be David's netfs helpers and sending
> > > concurrent requests if you use cache, but that's not merged yet either
> > > so it'll be a few cycles as well.
> > 
> > So right now the Linux client is always just handling one request at a
> > time; it sends a 9p request and waits for its response before processing
> > the next request?
> 
> Requests are handled concurrently just fine - if you have multiple
> processes all doing their things it will all go out in parallel.
> 
> The bottleneck people generally complain about (and where things hurt)
> is if you have a single process reading then there is currently no
> readahead as far as I know, so reads are really sent one at a time,
> waiting for reply and sending next.

So that also means if you are running a multi-threaded app (in one process) on 
guest side, then none of its I/O requests are handled in parallel right now. 
It would be desirable to have parallel requests for multi-threaded apps as 
well.

Personally I don't find raw I/O the worst performance issue right now. As you 
can see from the numbers above, if 'msize' is raised and I/O being performed 
with large chunk sizes (e.g. 'cat' automatically uses a chunk size according 
to the iounit advertised by stat) then the I/O results are okay.

What hurts IMO the most in practice is the sluggish behaviour regarding 
dentries ATM. The following is with cache=mmap (on guest side):

$ time ls /etc/ > /dev/null
real    0m0.091s
user    0m0.000s
sys     0m0.044s
$ time ls -l /etc/ > /dev/null
real    0m0.259s
user    0m0.008s
sys     0m0.016s
$ ls -l /etc/ | wc -l
113
$

With cache=loose there is some improvement; on the first "ls" run (when its 
not in the dentry cache I assume) the results are similar. The subsequent runs 
then improve to around 50ms for "ls" and around 70ms for "ls -l". But that's 
still far from numbers I would expect.

Keep in mind, even when you just open() & read() a file, then directory 
components have to be walked for checking ownership and permissions. I have 
seen huge slowdowns in deep directory structures for that reason.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-03-03 14:04                                         ` [Virtio-fs] " Christian Schoenebeck
@ 2021-03-03 14:50                                           ` Dominique Martinet
  -1 siblings, 0 replies; 107+ messages in thread
From: Dominique Martinet @ 2021-03-03 14:50 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: cdupontd, Michael S. Tsirkin, Venegas Munoz, Jose Carlos,
	Greg Kurz, qemu-devel, virtio-fs-list, Vivek Goyal,
	Stefan Hajnoczi, v9fs-developer, Shinde, Archana M,
	Dr. David Alan Gilbert

Christian Schoenebeck wrote on Wed, Mar 03, 2021 at 03:04:21PM +0100:
> > We can definitely increase the default, for all transports in my
> > opinion.
> > As a first step, 64 or 128k?
> 
> Just to throw some numbers first; when linearly reading a 12 GB file on guest 
> (i.e. "time cat test.dat > /dev/null") on a test machine, these are the 
> results that I get (cache=mmap):
> 
> msize=16k: 2min7s (95 MB/s)
> msize=64k: 17s (706 MB/s)
> msize=128k: 12s (1000 MB/s)
> msize=256k: 8s (1500 MB/s)
> msize=512k: 6.5s (1846 MB/s)
> 
> Personally I would raise the default msize value at least to 128k.

Thanks for the numbers.
I'm still a bit worried about too large chunks, let's go with 128k for
now -- I'll send a couple of patches increasing the tcp max/default as
well next week-ish.

> > The bottleneck people generally complain about (and where things hurt)
> > is if you have a single process reading then there is currently no
> > readahead as far as I know, so reads are really sent one at a time,
> > waiting for reply and sending next.
> 
> So that also means if you are running a multi-threaded app (in one process) on 
> guest side, then none of its I/O requests are handled in parallel right now. 
> It would be desirable to have parallel requests for multi-threaded apps as 
> well.

threads are independant there as far as the kernel goes, if multiple
threads issue IO in parallel it will be handled in parallel.
(the exception would be "lightweight threads" which don't spawn actual
OS thread, but in this case the IOs are generally sent asynchronously so
that should work as well)

> Personally I don't find raw I/O the worst performance issue right now. As you 
> can see from the numbers above, if 'msize' is raised and I/O being performed 
> with large chunk sizes (e.g. 'cat' automatically uses a chunk size according 
> to the iounit advertised by stat) then the I/O results are okay.
> 
> What hurts IMO the most in practice is the sluggish behaviour regarding 
> dentries ATM. The following is with cache=mmap (on guest side):
> 
> $ time ls /etc/ > /dev/null
> real    0m0.091s
> user    0m0.000s
> sys     0m0.044s
> $ time ls -l /etc/ > /dev/null
> real    0m0.259s
> user    0m0.008s
> sys     0m0.016s
> $ ls -l /etc/ | wc -l
> 113
> $

Yes, that is slow indeed.. Unfortunately cache=none/mmap means only open
dentries are pinned, so that means a load of requests everytime.

I was going to suggest something like readdirplus or prefetching
directory entries attributes in parallel/background, but since we're not
keeping any entries around we can't even do that in that mode.

> With cache=loose there is some improvement; on the first "ls" run (when its 
> not in the dentry cache I assume) the results are similar. The subsequent runs 
> then improve to around 50ms for "ls" and around 70ms for "ls -l". But that's 
> still far from numbers I would expect.

I'm surprised cached mode is that slow though, that is worth
investigating.
With that time range we are definitely sending more requests to the
server than I would expect for cache=loose, some stat revalidation
perhaps? I thought there wasn't any.

I don't like cache=loose/fscache right now as the reclaim mechanism
doesn't work well as far as I'm aware (I've heard reports of 9p memory
usage growing ad nauseam in these modes), so while it's fine for
short-lived VMs it can't really be used for long periods of time as
is... That's been on my todo for a while too, but unfortunately no time
for that.


Ideally if that gets fixed, it really should be the default with some
sort of cache revalidation like NFS does (if that hasn't changed, inode
stats have a lifetime after which they get revalidated on access, and
directory ctime changes lead to a fresh readdir) ; but we can't really
do that right now if it "leaks".

Some cap to the number of open fids could be appreciable as well
perhaps, to spare server resources and keep internal lists short.

> Keep in mind, even when you just open() & read() a file, then directory 
> components have to be walked for checking ownership and permissions. I have 
> seen huge slowdowns in deep directory structures for that reason.

Yes, each component is walked one at a time. In theory the protocol
allows opening a path with all components specified to a single walk and
letting the server handle intermediate directories check, but the VFS
doesn't allow that.
Using relative paths or openat/fstatat/etc helps but many programs
aren't very smart with that.. Note it's not just a problem with 9p
though, even network filesystems with proper caching have a noticeable
performance cost with deep directory trees.


Anyway, there definitely is room for improvement; if you need ideas I
have plenty but my time is more than limited right now and for the
forseeable future... 9p work is purely on my freetime and there isn't
much at the moment :(

I'll make time as necessary for reviews & tests but that's about as much
as I can promise, sorry and good luck!

-- 
Dominique


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-03-03 14:50                                           ` Dominique Martinet
  0 siblings, 0 replies; 107+ messages in thread
From: Dominique Martinet @ 2021-03-03 14:50 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: cdupontd, Michael S. Tsirkin, Venegas Munoz, Jose Carlos,
	qemu-devel, virtio-fs-list, Vivek Goyal, v9fs-developer, Shinde,
	Archana M

Christian Schoenebeck wrote on Wed, Mar 03, 2021 at 03:04:21PM +0100:
> > We can definitely increase the default, for all transports in my
> > opinion.
> > As a first step, 64 or 128k?
> 
> Just to throw some numbers first; when linearly reading a 12 GB file on guest 
> (i.e. "time cat test.dat > /dev/null") on a test machine, these are the 
> results that I get (cache=mmap):
> 
> msize=16k: 2min7s (95 MB/s)
> msize=64k: 17s (706 MB/s)
> msize=128k: 12s (1000 MB/s)
> msize=256k: 8s (1500 MB/s)
> msize=512k: 6.5s (1846 MB/s)
> 
> Personally I would raise the default msize value at least to 128k.

Thanks for the numbers.
I'm still a bit worried about too large chunks, let's go with 128k for
now -- I'll send a couple of patches increasing the tcp max/default as
well next week-ish.

> > The bottleneck people generally complain about (and where things hurt)
> > is if you have a single process reading then there is currently no
> > readahead as far as I know, so reads are really sent one at a time,
> > waiting for reply and sending next.
> 
> So that also means if you are running a multi-threaded app (in one process) on 
> guest side, then none of its I/O requests are handled in parallel right now. 
> It would be desirable to have parallel requests for multi-threaded apps as 
> well.

threads are independant there as far as the kernel goes, if multiple
threads issue IO in parallel it will be handled in parallel.
(the exception would be "lightweight threads" which don't spawn actual
OS thread, but in this case the IOs are generally sent asynchronously so
that should work as well)

> Personally I don't find raw I/O the worst performance issue right now. As you 
> can see from the numbers above, if 'msize' is raised and I/O being performed 
> with large chunk sizes (e.g. 'cat' automatically uses a chunk size according 
> to the iounit advertised by stat) then the I/O results are okay.
> 
> What hurts IMO the most in practice is the sluggish behaviour regarding 
> dentries ATM. The following is with cache=mmap (on guest side):
> 
> $ time ls /etc/ > /dev/null
> real    0m0.091s
> user    0m0.000s
> sys     0m0.044s
> $ time ls -l /etc/ > /dev/null
> real    0m0.259s
> user    0m0.008s
> sys     0m0.016s
> $ ls -l /etc/ | wc -l
> 113
> $

Yes, that is slow indeed.. Unfortunately cache=none/mmap means only open
dentries are pinned, so that means a load of requests everytime.

I was going to suggest something like readdirplus or prefetching
directory entries attributes in parallel/background, but since we're not
keeping any entries around we can't even do that in that mode.

> With cache=loose there is some improvement; on the first "ls" run (when its 
> not in the dentry cache I assume) the results are similar. The subsequent runs 
> then improve to around 50ms for "ls" and around 70ms for "ls -l". But that's 
> still far from numbers I would expect.

I'm surprised cached mode is that slow though, that is worth
investigating.
With that time range we are definitely sending more requests to the
server than I would expect for cache=loose, some stat revalidation
perhaps? I thought there wasn't any.

I don't like cache=loose/fscache right now as the reclaim mechanism
doesn't work well as far as I'm aware (I've heard reports of 9p memory
usage growing ad nauseam in these modes), so while it's fine for
short-lived VMs it can't really be used for long periods of time as
is... That's been on my todo for a while too, but unfortunately no time
for that.


Ideally if that gets fixed, it really should be the default with some
sort of cache revalidation like NFS does (if that hasn't changed, inode
stats have a lifetime after which they get revalidated on access, and
directory ctime changes lead to a fresh readdir) ; but we can't really
do that right now if it "leaks".

Some cap to the number of open fids could be appreciable as well
perhaps, to spare server resources and keep internal lists short.

> Keep in mind, even when you just open() & read() a file, then directory 
> components have to be walked for checking ownership and permissions. I have 
> seen huge slowdowns in deep directory structures for that reason.

Yes, each component is walked one at a time. In theory the protocol
allows opening a path with all components specified to a single walk and
letting the server handle intermediate directories check, but the VFS
doesn't allow that.
Using relative paths or openat/fstatat/etc helps but many programs
aren't very smart with that.. Note it's not just a problem with 9p
though, even network filesystems with proper caching have a noticeable
performance cost with deep directory trees.


Anyway, there definitely is room for improvement; if you need ideas I
have plenty but my time is more than limited right now and for the
forseeable future... 9p work is purely on my freetime and there isn't
much at the moment :(

I'll make time as necessary for reviews & tests but that's about as much
as I can promise, sorry and good luck!

-- 
Dominique


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
  2021-03-03 14:50                                           ` [Virtio-fs] " Dominique Martinet
@ 2021-03-05 14:57                                             ` Christian Schoenebeck
  -1 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-03-05 14:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dominique Martinet, cdupontd, Michael S. Tsirkin, Venegas Munoz,
	Jose Carlos, Greg Kurz, virtio-fs-list, Vivek Goyal,
	Stefan Hajnoczi, v9fs-developer, Shinde, Archana M,
	Dr. David Alan Gilbert

On Mittwoch, 3. März 2021 15:50:37 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Wed, Mar 03, 2021 at 03:04:21PM +0100:
> > > We can definitely increase the default, for all transports in my
> > > opinion.
> > > As a first step, 64 or 128k?
> > 
> > Just to throw some numbers first; when linearly reading a 12 GB file on
> > guest (i.e. "time cat test.dat > /dev/null") on a test machine, these are
> > the results that I get (cache=mmap):
> > 
> > msize=16k: 2min7s (95 MB/s)
> > msize=64k: 17s (706 MB/s)
> > msize=128k: 12s (1000 MB/s)
> > msize=256k: 8s (1500 MB/s)
> > msize=512k: 6.5s (1846 MB/s)
> > 
> > Personally I would raise the default msize value at least to 128k.
> 
> Thanks for the numbers.
> I'm still a bit worried about too large chunks, let's go with 128k for
> now -- I'll send a couple of patches increasing the tcp max/default as
> well next week-ish.

Ok, sounds good!

> > Personally I don't find raw I/O the worst performance issue right now. As
> > you can see from the numbers above, if 'msize' is raised and I/O being
> > performed with large chunk sizes (e.g. 'cat' automatically uses a chunk
> > size according to the iounit advertised by stat) then the I/O results are
> > okay.
> > 
> > What hurts IMO the most in practice is the sluggish behaviour regarding
> > dentries ATM. The following is with cache=mmap (on guest side):
> > 
> > $ time ls /etc/ > /dev/null
> > real    0m0.091s
> > user    0m0.000s
> > sys     0m0.044s
> > $ time ls -l /etc/ > /dev/null
> > real    0m0.259s
> > user    0m0.008s
> > sys     0m0.016s
> > $ ls -l /etc/ | wc -l
> > 113
> > $
> 
> Yes, that is slow indeed.. Unfortunately cache=none/mmap means only open
> dentries are pinned, so that means a load of requests everytime.
> 
> I was going to suggest something like readdirplus or prefetching
> directory entries attributes in parallel/background, but since we're not
> keeping any entries around we can't even do that in that mode.
> 
> > With cache=loose there is some improvement; on the first "ls" run (when
> > its
> > not in the dentry cache I assume) the results are similar. The subsequent
> > runs then improve to around 50ms for "ls" and around 70ms for "ls -l".
> > But that's still far from numbers I would expect.
> 
> I'm surprised cached mode is that slow though, that is worth
> investigating.
> With that time range we are definitely sending more requests to the
> server than I would expect for cache=loose, some stat revalidation
> perhaps? I thought there wasn't any.

Yes, it looks like more 9p requests are sent than actually required for 
readdir. But I haven't checked yet what's going on there in detail. That's 
definitely on my todo list, because this readdir/stat/direntry issue ATM 
really hurts the most IMO.

> I don't like cache=loose/fscache right now as the reclaim mechanism
> doesn't work well as far as I'm aware (I've heard reports of 9p memory
> usage growing ad nauseam in these modes), so while it's fine for
> short-lived VMs it can't really be used for long periods of time as
> is... That's been on my todo for a while too, but unfortunately no time
> for that.

Ok, that's new to me. But I fear the opposite is currently worse; with 
cache=mmap and running a VM for a longer time: 9p requests get slower and 
slower, e.g. at a certain point you're waiting like 20s for one request. I 
haven't investigated the cause here either yet. It may very well be an issue 
on QEMU side: I have some doubts in the fid reclaim algorithm on 9p server 
side which is using just a linked list. Maybe that list is growing to 
ridiculous sizes and searching the list with O(n) starts to hurt after a 
while.

With cache=loose I don't see such tremendous slowdowns even on long runs, 
which might indicate that this symptom might indeed be due to a problem on 
QEMU side.

> Ideally if that gets fixed, it really should be the default with some
> sort of cache revalidation like NFS does (if that hasn't changed, inode
> stats have a lifetime after which they get revalidated on access, and
> directory ctime changes lead to a fresh readdir) ; but we can't really
> do that right now if it "leaks".
> 
> Some cap to the number of open fids could be appreciable as well
> perhaps, to spare server resources and keep internal lists short.

I just reviewed the fid reclaim code on 9p servers side to some extent because 
of a security issue on 9p server side in this area recently, but I haven't 
really thought through nor captured the authors' original ideas behind it 
entirely yet. I still have some question marks here. Maybe Greg feels the 
same.

Probably when support for macOS is added (also on my todo list), then the 
amount of open fids needs to be limited anyway. Because macOS is much more 
conservative and does not allow a large number of open files by default.

> Anyway, there definitely is room for improvement; if you need ideas I
> have plenty but my time is more than limited right now and for the
> forseeable future... 9p work is purely on my freetime and there isn't
> much at the moment :(
> 
> I'll make time as necessary for reviews & tests but that's about as much
> as I can promise, sorry and good luck!

I fear that applies to all developers right now. To my knowledge there is not 
a single developer either paid and/or able to spend reasonable large time 
slices on 9p issues.

From my side: my plans are to hunt down the worst 9p performance issues in 
order of their impact, but like anybody else, when I find some free time 
slices for that.

#patience #optimistic

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Virtio-fs] Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance)
@ 2021-03-05 14:57                                             ` Christian Schoenebeck
  0 siblings, 0 replies; 107+ messages in thread
From: Christian Schoenebeck @ 2021-03-05 14:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Dominique Martinet, Venegas Munoz,
	Jose Carlos, cdupontd, virtio-fs-list, v9fs-developer, Shinde,
	Archana M, Vivek Goyal

On Mittwoch, 3. März 2021 15:50:37 CET Dominique Martinet wrote:
> Christian Schoenebeck wrote on Wed, Mar 03, 2021 at 03:04:21PM +0100:
> > > We can definitely increase the default, for all transports in my
> > > opinion.
> > > As a first step, 64 or 128k?
> > 
> > Just to throw some numbers first; when linearly reading a 12 GB file on
> > guest (i.e. "time cat test.dat > /dev/null") on a test machine, these are
> > the results that I get (cache=mmap):
> > 
> > msize=16k: 2min7s (95 MB/s)
> > msize=64k: 17s (706 MB/s)
> > msize=128k: 12s (1000 MB/s)
> > msize=256k: 8s (1500 MB/s)
> > msize=512k: 6.5s (1846 MB/s)
> > 
> > Personally I would raise the default msize value at least to 128k.
> 
> Thanks for the numbers.
> I'm still a bit worried about too large chunks, let's go with 128k for
> now -- I'll send a couple of patches increasing the tcp max/default as
> well next week-ish.

Ok, sounds good!

> > Personally I don't find raw I/O the worst performance issue right now. As
> > you can see from the numbers above, if 'msize' is raised and I/O being
> > performed with large chunk sizes (e.g. 'cat' automatically uses a chunk
> > size according to the iounit advertised by stat) then the I/O results are
> > okay.
> > 
> > What hurts IMO the most in practice is the sluggish behaviour regarding
> > dentries ATM. The following is with cache=mmap (on guest side):
> > 
> > $ time ls /etc/ > /dev/null
> > real    0m0.091s
> > user    0m0.000s
> > sys     0m0.044s
> > $ time ls -l /etc/ > /dev/null
> > real    0m0.259s
> > user    0m0.008s
> > sys     0m0.016s
> > $ ls -l /etc/ | wc -l
> > 113
> > $
> 
> Yes, that is slow indeed.. Unfortunately cache=none/mmap means only open
> dentries are pinned, so that means a load of requests everytime.
> 
> I was going to suggest something like readdirplus or prefetching
> directory entries attributes in parallel/background, but since we're not
> keeping any entries around we can't even do that in that mode.
> 
> > With cache=loose there is some improvement; on the first "ls" run (when
> > its
> > not in the dentry cache I assume) the results are similar. The subsequent
> > runs then improve to around 50ms for "ls" and around 70ms for "ls -l".
> > But that's still far from numbers I would expect.
> 
> I'm surprised cached mode is that slow though, that is worth
> investigating.
> With that time range we are definitely sending more requests to the
> server than I would expect for cache=loose, some stat revalidation
> perhaps? I thought there wasn't any.

Yes, it looks like more 9p requests are sent than actually required for 
readdir. But I haven't checked yet what's going on there in detail. That's 
definitely on my todo list, because this readdir/stat/direntry issue ATM 
really hurts the most IMO.

> I don't like cache=loose/fscache right now as the reclaim mechanism
> doesn't work well as far as I'm aware (I've heard reports of 9p memory
> usage growing ad nauseam in these modes), so while it's fine for
> short-lived VMs it can't really be used for long periods of time as
> is... That's been on my todo for a while too, but unfortunately no time
> for that.

Ok, that's new to me. But I fear the opposite is currently worse; with 
cache=mmap and running a VM for a longer time: 9p requests get slower and 
slower, e.g. at a certain point you're waiting like 20s for one request. I 
haven't investigated the cause here either yet. It may very well be an issue 
on QEMU side: I have some doubts in the fid reclaim algorithm on 9p server 
side which is using just a linked list. Maybe that list is growing to 
ridiculous sizes and searching the list with O(n) starts to hurt after a 
while.

With cache=loose I don't see such tremendous slowdowns even on long runs, 
which might indicate that this symptom might indeed be due to a problem on 
QEMU side.

> Ideally if that gets fixed, it really should be the default with some
> sort of cache revalidation like NFS does (if that hasn't changed, inode
> stats have a lifetime after which they get revalidated on access, and
> directory ctime changes lead to a fresh readdir) ; but we can't really
> do that right now if it "leaks".
> 
> Some cap to the number of open fids could be appreciable as well
> perhaps, to spare server resources and keep internal lists short.

I just reviewed the fid reclaim code on 9p servers side to some extent because 
of a security issue on 9p server side in this area recently, but I haven't 
really thought through nor captured the authors' original ideas behind it 
entirely yet. I still have some question marks here. Maybe Greg feels the 
same.

Probably when support for macOS is added (also on my todo list), then the 
amount of open fids needs to be limited anyway. Because macOS is much more 
conservative and does not allow a large number of open files by default.

> Anyway, there definitely is room for improvement; if you need ideas I
> have plenty but my time is more than limited right now and for the
> forseeable future... 9p work is purely on my freetime and there isn't
> much at the moment :(
> 
> I'll make time as necessary for reviews & tests but that's about as much
> as I can promise, sorry and good luck!

I fear that applies to all developers right now. To my knowledge there is not 
a single developer either paid and/or able to spend reasonable large time 
slices on 9p issues.

>From my side: my plans are to hunt down the worst 9p performance issues in 
order of their impact, but like anybody else, when I find some free time 
slices for that.

#patience #optimistic

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2021-03-05 14:59 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-18 21:34 tools/virtiofs: Multi threading seems to hurt performance Vivek Goyal
2020-09-18 21:34 ` [Virtio-fs] " Vivek Goyal
2020-09-21  8:39 ` Stefan Hajnoczi
2020-09-21  8:39   ` [Virtio-fs] " Stefan Hajnoczi
2020-09-21 13:39   ` Vivek Goyal
2020-09-21 13:39     ` [Virtio-fs] " Vivek Goyal
2020-09-21 16:57     ` Stefan Hajnoczi
2020-09-21 16:57       ` [Virtio-fs] " Stefan Hajnoczi
2020-09-21  8:50 ` Dr. David Alan Gilbert
2020-09-21  8:50   ` [Virtio-fs] " Dr. David Alan Gilbert
2020-09-21 13:35   ` Vivek Goyal
2020-09-21 13:35     ` [Virtio-fs] " Vivek Goyal
2020-09-21 14:08     ` Daniel P. Berrangé
2020-09-21 14:08       ` [Virtio-fs] " Daniel P. Berrangé
2020-09-21 15:32 ` Dr. David Alan Gilbert
2020-09-21 15:32   ` [Virtio-fs] " Dr. David Alan Gilbert
2020-09-22 10:25   ` Dr. David Alan Gilbert
2020-09-22 10:25     ` [Virtio-fs] " Dr. David Alan Gilbert
2020-09-22 17:47     ` Vivek Goyal
2020-09-22 17:47       ` [Virtio-fs] " Vivek Goyal
2020-09-24 21:33       ` Venegas Munoz, Jose Carlos
2020-09-24 21:33         ` [Virtio-fs] " Venegas Munoz, Jose Carlos
2020-09-24 22:10         ` virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance) Vivek Goyal
2020-09-24 22:10           ` [Virtio-fs] " Vivek Goyal
2020-09-25  8:06           ` virtiofs vs 9p performance Christian Schoenebeck
2020-09-25  8:06             ` [Virtio-fs] " Christian Schoenebeck
2020-09-25 13:13             ` Vivek Goyal
2020-09-25 13:13               ` [Virtio-fs] " Vivek Goyal
2020-09-25 15:47               ` Christian Schoenebeck
2020-09-25 15:47                 ` [Virtio-fs] " Christian Schoenebeck
2021-02-19 16:08             ` Can not set high msize with virtio-9p (Was: Re: virtiofs vs 9p performance) Vivek Goyal
2021-02-19 16:08               ` [Virtio-fs] " Vivek Goyal
2021-02-19 17:33               ` Christian Schoenebeck
2021-02-19 17:33                 ` [Virtio-fs] " Christian Schoenebeck
2021-02-19 19:01                 ` Vivek Goyal
2021-02-19 19:01                   ` [Virtio-fs] " Vivek Goyal
2021-02-20 15:38                   ` Christian Schoenebeck
2021-02-20 15:38                     ` [Virtio-fs] " Christian Schoenebeck
2021-02-22 12:18                     ` Greg Kurz
2021-02-22 12:18                       ` [Virtio-fs] " Greg Kurz
2021-02-22 15:08                       ` Christian Schoenebeck
2021-02-22 15:08                         ` [Virtio-fs] " Christian Schoenebeck
2021-02-22 17:11                         ` Greg Kurz
2021-02-22 17:11                           ` [Virtio-fs] " Greg Kurz
2021-02-23 13:39                           ` Christian Schoenebeck
2021-02-23 13:39                             ` [Virtio-fs] " Christian Schoenebeck
2021-02-23 14:07                             ` Michael S. Tsirkin
2021-02-23 14:07                               ` [Virtio-fs] " Michael S. Tsirkin
2021-02-24 15:16                               ` Christian Schoenebeck
2021-02-24 15:16                                 ` [Virtio-fs] " Christian Schoenebeck
2021-02-24 15:43                                 ` Dominique Martinet
2021-02-24 15:43                                   ` [Virtio-fs] " Dominique Martinet
2021-02-26 13:49                                   ` Christian Schoenebeck
2021-02-26 13:49                                     ` [Virtio-fs] " Christian Schoenebeck
2021-02-27  0:03                                     ` Dominique Martinet
2021-02-27  0:03                                       ` [Virtio-fs] " Dominique Martinet
2021-03-03 14:04                                       ` Christian Schoenebeck
2021-03-03 14:04                                         ` [Virtio-fs] " Christian Schoenebeck
2021-03-03 14:50                                         ` Dominique Martinet
2021-03-03 14:50                                           ` [Virtio-fs] " Dominique Martinet
2021-03-05 14:57                                           ` Christian Schoenebeck
2021-03-05 14:57                                             ` [Virtio-fs] " Christian Schoenebeck
2020-09-25 12:41           ` virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance) Dr. David Alan Gilbert
2020-09-25 12:41             ` [Virtio-fs] " Dr. David Alan Gilbert
2020-09-25 13:04             ` Christian Schoenebeck
2020-09-25 13:04               ` [Virtio-fs] " Christian Schoenebeck
2020-09-25 13:05               ` Dr. David Alan Gilbert
2020-09-25 13:05                 ` [Virtio-fs] " Dr. David Alan Gilbert
2020-09-25 16:05                 ` Christian Schoenebeck
2020-09-25 16:05                   ` [Virtio-fs] " Christian Schoenebeck
2020-09-25 16:33                   ` Christian Schoenebeck
2020-09-25 16:33                     ` [Virtio-fs] " Christian Schoenebeck
2020-09-25 18:51                   ` Dr. David Alan Gilbert
2020-09-25 18:51                     ` [Virtio-fs] " Dr. David Alan Gilbert
2020-09-27 12:14                     ` Christian Schoenebeck
2020-09-27 12:14                       ` [Virtio-fs] " Christian Schoenebeck
2020-09-29 13:03                       ` Vivek Goyal
2020-09-29 13:03                         ` [Virtio-fs] " Vivek Goyal
2020-09-29 13:28                         ` Christian Schoenebeck
2020-09-29 13:28                           ` [Virtio-fs] " Christian Schoenebeck
2020-09-29 13:49                           ` Vivek Goyal
2020-09-29 13:49                             ` [Virtio-fs] " Vivek Goyal
2020-09-29 13:59                             ` Christian Schoenebeck
2020-09-29 13:59                               ` [Virtio-fs] " Christian Schoenebeck
2020-09-29 13:17             ` Vivek Goyal
2020-09-29 13:17               ` [Virtio-fs] " Vivek Goyal
2020-09-29 13:49               ` Miklos Szeredi
2020-09-29 13:49                 ` Miklos Szeredi
2020-09-29 14:01                 ` Vivek Goyal
2020-09-29 14:01                   ` Vivek Goyal
2020-09-29 14:54                   ` Miklos Szeredi
2020-09-29 14:54                     ` Miklos Szeredi
2020-09-29 15:28                 ` Vivek Goyal
2020-09-29 15:28                   ` Vivek Goyal
2020-09-25 12:11       ` tools/virtiofs: Multi threading seems to hurt performance Dr. David Alan Gilbert
2020-09-25 12:11         ` [Virtio-fs] " Dr. David Alan Gilbert
2020-09-25 13:11         ` Vivek Goyal
2020-09-25 13:11           ` [Virtio-fs] " Vivek Goyal
2020-09-21 20:16 ` Vivek Goyal
2020-09-21 20:16   ` [Virtio-fs] " Vivek Goyal
2020-09-22 11:09   ` Dr. David Alan Gilbert
2020-09-22 11:09     ` [Virtio-fs] " Dr. David Alan Gilbert
2020-09-22 22:56     ` Vivek Goyal
2020-09-22 22:56       ` [Virtio-fs] " Vivek Goyal
2020-09-23 12:50 ` Chirantan Ekbote
2020-09-23 12:59   ` Vivek Goyal
2020-09-25 11:35   ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.