Hi Folks, Sorry for the delay about how to reproduce `fio` data. I have some code to automate testing for multiple kata configs and collect info like: - Kata-env, kata configuration.toml, qemu command, virtiofsd command. See: https://github.com/jcvenegas/mrunner/ Last time we agreed to narrow the cases and configs to compare virtiofs and 9pfs The configs where the following: - qemu + virtiofs(cache=auto, dax=0) a.ka. `kata-qemu-virtiofs` WITOUT xattr - qemu + 9pfs a.k.a `kata-qemu` Please take a look to the html and raw results I attach in this mail. ## Can I say that the current status is: - As David tests and Vivek points, for the fio workload you are using, seems that the best candidate should be cache=none - In the comparison I took cache=auto as Vivek suggested, this make sense as it seems that will be the default for kata. - Even if for this case cache=none works better, Can I assume that cache=auto dax=0 will be better than any 9pfs config? (once we find the root cause) - Vivek is taking a look to mmap mode from 9pfs, to see how different is with current virtiofs implementations. In 9pfs for kata, this is what we use by default. ## I'd like to identify what should be next on the debug/testing? - Should I try to narrow by only trying to with qemu? - Should I try first with a new patch you already have? - Probably try with qemu without static build? - Do the same test with thread-pool-size=1? Please let me know how can I help. Cheers. On 22/09/20 12:47, "Vivek Goyal" wrote: On Tue, Sep 22, 2020 at 11:25:31AM +0100, Dr. David Alan Gilbert wrote: > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: > > Hi, > > I've been doing some of my own perf tests and I think I agree > > about the thread pool size; my test is a kernel build > > and I've tried a bunch of different options. > > > > My config: > > Host: 16 core AMD EPYC (32 thread), 128G RAM, > > 5.9.0-rc4 kernel, rhel 8.2ish userspace. > > 5.1.0 qemu/virtiofsd built from git. > > Guest: Fedora 32 from cloud image with just enough extra installed for > > a kernel build. > > > > git cloned and checkout v5.8 of Linux into /dev/shm/linux on the host > > fresh before each test. Then log into the guest, make defconfig, > > time make -j 16 bzImage, make clean; time make -j 16 bzImage > > The numbers below are the 'real' time in the guest from the initial make > > (the subsequent makes dont vary much) > > > > Below are the detauls of what each of these means, but here are the > > numbers first > > > > virtiofsdefault 4m0.978s > > 9pdefault 9m41.660s > > virtiofscache=none 10m29.700s > > 9pmmappass 9m30.047s > > 9pmbigmsize 12m4.208s > > 9pmsecnone 9m21.363s > > virtiofscache=noneT1 7m17.494s > > virtiofsdefaultT1 3m43.326s > > > > So the winner there by far is the 'virtiofsdefaultT1' - that's > > the default virtiofs settings, but with --thread-pool-size=1 - so > > yes it gives a small benefit. > > But interestingly the cache=none virtiofs performance is pretty bad, > > but thread-pool-size=1 on that makes a BIG improvement. > > Here are fio runs that Vivek asked me to run in my same environment > (there are some 0's in some of the mmap cases, and I've not investigated > why yet). cache=none does not allow mmap in case of virtiofs. That's when you are seeing 0. >virtiofs is looking good here in I think all of the cases; > there's some division over which cinfig; cache=none > seems faster in some cases which surprises me. I know cache=none is faster in case of write workloads. It forces direct write where we don't call file_remove_privs(). While cache=auto goes through file_remove_privs() and that adds a GETXATTR request to every WRITE request. Vivek