On Wed, 12 May 2021 17:05:53 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Fri, May 07, 2021 at 06:59:01PM +0200, Greg Kurz wrote:
> > Now that virtio-scsi-pci and virtio-blk-pci map 1 virtqueue per vCPU,
> > a serious slow down may be observed on setups with a big enough number
> > of vCPUs.
> > 
> > Exemple with a pseries guest on a bi-POWER9 socket system (128 HW threads):
> > 
> >               virtio-scsi      virtio-blk
> > 
> > 1		0m20.922s	0m21.346s
> > 2		0m21.230s	0m20.350s
> > 4		0m21.761s	0m20.997s
> > 8		0m22.770s	0m20.051s
> > 16		0m22.038s	0m19.994s
> > 32		0m22.928s	0m20.803s
> > 64		0m26.583s	0m22.953s
> > 128		0m41.273s	0m32.333s
> > 256		2m4.727s 	1m16.924s
> > 384		6m5.563s 	3m26.186s
> > 
> > Both perf and gprof indicate that QEMU is hogging CPUs when setting up
> > the ioeventfds:
> > 
> >  67.88%  swapper         [kernel.kallsyms]  [k] power_pmu_enable
> >   9.47%  qemu-kvm        [kernel.kallsyms]  [k] smp_call_function_single
> >   8.64%  qemu-kvm        [kernel.kallsyms]  [k] power_pmu_enable
> > =>2.79%  qemu-kvm        qemu-kvm           [.] memory_region_ioeventfd_before
> > =>2.12%  qemu-kvm        qemu-kvm           [.] address_space_update_ioeventfds
> >   0.56%  kworker/8:0-mm  [kernel.kallsyms]  [k] smp_call_function_single
> > 
> > address_space_update_ioeventfds() is called when committing an MR
> > transaction, i.e. for each ioeventfd with the current code base,
> > and it internally loops on all ioventfds:
> > 
> > static void address_space_update_ioeventfds(AddressSpace *as)
> > {
> > [...]
> >     FOR_EACH_FLAT_RANGE(fr, view) {
> >         for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
> > 
> > This means that the setup of ioeventfds for these devices has
> > quadratic time complexity.
> > 
> > This series simply changes the device models to extend the transaction
> > to all virtqueueues, like already done in the past in the generic
> > code with 710fccf80d78 ("virtio: improve virtio devices initialization
> > time").
> > 
> > Only virtio-scsi and virtio-blk are covered here, but a similar change
> > might also be beneficial to other device types such as host-scsi-pci,
> > vhost-user-scsi-pci and vhost-user-blk-pci.
> > 
> >               virtio-scsi      virtio-blk
> > 
> > 1		0m21.271s	0m22.076s
> > 2		0m20.912s	0m19.716s
> > 4		0m20.508s	0m19.310s
> > 8		0m21.374s	0m20.273s
> > 16		0m21.559s	0m21.374s
> > 32		0m22.532s	0m21.271s
> > 64		0m26.550s	0m22.007s
> > 128		0m29.115s	0m27.446s
> > 256		0m44.752s	0m41.004s
> > 384		1m2.884s	0m58.023s
> > 
> > This should fix https://bugzilla.redhat.com/show_bug.cgi?id=1927108
> > which reported the issue for virtio-scsi-pci.
> > 
> > Changes since v1:
> > - Add some comments (Stefan)
> > - Drop optimization on the error path in patch 2 (Stefan)
> > 
> > Changes since RFC:
> > 
> > As suggested by Stefan, splimplify the code by directly beginning and
> > committing the memory transaction from the device model, without all
> > the virtio specific proxying code and no changes needed in the memory
> > subsystem.
> > 
> > Greg Kurz (4):
> >   virtio-blk: Fix rollback path in virtio_blk_data_plane_start()
> >   virtio-blk: Configure all host notifiers in a single MR transaction
> >   virtio-scsi: Set host notifiers and callbacks separately
> >   virtio-scsi: Configure all host notifiers in a single MR transaction
> > 
> >  hw/block/dataplane/virtio-blk.c | 45 ++++++++++++++++++++-
> >  hw/scsi/virtio-scsi-dataplane.c | 72 ++++++++++++++++++++++++---------
> >  2 files changed, 97 insertions(+), 20 deletions(-)
> > 
> > -- 
> > 2.26.3
> > 
> 
> Thanks, applied to my block tree:
> https://gitlab.com/stefanha/qemu/commits/block
> 

Hi Stefan,

It seems that Michael already merged the previous version of this
patch set with its latest PR.

https://gitlab.com/qemu-project/qemu/-/commit/6005ee07c380cbde44292f5f6c96e7daa70f4f7d

It is thus missing the v1->v2 changes. Basically some comments to
clarify the optimization we're doing with the MR transaction and
the removal of the optimization on an error path.

The optimization on the error path isn't needed indeed but it
doesn't hurt. No need to change that now that the patches are
upstream.

I can post a follow-up patch to add the missing comments though.
While here, I'd even add these comments in the generic
virtio_device_*_ioeventfd_impl() calls as well, since they already
have the very same optimization.

Anyway, I guess you can drop the patches from your tree.

Cheers,

--
Greg

> Stefan