Re: Bad performance since 5.9-rc1

From: Sean Christopherson <seanjc@google.com>
To: Zdenek Kaspar <zkaspar82@gmail.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: Bad performance since 5.9-rc1
Date: Wed, 13 Jan 2021 12:17:19 -0800	[thread overview]
Message-ID: <X/9VT6ZgLPZW3dxc@google.com> (raw)
In-Reply-To: <20210112121811.408e32fe.zkaspar82@gmail.com>

On Tue, Jan 12, 2021, Zdenek Kaspar wrote:
> On Tue, 22 Dec 2020 22:26:45 +0100
> Zdenek Kaspar <zkaspar82@gmail.com> wrote:
> 
> > On Tue, 22 Dec 2020 09:07:39 -0800
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Mon, Dec 21, 2020, Zdenek Kaspar wrote:
> > > > [  179.364305] WARNING: CPU: 0 PID: 369 at
> > > > kvm_mmu_zap_oldest_mmu_pages+0xd1/0xe0 [kvm] [  179.365415] Call
> > > > Trace: [  179.365443]  paging64_page_fault+0x244/0x8e0 [kvm]
> > > 
> > > This means the shadow page zapping is occuring because KVM is
> > > hitting the max number of allowed MMU shadow pages.  Can you
> > > provide your QEMU command line?  I can reproduce the performance
> > > degredation, but only by deliberately overriding the max number of
> > > MMU pages via `-machine kvm-shadow-mem` to be an absurdly low value.
> > > 
> > > > [  179.365596]  kvm_mmu_page_fault+0x376/0x550 [kvm]
> > > > [  179.365725]  kvm_arch_vcpu_ioctl_run+0xbaf/0x18f0 [kvm]
> > > > [  179.365772]  kvm_vcpu_ioctl+0x203/0x520 [kvm]
> > > > [  179.365938]  __x64_sys_ioctl+0x338/0x720
> > > > [  179.365992]  do_syscall_64+0x33/0x40
> > > > [  179.366013]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > It's one long line, added "\" for mail readability:
> > 
> > qemu-system-x86_64 -machine type=q35,accel=kvm            \
> > -cpu host,host-cache-info=on -smp cpus=2,cores=2          \
> > -m size=1024 -global virtio-pci.disable-legacy=on         \
> > -global virtio-pci.disable-modern=off                     \
> > -device virtio-balloon                                    \
> > -device virtio-net,netdev=tap-build,mac=DE:AD:BE:EF:00:80 \
> > -object rng-random,filename=/dev/urandom,id=rng0          \
> > -device virtio-rng,rng=rng0                               \
> > -name build,process=qemu-build                            \
> > -drive
> > file=/mnt/data/export/unix/kvm/build/openbsd-amd64.img,if=virtio,cache=none,format=raw,aio=native
> > \ -netdev type=tap,id=tap-build,vhost=on                    \ -serial
> > none                                              \ -parallel none
> >                                         \ -monitor
> > unix:/dev/shm/kvm-build.sock,server,nowait       \ -enable-kvm
> > -daemonize -runas qemu
> > 
> > Z.
> 
> BTW, v5.11-rc3 with kvm-shadow-mem=1073741824 it seems OK.
>
> Just curious what v5.8 does

Aha!  Figured it out.  v5.9 (the commit you bisected to) broke the zapping,
that's what it did.  The list of MMU pages is a FIFO list, meaning KVM adds
entries to the head, not the tail.  I botched the zapping flow and used
for_each instead of for_each_reverse, which meant KVM would zap the _newest_
pages instead of the _oldest_ pages.  So once a VM hit its limit, KVM would
constantly zap the shadow pages it just allocated.

This should resolve the performance regression, or at least make it far less
painful.  It's possible you may still see some performance degredation due to
other changes in the the zapping, e.g. more aggressive recursive zapping.  If
that's the case, I can explore other tweaks, e.g. skip higher levels when
possible.  I'll get a proper patch posted later today.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c478904af518..2c6e6fdb26ad 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2417,7 +2417,7 @@ static unsigned long kvm_mmu_zap_oldest_mmu_pages(struct kvm *kvm,
                return 0;

 restart:
-       list_for_each_entry_safe(sp, tmp, &kvm->arch.active_mmu_pages, link) {
+       list_for_each_entry_safe_reverse(sp, tmp, &kvm->arch.active_mmu_pages, link) {
                /*
                 * Don't zap active root pages, the page itself can't be freed
                 * and zapping it will just force vCPUs to realloc and reload.

Side topic, I still can't figure out how on earth your guest kernel is hitting
the max number of default pages.  Even with large pages completely disabled, PTI
enabled, multiple guest processes running, etc... I hit OOM in the guest before
the host's shadow page limit kicks in.  I had to force the limit down to 25% of
the default to reproduce the bad behavior.  All I can figure is that BSD has a
substantially different paging scheme than Linux.

> so by any chance is there command for kvm-shadow-mem value via qemu monitor?
> 
> Z.