Re: kvm splat in mmu_spte_clear_track_bits

From: Andrea Arcangeli <aarcange@redhat.com>
To: Adam Borowski <kilobyte@angband.pl>
Cc: "Takashi Iwai" <tiwai@suse.de>, "Bernhard Held" <berny156@gmx.de>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Wanpeng Li" <kernellwp@gmail.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"Joerg Roedel" <jroedel@suse.de>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	kvm <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Michal Hocko" <mhocko@kernel.org>
Subject: Re: kvm splat in mmu_spte_clear_track_bits
Date: Tue, 29 Aug 2017 16:09:24 +0200	[thread overview]
Message-ID: <20170829140924.GB21615@redhat.com> (raw)
In-Reply-To: <20170829125923.g3tp22bzsrcuruks@angband.pl>

Hello,

On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote:
> > [Put more people to Cc, sorry for growing too much...]
> 
> We're all interested in 4.13.0 not crashing on us, so that's ok.
> 
> > On Tue, 29 Aug 2017 11:19:13 +0200,
> > Bernhard Held wrote:
> > > 
> > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
> > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .
> > > 
> > > Amit, thanks for your courage to expose your patch!
> > > 
> > > I'm more and more confident that aac2fea94f7a is the culprit.  Maybe it
> > > just accelerates the triggering of the splash.  To be more sure the
> > > kernel needs to be tested for a couple of days.  It would be great if
> > > others could assist in testing aac2fea94f7a.
> > 
> > I'm testing with the revert for a while and it seems working.
> 
> With nothing but aac2fea94f7a reverted, no explosions for me either.

The aforementioned commit has 3 bugs.

1) mmu_notifier_invalidate_range cannot be used in replacement of
   mmu_notifier_invalidate_range_start/end. For KVM
   mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
   notifier implementation has to implement either
   ->invalidate_range method or the invalidate_range_start/end
   methods, not both. And if you implement invalidate_range_start/end
   like KVM is forced to do, calling mmu_notifier_invalidate_range in
   common code is a noop for KVM.

   For those MMU notifiers that can get away only implementing
   ->invalidate_range, the ->invalidate_range is implicitly called by
   mmu_notifier_invalidate_range_end(). And only those secondary MMUs
   that share the same pagetable with the primary MMU (like AMD
   iommuv2) can get away only implementing ->invalidate_range.

   So all cases (THP on/off) are broken right now.

   To fix this is enough to replace mmu_notifier_invalidate_range with
   mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end. Either
   that or call multiple mmu_notifier_invalidate_page like before.

2) address + (1UL << compound_order(page) is buggy, it should be
   PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not 512.

3) The whole invalidate_range thing was an attempt to call a single
   invalidate while walking multiple 4k ptes that maps the same THP
   (after a pmd virtual split without physical compound page THP
   split). It's unclear if the rmap_walk will always provide an
   address that is 2M aligned as parameter to try_to_unmap_one, in
   presence of THP. I think it needs also an address &= (PAGE_SIZE <<
   compound_order(page)) - 1 to be safe.

The other bug where you can reproduce the same corruption with OOM is
unrelated and caused by the OOM reaper. OOM reaper was even corrupting
data if a task was writing to disk and stuck in OOM in write() syscall
or async io write.

To fix the KVM corruption in the OOM reaper, it needs to call
mmu_notifier_invalidate_start/end around
oom_kill.c:unmap_page_range. This additional
mmu_notifier_invalidate_start will not be good for the OOM reaper
because it's yet another case (like the mmap_sem for writing) that
will prevent the OOM reaper to run, so hindering its ability to hide
XFS OOM deadlocks, and making those resurface. Not in KVM case because
we use a spinlock to serialize against the secondary MMU activity and
the KVM critical section under spinlock isn't going to allocate
memory, but range_start can schedule or block on slow hardware where
the secondary MMU is accessed through PCI (not KVM case).

My preference is still to make the OOM reaper a config option and let
it grow into the VM at zero cost if disabled, while at the same time
having the option to keep the VM simpler and spend the time fixing the
filesystem bugs instead (while still being able to reproduce them more
easily with OOM reaper disabled).

Thanks,
Andrea