From: Andrea Arcangeli <aarcange@redhat.com>
To: Adam Borowski <kilobyte@angband.pl>
Cc: "Takashi Iwai" <tiwai@suse.de>, "Bernhard Held" <berny156@gmx.de>,
"Nadav Amit" <nadav.amit@gmail.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Wanpeng Li" <kernellwp@gmail.com>,
"Radim Krčmář" <rkrcmar@redhat.com>,
"Joerg Roedel" <jroedel@suse.de>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Linus Torvalds" <torvalds@linux-foundation.org>,
kvm <kvm@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"Michal Hocko" <mhocko@kernel.org>
Subject: Re: kvm splat in mmu_spte_clear_track_bits
Date: Tue, 29 Aug 2017 16:09:24 +0200 [thread overview]
Message-ID: <20170829140924.GB21615@redhat.com> (raw)
In-Reply-To: <20170829125923.g3tp22bzsrcuruks@angband.pl>
Hello,
On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote:
> > [Put more people to Cc, sorry for growing too much...]
>
> We're all interested in 4.13.0 not crashing on us, so that's ok.
>
> > On Tue, 29 Aug 2017 11:19:13 +0200,
> > Bernhard Held wrote:
> > >
> > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
> > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a .
> > >
> > > Amit, thanks for your courage to expose your patch!
> > >
> > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it
> > > just accelerates the triggering of the splash. To be more sure the
> > > kernel needs to be tested for a couple of days. It would be great if
> > > others could assist in testing aac2fea94f7a.
> >
> > I'm testing with the revert for a while and it seems working.
>
> With nothing but aac2fea94f7a reverted, no explosions for me either.
The aforementioned commit has 3 bugs.
1) mmu_notifier_invalidate_range cannot be used in replacement of
mmu_notifier_invalidate_range_start/end. For KVM
mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
notifier implementation has to implement either
->invalidate_range method or the invalidate_range_start/end
methods, not both. And if you implement invalidate_range_start/end
like KVM is forced to do, calling mmu_notifier_invalidate_range in
common code is a noop for KVM.
For those MMU notifiers that can get away only implementing
->invalidate_range, the ->invalidate_range is implicitly called by
mmu_notifier_invalidate_range_end(). And only those secondary MMUs
that share the same pagetable with the primary MMU (like AMD
iommuv2) can get away only implementing ->invalidate_range.
So all cases (THP on/off) are broken right now.
To fix this is enough to replace mmu_notifier_invalidate_range with
mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end. Either
that or call multiple mmu_notifier_invalidate_page like before.
2) address + (1UL << compound_order(page) is buggy, it should be
PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not 512.
3) The whole invalidate_range thing was an attempt to call a single
invalidate while walking multiple 4k ptes that maps the same THP
(after a pmd virtual split without physical compound page THP
split). It's unclear if the rmap_walk will always provide an
address that is 2M aligned as parameter to try_to_unmap_one, in
presence of THP. I think it needs also an address &= (PAGE_SIZE <<
compound_order(page)) - 1 to be safe.
The other bug where you can reproduce the same corruption with OOM is
unrelated and caused by the OOM reaper. OOM reaper was even corrupting
data if a task was writing to disk and stuck in OOM in write() syscall
or async io write.
To fix the KVM corruption in the OOM reaper, it needs to call
mmu_notifier_invalidate_start/end around
oom_kill.c:unmap_page_range. This additional
mmu_notifier_invalidate_start will not be good for the OOM reaper
because it's yet another case (like the mmap_sem for writing) that
will prevent the OOM reaper to run, so hindering its ability to hide
XFS OOM deadlocks, and making those resurface. Not in KVM case because
we use a spinlock to serialize against the secondary MMU activity and
the KVM critical section under spinlock isn't going to allocate
memory, but range_start can schedule or block on slow hardware where
the secondary MMU is accessed through PCI (not KVM case).
My preference is still to make the OOM reaper a config option and let
it grow into the VM at zero cost if disabled, while at the same time
having the option to keep the VM simpler and spend the time fixing the
filesystem bugs instead (while still being able to reproduce them more
easily with OOM reaper disabled).
Thanks,
Andrea
next prev parent reply other threads:[~2017-08-29 14:09 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-20 23:13 kvm splat in mmu_spte_clear_track_bits Adam Borowski
2017-08-21 1:26 ` Wanpeng Li
2017-08-21 19:12 ` Adam Borowski
2017-08-21 19:58 ` Radim Krčmář
2017-08-21 22:32 ` Adam Borowski
2017-08-23 12:22 ` Paolo Bonzini
2017-08-24 7:43 ` Wanpeng Li
2017-08-25 13:14 ` Adam Borowski
2017-08-25 13:40 ` Paolo Bonzini
2017-08-27 12:35 ` Adam Borowski
2017-08-28 15:26 ` Bernhard Held
2017-08-28 16:01 ` Takashi Iwai
2017-08-28 16:07 ` Bernhard Held
2017-08-28 16:17 ` Takashi Iwai
2017-08-28 16:56 ` Nadav Amit
2017-08-29 9:19 ` Bernhard Held
[not found] ` <s5hh8wq8ruy.wl-tiwai@suse.de>
2017-08-29 12:59 ` Adam Borowski
2017-08-29 14:09 ` Andrea Arcangeli [this message]
2017-08-29 16:10 ` Linus Torvalds
2017-08-29 18:28 ` Jerome Glisse
2017-08-29 18:34 ` Jerome Glisse
2017-08-29 19:06 ` Linus Torvalds
2017-08-29 19:13 ` Jerome Glisse
2017-08-29 19:38 ` Linus Torvalds
2017-08-29 20:49 ` Andrea Arcangeli
2017-08-29 20:59 ` Linus Torvalds
2017-08-30 8:19 ` Michal Hocko
2017-08-29 15:53 ` Nadav Amit
2017-08-29 12:57 ` Mike Galbraith
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170829140924.GB21615@redhat.com \
--to=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=berny156@gmx.de \
--cc=jroedel@suse.de \
--cc=kernellwp@gmail.com \
--cc=kilobyte@angband.pl \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mhocko@kernel.org \
--cc=nadav.amit@gmail.com \
--cc=pbonzini@redhat.com \
--cc=rkrcmar@redhat.com \
--cc=tiwai@suse.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).