All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gary Fu <qfu@wavecomp.com>
To: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
Cc: Paul Burton <pburton@wavecomp.com>,
	"jhogan@kernel.org" <jhogan@kernel.org>,
	Archer Yan <ayan@wavecomp.com>, Gary Fu <qfu@wavecomp.com>
Subject: [PATCH] KVM: Fix an issue in non-preemptible kernel.
Date: Mon, 9 Sep 2019 02:49:19 +0000	[thread overview]
Message-ID: <20190909024838.2757-1-qfu@wavecomp.com> (raw)

Add a cond_resched() to give the scheduler a chance to run madvise
task to avoid endless loop here in non-preemptible kernel.

Otherwise, the kvm_mmu_notifier would have no chance to be decreased
to 0 by madvise task -> syscall -> zap_page_range ->
mmu_notifier_invalidate_range_end ->
__mmu_notifier_invalidate_range_end -> invalidate_range_end ->
kvm_mmu_notifier_invalidate_range_end, as the madvise task would be
scheduled when running unmap_single_vma -> unmap_page_range ->
zap_p4d_range -> zap_pud_range -> zap_pmd_range -> cond_resched which
is called before mmu_notifier_invalidate_range_end in zap_page_range.

When handling GPA faults by creating a new GPA mapping in
kvm_mips_map_page, it will be retrying to get available page. In the
low memory case, it is waiting for the memory resources freed by
madvise syscall with MADV_DONTNEED (QEMU application -> madvise with
MADV_DONTNEED -> syscall -> madvise_vma -> madvise_dontneed_free ->
madvise_dontneed_single_vma -> zap_page_range). In zap_page_range,
after the TLB of given address range is cleared by unmap_single_vma,
it will call __mmu_notifier_invalidate_range_end which finally calls
kvm_mmu_notifier_invalidate_range_end to decrease mmu_notifier_count
to 0. The retrying loop in kvm_mips_map_page checks the
mmu_notifier_count and if the value is 0 which indicates that some
new page is available for mapping, it will jump out the retrying loop
and set up PTE for a new GPA mapping.
During the TLB clearing (in unmap_single_vma in madvise syscall)
mentioned above, it will call cond_resched() per PMD for avoiding
occupying CPU for a long time (in case of huge page range zapping).
When this happens in the non-preemptible kernel, the retrying loop in
kvm_mips_map_page will be running endlessly as there is no chance to
reschedule back to madvise syscall to run
__mmu_notifier_invalidate_range_end to decrease mmu_notifier_count so
that the value of mmu_notifier_count is always 1.
Adding a scheduling point before every retry in kvm_mips_map_page will
give the madvise syscall (invoked by QEMU) a chance to be re-scheduled
back to zap pages in the given range and clear mmu_notifier_count
value to let kvm_mips_map_page task jump out the loop.

Signed-off-by: Gary Fu <qfu@wavecomp.com>
---
 arch/mips/kvm/mmu.c | 48 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 97e538a8c1be..26bac7e1ea85 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -746,6 +746,54 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
 		 */
 		spin_unlock(&kvm->mmu_lock);
 		kvm_release_pfn_clean(pfn);
+		/*
+		 * Add a cond_resched() to give the scheduler a chance to run
+		 * madvise task to avoid endless loop here in non-preemptible
+		 * kernel.
+		 * Otherwise, the kvm_mmu_notifier would have no chance to be
+		 * decreased to 0 by madvise task -> syscall -> zap_page_range
+		 * -> mmu_notifier_invalidate_range_end ->
+		 * __mmu_notifier_invalidate_range_end -> invalidate_range_end
+		 * -> kvm_mmu_notifier_invalidate_range_end, as the madvise task
+		 * would be scheduled when running unmap_single_vma ->
+		 * unmap_page_range -> zap_p4d_range -> zap_pud_range ->
+		 * zap_pmd_range -> cond_resched which is called before
+		 * mmu_notifier_invalidate_range_end in zap_page_range.
+		 *
+		 * When handling GPA faults by creating a new GPA mapping in
+		 * kvm_mips_map_page, it will be retrying to get available
+		 * pages.
+		 * In the low memory case, it is waiting for the memory
+		 * resources freed by madvise syscall with MADV_DONTNEED (QEMU
+		 * application -> madvise with MADV_DONTNEED -> syscall ->
+		 * madvise_vma -> madvise_dontneed_free ->
+		 * madvise_dontneed_single_vma -> zap_page_range). In
+		 * zap_page_range, after the TLB of given address range is
+		 * cleared by unmap_single_vma, it will call
+		 *  __mmu_notifier_invalidate_range_end which finally calls
+		 * kvm_mmu_notifier_invalidate_range_end to decrease
+		 * mmu_notifier_count to 0. The retrying loop in
+		 * kvm_mips_map_page checks the mmu_notifier_count and if the
+		 * value is 0 which indicates that some new page is available
+		 * for mapping, it will jump out the retrying loop and set up
+		 * PTE for a new GPA mapping.
+		 * During the TLB clearing (in unmap_single_vma in madvise
+		 * syscall) mentioned above, it will call cond_resched() per
+		 * PMD for avoiding occupying CPU for a long time (in case of
+		 * huge page range zapping). When this happens in the
+		 * non-preemptible kernel, the retrying loop in
+		 * kvm_mips_map_page will be running endlessly as there is no
+		 * chance to reschedule back to madvise syscall to run
+		 * __mmu_notifier_invalidate_range_end to decrease
+		 * mmu_notifier_count so that the value of mmu_notifier_count
+		 * is always 1.
+		 * Adding a scheduling point before every retry in
+		 * kvm_mips_map_page will give the madvise syscall (invoked by
+		 * QEMU) a chance to be re-scheduled back to zap pages in the
+		 * given range and clear mmu_notifier_count value to let
+		 * kvm_mips_map_page task jump out the loop.
+		 */
+		cond_resched();
 		goto retry;
 	}
 
-- 
2.17.1


             reply	other threads:[~2019-09-09  2:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:49 Gary Fu [this message]
  -- strict thread matches above, loose matches on Subject: below --
2019-09-02  9:02 [PATCH] KVM: Fix an issue in non-preemptible kernel Gary Fu
2019-09-04 14:02 ` Paul Burton
2019-09-05 13:54   ` Gary Fu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190909024838.2757-1-qfu@wavecomp.com \
    --to=qfu@wavecomp.com \
    --cc=ayan@wavecomp.com \
    --cc=jhogan@kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=pburton@wavecomp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.