All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Stefan Bader <stefan.bader@canonical.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	kvm@vger.kernel.org, Marcelo Tosatti <mtosatti@redhat.com>,
	MASAO TAKAHASHI <masao-takahashi@kanno.co.jp>,
	Joerg Roedel <joro@8bytes.org>, Borislav Petkov <bp@alien8.de>
Subject: Re: Another preempt folding issue?
Date: Tue, 11 Feb 2014 20:45:53 +0100	[thread overview]
Message-ID: <20140211194553.GZ9987@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <52FA6D4B.7020709@canonical.com>

On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote:
> Hi Peter,
> 
> I am currently looking at a weird issue that manifest itself when trying to run
> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the
> cpu is 64bit capable, so qemu-system-x86_64 is called). 

AMD or Intel machine?

> Sooner or later this
> causes softlockup messages on the host. I tracked this down to __vcpu_run in
> arch/x86/kvm/x86.c which does a loop which in that case never seems to make
> progress or exit.
> 
> What I found is that vcpu_enter_guest will exit quickly without causing the loop
> to exit when need_resched() is true. Looking at a crash dump I took, this was
> the case (thread_info->flags had TIF_NEED_RESCHED set). So after immediately
> returning __vcpu_run has the following code:
> 
>         if (need_resched()) {
>                 srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>                 kvm_resched(vcpu); // now cond_resched();
>                 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>         }
> 
> The kvm_resched basically would end up doing a cond_resched() which now checks
> preempt_count() to be 0. If that is zero it will do the reschedule, otherwise it
> just does nothing. Looking at the percpu variables in the dump, I saw that
> the preempt_count was 0x8000000 (actually it was 0x80110000 but that was me
> triggering the kexec crashdump with sysrq-c).
> 
> I saw that there have been some changes in the upstream kernel and have picked
> the following:
> 1) x86, acpi, idle: Restructure the mwait idle routines
> 2) x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
> 3) sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
> 4) sched/preempt/x86: Fix voluntary preempt for x86
> 
> Patch 1) and 2) as dependencies of 3) (to get the mwait function correct and to
> the other file). Finally 4) is fixing up 3). [maybe worth suggesting to do for
> 3.13.y stable].
> 
> Still, with all those I got the softlockup. Since I knew from the dump info that
> something is wrong with the folding, I made the pragmatic approach and added the
> following:
> 
>         if (need_resched()) {
>                 srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> +               preempt_fold_need_resched();
>                 kvm_resched(vcpu); // now cond_resched();
>                 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>         }
> 
> And this lets the kvm guest run without the softlockups! However I am less than
> convinced that this is the right thing to do. Somehow something done when
> converting the preempt_count into percpu has caused at least the i386 side to
> get into this mess (as there has not been any whining about 64bit). Just fail to
> see what.

I've been looking at the same thing too; I've got a trace from someone
who can reproduce and its just not making sense.

Looks like the hardware is loosing an interrupt (or worse).

With the below patch on top of current -git (might be whitespace damaged
due to copy-paste).

---
 arch/x86/kvm/x86.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 39c28f09dfd5..f8b2fc7ce491 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6115,7 +6115,11 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
                }
                if (need_resched()) {
                        srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
-                       cond_resched();
+                       if (need_resched() != should_resched()) {
+                               tracing_stop();
+                               printk(KERN_ERR "Stopped tracing, due to inconsistent state.\n");
+                       }
+                       schedule();
                        vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
                }
        }

---

I got the following from an Athlon X2 (fam 0xf):

On Mon, Feb 10, 2014 at 05:23:15PM -0200, Marcelo Tosatti wrote:
> What is the problem exactly?

>From the trace:

 qemu-system-x86-2455  [001] d.s3   393.630586: smp_reschedule_interrupt <-reschedule_interrupt
 qemu-system-x86-2455  [001] d.s3   393.630586: scheduler_ipi <-smp_reschedule_interrupt
 qemu-system-x86-2455  [001] ..s3   393.630586: preempt_count_sub <-_raw_spin_unlock_irqrestore
 qemu-system-x86-2455  [001] ..s2   393.630586: _raw_spin_lock_irq <-run_timer_softirq
 qemu-system-x86-2455  [001] d.s2   393.630586: preempt_count_add <-_raw_spin_lock_irq
 qemu-system-x86-2455  [001] d.s3   393.630586: _raw_spin_unlock_irq <-run_timer_softirq
 qemu-system-x86-2455  [001] ..s3   393.630586: preempt_count_sub <-_raw_spin_unlock_irq
 qemu-system-x86-2455  [001] ..s2   393.630586: rcu_bh_qs <-__do_softirq
 qemu-system-x86-2455  [001] d.s2   393.630586: irqtime_account_irq <-__do_softirq
 qemu-system-x86-2455  [001] d.s2   393.630586: __local_bh_enable <-__do_softirq
 qemu-system-x86-2455  [001] d.s2   393.630586: preempt_count_sub <-__local_bh_enable


           dmesg-2458  [000] d.s5   393.630614: resched_task <-check_preempt_wakeup
           dmesg-2458  [000] d.s5   393.630614: native_smp_send_reschedule <-resched_task
           dmesg-2458  [000] d.s5   393.630614: default_send_IPI_mask_logical <-native_smp_send_reschedule


 qemu-system-x86-2455  [001] .n..   393.630636: preempt_count_add <-kmap_atomic_prot
 qemu-system-x86-2455  [001] .n.1   393.630639: __kunmap_atomic <-clear_huge_page
 qemu-system-x86-2455  [001] .n.1   393.630639: preempt_count_sub <-__kunmap_atomic
 qemu-system-x86-2455  [001] .n..   393.630639: _cond_resched <-clear_huge_page
 qemu-system-x86-2455  [001] .n..   393.630640: kmap_atomic <-clear_huge_page
 qemu-system-x86-2455  [001] .n..   393.630640: kmap_atomic_prot <-kmap_atomic
 qemu-system-x86-2455  [001] .n..   393.630640: preempt_count_add <-kmap_atomic_prot


The resched_task() in the middle does:

  set_tsk_need_resched();
  smp_mb();
  smp_send_reschedule();

This means that the remote cpu must observe TIF_NEED_RESCHED (.n.. in
the traces) when the IPI (smp_reschedule_interrupt) hits.

Now given the machine has unsynchronized TSC the order in the above is
not necessarily the true order, but there's only two possible scenarios:

1) its the right order; the smp_reschedule_interrupt() is from a
 previous resched_task()/rcu kick or any of the other users that sends
 it.

 In which case we should get another smp_reschedule_interrupt() _after_
 the resched_task() which should observe the 'n' and promote it to an
 'N'.

 This doesn't happen and we're stuck with the 'n' state..

2) its the wrong order and the smp_reschedule_interrupt() actually
 happened after the resched_task(), but in that case we should have
 observed the 'n', we did not.

Now given that 2) would mean a bug in the coherency fabric and 1) would
mean a lost interrupt, 1 seems like the more likely suspect.

Joerg wanted the extra kvm tracepoints to see what the vm-exit reason
is.


  reply	other threads:[~2014-02-11 19:46 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-11 18:34 Another preempt folding issue? Stefan Bader
2014-02-11 19:45 ` Peter Zijlstra [this message]
2014-02-12  8:20   ` Stefan Bader
2014-02-12 10:37     ` Peter Zijlstra
2014-02-12 10:40       ` Borislav Petkov
2014-02-12 11:09         ` Stefan Bader
2014-02-12 11:54           ` Peter Zijlstra
2014-02-13 17:00             ` Stefan Bader
2014-02-13 17:38               ` Peter Zijlstra
2014-02-13 18:03                 ` Stefan Bader
2014-02-13 18:26                   ` Peter Zijlstra
2014-02-14 13:34                     ` Borislav Petkov
2014-02-14 13:40                       ` Stefan Bader
2014-02-14 14:24                       ` Stefan Bader
2014-02-14 14:47                         ` Borislav Petkov
2014-02-14 17:02                           ` Stefan Bader
2014-02-14 17:21                             ` Peter Zijlstra
2014-02-20 15:38                               ` Stefan Bader
2014-02-20 15:50                                 ` Peter Zijlstra
2014-03-24 17:39                                   ` Paolo Bonzini
2014-03-25  8:23                                     ` Stefan Bader
2014-02-14 17:33                             ` Borislav Petkov
2014-02-14 18:23                               ` Stefan Bader
2014-02-14 19:03                                 ` Stefan Bader
2014-02-14 15:21                         ` Another preempt folding issue? (maybe bisect) Borislav Petkov
2014-02-14 15:28                           ` Stefan Bader
2014-02-14 15:44                             ` Borislav Petkov
2014-02-14 16:21                           ` Peter Zijlstra
2014-02-13 18:25               ` Another preempt folding issue? Peter Zijlstra
2014-02-14 10:55                 ` Stefan Bader
2014-02-14 13:17                   ` Peter Zijlstra
2014-02-14 11:24                 ` Stefan Bader
2014-02-14 11:49                   ` Peter Zijlstra
2014-02-12 11:12         ` Joerg Roedel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140211194553.GZ9987@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=bp@alien8.de \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=masao-takahashi@kanno.co.jp \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=stefan.bader@canonical.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.