From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: Strange CPU usage pattern in SMP guest
Date: Tue, 23 Mar 2010 18:18:08 -0300
Message-ID: <20100323211808.GA25813@amt.cnet>
References: <20100321001304.B8EAF30301DA@mail.linux-ag.de>
 <4BA5F03C.1020900@redhat.com>
 <20100321120236.55228A0015@mail.linux-ag.de>
 <4BA60EDC.6080202@redhat.com>
 <20100321145548.CC027A0015@mail.linux-ag.de>
 <4BA63892.6090006@redhat.com>
 <20100322125120.DE032A0015@mail.linux-ag.de>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="SLDf9lqlvOQaIe6s"
Cc: Avi Kivity <avi@redhat.com>, kvm@vger.kernel.org
To: Sebastian Hetze <s.hetze@linux-ag.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:27804 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752518Ab0CWVTX (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 23 Mar 2010 17:19:23 -0400
Content-Disposition: inline
In-Reply-To: <20100322125120.DE032A0015@mail.linux-ag.de>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


--SLDf9lqlvOQaIe6s
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Mon, Mar 22, 2010 at 01:51:20PM +0100, Sebastian Hetze wrote:
> On Sun, Mar 21, 2010 at 05:17:38PM +0200, Avi Kivity wrote:
> > On 03/21/2010 04:55 PM, Sebastian Hetze wrote:
> >> On Sun, Mar 21, 2010 at 02:19:40PM +0200, Avi Kivity wrote:
> >>    
> >>> On 03/21/2010 02:02 PM, Sebastian Hetze wrote:
> >>>      
> >>>> 12:46:02     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
> >>>> 12:46:03     all    0,20   11,35   10,96    8,96    0,40    2,99    0,00    0,00   65,14
> >>>> 12:46:03       0    1,00   11,00    7,00   15,00    0,00    1,00    0,00    0,00   65,00
> >>>> 12:46:03       1    0,00    7,14    2,04    6,12    1,02   11,22    0,00    0,00   72,45
> >>>> 12:46:03       2    0,00   15,00    1,00   12,00    0,00    1,00    0,00    0,00   71,00
> >>>> 12:46:03       3    0,00   11,00   23,00    8,00    0,00    0,00    0,00    0,00   58,00
> >>>> 12:46:03       4    0,00    0,00   50,00    0,00    0,00    0,00    0,00    0,00   50,00
> >>>> 12:46:03       5    0,00   13,00   20,00    4,00    0,00    1,00    0,00    0,00   62,00
> >>>>
> >>>> So it is only CPU4 that is showing this strange behaviour.
> >>>>
> >>>>        
> >>> Can you adjust irqtop to only count cpu4?  or even just post a few 'cat
> >>> /proc/interrupts' from that guest.
> >>>
> >>> Most likely the timer interrupt for cpu4 died.
> >>>      
> >> I've added two keys +/- to your irqtop to focus up and down
> >> in the row of available CPUs.
> >> The irqtop for CPU4 shows a constant number of 6 local timer interrupts
> >> per update, while the other CPUs show various higher values:
> >>
> >> irqtop for cpu 4
> >>
> >>   eth0                                      188
> >>   Rescheduling interrupts                   162
> >>   Local timer interrupts                      6
> >>   ata_piix                                    3
> >>   TLB shootdowns                              1
> >>   Spurious interrupts                         0
> >>   Machine check exceptions                    0
> >>
> >>
> >> irqtop for cpu 5
> >>
> >>   eth0                                      257
> >>   Local timer interrupts                    251
> >>   Rescheduling interrupts                   237
> >>   Spurious interrupts                         0
> >>   Machine check exceptions                    0
> >>
> >> So the timer interrupt for cpu4 is not completely dead but somehow
> >> broken.
> >
> > That is incredibly weird.
> >
> >> What can cause this problem? Any way to speed it up again?
> >>    
> >
> > The host has 8 cpus and is only running this 6 vcpu guest, yes?
> >
> > Can you confirm the other vcpus are ticking at 250 Hz?
> >
> > What does 'top' show running on cpu 4?  Pressing 'f' 'j' will add a  
> > last-used-cpu field in the display.
> >
> > Marcelo, any ideas?
> 
> Just to let you know, right after startup, all vcpus work fine.
> 
> The following message might be related to the problem:
> hrtimer: interrupt too slow, forcing clock min delta to 165954639 ns
> 
> The guest is an 32bit system running on an 64bit host.

Sebastian,

Please apply the attached patch to your guest kernel.


--SLDf9lqlvOQaIe6s
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="hrtimer-tune-hrtimer_interrupt-hang-logic.patch"

commit ff28aa973164f8e5f4b2633e178162d121c4e5eb
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Fri Nov 13 17:05:44 2009 +0100

    hrtimer: Tune hrtimer_interrupt hang logic
    
    The hrtimer_interrupt hang logic adjusts min_delta_ns based on the
    execution time of the hrtimer callbacks.
    
    This is error-prone for virtual machines, where a guest vcpu can be
    scheduled out during the execution of the callbacks (and the callbacks
    themselves can do operations that translate to blocking operations in
    the hypervisor), which in can lead to large min_delta_ns rendering the
    system unusable.
    
    Replace the current heuristics with something more reliable. Allow the
    interrupt code to try 3 times to catch up with the lost time. If that
    fails use the total time spent in the interrupt handler to defer the
    next timer interrupt so the system can catch up with other things
    which got delayed. Limit that deferment to 100ms.
    
    The retry events and the maximum time spent in the interrupt handler
    are recorded and exposed via /proc/timer_list
    
    Inspired by a patch from Marcelo.
    
    Reported-by: Michael Tokarev <mjt@tls.msk.ru>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: kvm@vger.kernel.org

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 4759917..2e1064f 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -164,10 +164,11 @@ struct hrtimer_clock_base {
  * @expires_next:	absolute time of the next event which was scheduled
  *			via clock_set_next_event()
  * @hres_active:	State of high resolution mode
- * @check_clocks:	Indictator, when set evaluate time source and clock
- *			event devices whether high resolution mode can be
- *			activated.
- * @nr_events:		Total number of timer interrupt events
+ * @hang_detected:	The last hrtimer interrupt detected a hang
+ * @nr_events:		Total number of hrtimer interrupt events
+ * @nr_retries:		Total number of hrtimer interrupt retries
+ * @nr_hangs:		Total number of hrtimer interrupt hangs
+ * @max_hang_time:	Maximum time spent in hrtimer_interrupt
  */
 struct hrtimer_cpu_base {
 	spinlock_t			lock;
@@ -175,7 +176,11 @@ struct hrtimer_cpu_base {
 #ifdef CONFIG_HIGH_RES_TIMERS
 	ktime_t				expires_next;
 	int				hres_active;
+	int				hang_detected;
 	unsigned long			nr_events;
+	unsigned long			nr_retries;
+	unsigned long			nr_hangs;
+	ktime_t				max_hang_time;
 #endif
 };
 
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 49da79a..4b0638b 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -584,7 +584,7 @@ static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base)
 static int hrtimer_reprogram(struct hrtimer *timer,
 			     struct hrtimer_clock_base *base)
 {
-	ktime_t *expires_next = &__get_cpu_var(hrtimer_bases).expires_next;
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
 	ktime_t expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
 	int res;
 
@@ -609,7 +609,16 @@ static int hrtimer_reprogram(struct hrtimer *timer,
 	if (expires.tv64 < 0)
 		return -ETIME;
 
-	if (expires.tv64 >= expires_next->tv64)
+	if (expires.tv64 >= cpu_base->expires_next.tv64)
+		return 0;
+
+	/*
+	 * If a hang was detected in the last timer interrupt then we
+	 * do not schedule a timer which is earlier than the expiry
+	 * which we enforced in the hang detection. We want the system
+	 * to make progress.
+	 */
+	if (cpu_base->hang_detected)
 		return 0;
 
 	/*
@@ -617,7 +626,7 @@ static int hrtimer_reprogram(struct hrtimer *timer,
 	 */
 	res = tick_program_event(expires, 0);
 	if (!IS_ERR_VALUE(res))
-		*expires_next = expires;
+		cpu_base->expires_next = expires;
 	return res;
 }
 
@@ -1234,29 +1243,6 @@ static void __run_hrtimer(struct hrtimer *timer)
 
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-static int force_clock_reprogram;
-
-/*
- * After 5 iteration's attempts, we consider that hrtimer_interrupt()
- * is hanging, which could happen with something that slows the interrupt
- * such as the tracing. Then we force the clock reprogramming for each future
- * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
- * threshold that we will overwrite.
- * The next tick event will be scheduled to 3 times we currently spend on
- * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
- * 1/4 of their time to process the hrtimer interrupts. This is enough to
- * let it running without serious starvation.
- */
-
-static inline void
-hrtimer_interrupt_hanging(struct clock_event_device *dev,
-			ktime_t try_time)
-{
-	force_clock_reprogram = 1;
-	dev->min_delta_ns = (unsigned long)try_time.tv64 * 3;
-	printk(KERN_WARNING "hrtimer: interrupt too slow, "
-		"forcing clock min delta to %lu ns\n", dev->min_delta_ns);
-}
 /*
  * High resolution timer interrupt
  * Called with interrupts disabled
@@ -1265,21 +1251,15 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 {
 	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
 	struct hrtimer_clock_base *base;
-	ktime_t expires_next, now;
-	int nr_retries = 0;
-	int i;
+	ktime_t expires_next, now, entry_time, delta;
+	int i, retries = 0;
 
 	BUG_ON(!cpu_base->hres_active);
 	cpu_base->nr_events++;
 	dev->next_event.tv64 = KTIME_MAX;
 
- retry:
-	/* 5 retries is enough to notice a hang */
-	if (!(++nr_retries % 5))
-		hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
-
-	now = ktime_get();
-
+	entry_time = now = ktime_get();
+retry:
 	expires_next.tv64 = KTIME_MAX;
 
 	spin_lock(&cpu_base->lock);
@@ -1341,10 +1321,48 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	spin_unlock(&cpu_base->lock);
 
 	/* Reprogramming necessary ? */
-	if (expires_next.tv64 != KTIME_MAX) {
-		if (tick_program_event(expires_next, force_clock_reprogram))
-			goto retry;
+	if (expires_next.tv64 == KTIME_MAX ||
+	    !tick_program_event(expires_next, 0)) {
+		cpu_base->hang_detected = 0;
+		return;
 	}
+
+	/*
+	 * The next timer was already expired due to:
+	 * - tracing
+	 * - long lasting callbacks
+	 * - being scheduled away when running in a VM
+	 *
+	 * We need to prevent that we loop forever in the hrtimer
+	 * interrupt routine. We give it 3 attempts to avoid
+	 * overreacting on some spurious event.
+	 */
+	now = ktime_get();
+	cpu_base->nr_retries++;
+	if (++retries < 3)
+		goto retry;
+	/*
+	 * Give the system a chance to do something else than looping
+	 * here. We stored the entry time, so we know exactly how long
+	 * we spent here. We schedule the next event this amount of
+	 * time away.
+	 */
+	cpu_base->nr_hangs++;
+	cpu_base->hang_detected = 1;
+	delta = ktime_sub(now, entry_time);
+	if (delta.tv64 > cpu_base->max_hang_time.tv64)
+		cpu_base->max_hang_time = delta;
+	/*
+	 * Limit it to a sensible value as we enforce a longer
+	 * delay. Give the CPU at least 100ms to catch up.
+	 */
+	if (delta.tv64 > 100 * NSEC_PER_MSEC)
+		expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
+	else
+		expires_next = ktime_add(now, delta);
+	tick_program_event(expires_next, 1);
+	printk_once(KERN_WARNING "hrtimer: interrupt took %llu ns\n",
+		    ktime_to_ns(delta));
 }
 
 /*
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index fddd69d..7cd905c 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -150,6 +150,9 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
 	P_ns(expires_next);
 	P(hres_active);
 	P(nr_events);
+	P(nr_retries);
+	P(nr_hangs);
+	P_ns(max_hang_time);
 #endif
 #undef P
 #undef P_ns
@@ -252,7 +255,7 @@ static int timer_list_show(struct seq_file *m, void *v)
 	u64 now = ktime_to_ns(ktime_get());
 	int cpu;
 
-	SEQ_printf(m, "Timer List Version: v0.4\n");
+	SEQ_printf(m, "Timer List Version: v0.5\n");
 	SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
 	SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
 

--SLDf9lqlvOQaIe6s--