All of lore.kernel.org
 help / color / mirror / Atom feed
* kvm guest: hrtimer: interrupt too slow
@ 2009-09-29 13:12 Michael Tokarev
  2009-09-29 13:47 ` Avi Kivity
  2009-10-03 23:12 ` Marcelo Tosatti
  0 siblings, 2 replies; 29+ messages in thread
From: Michael Tokarev @ 2009-09-29 13:12 UTC (permalink / raw)
  To: kvm

Hello.

I'm having quite an.. unusable system here.
It's not really a regresssion with 0.11.0,
it was something similar before, but with
0.11.0 and/or 2.6.31 it become much worse.

The thing is that after some uptime, kvm
guest prints something like this:

hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns

after which system (guest) speeed becomes
very slow.  The above message is from
2.6.31 guest running wiht 0.11.0 & 2.6.31
host.  Before I tried it with 0.10.6 and
2.6.30 or 2.6.27, and the delta were a
bit less than that:

hrtimer: interrupt too slow, forcing clock min delta to 152222415 ns
hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns

etc.

Before, guest was just somewhat slow.  But
now, it reached the state when it's almost
unusable.

Especially "nice" it's visible in xterm
(running on that guest): typing a single
char in xterm results in it being displayed
after 1..2 sec pause.  With applications
like web browser (where all the font
rendering is done on the client and X
server receives bitmap instead of a
character code to draw, it isn't that
bad.

I suspect it's only happening with small
network packets.

The above message (with large min delta)
occured after about 20 hours uptime.  Similar
(give or take) delays were observed previously
as well.  That to say, -- it's not something
you notice immediately.

The problem is quite consistent, that is,
different guests shows it sooner or later.

Right now I'm running 2.6.31-amd64 host,
32bit qemu-kvm-0.11.0, 2.6.31-i686 uniprocessor
guest, virtio networking and block devices,
kvm-clock is enabled.  Similar configuration
were used before (with different versions).
The hardware is Phenom 9750 (4core) running
on Amd780g/sb700 chipset.

Any hints on what to do with all this?

Thanks!

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-09-29 13:12 kvm guest: hrtimer: interrupt too slow Michael Tokarev
@ 2009-09-29 13:47 ` Avi Kivity
  2009-09-29 13:58   ` Michael Tokarev
  2009-10-03 23:12 ` Marcelo Tosatti
  1 sibling, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2009-09-29 13:47 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: kvm, Glauber Costa

On 09/29/2009 03:12 PM, Michael Tokarev wrote:
> Hello.
>
> I'm having quite an.. unusable system here.
> It's not really a regresssion with 0.11.0,
> it was something similar before, but with
> 0.11.0 and/or 2.6.31 it become much worse.
>
> The thing is that after some uptime, kvm
> guest prints something like this:
>
> hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
>
> after which system (guest) speeed becomes
> very slow.  The above message is from
> 2.6.31 guest running wiht 0.11.0 & 2.6.31
> host.  Before I tried it with 0.10.6 and
> 2.6.30 or 2.6.27, and the delta were a
> bit less than that:
>
> hrtimer: interrupt too slow, forcing clock min delta to 152222415 ns
> hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns
>
> etc.
>
> Before, guest was just somewhat slow.  But
> now, it reached the state when it's almost
> unusable.
>
> Especially "nice" it's visible in xterm
> (running on that guest): typing a single
> char in xterm results in it being displayed
> after 1..2 sec pause.  With applications
> like web browser (where all the font
> rendering is done on the client and X
> server receives bitmap instead of a
> character code to draw, it isn't that
> bad.
>
> I suspect it's only happening with small
> network packets.
>
> The above message (with large min delta)
> occured after about 20 hours uptime.  Similar
> (give or take) delays were observed previously
> as well.  That to say, -- it's not something
> you notice immediately.
>
> The problem is quite consistent, that is,
> different guests shows it sooner or later.
>
> Right now I'm running 2.6.31-amd64 host,
> 32bit qemu-kvm-0.11.0, 2.6.31-i686 uniprocessor
> guest, virtio networking and block devices,
> kvm-clock is enabled.  Similar configuration
> were used before (with different versions).
> The hardware is Phenom 9750 (4core) running
> on Amd780g/sb700 chipset.
>
> Any hints on what to do with all this?
>

What happens if you use hpet or pmtimer as guest clocksource?

Please post host /proc/cpuinfo.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-09-29 13:47 ` Avi Kivity
@ 2009-09-29 13:58   ` Michael Tokarev
  2009-10-05 10:47     ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Tokarev @ 2009-09-29 13:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Glauber Costa

Avi Kivity wrote:
> On 09/29/2009 03:12 PM, Michael Tokarev wrote:
[]
>> The thing is that after some uptime, kvm
>> guest prints something like this:
>>
>> hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
[]
> What happens if you use hpet or pmtimer as guest clocksource?

For all the guests I have handy, only 2 clocksources
are available: kvm-clock and acpi_pm.  The host itself
has hpet turned off because it itself had issues with
hpet, and after many tries we finally turned it off
(there were several long and painy threads on lkml
about this).

Dunno why pmtimer isn't available.

I tried switching to acpi_pm on a running guest (which
is still runnining in very slow mode), but that did not
make any difference - i.e., it did not become "fast"
again.

> Please post host /proc/cpuinfo.

Here's the cpuinfo from host (for last core):

processor	: 3
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 2
model name	: AMD Phenom(tm) 9750 Quad-Core Processor
stepping	: 3
cpu MHz		: 1200.000
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
bogomips	: 4814.78
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

It has cpufreq enabled (ondemand), but turning that off
does not change anything (that was the first thing I
tried, but AFTER the guest become slow).

Thanks!

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-09-29 13:12 kvm guest: hrtimer: interrupt too slow Michael Tokarev
  2009-09-29 13:47 ` Avi Kivity
@ 2009-10-03 23:12 ` Marcelo Tosatti
       [not found]   ` <4AC88E7E.8050909@msgid.tls.msk.ru>
  2009-10-07 23:17   ` Frederic Weisbecker
  1 sibling, 2 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2009-10-03 23:12 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: kvm, Frederic Weisbecker, Ingo Molnar

Michael,

Can you please give the patch below a try please? (without acpi_pm timer 
or priority adjustments for the guest).

On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote:
> Hello.
>
> I'm having quite an.. unusable system here.
> It's not really a regresssion with 0.11.0,
> it was something similar before, but with
> 0.11.0 and/or 2.6.31 it become much worse.
>
> The thing is that after some uptime, kvm
> guest prints something like this:
>
> hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
>
> after which system (guest) speeed becomes
> very slow.  The above message is from
> 2.6.31 guest running wiht 0.11.0 & 2.6.31
> host.  Before I tried it with 0.10.6 and
> 2.6.30 or 2.6.27, and the delta were a
> bit less than that:
>
> hrtimer: interrupt too slow, forcing clock min delta to 152222415 ns
> hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns

It seems the way hrtimer_interrupt_hanging calculates min_delta is
wrong (especially to virtual machines). The guest vcpu can be scheduled
out during the execution of the hrtimer callbacks (and the callbacks
themselves can do operations that translate to blocking operations in
the hypervisor).

So high min_delta values can be calculated if, for example, a single
hrtimer_interrupt run takes two host time slices to execute, while some
other higher priority task runs for N slices in between.

Using the hrtimer_interrupt execution time (which can be the worse
case at any given time), as the min_delta is problematic.

So simply increase min_delta_ns by 50% once every detected failure,
which will eventually lead to an acceptable threshold (the algorithm
should scale back to down lower min_delta, to adjust back to wealthier
times, too).

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 49da79a..8997978 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
 
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-static int force_clock_reprogram;
-
 /*
  * After 5 iteration's attempts, we consider that hrtimer_interrupt()
  * is hanging, which could happen with something that slows the interrupt
- * such as the tracing. Then we force the clock reprogramming for each future
- * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
- * threshold that we will overwrite.
- * The next tick event will be scheduled to 3 times we currently spend on
- * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
- * 1/4 of their time to process the hrtimer interrupts. This is enough to
- * let it running without serious starvation.
+ * such as the tracing, so we increase min_delta_ns.
  */
 
 static inline void
-hrtimer_interrupt_hanging(struct clock_event_device *dev,
-			ktime_t try_time)
+hrtimer_interrupt_hanging(struct clock_event_device *dev)
 {
-	force_clock_reprogram = 1;
-	dev->min_delta_ns = (unsigned long)try_time.tv64 * 3;
-	printk(KERN_WARNING "hrtimer: interrupt too slow, "
-		"forcing clock min delta to %lu ns\n", dev->min_delta_ns);
+	dev->min_delta_ns += dev->min_delta_ns >> 1;
+	if (printk_ratelimit())
+		printk(KERN_WARNING "hrtimer: interrupt too slow, "
+			"forcing clock min delta to %lu ns\n",
+			dev->min_delta_ns);
 }
 /*
  * High resolution timer interrupt
@@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
  retry:
 	/* 5 retries is enough to notice a hang */
 	if (!(++nr_retries % 5))
-		hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
+		hrtimer_interrupt_hanging(dev);
 
 	now = ktime_get();
 
@@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
 	/* Reprogramming necessary ? */
 	if (expires_next.tv64 != KTIME_MAX) {
-		if (tick_program_event(expires_next, force_clock_reprogram))
+		if (tick_program_event(expires_next, 0))
 			goto retry;
 	}
 }

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
       [not found]   ` <4AC88E7E.8050909@msgid.tls.msk.ru>
@ 2009-10-05  0:50     ` Marcelo Tosatti
  2009-10-05  9:31       ` Michael Tokarev
  0 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2009-10-05  0:50 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: kvm, Frederic Weisbecker, Ingo Molnar

On Sun, Oct 04, 2009 at 04:01:02PM +0400, Michael Tokarev wrote:
> Marcelo Tosatti wrote:
>> Michael,
>>
>> Can you please give the patch below a try please? (without acpi_pm 
>> timer or priority adjustments for the guest).
>
> Sure.  I'll try it out in a hour or two, while I can experiment freely because
> it's weekend.
>
> But I wonder...
> []
>>> hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns
>>
>> It seems the way hrtimer_interrupt_hanging calculates min_delta is
>> wrong (especially to virtual machines). The guest vcpu can be scheduled
>> out during the execution of the hrtimer callbacks (and the callbacks
>> themselves can do operations that translate to blocking operations in
>> the hypervisor).
>>
>> So high min_delta values can be calculated if, for example, a single
>> hrtimer_interrupt run takes two host time slices to execute, while some
>> other higher priority task runs for N slices in between.
>>
>> Using the hrtimer_interrupt execution time (which can be the worse
>> case at any given time), as the min_delta is problematic.
>>
>> So simply increase min_delta_ns by 50% once every detected failure,
>> which will eventually lead to an acceptable threshold (the algorithm
>> should scale back to down lower min_delta, to adjust back to wealthier
>> times, too).
>
> ..I wonder what should I check for.  I mean, the end result of this patch
> is not entirely clear to me, what should it change?  I see that instead
> of the now-calculated-after-error (probably very large) min_delta, it's
> increased to a smaller value.
>
> So I should be getting more such messages (forcing min_delta to $foo), but
> the "responsiveness" of the guest should stay in more or less acceptable
> range (unless it will continue erroring, in which case the "responsiveness"
> will be slowly reduced).

Right.

> Yes indeed, it's better than current situation, when the guest works fine
> initially but out of the sudden it switches to a wild very-slow-to-reply
> mode.  But it does not look like a right solution either, even if the
> back adjustment (mentioned in the last statement above) will be implemented.
> Unless I completely missed the point...
>
> Neverless, the question stands: what I'm looking for now, when the patch is
> applied?  I can't measure the "responsiveness", especially since the min_delta
> gets set to different (large) values each time (I mean current situation
> without the patch).

You should see min_delta_ns increase to a much smaller value, hopefully in
the 2000-10000 range.

min_delta_ns is the minimum delay a high resolution timer can have. You
had it set in the hundreds of milliseconds range.

>
> Thanks!
>
> /mjt
>
>> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
>> index 49da79a..8997978 100644
>> --- a/kernel/hrtimer.c
>> +++ b/kernel/hrtimer.c
>> @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
>>   #ifdef CONFIG_HIGH_RES_TIMERS
>>  -static int force_clock_reprogram;
>> -
>>  /*
>>   * After 5 iteration's attempts, we consider that hrtimer_interrupt()
>>   * is hanging, which could happen with something that slows the interrupt
>> - * such as the tracing. Then we force the clock reprogramming for each future
>> - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
>> - * threshold that we will overwrite.
>> - * The next tick event will be scheduled to 3 times we currently spend on
>> - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
>> - * 1/4 of their time to process the hrtimer interrupts. This is enough to
>> - * let it running without serious starvation.
>> + * such as the tracing, so we increase min_delta_ns.
>>   */
>>   static inline void
>> -hrtimer_interrupt_hanging(struct clock_event_device *dev,
>> -			ktime_t try_time)
>> +hrtimer_interrupt_hanging(struct clock_event_device *dev)
>>  {
>> -	force_clock_reprogram = 1;
>> -	dev->min_delta_ns = (unsigned long)try_time.tv64 * 3;
>> -	printk(KERN_WARNING "hrtimer: interrupt too slow, "
>> -		"forcing clock min delta to %lu ns\n", dev->min_delta_ns);
>> +	dev->min_delta_ns += dev->min_delta_ns >> 1;
>> +	if (printk_ratelimit())
>> +		printk(KERN_WARNING "hrtimer: interrupt too slow, "
>> +			"forcing clock min delta to %lu ns\n",
>> +			dev->min_delta_ns);
>>  }
>>  /*
>>   * High resolution timer interrupt
>> @@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
>>   retry:
>>  	/* 5 retries is enough to notice a hang */
>>  	if (!(++nr_retries % 5))
>> -		hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
>> +		hrtimer_interrupt_hanging(dev);
>>   	now = ktime_get();
>>  @@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device 
>> *dev)
>>   	/* Reprogramming necessary ? */
>>  	if (expires_next.tv64 != KTIME_MAX) {
>> -		if (tick_program_event(expires_next, force_clock_reprogram))
>> +		if (tick_program_event(expires_next, 0))
>>  			goto retry;
>>  	}
>>  }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-05  0:50     ` Marcelo Tosatti
@ 2009-10-05  9:31       ` Michael Tokarev
  2009-10-06 13:30         ` Michael Tokarev
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Tokarev @ 2009-10-05  9:31 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Frederic Weisbecker, Ingo Molnar

Marcelo Tosatti wrote:
[]
> You should see min_delta_ns increase to a much smaller value, hopefully in
> the 2000-10000 range.

That's what I see:

Oct  4 16:13:09 paltus kernel: [boot]
Oct  5 12:35:51 paltus kernel: hrtimer: interrupt too slow, forcing clock min delta to 1500 ns
Oct  5 12:47:29 paltus kernel: hrtimer: interrupt too slow, forcing clock min delta to 2250 ns

I tried to stress-load the machine yesterday (Oct-4), including massive
network usage (I think it's the network that triggers the issue most
reliable), but wasn't able to trigger it.  Today, at Monday, at normal
system load, it triggered "by its own", increasing min_delta to 50%
as in the patch.  The system runs smoothly so far, but it's to be
expected (till it increases min delta to some 10-digit number :).

But I'm unsure still if that's the right approach...  The root
cause seems to be the problem, and this patch just tries to
work around it in a better-than-before way.  But it might be
just me... ;)

I'll continue watching it, since the machine(s) in question are
important for us.

Thanks!

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-09-29 13:58   ` Michael Tokarev
@ 2009-10-05 10:47     ` Avi Kivity
  0 siblings, 0 replies; 29+ messages in thread
From: Avi Kivity @ 2009-10-05 10:47 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: kvm, Glauber Costa

On 09/29/2009 03:58 PM, Michael Tokarev wrote:
> Avi Kivity wrote:
>> On 09/29/2009 03:12 PM, Michael Tokarev wrote:
> []
>>> The thing is that after some uptime, kvm
>>> guest prints something like this:
>>>
>>> hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
> []
>> What happens if you use hpet or pmtimer as guest clocksource?
>
> For all the guests I have handy, only 2 clocksources
> are available: kvm-clock and acpi_pm.  The host itself
> has hpet turned off because it itself had issues with
> hpet, and after many tries we finally turned it off
> (there were several long and painy threads on lkml
> about this).
>
> Dunno why pmtimer isn't available.

acpi_pm is pmtimer.

>
> I tried switching to acpi_pm on a running guest (which
> is still runnining in very slow mode), but that did not
> make any difference - i.e., it did not become "fast"
> again.
>
>> Please post host /proc/cpuinfo.
>
> Here's the cpuinfo from host (for last core):
>
> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext 
> fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good 
> nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm 
> extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs

Has both constant_tsc and nonstop_tsc, wonder why it's making so much 
trouble.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-05  9:31       ` Michael Tokarev
@ 2009-10-06 13:30         ` Michael Tokarev
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Tokarev @ 2009-10-06 13:30 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, Frederic Weisbecker, Ingo Molnar

Michael Tokarev wrote:
> Marcelo Tosatti wrote:
> []
>> You should see min_delta_ns increase to a much smaller value, 
>> hopefully in
>> the 2000-10000 range.
> 
> That's what I see:
> 

Oct  4 16:13:09 paltus kernel: [boot]
Oct  5 12:35:51 paltus kernel: hrtimer: interrupt too slow, forcing clock min delta to 1500 ns
Oct  5 12:47:29 paltus kernel: hrtimer: interrupt too slow, forcing clock min delta to 2250 ns
Oct  5 15:46:41 paltus kernel: hrtimer: interrupt too slow, forcing clock min delta to 3375 ns
Oct  5 23:13:08 paltus kernel: hrtimer: interrupt too slow, forcing clock min delta to 5062 ns
Oct  6 17:14:38 paltus kernel: hrtimer: interrupt too slow, forcing clock min delta to 7593 ns

So it's increasing, slowly, so far.  Hopefully it'll slow down
further.

I want to reboot the host today, so it'll start over.

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-03 23:12 ` Marcelo Tosatti
       [not found]   ` <4AC88E7E.8050909@msgid.tls.msk.ru>
@ 2009-10-07 23:17   ` Frederic Weisbecker
  2009-10-08  0:54     ` Marcelo Tosatti
  2009-10-08  8:09     ` kvm guest: hrtimer: interrupt too slow Michael Tokarev
  1 sibling, 2 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2009-10-07 23:17 UTC (permalink / raw)
  To: Marcelo Tosatti, Thomas Gleixner; +Cc: Michael Tokarev, kvm, Ingo Molnar

(Adding Thomas in Cc)


On Sat, Oct 03, 2009 at 08:12:05PM -0300, Marcelo Tosatti wrote:
> Michael,
> 
> Can you please give the patch below a try please? (without acpi_pm timer 
> or priority adjustments for the guest).
> 
> On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote:
> > Hello.
> >
> > I'm having quite an.. unusable system here.
> > It's not really a regresssion with 0.11.0,
> > it was something similar before, but with
> > 0.11.0 and/or 2.6.31 it become much worse.
> >
> > The thing is that after some uptime, kvm
> > guest prints something like this:
> >
> > hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
> >
> > after which system (guest) speeed becomes
> > very slow.  The above message is from
> > 2.6.31 guest running wiht 0.11.0 & 2.6.31
> > host.  Before I tried it with 0.10.6 and
> > 2.6.30 or 2.6.27, and the delta were a
> > bit less than that:
> >
> > hrtimer: interrupt too slow, forcing clock min delta to 152222415 ns
> > hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns
> 
> It seems the way hrtimer_interrupt_hanging calculates min_delta is
> wrong (especially to virtual machines). The guest vcpu can be scheduled
> out during the execution of the hrtimer callbacks (and the callbacks
> themselves can do operations that translate to blocking operations in
> the hypervisor).
> 
> So high min_delta values can be calculated if, for example, a single
> hrtimer_interrupt run takes two host time slices to execute, while some
> other higher priority task runs for N slices in between.
> 
> Using the hrtimer_interrupt execution time (which can be the worse
> case at any given time), as the min_delta is problematic.
> 
> So simply increase min_delta_ns by 50% once every detected failure,
> which will eventually lead to an acceptable threshold (the algorithm
> should scale back to down lower min_delta, to adjust back to wealthier
> times, too).
> 
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index 49da79a..8997978 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
>  
>  #ifdef CONFIG_HIGH_RES_TIMERS
>  
> -static int force_clock_reprogram;
> -
>  /*
>   * After 5 iteration's attempts, we consider that hrtimer_interrupt()
>   * is hanging, which could happen with something that slows the interrupt
> - * such as the tracing. Then we force the clock reprogramming for each future
> - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
> - * threshold that we will overwrite.
> - * The next tick event will be scheduled to 3 times we currently spend on
> - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
> - * 1/4 of their time to process the hrtimer interrupts. This is enough to
> - * let it running without serious starvation.
> + * such as the tracing, so we increase min_delta_ns.
>   */
>  
>  static inline void
> -hrtimer_interrupt_hanging(struct clock_event_device *dev,
> -			ktime_t try_time)
> +hrtimer_interrupt_hanging(struct clock_event_device *dev)
>  {
> -	force_clock_reprogram = 1;
> -	dev->min_delta_ns = (unsigned long)try_time.tv64 * 3;
> -	printk(KERN_WARNING "hrtimer: interrupt too slow, "
> -		"forcing clock min delta to %lu ns\n", dev->min_delta_ns);
> +	dev->min_delta_ns += dev->min_delta_ns >> 1;


I haven't thought about the guest that could be scheduled out in
the middle of the timers servicing, making wrong this check based
of the time spent in hrtimer_interrupt().

I guess there is no easy/generic/cheap way to rebase this check
on the _virtual_ time spent in the timers servicing. By virtual,
I mean the time spent in the guest only.

In a non-guest kernel, the old check forces an adaptive rate
sharing:

- we spent n nanosecs to service the batch of timers.
- we are hanging
- we want at least 3/4 of time reserved for non-timer
  servicing in the kernel, this is a minimum prerequisite
  for the system to not starve
- adapt the min_clock_delta against to fit the above constraint

All that does not make sense anymore in a guest. The hang detection
and warnings, the recalibrations of the min_clock_deltas are completely
wrong in this context.
Not only does it spuriously warn, but the minimum timer is increasing
slowly and the guest progressively suffers from higher and higher
latencies.

That's really bad.

Your patch lowers the immediate impact and makes this illness evolving
smoother by scaling down the recalibration to the min_clock_delta.
This appeases the bug but doesn't solve it. I fear it could be even
worse because it makes it more discreet.


May be can we instead increase the minimum threshold of loop in the
hrtimer interrupt before considering it as a hang? Hmm, but a too high
number could make this check useless, depending of the number of pending
timers, which is a finite number.

Well, actually I'm not confident anymore in this check. Or actually we
should change it. May be we can rebase it on the time spent on the hrtimer
interrupt (and check it every 10 loops of reprocessing in hrtimer_interrupts).

Would a mimimum threshold of 5 seconds spent in hrtimer_interrupt() be
a reasonable check to perform?
We should probably base our check on such kind of high boundary.
What we want is an ultimate rescue against hard hangs anyway, not
something that can solve the hang source itself. After the min_clock_delta
recalibration, the system will be unstable (eg: high latencies).
So if this must behave as a hammer, let's ensure we really need this hammer,
even if we need to wait for few seconds before it triggers.


What do you think?

Thanks,
Frederic.




> +	if (printk_ratelimit())
> +		printk(KERN_WARNING "hrtimer: interrupt too slow, "
> +			"forcing clock min delta to %lu ns\n",
> +			dev->min_delta_ns);
>  }
>  /*
>   * High resolution timer interrupt
> @@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
>   retry:
>  	/* 5 retries is enough to notice a hang */
>  	if (!(++nr_retries % 5))
> -		hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
> +		hrtimer_interrupt_hanging(dev);
>  
>  	now = ktime_get();
>  
> @@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
>  
>  	/* Reprogramming necessary ? */
>  	if (expires_next.tv64 != KTIME_MAX) {
> -		if (tick_program_event(expires_next, force_clock_reprogram))
> +		if (tick_program_event(expires_next, 0))
>  			goto retry;
>  	}
>  }


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-07 23:17   ` Frederic Weisbecker
@ 2009-10-08  0:54     ` Marcelo Tosatti
  2009-10-08  7:54       ` Michael Tokarev
  2009-10-08  8:05       ` Thomas Gleixner
  2009-10-08  8:09     ` kvm guest: hrtimer: interrupt too slow Michael Tokarev
  1 sibling, 2 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2009-10-08  0:54 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Thomas Gleixner, Michael Tokarev, kvm, Ingo Molnar

On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
> (Adding Thomas in Cc)
> 
> 
> On Sat, Oct 03, 2009 at 08:12:05PM -0300, Marcelo Tosatti wrote:
> > Michael,
> > 
> > Can you please give the patch below a try please? (without acpi_pm timer 
> > or priority adjustments for the guest).
> > 
> > On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote:
> > > Hello.
> > >
> > > I'm having quite an.. unusable system here.
> > > It's not really a regresssion with 0.11.0,
> > > it was something similar before, but with
> > > 0.11.0 and/or 2.6.31 it become much worse.
> > >
> > > The thing is that after some uptime, kvm
> > > guest prints something like this:
> > >
> > > hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
> > >
> > > after which system (guest) speeed becomes
> > > very slow.  The above message is from
> > > 2.6.31 guest running wiht 0.11.0 & 2.6.31
> > > host.  Before I tried it with 0.10.6 and
> > > 2.6.30 or 2.6.27, and the delta were a
> > > bit less than that:
> > >
> > > hrtimer: interrupt too slow, forcing clock min delta to 152222415 ns
> > > hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns
> > 
> > It seems the way hrtimer_interrupt_hanging calculates min_delta is
> > wrong (especially to virtual machines). The guest vcpu can be scheduled
> > out during the execution of the hrtimer callbacks (and the callbacks
> > themselves can do operations that translate to blocking operations in
> > the hypervisor).
> > 
> > So high min_delta values can be calculated if, for example, a single
> > hrtimer_interrupt run takes two host time slices to execute, while some
> > other higher priority task runs for N slices in between.
> > 
> > Using the hrtimer_interrupt execution time (which can be the worse
> > case at any given time), as the min_delta is problematic.
> > 
> > So simply increase min_delta_ns by 50% once every detected failure,
> > which will eventually lead to an acceptable threshold (the algorithm
> > should scale back to down lower min_delta, to adjust back to wealthier
> > times, too).
> > 
> > diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> > index 49da79a..8997978 100644
> > --- a/kernel/hrtimer.c
> > +++ b/kernel/hrtimer.c
> > @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
> >  
> >  #ifdef CONFIG_HIGH_RES_TIMERS
> >  
> > -static int force_clock_reprogram;
> > -
> >  /*
> >   * After 5 iteration's attempts, we consider that hrtimer_interrupt()
> >   * is hanging, which could happen with something that slows the interrupt
> > - * such as the tracing. Then we force the clock reprogramming for each future
> > - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
> > - * threshold that we will overwrite.
> > - * The next tick event will be scheduled to 3 times we currently spend on
> > - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
> > - * 1/4 of their time to process the hrtimer interrupts. This is enough to
> > - * let it running without serious starvation.
> > + * such as the tracing, so we increase min_delta_ns.
> >   */
> >  
> >  static inline void
> > -hrtimer_interrupt_hanging(struct clock_event_device *dev,
> > -			ktime_t try_time)
> > +hrtimer_interrupt_hanging(struct clock_event_device *dev)
> >  {
> > -	force_clock_reprogram = 1;
> > -	dev->min_delta_ns = (unsigned long)try_time.tv64 * 3;
> > -	printk(KERN_WARNING "hrtimer: interrupt too slow, "
> > -		"forcing clock min delta to %lu ns\n", dev->min_delta_ns);
> > +	dev->min_delta_ns += dev->min_delta_ns >> 1;
> 
> 
> I haven't thought about the guest that could be scheduled out in
> the middle of the timers servicing, making wrong this check based
> of the time spent in hrtimer_interrupt().
> 
> I guess there is no easy/generic/cheap way to rebase this check
> on the _virtual_ time spent in the timers servicing. By virtual,
> I mean the time spent in the guest only.
> 
> In a non-guest kernel, the old check forces an adaptive rate
> sharing:
> 
> - we spent n nanosecs to service the batch of timers.
> - we are hanging
> - we want at least 3/4 of time reserved for non-timer
>   servicing in the kernel, this is a minimum prerequisite
>   for the system to not starve
> - adapt the min_clock_delta against to fit the above constraint
> 
> All that does not make sense anymore in a guest. The hang detection
> and warnings, the recalibrations of the min_clock_deltas are completely
> wrong in this context.
> Not only does it spuriously warn, but the minimum timer is increasing
> slowly and the guest progressively suffers from higher and higher
> latencies.
> 
> That's really bad.
> 
> Your patch lowers the immediate impact and makes this illness evolving
> smoother by scaling down the recalibration to the min_clock_delta.
> This appeases the bug but doesn't solve it. I fear it could be even
> worse because it makes it more discreet.

True.

> May be can we instead increase the minimum threshold of loop in the
> hrtimer interrupt before considering it as a hang? Hmm, but a too high
> number could make this check useless, depending of the number of pending
> timers, which is a finite number.
> 
> Well, actually I'm not confident anymore in this check. Or actually we
> should change it. May be we can rebase it on the time spent on the hrtimer
> interrupt (and check it every 10 loops of reprocessing in hrtimer_interrupts).
> 
> Would a mimimum threshold of 5 seconds spent in hrtimer_interrupt() be
> a reasonable check to perform?
> We should probably base our check on such kind of high boundary.
> What we want is an ultimate rescue against hard hangs anyway, not
> something that can solve the hang source itself. After the min_clock_delta
> recalibration, the system will be unstable (eg: high latencies).
> So if this must behave as a hammer, let's ensure we really need this hammer,
> even if we need to wait for few seconds before it triggers.
> 
> 
> What do you think?

What about getting rid of the retry loop, instead? So something
like:

- run hrtimer callbacks (once)
- while (tick_program_event(expires))
	expires = ktime_add_ns(expires, dev->min_delta_ns)

This way there's no static tuning involved.

Its not clear to me why the loop is there in the first place.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08  0:54     ` Marcelo Tosatti
@ 2009-10-08  7:54       ` Michael Tokarev
  2009-10-08  8:06         ` Thomas Gleixner
  2009-10-08  8:05       ` Thomas Gleixner
  1 sibling, 1 reply; 29+ messages in thread
From: Michael Tokarev @ 2009-10-08  7:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Frederic Weisbecker, Thomas Gleixner, kvm, Ingo Molnar

[skip everything]

Yesterday I was "lucky" enough to actually watch what's
going on when the delay actually happens.

I run desktop environment on a kvm virtual machine here.
The server is on diskless terminal, and the rest, incl.
the window manager etc, is started from a VM.

And yesterday, during normal system load (nothing extra,
and not idle either, and all the other guests were running
under normal load too), I had a stall of everyhing on this
X session for about 2..3, maybe 5 secounds.

It felt like completely stuck machine. Nothing were moving
on the screen, no reaction to the keyboard etc.

And after several seconds it returned to normal.  With
the familiar message in dmesg -- increasing hrtimer etc,
to the next 50%.  (Without a patch from Marcelo at this
time it shuold increase min_delta to a large number).

To summarize: there's something, well, more interesting
going on here.  In addition to the scheduling issues that
causes timers to be calculated on the "wrong" CPU etc as
mentioned before, we've some real stalls somewhere, and
that only happens on SMP guests (so far anyway).

Thanks!

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08  0:54     ` Marcelo Tosatti
  2009-10-08  7:54       ` Michael Tokarev
@ 2009-10-08  8:05       ` Thomas Gleixner
  2009-10-08 19:22         ` Marcelo Tosatti
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2009-10-08  8:05 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Frederic Weisbecker, Michael Tokarev, kvm, Ingo Molnar

On Wed, 7 Oct 2009, Marcelo Tosatti wrote:
> On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
> What about getting rid of the retry loop, instead? So something
> like:
> 
> - run hrtimer callbacks (once)
> - while (tick_program_event(expires))
> 	expires = ktime_add_ns(expires, dev->min_delta_ns)
> 
> This way there's no static tuning involved.

And what does that buy us ? We get an timer interrupt right away, so
it's not that much different from the retry loop. See below.

> Its not clear to me why the loop is there in the first place.

We get a timer interrupt and handle the expired timers and find out
the timer which is going to expire next to reprogram the hardware. Now
when we program that expiry time we find out that the timer is already
expired. So instead of programming the hardware to fire an interrupt
in the very near future which you would do with your loop above we
stay in the interrupt handler and expire the timer and any other by
now expired timers right away.

The hang check is just there to avoid starving (slow) machines. We do
this by spreading the timer interrupts out so that the system can do
something else than expiring timers.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08  7:54       ` Michael Tokarev
@ 2009-10-08  8:06         ` Thomas Gleixner
  2009-10-08  8:14           ` Michael Tokarev
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2009-10-08  8:06 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Marcelo Tosatti, Frederic Weisbecker, kvm, Ingo Molnar

On Thu, 8 Oct 2009, Michael Tokarev wrote:
> Yesterday I was "lucky" enough to actually watch what's
> going on when the delay actually happens.
> 
> I run desktop environment on a kvm virtual machine here.
> The server is on diskless terminal, and the rest, incl.
> the window manager etc, is started from a VM.
> 
> And yesterday, during normal system load (nothing extra,
> and not idle either, and all the other guests were running
> under normal load too), I had a stall of everyhing on this
> X session for about 2..3, maybe 5 secounds.
> 
> It felt like completely stuck machine. Nothing were moving
> on the screen, no reaction to the keyboard etc.
> 
> And after several seconds it returned to normal.  With
> the familiar message in dmesg -- increasing hrtimer etc,
> to the next 50%.  (Without a patch from Marcelo at this
> time it shuold increase min_delta to a large number).
> 
> To summarize: there's something, well, more interesting
> going on here.  In addition to the scheduling issues that
> causes timers to be calculated on the "wrong" CPU etc as

Care to elaborate ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-07 23:17   ` Frederic Weisbecker
  2009-10-08  0:54     ` Marcelo Tosatti
@ 2009-10-08  8:09     ` Michael Tokarev
  1 sibling, 0 replies; 29+ messages in thread
From: Michael Tokarev @ 2009-10-08  8:09 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Marcelo Tosatti, Thomas Gleixner, kvm, Ingo Molnar

[]
>>> hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns
[]
> All that does not make sense anymore in a guest. The hang detection
> and warnings, the recalibrations of the min_clock_deltas are completely
> wrong in this context.
> Not only does it spuriously warn, but the minimum timer is increasing
> slowly and the guest progressively suffers from higher and higher
> latencies.

Well, it's not "slowly", -- that huge jump shown above is typical.
If my calculations are correct, that's about 0.5 sec min_delta.

> That's really bad.

*nod* :)

> Your patch lowers the immediate impact and makes this illness evolving
> smoother by scaling down the recalibration to the min_clock_delta.
> This appeases the bug but doesn't solve it. I fear it could be even
> worse because it makes it more discreet.

Well, long-term it's not worse still.  New code has a chance to hitting
the same values for min_delta in a long run, but this chance is so small
and the time spent is so long that it can be forgotten about completely.

> May be can we instead increase the minimum threshold of loop in the
> hrtimer interrupt before considering it as a hang? Hmm, but a too high
> number could make this check useless, depending of the number of pending
> timers, which is a finite number.
> 
> Well, actually I'm not confident anymore in this check. Or actually we
> should change it. May be we can rebase it on the time spent on the hrtimer
> interrupt (and check it every 10 loops of reprocessing in hrtimer_interrupts).
> 
> Would a mimimum threshold of 5 seconds spent in hrtimer_interrupt() be
> a reasonable check to perform?
> We should probably base our check on such kind of high boundary.
> What we want is an ultimate rescue against hard hangs anyway, not
> something that can solve the hang source itself. After the min_clock_delta
> recalibration, the system will be unstable (eg: high latencies).
> So if this must behave as a hammer, let's ensure we really need this hammer,
> even if we need to wait for few seconds before it triggers.

By the way, all other cases I've seen this message (hrtimer: interrupt too slow..)
triggering, the problems were elsewhere and re-calibrating timer was not a good
idea anyway, because the problem was elsewhere and changing timer didn't solve
it.

Back into the vm issue at hand.  I (almost) understand what's happening in
the discussion above, but I does not see how it is possible to have such a
*huge* delays explained by scheduling on a different CPU etc.  The delays
are measured in *seconds*, not nano- or micro-secs etc.

I can imagine, say, swapping on host that causes the whole guest to be
swapped out for a while during the timer interrupt handling for example.
But it is NOT what's happening here, at least not that I can see it.
Yes host had some swapping:

  pswpin 17535
  pswpout 41602

but it's not massive and I know when exactly it happened - when I was testing
something else.  Right now free(1) reports:

              total       used       free     shared    buffers     cached
Mem:       8155280    8105704      49576          0    1209136      27440
-/+ buffers/cache:    6869128    1286152
Swap:      8388856     124112    8264744

(and f*ng vmstat that, again, does not show swapping activity at all)

So, I think, the problem is somewhere elsewhere.

By the way, I *think* it only happens with kvm_clock, and does not
happen with acpi_pm clocksource.  Is it worth to check?

Thanks!

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08  8:06         ` Thomas Gleixner
@ 2009-10-08  8:14           ` Michael Tokarev
  2009-10-08  9:29             ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Tokarev @ 2009-10-08  8:14 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Marcelo Tosatti, Frederic Weisbecker, kvm, Ingo Molnar

Thomas Gleixner wrote:
> On Thu, 8 Oct 2009, Michael Tokarev wrote:
>> Yesterday I was "lucky" enough to actually watch what's
>> going on when the delay actually happens.
>>
>> I run desktop environment on a kvm virtual machine here.
>> The server is on diskless terminal, and the rest, incl.
>> the window manager etc, is started from a VM.
>>
>> And yesterday, during normal system load (nothing extra,
>> and not idle either, and all the other guests were running
>> under normal load too), I had a stall of everyhing on this
>> X session for about 2..3, maybe 5 secounds.
>>
>> It felt like completely stuck machine. Nothing were moving
>> on the screen, no reaction to the keyboard etc.
>>
>> And after several seconds it returned to normal.  With
>> the familiar message in dmesg -- increasing hrtimer etc,
>> to the next 50%.  (Without a patch from Marcelo at this
>> time it shuold increase min_delta to a large number).
>>
>> To summarize: there's something, well, more interesting
>> going on here.  In addition to the scheduling issues that
>> causes timers to be calculated on the "wrong" CPU etc as
> 
> Care to elaborate ?

Such huge delays (in terms of seconds, not ms or ns) - I don't
understand how such delays can be explained by sheduling to the
different cpu etc.  That's what I mean.  I know very little about
all this low-level stuff so I may be completely out of context,
but such explanation does not look right to me, simple as that.
By "scheduling mistakes" we can get mistakes in range of millisecs,
but not secs.

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08  8:14           ` Michael Tokarev
@ 2009-10-08  9:29             ` Thomas Gleixner
  2009-10-08 14:06               ` Michael Tokarev
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2009-10-08  9:29 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Marcelo Tosatti, Frederic Weisbecker, kvm, Ingo Molnar

On Thu, 8 Oct 2009, Michael Tokarev wrote:

> Thomas Gleixner wrote:
> > On Thu, 8 Oct 2009, Michael Tokarev wrote:
> > > Yesterday I was "lucky" enough to actually watch what's
> > > going on when the delay actually happens.
> > > 
> > > I run desktop environment on a kvm virtual machine here.
> > > The server is on diskless terminal, and the rest, incl.
> > > the window manager etc, is started from a VM.
> > > 
> > > And yesterday, during normal system load (nothing extra,
> > > and not idle either, and all the other guests were running
> > > under normal load too), I had a stall of everyhing on this
> > > X session for about 2..3, maybe 5 secounds.
> > > 
> > > It felt like completely stuck machine. Nothing were moving
> > > on the screen, no reaction to the keyboard etc.
> > > 
> > > And after several seconds it returned to normal.  With
> > > the familiar message in dmesg -- increasing hrtimer etc,
> > > to the next 50%.  (Without a patch from Marcelo at this
> > > time it shuold increase min_delta to a large number).
> > > 
> > > To summarize: there's something, well, more interesting
> > > going on here.  In addition to the scheduling issues that
> > > causes timers to be calculated on the "wrong" CPU etc as
> > 
> > Care to elaborate ?
> 
> Such huge delays (in terms of seconds, not ms or ns) - I don't
> understand how such delays can be explained by sheduling to the
> different cpu etc.  That's what I mean.  I know very little about
> all this low-level stuff so I may be completely out of context,
> but such explanation does not look right to me, simple as that.
> By "scheduling mistakes" we can get mistakes in range of millisecs,
> but not secs.

I'm really missing the big picture here. 

What means "causes timers to be calculated on the "wrong" CPU etc" ?
And what do you consider a "scheduling mistake" ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08  9:29             ` Thomas Gleixner
@ 2009-10-08 14:06               ` Michael Tokarev
  2009-10-08 15:06                 ` Thomas Gleixner
  2009-10-08 19:52                 ` Marcelo Tosatti
  0 siblings, 2 replies; 29+ messages in thread
From: Michael Tokarev @ 2009-10-08 14:06 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Marcelo Tosatti, Frederic Weisbecker, kvm, Ingo Molnar

Thomas Gleixner wrote:
> On Thu, 8 Oct 2009, Michael Tokarev wrote:
> 
>> Thomas Gleixner wrote:
>>> On Thu, 8 Oct 2009, Michael Tokarev wrote:
>>>> Yesterday I was "lucky" enough to actually watch what's
>>>> going on when the delay actually happens.
>>>>
>>>> I run desktop environment on a kvm virtual machine here.
>>>> The server is on diskless terminal, and the rest, incl.
>>>> the window manager etc, is started from a VM.
>>>>
>>>> And yesterday, during normal system load (nothing extra,
>>>> and not idle either, and all the other guests were running
>>>> under normal load too), I had a stall of everyhing on this
>>>> X session for about 2..3, maybe 5 secounds.
>>>>
>>>> It felt like completely stuck machine. Nothing were moving
>>>> on the screen, no reaction to the keyboard etc.
>>>>
>>>> And after several seconds it returned to normal.  With
>>>> the familiar message in dmesg -- increasing hrtimer etc,
>>>> to the next 50%.  (Without a patch from Marcelo at this
>>>> time it shuold increase min_delta to a large number).
>>>>
>>>> To summarize: there's something, well, more interesting
>>>> going on here.  In addition to the scheduling issues that
>>>> causes timers to be calculated on the "wrong" CPU etc as
>>> Care to elaborate ?
>> Such huge delays (in terms of seconds, not ms or ns) - I don't
>> understand how such delays can be explained by sheduling to the
>> different cpu etc.  That's what I mean.  I know very little about
>> all this low-level stuff so I may be completely out of context,
>> but such explanation does not look right to me, simple as that.
>> By "scheduling mistakes" we can get mistakes in range of millisecs,
>> but not secs.
> 
> I'm really missing the big picture here. 
> 
> What means "causes timers to be calculated on the "wrong" CPU etc" ?
> And what do you consider a "scheduling mistake" ?

 From the initial diagnostics by Marcelo:

 > It seems the way hrtimer_interrupt_hanging calculates min_delta is
 > wrong (especially to virtual machines). The guest vcpu can be scheduled
 > out during the execution of the hrtimer callbacks (and the callbacks
 > themselves can do operations that translate to blocking operations in
 > the hypervisor).
 >
 > So high min_delta values can be calculated if, for example, a single
 > hrtimer_interrupt run takes two host time slices to execute, while some
 > other higher priority task runs for N slices in between.

 From this I conclude that the huge min_delta is due to some other task(s)
on the host being run while this guest is in hrtimer callback.  But I
fail to see why that process on the host takes SO MUCH time, to warrant
resulting min_delta to 0.5s, or to cause delays for 3..5 seconds in
guest.  It's ok to have delays in range of several extra milliseconds,
but for *seconds* is too much.

Note again that neither host nor guest are not under high load when
this jump happens.  Also note that there's no high-priority processes
running on the host, all are of the same priority level, including
all the guests.

Note also that so far I only see it on SMP guests, never on UP
guests.  And only on guests with kvm_clock, not with acpi_pm
clocksource.

What I'm trying to say is that it looks like there's something
else wrong here in the guest code.  Huge stalls, huge delays
while in hrtimer callback (i think it jappens always when such
delay is happening, it's just noticed by hrtimer code) -- that's
the root cause of all this, (probably) wrong logic in hrtimer
calibration just shows the results of something that's wrong
elsewhere.

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08 14:06               ` Michael Tokarev
@ 2009-10-08 15:06                 ` Thomas Gleixner
  2009-10-08 19:52                 ` Marcelo Tosatti
  1 sibling, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2009-10-08 15:06 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Marcelo Tosatti, Frederic Weisbecker, kvm, Ingo Molnar

On Thu, 8 Oct 2009, Michael Tokarev wrote:
> Thomas Gleixner wrote:
> > 
> > I'm really missing the big picture here. 
> > What means "causes timers to be calculated on the "wrong" CPU etc" ?
> > And what do you consider a "scheduling mistake" ?
> 
> From the initial diagnostics by Marcelo:
> 
> > It seems the way hrtimer_interrupt_hanging calculates min_delta is
> > wrong (especially to virtual machines). The guest vcpu can be scheduled
> > out during the execution of the hrtimer callbacks (and the callbacks
> > themselves can do operations that translate to blocking operations in
> > the hypervisor).
> >
> > So high min_delta values can be calculated if, for example, a single
> > hrtimer_interrupt run takes two host time slices to execute, while some
> > other higher priority task runs for N slices in between.
> 
> From this I conclude that the huge min_delta is due to some other task(s)
> on the host being run while this guest is in hrtimer callback.  But I
> fail to see why that process on the host takes SO MUCH time, to warrant
> resulting min_delta to 0.5s, or to cause delays for 3..5 seconds in
> guest.  It's ok to have delays in range of several extra milliseconds,
> but for *seconds* is too much.
> 
> Note again that neither host nor guest are not under high load when
> this jump happens.  Also note that there's no high-priority processes
> running on the host, all are of the same priority level, including
> all the guests.
> 
> Note also that so far I only see it on SMP guests, never on UP
> guests.  And only on guests with kvm_clock, not with acpi_pm
> clocksource.
> 
> What I'm trying to say is that it looks like there's something
> else wrong here in the guest code.  Huge stalls, huge delays
> while in hrtimer callback (i think it jappens always when such
> delay is happening, it's just noticed by hrtimer code) -- that's
> the root cause of all this, (probably) wrong logic in hrtimer
> calibration just shows the results of something that's wrong
> elsewhere.

Ah, ok. That makes sense. The hrtimer interrupt hang check detects
that the CPU was "stolen" for whatever reasons. I'm wondering why this
happens several times in a row - it takes at least 4 iterations until
it decides to make the interval larger.

Thanks,

	tglx
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08  8:05       ` Thomas Gleixner
@ 2009-10-08 19:22         ` Marcelo Tosatti
  2009-10-08 20:25           ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2009-10-08 19:22 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Frederic Weisbecker, Michael Tokarev, kvm, Ingo Molnar

On Thu, Oct 08, 2009 at 10:05:01AM +0200, Thomas Gleixner wrote:
> On Wed, 7 Oct 2009, Marcelo Tosatti wrote:
> > On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
> > What about getting rid of the retry loop, instead? So something
> > like:
> > 
> > - run hrtimer callbacks (once)
> > - while (tick_program_event(expires))
> > 	expires = ktime_add_ns(expires, dev->min_delta_ns)
> > 
> > This way there's no static tuning involved.
> 
> And what does that buy us ? We get an timer interrupt right away, so
> it's not that much different from the retry loop. See below.
> 
> > Its not clear to me why the loop is there in the first place.
> 
> We get a timer interrupt and handle the expired timers and find out
> the timer which is going to expire next to reprogram the hardware. Now
> when we program that expiry time we find out that the timer is already
> expired. So instead of programming the hardware to fire an interrupt
> in the very near future which you would do with your loop above we
> stay in the interrupt handler and expire the timer and any other by
> now expired timers right away.
> 
> The hang check is just there to avoid starving (slow) machines. We do
> this by spreading the timer interrupts out so that the system can do
> something else than expiring timers.

OK, makes sense.

So why not program only the next tick using the heuristic, without 
touching min_delta_ns?



diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index c03f221..4fcb670 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1178,29 +1178,16 @@ static void __run_hrtimer(struct hrtimer *timer)
 
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-static int force_clock_reprogram;
-
 /*
  * After 5 iteration's attempts, we consider that hrtimer_interrupt()
  * is hanging, which could happen with something that slows the interrupt
- * such as the tracing. Then we force the clock reprogramming for each future
- * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
- * threshold that we will overwrite.
+ * such as the tracing. 
  * The next tick event will be scheduled to 3 times we currently spend on
  * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
  * 1/4 of their time to process the hrtimer interrupts. This is enough to
  * let it running without serious starvation.
  */
 
-static inline void
-hrtimer_interrupt_hanging(struct clock_event_device *dev,
-			ktime_t try_time)
-{
-	force_clock_reprogram = 1;
-	dev->min_delta_ns = (unsigned long)try_time.tv64 * 3;
-	printk(KERN_WARNING "hrtimer: interrupt too slow, "
-		"forcing clock min delta to %lu ns\n", dev->min_delta_ns);
-}
 /*
  * High resolution timer interrupt
  * Called with interrupts disabled
@@ -1219,8 +1206,16 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
  retry:
 	/* 5 retries is enough to notice a hang */
-	if (!(++nr_retries % 5))
-		hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
+	if (!(++nr_retries % 5)) {
+		ktime_t try_time = ktime_sub(ktime_get(), now);
+
+		do {
+			for (i = 0; i < 3; i++)
+				expires_next = ktime_add(expires_next,try_time);
+		} while (tick_program_event(expires_next, 0));
+
+		return;
+	}
 
 	now = ktime_get();
 
@@ -1286,7 +1281,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
 	/* Reprogramming necessary ? */
 	if (expires_next.tv64 != KTIME_MAX) {
-		if (tick_program_event(expires_next, force_clock_reprogram))
+		if (tick_program_event(expires_next, 0))
 			goto retry;
 	}
 }


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08 14:06               ` Michael Tokarev
  2009-10-08 15:06                 ` Thomas Gleixner
@ 2009-10-08 19:52                 ` Marcelo Tosatti
  2009-10-09 21:22                   ` Michael Tokarev
  1 sibling, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2009-10-08 19:52 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Thomas Gleixner, Frederic Weisbecker, kvm

On Thu, Oct 08, 2009 at 06:06:39PM +0400, Michael Tokarev wrote:
> Thomas Gleixner wrote:
>> On Thu, 8 Oct 2009, Michael Tokarev wrote:
>>
>>> Thomas Gleixner wrote:
>>>> On Thu, 8 Oct 2009, Michael Tokarev wrote:
>>>>> Yesterday I was "lucky" enough to actually watch what's
>>>>> going on when the delay actually happens.
>>>>>
>>>>> I run desktop environment on a kvm virtual machine here.
>>>>> The server is on diskless terminal, and the rest, incl.
>>>>> the window manager etc, is started from a VM.
>>>>>
>>>>> And yesterday, during normal system load (nothing extra,
>>>>> and not idle either, and all the other guests were running
>>>>> under normal load too), I had a stall of everyhing on this
>>>>> X session for about 2..3, maybe 5 secounds.
>>>>>
>>>>> It felt like completely stuck machine. Nothing were moving
>>>>> on the screen, no reaction to the keyboard etc.
>>>>>
>>>>> And after several seconds it returned to normal.  With
>>>>> the familiar message in dmesg -- increasing hrtimer etc,
>>>>> to the next 50%.  (Without a patch from Marcelo at this
>>>>> time it shuold increase min_delta to a large number).
>>>>>
>>>>> To summarize: there's something, well, more interesting
>>>>> going on here.  In addition to the scheduling issues that
>>>>> causes timers to be calculated on the "wrong" CPU etc as
>>>> Care to elaborate ?
>>> Such huge delays (in terms of seconds, not ms or ns) - I don't
>>> understand how such delays can be explained by sheduling to the
>>> different cpu etc.  That's what I mean.  I know very little about
>>> all this low-level stuff so I may be completely out of context,
>>> but such explanation does not look right to me, simple as that.
>>> By "scheduling mistakes" we can get mistakes in range of millisecs,
>>> but not secs.
>>
>> I'm really missing the big picture here. 
>>
>> What means "causes timers to be calculated on the "wrong" CPU etc" ?
>> And what do you consider a "scheduling mistake" ?
>
> From the initial diagnostics by Marcelo:
>
> > It seems the way hrtimer_interrupt_hanging calculates min_delta is
> > wrong (especially to virtual machines). The guest vcpu can be scheduled
> > out during the execution of the hrtimer callbacks (and the callbacks
> > themselves can do operations that translate to blocking operations in
> > the hypervisor).
> >
> > So high min_delta values can be calculated if, for example, a single
> > hrtimer_interrupt run takes two host time slices to execute, while some
> > other higher priority task runs for N slices in between.
>
> From this I conclude that the huge min_delta is due to some other task(s)
> on the host being run while this guest is in hrtimer callback.  But I
> fail to see why that process on the host takes SO MUCH time, to warrant
> resulting min_delta to 0.5s, or to cause delays for 3..5 seconds in
> guest.  It's ok to have delays in range of several extra milliseconds,
> but for *seconds* is too much.
>
> Note again that neither host nor guest are not under high load when
> this jump happens.  Also note that there's no high-priority processes
> running on the host, all are of the same priority level, including
> all the guests.
>
> Note also that so far I only see it on SMP guests, never on UP
> guests.  And only on guests with kvm_clock, not with acpi_pm
> clocksource.
>
> What I'm trying to say is that it looks like there's something
> else wrong here in the guest code.  Huge stalls, huge delays
> while in hrtimer callback (i think it jappens always when such
> delay is happening, it's just noticed by hrtimer code) -- that's
> the root cause of all this, (probably) wrong logic in hrtimer
> calibration just shows the results of something that's wrong
> elsewhere.

True.

Would be useful to collect sar (sar -B -b -u) output every one second
in both host/guest. You already mentioned load was low, but this should
give more details.

Was there swapping going on?



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08 19:22         ` Marcelo Tosatti
@ 2009-10-08 20:25           ` Thomas Gleixner
  2009-10-08 21:02             ` Michael Tokarev
  2009-10-10 17:32             ` [PATCH] tune hrtimer_interrupt hang logic Marcelo Tosatti
  0 siblings, 2 replies; 29+ messages in thread
From: Thomas Gleixner @ 2009-10-08 20:25 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Frederic Weisbecker, Michael Tokarev, kvm, Ingo Molnar

On Thu, 8 Oct 2009, Marcelo Tosatti wrote:
> On Thu, Oct 08, 2009 at 10:05:01AM +0200, Thomas Gleixner wrote:
> > On Wed, 7 Oct 2009, Marcelo Tosatti wrote:
> > > On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote:
> > > What about getting rid of the retry loop, instead? So something
> > > like:
> > > 
> > > - run hrtimer callbacks (once)
> > > - while (tick_program_event(expires))
> > > 	expires = ktime_add_ns(expires, dev->min_delta_ns)
> > > 
> > > This way there's no static tuning involved.
> > 
> > And what does that buy us ? We get an timer interrupt right away, so
> > it's not that much different from the retry loop. See below.
> > 
> > > Its not clear to me why the loop is there in the first place.
> > 
> > We get a timer interrupt and handle the expired timers and find out
> > the timer which is going to expire next to reprogram the hardware. Now
> > when we program that expiry time we find out that the timer is already
> > expired. So instead of programming the hardware to fire an interrupt
> > in the very near future which you would do with your loop above we
> > stay in the interrupt handler and expire the timer and any other by
> > now expired timers right away.
> > 
> > The hang check is just there to avoid starving (slow) machines. We do
> > this by spreading the timer interrupts out so that the system can do
> > something else than expiring timers.
> 
> OK, makes sense.
> 
> So why not program only the next tick using the heuristic, without 
> touching min_delta_ns?

That makes a certain amount of sense albeit I really hate that
heuristics crap especially when we know that we run as a guest. 

We better add a function pointer to the clock event device struct
which defaults to "force it to be slow" for real hardware and can be
overridden by paravirt guests.

Also it's not clear to me why the problem does only happen with
kvm_clock and not with acpi_pm timer emulation (according to the
reporter) and is restricted to SMP guests.

>   retry:
>  	/* 5 retries is enough to notice a hang */
> -	if (!(++nr_retries % 5))
> -		hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
> +	if (!(++nr_retries % 5)) {
> +		ktime_t try_time = ktime_sub(ktime_get(), now);
> +
> +		do {
> +			for (i = 0; i < 3; i++)
> +				expires_next = ktime_add(expires_next,try_time);
> +		} while (tick_program_event(expires_next, 0));

  This needs at least a WARN_ON_ONCE() or some other way (sysfs, proc,
  ...) where we can find out how often this happens.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08 20:25           ` Thomas Gleixner
@ 2009-10-08 21:02             ` Michael Tokarev
  2009-10-10 17:32             ` [PATCH] tune hrtimer_interrupt hang logic Marcelo Tosatti
  1 sibling, 0 replies; 29+ messages in thread
From: Michael Tokarev @ 2009-10-08 21:02 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Marcelo Tosatti, Frederic Weisbecker, kvm, Ingo Molnar

Thomas Gleixner wrote:
[]
> Also it's not clear to me why the problem does only happen with
> kvm_clock and not with acpi_pm timer emulation (according to the
> reporter) and is restricted to SMP guests.

I just reproduced it with acpi_pm.  I explained it already to Marcelo,
the problem is that the issue is difficult to trigger.  I still don't
have any pointers as of how to trigger it, all my attempts so far, to
create network, disk or cpu load, failed.  So the only way is to run
the guest and wait, in a hope it'll be there.  And I restarted the
"guinea pig" guest today (which happens to be our main office server :),
and voila, after ~4 hours uptime said the same thing about hrtimer.
That's lucky time, since it may run stable for several days...
It just happens (and I mentioned it each time) that I didn't *see* the
issue with acpi_pm.  Now I see it with acpi_pm too.

Speaking of smp -- well, that one is of the same category.  Maybe smp
just makes the issue easier to trigger but it exists with UP guests too,
maybe it's SMP-specific - I don't know.  What I know for sure is that
out of 4 guests here (running on the same host), 2 are SMP and 2 UP,
loaded approximately equally (according to the number of CPUs), and
two SMP guests shows the issue quite easily, while for 2 UP guests
I don't see anything in the logs for last 2 months.

The issue isn't unique to my machines, other people reported it too
in #kvm, including at least one active participant there.  For him,
issues stopped when he switched from SMP to UP guest.  Yet there's
no definite knowlege if the issue is really SMP-specific or not.

>>   retry:
>>  	/* 5 retries is enough to notice a hang */
>> -	if (!(++nr_retries % 5))
>> -		hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
>> +	if (!(++nr_retries % 5)) {
>> +		ktime_t try_time = ktime_sub(ktime_get(), now);
>> +
>> +		do {
>> +			for (i = 0; i < 3; i++)
>> +				expires_next = ktime_add(expires_next,try_time);
>> +		} while (tick_program_event(expires_next, 0));
> 
>   This needs at least a WARN_ON_ONCE() or some other way (sysfs, proc,
>   ...) where we can find out how often this happens.

Definitely.  Or printk_ratelimit.

Before Marcelo come with his first version I was thinking about exposing
that min_delta over procfs to be able to reset it back to a reasonable
value.. ;)

Thanks!

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-08 19:52                 ` Marcelo Tosatti
@ 2009-10-09 21:22                   ` Michael Tokarev
  2009-10-09 22:27                     ` Frederic Weisbecker
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Tokarev @ 2009-10-09 21:22 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Thomas Gleixner, Frederic Weisbecker, kvm

Marcelo Tosatti wrote:
[snip]

> Would be useful to collect sar (sar -B -b -u) output every one second
> in both host/guest. You already mentioned load was low, but this should
> give more details.

Here we go:  http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/

Two incindents - cases when hrtimer: interrupt is too slow were
reported in the guest (with Marcelo's patch so that min_delta is
increased to 50% only), happened at 11:44:48 and 21:46:56 (as
shown in guest-dmesg file).  For both, there's `sar -BWbd' output
for a 2-minute interval (starting one minute before the delay
and ending one minute after) from both host and guest.

> Was there swapping going on?

Not as far as I can see, and sar output agrees.

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-09 21:22                   ` Michael Tokarev
@ 2009-10-09 22:27                     ` Frederic Weisbecker
  2009-10-09 22:34                       ` Michael Tokarev
  0 siblings, 1 reply; 29+ messages in thread
From: Frederic Weisbecker @ 2009-10-09 22:27 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Marcelo Tosatti, Thomas Gleixner, kvm

On Sat, Oct 10, 2009 at 01:22:16AM +0400, Michael Tokarev wrote:
> Marcelo Tosatti wrote:
> [snip]
>
>> Would be useful to collect sar (sar -B -b -u) output every one second
>> in both host/guest. You already mentioned load was low, but this should
>> give more details.
>
> Here we go:  http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/
>
> Two incindents - cases when hrtimer: interrupt is too slow were
> reported in the guest (with Marcelo's patch so that min_delta is
> increased to 50% only), happened at 11:44:48 and 21:46:56 (as
> shown in guest-dmesg file).  For both, there's `sar -BWbd' output
> for a 2-minute interval (starting one minute before the delay
> and ending one minute after) from both host and guest.
>
>> Was there swapping going on?
>
> Not as far as I can see, and sar output agrees.


But I can read this from you guest traces:

                pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff

11:44:45         0.00     32.32    174.75      0.00    176.77      0.00      0.00      0.00      0.00
11:44:46         0.00     16.00    789.00      0.00    323.00      0.00      0.00      0.00      0.00
11:44:47         0.00     32.32    907.07      0.00    277.78      0.00      0.00      0.00      0.00
11:44:48        27.59     22.99     44.83      0.00    150.57      0.00      0.00      0.00      0.00
11:44:49         0.00     33.68     22.11      0.00    218.95      0.00      0.00      0.00      0.00
11:44:50         0.00    101.01     17.17      0.00    151.52      0.00      0.00      0.00      0.00
11:44:51         0.00     15.69     16.67      0.00    126.47      0.00      0.00      0.00      0.00
[...]
21:46:52         0.00     40.00     17.00      0.00     82.00      0.00      0.00      0.00      0.00
21:46:53         0.00     31.68     18.81      0.00     94.06      0.00      0.00      0.00      0.00
21:46:54         0.00     31.68     16.83      0.00     90.10      0.00      0.00      0.00      0.00
21:46:55         0.00    108.00     17.00      0.00     89.00      0.00      0.00      0.00      0.00
21:46:56         9.76    482.93   3890.24      0.00    439.02      0.00      0.00      0.00      0.00
21:46:57         0.00    760.00   8627.00      0.00   1133.00      0.00      0.00      0.00      0.00
21:46:58         0.00     84.85   2612.12      0.00    138.38      0.00      0.00      0.00      0.00
21:46:59         0.00     16.00     17.00      0.00     90.00      0.00      0.00      0.00      0.00
21:47:00         0.00     36.36     17.17      0.00     90.91      0.00      0.00      0.00      0.00



So it looks like there was some swapping in when the hrtimer (spuriously)
hanged.

Not sure how to correlate these results with the incident though...




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-09 22:27                     ` Frederic Weisbecker
@ 2009-10-09 22:34                       ` Michael Tokarev
  2009-10-10  9:18                         ` Michael Tokarev
  2009-10-10 17:37                         ` Marcelo Tosatti
  0 siblings, 2 replies; 29+ messages in thread
From: Michael Tokarev @ 2009-10-09 22:34 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Marcelo Tosatti, Thomas Gleixner, kvm

Frederic Weisbecker wrote:
> On Sat, Oct 10, 2009 at 01:22:16AM +0400, Michael Tokarev wrote:
>> Marcelo Tosatti wrote:
>> [snip]
>>
>>> Would be useful to collect sar (sar -B -b -u) output every one second
>>> in both host/guest. You already mentioned load was low, but this should
>>> give more details.
>> Here we go:  http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/
>>
>> Two incindents - cases when hrtimer: interrupt is too slow were
>> reported in the guest (with Marcelo's patch so that min_delta is
>> increased to 50% only), happened at 11:44:48 and 21:46:56 (as
>> shown in guest-dmesg file).  For both, there's `sar -BWbd' output
>> for a 2-minute interval (starting one minute before the delay
>> and ending one minute after) from both host and guest.
>>
>>> Was there swapping going on?
>> Not as far as I can see, and sar output agrees.
> 
> But I can read this from you guest traces:
> 
>                 pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
> 
> 11:44:45         0.00     32.32    174.75      0.00    176.77      0.00      0.00      0.00      0.00
> 11:44:46         0.00     16.00    789.00      0.00    323.00      0.00      0.00      0.00      0.00
> 11:44:47         0.00     32.32    907.07      0.00    277.78      0.00      0.00      0.00      0.00
> 11:44:48        27.59     22.99     44.83      0.00    150.57      0.00      0.00      0.00      0.00
> 11:44:49         0.00     33.68     22.11      0.00    218.95      0.00      0.00      0.00      0.00
> 11:44:50         0.00    101.01     17.17      0.00    151.52      0.00      0.00      0.00      0.00
> 11:44:51         0.00     15.69     16.67      0.00    126.47      0.00      0.00      0.00      0.00
> [...]
> 21:46:52         0.00     40.00     17.00      0.00     82.00      0.00      0.00      0.00      0.00
> 21:46:53         0.00     31.68     18.81      0.00     94.06      0.00      0.00      0.00      0.00
> 21:46:54         0.00     31.68     16.83      0.00     90.10      0.00      0.00      0.00      0.00
> 21:46:55         0.00    108.00     17.00      0.00     89.00      0.00      0.00      0.00      0.00
> 21:46:56         9.76    482.93   3890.24      0.00    439.02      0.00      0.00      0.00      0.00
> 21:46:57         0.00    760.00   8627.00      0.00   1133.00      0.00      0.00      0.00      0.00
> 21:46:58         0.00     84.85   2612.12      0.00    138.38      0.00      0.00      0.00      0.00
> 21:46:59         0.00     16.00     17.00      0.00     90.00      0.00      0.00      0.00      0.00
> 21:47:00         0.00     36.36     17.17      0.00     90.91      0.00      0.00      0.00      0.00
> 
> 
> So it looks like there was some swapping in when the hrtimer (spuriously)
> hanged.

It's not swapping.  Swapping is in a separate table, with columns titled
pswpin/s and pswpout/s -- first table.

On my home machine with no swap at all, 4gig memory and 2gig free,
pgpgin and pgpgout are increasing too.

Also, while in the second case (21:46:56) there's actually some
noticeable activity (page faults at least), in first case that
activity is modest.

Note there's no documentation for /proc/vmstat file :)

/mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-09 22:34                       ` Michael Tokarev
@ 2009-10-10  9:18                         ` Michael Tokarev
  2009-10-10  9:24                           ` Frederic Weisbecker
  2009-10-10 17:37                         ` Marcelo Tosatti
  1 sibling, 1 reply; 29+ messages in thread
From: Michael Tokarev @ 2009-10-10  9:18 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Marcelo Tosatti, Thomas Gleixner, kvm

Michael Tokarev wrote:
> Frederic Weisbecker wrote:
[]
>>>> Was there swapping going on?
>>> Not as far as I can see, and sar output agrees.
>>
>> But I can read this from you guest traces:

I missed this one yesterday.  Note it's GUEST traces
indeed.  Higher (read: non-zero) pgp{in,out} and faults
values happens in *guest*, not on host (original question
was if we've swapping in HOST, which'd explain the timer
issues)

[cutting extra all-zero columns]
>>                 pgpgin/s pgpgout/s   fault/s  pgfree/s
>> 11:44:47         0.00     32.32    907.07       277.78
>> 11:44:48        27.59     22.99     44.83       150.57
>> 11:44:49         0.00     33.68     22.11       218.95
>> [...]
>> 21:46:54         0.00     31.68     16.83        90.10
>> 21:46:55         0.00    108.00     17.00        89.00
>> 21:46:56         9.76    482.93   3890.24       439.02
>> 21:46:57         0.00    760.00   8627.00      1133.00
>> 21:46:58         0.00     84.85   2612.12       138.38
>> 21:46:59         0.00     16.00     17.00        90.00
>>
>> So it looks like there was some swapping in when the hrtimer (spuriously)
>> hanged.

One possible guess.  Since the guest hanged for some time, the
higher values there might be a result of accumulated values for
several seconds.

> It's not swapping.  Swapping is in a separate table, with columns titled
> pswpin/s and pswpout/s -- first table.
> 
> On my home machine with no swap at all, 4gig memory and 2gig free,
> pgpgin and pgpgout are increasing too.
> 
> Also, while in the second case (21:46:56) there's actually some
> noticeable activity (page faults at least), in first case that
> activity is modest.
> 
> Note there's no documentation for /proc/vmstat file :)
> 
> /mjt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-10  9:18                         ` Michael Tokarev
@ 2009-10-10  9:24                           ` Frederic Weisbecker
  0 siblings, 0 replies; 29+ messages in thread
From: Frederic Weisbecker @ 2009-10-10  9:24 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Marcelo Tosatti, Thomas Gleixner, kvm

On Sat, Oct 10, 2009 at 01:18:16PM +0400, Michael Tokarev wrote:
> Michael Tokarev wrote:
>> Frederic Weisbecker wrote:
> []
>>>>> Was there swapping going on?
>>>> Not as far as I can see, and sar output agrees.
>>>
>>> But I can read this from you guest traces:
>
> I missed this one yesterday.  Note it's GUEST traces
> indeed.  Higher (read: non-zero) pgp{in,out} and faults
> values happens in *guest*, not on host (original question
> was if we've swapping in HOST, which'd explain the timer
> issues)



Yeah indeed. But still, that's a strange happenstance.



> [cutting extra all-zero columns]
>>>                 pgpgin/s pgpgout/s   fault/s  pgfree/s
>>> 11:44:47         0.00     32.32    907.07       277.78
>>> 11:44:48        27.59     22.99     44.83       150.57
>>> 11:44:49         0.00     33.68     22.11       218.95
>>> [...]
>>> 21:46:54         0.00     31.68     16.83        90.10
>>> 21:46:55         0.00    108.00     17.00        89.00
>>> 21:46:56         9.76    482.93   3890.24       439.02
>>> 21:46:57         0.00    760.00   8627.00      1133.00
>>> 21:46:58         0.00     84.85   2612.12       138.38
>>> 21:46:59         0.00     16.00     17.00        90.00
>>>
>>> So it looks like there was some swapping in when the hrtimer (spuriously)
>>> hanged.
>
> One possible guess.  Since the guest hanged for some time, the
> higher values there might be a result of accumulated values for
> several seconds.



May be yeah. I don't know enough about virtual internals so...


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] tune hrtimer_interrupt hang logic
  2009-10-08 20:25           ` Thomas Gleixner
  2009-10-08 21:02             ` Michael Tokarev
@ 2009-10-10 17:32             ` Marcelo Tosatti
  1 sibling, 0 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2009-10-10 17:32 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Frederic Weisbecker, Michael Tokarev, kvm, Ingo Molnar


The hrtimer_interrupt hang logic adjusts min_delta_ns based on the
execution time of the hrtimer callbacks.

This is error-prone for virtual machines, where a guest vcpu can be
scheduled out during the execution of the callbacks (and the callbacks
themselves can do operations that translate to blocking operations in
the hypervisor), which in can lead to large min_delta_ns rendering the
system unusable.

Change the logic to simply schedule the next interrupt using the 1/4 
ratio, while keeping min_delta_ns intact.

Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index ff037f0..88a8ca5 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -245,6 +245,8 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
 #ifdef CONFIG_HIGH_RES_TIMERS
 struct clock_event_device;
 
+extern int hrtimer_interrupt_hang;
+
 extern void clock_was_set(void);
 extern void hres_timers_resume(void);
 extern void hrtimer_interrupt(struct clock_event_device *dev);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 6d70204..6b81888 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1219,29 +1219,17 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now)
 
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-static int force_clock_reprogram;
-
 /*
  * After 5 iteration's attempts, we consider that hrtimer_interrupt()
  * is hanging, which could happen with something that slows the interrupt
- * such as the tracing. Then we force the clock reprogramming for each future
- * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
- * threshold that we will overwrite.
+ * such as the tracing. 
  * The next tick event will be scheduled to 3 times we currently spend on
  * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
  * 1/4 of their time to process the hrtimer interrupts. This is enough to
  * let it running without serious starvation.
  */
+int __read_mostly hrtimer_interrupt_hang;
 
-static inline void
-hrtimer_interrupt_hanging(struct clock_event_device *dev,
-			ktime_t try_time)
-{
-	force_clock_reprogram = 1;
-	dev->min_delta_ns = (unsigned long)try_time.tv64 * 3;
-	printk(KERN_WARNING "hrtimer: interrupt too slow, "
-		"forcing clock min delta to %lu ns\n", dev->min_delta_ns);
-}
 /*
  * High resolution timer interrupt
  * Called with interrupts disabled
@@ -1260,8 +1248,27 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
  retry:
 	/* 5 retries is enough to notice a hang */
-	if (!(++nr_retries % 5))
-		hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
+	if (!(++nr_retries % 5)) {
+		int tries = 0;
+		ktime_t try_time = ktime_sub(ktime_get(), now);
+
+		if (ktime_to_ns(try_time) < dev->min_delta_ns)
+			try_time = ns_to_ktime(dev->min_delta_ns);
+
+		do {
+			tries++;
+			for (i = 0; i < 3*tries; i++)
+				expires_next = ktime_add(expires_next,try_time);
+
+			printk_once(KERN_WARNING "hrtimer: interrupt too slow, "
+					"scheduling tick %lld ns ahead\n",
+					ktime_to_ns(ktime_sub(expires_next,
+						    ktime_get())));
+		} while (tick_program_event(expires_next, 1));
+
+		hrtimer_interrupt_hang++;
+		return;
+	}
 
 	now = ktime_get();
 
@@ -1327,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
 	/* Reprogramming necessary ? */
 	if (expires_next.tv64 != KTIME_MAX) {
-		if (tick_program_event(expires_next, force_clock_reprogram))
+		if (tick_program_event(expires_next, 0))
 			goto retry;
 	}
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 0d949c5..7223853 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1032,6 +1032,17 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_HIGH_RES_TIMERS
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "hrtimer_interrupt_hang",
+		.data		= &hrtimer_interrupt_hang,
+		.maxlen		= sizeof (int),
+		.mode		= 0444,
+		.proc_handler	= &proc_dointvec,
+	
+	},
+#endif
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: kvm guest: hrtimer: interrupt too slow
  2009-10-09 22:34                       ` Michael Tokarev
  2009-10-10  9:18                         ` Michael Tokarev
@ 2009-10-10 17:37                         ` Marcelo Tosatti
  1 sibling, 0 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2009-10-10 17:37 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Frederic Weisbecker, Thomas Gleixner, kvm

On Sat, Oct 10, 2009 at 02:34:22AM +0400, Michael Tokarev wrote:
> Frederic Weisbecker wrote:
>> On Sat, Oct 10, 2009 at 01:22:16AM +0400, Michael Tokarev wrote:
>>> Marcelo Tosatti wrote:
>>> [snip]
>>>
>>>> Would be useful to collect sar (sar -B -b -u) output every one second
>>>> in both host/guest. You already mentioned load was low, but this should
>>>> give more details.
>>> Here we go:  http://www.corpit.ru/mjt/hrtimer-interrupt-too-slow/
>>>
>>> Two incindents - cases when hrtimer: interrupt is too slow were
>>> reported in the guest (with Marcelo's patch so that min_delta is
>>> increased to 50% only), happened at 11:44:48 and 21:46:56 (as
>>> shown in guest-dmesg file).  For both, there's `sar -BWbd' output
>>> for a 2-minute interval (starting one minute before the delay
>>> and ending one minute after) from both host and guest.
>>>
>>>> Was there swapping going on?
>>> Not as far as I can see, and sar output agrees.
>>
>> But I can read this from you guest traces:
>>
>>                 pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
>>
>> 11:44:45         0.00     32.32    174.75      0.00    176.77      0.00      0.00      0.00      0.00
>> 11:44:46         0.00     16.00    789.00      0.00    323.00      0.00      0.00      0.00      0.00
>> 11:44:47         0.00     32.32    907.07      0.00    277.78      0.00      0.00      0.00      0.00
>> 11:44:48        27.59     22.99     44.83      0.00    150.57      0.00      0.00      0.00      0.00
>> 11:44:49         0.00     33.68     22.11      0.00    218.95      0.00      0.00      0.00      0.00
>> 11:44:50         0.00    101.01     17.17      0.00    151.52      0.00      0.00      0.00      0.00
>> 11:44:51         0.00     15.69     16.67      0.00    126.47      0.00      0.00      0.00      0.00
>> [...]
>> 21:46:52         0.00     40.00     17.00      0.00     82.00      0.00      0.00      0.00      0.00
>> 21:46:53         0.00     31.68     18.81      0.00     94.06      0.00      0.00      0.00      0.00
>> 21:46:54         0.00     31.68     16.83      0.00     90.10      0.00      0.00      0.00      0.00
>> 21:46:55         0.00    108.00     17.00      0.00     89.00      0.00      0.00      0.00      0.00
>> 21:46:56         9.76    482.93   3890.24      0.00    439.02      0.00      0.00      0.00      0.00
>> 21:46:57         0.00    760.00   8627.00      0.00   1133.00      0.00      0.00      0.00      0.00
>> 21:46:58         0.00     84.85   2612.12      0.00    138.38      0.00      0.00      0.00      0.00
>> 21:46:59         0.00     16.00     17.00      0.00     90.00      0.00      0.00      0.00      0.00
>> 21:47:00         0.00     36.36     17.17      0.00     90.91      0.00      0.00      0.00      0.00
>>
>>
>> So it looks like there was some swapping in when the hrtimer (spuriously)
>> hanged.
>
> It's not swapping.  Swapping is in a separate table, with columns titled
> pswpin/s and pswpout/s -- first table.
>
> On my home machine with no swap at all, 4gig memory and 2gig free,
> pgpgin and pgpgout are increasing too.
>
> Also, while in the second case (21:46:56) there's actually some
> noticeable activity (page faults at least), in first case that
> activity is modest.

Yes. It still could be scheduling though (but you mention the host is
idle, you forgot cpu stats in sar).

It might be a QEMU bug, can you share your qemu-kvm command line?
There's always (guest) disk i/o when the hang happens.

Also please give the latest hrtimer_interrupt patch a try.

Thanks!


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-10-10 17:38 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-29 13:12 kvm guest: hrtimer: interrupt too slow Michael Tokarev
2009-09-29 13:47 ` Avi Kivity
2009-09-29 13:58   ` Michael Tokarev
2009-10-05 10:47     ` Avi Kivity
2009-10-03 23:12 ` Marcelo Tosatti
     [not found]   ` <4AC88E7E.8050909@msgid.tls.msk.ru>
2009-10-05  0:50     ` Marcelo Tosatti
2009-10-05  9:31       ` Michael Tokarev
2009-10-06 13:30         ` Michael Tokarev
2009-10-07 23:17   ` Frederic Weisbecker
2009-10-08  0:54     ` Marcelo Tosatti
2009-10-08  7:54       ` Michael Tokarev
2009-10-08  8:06         ` Thomas Gleixner
2009-10-08  8:14           ` Michael Tokarev
2009-10-08  9:29             ` Thomas Gleixner
2009-10-08 14:06               ` Michael Tokarev
2009-10-08 15:06                 ` Thomas Gleixner
2009-10-08 19:52                 ` Marcelo Tosatti
2009-10-09 21:22                   ` Michael Tokarev
2009-10-09 22:27                     ` Frederic Weisbecker
2009-10-09 22:34                       ` Michael Tokarev
2009-10-10  9:18                         ` Michael Tokarev
2009-10-10  9:24                           ` Frederic Weisbecker
2009-10-10 17:37                         ` Marcelo Tosatti
2009-10-08  8:05       ` Thomas Gleixner
2009-10-08 19:22         ` Marcelo Tosatti
2009-10-08 20:25           ` Thomas Gleixner
2009-10-08 21:02             ` Michael Tokarev
2009-10-10 17:32             ` [PATCH] tune hrtimer_interrupt hang logic Marcelo Tosatti
2009-10-08  8:09     ` kvm guest: hrtimer: interrupt too slow Michael Tokarev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.