All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] Paravirtual time accounting / IRQ time accounting
@ 2014-03-19  9:42 lwcheng
  2014-03-20 15:01 ` Glauber Costa
  0 siblings, 1 reply; 8+ messages in thread
From: lwcheng @ 2014-03-19  9:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: glommer

In consolidated environments, when there are multiple virtual machines (VMs)
running on one CPU core, timekeeping will be a problem to the guest OS.
Here, I report my findings about Linux process scheduler.


Description
------------
Linux CFS relies on rq->clock_task to charge each task, determine  
vruntime, etc.

When CONFIG_IRQ_TIME_ACCOUNTING is enabled, the time spent on serving IRQ
will be excluded from updating rq->clock_task.
When CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled, the time stolen by  
the hypervisor
will also be excluded from updating rq->clock_task.

With "both" CONFIG_IRQ_TIME_ACCOUNTING and  
CONFIG_PARAVIRT_TIME_ACCOUNTING enabled,
I put three KVM guests on one core and run hackbench in each guest. I  
find that
in the guests, rq->clock_task stays *unchanged*. The malfunction  
embarrasses CFS.
------------


Analysis
------------
[src/kernel/sched/core.c]
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
     ... ...
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
     irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
     ... ...
     rq->prev_irq_time += irq_delta;
     delta -= irq_delta;
#endif

#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
     if (static_key_false((&paravirt_steal_rq_enabled))) {
         steal = paravirt_steal_clock(cpu_of(rq));
         steal -= rq->prev_steal_time_rq;
         ... ...
         rq->prev_steal_time_rq += steal;
         delta -= steal;
     }
#endif

     rq->clock_task += delta;
     ... ...
}
--
"delta" -> the intended increment to rq->clock_task
"irq_delta" -> the time spent on serving IRQ (hard + soft)
"steal" -> the time stolen by the underlying hypervisor
--
"irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable
to VM scheduling delays. "irq_delta" can include part or whole of "steal".
I observe that [irq_delta + steal >> delta].
As a result, "delta" becomes zero. That is why rq->clock_task stops.
------------

Please confirm this bug. Thanks.


Luwei Cheng
--
CS student
The University of Hong Kong

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Paravirtual time accounting / IRQ time accounting
  2014-03-19  9:42 [BUG] Paravirtual time accounting / IRQ time accounting lwcheng
@ 2014-03-20 15:01 ` Glauber Costa
  2014-03-21  5:50   ` Mike Galbraith
  2014-03-21 11:31   ` Rik van Riel
  0 siblings, 2 replies; 8+ messages in thread
From: Glauber Costa @ 2014-03-20 15:01 UTC (permalink / raw)
  To: lwcheng, Rik van Riel, Peter Zijlstra; +Cc: LKML

On Wed, Mar 19, 2014 at 6:42 AM,  <lwcheng@cs.hku.hk> wrote:
> In consolidated environments, when there are multiple virtual machines (VMs)
> running on one CPU core, timekeeping will be a problem to the guest OS.
> Here, I report my findings about Linux process scheduler.
>
>
> Description
> ------------
> Linux CFS relies on rq->clock_task to charge each task, determine vruntime,
> etc.
>
> When CONFIG_IRQ_TIME_ACCOUNTING is enabled, the time spent on serving IRQ
> will be excluded from updating rq->clock_task.
> When CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled, the time stolen by the
> hypervisor
> will also be excluded from updating rq->clock_task.
>
> With "both" CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING
> enabled,
> I put three KVM guests on one core and run hackbench in each guest. I find
> that
> in the guests, rq->clock_task stays *unchanged*. The malfunction embarrasses
> CFS.
> ------------
>
>
> Analysis
> ------------
> [src/kernel/sched/core.c]
> static void update_rq_clock_task(struct rq *rq, s64 delta)
> {
>     ... ...
> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>     irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
>     ... ...
>     rq->prev_irq_time += irq_delta;
>     delta -= irq_delta;
> #endif
>
> #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>     if (static_key_false((&paravirt_steal_rq_enabled))) {
>         steal = paravirt_steal_clock(cpu_of(rq));
>         steal -= rq->prev_steal_time_rq;
>         ... ...
>         rq->prev_steal_time_rq += steal;
>         delta -= steal;
>     }
> #endif
>
>     rq->clock_task += delta;
>     ... ...
> }
> --
> "delta" -> the intended increment to rq->clock_task
> "irq_delta" -> the time spent on serving IRQ (hard + soft)
> "steal" -> the time stolen by the underlying hypervisor
> --
> "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable
> to VM scheduling delays.

This looks like a real problem indeed. The main problem in searching
for a solution, is that of course not all of the irq time is steal
time and vice versa. In this case, we could subtract irq_time from
steal, and add only the steal part time that is in excess. I don't
think this is 100 % guaranteed, but maybe it is a good approximation.

Rik, do you have an opinion on this ?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Paravirtual time accounting / IRQ time accounting
  2014-03-20 15:01 ` Glauber Costa
@ 2014-03-21  5:50   ` Mike Galbraith
  2014-03-22  6:47     ` lwcheng
  2014-03-21 11:31   ` Rik van Riel
  1 sibling, 1 reply; 8+ messages in thread
From: Mike Galbraith @ 2014-03-21  5:50 UTC (permalink / raw)
  To: Glauber Costa; +Cc: lwcheng, Rik van Riel, Peter Zijlstra, LKML

On Thu, 2014-03-20 at 12:01 -0300, Glauber Costa wrote: 
> On Wed, Mar 19, 2014 at 6:42 AM,  <lwcheng@cs.hku.hk> wrote:
> > In consolidated environments, when there are multiple virtual machines (VMs)
> > running on one CPU core, timekeeping will be a problem to the guest OS.
> > Here, I report my findings about Linux process scheduler.
> >
> >
> > Description
> > ------------
> > Linux CFS relies on rq->clock_task to charge each task, determine vruntime,
> > etc.
> >
> > When CONFIG_IRQ_TIME_ACCOUNTING is enabled, the time spent on serving IRQ
> > will be excluded from updating rq->clock_task.
> > When CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled, the time stolen by the
> > hypervisor
> > will also be excluded from updating rq->clock_task.
> >
> > With "both" CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING
> > enabled,
> > I put three KVM guests on one core and run hackbench in each guest. I find
> > that
> > in the guests, rq->clock_task stays *unchanged*. The malfunction embarrasses
> > CFS.
> > ------------
> >
> >
> > Analysis
> > ------------
> > [src/kernel/sched/core.c]
> > static void update_rq_clock_task(struct rq *rq, s64 delta)
> > {
> >     ... ...
> > #ifdef CONFIG_IRQ_TIME_ACCOUNTING
> >     irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
> >     ... ...
> >     rq->prev_irq_time += irq_delta;
> >     delta -= irq_delta;
> > #endif
> >
> > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> >     if (static_key_false((&paravirt_steal_rq_enabled))) {
> >         steal = paravirt_steal_clock(cpu_of(rq));
> >         steal -= rq->prev_steal_time_rq;
> >         ... ...
> >         rq->prev_steal_time_rq += steal;
> >         delta -= steal;
> >     }
> > #endif
> >
> >     rq->clock_task += delta;
> >     ... ...
> > }
> > --
> > "delta" -> the intended increment to rq->clock_task
> > "irq_delta" -> the time spent on serving IRQ (hard + soft)
> > "steal" -> the time stolen by the underlying hypervisor
> > --
> > "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable
> > to VM scheduling delays.
> 
> This looks like a real problem indeed. The main problem in searching
> for a solution, is that of course not all of the irq time is steal
> time and vice versa. In this case, we could subtract irq_time from
> steal, and add only the steal part time that is in excess. I don't
> think this is 100 % guaranteed, but maybe it is a good approximation.
> 
> Rik, do you have an opinion on this ?

Hrm, on my little Q6600 box, I'm running 3 VMS all pinned to CPU3, all
running hackbench -l zillion, one of them also running crash, staring at
it's sole rq->clock_task as I write this, with kernels (3.11.10) on both
host and guest configured as reported.

  clock_task = 631322187004, 
  clock_task = 631387807452, 
  clock_task = 631474214294, 
  clock_task = 631523864893, 
  clock_task = 631604646268, 
  clock_task = 631643276025, 

Maybe 3 VMs isn't enough overload for such a beastly CPU.  Top reports
some very funky utilization numbers, but other than that, the things
seem to work fine here.  perf thinks scheduling work too.

-Mike


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Paravirtual time accounting / IRQ time accounting
  2014-03-20 15:01 ` Glauber Costa
  2014-03-21  5:50   ` Mike Galbraith
@ 2014-03-21 11:31   ` Rik van Riel
  2014-03-22  7:15     ` lwcheng
  1 sibling, 1 reply; 8+ messages in thread
From: Rik van Riel @ 2014-03-21 11:31 UTC (permalink / raw)
  To: Glauber Costa, lwcheng, Peter Zijlstra; +Cc: LKML

On 03/20/2014 11:01 AM, Glauber Costa wrote:
> On Wed, Mar 19, 2014 at 6:42 AM,  <lwcheng@cs.hku.hk> wrote:

>> ------------
>> [src/kernel/sched/core.c]
>> static void update_rq_clock_task(struct rq *rq, s64 delta)
>> {
>>     ... ...
>> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>>     irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
>>     ... ...
>>     rq->prev_irq_time += irq_delta;
>>     delta -= irq_delta;
>> #endif
>>
>> #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>>     if (static_key_false((&paravirt_steal_rq_enabled))) {
>>         steal = paravirt_steal_clock(cpu_of(rq));
>>         steal -= rq->prev_steal_time_rq;
>>         ... ...
>>         rq->prev_steal_time_rq += steal;
>>         delta -= steal;
>>     }
>> #endif
>>
>>     rq->clock_task += delta;
>>     ... ...
>> }
>> --
>> "delta" -> the intended increment to rq->clock_task
>> "irq_delta" -> the time spent on serving IRQ (hard + soft)
>> "steal" -> the time stolen by the underlying hypervisor
>> --
>> "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable
>> to VM scheduling delays.
> 
> This looks like a real problem indeed. The main problem in searching
> for a solution, is that of course not all of the irq time is steal
> time and vice versa. In this case, we could subtract irq_time from
> steal, and add only the steal part time that is in excess. I don't
> think this is 100 % guaranteed, but maybe it is a good approximation.
> 
> Rik, do you have an opinion on this ?

The other way around may be better, since steal time (when it
happens) is likely to be of "time slice" magnitude, while irq
time will happen more frequently, and in dozens-of-microseconds
intervals.

Furthermore, we have no way to measure what the irq time is,
except by looking at how much real time elapsed. For steal time,
however, the hypervisor tells us exactly how much time was stolen.

That means when steal time and irq time happen simultaneously,
the amount of steal time should always be smaller than the
calculated irq time for that period.

actual irq_time = calculated irq time - reported steal time;

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Paravirtual time accounting / IRQ time accounting
  2014-03-21  5:50   ` Mike Galbraith
@ 2014-03-22  6:47     ` lwcheng
  2014-03-22  7:44       ` Mike Galbraith
  0 siblings, 1 reply; 8+ messages in thread
From: lwcheng @ 2014-03-22  6:47 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Glauber Costa, Rik van Riel, Peter Zijlstra, LKML


Quoting Mike Galbraith <umgwanakikbuti@gmail.com>:

> On Thu, 2014-03-20 at 12:01 -0300, Glauber Costa wrote:
>> On Wed, Mar 19, 2014 at 6:42 AM,  <lwcheng@cs.hku.hk> wrote:
>> > In consolidated environments, when there are multiple virtual  
>> machines (VMs)
>> > running on one CPU core, timekeeping will be a problem to the guest OS.
>> > Here, I report my findings about Linux process scheduler.
>> >
>> >
>> > Description
>> > ------------
>> > Linux CFS relies on rq->clock_task to charge each task, determine  
>> vruntime,
>> > etc.
>> >
>> > When CONFIG_IRQ_TIME_ACCOUNTING is enabled, the time spent on serving IRQ
>> > will be excluded from updating rq->clock_task.
>> > When CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled, the time stolen by the
>> > hypervisor
>> > will also be excluded from updating rq->clock_task.
>> >
>> > With "both" CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING
>> > enabled,
>> > I put three KVM guests on one core and run hackbench in each guest. I find
>> > that
>> > in the guests, rq->clock_task stays *unchanged*. The malfunction  
>> embarrasses
>> > CFS.
>> > ------------
>> >
>> >
>> > Analysis
>> > ------------
>> > [src/kernel/sched/core.c]
>> > static void update_rq_clock_task(struct rq *rq, s64 delta)
>> > {
>> >     ... ...
>> > #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>> >     irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
>> >     ... ...
>> >     rq->prev_irq_time += irq_delta;
>> >     delta -= irq_delta;
>> > #endif
>> >
>> > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>> >     if (static_key_false((&paravirt_steal_rq_enabled))) {
>> >         steal = paravirt_steal_clock(cpu_of(rq));
>> >         steal -= rq->prev_steal_time_rq;
>> >         ... ...
>> >         rq->prev_steal_time_rq += steal;
>> >         delta -= steal;
>> >     }
>> > #endif
>> >
>> >     rq->clock_task += delta;
>> >     ... ...
>> > }
>> > --
>> > "delta" -> the intended increment to rq->clock_task
>> > "irq_delta" -> the time spent on serving IRQ (hard + soft)
>> > "steal" -> the time stolen by the underlying hypervisor
>> > --
>> > "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable
>> > to VM scheduling delays.
>>
>> This looks like a real problem indeed. The main problem in searching
>> for a solution, is that of course not all of the irq time is steal
>> time and vice versa. In this case, we could subtract irq_time from
>> steal, and add only the steal part time that is in excess. I don't
>> think this is 100 % guaranteed, but maybe it is a good approximation.
>>
>> Rik, do you have an opinion on this ?
>
> Hrm, on my little Q6600 box, I'm running 3 VMS all pinned to CPU3, all
> running hackbench -l zillion, one of them also running crash, staring at
> it's sole rq->clock_task as I write this, with kernels (3.11.10) on both
> host and guest configured as reported.
>
>   clock_task = 631322187004,
>   clock_task = 631387807452,
>   clock_task = 631474214294,
>   clock_task = 631523864893,
>   clock_task = 631604646268,
>   clock_task = 631643276025,
>
> Maybe 3 VMs isn't enough overload for such a beastly CPU.  Top reports
> some very funky utilization numbers, but other than that, the things
> seem to work fine here.  perf thinks scheduling work too.
>
> -Mike
>

I checked the source code again. I forgot to mention that I commented out
steal_ticks() in my experiments (sorry):
[kernel/sched/core.c]
-------------------------------
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
     if (static_key_false((&paravirt_steal_rq_enabled))) {
         /*
         u64 st;
         */
         steal = paravirt_steal_clock(cpu_of(rq));
         steal -= rq->prev_steal_time_rq;

         if (unlikely(steal > delta))
             steal = delta;
         /*
         st = steal_ticks(steal);
         steal = st * TICK_NSEC;
         */
         rq->prev_steal_time_rq += steal;

         delta -= steal;
     }
#endif

     rq->clock_task += delta;
-------------------------------

I do it just for "accuracy", because I fully trust "steal" reported
by the hypervisor. I do not quite understand why it is trimmed again
using steal_ticks(). Please enlighten me.

Even when "steal == delta", as long as "steal" is not exactly
(N * TICK_NSEC), after steal_ticks(), it will be always smaller
than "delta".
So, rq->clock_task can still progress, but may not in the precise way.
Each time, the error is within the range (0, TICK_NSEC).

When CONFIG_IRQ_TIME_ACCOUNTING is disabled, deleting steal_ticks()
does not affect the update to rq->clock_task.

I tested both 3.10.0 and 3.13.5. The results are consistent.

-Luwei





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Paravirtual time accounting / IRQ time accounting
  2014-03-21 11:31   ` Rik van Riel
@ 2014-03-22  7:15     ` lwcheng
  2014-03-22 14:57       ` Rik van Riel
  0 siblings, 1 reply; 8+ messages in thread
From: lwcheng @ 2014-03-22  7:15 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Glauber Costa, Peter Zijlstra, LKML


Quoting Rik van Riel <riel@redhat.com>:

> On 03/20/2014 11:01 AM, Glauber Costa wrote:
>> On Wed, Mar 19, 2014 at 6:42 AM,  <lwcheng@cs.hku.hk> wrote:
>
>>> ------------
>>> [src/kernel/sched/core.c]
>>> static void update_rq_clock_task(struct rq *rq, s64 delta)
>>> {
>>>     ... ...
>>> #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>>>     irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
>>>     ... ...
>>>     rq->prev_irq_time += irq_delta;
>>>     delta -= irq_delta;
>>> #endif
>>>
>>> #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>>>     if (static_key_false((&paravirt_steal_rq_enabled))) {
>>>         steal = paravirt_steal_clock(cpu_of(rq));
>>>         steal -= rq->prev_steal_time_rq;
>>>         ... ...
>>>         rq->prev_steal_time_rq += steal;
>>>         delta -= steal;
>>>     }
>>> #endif
>>>
>>>     rq->clock_task += delta;
>>>     ... ...
>>> }
>>> --
>>> "delta" -> the intended increment to rq->clock_task
>>> "irq_delta" -> the time spent on serving IRQ (hard + soft)
>>> "steal" -> the time stolen by the underlying hypervisor
>>> --
>>> "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable
>>> to VM scheduling delays.
>>
>> This looks like a real problem indeed. The main problem in searching
>> for a solution, is that of course not all of the irq time is steal
>> time and vice versa. In this case, we could subtract irq_time from
>> steal, and add only the steal part time that is in excess. I don't
>> think this is 100 % guaranteed, but maybe it is a good approximation.
>>
>> Rik, do you have an opinion on this ?
>
> The other way around may be better, since steal time (when it
> happens) is likely to be of "time slice" magnitude, while irq
> time will happen more frequently, and in dozens-of-microseconds
> intervals.
>
> Furthermore, we have no way to measure what the irq time is,
> except by looking at how much real time elapsed. For steal time,
> however, the hypervisor tells us exactly how much time was stolen.
>
> That means when steal time and irq time happen simultaneously,
> the amount of steal time should always be smaller than the
> calculated irq time for that period.
>
> actual irq_time = calculated irq time - reported steal time;
>
> --
> All rights reversed
>

I observe that sometimes irq_time only includes "part" of steal_time.

Like you said, irq_time is in dozens-of-microseconds. In VMs, as all
devices seen are virtual ones, irq_time seems to be not as desired
as it is in physical hosts.

A quick (but not radical) solution may be:
disable CONFIG_IRQ_TIME_ACCOUNTING if CONFIG_PARAVIRT is selected.
Just adopt tick-based accounting: CONFIG_TICK_CPU_ACCOUNTING

I am thinking what irq_time really *means* in VMs.

-Luwei



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Paravirtual time accounting / IRQ time accounting
  2014-03-22  6:47     ` lwcheng
@ 2014-03-22  7:44       ` Mike Galbraith
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Galbraith @ 2014-03-22  7:44 UTC (permalink / raw)
  To: lwcheng; +Cc: Glauber Costa, Rik van Riel, Peter Zijlstra, LKML

On Sat, 2014-03-22 at 14:47 +0800, lwcheng@cs.hku.hk wrote:

> I checked the source code again. I forgot to mention that I commented out
> steal_ticks() in my experiments (sorry):

Oh well, it wasn't a _complete_ waste of time.  I now have a gaggle of
cute little alien boxen to dissect should I ever again feel inspired to
explore non silicon based CPU anatomy.

-Mike


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] Paravirtual time accounting / IRQ time accounting
  2014-03-22  7:15     ` lwcheng
@ 2014-03-22 14:57       ` Rik van Riel
  0 siblings, 0 replies; 8+ messages in thread
From: Rik van Riel @ 2014-03-22 14:57 UTC (permalink / raw)
  To: lwcheng; +Cc: Glauber Costa, Peter Zijlstra, LKML

On 03/22/2014 03:15 AM, lwcheng@cs.hku.hk wrote:

> I observe that sometimes irq_time only includes "part" of steal_time.
> 
> Like you said, irq_time is in dozens-of-microseconds. In VMs, as all
> devices seen are virtual ones, irq_time seems to be not as desired
> as it is in physical hosts.
> 
> A quick (but not radical) solution may be:
> disable CONFIG_IRQ_TIME_ACCOUNTING if CONFIG_PARAVIRT is selected.
> Just adopt tick-based accounting: CONFIG_TICK_CPU_ACCOUNTING
> 
> I am thinking what irq_time really *means* in VMs.

That is not really an option for Linux distributions, which tend
to run the same kernel image on the host and in the guest...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-03-22 14:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-19  9:42 [BUG] Paravirtual time accounting / IRQ time accounting lwcheng
2014-03-20 15:01 ` Glauber Costa
2014-03-21  5:50   ` Mike Galbraith
2014-03-22  6:47     ` lwcheng
2014-03-22  7:44       ` Mike Galbraith
2014-03-21 11:31   ` Rik van Riel
2014-03-22  7:15     ` lwcheng
2014-03-22 14:57       ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.