Re: [PATCH] time,virt: resync steal time when guest & host lose sync

From: Wanpeng Li <kernellwp@gmail.com>
To: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>,
	Ingo Molnar <mingo@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Wanpeng Li <wanpeng.li@hotmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Radim Krcmar <rkrcmar@redhat.com>, Mike Galbraith <efault@gmx.de>
Subject: Re: [PATCH] time,virt: resync steal time when guest & host lose sync
Date: Tue, 16 Aug 2016 14:54:50 +0800	[thread overview]
Message-ID: <CANRm+CyxRQjewCNQ96CKw0ZSk0zW5CTAuPwjsmCVSdFKs_Sfpg@mail.gmail.com> (raw)
In-Reply-To: <1471313483.32433.33.camel@redhat.com>

2016-08-16 10:11 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-16 at 09:31 +0800, Wanpeng Li wrote:
>> 2016-08-15 23:00 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
>> > > 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > > [...]
>> > > > Wanpeng, does the patch below work for you?
>> > >
>> > > It will break steal time for full dynticks guest, and there is a
>> > > calltrace of thread_group_cputime_adjusted call stack, RIP is
>> > > cputime_adjust+0xff/0x130.
>> >
>> > How?  This patch is equivalent to passing ULONG_MAX to
>> > steal_account_process_time, which you tried to no ill
>> > effect before.
>>
>> https://lkml.org/lkml/2016/6/8/404/ Paolo original suggested to add
>> the max cputime limit to the vtime, when the cpu is running in nohz
>> full mode and stop the tick, jiffies will be updated depends on clock
>> source instead of clock event device in
>> guest(tick_nohz_update_jiffies() callsite, ktime_get()), so it will
>> not be affected by lost clock ticks, my patch keeps the limit for
>> vtime and remove the limit to non-vtime. However, your patch removes
>> the limit for both scenarios and results in the below calltrace for
>> vtime.
>
> I understand what it does.
>
> What I would like to understand is WHY enforcing the limit
> is the right thing when using vtime, and the wrong thing
> in all other scenarios.

I observed that function get_vtime_delta() underflow which means that
delta < other when debugging your bugfix patch, I believe that is why
Paolo suggested to add the max cputime limit to vtime, he also pointed
out the potentional underflow before
https://lkml.org/lkml/2016/6/8/404/

>
> Can you explain why you change the limit to ULONG_MAX in
> three call sites, but not in the last one?
>
> What is different about the first three, versus the last
> one?
>
> Are you sure it should be changed in three places, and
> not in eg. two?
>
> This seems like something we should try to understand,
> rather than patch up haphazardly.
>
> The changelog of your patch could use an explanation of
> why the change is the correct way to go.
>
>> >
>> > Do you have the full call trace?
>
> OK, so you are seeing a divide by zero in cputime_adjust.
>
> Specifically, this would be scale_stime getting passed
> a (utime + stime) that adds up to 0. Stranger still, that
> only gets called if neither utime or stime is 0, meaning
> that one of utime or stime is negative, at the exact same
> magnitude as the other.
>
> Looking at thread_group_cputime(), I see some room for
> rounding errors.
>
>         do {
>                 seq = nextseq;
>                 flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock,
> &seq);
>                 times->utime = sig->utime;
>                 times->stime = sig->stime;
>                 times->sum_exec_runtime = sig->sum_sched_runtime;
>
>                 for_each_thread(tsk, t) {
>                         task_cputime(t, &utime, &stime);
>                         times->utime += utime;
>                         times->stime += stime;
>                         times->sum_exec_runtime +=
> task_sched_runtime(t);
>                 }
>                 /* If lockless access failed, take the lock. */
>                 nextseq = 1;
>         } while (need_seqretry(&sig->stats_lock, seq));
>
> Specifically, task_cputime calls vtime_delta, which works
> off jiffies, while task_sched_runtime works straight off
> the sched clock.

I try to replace task_sched_runtime() by t->se.sum_exec_runtime just
for testing (refer to Commit d670ec13178d0 "posix-cpu-timers: Cure SMP
wobbles"), divide zero still appear.

>
> I can see how this would lead to a non-zero ->sum_exec_runtime,
> while ->utime and/or ->stime are still at zero. This is fine.
>
> What I do not see is how ->utime or ->stime could end up negative,
> which is what would be needed to hit that divide by zero.

stime is negative since underflow incurred in function
get_vtime_delta(), however, it is u64, so it is a very big number, and
the total is 0 after for loop in scale_time(), but not 0 before that.

Regards,
Wanpeng Li