From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753160AbcHPGyy (ORCPT <rfc822;w@1wt.eu>);
	Tue, 16 Aug 2016 02:54:54 -0400
Received: from mail-wm0-f47.google.com ([74.125.82.47]:37920 "EHLO
	mail-wm0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752974AbcHPGyx (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 16 Aug 2016 02:54:53 -0400
MIME-Version: 1.0
In-Reply-To: <1471313483.32433.33.camel@redhat.com>
References: <1468421405-20056-1-git-send-email-fweisbec@gmail.com>
 <1468421405-20056-2-git-send-email-fweisbec@gmail.com> <CANRm+CyYHLaihuk+ckQg42-Lo_3vFHaS8gU=GmXh8Rfq5mMpaA@mail.gmail.com>
 <1470751579.13905.77.camel@redhat.com> <CANRm+CzxWo9i_=Ygn4CAF0L5=1avtnX=ahwMK96fj3hnQW-_zA@mail.gmail.com>
 <CANRm+CzXSCNGoVbuOB0Ruj2nmfHNRfcO3eB-91Z-fnBOnn-gbQ@mail.gmail.com>
 <20160810125212.78564dc2@annuminas.surriel.com> <CANRm+Cye=ipm5XwhBV-AxjCMZRQTb3ahuYEa6KXBMe_HzxPa2Q@mail.gmail.com>
 <1470969892.13905.120.camel@redhat.com> <CANRm+CzUAinBTigNVW37r4odfZmP6gT9vuHY1tnQrXGXp3i6jA@mail.gmail.com>
 <20160812115803.0f26211c@annuminas.surriel.com> <CANRm+CzF5RLS=yLp7xJqkUG3HO4R-xUxGsOeZUBHj8gyEiD5gA@mail.gmail.com>
 <1471273244.32433.22.camel@redhat.com> <CANRm+CwnHp-fNK41oBWGUaYdOyZdiGCihTsfWx9i6cd_30QvHQ@mail.gmail.com>
 <1471313483.32433.33.camel@redhat.com>
From: Wanpeng Li <kernellwp@gmail.com>
Date: Tue, 16 Aug 2016 14:54:50 +0800
Message-ID: <CANRm+CyxRQjewCNQ96CKw0ZSk0zW5CTAuPwjsmCVSdFKs_Sfpg@mail.gmail.com>
Subject: Re: [PATCH] time,virt: resync steal time when guest & host lose sync
To: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>, Ingo Molnar <mingo@kernel.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Wanpeng Li <wanpeng.li@hotmail.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Radim Krcmar <rkrcmar@redhat.com>, Mike Galbraith <efault@gmx.de>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

2016-08-16 10:11 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-16 at 09:31 +0800, Wanpeng Li wrote:
>> 2016-08-15 23:00 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
>> > > 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > > [...]
>> > > > Wanpeng, does the patch below work for you?
>> > >
>> > > It will break steal time for full dynticks guest, and there is a
>> > > calltrace of thread_group_cputime_adjusted call stack, RIP is
>> > > cputime_adjust+0xff/0x130.
>> >
>> > How?  This patch is equivalent to passing ULONG_MAX to
>> > steal_account_process_time, which you tried to no ill
>> > effect before.
>>
>> https://lkml.org/lkml/2016/6/8/404/ Paolo original suggested to add
>> the max cputime limit to the vtime, when the cpu is running in nohz
>> full mode and stop the tick, jiffies will be updated depends on clock
>> source instead of clock event device in
>> guest(tick_nohz_update_jiffies() callsite, ktime_get()), so it will
>> not be affected by lost clock ticks, my patch keeps the limit for
>> vtime and remove the limit to non-vtime. However, your patch removes
>> the limit for both scenarios and results in the below calltrace for
>> vtime.
>
> I understand what it does.
>
> What I would like to understand is WHY enforcing the limit
> is the right thing when using vtime, and the wrong thing
> in all other scenarios.

I observed that function get_vtime_delta() underflow which means that
delta < other when debugging your bugfix patch, I believe that is why
Paolo suggested to add the max cputime limit to vtime, he also pointed
out the potentional underflow before
https://lkml.org/lkml/2016/6/8/404/

>
> Can you explain why you change the limit to ULONG_MAX in
> three call sites, but not in the last one?
>
> What is different about the first three, versus the last
> one?
>
> Are you sure it should be changed in three places, and
> not in eg. two?
>
> This seems like something we should try to understand,
> rather than patch up haphazardly.
>
> The changelog of your patch could use an explanation of
> why the change is the correct way to go.
>
>> >
>> > Do you have the full call trace?
>
> OK, so you are seeing a divide by zero in cputime_adjust.
>
> Specifically, this would be scale_stime getting passed
> a (utime + stime) that adds up to 0. Stranger still, that
> only gets called if neither utime or stime is 0, meaning
> that one of utime or stime is negative, at the exact same
> magnitude as the other.
>
> Looking at thread_group_cputime(), I see some room for
> rounding errors.
>
>         do {
>                 seq = nextseq;
>                 flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock,
> &seq);
>                 times->utime = sig->utime;
>                 times->stime = sig->stime;
>                 times->sum_exec_runtime = sig->sum_sched_runtime;
>
>                 for_each_thread(tsk, t) {
>                         task_cputime(t, &utime, &stime);
>                         times->utime += utime;
>                         times->stime += stime;
>                         times->sum_exec_runtime +=
> task_sched_runtime(t);
>                 }
>                 /* If lockless access failed, take the lock. */
>                 nextseq = 1;
>         } while (need_seqretry(&sig->stats_lock, seq));
>
> Specifically, task_cputime calls vtime_delta, which works
> off jiffies, while task_sched_runtime works straight off
> the sched clock.

I try to replace task_sched_runtime() by t->se.sum_exec_runtime just
for testing (refer to Commit d670ec13178d0 "posix-cpu-timers: Cure SMP
wobbles"), divide zero still appear.

>
> I can see how this would lead to a non-zero ->sum_exec_runtime,
> while ->utime and/or ->stime are still at zero. This is fine.
>
> What I do not see is how ->utime or ->stime could end up negative,
> which is what would be needed to hit that divide by zero.

stime is negative since underflow incurred in function
get_vtime_delta(), however, it is u64, so it is a very big number, and
the total is 0 after for loop in scale_time(), but not 0 before that.

Regards,
Wanpeng Li