Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.

From: Thomas Gleixner <tglx@linutronix.de>
To: John Stultz <john.stultz@linaro.org>
Cc: David Gibson <david@gibson.dropbear.id.au>,
	lkml <linux-kernel@vger.kernel.org>,
	Liav Rehana <liavr@mellanox.com>,
	Chris Metcalf <cmetcalf@mellanox.com>,
	Richard Cochran <richardcochran@gmail.com>,
	Ingo Molnar <mingo@kernel.org>,
	Prarit Bhargava <prarit@redhat.com>,
	Laurent Vivier <lvivier@redhat.com>,
	"Christopher S . Hall" <christopher.s.hall@intel.com>,
	"4.6+" <stable@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation.
Date: Thu, 1 Dec 2016 23:44:02 +0100 (CET)	[thread overview]
Message-ID: <alpine.DEB.2.20.1612012315261.3666@nanos> (raw)
In-Reply-To: <CALAqxLXg2i6uiWcq21LK-ZsPvtugbuJa7Y8U0upXczS_o9aZOQ@mail.gmail.com>

On Thu, 1 Dec 2016, John Stultz wrote:

> On Thu, Dec 1, 2016 at 12:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Thu, 1 Dec 2016, John Stultz wrote:
> >> I would also suggest:
> >> 3) If the systems are halted for longer then the timekeeping core
> >> expects, the system will "miss" or "lose" some portion of that halted
> >> time, but otherwise the system will function properly.  Which is the
> >> result with this patch.
> >
> > Wrong. This is not the result with this patch.
> >
> > If the time advances enough to overflow the unsigned mult, which is
> > entirely possible as it takes just twice the time of the negative overflow,
> > then time will go backwards again and that's not 'miss' or 'lose', that's
> > just broken.
> 
> Eh? If you overflow the 64bits on the mult, the shift (which is likely
> large if you're actually hitting the overflow) brings the value back
> down to a smaller value. Time doesn't go backwards, its just smaller
> then it ought to be (since the high bits were lost).

WTF?

If the mult overflows, what on earth gurantees that any of the upper bits
is set?

A very simple example:

T1:
   u64 delta = 0x1000000000 - 1;
   u64 mult  = 0x10000000;
   u64 res;

   res = delta * mult;

==> res == 0xfffffffff0000000

T2:
   u64 delta = 0x1000000000;
   u64 mult  = 0x10000000;
   u64 res;

   res = delta * mult;

==> res == 0

because delta * mult == 1 << 64

Ergo: T2 < T1, AKA: Time goes backwards.

Maybe it's just me not understanding how the bits are set by the following
shift....

> > If we want to prevent that, then we either have to clamp the delta value,
> > which is the worst choice or use 128bit math to avoid the overflow.
> 
> I'm not convinced yet either of these approaches are really needed.

Then please explain how you solve the issue without time going backwards
and not impacting the fast path.

> >> I'm not sure if its really worth trying to recover that time or be
> >> perfect in those situations. Especially since on narrow clocksources
> >> you'll have the same result.
> >
> > We can deal with the 64bit overflow at least for wide clocksources which
> > all virtualizaton infected architectures provide in a sane way.
> 
> Another approach would be to push back on the virtualization
> environments to step in and virtualize a solution if they've idled a
> host for too long. They could do like the old tick-based
> virtualization environments used to and trigger a few timer interrupts
> while slowly removing a fake negative clocksource offset to allow time
> to catch up more normally after a long stall.

And that's going to happen after we retired, right?

Aside of that it's just silly hackery and wont ever work reliably because
there is no guarantee that the guest can handle the interrupts _before_ it
trips over the time going backwards issue. You can call ktime_get() in
interrupt disabled code.

> Or they could require clocksources that have smaller shift values to
> allow longer idle periods.

Could require? You have to do that in the guest kernel for the price of
less accuracy. The hypervisor wont help with that.

> > For bare metal systems with narrow clocksources the whole issue is non
> > existant we can make the 128bit math depend on both a config switch and a
> > static key, so bare metal will not have to take the burden.
> 
> Bare metal machines also sometimes run virtualization. I'm not sure
> the two are usefully exclusive.

Bare metal does not have the problem, whether the system is used as a
hypervisor or not. The guests CANNOT prevent the host from running the tick
interrupt, but the host very well can prevent the guest from running.

If you are talking about S390/PPC style hypervisors which pretend that
Linux is running on bare metal, then yes Linux is still a guest and prone
to the same issue, if that hypervisor supports overcommitment and is silly
enough to keep the guests scheduled out long enough.

Thanks,

	tglx