From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751699AbdCYShO (ORCPT <rfc822;w@1wt.eu>);
        Sat, 25 Mar 2017 14:37:14 -0400
Received: from Galois.linutronix.de ([146.0.238.70]:41791 "EHLO
        Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751446AbdCYShN (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sat, 25 Mar 2017 14:37:13 -0400
Date: Sat, 25 Mar 2017 19:36:31 +0100 (CET)
From: Thomas Gleixner <tglx@linutronix.de>
To: Pasha Tatashin <pasha.tatashin@oracle.com>
cc: x86@kernel.org, LKML <linux-kernel@vger.kernel.org>, mingo@redhat.com,
        Peter Zijlstra <peterz@infradead.org>,
        "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [v2 0/9] Early boot time stamps for x86
In-Reply-To: <a0988fb9-9a41-9090-7718-ecc8aa344dc2@oracle.com>
Message-ID: <alpine.DEB.2.20.1703251730090.4363@nanos>
References: <1490368899-877997-1-git-send-email-pasha.tatashin@oracle.com> <alpine.DEB.2.20.1703251119180.5612@nanos> <a0988fb9-9a41-9090-7718-ecc8aa344dc2@oracle.com>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, 25 Mar 2017, Pasha Tatashin wrote:
> The second versions was actually meant as a reply to your e-mail: the code
> differences were minimal: the main differences were in the cover letter. You
> mentioned that it is not necessary to have early boot time stamps, and  I
> wanted to show examples how this data is useful to track scalability bugs and
> avoid future regressions.

I did not say that early timestamps are not useful. I said they fall not
into the category of things which are absolutely necessary in the earliest
boot process. Can you spot the difference?

If you want to show me something to prove a point, then why don't you reply
to my mail as anybody else does who wants a to start a conversation or
discussion?

Ignoring my remarks and then sending another hastily cobbled together patch
set is not in any way a form of conversation/discussion or follow up to my
mail.

> Anyway, you asked for some time to think about this problem,

I certainly did not ask for some time. ...

I told you that I need to find a free time slot to look into that in
detail. I told you as well, that some of your decisions (earliest boot
code, code duplication etc.) are not acceptable.  The fact that I told you
that I need to find a free time slot also means that I'm going to tell you
in more detail why and what needs to be done.

> ... I won't send any replies to this thread for the next two weeks.

Not replying to mail for a fixed amount of time is just silly. What are you
going to do after that? Resend another half baken patch series which
ignores every review comment?

The review/feedback mechanism does not work that way. If you get feedback,
then you either accept it and act accordingly or you discuss it as a
response to that feedback. When that discussion is settled, then it's time
to rework stuff and send out a new version after making sure that
everything is addressed.

> So, please consider this solution. The feature is well abstracted, does
> not harm the performance of the fast path, and if necessary it can also
> be made optional with something like: CONFIG_HAVE_UNSTABLE_EARLY_CLOCK

Well abstracted? I have no idea what your understanding of abstractions is,
but it's definitely different from mine. This is not abstracted, it's glued
into the code in the 'works for me' mode. It's a proof of concept if at
all. And no, this won't have a CONFIG switch to clutter the code even more.

First of all, this "solution" is only valid for a very restricted set of
systems and breaks others in very subtle ways.

FYI, the architecture name is x86, which handles 32bit/64bit, various
vendors and hypervisors. It's not 'arch/mymachine'.

So this needs to be integrated into the existing TSC/CPU calibration
mechanism which works across all supported systems. And that means, that
early TSC calibration can't be done before x86_init.oem.arch_setup() and
hypervisor_init_platform() have been invoked.

By that time, which is still pretty early in the boot process:

      - boot_cpu_data has been preinitialized

      - calibration functions setup has been done

So this won't ignore decisions made by the various platform quirks and does
not require pointless copies of existing code and absurd hacks to make it
work. Also at this point the most fragile parts of the boot process have
been handled.

Putting the early TSC initialization after hypervisor_init_platform() makes
this available for _ALL_ systems/platforms as infrastructure and allows the
platforms to add the required bits of support via a new set of function
pointers in struct x86_platform_ops, which btw. is a proper form of
abstraction. For some of them it's just switching the calibration function
to the early one, for others this won't be possible ever.

The time between x86_64_start_kernel() and that point is in the low single
digit millisecond range or less than a millisecond and therefor completely
irrelevant. The first timestamps before that point will be 0 as they used
to be.

You can argue in circles about that, it's simply not debatable.

Doing that allows to simply split the existing TSC init into an early and
late part and that late part is going to switch over from early sched clock
to the regular one w/o any _fini() calls or other absurdities.

The same applies to the sched core clock code. This can be integrated w/o
all that extra early/fini() hackery cleanly. All it takes is a weak
sched_clock_early() function which returns 0 to start with. You probably
can figure out how to overload it.

But that's only the sched clock part of the problem. If this early
sched_clock() is available, then this needs to be fed into the setup of
timekeeping as well, otherwise dmesg tells 50 seconds into boot and clock
monotonic/boottime say 5, which would be confusing at best. That has not be
solved in the first step, but definitely before something like this is
going to be merged.

Thanks,

	tglx