linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 0/6] Early boot time stamps for x86
@ 2018-06-20 21:26 Pavel Tatashin
  2018-06-20 21:26 ` [PATCH v11 1/6] x86/tsc: redefine notsc to behave as tsc=unstable Pavel Tatashin
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-20 21:26 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes

changelog
---------
v11 - v10
	- Addressed all the comments from Thomas Gleixner.
	- I added one more patch:
	  "x86/tsc: prepare for early sched_clock" which fixes a problem
	  that I discovered while testing. I am not particularly happy with
	  the fix, as it adds a new argument that is used only in one
	  place, but if you have a suggestion for a different approach on
	  how to address this problem please let me know.

v10 - v9
	- Added another patch to this series that removes dependency
	  between KVM clock, and memblock allocator. The benefit is that
	  all clocks can now be initialized even earlier.
v9 - v8
	- Addressed more comments from Dou Liyang

v8 - v7
	- Addressed comments from Dou Liyang:
	- Moved tsc_early_init() and tsc_early_fini() to be all inside
	  tsc.c, and changed them to be static.
	- Removed warning when notsc parameter is used.
	- Merged with:
	  https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

v7 - v6
	- Removed tsc_disabled flag, now notsc is equivalent of
	  tsc=unstable
	- Simplified changes to sched/clock.c, by removing the
	  sched_clock_early() and friends as requested by Peter Zijlstra.
	  We know always use sched_clock()
	- Modified x86 sched_clock() to return either early boot time or
	  regular.
	- Added another example why ealry boot time is important

v5 - v6
	- Added a new patch:
		time: sync read_boot_clock64() with persistent clock
	  Which fixes missing __init macro, and enabled time discrepancy
	  fix that was noted by Thomas Gleixner
	- Split "x86/time: read_boot_clock64() implementation" into a
	  separate patch

v4 - v5
	- Fix compiler warnings on systems with stable clocks.

v3 - v4
	- Fixed tsc_early_fini() call to be in the 2nd patch as reported
	  by Dou Liyang
	- Improved comment before __use_sched_clock_early to explain why
	  we need both booleans.
	- Simplified valid_clock logic in read_boot_clock64().

v2 - v3
	- Addressed comment from Thomas Gleixner
	- Timestamps are available a little later in boot but still much
	  earlier than in mainline. This significantly simplified this
	  work.

v1 - v2
	In patch "x86/tsc: tsc early":
	- added tsc_adjusted_early()
	- fixed 32-bit compile error use do_div()

The early boot time stamps were discussed recently in these threads:
http://lkml.kernel.org/r/1527672059-6225-1-git-send-email-feng.tang@intel.com
http://lkml.kernel.org/r/1527672059-6225-2-git-send-email-feng.tang@intel.com

I updated my series to the latest mainline and sending it again.

Peter mentioned he did not like patch 6,7, and we can discuss for a better
way to do that, but I think patches 1-5 can be accepted separetly, since
they already enable early timestamps on platforms where sched_clock() is
available early. Such as KVM.

Adding early boot time stamps support for x86 machines.
SPARC patches for early boot time stamps are already integrated into
mainline linux.

Sample output
-------------
Before:
https://paste.ubuntu.com/26133428/

After:
https://paste.ubuntu.com/26133523/

For exaples how early time stamps are used, see this work:
Example 1:
https://lwn.net/Articles/734374/
- Without early boot time stamps we would not know about the extra time
  that is spent zeroing struct pages early in boot even when deferred
  page initialization.

Example 2:
https://patchwork.kernel.org/patch/10021247/
- If early boot timestamps were available, the engineer who introduced
  this bug would have noticed the extra time that is spent early in boot.
Pavel Tatashin (7):
  x86/tsc: remove tsc_disabled flag
  time: sync read_boot_clock64() with persistent clock
  x86/time: read_boot_clock64() implementation
  sched: early boot clock
  kvm/x86: remove kvm memblock dependency
  x86/paravirt: add active_sched_clock to pv_time_ops
  x86/tsc: use tsc early

Example 3:
http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com
- Needed early time stamps to show improvement

Pavel Tatashin (6):
  x86/tsc: redefine notsc to behave as tsc=unstable
  kvm/x86: remove kvm memblock dependency
  time: replace read_boot_clock64() with
    read_persistent_wall_and_boot_offset()
  x86/tsc: prepare for early sched_clock.
  sched: early boot clock
  x86/tsc: use tsc early

 .../admin-guide/kernel-parameters.txt         |  2 -
 Documentation/x86/x86_64/boot-options.txt     |  4 +-
 arch/arm/kernel/time.c                        | 12 +--
 arch/s390/kernel/time.c                       | 11 ++-
 arch/x86/kernel/kvm.c                         |  1 +
 arch/x86/kernel/kvmclock.c                    | 64 ++-----------
 arch/x86/kernel/setup.c                       |  7 +-
 arch/x86/kernel/tsc.c                         | 89 +++++++++++--------
 include/linux/timekeeping.h                   |  3 +-
 kernel/sched/clock.c                          | 10 ++-
 kernel/time/timekeeping.c                     | 61 +++++++------
 11 files changed, 117 insertions(+), 147 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v11 1/6] x86/tsc: redefine notsc to behave as tsc=unstable
  2018-06-20 21:26 [PATCH v11 0/6] Early boot time stamps for x86 Pavel Tatashin
@ 2018-06-20 21:26 ` Pavel Tatashin
  2018-06-20 21:26 ` [PATCH v11 2/6] kvm/x86: remove kvm memblock dependency Pavel Tatashin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-20 21:26 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes

Currently, notsc kernel parameter disables the use of tsc register by
sched_clock(). However, this parameter does not prevent linux from
accessing tsc in other places in kernel.

The only rational to boot with notsc is to avoid timing discrepancies on
multi-socket systems where different tsc frequencies may present, and thus
fallback to jiffies for clock source.

However, there is another method to solve the above problem, it is to boot
with tsc=unstable parameter. This parameter allows sched_clock() to use tsc
but in case tsc is outside of expected interval it is corrected back to a
sane value.

This is why there is no reason to keep notsc, and it can be removed. But,
for compatibility reasons we will keep this parameter but change its
definition to be the same as tsc=unstable.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
---
 .../admin-guide/kernel-parameters.txt          |  2 --
 Documentation/x86/x86_64/boot-options.txt      |  4 +---
 arch/x86/kernel/tsc.c                          | 18 +++---------------
 3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index efc7aa7a0670..f7123d28f318 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2835,8 +2835,6 @@
 
 	nosync		[HW,M68K] Disables sync negotiation for all devices.
 
-	notsc		[BUGS=X86-32] Disable Time Stamp Counter
-
 	nowatchdog	[KNL] Disable both lockup detectors, i.e.
 			soft-lockup and NMI watchdog (hard-lockup).
 
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 8d109ef67ab6..66114ab4f9fe 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -92,9 +92,7 @@ APICs
 Timing
 
   notsc
-  Don't use the CPU time stamp counter to read the wall time.
-  This can be used to work around timing problems on multiprocessor systems
-  with not properly synchronized CPUs.
+  Deprecated, use tsc=unstable instead.
 
   nohpet
   Don't use the HPET timer.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..186395041725 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,11 +38,6 @@ EXPORT_SYMBOL(tsc_khz);
  */
 static int __read_mostly tsc_unstable;
 
-/* native_sched_clock() is called before tsc_init(), so
-   we must start with the TSC soft disabled to prevent
-   erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
 static DEFINE_STATIC_KEY_FALSE(__use_tsc);
 
 int tsc_clocksource_reliable;
@@ -248,8 +243,7 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
 #ifdef CONFIG_X86_TSC
 int __init notsc_setup(char *str)
 {
-	pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
-	tsc_disabled = 1;
+	mark_tsc_unstable("boot parameter notsc");
 	return 1;
 }
 #else
@@ -1307,7 +1301,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)
 
 static int __init init_tsc_clocksource(void)
 {
-	if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+	if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
 		return 0;
 
 	if (tsc_unstable)
@@ -1414,12 +1408,6 @@ void __init tsc_init(void)
 		set_cyc2ns_scale(tsc_khz, cpu, cyc);
 	}
 
-	if (tsc_disabled > 0)
-		return;
-
-	/* now allow native_sched_clock() to use rdtsc */
-
-	tsc_disabled = 0;
 	static_branch_enable(&__use_tsc);
 
 	if (!no_sched_irq_time)
@@ -1455,7 +1443,7 @@ unsigned long calibrate_delay_is_known(void)
 	int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
 	const struct cpumask *mask = topology_core_cpumask(cpu);
 
-	if (tsc_disabled || !constant_tsc || !mask)
+	if (!constant_tsc || !mask)
 		return 0;
 
 	sibling = cpumask_any_but(mask, cpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v11 2/6] kvm/x86: remove kvm memblock dependency
  2018-06-20 21:26 [PATCH v11 0/6] Early boot time stamps for x86 Pavel Tatashin
  2018-06-20 21:26 ` [PATCH v11 1/6] x86/tsc: redefine notsc to behave as tsc=unstable Pavel Tatashin
@ 2018-06-20 21:26 ` Pavel Tatashin
  2018-06-20 21:26 ` [PATCH v11 3/6] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset() Pavel Tatashin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-20 21:26 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes

KVM clock is initialized later compared to other hypervisor because it has
dependency on memblock allocator.

Lets bring it inline with other hypervisors by removing this dependency by
using memory from BSS instead of allocating it.

The benefits:
- remove ifdef from common code
- earlier availability of TSC.
- remove dependency on memblock, and reduce code
- earlier kvm sched_clock()

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/kernel/kvm.c      |  1 +
 arch/x86/kernel/kvmclock.c | 64 ++++++--------------------------------
 arch/x86/kernel/setup.c    |  7 ++---
 3 files changed, 12 insertions(+), 60 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5b2300b818af..c65c232d3ddd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
 	.name			= "KVM",
 	.detect			= kvm_detect,
 	.type			= X86_HYPER_KVM,
+	.init.init_platform	= kvmclock_init,
 	.init.guest_late_init	= kvm_guest_init,
 	.init.x2apic_available	= kvm_para_available,
 };
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index bf8d1eb7fca3..01558d41ec2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,9 +23,9 @@
 #include <asm/apic.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
-#include <linux/memblock.h>
 #include <linux/sched.h>
 #include <linux/sched/clock.h>
+#include <linux/mm.h>
 
 #include <asm/mem_encrypt.h>
 #include <asm/x86_init.h>
@@ -44,6 +44,11 @@ static int parse_no_kvmclock(char *arg)
 }
 early_param("no-kvmclock", parse_no_kvmclock);
 
+/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
+#define HV_CLOCK_SIZE	(sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define WALL_CLOCK_SIZE	(sizeof(struct pvclock_wall_clock))
+static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
 /* The hypervisor will put information about time periodically here */
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock *wall_clock;
@@ -244,43 +249,12 @@ static void kvm_shutdown(void)
 	native_machine_shutdown();
 }
 
-static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
-					     phys_addr_t align)
-{
-	phys_addr_t mem;
-
-	mem = memblock_alloc(size, align);
-	if (!mem)
-		return 0;
-
-	if (sev_active()) {
-		if (early_set_memory_decrypted((unsigned long)__va(mem), size))
-			goto e_free;
-	}
-
-	return mem;
-e_free:
-	memblock_free(mem, size);
-	return 0;
-}
-
-static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
-{
-	if (sev_active())
-		early_set_memory_encrypted((unsigned long)__va(addr), size);
-
-	memblock_free(addr, size);
-}
-
 void __init kvmclock_init(void)
 {
 	struct pvclock_vcpu_time_info *vcpu_time;
-	unsigned long mem, mem_wall_clock;
-	int size, cpu, wall_clock_size;
+	int cpu;
 	u8 flags;
 
-	size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
 	if (!kvm_para_available())
 		return;
 
@@ -290,28 +264,11 @@ void __init kvmclock_init(void)
 	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
 		return;
 
-	wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
-	mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
-	if (!mem_wall_clock)
-		return;
-
-	wall_clock = __va(mem_wall_clock);
-	memset(wall_clock, 0, wall_clock_size);
-
-	mem = kvm_memblock_alloc(size, PAGE_SIZE);
-	if (!mem) {
-		kvm_memblock_free(mem_wall_clock, wall_clock_size);
-		wall_clock = NULL;
-		return;
-	}
-
-	hv_clock = __va(mem);
-	memset(hv_clock, 0, size);
+	wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
+	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
 
 	if (kvm_register_clock("primary cpu clock")) {
 		hv_clock = NULL;
-		kvm_memblock_free(mem, size);
-		kvm_memblock_free(mem_wall_clock, wall_clock_size);
 		wall_clock = NULL;
 		return;
 	}
@@ -354,13 +311,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
 	int cpu;
 	u8 flags;
 	struct pvclock_vcpu_time_info *vcpu_time;
-	unsigned int size;
 
 	if (!hv_clock)
 		return 0;
 
-	size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
 	cpu = get_cpu();
 
 	vcpu_time = &hv_clock[cpu].pvti;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..5194d9c38a43 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1012,6 +1012,8 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	init_hypervisor_platform();
 
+	tsc_early_delay_calibrate();
+
 	x86_init.resources.probe_roms();
 
 	/* after parse_early_param, so could debug it */
@@ -1197,11 +1199,6 @@ void __init setup_arch(char **cmdline_p)
 
 	memblock_find_dma_reserve();
 
-#ifdef CONFIG_KVM_GUEST
-	kvmclock_init();
-#endif
-
-	tsc_early_delay_calibrate();
 	if (!early_xdbc_setup_hardware())
 		early_xdbc_register_console();
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v11 3/6] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
  2018-06-20 21:26 [PATCH v11 0/6] Early boot time stamps for x86 Pavel Tatashin
  2018-06-20 21:26 ` [PATCH v11 1/6] x86/tsc: redefine notsc to behave as tsc=unstable Pavel Tatashin
  2018-06-20 21:26 ` [PATCH v11 2/6] kvm/x86: remove kvm memblock dependency Pavel Tatashin
@ 2018-06-20 21:26 ` Pavel Tatashin
  2018-06-21 16:13   ` Thomas Gleixner
  2018-06-20 21:26 ` [PATCH v11 4/6] x86/tsc: prepare for early sched_clock Pavel Tatashin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-20 21:26 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes

If architecture does not support exact boot time, it is challenging to
estimate boot time without having a reference to the current persistent
clock value. Yet, we cannot read the persistent clock time again, because
this may lead to math discrepancies with the caller of read_boot_clock64()
who have read the persistent clock at a different time.

This is why it is better to provide two values simultaneously: the
persistent clock value, and the boot time.

Thus, we replace read_boot_clock64() with:
read_persistent_wall_and_boot_offset(wall_time, boot_offset)

Where wall_time is returned by read_persistent_clock()
And boot_offset is wall_time - boot time

We calculate boot_offset using the current value of local_clock() so
architectures, that do not have a dedicated boot_clock but have early
sched_clock(), such as SPARCv9, x86, and possibly more will benefit from
this change by getting a better and more consistent estimate of the boot
time without need for an arch specific implementation.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm/kernel/time.c      | 12 +-------
 arch/s390/kernel/time.c     | 11 +++++--
 include/linux/timekeeping.h |  3 +-
 kernel/time/timekeeping.c   | 61 +++++++++++++++++++------------------
 4 files changed, 43 insertions(+), 44 deletions(-)

diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index cf2701cb0de8..0a6a457b13c7 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -83,29 +83,19 @@ static void dummy_clock_access(struct timespec64 *ts)
 }
 
 static clock_access_fn __read_persistent_clock = dummy_clock_access;
-static clock_access_fn __read_boot_clock = dummy_clock_access;
 
 void read_persistent_clock64(struct timespec64 *ts)
 {
 	__read_persistent_clock(ts);
 }
 
-void read_boot_clock64(struct timespec64 *ts)
-{
-	__read_boot_clock(ts);
-}
-
 int __init register_persistent_clock(clock_access_fn read_boot,
 				     clock_access_fn read_persistent)
 {
 	/* Only allow the clockaccess functions to be registered once */
-	if (__read_persistent_clock == dummy_clock_access &&
-	    __read_boot_clock == dummy_clock_access) {
-		if (read_boot)
-			__read_boot_clock = read_boot;
+	if (__read_persistent_clock == dummy_clock_access) {
 		if (read_persistent)
 			__read_persistent_clock = read_persistent;
-
 		return 0;
 	}
 
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index cf561160ea88..a69355166f97 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -221,17 +221,22 @@ void read_persistent_clock64(struct timespec64 *ts)
 	ext_to_timespec64(clk, ts);
 }
 
-void read_boot_clock64(struct timespec64 *ts)
+void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+						 struct timespec64 *boot_offset)
 {
 	unsigned char clk[STORE_CLOCK_EXT_SIZE];
+	struct timespec64 boot_time;
 	__u64 delta;
 
 	delta = initial_leap_seconds + TOD_UNIX_EPOCH;
-	memcpy(clk, tod_clock_base, 16);
+	memcpy(clk, tod_clock_base, STORE_CLOCK_EXT_SIZE);
 	*(__u64 *) &clk[1] -= delta;
 	if (*(__u64 *) &clk[1] > delta)
 		clk[0]--;
-	ext_to_timespec64(clk, ts);
+	ext_to_timespec64(clk, &boot_time);
+
+	read_persistent_clock64(wall_time);
+	*boot_offset = timespec64_sub(*wall_time, boot_time);
 }
 
 static u64 read_tod_clock(struct clocksource *cs)
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 86bc2026efce..686bc27acef0 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -243,7 +243,8 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
 extern int persistent_clock_is_local;
 
 extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
+void read_persistent_clock_and_boot_offset(struct timespec64 *wall_clock,
+					   struct timespec64 *boot_offset);
 extern int update_persistent_clock64(struct timespec64 now);
 
 /*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4786df904c22..aface5c13e7d 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -17,6 +17,7 @@
 #include <linux/nmi.h>
 #include <linux/sched.h>
 #include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
 #include <linux/syscore_ops.h>
 #include <linux/clocksource.h>
 #include <linux/jiffies.h>
@@ -1496,18 +1497,23 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
 }
 
 /**
- * read_boot_clock64 -  Return time of the system start.
+ * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset
+ *                                        from the boot.
  *
  * Weak dummy function for arches that do not yet support it.
- * Function to read the exact time the system has been started.
- * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
- *
- *  XXX - Do be sure to remove it once all arches implement it.
+ * wall_time	- current time as returned by persistent clock
+ * boot_offset	- offset that is defined as wall_time - boot_time
+ * The default function calculates offset based on the current value of
+ * local_clock(). This way architectures that support sched_clock() but don't
+ * support dedicated boot time clock will provide the best estimate of the
+ * boot time.
  */
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init
+read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+				     struct timespec64 *boot_offset)
 {
-	ts->tv_sec = 0;
-	ts->tv_nsec = 0;
+	read_persistent_clock64(wall_time);
+	*boot_offset = ns_to_timespec64(local_clock());
 }
 
 /* Flag for if timekeeping_resume() has injected sleeptime */
@@ -1521,28 +1527,28 @@ static bool persistent_clock_exists;
  */
 void __init timekeeping_init(void)
 {
+	struct timespec64 wall_time, boot_offset, wall_to_mono;
 	struct timekeeper *tk = &tk_core.timekeeper;
 	struct clocksource *clock;
 	unsigned long flags;
-	struct timespec64 now, boot, tmp;
-
-	read_persistent_clock64(&now);
-	if (!timespec64_valid_strict(&now)) {
-		pr_warn("WARNING: Persistent clock returned invalid value!\n"
-			"         Check your CMOS/BIOS settings.\n");
-		now.tv_sec = 0;
-		now.tv_nsec = 0;
-	} else if (now.tv_sec || now.tv_nsec)
-		persistent_clock_exists = true;
 
-	read_boot_clock64(&boot);
-	if (!timespec64_valid_strict(&boot)) {
-		pr_warn("WARNING: Boot clock returned invalid value!\n"
-			"         Check your CMOS/BIOS settings.\n");
-		boot.tv_sec = 0;
-		boot.tv_nsec = 0;
+	read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
+	if (timespec64_valid_strict(&wall_time) &&
+	    timespec64_to_ns(&wall_time)) {
+		persistent_clock_exists = true;
+	} else {
+		pr_warn("Persistent clock returned invalid value");
+		wall_time = (struct timespec64){0};
 	}
 
+	if (timespec64_compare(&wall_time, &boot_offset) < 0)
+		boot_offset = (struct timespec64){0};
+
+	/* We want set wall_to_mono, so the following is true:
+	 * wall time + wall_to_mono = boot time
+	 */
+	wall_to_mono = timespec64_sub(boot_offset, wall_time);
+
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
 	write_seqcount_begin(&tk_core.seq);
 	ntp_init();
@@ -1552,13 +1558,10 @@ void __init timekeeping_init(void)
 		clock->enable(clock);
 	tk_setup_internals(tk, clock);
 
-	tk_set_xtime(tk, &now);
+	tk_set_xtime(tk, &wall_time);
 	tk->raw_sec = 0;
-	if (boot.tv_sec == 0 && boot.tv_nsec == 0)
-		boot = tk_xtime(tk);
 
-	set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec);
-	tk_set_wall_to_mono(tk, tmp);
+	tk_set_wall_to_mono(tk, wall_to_mono);
 
 	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v11 4/6] x86/tsc: prepare for early sched_clock
  2018-06-20 21:26 [PATCH v11 0/6] Early boot time stamps for x86 Pavel Tatashin
                   ` (2 preceding siblings ...)
  2018-06-20 21:26 ` [PATCH v11 3/6] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset() Pavel Tatashin
@ 2018-06-20 21:26 ` Pavel Tatashin
  2018-06-20 21:26 ` [PATCH v11 5/6] sched: early boot clock Pavel Tatashin
  2018-06-20 21:27 ` [PATCH v11 6/6] x86/tsc: use tsc early Pavel Tatashin
  5 siblings, 0 replies; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-20 21:26 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes

We will change sched_clock() to be called early. But, during boot
sched_clock() changes its output without notifying us about the change of
clock source.

This happens in tsc_init(), when static_branch_enable(&__use_tsc) is
called.

native_sched_clock() changes from outputing jiffies to reading tsc, but
sched is not notified in anyway. So, to preserve the continoutity in this
place we add the offset of sched_clock() to the calculation of cyc2ns.

Without this change, the output would look like this:

[    0.004000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.009000] tsc: Fast TSC calibration using PIT
[    0.010000] tsc: Detected 3192.137 MHz processor
[    0.011000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2e03465ceb2,    max_idle_ns: 440795259855 ns

static_branch_enable(__use_tsc) is called, and timestamps became precise
but reduced:

[    0.002233] Calibrating delay loop (skipped), value calculated using timer frequency..     6384.27 BogoMIPS (lpj=3192137)
[    0.002516] pid_max: default: 32768 minimum: 301

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/kernel/tsc.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 186395041725..654a01cc0358 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -133,7 +133,9 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 	return ns;
 }
 
-static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+static void set_cyc2ns_scale(unsigned long khz, int cpu,
+			     unsigned long long tsc_now,
+			     unsigned long long sched_now)
 {
 	unsigned long long ns_now;
 	struct cyc2ns_data data;
@@ -146,7 +148,7 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
 	if (!khz)
 		goto done;
 
-	ns_now = cycles_2_ns(tsc_now);
+	ns_now = cycles_2_ns(tsc_now) + sched_now;
 
 	/*
 	 * Compute a new multiplier as per the above comment and ensure our
@@ -936,7 +938,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
 		if (!(freq->flags & CPUFREQ_CONST_LOOPS))
 			mark_tsc_unstable("cpufreq changes");
 
-		set_cyc2ns_scale(tsc_khz, freq->cpu, rdtsc());
+		set_cyc2ns_scale(tsc_khz, freq->cpu, rdtsc(), 0);
 	}
 
 	return 0;
@@ -1285,7 +1287,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)
 
 	/* Update the sched_clock() rate to match the clocksource one */
 	for_each_possible_cpu(cpu)
-		set_cyc2ns_scale(tsc_khz, cpu, tsc_stop);
+		set_cyc2ns_scale(tsc_khz, cpu, tsc_stop, 0);
 
 out:
 	if (tsc_unstable)
@@ -1356,7 +1358,7 @@ void __init tsc_early_delay_calibrate(void)
 
 void __init tsc_init(void)
 {
-	u64 lpj, cyc;
+	u64 lpj, cyc, sch;
 	int cpu;
 
 	if (!boot_cpu_has(X86_FEATURE_TSC)) {
@@ -1403,9 +1405,10 @@ void __init tsc_init(void)
 	 * up if their speed diverges)
 	 */
 	cyc = rdtsc();
+	sch = local_clock();
 	for_each_possible_cpu(cpu) {
 		cyc2ns_init(cpu);
-		set_cyc2ns_scale(tsc_khz, cpu, cyc);
+		set_cyc2ns_scale(tsc_khz, cpu, cyc, sch);
 	}
 
 	static_branch_enable(&__use_tsc);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v11 5/6] sched: early boot clock
  2018-06-20 21:26 [PATCH v11 0/6] Early boot time stamps for x86 Pavel Tatashin
                   ` (3 preceding siblings ...)
  2018-06-20 21:26 ` [PATCH v11 4/6] x86/tsc: prepare for early sched_clock Pavel Tatashin
@ 2018-06-20 21:26 ` Pavel Tatashin
  2018-06-20 21:27 ` [PATCH v11 6/6] x86/tsc: use tsc early Pavel Tatashin
  5 siblings, 0 replies; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-20 21:26 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes

Allow sched_clock() to be used before schec_clock_init() and
sched_clock_init_late() are called. This provides us with a way to get
early boot timestamps on machines with unstable clocks.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 kernel/sched/clock.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..f034392b0f6c 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -205,6 +205,11 @@ void clear_sched_clock_stable(void)
  */
 static int __init sched_clock_init_late(void)
 {
+	/* Transition to unstable clock from early clock */
+	local_irq_disable();
+	__gtod_offset = sched_clock() + __sched_clock_offset - ktime_get_ns();
+	local_irq_enable();
+
 	sched_clock_running = 2;
 	/*
 	 * Ensure that it is impossible to not do a static_key update.
@@ -350,8 +355,9 @@ u64 sched_clock_cpu(int cpu)
 	if (sched_clock_stable())
 		return sched_clock() + __sched_clock_offset;
 
-	if (unlikely(!sched_clock_running))
-		return 0ull;
+	/* Use early clock until sched_clock_init_late() */
+	if (unlikely(sched_clock_running < 2))
+		return sched_clock() + __sched_clock_offset;
 
 	preempt_disable_notrace();
 	scd = cpu_sdc(cpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v11 6/6] x86/tsc: use tsc early
  2018-06-20 21:26 [PATCH v11 0/6] Early boot time stamps for x86 Pavel Tatashin
                   ` (4 preceding siblings ...)
  2018-06-20 21:26 ` [PATCH v11 5/6] sched: early boot clock Pavel Tatashin
@ 2018-06-20 21:27 ` Pavel Tatashin
  2018-06-21  5:51   ` Feng Tang
  5 siblings, 1 reply; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-20 21:27 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes

We want to get timestamps and high resultion clock available to us as early
as possible in boot. But, native_sched_clock() outputs time based either on
tsc after tsc_init() is called later in boot, or using jiffies when clock
interrupts are enabled, which is also happens later in boot.

On the other hand, we know tsc frequency from as early as when
tsc_early_delay_calibrate() is called. So, we use the early tsc calibration
to output timestamps early. Later in boot when tsc_init() is called we
calibrate tsc again using more precise methods, and start using that.

Since sched_clock() is in a hot path, we want to make sure that no
regressions are introduced to this function after machine is booted, this
is why we are using static branch that is enabled by default, but is
disabled once we have initialized a permanent clock source.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/kernel/tsc.c | 64 ++++++++++++++++++++++++++++++-------------
 1 file changed, 45 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 654a01cc0358..1dd69612c69c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -39,6 +39,9 @@ EXPORT_SYMBOL(tsc_khz);
 static int __read_mostly tsc_unstable;
 
 static DEFINE_STATIC_KEY_FALSE(__use_tsc);
+static DEFINE_STATIC_KEY_TRUE(tsc_early_enabled);
+
+static bool tsc_early_sched_clock;
 
 int tsc_clocksource_reliable;
 
@@ -133,22 +136,13 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 	return ns;
 }
 
-static void set_cyc2ns_scale(unsigned long khz, int cpu,
-			     unsigned long long tsc_now,
-			     unsigned long long sched_now)
+static void __set_cyc2ns_scale(unsigned long khz, int cpu,
+			       unsigned long long tsc_now,
+			       unsigned long long sched_now)
 {
-	unsigned long long ns_now;
+	unsigned long long ns_now = cycles_2_ns(tsc_now) + sched_now;
 	struct cyc2ns_data data;
 	struct cyc2ns *c2n;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	sched_clock_idle_sleep_event();
-
-	if (!khz)
-		goto done;
-
-	ns_now = cycles_2_ns(tsc_now) + sched_now;
 
 	/*
 	 * Compute a new multiplier as per the above comment and ensure our
@@ -178,22 +172,47 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu,
 	c2n->data[0] = data;
 	raw_write_seqcount_latch(&c2n->seq);
 	c2n->data[1] = data;
+}
+
+static void set_cyc2ns_scale(unsigned long khz, int cpu,
+			     unsigned long long tsc_now,
+			     unsigned long long sched_now)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	if (khz)
+		__set_cyc2ns_scale(khz, cpu, tsc_now, sched_now);
 
-done:
 	sched_clock_idle_wakeup_event();
 	local_irq_restore(flags);
 }
 
+static void __init sched_clock_early_init(unsigned int khz)
+{
+	cyc2ns_init(smp_processor_id());
+	__set_cyc2ns_scale(khz, smp_processor_id(), rdtsc(), 0);
+	tsc_early_sched_clock = true;
+}
+
+static void __init sched_clock_early_exit(void)
+{
+	static_branch_disable(&tsc_early_enabled);
+}
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
 u64 native_sched_clock(void)
 {
-	if (static_branch_likely(&__use_tsc)) {
-		u64 tsc_now = rdtsc();
+	if (static_branch_likely(&__use_tsc))
+		return cycles_2_ns(rdtsc());
 
-		/* return the value in ns */
-		return cycles_2_ns(tsc_now);
+	if (static_branch_unlikely(&tsc_early_enabled)) {
+		if (tsc_early_sched_clock)
+			return cycles_2_ns(rdtsc());
 	}
 
 	/*
@@ -1354,9 +1373,10 @@ void __init tsc_early_delay_calibrate(void)
 	lpj = tsc_khz * 1000;
 	do_div(lpj, HZ);
 	loops_per_jiffy = lpj;
+	sched_clock_early_init(tsc_khz);
 }
 
-void __init tsc_init(void)
+static void __init __tsc_init(void)
 {
 	u64 lpj, cyc, sch;
 	int cpu;
@@ -1433,6 +1453,12 @@ void __init tsc_init(void)
 	detect_art();
 }
 
+void __init tsc_init(void)
+{
+	__tsc_init();
+	sched_clock_early_exit();
+}
+
 #ifdef CONFIG_SMP
 /*
  * If we have a constant TSC and are using the TSC for the delay loop,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v11 6/6] x86/tsc: use tsc early
  2018-06-20 21:27 ` [PATCH v11 6/6] x86/tsc: use tsc early Pavel Tatashin
@ 2018-06-21  5:51   ` Feng Tang
  2018-06-21 11:16     ` Pavel Tatashin
  0 siblings, 1 reply; 13+ messages in thread
From: Feng Tang @ 2018-06-21  5:51 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, pmladek, gnomes

Hi Pavel,

On Wed, Jun 20, 2018 at 05:27:00PM -0400, Pavel Tatashin wrote:
> We want to get timestamps and high resultion clock available to us as early
> as possible in boot. But, native_sched_clock() outputs time based either on
> tsc after tsc_init() is called later in boot, or using jiffies when clock
> interrupts are enabled, which is also happens later in boot.
> 
> On the other hand, we know tsc frequency from as early as when
> tsc_early_delay_calibrate() is called. So, we use the early tsc calibration
> to output timestamps early. Later in boot when tsc_init() is called we
> calibrate tsc again using more precise methods, and start using that.
> 
> Since sched_clock() is in a hot path, we want to make sure that no
> regressions are introduced to this function after machine is booted, this
> is why we are using static branch that is enabled by default, but is
> disabled once we have initialized a permanent clock source.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/x86/kernel/tsc.c | 64 ++++++++++++++++++++++++++++++-------------
>  1 file changed, 45 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 654a01cc0358..1dd69612c69c 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -39,6 +39,9 @@ EXPORT_SYMBOL(tsc_khz);
>  static int __read_mostly tsc_unstable;
>  
>  static DEFINE_STATIC_KEY_FALSE(__use_tsc);
> +static DEFINE_STATIC_KEY_TRUE(tsc_early_enabled);

Do we still need add a static_key? after Peter worked out the patch
to enable ealy jump_label_init?

Thanks,
Feng

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v11 6/6] x86/tsc: use tsc early
  2018-06-21  5:51   ` Feng Tang
@ 2018-06-21 11:16     ` Pavel Tatashin
  2018-06-21 13:30       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-21 11:16 UTC (permalink / raw)
  To: feng.tang
  Cc: Steven Sistare, Daniel Jordan, linux, schwidefsky,
	Heiko Carstens, John Stultz, sboyd, x86, LKML, mingo, tglx, hpa,
	douly.fnst, peterz, prarit, Petr Mladek, gnomes

> Do we still need add a static_key? after Peter worked out the patch
> to enable ealy jump_label_init?

Hi Feng,

With Pete's patch we will still need at least one static branch, but
as I replied to Pete's email I like the idea of initializing
jump_label_init() early, but in my opinion it should be a separate
work, with tsc.c cleanup patch.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v11 6/6] x86/tsc: use tsc early
  2018-06-21 11:16     ` Pavel Tatashin
@ 2018-06-21 13:30       ` Peter Zijlstra
  2018-06-21 13:38         ` Pavel Tatashin
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2018-06-21 13:30 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: feng.tang, Steven Sistare, Daniel Jordan, linux, schwidefsky,
	Heiko Carstens, John Stultz, sboyd, x86, LKML, mingo, tglx, hpa,
	douly.fnst, prarit, Petr Mladek, gnomes, Borislav Petkov

On Thu, Jun 21, 2018 at 07:16:32AM -0400, Pavel Tatashin wrote:
> > Do we still need add a static_key? after Peter worked out the patch
> > to enable ealy jump_label_init?
> 
> Hi Feng,
> 
> With Pete's patch we will still need at least one static branch, but
> as I replied to Pete's email I like the idea of initializing
> jump_label_init() early, but in my opinion it should be a separate
> work, with tsc.c cleanup patch.

Bah, no, we don't make a mess first and then maybe clean it up.

Have a look at the below. The patch is a mess, but I have two sick kids
on hands, please clean up / split where appropriate.

Seems to work though:

Booting the kernel.
[    0.000000] microcode: microcode updated early to revision 0x428, date = 2014-05-29
[    0.000000] Linux version 4.17.0-09589-g7a36b8fc167a-dirty (root@ivb-ep) (gcc version 7.3.0 (Debian 7.3.0-3)) #360 SMP PREEMPT Thu Jun 21 15:03:32 CEST 2018
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.17.0-09589-g7a36b8fc167a-dirty root=UUID=ee91c0f0-977f-434d-bfaa-92daf7cdbe07 ro possible_cpus=40 debug ignore_loglevel sysrq_always_enabled ftrace=nop earlyprintk=serial,ttyS0,115200 console=ttyS0,115200 no_console_suspend force_early_printk sched_debug
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[    0.000000] key: ffffffff83033e90 enabled: 0 enable: 1
[    0.000000] transform: setup_arch+0x104/0xc6a type: 1
[    0.000000] transform: setup_arch+0xcf/0xc6a type: 1
[    0.000000] transform: setup_arch+0xf3/0xc6a type: 0
[    0.000000] transform: setup_arch+0xbe/0xc6a type: 0
[    0.000000] post-likely
[    0.000000] post-unlikely

---
 arch/x86/include/asm/jump_label.h |  2 ++
 arch/x86/kernel/alternative.c     |  2 ++
 arch/x86/kernel/cpu/amd.c         | 13 +++++++-----
 arch/x86/kernel/cpu/common.c      |  2 ++
 arch/x86/kernel/jump_label.c      | 43 +++++++++++++++++++++++++++++++++------
 arch/x86/kernel/setup.c           | 19 +++++++++++++++--
 include/linux/jump_label.h        | 25 +++++++++++++++++++----
 kernel/jump_label.c               | 16 ---------------
 8 files changed, 89 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 8c0de4282659..555fb57ea872 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -74,6 +74,8 @@ struct jump_entry {
 	jump_label_t key;
 };
 
+extern void jump_label_update_early(struct static_key *key, bool enable);
+
 #else	/* __ASSEMBLY__ */
 
 .macro STATIC_JUMP_IF_TRUE target, key, def
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a481763a3776..874bb274af2f 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -215,6 +215,7 @@ void __init arch_init_ideal_nops(void)
 			   ideal_nops = p6_nops;
 		} else {
 #ifdef CONFIG_X86_64
+			/* FEATURE_NOPL is unconditionally true on 64bit so this is dead code */
 			ideal_nops = k8_nops;
 #else
 			ideal_nops = intel_nops;
@@ -668,6 +669,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 	local_irq_save(flags);
 	memcpy(addr, opcode, len);
 	local_irq_restore(flags);
+	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
 	   that causes hangs on some VIA CPUs. */
 	return addr;
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 082d7875cef8..355105aebc4e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -232,8 +232,6 @@ static void init_amd_k7(struct cpuinfo_x86 *c)
 		}
 	}
 
-	set_cpu_cap(c, X86_FEATURE_K7);
-
 	/* calling is from identify_secondary_cpu() ? */
 	if (!c->cpu_index)
 		return;
@@ -615,6 +613,14 @@ static void early_init_amd(struct cpuinfo_x86 *c)
 
 	early_init_amd_mc(c);
 
+#ifdef CONFIG_X86_32
+	if (c->x86 == 6)
+		set_cpu_cap(c, X86_FEATURE_K7);
+#endif
+
+	if (c->x86 >= 0xf)
+		set_cpu_cap(c, X86_FEATURE_K8);
+
 	rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);
 
 	/*
@@ -861,9 +867,6 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	init_amd_cacheinfo(c);
 
-	if (c->x86 >= 0xf)
-		set_cpu_cap(c, X86_FEATURE_K8);
-
 	if (cpu_has(c, X86_FEATURE_XMM2)) {
 		unsigned long long val;
 		int ret;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 910b47ee8078..2a4024f7a222 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1086,6 +1086,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 	 */
 	if (!pgtable_l5_enabled())
 		setup_clear_cpu_cap(X86_FEATURE_LA57);
+
+	detect_nopl(c);
 }
 
 void __init early_cpu_init(void)
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e56c95be2808..abebe1318e6b 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -52,16 +52,14 @@ static void __jump_label_transform(struct jump_entry *entry,
 			 * Jump label is enabled for the first time.
 			 * So we expect a default_nop...
 			 */
-			if (unlikely(memcmp((void *)entry->code, default_nop, 5)
-				     != 0))
+			if (unlikely(memcmp((void *)entry->code, default_nop, 5) != 0))
 				bug_at((void *)entry->code, __LINE__);
 		} else {
 			/*
 			 * ...otherwise expect an ideal_nop. Otherwise
 			 * something went horribly wrong.
 			 */
-			if (unlikely(memcmp((void *)entry->code, ideal_nop, 5)
-				     != 0))
+			if (unlikely(memcmp((void *)entry->code, ideal_nop, 5) != 0))
 				bug_at((void *)entry->code, __LINE__);
 		}
 
@@ -80,8 +78,8 @@ static void __jump_label_transform(struct jump_entry *entry,
 				bug_at((void *)entry->code, __LINE__);
 		} else {
 			code.jump = 0xe9;
-			code.offset = entry->target -
-				(entry->code + JUMP_LABEL_NOP_SIZE);
+			code.offset = entry->target - (entry->code + JUMP_LABEL_NOP_SIZE);
+
 			if (unlikely(memcmp((void *)entry->code, &code, 5) != 0))
 				bug_at((void *)entry->code, __LINE__);
 		}
@@ -140,4 +138,37 @@ __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry,
 		__jump_label_transform(entry, type, text_poke_early, 1);
 }
 
+void jump_label_update_early(struct static_key *key, bool enable)
+{
+	struct jump_entry *entry, *stop = __stop___jump_table;
+
+	/*
+	 * We need the table sorted and key->entries set up.
+	 */
+	WARN_ON_ONCE(!static_key_initialized);
+
+	entry = static_key_entries(key);
+
+	/*
+	 * Sanity check for early users, there had beter be a core kernel user.
+	 */
+	if (!entry || !entry->code || !core_kernel_text(entry->code)) {
+		WARN_ON(1);
+		return;
+	}
+
+	printk("key: %px enabled: %d enable: %d\n", key, atomic_read(&key->enabled), (int)enable);
+
+	if (!(!!atomic_read(&key->enabled) ^ !!enable))
+		return;
+
+	for ( ; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
+		enum jump_label_type type = enable ^ jump_entry_branch(entry);
+		printk("transform: %pS type: %d\n", (void *)entry->code, type);
+		__jump_label_transform(entry, type, text_poke_early, 0);
+	}
+
+	atomic_set_release(&key->enabled, !!enable);
+}
+
 #endif
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..3731245b8ec7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -805,6 +805,8 @@ dump_kernel_offset(struct notifier_block *self, unsigned long v, void *p)
 	return 0;
 }
 
+static DEFINE_STATIC_KEY_FALSE(__test);
+
 /*
  * Determine if we were loaded by an EFI loader.  If so, then we have also been
  * passed the efi memmap, systab, etc., so we should use these data structures
@@ -866,6 +868,21 @@ void __init setup_arch(char **cmdline_p)
 
 	idt_setup_early_traps();
 	early_cpu_init();
+	arch_init_ideal_nops();
+	jump_label_init();
+
+	if (static_branch_likely(&__test))
+		printk("pre-likely\n");
+	if (static_branch_unlikely(&__test))
+		printk("pre-unlikely\n");
+
+	jump_label_update_early(&__test.key, true);
+
+	if (static_branch_likely(&__test))
+		printk("post-likely\n");
+	if (static_branch_unlikely(&__test))
+		printk("post-unlikely\n");
+
 	early_ioremap_init();
 
 	setup_olpc_ofw_pgd();
@@ -1272,8 +1289,6 @@ void __init setup_arch(char **cmdline_p)
 
 	mcheck_init();
 
-	arch_init_ideal_nops();
-
 	register_refined_jiffies(CLOCK_TICK_RATE);
 
 #ifdef CONFIG_EFI
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index b46b541c67c4..7a693d0fb5b5 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -79,6 +79,7 @@
 
 #include <linux/types.h>
 #include <linux/compiler.h>
+#include <linux/bug.h>
 
 extern bool static_key_initialized;
 
@@ -110,6 +111,17 @@ struct static_key {
 	};
 };
 
+#define JUMP_TYPE_FALSE		0UL
+#define JUMP_TYPE_TRUE		1UL
+#define JUMP_TYPE_LINKED	2UL
+#define JUMP_TYPE_MASK		3UL
+
+static inline struct jump_entry *static_key_entries(struct static_key *key)
+{
+	WARN_ON_ONCE(key->type & JUMP_TYPE_LINKED);
+	return (struct jump_entry *)(key->type & ~JUMP_TYPE_MASK);
+}
+
 #else
 struct static_key {
 	atomic_t enabled;
@@ -132,10 +144,15 @@ struct module;
 
 #ifdef HAVE_JUMP_LABEL
 
-#define JUMP_TYPE_FALSE		0UL
-#define JUMP_TYPE_TRUE		1UL
-#define JUMP_TYPE_LINKED	2UL
-#define JUMP_TYPE_MASK		3UL
+static inline struct static_key *jump_entry_key(struct jump_entry *entry)
+{
+	return (struct static_key *)((unsigned long)entry->key & ~1UL);
+}
+
+static inline bool jump_entry_branch(struct jump_entry *entry)
+{
+	return (unsigned long)entry->key & 1UL;
+}
 
 static __always_inline bool static_key_false(struct static_key *key)
 {
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 01ebdf1f9f40..9710fa7582aa 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -295,12 +295,6 @@ void __weak __init_or_module arch_jump_label_transform_static(struct jump_entry
 	arch_jump_label_transform(entry, type);
 }
 
-static inline struct jump_entry *static_key_entries(struct static_key *key)
-{
-	WARN_ON_ONCE(key->type & JUMP_TYPE_LINKED);
-	return (struct jump_entry *)(key->type & ~JUMP_TYPE_MASK);
-}
-
 static inline bool static_key_type(struct static_key *key)
 {
 	return key->type & JUMP_TYPE_TRUE;
@@ -321,16 +315,6 @@ static inline void static_key_set_linked(struct static_key *key)
 	key->type |= JUMP_TYPE_LINKED;
 }
 
-static inline struct static_key *jump_entry_key(struct jump_entry *entry)
-{
-	return (struct static_key *)((unsigned long)entry->key & ~1UL);
-}
-
-static bool jump_entry_branch(struct jump_entry *entry)
-{
-	return (unsigned long)entry->key & 1UL;
-}
-
 /***
  * A 'struct static_key' uses a union such that it either points directly
  * to a table of 'struct jump_entry' or to a linked list of modules which in

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v11 6/6] x86/tsc: use tsc early
  2018-06-21 13:30       ` Peter Zijlstra
@ 2018-06-21 13:38         ` Pavel Tatashin
  0 siblings, 0 replies; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-21 13:38 UTC (permalink / raw)
  To: peterz
  Cc: feng.tang, Steven Sistare, Daniel Jordan, linux, schwidefsky,
	Heiko Carstens, John Stultz, sboyd, x86, LKML, mingo, tglx, hpa,
	douly.fnst, prarit, Petr Mladek, gnomes, bp

> Bah, no, we don't make a mess first and then maybe clean it up.

OK, I will add this path to the series.

>
> Have a look at the below. The patch is a mess, but I have two sick kids
> on hands

Sorry to hear that, I hope your kids will get better soon.

> , please clean up / split where appropriate.

OK

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v11 3/6] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
  2018-06-20 21:26 ` [PATCH v11 3/6] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset() Pavel Tatashin
@ 2018-06-21 16:13   ` Thomas Gleixner
  2018-06-21 16:51     ` Pavel Tatashin
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2018-06-21 16:13 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	hpa, douly.fnst, peterz, prarit, feng.tang, pmladek, gnomes

On Wed, 20 Jun 2018, Pavel Tatashin wrote:

> If architecture does not support exact boot time, it is challenging to
> estimate boot time without having a reference to the current persistent
> clock value. Yet, we cannot read the persistent clock time again, because
> this may lead to math discrepancies with the caller of read_boot_clock64()
> who have read the persistent clock at a different time.
> 
> This is why it is better to provide two values simultaneously: the
> persistent clock value, and the boot time.
> 
> Thus, we replace read_boot_clock64() with:
> read_persistent_wall_and_boot_offset(wall_time, boot_offset)
> 
> Where wall_time is returned by read_persistent_clock()
> And boot_offset is wall_time - boot time
> 
> We calculate boot_offset using the current value of local_clock() so
> architectures, that do not have a dedicated boot_clock but have early
> sched_clock(), such as SPARCv9, x86, and possibly more will benefit from
> this change by getting a better and more consistent estimate of the boot
> time without need for an arch specific implementation.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/arm/kernel/time.c      | 12 +-------
>  arch/s390/kernel/time.c     | 11 +++++--
>  include/linux/timekeeping.h |  3 +-
>  kernel/time/timekeeping.c   | 61 +++++++++++++++++++------------------

Please don't make that a wholesale patch. I surely indicated the steps
which are required and the steps can be done as separate patches easily,

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v11 3/6] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
  2018-06-21 16:13   ` Thomas Gleixner
@ 2018-06-21 16:51     ` Pavel Tatashin
  0 siblings, 0 replies; 13+ messages in thread
From: Pavel Tatashin @ 2018-06-21 16:51 UTC (permalink / raw)
  To: tglx
  Cc: Steven Sistare, Daniel Jordan, linux, schwidefsky,
	Heiko Carstens, John Stultz, sboyd, x86, LKML, mingo, hpa,
	douly.fnst, peterz, prarit, feng.tang, Petr Mladek, gnomes

> Please don't make that a wholesale patch. I surely indicated the steps
> which are required and the steps can be done as separate patches easily,

Hi Thomas,

I will split it into several patches in the next version.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-06-21 16:51 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-20 21:26 [PATCH v11 0/6] Early boot time stamps for x86 Pavel Tatashin
2018-06-20 21:26 ` [PATCH v11 1/6] x86/tsc: redefine notsc to behave as tsc=unstable Pavel Tatashin
2018-06-20 21:26 ` [PATCH v11 2/6] kvm/x86: remove kvm memblock dependency Pavel Tatashin
2018-06-20 21:26 ` [PATCH v11 3/6] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset() Pavel Tatashin
2018-06-21 16:13   ` Thomas Gleixner
2018-06-21 16:51     ` Pavel Tatashin
2018-06-20 21:26 ` [PATCH v11 4/6] x86/tsc: prepare for early sched_clock Pavel Tatashin
2018-06-20 21:26 ` [PATCH v11 5/6] sched: early boot clock Pavel Tatashin
2018-06-20 21:27 ` [PATCH v11 6/6] x86/tsc: use tsc early Pavel Tatashin
2018-06-21  5:51   ` Feng Tang
2018-06-21 11:16     ` Pavel Tatashin
2018-06-21 13:30       ` Peter Zijlstra
2018-06-21 13:38         ` Pavel Tatashin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).