linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v15 00/26] Early boot time stamps
@ 2018-07-19 20:55 Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 01/26] x86/kvmclock: Remove memblock dependency Pavel Tatashin
                   ` (26 more replies)
  0 siblings, 27 replies; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

changelog
---------
v15 - v14
	Repo: https://github.com/soleen/time_15.git
	- dropped "x86/kvmclock: Avoid TSC recalibration" as Paolo Bonzini
	  suggested
	- Fixed in "sched: early boot clock" whenched_clock_running is set,
	   and moved __sched_clock_gtod_offset inside IRQ as Peter noticed.
	- Addressed comments from Dou Liyang: added missing __inits, and
	  X86_FEATURE_TSC_DEADLINE_TIMER, spelling.
	- Fixed xen_sched_clock_offset on xen hvm (noticed by Boris
	  Ostrovsky).
	- Added two patches to address Peter Zijlstra's request to split
	  native cpu calibration into early and late parts. The patches
	  are:
	  x86/tsc:  split native_calibrate_cpu() into early and late parts
	  x86/tsc:  use tsc_calibrate_cpu_early and
	    pit_hpet_ptimer_calibrate_cpu

v14 - v13
	- Included Thomas' KVM clock series, addressed comments from
	  reviewers.
	http://lkml.kernel.org/r/20180706161307.733337643@linutronix.de
	- Fixed xen hvm panic reported by Boris
	- Fixed build issue on microblaze

v13 - v12
	- Addressed comments from Thomas Gleixner.
	- Addressed comments from Peter Zijlstra.
	- Added a patch from Borislav Petkov
	- Added a new patch: sched: use static key for sched_clock_running
	- Added xen pv fixes, so clock is initialized when other
	  hypervisors initialize their clocks.
	Note: I am including kvm/x86: remove kvm memblock dependency, which
	is part of this series:
	http://lkml.kernel.org/r/20180706161307.733337643@linutronix.de
	Because without this patch it is not possible to test this series on
	KVM.

v12 - v11
	- split time: replace read_boot_clock64() with
	  read_persistent_wall_and_boot_offset() into four patches
	- Added two patches one fixes an existing bug with text_poke()
	  another one enables static branches early. Note, because I found
	  and fixed the text_poke() bug, enabling static branching became
	  super easy, as no changes to jump_label* is needed.
	- Modified x86/tsc: use tsc early to use static branches early, and
	  thus native_sched_clock() is not changed at all.
v11 - v10
	- Addressed all the comments from Thomas Gleixner.
	- I added one more patch:
	  "x86/tsc: prepare for early sched_clock" which fixes a problem
	  that I discovered while testing. I am not particularly happy with
	  the fix, as it adds a new argument that is used only in one
	  place, but if you have a suggestion for a different approach on
	  how to address this problem please let me know.

v10 - v9
	- Added another patch to this series that removes dependency
	  between KVM clock, and memblock allocator. The benefit is that
	  all clocks can now be initialized even earlier.
v9 - v8
	- Addressed more comments from Dou Liyang

v8 - v7
	- Addressed comments from Dou Liyang:
	- Moved tsc_early_init() and tsc_early_fini() to be all inside
	  tsc.c, and changed them to be static.
	- Removed warning when notsc parameter is used.
	- Merged with:
	  https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

v7 - v6
	- Removed tsc_disabled flag, now notsc is equivalent of
	  tsc=unstable
	- Simplified changes to sched/clock.c, by removing the
	  sched_clock_early() and friends as requested by Peter Zijlstra.
	  We know always use sched_clock()
	- Modified x86 sched_clock() to return either early boot time or
	  regular.
	- Added another example why ealry boot time is important

v5 - v6
	- Added a new patch:
		time: sync read_boot_clock64() with persistent clock
	  Which fixes missing __init macro, and enabled time discrepancy
	  fix that was noted by Thomas Gleixner
	- Split "x86/time: read_boot_clock64() implementation" into a
	  separate patch

v4 - v5
	- Fix compiler warnings on systems with stable clocks.

v3 - v4
	- Fixed tsc_early_fini() call to be in the 2nd patch as reported
	  by Dou Liyang
	- Improved comment before __use_sched_clock_early to explain why
	  we need both booleans.
	- Simplified valid_clock logic in read_boot_clock64().

v2 - v3
	- Addressed comment from Thomas Gleixner
	- Timestamps are available a little later in boot but still much
	  earlier than in mainline. This significantly simplified this
	  work.

v1 - v2
	In patch "x86/tsc: tsc early":
	- added tsc_adjusted_early()
	- fixed 32-bit compile error use do_div()

The early boot time stamps were discussed recently in these threads:
http://lkml.kernel.org/r/1527672059-6225-1-git-send-email-feng.tang@intel.com
http://lkml.kernel.org/r/1527672059-6225-2-git-send-email-feng.tang@intel.com

I updated my series to the latest mainline and sending it again.

Peter mentioned he did not like patch 6,7, and we can discuss for a better
way to do that, but I think patches 1-5 can be accepted separetly, since
they already enable early timestamps on platforms where sched_clock() is
available early. Such as KVM.

Adding early boot time stamps support for x86 machines.
SPARC patches for early boot time stamps are already integrated into
mainline linux.

Sample output
-------------
Before:
https://paste.ubuntu.com/26133428/

After:
https://paste.ubuntu.com/26133523/

For exaples how early time stamps are used, see this work:
Example 1:
https://lwn.net/Articles/734374/
- Without early boot time stamps we would not know about the extra time
  that is spent zeroing struct pages early in boot even when deferred
  page initialization.

Example 2:
https://patchwork.kernel.org/patch/10021247/
- If early boot timestamps were available, the engineer who introduced
  this bug would have noticed the extra time that is spent early in boot.
Pavel Tatashin (7):
  x86/tsc: remove tsc_disabled flag
  time: sync read_boot_clock64() with persistent clock
  x86/time: read_boot_clock64() implementation
  sched: early boot clock
  kvm/x86: remove kvm memblock dependency
  x86/paravirt: add active_sched_clock to pv_time_ops
  x86/tsc: use tsc early

Example 3:
http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com
- Needed early time stamps to show improvement
Borislav Petkov (1):
  x86/CPU: Call detect_nopl() only on the BSP

Pavel Tatashin (19):
  x86/kvmclock: Remove memblock dependency
  x86: text_poke() may access uninitialized struct pages
  x86: initialize static branching early
  x86/tsc: redefine notsc to behave as tsc=unstable
  x86/xen/time: initialize pv xen time in init_hypervisor_platform
  x86/xen/time: output xen sched_clock time from 0
  s390/time: add read_persistent_wall_and_boot_offset()
  time: replace read_boot_clock64() with
    read_persistent_wall_and_boot_offset()
  time: default boot time offset to local_clock()
  s390/time: remove read_boot_clock64()
  ARM/time: remove read_boot_clock64()
  x86/tsc: calibrate tsc only once
  x86/tsc: initialize cyc2ns when tsc freq. is determined
  x86/tsc: use tsc early
  sched: move sched clock initialization and merge with generic clock
  sched: early boot clock
  sched: use static key for sched_clock_running
  x86/tsc:  split native_calibrate_cpu() into early and late parts
  x86/tsc:  use tsc_calibrate_cpu_early and
    pit_hpet_ptimer_calibrate_cpu

Thomas Gleixner (6):
  x86/kvmclock: Remove page size requirement from wall_clock
  x86/kvmclock: Decrapify kvm_register_clock()
  x86/kvmclock: Cleanup the code
  x86/kvmclock: Mark variables __initdata and __ro_after_init
  x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock
  x86/kvmclock: Switch kvmclock data to a PER_CPU variable

 .../admin-guide/kernel-parameters.txt         |   2 -
 Documentation/x86/x86_64/boot-options.txt     |   4 +-
 arch/arm/include/asm/mach/time.h              |   3 +-
 arch/arm/kernel/time.c                        |  15 +-
 arch/arm/plat-omap/counter_32k.c              |   2 +-
 arch/s390/kernel/time.c                       |  15 +-
 arch/x86/include/asm/kvm_guest.h              |   7 -
 arch/x86/include/asm/kvm_para.h               |   1 -
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/include/asm/tsc.h                    |   4 +-
 arch/x86/kernel/alternative.c                 |   7 +
 arch/x86/kernel/cpu/amd.c                     |  13 +-
 arch/x86/kernel/cpu/common.c                  |  40 +--
 arch/x86/kernel/jump_label.c                  |  11 +-
 arch/x86/kernel/kvm.c                         |  14 +-
 arch/x86/kernel/kvmclock.c                    | 256 +++++++-----------
 arch/x86/kernel/setup.c                       |  10 +-
 arch/x86/kernel/tsc.c                         | 253 +++++++++--------
 arch/x86/kernel/x86_init.c                    |   2 +-
 arch/x86/xen/enlighten_pv.c                   |  51 ++--
 arch/x86/xen/mmu_pv.c                         |   6 +-
 arch/x86/xen/suspend_pv.c                     |   5 +-
 arch/x86/xen/time.c                           |  18 +-
 arch/x86/xen/xen-ops.h                        |   6 +-
 drivers/clocksource/tegra20_timer.c           |   2 +-
 include/linux/sched_clock.h                   |   5 +-
 include/linux/timekeeping.h                   |   3 +-
 init/main.c                                   |   4 +-
 kernel/sched/clock.c                          |  59 ++--
 kernel/sched/core.c                           |   1 -
 kernel/sched/debug.c                          |   2 -
 kernel/time/sched_clock.c                     |   2 +-
 kernel/time/timekeeping.c                     |  62 +++--
 33 files changed, 439 insertions(+), 447 deletions(-)
 delete mode 100644 arch/x86/include/asm/kvm_guest.h

-- 
2.18.0


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v15 01/26] x86/kvmclock: Remove memblock dependency
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:21   ` [tip:x86/timers] " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 02/26] x86/kvmclock: Remove page size requirement from wall_clock Pavel Tatashin
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

KVM clock is initialized later compared to other hypervisor clocks because
it has a dependency on the memblock allocator.

Bring it in line with other hypervisors by using memory from the BSS
instead of allocating it.

The benefits:

  - Remove ifdef from common code
  - Earlier availability of the clock
  - Remove dependency on memblock, and reduce code

The downside:

  - Static allocation of the per cpu data structures sized NR_CPUS * 64byte
    Will be addressed in follow up patches.

[ tglx: Split out from larger series ]

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/kvm.c      |  1 +
 arch/x86/kernel/kvmclock.c | 66 +++++++-------------------------------
 arch/x86/kernel/setup.c    |  4 ---
 3 files changed, 12 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5b2300b818af..c65c232d3ddd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
 	.name			= "KVM",
 	.detect			= kvm_detect,
 	.type			= X86_HYPER_KVM,
+	.init.init_platform	= kvmclock_init,
 	.init.guest_late_init	= kvm_guest_init,
 	.init.x2apic_available	= kvm_para_available,
 };
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 3b8e7c13c614..1f6ac5aaa904 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,9 +23,9 @@
 #include <asm/apic.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
-#include <linux/memblock.h>
 #include <linux/sched.h>
 #include <linux/sched/clock.h>
+#include <linux/mm.h>
 
 #include <asm/mem_encrypt.h>
 #include <asm/x86_init.h>
@@ -44,6 +44,13 @@ static int parse_no_kvmclock(char *arg)
 }
 early_param("no-kvmclock", parse_no_kvmclock);
 
+/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
+#define HV_CLOCK_SIZE	(sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define WALL_CLOCK_SIZE	(sizeof(struct pvclock_wall_clock))
+
+static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+
 /* The hypervisor will put information about time periodically here */
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock *wall_clock;
@@ -245,43 +252,12 @@ static void kvm_shutdown(void)
 	native_machine_shutdown();
 }
 
-static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
-					     phys_addr_t align)
-{
-	phys_addr_t mem;
-
-	mem = memblock_alloc(size, align);
-	if (!mem)
-		return 0;
-
-	if (sev_active()) {
-		if (early_set_memory_decrypted((unsigned long)__va(mem), size))
-			goto e_free;
-	}
-
-	return mem;
-e_free:
-	memblock_free(mem, size);
-	return 0;
-}
-
-static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
-{
-	if (sev_active())
-		early_set_memory_encrypted((unsigned long)__va(addr), size);
-
-	memblock_free(addr, size);
-}
-
 void __init kvmclock_init(void)
 {
 	struct pvclock_vcpu_time_info *vcpu_time;
-	unsigned long mem, mem_wall_clock;
-	int size, cpu, wall_clock_size;
+	int cpu;
 	u8 flags;
 
-	size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
 	if (!kvm_para_available())
 		return;
 
@@ -291,28 +267,11 @@ void __init kvmclock_init(void)
 	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
 		return;
 
-	wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
-	mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
-	if (!mem_wall_clock)
-		return;
-
-	wall_clock = __va(mem_wall_clock);
-	memset(wall_clock, 0, wall_clock_size);
-
-	mem = kvm_memblock_alloc(size, PAGE_SIZE);
-	if (!mem) {
-		kvm_memblock_free(mem_wall_clock, wall_clock_size);
-		wall_clock = NULL;
-		return;
-	}
-
-	hv_clock = __va(mem);
-	memset(hv_clock, 0, size);
+	wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
+	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
 
 	if (kvm_register_clock("primary cpu clock")) {
 		hv_clock = NULL;
-		kvm_memblock_free(mem, size);
-		kvm_memblock_free(mem_wall_clock, wall_clock_size);
 		wall_clock = NULL;
 		return;
 	}
@@ -357,13 +316,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
 	int cpu;
 	u8 flags;
 	struct pvclock_vcpu_time_info *vcpu_time;
-	unsigned int size;
 
 	if (!hv_clock)
 		return 0;
 
-	size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
 	cpu = get_cpu();
 
 	vcpu_time = &hv_clock[cpu].pvti;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..da1dbd99cb6e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1197,10 +1197,6 @@ void __init setup_arch(char **cmdline_p)
 
 	memblock_find_dma_reserve();
 
-#ifdef CONFIG_KVM_GUEST
-	kvmclock_init();
-#endif
-
 	tsc_early_delay_calibrate();
 	if (!early_xdbc_setup_hardware())
 		early_xdbc_register_console();
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 02/26] x86/kvmclock: Remove page size requirement from wall_clock
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 01/26] x86/kvmclock: Remove memblock dependency Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:22   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
  2018-07-19 20:55 ` [PATCH v15 03/26] x86/kvmclock: Decrapify kvm_register_clock() Pavel Tatashin
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

From: Thomas Gleixner <tglx@linutronix.de>

There is no requirement for wall_clock data to be page aligned or page
sized.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/kvmclock.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1f6ac5aaa904..a995d7d7164c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -46,14 +46,12 @@ early_param("no-kvmclock", parse_no_kvmclock);
 
 /* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
 #define HV_CLOCK_SIZE	(sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
-#define WALL_CLOCK_SIZE	(sizeof(struct pvclock_wall_clock))
 
 static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
 
 /* The hypervisor will put information about time periodically here */
 static struct pvclock_vsyscall_time_info *hv_clock;
-static struct pvclock_wall_clock *wall_clock;
+static struct pvclock_wall_clock wall_clock;
 
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
@@ -66,15 +64,15 @@ static void kvm_get_wallclock(struct timespec64 *now)
 	int low, high;
 	int cpu;
 
-	low = (int)slow_virt_to_phys(wall_clock);
-	high = ((u64)slow_virt_to_phys(wall_clock) >> 32);
+	low = (int)slow_virt_to_phys(&wall_clock);
+	high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);
 
 	native_write_msr(msr_kvm_wall_clock, low, high);
 
 	cpu = get_cpu();
 
 	vcpu_time = &hv_clock[cpu].pvti;
-	pvclock_read_wallclock(wall_clock, vcpu_time, now);
+	pvclock_read_wallclock(&wall_clock, vcpu_time, now);
 
 	put_cpu();
 }
@@ -267,12 +265,10 @@ void __init kvmclock_init(void)
 	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
 		return;
 
-	wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
 	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
 
 	if (kvm_register_clock("primary cpu clock")) {
 		hv_clock = NULL;
-		wall_clock = NULL;
 		return;
 	}
 
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 03/26] x86/kvmclock: Decrapify kvm_register_clock()
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 01/26] x86/kvmclock: Remove memblock dependency Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 02/26] x86/kvmclock: Remove page size requirement from wall_clock Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:23   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
  2018-07-19 20:55 ` [PATCH v15 04/26] x86/kvmclock: Cleanup the code Pavel Tatashin
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

From: Thomas Gleixner <tglx@linutronix.de>

The return value is pointless because the wrmsr cannot fail if
KVM_FEATURE_CLOCKSOURCE or KVM_FEATURE_CLOCKSOURCE2 are set.

kvm_register_clock() is only called locally so wants to be static.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_para.h |  1 -
 arch/x86/kernel/kvmclock.c      | 33 ++++++++++-----------------------
 2 files changed, 10 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 3aea2658323a..4c723632c036 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -7,7 +7,6 @@
 #include <uapi/asm/kvm_para.h>
 
 extern void kvmclock_init(void);
-extern int kvm_register_clock(char *txt);
 
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index a995d7d7164c..f0a0aef5e9fa 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -187,23 +187,19 @@ struct clocksource kvm_clock = {
 };
 EXPORT_SYMBOL_GPL(kvm_clock);
 
-int kvm_register_clock(char *txt)
+static void kvm_register_clock(char *txt)
 {
-	int cpu = smp_processor_id();
-	int low, high, ret;
 	struct pvclock_vcpu_time_info *src;
+	int cpu = smp_processor_id();
+	u64 pa;
 
 	if (!hv_clock)
-		return 0;
+		return;
 
 	src = &hv_clock[cpu].pvti;
-	low = (int)slow_virt_to_phys(src) | 1;
-	high = ((u64)slow_virt_to_phys(src) >> 32);
-	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
-	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
-	       cpu, high, low, txt);
-
-	return ret;
+	pa = slow_virt_to_phys(src) | 0x01ULL;
+	wrmsrl(msr_kvm_system_time, pa);
+	pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
 }
 
 static void kvm_save_sched_clock_state(void)
@@ -218,11 +214,7 @@ static void kvm_restore_sched_clock_state(void)
 #ifdef CONFIG_X86_LOCAL_APIC
 static void kvm_setup_secondary_clock(void)
 {
-	/*
-	 * Now that the first cpu already had this clocksource initialized,
-	 * we shouldn't fail.
-	 */
-	WARN_ON(kvm_register_clock("secondary cpu clock"));
+	kvm_register_clock("secondary cpu clock");
 }
 #endif
 
@@ -265,16 +257,11 @@ void __init kvmclock_init(void)
 	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
 		return;
 
-	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
-
-	if (kvm_register_clock("primary cpu clock")) {
-		hv_clock = NULL;
-		return;
-	}
-
 	printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
 		msr_kvm_system_time, msr_kvm_wall_clock);
 
+	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+	kvm_register_clock("primary cpu clock");
 	pvclock_set_pvti_cpu0_va(hv_clock);
 
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 04/26] x86/kvmclock: Cleanup the code
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (2 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 03/26] x86/kvmclock: Decrapify kvm_register_clock() Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:23   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
  2018-07-19 20:55 ` [PATCH v15 05/26] x86/kvmclock: Mark variables __initdata and __ro_after_init Pavel Tatashin
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

From: Thomas Gleixner <tglx@linutronix.de>

- Cleanup the mrs write for wall clock. The type casts to (int) are sloppy
  because the wrmsr parameters are u32 and aside of that wrmsrl() already
  provides the high/low split for free.

- Remove the pointless get_cpu()/put_cpu() dance from various
  functions. Either they are called during early init where CPU is
  guaranteed to be 0 or they are already called from non preemptible
  context where smp_processor_id() can be used safely

- Simplify the convoluted check for kvmclock in the init function.

- Mark the parameter parsing function __init. No point in keeping it
  around.

- Convert to pr_info()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/kvmclock.c | 72 ++++++++++++--------------------------
 1 file changed, 22 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f0a0aef5e9fa..4afb03e49a4f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -37,7 +37,7 @@ static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
 static u64 kvm_sched_clock_offset;
 
-static int parse_no_kvmclock(char *arg)
+static int __init parse_no_kvmclock(char *arg)
 {
 	kvmclock = 0;
 	return 0;
@@ -61,13 +61,9 @@ static struct pvclock_wall_clock wall_clock;
 static void kvm_get_wallclock(struct timespec64 *now)
 {
 	struct pvclock_vcpu_time_info *vcpu_time;
-	int low, high;
 	int cpu;
 
-	low = (int)slow_virt_to_phys(&wall_clock);
-	high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);
-
-	native_write_msr(msr_kvm_wall_clock, low, high);
+	wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));
 
 	cpu = get_cpu();
 
@@ -117,11 +113,11 @@ static inline void kvm_sched_clock_init(bool stable)
 	kvm_sched_clock_offset = kvm_clock_read();
 	pv_time_ops.sched_clock = kvm_sched_clock_read;
 
-	printk(KERN_INFO "kvm-clock: using sched offset of %llu cycles\n",
-			kvm_sched_clock_offset);
+	pr_info("kvm-clock: using sched offset of %llu cycles",
+		kvm_sched_clock_offset);
 
 	BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
-	         sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
+		sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
 }
 
 /*
@@ -135,16 +131,8 @@ static inline void kvm_sched_clock_init(bool stable)
  */
 static unsigned long kvm_get_tsc_khz(void)
 {
-	struct pvclock_vcpu_time_info *src;
-	int cpu;
-	unsigned long tsc_khz;
-
-	cpu = get_cpu();
-	src = &hv_clock[cpu].pvti;
-	tsc_khz = pvclock_tsc_khz(src);
-	put_cpu();
 	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
-	return tsc_khz;
+	return pvclock_tsc_khz(&hv_clock[0].pvti);
 }
 
 static void kvm_get_preset_lpj(void)
@@ -161,29 +149,27 @@ static void kvm_get_preset_lpj(void)
 
 bool kvm_check_and_clear_guest_paused(void)
 {
-	bool ret = false;
 	struct pvclock_vcpu_time_info *src;
-	int cpu = smp_processor_id();
+	bool ret = false;
 
 	if (!hv_clock)
 		return ret;
 
-	src = &hv_clock[cpu].pvti;
+	src = &hv_clock[smp_processor_id()].pvti;
 	if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
 		src->flags &= ~PVCLOCK_GUEST_STOPPED;
 		pvclock_touch_watchdogs();
 		ret = true;
 	}
-
 	return ret;
 }
 
 struct clocksource kvm_clock = {
-	.name = "kvm-clock",
-	.read = kvm_clock_get_cycles,
-	.rating = 400,
-	.mask = CLOCKSOURCE_MASK(64),
-	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
+	.name	= "kvm-clock",
+	.read	= kvm_clock_get_cycles,
+	.rating	= 400,
+	.mask	= CLOCKSOURCE_MASK(64),
+	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
 };
 EXPORT_SYMBOL_GPL(kvm_clock);
 
@@ -199,7 +185,7 @@ static void kvm_register_clock(char *txt)
 	src = &hv_clock[cpu].pvti;
 	pa = slow_virt_to_phys(src) | 0x01ULL;
 	wrmsrl(msr_kvm_system_time, pa);
-	pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
+	pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
 }
 
 static void kvm_save_sched_clock_state(void)
@@ -244,20 +230,19 @@ static void kvm_shutdown(void)
 
 void __init kvmclock_init(void)
 {
-	struct pvclock_vcpu_time_info *vcpu_time;
-	int cpu;
 	u8 flags;
 
-	if (!kvm_para_available())
+	if (!kvm_para_available() || !kvmclock)
 		return;
 
-	if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
+	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
 		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
 		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
+	} else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
 		return;
+	}
 
-	printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
+	pr_info("kvm-clock: Using msrs %x and %x",
 		msr_kvm_system_time, msr_kvm_wall_clock);
 
 	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
@@ -267,20 +252,15 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	cpu = get_cpu();
-	vcpu_time = &hv_clock[cpu].pvti;
-	flags = pvclock_read_flags(vcpu_time);
-
+	flags = pvclock_read_flags(&hv_clock[0].pvti);
 	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
-	put_cpu();
 
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;
 	x86_platform.set_wallclock = kvm_set_wallclock;
 #ifdef CONFIG_X86_LOCAL_APIC
-	x86_cpuinit.early_percpu_clock_init =
-		kvm_setup_secondary_clock;
+	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
 #endif
 	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
 	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
@@ -296,20 +276,12 @@ void __init kvmclock_init(void)
 int __init kvm_setup_vsyscall_timeinfo(void)
 {
 #ifdef CONFIG_X86_64
-	int cpu;
 	u8 flags;
-	struct pvclock_vcpu_time_info *vcpu_time;
 
 	if (!hv_clock)
 		return 0;
 
-	cpu = get_cpu();
-
-	vcpu_time = &hv_clock[cpu].pvti;
-	flags = pvclock_read_flags(vcpu_time);
-
-	put_cpu();
-
+	flags = pvclock_read_flags(&hv_clock[0].pvti);
 	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
 		return 1;
 
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 05/26] x86/kvmclock: Mark variables __initdata and __ro_after_init
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (3 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 04/26] x86/kvmclock: Cleanup the code Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:24   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
  2018-07-19 20:55 ` [PATCH v15 06/26] x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock Pavel Tatashin
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

From: Thomas Gleixner <tglx@linutronix.de>

The kvmclock parameter is init data and the other variables are not
modified after init.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/kvmclock.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 4afb03e49a4f..78aec160f5e0 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -32,10 +32,10 @@
 #include <asm/reboot.h>
 #include <asm/kvmclock.h>
 
-static int kvmclock __ro_after_init = 1;
-static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
-static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
-static u64 kvm_sched_clock_offset;
+static int kvmclock __initdata = 1;
+static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
+static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
+static u64 kvm_sched_clock_offset __ro_after_init;
 
 static int __init parse_no_kvmclock(char *arg)
 {
@@ -50,7 +50,7 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
 
 /* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock;
+static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
 static struct pvclock_wall_clock wall_clock;
 
 /*
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 06/26] x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (4 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 05/26] x86/kvmclock: Mark variables __initdata and __ro_after_init Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:24   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
  2018-07-19 20:55 ` [PATCH v15 07/26] x86/kvmclock: Switch kvmclock data to a PER_CPU variable Pavel Tatashin
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

From: Thomas Gleixner <tglx@linutronix.de>

There is no point to have this in the kvm code itself and call it from
there. This can be called from an initcall and the parameter is cleared
when the hypervisor is not KVM.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_guest.h |  7 -----
 arch/x86/kernel/kvm.c            | 13 ----------
 arch/x86/kernel/kvmclock.c       | 44 ++++++++++++++++++++------------
 3 files changed, 27 insertions(+), 37 deletions(-)
 delete mode 100644 arch/x86/include/asm/kvm_guest.h

diff --git a/arch/x86/include/asm/kvm_guest.h b/arch/x86/include/asm/kvm_guest.h
deleted file mode 100644
index 46185263d9c2..000000000000
--- a/arch/x86/include/asm/kvm_guest.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_X86_KVM_GUEST_H
-#define _ASM_X86_KVM_GUEST_H
-
-int kvm_setup_vsyscall_timeinfo(void);
-
-#endif /* _ASM_X86_KVM_GUEST_H */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c65c232d3ddd..a560750cc76f 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -45,7 +45,6 @@
 #include <asm/apic.h>
 #include <asm/apicdef.h>
 #include <asm/hypervisor.h>
-#include <asm/kvm_guest.h>
 
 static int kvmapf = 1;
 
@@ -66,15 +65,6 @@ static int __init parse_no_stealacc(char *arg)
 
 early_param("no-steal-acc", parse_no_stealacc);
 
-static int kvmclock_vsyscall = 1;
-static int __init parse_no_kvmclock_vsyscall(char *arg)
-{
-        kvmclock_vsyscall = 0;
-        return 0;
-}
-
-early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
-
 static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -560,9 +550,6 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
-	if (kvmclock_vsyscall)
-		kvm_setup_vsyscall_timeinfo();
-
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 78aec160f5e0..7d690d2238f8 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -27,12 +27,14 @@
 #include <linux/sched/clock.h>
 #include <linux/mm.h>
 
+#include <asm/hypervisor.h>
 #include <asm/mem_encrypt.h>
 #include <asm/x86_init.h>
 #include <asm/reboot.h>
 #include <asm/kvmclock.h>
 
 static int kvmclock __initdata = 1;
+static int kvmclock_vsyscall __initdata = 1;
 static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
 static u64 kvm_sched_clock_offset __ro_after_init;
@@ -44,6 +46,13 @@ static int __init parse_no_kvmclock(char *arg)
 }
 early_param("no-kvmclock", parse_no_kvmclock);
 
+static int __init parse_no_kvmclock_vsyscall(char *arg)
+{
+	kvmclock_vsyscall = 0;
+	return 0;
+}
+early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
+
 /* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
 #define HV_CLOCK_SIZE	(sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
 
@@ -228,6 +237,24 @@ static void kvm_shutdown(void)
 	native_machine_shutdown();
 }
 
+static int __init kvm_setup_vsyscall_timeinfo(void)
+{
+#ifdef CONFIG_X86_64
+	u8 flags;
+
+	if (!hv_clock || !kvmclock_vsyscall)
+		return 0;
+
+	flags = pvclock_read_flags(&hv_clock[0].pvti);
+	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
+		return 1;
+
+	kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+#endif
+	return 0;
+}
+early_initcall(kvm_setup_vsyscall_timeinfo);
+
 void __init kvmclock_init(void)
 {
 	u8 flags;
@@ -272,20 +299,3 @@ void __init kvmclock_init(void)
 	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
 	pv_info.name = "KVM";
 }
-
-int __init kvm_setup_vsyscall_timeinfo(void)
-{
-#ifdef CONFIG_X86_64
-	u8 flags;
-
-	if (!hv_clock)
-		return 0;
-
-	flags = pvclock_read_flags(&hv_clock[0].pvti);
-	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
-		return 1;
-
-	kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
-#endif
-	return 0;
-}
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 07/26] x86/kvmclock: Switch kvmclock data to a PER_CPU variable
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (5 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 06/26] x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:25   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
  2018-07-19 20:55 ` [PATCH v15 08/26] x86: text_poke() may access uninitialized struct pages Pavel Tatashin
                   ` (19 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

From: Thomas Gleixner <tglx@linutronix.de>

The previous removal of the memblock dependency from kvmclock introduced a
static data array sized 64bytes * CONFIG_NR_CPUS. That's wasteful on large
systems when kvmclock is not used.

Replace it with:

 - A static page sized array of pvclock data. It's page sized because the
   pvclock data of the boot cpu is mapped into the VDSO so otherwise random
   other data would be exposed to the vDSO

 - A PER_CPU variable of pvclock data pointers. This is used to access the
   pcvlock data storage on each CPU.

The setup is done in two stages:

 - Early boot stores the pointer to the static page for the boot CPU in
   the per cpu data.

 - In the preparatory stage of CPU hotplug assign either an element of
   the static array (when the CPU number is in that range) or allocate
   memory and initialize the per cpu pointer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kernel/kvmclock.c | 99 ++++++++++++++++++++++++--------------
 1 file changed, 62 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 7d690d2238f8..91b94c0ae4e3 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
 #include <asm/apic.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/cpuhotplug.h>
 #include <linux/sched.h>
 #include <linux/sched/clock.h>
 #include <linux/mm.h>
@@ -55,12 +56,23 @@ early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
 
 /* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
 #define HV_CLOCK_SIZE	(sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define HVC_BOOT_ARRAY_SIZE \
+	(PAGE_SIZE / sizeof(struct pvclock_vsyscall_time_info))
 
-static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-
-/* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
+static struct pvclock_vsyscall_time_info
+			hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __aligned(PAGE_SIZE);
 static struct pvclock_wall_clock wall_clock;
+static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
+
+static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
+{
+	return &this_cpu_read(hv_clock_per_cpu)->pvti;
+}
+
+static inline struct pvclock_vsyscall_time_info *this_cpu_hvclock(void)
+{
+	return this_cpu_read(hv_clock_per_cpu);
+}
 
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
@@ -69,17 +81,10 @@ static struct pvclock_wall_clock wall_clock;
  */
 static void kvm_get_wallclock(struct timespec64 *now)
 {
-	struct pvclock_vcpu_time_info *vcpu_time;
-	int cpu;
-
 	wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));
-
-	cpu = get_cpu();
-
-	vcpu_time = &hv_clock[cpu].pvti;
-	pvclock_read_wallclock(&wall_clock, vcpu_time, now);
-
-	put_cpu();
+	preempt_disable();
+	pvclock_read_wallclock(&wall_clock, this_cpu_pvti(), now);
+	preempt_enable();
 }
 
 static int kvm_set_wallclock(const struct timespec64 *now)
@@ -89,14 +94,10 @@ static int kvm_set_wallclock(const struct timespec64 *now)
 
 static u64 kvm_clock_read(void)
 {
-	struct pvclock_vcpu_time_info *src;
 	u64 ret;
-	int cpu;
 
 	preempt_disable_notrace();
-	cpu = smp_processor_id();
-	src = &hv_clock[cpu].pvti;
-	ret = pvclock_clocksource_read(src);
+	ret = pvclock_clocksource_read(this_cpu_pvti());
 	preempt_enable_notrace();
 	return ret;
 }
@@ -141,7 +142,7 @@ static inline void kvm_sched_clock_init(bool stable)
 static unsigned long kvm_get_tsc_khz(void)
 {
 	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
-	return pvclock_tsc_khz(&hv_clock[0].pvti);
+	return pvclock_tsc_khz(this_cpu_pvti());
 }
 
 static void kvm_get_preset_lpj(void)
@@ -158,15 +159,14 @@ static void kvm_get_preset_lpj(void)
 
 bool kvm_check_and_clear_guest_paused(void)
 {
-	struct pvclock_vcpu_time_info *src;
+	struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
 	bool ret = false;
 
-	if (!hv_clock)
+	if (!src)
 		return ret;
 
-	src = &hv_clock[smp_processor_id()].pvti;
-	if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
-		src->flags &= ~PVCLOCK_GUEST_STOPPED;
+	if ((src->pvti.flags & PVCLOCK_GUEST_STOPPED) != 0) {
+		src->pvti.flags &= ~PVCLOCK_GUEST_STOPPED;
 		pvclock_touch_watchdogs();
 		ret = true;
 	}
@@ -184,17 +184,15 @@ EXPORT_SYMBOL_GPL(kvm_clock);
 
 static void kvm_register_clock(char *txt)
 {
-	struct pvclock_vcpu_time_info *src;
-	int cpu = smp_processor_id();
+	struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
 	u64 pa;
 
-	if (!hv_clock)
+	if (!src)
 		return;
 
-	src = &hv_clock[cpu].pvti;
-	pa = slow_virt_to_phys(src) | 0x01ULL;
+	pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
 	wrmsrl(msr_kvm_system_time, pa);
-	pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
+	pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
 }
 
 static void kvm_save_sched_clock_state(void)
@@ -242,12 +240,12 @@ static int __init kvm_setup_vsyscall_timeinfo(void)
 #ifdef CONFIG_X86_64
 	u8 flags;
 
-	if (!hv_clock || !kvmclock_vsyscall)
+	if (!per_cpu(hv_clock_per_cpu, 0) || !kvmclock_vsyscall)
 		return 0;
 
-	flags = pvclock_read_flags(&hv_clock[0].pvti);
+	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
 	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
-		return 1;
+		return 0;
 
 	kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
 #endif
@@ -255,6 +253,28 @@ static int __init kvm_setup_vsyscall_timeinfo(void)
 }
 early_initcall(kvm_setup_vsyscall_timeinfo);
 
+static int kvmclock_setup_percpu(unsigned int cpu)
+{
+	struct pvclock_vsyscall_time_info *p = per_cpu(hv_clock_per_cpu, cpu);
+
+	/*
+	 * The per cpu area setup replicates CPU0 data to all cpu
+	 * pointers. So carefully check. CPU0 has been set up in init
+	 * already.
+	 */
+	if (!cpu || (p && p != per_cpu(hv_clock_per_cpu, 0)))
+		return 0;
+
+	/* Use the static page for the first CPUs, allocate otherwise */
+	if (cpu < HVC_BOOT_ARRAY_SIZE)
+		p = &hv_clock_boot[cpu];
+	else
+		p = kzalloc(sizeof(*p), GFP_KERNEL);
+
+	per_cpu(hv_clock_per_cpu, cpu) = p;
+	return p ? 0 : -ENOMEM;
+}
+
 void __init kvmclock_init(void)
 {
 	u8 flags;
@@ -269,17 +289,22 @@ void __init kvmclock_init(void)
 		return;
 	}
 
+	if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
+			      kvmclock_setup_percpu, NULL) < 0) {
+		return;
+	}
+
 	pr_info("kvm-clock: Using msrs %x and %x",
 		msr_kvm_system_time, msr_kvm_wall_clock);
 
-	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+	this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
 	kvm_register_clock("primary cpu clock");
-	pvclock_set_pvti_cpu0_va(hv_clock);
+	pvclock_set_pvti_cpu0_va(hv_clock_boot);
 
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	flags = pvclock_read_flags(&hv_clock[0].pvti);
+	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
 	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
 
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 08/26] x86: text_poke() may access uninitialized struct pages
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (6 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 07/26] x86/kvmclock: Switch kvmclock data to a PER_CPU variable Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:25   ` [tip:x86/timers] x86/alternatives, jumplabel: Use text_poke_early() before mm_init() tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 09/26] x86: initialize static branching early Pavel Tatashin
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() we
may end up with accessing struct page that have not been initialized.

Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:

| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
|        char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
|        vfs_caches_init_early();
|        sort_main_extable();
|        trap_init();
|+       {
|+       static_branch_enable(&__test);
|+       WARN_ON(!static_branch_likely(&__test));
|+       }
|        mm_init();

The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc1_pt_t1 #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.11.0-20171110_100015-anatol 04/01/2014
RIP: 0010:text_poke+0x20d/0x230
Code: 0f 0b 4c 89 e2 4c 89 ee 4c 89 f7 e8 7d 4b 9b 00 31 d2 31 f6 bf 86 02
00 00 48 8b 05 95 8e 24 01 e8 78 18 d8 00 e9 55 ff ff ff <0f> 0b e9 54 fe
ff ff 48 8b 05 75 a8 38 01 e9 64 fe ff ff 48 8b 1d
RSP: 0000:ffffffff94e03e30 EFLAGS: 00010046
RAX: 0100000000000000 RBX: fffff7b2c011f300 RCX: ffffffff94fcccf4
RDX: 0000000000000001 RSI: ffffffff94e03e77 RDI: ffffffff94fcccef
RBP: ffffffff94fcccef R08: 00000000fffffe00 R09: 00000000000000a0
R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000001
R13: ffffffff94e03e77 R14: ffffffff94fcdcef R15: fffff7b2c0000000
FS:  0000000000000000(0000) GS:ffff9adc87c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff9adc8499d000 CR3: 000000000460a001 CR4: 00000000000606b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 ? start_kernel+0x23e/0x4c8
 ? start_kernel+0x23f/0x4c8
 ? text_poke_bp+0x50/0xda
 ? arch_jump_label_transform+0x89/0xe0
 ? __jump_label_update+0x78/0xb0
 ? static_key_enable_cpuslocked+0x4d/0x80
 ? static_key_enable+0x11/0x20
 ? start_kernel+0x23e/0x4c8
 ? secondary_startup_64+0xa5/0xb0
---[ end trace abdc99c031b8a90a ]---

If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.

Use text_poke_early() in static branching until early boot IRQs are
enabled, at which time switch to text_poke. Also, ensure text_poke() is
never invoked when unitialized memory access may happen by using:
BUG_ON(!after_bootmem); assertion.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c        |  7 +++++++
 arch/x86/kernel/jump_label.c         | 11 +++++++----
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 2ecd34e2d46c..e85ff65c43c3 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,5 +37,6 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern int after_bootmem;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a481763a3776..014f214da581 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -668,6 +668,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 	local_irq_save(flags);
 	memcpy(addr, opcode, len);
 	local_irq_restore(flags);
+	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
 	   that causes hangs on some VIA CPUs. */
 	return addr;
@@ -693,6 +694,12 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 	struct page *pages[2];
 	int i;
 
+	/*
+	 * While boot memory allocator is runnig we cannot use struct
+	 * pages as they are not yet initialized.
+	 */
+	BUG_ON(!after_bootmem);
+
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
 		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e56c95be2808..eeea935e9bb5 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,15 +37,18 @@ static void bug_at(unsigned char *ip, int line)
 	BUG();
 }
 
-static void __jump_label_transform(struct jump_entry *entry,
-				   enum jump_label_type type,
-				   void *(*poker)(void *, const void *, size_t),
-				   int init)
+static void __ref __jump_label_transform(struct jump_entry *entry,
+					 enum jump_label_type type,
+					 void *(*poker)(void *, const void *, size_t),
+					 int init)
 {
 	union jump_code_union code;
 	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
 	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
 
+	if (early_boot_irqs_disabled)
+		poker = text_poke_early;
+
 	if (type == JUMP_LABEL_JMP) {
 		if (init) {
 			/*
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 09/26] x86: initialize static branching early
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (7 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 08/26] x86: text_poke() may access uninitialized struct pages Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:26   ` [tip:x86/timers] x86/jump_label: Initialize " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 10/26] x86/CPU: Call detect_nopl() only on the BSP Pavel Tatashin
                   ` (17 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

static branching is useful to hot-patch branches that are used in hot
path, but are infrequently changed.

x86 clock framework is one example that uses static branches to setup
the best clock during boot and never change it again.

Since we plan to enable clock early, we need static branching
functionality early as well.

static branching requires patching nop instructions, thus, we need
arch_init_ideal_nops() to be called prior to jump_label_init()

Here we do all the necessary steps to call arch_init_ideal_nops
after early_cpu_init().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/amd.c    | 13 +++++++-----
 arch/x86/kernel/cpu/common.c | 38 +++++++++++++++++++-----------------
 arch/x86/kernel/setup.c      |  4 ++--
 3 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 38915fbfae73..b732438c1a1e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -232,8 +232,6 @@ static void init_amd_k7(struct cpuinfo_x86 *c)
 		}
 	}
 
-	set_cpu_cap(c, X86_FEATURE_K7);
-
 	/* calling is from identify_secondary_cpu() ? */
 	if (!c->cpu_index)
 		return;
@@ -617,6 +615,14 @@ static void early_init_amd(struct cpuinfo_x86 *c)
 
 	early_init_amd_mc(c);
 
+#ifdef CONFIG_X86_32
+	if (c->x86 == 6)
+		set_cpu_cap(c, X86_FEATURE_K7);
+#endif
+
+	if (c->x86 >= 0xf)
+		set_cpu_cap(c, X86_FEATURE_K8);
+
 	rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);
 
 	/*
@@ -863,9 +869,6 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	init_amd_cacheinfo(c);
 
-	if (c->x86 >= 0xf)
-		set_cpu_cap(c, X86_FEATURE_K8);
-
 	if (cpu_has(c, X86_FEATURE_XMM2)) {
 		unsigned long long val;
 		int ret;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eb4cb3efd20e..71281ac43b15 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1015,6 +1015,24 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 	setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
 }
 
+/*
+ * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
+ * unfortunately, that's not true in practice because of early VIA
+ * chips and (more importantly) broken virtualizers that are not easy
+ * to detect. In the latter case it doesn't even *fail* reliably, so
+ * probing for it doesn't even work. Disable it completely on 32-bit
+ * unless we can find a reliable way to detect all the broken cases.
+ * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
+ */
+static void detect_nopl(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_32
+	clear_cpu_cap(c, X86_FEATURE_NOPL);
+#else
+	set_cpu_cap(c, X86_FEATURE_NOPL);
+#endif
+}
+
 /*
  * Do minimum CPU detection early.
  * Fields really needed: vendor, cpuid_level, family, model, mask,
@@ -1089,6 +1107,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 	 */
 	if (!pgtable_l5_enabled())
 		setup_clear_cpu_cap(X86_FEATURE_LA57);
+
+	detect_nopl(c);
 }
 
 void __init early_cpu_init(void)
@@ -1124,24 +1144,6 @@ void __init early_cpu_init(void)
 	early_identify_cpu(&boot_cpu_data);
 }
 
-/*
- * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
- * unfortunately, that's not true in practice because of early VIA
- * chips and (more importantly) broken virtualizers that are not easy
- * to detect. In the latter case it doesn't even *fail* reliably, so
- * probing for it doesn't even work. Disable it completely on 32-bit
- * unless we can find a reliable way to detect all the broken cases.
- * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
- */
-static void detect_nopl(struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_X86_32
-	clear_cpu_cap(c, X86_FEATURE_NOPL);
-#else
-	set_cpu_cap(c, X86_FEATURE_NOPL);
-#endif
-}
-
 static void detect_null_seg_behavior(struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index da1dbd99cb6e..7490de925a81 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -866,6 +866,8 @@ void __init setup_arch(char **cmdline_p)
 
 	idt_setup_early_traps();
 	early_cpu_init();
+	arch_init_ideal_nops();
+	jump_label_init();
 	early_ioremap_init();
 
 	setup_olpc_ofw_pgd();
@@ -1268,8 +1270,6 @@ void __init setup_arch(char **cmdline_p)
 
 	mcheck_init();
 
-	arch_init_ideal_nops();
-
 	register_refined_jiffies(CLOCK_TICK_RATE);
 
 #ifdef CONFIG_EFI
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 10/26] x86/CPU: Call detect_nopl() only on the BSP
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (8 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 09/26] x86: initialize static branching early Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:26   ` [tip:x86/timers] " tip-bot for Borislav Petkov
  2018-07-19 20:55 ` [PATCH v15 11/26] x86/tsc: redefine notsc to behave as tsc=unstable Pavel Tatashin
                   ` (16 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

From: Borislav Petkov <bp@alien8.de>

Make it use the setup_* variants and have it be called only on the BSP
and drop the call in generic_identify() - X86_FEATURE_NOPL will be
replicated to the APs through the forced caps. Helps keep the mess at a
manageable level.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/kernel/cpu/common.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 71281ac43b15..46408a8cdf62 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1024,12 +1024,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
  * unless we can find a reliable way to detect all the broken cases.
  * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
  */
-static void detect_nopl(struct cpuinfo_x86 *c)
+static void detect_nopl(void)
 {
 #ifdef CONFIG_X86_32
-	clear_cpu_cap(c, X86_FEATURE_NOPL);
+	setup_clear_cpu_cap(X86_FEATURE_NOPL);
 #else
-	set_cpu_cap(c, X86_FEATURE_NOPL);
+	setup_force_cpu_cap(X86_FEATURE_NOPL);
 #endif
 }
 
@@ -1108,7 +1108,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 	if (!pgtable_l5_enabled())
 		setup_clear_cpu_cap(X86_FEATURE_LA57);
 
-	detect_nopl(c);
+	detect_nopl();
 }
 
 void __init early_cpu_init(void)
@@ -1206,8 +1206,6 @@ static void generic_identify(struct cpuinfo_x86 *c)
 
 	get_model_name(c); /* Default name */
 
-	detect_nopl(c);
-
 	detect_null_seg_behavior(c);
 
 	/*
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 11/26] x86/tsc: redefine notsc to behave as tsc=unstable
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (9 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 10/26] x86/CPU: Call detect_nopl() only on the BSP Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:27   ` [tip:x86/timers] x86/tsc: Redefine " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 12/26] x86/xen/time: initialize pv xen time in init_hypervisor_platform Pavel Tatashin
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

Currently, notsc kernel parameter disables the use of tsc register by
sched_clock(). However, this parameter does not prevent linux from
accessing tsc in other places in kernel.

The only rational to boot with notsc is to avoid timing discrepancies on
multi-socket systems where different tsc frequencies may present, and thus
fallback to jiffies for clock source.

However, there is another method to solve the above problem, it is to boot
with tsc=unstable parameter. This parameter allows sched_clock() to use tsc
but in case tsc is outside of expected interval it is corrected back to a
sane value.

This is why there is no reason to keep notsc, and it can be removed. But,
for compatibility reasons we will keep this parameter but change its
definition to be the same as tsc=unstable.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 .../admin-guide/kernel-parameters.txt          |  2 --
 Documentation/x86/x86_64/boot-options.txt      |  4 +---
 arch/x86/kernel/tsc.c                          | 18 +++---------------
 3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 533ff5c68970..5aed30cd0350 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2835,8 +2835,6 @@
 
 	nosync		[HW,M68K] Disables sync negotiation for all devices.
 
-	notsc		[BUGS=X86-32] Disable Time Stamp Counter
-
 	nowatchdog	[KNL] Disable both lockup detectors, i.e.
 			soft-lockup and NMI watchdog (hard-lockup).
 
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 8d109ef67ab6..66114ab4f9fe 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -92,9 +92,7 @@ APICs
 Timing
 
   notsc
-  Don't use the CPU time stamp counter to read the wall time.
-  This can be used to work around timing problems on multiprocessor systems
-  with not properly synchronized CPUs.
+  Deprecated, use tsc=unstable instead.
 
   nohpet
   Don't use the HPET timer.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..186395041725 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,11 +38,6 @@ EXPORT_SYMBOL(tsc_khz);
  */
 static int __read_mostly tsc_unstable;
 
-/* native_sched_clock() is called before tsc_init(), so
-   we must start with the TSC soft disabled to prevent
-   erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
 static DEFINE_STATIC_KEY_FALSE(__use_tsc);
 
 int tsc_clocksource_reliable;
@@ -248,8 +243,7 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
 #ifdef CONFIG_X86_TSC
 int __init notsc_setup(char *str)
 {
-	pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
-	tsc_disabled = 1;
+	mark_tsc_unstable("boot parameter notsc");
 	return 1;
 }
 #else
@@ -1307,7 +1301,7 @@ static void tsc_refine_calibration_work(struct work_struct *work)
 
 static int __init init_tsc_clocksource(void)
 {
-	if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+	if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
 		return 0;
 
 	if (tsc_unstable)
@@ -1414,12 +1408,6 @@ void __init tsc_init(void)
 		set_cyc2ns_scale(tsc_khz, cpu, cyc);
 	}
 
-	if (tsc_disabled > 0)
-		return;
-
-	/* now allow native_sched_clock() to use rdtsc */
-
-	tsc_disabled = 0;
 	static_branch_enable(&__use_tsc);
 
 	if (!no_sched_irq_time)
@@ -1455,7 +1443,7 @@ unsigned long calibrate_delay_is_known(void)
 	int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
 	const struct cpumask *mask = topology_core_cpumask(cpu);
 
-	if (tsc_disabled || !constant_tsc || !mask)
+	if (!constant_tsc || !mask)
 		return 0;
 
 	sibling = cpumask_any_but(mask, cpu);
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 12/26] x86/xen/time: initialize pv xen time in init_hypervisor_platform
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (10 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 11/26] x86/tsc: redefine notsc to behave as tsc=unstable Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:27   ` [tip:x86/timers] x86/xen/time: Initialize pv xen time in init_hypervisor_platform() tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 13/26] x86/xen/time: output xen sched_clock time from 0 Pavel Tatashin
                   ` (14 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

In every hypervisor except for xen pv time ops are initialized in
init_hypervisor_platform().

Xen PV domains initialize time ops in x86_init.paging.pagetable_init(),
by calling xen_setup_shared_info() which is a poor design, as time is
needed prior to memory allocator.

xen_setup_shared_info() is called from two places: during boot, and
after suspend. Split the content of xen_setup_shared_info() into
three places:

1. add the clock relavent data into new xen pv init_platform vector, and
   set clock ops in there.

2. move xen_setup_vcpu_info_placement() to new xen_pv_guest_late_init()
   call.

3. Re-initializing parts of shared info copy to xen_pv_post_suspend() to
   be symmetric to xen_pv_pre_suspend

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/xen/enlighten_pv.c | 51 +++++++++++++++++--------------------
 arch/x86/xen/mmu_pv.c       |  6 ++---
 arch/x86/xen/suspend_pv.c   |  5 ++--
 arch/x86/xen/time.c         |  7 +++--
 arch/x86/xen/xen-ops.h      |  6 ++---
 5 files changed, 34 insertions(+), 41 deletions(-)

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 439a94bf89ad..105a57d73701 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -119,6 +119,27 @@ static void __init xen_banner(void)
 	       version >> 16, version & 0xffff, extra.extraversion,
 	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
 }
+
+static void __init xen_pv_init_platform(void)
+{
+	set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+	HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+
+	/* xen clock uses per-cpu vcpu_info, need to init it for boot cpu */
+	xen_vcpu_info_reset(0);
+
+	/* pvclock is in shared info area */
+	xen_init_time_ops();
+}
+
+static void __init xen_pv_guest_late_init(void)
+{
+#ifndef CONFIG_SMP
+	/* Setup shared vcpu info for non-smp configurations */
+	xen_setup_vcpu_info_placement();
+#endif
+}
+
 /* Check if running on Xen version (major, minor) or later */
 bool
 xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -947,34 +968,8 @@ static void xen_write_msr(unsigned int msr, unsigned low, unsigned high)
 	xen_write_msr_safe(msr, low, high);
 }
 
-void xen_setup_shared_info(void)
-{
-	set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
-
-	HYPERVISOR_shared_info =
-		(struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
-
-	xen_setup_mfn_list_list();
-
-	if (system_state == SYSTEM_BOOTING) {
-#ifndef CONFIG_SMP
-		/*
-		 * In UP this is as good a place as any to set up shared info.
-		 * Limit this to boot only, at restore vcpu setup is done via
-		 * xen_vcpu_restore().
-		 */
-		xen_setup_vcpu_info_placement();
-#endif
-		/*
-		 * Now that shared info is set up we can start using routines
-		 * that point to pvclock area.
-		 */
-		xen_init_time_ops();
-	}
-}
-
 /* This is called once we have the cpu_possible_mask */
-void __ref xen_setup_vcpu_info_placement(void)
+void __init xen_setup_vcpu_info_placement(void)
 {
 	int cpu;
 
@@ -1228,6 +1223,8 @@ asmlinkage __visible void __init xen_start_kernel(void)
 	x86_init.irqs.intr_mode_init	= x86_init_noop;
 	x86_init.oem.arch_setup = xen_arch_setup;
 	x86_init.oem.banner = xen_banner;
+	x86_init.hyper.init_platform = xen_pv_init_platform;
+	x86_init.hyper.guest_late_init = xen_pv_guest_late_init;
 
 	/*
 	 * Set up some pagetable state before starting to set any ptes.
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2c30cabfda90..52206ad81e4b 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1230,8 +1230,7 @@ static void __init xen_pagetable_p2m_free(void)
 	 * We roundup to the PMD, which means that if anybody at this stage is
 	 * using the __ka address of xen_start_info or
 	 * xen_start_info->shared_info they are in going to crash. Fortunatly
-	 * we have already revectored in xen_setup_kernel_pagetable and in
-	 * xen_setup_shared_info.
+	 * we have already revectored in xen_setup_kernel_pagetable.
 	 */
 	size = roundup(size, PMD_SIZE);
 
@@ -1292,8 +1291,7 @@ static void __init xen_pagetable_init(void)
 
 	/* Remap memory freed due to conflicts with E820 map */
 	xen_remap_memory();
-
-	xen_setup_shared_info();
+	xen_setup_mfn_list_list();
 }
 static void xen_write_cr2(unsigned long cr2)
 {
diff --git a/arch/x86/xen/suspend_pv.c b/arch/x86/xen/suspend_pv.c
index a2e0f110af56..8303b58c79a9 100644
--- a/arch/x86/xen/suspend_pv.c
+++ b/arch/x86/xen/suspend_pv.c
@@ -27,8 +27,9 @@ void xen_pv_pre_suspend(void)
 void xen_pv_post_suspend(int suspend_cancelled)
 {
 	xen_build_mfn_list_list();
-
-	xen_setup_shared_info();
+	set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+	HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+	xen_setup_mfn_list_list();
 
 	if (suspend_cancelled) {
 		xen_start_info->store_mfn =
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index e0f1bcf01d63..53bb7a8d10b5 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -40,7 +40,7 @@ static unsigned long xen_tsc_khz(void)
 	return pvclock_tsc_khz(info);
 }
 
-u64 xen_clocksource_read(void)
+static u64 xen_clocksource_read(void)
 {
         struct pvclock_vcpu_time_info *src;
 	u64 ret;
@@ -503,7 +503,7 @@ static void __init xen_time_init(void)
 		pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 }
 
-void __ref xen_init_time_ops(void)
+void __init xen_init_time_ops(void)
 {
 	pv_time_ops = xen_time_ops;
 
@@ -542,8 +542,7 @@ void __init xen_hvm_init_time_ops(void)
 		return;
 
 	if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
-		printk(KERN_INFO "Xen doesn't support pvclock on HVM,"
-				"disable pv timer\n");
+		pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
 		return;
 	}
 
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 3b34745d0a52..e78684597f57 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -31,7 +31,6 @@ extern struct shared_info xen_dummy_shared_info;
 extern struct shared_info *HYPERVISOR_shared_info;
 
 void xen_setup_mfn_list_list(void);
-void xen_setup_shared_info(void);
 void xen_build_mfn_list_list(void);
 void xen_setup_machphys_mapping(void);
 void xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
@@ -68,12 +67,11 @@ void xen_init_irq_ops(void);
 void xen_setup_timer(int cpu);
 void xen_setup_runstate_info(int cpu);
 void xen_teardown_timer(int cpu);
-u64 xen_clocksource_read(void);
 void xen_setup_cpu_clockevents(void);
 void xen_save_time_memory_area(void);
 void xen_restore_time_memory_area(void);
-void __ref xen_init_time_ops(void);
-void __init xen_hvm_init_time_ops(void);
+void xen_init_time_ops(void);
+void xen_hvm_init_time_ops(void);
 
 irqreturn_t xen_debug_interrupt(int irq, void *dev_id);
 
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 13/26] x86/xen/time: output xen sched_clock time from 0
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (11 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 12/26] x86/xen/time: initialize pv xen time in init_hypervisor_platform Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:28   ` [tip:x86/timers] x86/xen/time: Output " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 14/26] s390/time: add read_persistent_wall_and_boot_offset() Pavel Tatashin
                   ` (13 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

It is expected for sched_clock() to output data from 0, when system boots.
Add an offset xen_sched_clock_offset (similarly how it is done in other
hypervisors i.e. kvm_sched_clock_offset) to count sched_clock() from 0,
when time is first initialized.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/xen/time.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 53bb7a8d10b5..c84f1e039d84 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -31,6 +31,8 @@
 /* Xen may fire a timer up to this many ns early */
 #define TIMER_SLOP	100000
 
+static u64 xen_sched_clock_offset __read_mostly;
+
 /* Get the TSC speed from Xen */
 static unsigned long xen_tsc_khz(void)
 {
@@ -57,6 +59,11 @@ static u64 xen_clocksource_get_cycles(struct clocksource *cs)
 	return xen_clocksource_read();
 }
 
+static u64 xen_sched_clock(void)
+{
+	return xen_clocksource_read() - xen_sched_clock_offset;
+}
+
 static void xen_read_wallclock(struct timespec64 *ts)
 {
 	struct shared_info *s = HYPERVISOR_shared_info;
@@ -367,7 +374,7 @@ void xen_timer_resume(void)
 }
 
 static const struct pv_time_ops xen_time_ops __initconst = {
-	.sched_clock = xen_clocksource_read,
+	.sched_clock = xen_sched_clock,
 	.steal_clock = xen_steal_clock,
 };
 
@@ -505,6 +512,7 @@ static void __init xen_time_init(void)
 
 void __init xen_init_time_ops(void)
 {
+	xen_sched_clock_offset = xen_clocksource_read();
 	pv_time_ops = xen_time_ops;
 
 	x86_init.timers.timer_init = xen_time_init;
@@ -546,6 +554,7 @@ void __init xen_hvm_init_time_ops(void)
 		return;
 	}
 
+	xen_sched_clock_offset = xen_clocksource_read();
 	pv_time_ops = xen_time_ops;
 	x86_init.timers.setup_percpu_clockev = xen_time_init;
 	x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 14/26] s390/time: add read_persistent_wall_and_boot_offset()
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (12 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 13/26] x86/xen/time: output xen sched_clock time from 0 Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:28   ` [tip:x86/timers] s390/time: Add read_persistent_wall_and_boot_offset() tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 15/26] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset() Pavel Tatashin
                   ` (12 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

read_persistent_wall_and_boot_offset() will replace read_boot_clock64()
because on some architectures it is more convenient to read both sources
as one may depend on the other. For s390, implementation is the same
as read_boot_clock64() but also calling and returning value of
read_persistent_clock64()

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/kernel/time.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index cf561160ea88..d1f5447d5687 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -221,6 +221,24 @@ void read_persistent_clock64(struct timespec64 *ts)
 	ext_to_timespec64(clk, ts);
 }
 
+void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+						 struct timespec64 *boot_offset)
+{
+	unsigned char clk[STORE_CLOCK_EXT_SIZE];
+	struct timespec64 boot_time;
+	__u64 delta;
+
+	delta = initial_leap_seconds + TOD_UNIX_EPOCH;
+	memcpy(clk, tod_clock_base, STORE_CLOCK_EXT_SIZE);
+	*(__u64 *)&clk[1] -= delta;
+	if (*(__u64 *)&clk[1] > delta)
+		clk[0]--;
+	ext_to_timespec64(clk, &boot_time);
+
+	read_persistent_clock64(wall_time);
+	*boot_offset = timespec64_sub(*wall_time, boot_time);
+}
+
 void read_boot_clock64(struct timespec64 *ts)
 {
 	unsigned char clk[STORE_CLOCK_EXT_SIZE];
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 15/26] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (13 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 14/26] s390/time: add read_persistent_wall_and_boot_offset() Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:29   ` [tip:x86/timers] timekeeping: Replace " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 16/26] time: default boot time offset to local_clock() Pavel Tatashin
                   ` (11 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

If architecture does not support exact boot time, it is challenging to
estimate boot time without having a reference to the current persistent
clock value. Yet, it cannot read the persistent clock time again, because
this may lead to math discrepancies with the caller of read_boot_clock64()
who have read the persistent clock at a different time.

This is why it is better to provide two values simultaneously: the
persistent clock value, and the boot time.

Replace read_boot_clock64() with:
read_persistent_wall_and_boot_offset(wall_time, boot_offset)

Where wall_time is returned by read_persistent_clock()
And boot_offset is wall_time - boot time, which defaults to 0.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 include/linux/timekeeping.h |  3 +-
 kernel/time/timekeeping.c   | 59 +++++++++++++++++++------------------
 2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 86bc2026efce..686bc27acef0 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -243,7 +243,8 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
 extern int persistent_clock_is_local;
 
 extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
+void read_persistent_clock_and_boot_offset(struct timespec64 *wall_clock,
+					   struct timespec64 *boot_offset);
 extern int update_persistent_clock64(struct timespec64 now);
 
 /*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4786df904c22..cb738f825c12 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -17,6 +17,7 @@
 #include <linux/nmi.h>
 #include <linux/sched.h>
 #include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
 #include <linux/syscore_ops.h>
 #include <linux/clocksource.h>
 #include <linux/jiffies.h>
@@ -1496,18 +1497,20 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
 }
 
 /**
- * read_boot_clock64 -  Return time of the system start.
+ * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset
+ *                                        from the boot.
  *
  * Weak dummy function for arches that do not yet support it.
- * Function to read the exact time the system has been started.
- * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
- *
- *  XXX - Do be sure to remove it once all arches implement it.
+ * wall_time	- current time as returned by persistent clock
+ * boot_offset	- offset that is defined as wall_time - boot_time
+ *		  default to 0.
  */
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init
+read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+				     struct timespec64 *boot_offset)
 {
-	ts->tv_sec = 0;
-	ts->tv_nsec = 0;
+	read_persistent_clock64(wall_time);
+	*boot_offset = (struct timespec64){0};
 }
 
 /* Flag for if timekeeping_resume() has injected sleeptime */
@@ -1521,28 +1524,29 @@ static bool persistent_clock_exists;
  */
 void __init timekeeping_init(void)
 {
+	struct timespec64 wall_time, boot_offset, wall_to_mono;
 	struct timekeeper *tk = &tk_core.timekeeper;
 	struct clocksource *clock;
 	unsigned long flags;
-	struct timespec64 now, boot, tmp;
-
-	read_persistent_clock64(&now);
-	if (!timespec64_valid_strict(&now)) {
-		pr_warn("WARNING: Persistent clock returned invalid value!\n"
-			"         Check your CMOS/BIOS settings.\n");
-		now.tv_sec = 0;
-		now.tv_nsec = 0;
-	} else if (now.tv_sec || now.tv_nsec)
-		persistent_clock_exists = true;
 
-	read_boot_clock64(&boot);
-	if (!timespec64_valid_strict(&boot)) {
-		pr_warn("WARNING: Boot clock returned invalid value!\n"
-			"         Check your CMOS/BIOS settings.\n");
-		boot.tv_sec = 0;
-		boot.tv_nsec = 0;
+	read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
+	if (timespec64_valid_strict(&wall_time) &&
+	    timespec64_to_ns(&wall_time) > 0) {
+		persistent_clock_exists = true;
+	} else {
+		pr_warn("Persistent clock returned invalid value");
+		wall_time = (struct timespec64){0};
 	}
 
+	if (timespec64_compare(&wall_time, &boot_offset) < 0)
+		boot_offset = (struct timespec64){0};
+
+	/*
+	 * We want set wall_to_mono, so the following is true:
+	 * wall time + wall_to_mono = boot time
+	 */
+	wall_to_mono = timespec64_sub(boot_offset, wall_time);
+
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
 	write_seqcount_begin(&tk_core.seq);
 	ntp_init();
@@ -1552,13 +1556,10 @@ void __init timekeeping_init(void)
 		clock->enable(clock);
 	tk_setup_internals(tk, clock);
 
-	tk_set_xtime(tk, &now);
+	tk_set_xtime(tk, &wall_time);
 	tk->raw_sec = 0;
-	if (boot.tv_sec == 0 && boot.tv_nsec == 0)
-		boot = tk_xtime(tk);
 
-	set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec);
-	tk_set_wall_to_mono(tk, tmp);
+	tk_set_wall_to_mono(tk, wall_to_mono);
 
 	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
 
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 16/26] time: default boot time offset to local_clock()
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (14 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 15/26] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset() Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:29   ` [tip:x86/timers] timekeeping: Default " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 17/26] s390/time: remove read_boot_clock64() Pavel Tatashin
                   ` (10 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

read_persistent_wall_and_boot_offset() is called during boot to read
both the persistent clock and also return the offset between the boot time
and the value of persistent clock.

Change the default boot_offset from zero to local_clock() so architectures,
that do not have a dedicated boot_clock but have early sched_clock(), such
as SPARCv9, x86, and possibly more will benefit from this change by getting
a better and more consistent estimate of the boot time without need for an
arch specific implementation.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 kernel/time/timekeeping.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index cb738f825c12..30d7f64ffc87 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1503,14 +1503,17 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
  * Weak dummy function for arches that do not yet support it.
  * wall_time	- current time as returned by persistent clock
  * boot_offset	- offset that is defined as wall_time - boot_time
- *		  default to 0.
+ * The default function calculates offset based on the current value of
+ * local_clock(). This way architectures that support sched_clock() but don't
+ * support dedicated boot time clock will provide the best estimate of the
+ * boot time.
  */
 void __weak __init
 read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
 				     struct timespec64 *boot_offset)
 {
 	read_persistent_clock64(wall_time);
-	*boot_offset = (struct timespec64){0};
+	*boot_offset = ns_to_timespec64(local_clock());
 }
 
 /* Flag for if timekeeping_resume() has injected sleeptime */
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 17/26] s390/time: remove read_boot_clock64()
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (15 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 16/26] time: default boot time offset to local_clock() Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:30   ` [tip:x86/timers] s390/time: Remove read_boot_clock64() tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 18/26] ARM/time: remove read_boot_clock64() Pavel Tatashin
                   ` (9 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

read_boot_clock64() was replaced by read_persistent_wall_and_boot_offset()
so remove it.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/s390/kernel/time.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index d1f5447d5687..e8766beee5ad 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -239,19 +239,6 @@ void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
 	*boot_offset = timespec64_sub(*wall_time, boot_time);
 }
 
-void read_boot_clock64(struct timespec64 *ts)
-{
-	unsigned char clk[STORE_CLOCK_EXT_SIZE];
-	__u64 delta;
-
-	delta = initial_leap_seconds + TOD_UNIX_EPOCH;
-	memcpy(clk, tod_clock_base, 16);
-	*(__u64 *) &clk[1] -= delta;
-	if (*(__u64 *) &clk[1] > delta)
-		clk[0]--;
-	ext_to_timespec64(clk, ts);
-}
-
 static u64 read_tod_clock(struct clocksource *cs)
 {
 	unsigned long long now, adj;
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 18/26] ARM/time: remove read_boot_clock64()
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (16 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 17/26] s390/time: remove read_boot_clock64() Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:30   ` [tip:x86/timers] ARM/time: Remove read_boot_clock64() tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 19/26] x86/tsc: calibrate tsc only once Pavel Tatashin
                   ` (8 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

read_boot_clock64() is deleted, and replaced with
read_persistent_wall_and_boot_offset().

The default implementation of read_persistent_wall_and_boot_offset()
provides a better fallback than the current stubs for read_boot_clock64()
that arm has with no users, so remove the old code.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/arm/include/asm/mach/time.h    |  3 +--
 arch/arm/kernel/time.c              | 15 ++-------------
 arch/arm/plat-omap/counter_32k.c    |  2 +-
 drivers/clocksource/tegra20_timer.c |  2 +-
 4 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mach/time.h b/arch/arm/include/asm/mach/time.h
index 0f79e4dec7f9..4ac3a019a46f 100644
--- a/arch/arm/include/asm/mach/time.h
+++ b/arch/arm/include/asm/mach/time.h
@@ -13,7 +13,6 @@
 extern void timer_tick(void);
 
 typedef void (*clock_access_fn)(struct timespec64 *);
-extern int register_persistent_clock(clock_access_fn read_boot,
-				     clock_access_fn read_persistent);
+extern int register_persistent_clock(clock_access_fn read_persistent);
 
 #endif
diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index cf2701cb0de8..078b259ead4e 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -83,29 +83,18 @@ static void dummy_clock_access(struct timespec64 *ts)
 }
 
 static clock_access_fn __read_persistent_clock = dummy_clock_access;
-static clock_access_fn __read_boot_clock = dummy_clock_access;
 
 void read_persistent_clock64(struct timespec64 *ts)
 {
 	__read_persistent_clock(ts);
 }
 
-void read_boot_clock64(struct timespec64 *ts)
-{
-	__read_boot_clock(ts);
-}
-
-int __init register_persistent_clock(clock_access_fn read_boot,
-				     clock_access_fn read_persistent)
+int __init register_persistent_clock(clock_access_fn read_persistent)
 {
 	/* Only allow the clockaccess functions to be registered once */
-	if (__read_persistent_clock == dummy_clock_access &&
-	    __read_boot_clock == dummy_clock_access) {
-		if (read_boot)
-			__read_boot_clock = read_boot;
+	if (__read_persistent_clock == dummy_clock_access) {
 		if (read_persistent)
 			__read_persistent_clock = read_persistent;
-
 		return 0;
 	}
 
diff --git a/arch/arm/plat-omap/counter_32k.c b/arch/arm/plat-omap/counter_32k.c
index 2438b96004c1..fcc5bfec8bd1 100644
--- a/arch/arm/plat-omap/counter_32k.c
+++ b/arch/arm/plat-omap/counter_32k.c
@@ -110,7 +110,7 @@ int __init omap_init_clocksource_32k(void __iomem *vbase)
 	}
 
 	sched_clock_register(omap_32k_read_sched_clock, 32, 32768);
-	register_persistent_clock(NULL, omap_read_persistent_clock64);
+	register_persistent_clock(omap_read_persistent_clock64);
 	pr_info("OMAP clocksource: 32k_counter at 32768 Hz\n");
 
 	return 0;
diff --git a/drivers/clocksource/tegra20_timer.c b/drivers/clocksource/tegra20_timer.c
index c337a8100a7b..2242a36fc5b0 100644
--- a/drivers/clocksource/tegra20_timer.c
+++ b/drivers/clocksource/tegra20_timer.c
@@ -259,6 +259,6 @@ static int __init tegra20_init_rtc(struct device_node *np)
 	else
 		clk_prepare_enable(clk);
 
-	return register_persistent_clock(NULL, tegra_read_persistent_clock64);
+	return register_persistent_clock(tegra_read_persistent_clock64);
 }
 TIMER_OF_DECLARE(tegra20_rtc, "nvidia,tegra20-rtc", tegra20_init_rtc);
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 19/26] x86/tsc: calibrate tsc only once
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (17 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 18/26] ARM/time: remove read_boot_clock64() Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:31   ` [tip:x86/timers] x86/tsc: Calibrate " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 20/26] x86/tsc: initialize cyc2ns when tsc freq. is determined Pavel Tatashin
                   ` (7 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
and the second time in tsc_init().

Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
the calibration is done only early, and make tsc_init() to use the values
already determined in tsc_early_init().

Sometimes it is not possible to determine tsc early, as the subsystem that
is required is not yet initialized, in such case try again later in
tsc_init().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/include/asm/tsc.h |  2 +-
 arch/x86/kernel/setup.c    |  2 +-
 arch/x86/kernel/tsc.c      | 87 ++++++++++++++++++++------------------
 3 files changed, 49 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 2701d221583a..c4368ff73652 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void)
 extern struct system_counterval_t convert_art_to_tsc(u64 art);
 extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);
 
-extern void tsc_early_delay_calibrate(void);
+extern void tsc_early_init(void);
 extern void tsc_init(void);
 extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 7490de925a81..5d32c55aeb8b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1014,6 +1014,7 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	init_hypervisor_platform();
 
+	tsc_early_init();
 	x86_init.resources.probe_roms();
 
 	/* after parse_early_param, so could debug it */
@@ -1199,7 +1200,6 @@ void __init setup_arch(char **cmdline_p)
 
 	memblock_find_dma_reserve();
 
-	tsc_early_delay_calibrate();
 	if (!early_xdbc_setup_hardware())
 		early_xdbc_register_console();
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 186395041725..4cab2236169e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int __read_mostly tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
+#define KHZ	1000
+
 /*
  * TSC can be unstable due to cpufreq or due to unsynced TSCs
  */
@@ -1335,34 +1337,10 @@ static int __init init_tsc_clocksource(void)
  */
 device_initcall(init_tsc_clocksource);
 
-void __init tsc_early_delay_calibrate(void)
-{
-	unsigned long lpj;
-
-	if (!boot_cpu_has(X86_FEATURE_TSC))
-		return;
-
-	cpu_khz = x86_platform.calibrate_cpu();
-	tsc_khz = x86_platform.calibrate_tsc();
-
-	tsc_khz = tsc_khz ? : cpu_khz;
-	if (!tsc_khz)
-		return;
-
-	lpj = tsc_khz * 1000;
-	do_div(lpj, HZ);
-	loops_per_jiffy = lpj;
-}
-
-void __init tsc_init(void)
+static bool __init determine_cpu_tsc_frequencies(void)
 {
-	u64 lpj, cyc;
-	int cpu;
-
-	if (!boot_cpu_has(X86_FEATURE_TSC)) {
-		setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
-		return;
-	}
+	/* Make sure that cpu and tsc are not already calibrated */
+	WARN_ON(cpu_khz || tsc_khz);
 
 	cpu_khz = x86_platform.calibrate_cpu();
 	tsc_khz = x86_platform.calibrate_tsc();
@@ -1377,20 +1355,52 @@ void __init tsc_init(void)
 	else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
 		cpu_khz = tsc_khz;
 
-	if (!tsc_khz) {
-		mark_tsc_unstable("could not calculate TSC khz");
-		setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
-		return;
-	}
+	if (tsc_khz == 0)
+		return false;
 
 	pr_info("Detected %lu.%03lu MHz processor\n",
-		(unsigned long)cpu_khz / 1000,
-		(unsigned long)cpu_khz % 1000);
+		(unsigned long)cpu_khz / KHZ,
+		(unsigned long)cpu_khz % KHZ);
 
 	if (cpu_khz != tsc_khz) {
 		pr_info("Detected %lu.%03lu MHz TSC",
-			(unsigned long)tsc_khz / 1000,
-			(unsigned long)tsc_khz % 1000);
+			(unsigned long)tsc_khz / KHZ,
+			(unsigned long)tsc_khz % KHZ);
+	}
+	return true;
+}
+
+static unsigned long __init get_loops_per_jiffy(void)
+{
+	unsigned long lpj = tsc_khz * KHZ;
+
+	do_div(lpj, HZ);
+	return lpj;
+}
+
+void __init tsc_early_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_TSC))
+		return;
+	if (!determine_cpu_tsc_frequencies())
+		return;
+	loops_per_jiffy = get_loops_per_jiffy();
+}
+
+void __init tsc_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_TSC)) {
+		setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+		return;
+	}
+
+	if (!tsc_khz) {
+		/* We failed to determine frequencies earlier, try again */
+		if (!determine_cpu_tsc_frequencies()) {
+			mark_tsc_unstable("could not calculate TSC khz");
+			setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+			return;
+		}
 	}
 
 	/* Sanitize TSC ADJUST before cyc2ns gets initialized */
@@ -1413,10 +1423,7 @@ void __init tsc_init(void)
 	if (!no_sched_irq_time)
 		enable_sched_clock_irqtime();
 
-	lpj = ((u64)tsc_khz * 1000);
-	do_div(lpj, HZ);
-	lpj_fine = lpj;
-
+	lpj_fine = get_loops_per_jiffy();
 	use_tsc_delay();
 
 	check_system_tsc_reliable();
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 20/26] x86/tsc: initialize cyc2ns when tsc freq. is determined
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (18 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 19/26] x86/tsc: calibrate tsc only once Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:31   ` [tip:x86/timers] x86/tsc: Initialize cyc2ns when tsc frequency " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 21/26] x86/tsc: use tsc early Pavel Tatashin
                   ` (6 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data
structure.

Currently, the setup code for c2ns data for every possible CPU goes through
the same sequence of calculations as for the boot CPU, but is based on the
same tsc frequency as the boot CPU, and thus this is not necessary.

Initialize the boot cpu when tsc frequency is determined. Copy the
calculated data from the boot CPU to the other CPUs in tsc_init().

In addition do the following:

- Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init()
- Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be
  called when system is up, and wraps around __set_cyc2ns_scale() that can
  be called directly when system is booting but avoids saving restoring
  IRQs and going and waking up from idle.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/kernel/tsc.c | 94 ++++++++++++++++++++++++-------------------
 1 file changed, 53 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 4cab2236169e..7ea0718a4c75 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -103,23 +103,6 @@ void cyc2ns_read_end(void)
  *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
 
-static void cyc2ns_data_init(struct cyc2ns_data *data)
-{
-	data->cyc2ns_mul = 0;
-	data->cyc2ns_shift = 0;
-	data->cyc2ns_offset = 0;
-}
-
-static void __init cyc2ns_init(int cpu)
-{
-	struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
-
-	cyc2ns_data_init(&c2n->data[0]);
-	cyc2ns_data_init(&c2n->data[1]);
-
-	seqcount_init(&c2n->seq);
-}
-
 static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
 	struct cyc2ns_data data;
@@ -135,18 +118,11 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 	return ns;
 }
 
-static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
 {
 	unsigned long long ns_now;
 	struct cyc2ns_data data;
 	struct cyc2ns *c2n;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	sched_clock_idle_sleep_event();
-
-	if (!khz)
-		goto done;
 
 	ns_now = cycles_2_ns(tsc_now);
 
@@ -178,12 +154,55 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
 	c2n->data[0] = data;
 	raw_write_seqcount_latch(&c2n->seq);
 	c2n->data[1] = data;
+}
+
+static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	if (khz)
+		__set_cyc2ns_scale(khz, cpu, tsc_now);
 
-done:
 	sched_clock_idle_wakeup_event();
 	local_irq_restore(flags);
 }
 
+/*
+ * Initialize cyc2ns for boot cpu
+ */
+static void __init cyc2ns_init_boot_cpu(void)
+{
+	struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+
+	seqcount_init(&c2n->seq);
+	__set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
+}
+
+/*
+ * Secondary CPUs do not run through cyc2ns_init(), so set up
+ * all the scale factors for all CPUs, assuming the same
+ * speed as the bootup CPU. (cpufreq notifiers will fix this
+ * up if their speed diverges)
+ */
+static void __init cyc2ns_init_secondary_cpus(void)
+{
+	unsigned int cpu, this_cpu = smp_processor_id();
+	struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+	struct cyc2ns_data *data = c2n->data;
+
+	for_each_possible_cpu(cpu) {
+		if (cpu != this_cpu) {
+			seqcount_init(&c2n->seq);
+			c2n = per_cpu_ptr(&cyc2ns, cpu);
+			c2n->data[0] = data[0];
+			c2n->data[1] = data[1];
+		}
+	}
+}
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
@@ -1385,6 +1404,10 @@ void __init tsc_early_init(void)
 	if (!determine_cpu_tsc_frequencies())
 		return;
 	loops_per_jiffy = get_loops_per_jiffy();
+
+	/* Sanitize TSC ADJUST before cyc2ns gets initialized */
+	tsc_store_and_check_tsc_adjust(true);
+	cyc2ns_init_boot_cpu();
 }
 
 void __init tsc_init(void)
@@ -1401,23 +1424,12 @@ void __init tsc_init(void)
 			setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 			return;
 		}
+		/* Sanitize TSC ADJUST before cyc2ns gets initialized */
+		tsc_store_and_check_tsc_adjust(true);
+		cyc2ns_init_boot_cpu();
 	}
 
-	/* Sanitize TSC ADJUST before cyc2ns gets initialized */
-	tsc_store_and_check_tsc_adjust(true);
-
-	/*
-	 * Secondary CPUs do not run through tsc_init(), so set up
-	 * all the scale factors for all CPUs, assuming the same
-	 * speed as the bootup CPU. (cpufreq notifiers will fix this
-	 * up if their speed diverges)
-	 */
-	cyc = rdtsc();
-	for_each_possible_cpu(cpu) {
-		cyc2ns_init(cpu);
-		set_cyc2ns_scale(tsc_khz, cpu, cyc);
-	}
-
+	cyc2ns_init_secondary_cpus();
 	static_branch_enable(&__use_tsc);
 
 	if (!no_sched_irq_time)
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 21/26] x86/tsc: use tsc early
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (19 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 20/26] x86/tsc: initialize cyc2ns when tsc freq. is determined Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:32   ` [tip:x86/timers] x86/tsc: Use TSC as sched clock early tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 22/26] sched: move sched clock initialization and merge with generic clock Pavel Tatashin
                   ` (5 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

get timestamps and high resultion clock available to us as early as
possible.

native_sched_clock() outputs time based either on tsc after tsc_init() is
called later in boot, or using jiffies when clock interrupts are enabled,
which is also happens later in boot.

On the other hand, tsc frequency is known from as early as when
tsc_early_init() is called.

Use the early tsc calibration to output timestamps early.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/kernel/tsc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7ea0718a4c75..9277ae9b68b3 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1408,6 +1408,7 @@ void __init tsc_early_init(void)
 	/* Sanitize TSC ADJUST before cyc2ns gets initialized */
 	tsc_store_and_check_tsc_adjust(true);
 	cyc2ns_init_boot_cpu();
+	static_branch_enable(&__use_tsc);
 }
 
 void __init tsc_init(void)
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 22/26] sched: move sched clock initialization and merge with generic clock
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (20 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 21/26] x86/tsc: use tsc early Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:32   ` [tip:x86/timers] sched/clock: Move " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 23/26] sched: early boot clock Pavel Tatashin
                   ` (4 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

sched_clock_postinit() initializes a generic clock on systems where no
other clock is porvided. This function may be called only after
timekeeping_init().

Rename sched_clock_postinit to generic_clock_inti() and call it from
sched_clock_init(). Move the call for sched_clock_init() until after
time_init().

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 include/linux/sched_clock.h |  5 ++---
 init/main.c                 |  4 ++--
 kernel/sched/clock.c        | 27 +++++++++++++++++----------
 kernel/sched/core.c         |  1 -
 kernel/time/sched_clock.c   |  2 +-
 5 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched_clock.h b/include/linux/sched_clock.h
index 411b52e424e1..abe28d5cb3f4 100644
--- a/include/linux/sched_clock.h
+++ b/include/linux/sched_clock.h
@@ -9,17 +9,16 @@
 #define LINUX_SCHED_CLOCK
 
 #ifdef CONFIG_GENERIC_SCHED_CLOCK
-extern void sched_clock_postinit(void);
+extern void generic_sched_clock_init(void);
 
 extern void sched_clock_register(u64 (*read)(void), int bits,
 				 unsigned long rate);
 #else
-static inline void sched_clock_postinit(void) { }
+static inline void generic_sched_clock_init(void) { }
 
 static inline void sched_clock_register(u64 (*read)(void), int bits,
 					unsigned long rate)
 {
-	;
 }
 #endif
 
diff --git a/init/main.c b/init/main.c
index 3b4ada11ed52..162d931c9511 100644
--- a/init/main.c
+++ b/init/main.c
@@ -79,7 +79,7 @@
 #include <linux/pti.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
-#include <linux/sched_clock.h>
+#include <linux/sched/clock.h>
 #include <linux/sched/task.h>
 #include <linux/sched/task_stack.h>
 #include <linux/context_tracking.h>
@@ -642,7 +642,7 @@ asmlinkage __visible void __init start_kernel(void)
 	softirq_init();
 	timekeeping_init();
 	time_init();
-	sched_clock_postinit();
+	sched_clock_init();
 	printk_safe_init();
 	perf_event_init();
 	profile_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..0e9dbb2d9aea 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -53,6 +53,7 @@
  *
  */
 #include "sched.h"
+#include <linux/sched_clock.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -68,11 +69,6 @@ EXPORT_SYMBOL_GPL(sched_clock);
 
 __read_mostly int sched_clock_running;
 
-void sched_clock_init(void)
-{
-	sched_clock_running = 1;
-}
-
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 /*
  * We must start with !__sched_clock_stable because the unstable -> stable
@@ -199,6 +195,15 @@ void clear_sched_clock_stable(void)
 		__clear_sched_clock_stable();
 }
 
+static void __sched_clock_gtod_offset(void)
+{
+	__gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+}
+
+void __init sched_clock_init(void)
+{
+	sched_clock_running = 1;
+}
 /*
  * We run this as late_initcall() such that it runs after all built-in drivers,
  * notably: acpi_processor and intel_idle, which can mark the TSC as unstable.
@@ -385,8 +390,6 @@ void sched_clock_tick(void)
 
 void sched_clock_tick_stable(void)
 {
-	u64 gtod, clock;
-
 	if (!sched_clock_stable())
 		return;
 
@@ -398,9 +401,7 @@ void sched_clock_tick_stable(void)
 	 * TSC to be unstable, any computation will be computing crap.
 	 */
 	local_irq_disable();
-	gtod = ktime_get_ns();
-	clock = sched_clock();
-	__gtod_offset = (clock + __sched_clock_offset) - gtod;
+	__sched_clock_gtod_offset();
 	local_irq_enable();
 }
 
@@ -434,6 +435,12 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
 #else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
 
+void __init sched_clock_init(void)
+{
+	sched_clock_running = 1;
+	generic_sched_clock_init();
+}
+
 u64 sched_clock_cpu(int cpu)
 {
 	if (unlikely(!sched_clock_running))
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..552406e9713b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5954,7 +5954,6 @@ void __init sched_init(void)
 	int i, j;
 	unsigned long alloc_size = 0, ptr;
 
-	sched_clock_init();
 	wait_bit_init();
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 2d8f05aad442..cbc72c2c1fca 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -237,7 +237,7 @@ sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
 	pr_debug("Registered %pF as sched_clock source\n", read);
 }
 
-void __init sched_clock_postinit(void)
+void __init generic_sched_clock_init(void)
 {
 	/*
 	 * If no sched_clock() function has been provided at that point,
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 23/26] sched: early boot clock
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (21 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 22/26] sched: move sched clock initialization and merge with generic clock Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:33   ` [tip:x86/timers] sched/clock: Enable sched clock early tip-bot for Pavel Tatashin
                     ` (2 more replies)
  2018-07-19 20:55 ` [PATCH v15 24/26] sched: use static key for sched_clock_running Pavel Tatashin
                   ` (3 subsequent siblings)
  26 siblings, 3 replies; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

Allow sched_clock() to be used before schec_clock_init() is called.
This provides with a way to get early boot timestamps on machines with
unstable clocks.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 init/main.c          |  2 +-
 kernel/sched/clock.c | 20 +++++++++++++++++++-
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 162d931c9511..ff0a24170b95 100644
--- a/init/main.c
+++ b/init/main.c
@@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
 	softirq_init();
 	timekeeping_init();
 	time_init();
-	sched_clock_init();
 	printk_safe_init();
 	perf_event_init();
 	profile_init();
@@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
 	acpi_early_init();
 	if (late_time_init)
 		late_time_init();
+	sched_clock_init();
 	calibrate_delay();
 	pid_idr_init();
 	anon_vma_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 0e9dbb2d9aea..422cd63f8f17 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)
 
 void __init sched_clock_init(void)
 {
+	unsigned long flags;
+
+	/*
+	 * Set __gtod_offset such that once we mark sched_clock_running,
+	 * sched_clock_tick() continues where sched_clock() left off.
+	 *
+	 * Even if TSC is buggered, we're still UP at this point so it
+	 * can't really be out of sync.
+	 */
+	local_irq_save(flags);
+	__sched_clock_gtod_offset();
+	local_irq_restore(flags);
+
 	sched_clock_running = 1;
+
+	/* Now that sched_clock_running is set adjust scd */
+	local_irq_save(flags);
+	sched_clock_tick();
+	local_irq_restore(flags);
 }
 /*
  * We run this as late_initcall() such that it runs after all built-in drivers,
@@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
 		return sched_clock() + __sched_clock_offset;
 
 	if (unlikely(!sched_clock_running))
-		return 0ull;
+		return sched_clock();
 
 	preempt_disable_notrace();
 	scd = cpu_sdc(cpu);
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 24/26] sched: use static key for sched_clock_running
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (22 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 23/26] sched: early boot clock Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:33   ` [tip:x86/timers] sched/clock: Use " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 25/26] x86/tsc: split native_calibrate_cpu() into early and late parts Pavel Tatashin
                   ` (2 subsequent siblings)
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

sched_clock_running may be read every time sched_clock_cpu() is called.
Yet, this variable is updated only twice during boot, and never changes
again, therefore it is better to make it a static key.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/clock.c | 16 ++++++++--------
 kernel/sched/debug.c |  2 --
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 422cd63f8f17..c5c47ad3f386 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -67,7 +67,7 @@ unsigned long long __weak sched_clock(void)
 }
 EXPORT_SYMBOL_GPL(sched_clock);
 
-__read_mostly int sched_clock_running;
+static DEFINE_STATIC_KEY_FALSE(sched_clock_running);
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 /*
@@ -191,7 +191,7 @@ void clear_sched_clock_stable(void)
 
 	smp_mb(); /* matches sched_clock_init_late() */
 
-	if (sched_clock_running == 2)
+	if (static_key_count(&sched_clock_running.key) == 2)
 		__clear_sched_clock_stable();
 }
 
@@ -215,7 +215,7 @@ void __init sched_clock_init(void)
 	__sched_clock_gtod_offset();
 	local_irq_restore(flags);
 
-	sched_clock_running = 1;
+	static_branch_inc(&sched_clock_running);
 
 	/* Now that sched_clock_running is set adjust scd */
 	local_irq_save(flags);
@@ -228,7 +228,7 @@ void __init sched_clock_init(void)
  */
 static int __init sched_clock_init_late(void)
 {
-	sched_clock_running = 2;
+	static_branch_inc(&sched_clock_running);
 	/*
 	 * Ensure that it is impossible to not do a static_key update.
 	 *
@@ -373,7 +373,7 @@ u64 sched_clock_cpu(int cpu)
 	if (sched_clock_stable())
 		return sched_clock() + __sched_clock_offset;
 
-	if (unlikely(!sched_clock_running))
+	if (!static_branch_unlikely(&sched_clock_running))
 		return sched_clock();
 
 	preempt_disable_notrace();
@@ -396,7 +396,7 @@ void sched_clock_tick(void)
 	if (sched_clock_stable())
 		return;
 
-	if (unlikely(!sched_clock_running))
+	if (!static_branch_unlikely(&sched_clock_running))
 		return;
 
 	lockdep_assert_irqs_disabled();
@@ -455,13 +455,13 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
 void __init sched_clock_init(void)
 {
-	sched_clock_running = 1;
+	static_branch_inc(&sched_clock_running);
 	generic_sched_clock_init();
 }
 
 u64 sched_clock_cpu(int cpu)
 {
-	if (unlikely(!sched_clock_running))
+	if (!static_branch_unlikely(&sched_clock_running))
 		return 0;
 
 	return sched_clock();
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e593b4118578..b0212f489a33 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -623,8 +623,6 @@ void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
 #undef PU
 }
 
-extern __read_mostly int sched_clock_running;
-
 static void print_cpu(struct seq_file *m, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 25/26] x86/tsc:  split native_calibrate_cpu() into early and late parts
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (23 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 24/26] sched: use static key for sched_clock_running Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:34   ` [tip:x86/timers] x86/tsc: Split " tip-bot for Pavel Tatashin
  2018-07-19 20:55 ` [PATCH v15 26/26] x86/tsc: use tsc_calibrate_cpu_early and pit_hpet_ptimer_calibrate_cpu Pavel Tatashin
  2018-07-19 22:34 ` [PATCH v15 00/26] Early boot time stamps Thomas Gleixner
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

Early in boot CPU can be calibrated using msr, cpuid, and quick pit
methods. The other methods pit/hpet/pmtimer are available only after acpi
is initialized.

Split native_calibrate_cpu() into early and late parts so they can be
called separately during early and late tsc calibration.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/include/asm/tsc.h |  1 +
 arch/x86/kernel/tsc.c      | 54 +++++++++++++++++++++++++-------------
 2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c4368ff73652..88140e4f2292 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -40,6 +40,7 @@ extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern void mark_tsc_async_resets(char *reason);
 extern unsigned long native_calibrate_cpu(void);
+extern unsigned long native_calibrate_cpu_early(void);
 extern unsigned long native_calibrate_tsc(void);
 extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 9277ae9b68b3..60586779b02c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -680,30 +680,17 @@ static unsigned long cpu_khz_from_cpuid(void)
 	return eax_base_mhz * 1000;
 }
 
-/**
- * native_calibrate_cpu - calibrate the cpu on boot
+/*
+ * calibrate cpu using pit, hpet, and ptimer methods. They are available
+ * later in boot after acpi is initialized.
  */
-unsigned long native_calibrate_cpu(void)
+static unsigned long pit_hpet_ptimer_calibrate_cpu(void)
 {
 	u64 tsc1, tsc2, delta, ref1, ref2;
 	unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
-	unsigned long flags, latch, ms, fast_calibrate;
+	unsigned long flags, latch, ms;
 	int hpet = is_hpet_enabled(), i, loopmin;
 
-	fast_calibrate = cpu_khz_from_cpuid();
-	if (fast_calibrate)
-		return fast_calibrate;
-
-	fast_calibrate = cpu_khz_from_msr();
-	if (fast_calibrate)
-		return fast_calibrate;
-
-	local_irq_save(flags);
-	fast_calibrate = quick_pit_calibrate();
-	local_irq_restore(flags);
-	if (fast_calibrate)
-		return fast_calibrate;
-
 	/*
 	 * Run 5 calibration loops to get the lowest frequency value
 	 * (the best estimate). We use two different calibration modes
@@ -846,6 +833,37 @@ unsigned long native_calibrate_cpu(void)
 	return tsc_pit_min;
 }
 
+/**
+ * native_calibrate_cpu_early - can calibrate the cpu early in boot
+ */
+unsigned long native_calibrate_cpu_early(void)
+{
+	unsigned long flags, fast_calibrate = cpu_khz_from_cpuid();
+
+	if (!fast_calibrate)
+		fast_calibrate = cpu_khz_from_msr();
+	if (!fast_calibrate) {
+		local_irq_save(flags);
+		fast_calibrate = quick_pit_calibrate();
+		local_irq_restore(flags);
+	}
+	return fast_calibrate;
+}
+
+
+/**
+ * native_calibrate_cpu - calibrate the cpu
+ */
+unsigned long native_calibrate_cpu(void)
+{
+	unsigned long tsc_freq = native_calibrate_cpu_early();
+
+	if (!tsc_freq)
+		tsc_freq = pit_hpet_ptimer_calibrate_cpu();
+
+	return tsc_freq;
+}
+
 void recalibrate_cpu_khz(void)
 {
 #ifndef CONFIG_SMP
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v15 26/26] x86/tsc:  use tsc_calibrate_cpu_early and pit_hpet_ptimer_calibrate_cpu
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (24 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 25/26] x86/tsc: split native_calibrate_cpu() into early and late parts Pavel Tatashin
@ 2018-07-19 20:55 ` Pavel Tatashin
  2018-07-19 22:35   ` [tip:x86/timers] x86/tsc: Make use of tsc_calibrate_cpu_early() tip-bot for Pavel Tatashin
  2018-07-19 22:34 ` [PATCH v15 00/26] Early boot time stamps Thomas Gleixner
  26 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-19 20:55 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, pasha.tatashin, boris.ostrovsky, jgross,
	pbonzini

Early in boot enable tsc_calibrate_cpu_early and switch to
tsc_calibrate_cpu() only later. Do this unconditionally, because it is
unknown what methods other cpus will use to calibrate once they are
onlined.

If by the time tsc_init() is called tsc frequency is still unknown do only
pit_hpet_ptimer_calibrate_cpu to calibrate, as this function contails the
only methods that had not been called earlier in boot.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 arch/x86/include/asm/tsc.h |  1 -
 arch/x86/kernel/tsc.c      | 25 +++++++++++++++++++------
 arch/x86/kernel/x86_init.c |  2 +-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 88140e4f2292..eb5bbfeccb66 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -39,7 +39,6 @@ extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern void mark_tsc_async_resets(char *reason);
-extern unsigned long native_calibrate_cpu(void);
 extern unsigned long native_calibrate_cpu_early(void);
 extern unsigned long native_calibrate_tsc(void);
 extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 60586779b02c..02e416b87ac1 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -854,7 +854,7 @@ unsigned long native_calibrate_cpu_early(void)
 /**
  * native_calibrate_cpu - calibrate the cpu
  */
-unsigned long native_calibrate_cpu(void)
+static unsigned long native_calibrate_cpu(void)
 {
 	unsigned long tsc_freq = native_calibrate_cpu_early();
 
@@ -1374,13 +1374,19 @@ static int __init init_tsc_clocksource(void)
  */
 device_initcall(init_tsc_clocksource);
 
-static bool __init determine_cpu_tsc_frequencies(void)
+static bool __init determine_cpu_tsc_frequencies(bool early)
 {
 	/* Make sure that cpu and tsc are not already calibrated */
 	WARN_ON(cpu_khz || tsc_khz);
 
-	cpu_khz = x86_platform.calibrate_cpu();
-	tsc_khz = x86_platform.calibrate_tsc();
+	if (early) {
+		cpu_khz = x86_platform.calibrate_cpu();
+		tsc_khz = x86_platform.calibrate_tsc();
+	} else {
+		/* We should not be here with non-native cpu calibration */
+		WARN_ON(x86_platform.calibrate_cpu != native_calibrate_cpu);
+		cpu_khz = pit_hpet_ptimer_calibrate_cpu();
+	}
 
 	/*
 	 * Trust non-zero tsc_khz as authorative,
@@ -1419,7 +1425,7 @@ void __init tsc_early_init(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_TSC))
 		return;
-	if (!determine_cpu_tsc_frequencies())
+	if (!determine_cpu_tsc_frequencies(true))
 		return;
 	loops_per_jiffy = get_loops_per_jiffy();
 
@@ -1431,6 +1437,13 @@ void __init tsc_early_init(void)
 
 void __init tsc_init(void)
 {
+	/*
+	 * native_calibrate_cpu_early can only calibrate using methods that are
+	 * available early in boot.
+	 */
+	if (x86_platform.calibrate_cpu == native_calibrate_cpu_early)
+		x86_platform.calibrate_cpu = native_calibrate_cpu;
+
 	if (!boot_cpu_has(X86_FEATURE_TSC)) {
 		setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 		return;
@@ -1438,7 +1451,7 @@ void __init tsc_init(void)
 
 	if (!tsc_khz) {
 		/* We failed to determine frequencies earlier, try again */
-		if (!determine_cpu_tsc_frequencies()) {
+		if (!determine_cpu_tsc_frequencies(false)) {
 			mark_tsc_unstable("could not calculate TSC khz");
 			setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 			return;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 3ab867603e81..2792b5573818 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -109,7 +109,7 @@ struct x86_cpuinit_ops x86_cpuinit = {
 static void default_nmi_init(void) { };
 
 struct x86_platform_ops x86_platform __ro_after_init = {
-	.calibrate_cpu			= native_calibrate_cpu,
+	.calibrate_cpu			= native_calibrate_cpu_early,
 	.calibrate_tsc			= native_calibrate_tsc,
 	.get_wallclock			= mach_get_cmos_time,
 	.set_wallclock			= mach_set_rtc_mmss,
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/kvmclock: Remove memblock dependency
  2018-07-19 20:55 ` [PATCH v15 01/26] x86/kvmclock: Remove memblock dependency Pavel Tatashin
@ 2018-07-19 22:21   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:21 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, pbonzini, pasha.tatashin, linux-kernel, tglx, mingo

Commit-ID:  368a540e0232ad446931f5a4e8a5e06f69f21343
Gitweb:     https://git.kernel.org/tip/368a540e0232ad446931f5a4e8a5e06f69f21343
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:20 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:36 +0200

x86/kvmclock: Remove memblock dependency

KVM clock is initialized later compared to other hypervisor clocks because
it has a dependency on the memblock allocator.

Bring it in line with other hypervisors by using memory from the BSS
instead of allocating it.

The benefits:

  - Remove ifdef from common code
  - Earlier availability of the clock
  - Remove dependency on memblock, and reduce code

The downside:

  - Static allocation of the per cpu data structures sized NR_CPUS * 64byte
    Will be addressed in follow up patches.

[ tglx: Split out from larger series ]

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Link: https://lkml.kernel.org/r/20180719205545.16512-2-pasha.tatashin@oracle.com

---
 arch/x86/kernel/kvm.c      |  1 +
 arch/x86/kernel/kvmclock.c | 66 ++++++++--------------------------------------
 arch/x86/kernel/setup.c    |  4 ---
 3 files changed, 12 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5b2300b818af..c65c232d3ddd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -628,6 +628,7 @@ const __initconst struct hypervisor_x86 x86_hyper_kvm = {
 	.name			= "KVM",
 	.detect			= kvm_detect,
 	.type			= X86_HYPER_KVM,
+	.init.init_platform	= kvmclock_init,
 	.init.guest_late_init	= kvm_guest_init,
 	.init.x2apic_available	= kvm_para_available,
 };
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 3b8e7c13c614..1f6ac5aaa904 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,9 +23,9 @@
 #include <asm/apic.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
-#include <linux/memblock.h>
 #include <linux/sched.h>
 #include <linux/sched/clock.h>
+#include <linux/mm.h>
 
 #include <asm/mem_encrypt.h>
 #include <asm/x86_init.h>
@@ -44,6 +44,13 @@ static int parse_no_kvmclock(char *arg)
 }
 early_param("no-kvmclock", parse_no_kvmclock);
 
+/* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
+#define HV_CLOCK_SIZE	(sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define WALL_CLOCK_SIZE	(sizeof(struct pvclock_wall_clock))
+
+static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
+
 /* The hypervisor will put information about time periodically here */
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock *wall_clock;
@@ -245,43 +252,12 @@ static void kvm_shutdown(void)
 	native_machine_shutdown();
 }
 
-static phys_addr_t __init kvm_memblock_alloc(phys_addr_t size,
-					     phys_addr_t align)
-{
-	phys_addr_t mem;
-
-	mem = memblock_alloc(size, align);
-	if (!mem)
-		return 0;
-
-	if (sev_active()) {
-		if (early_set_memory_decrypted((unsigned long)__va(mem), size))
-			goto e_free;
-	}
-
-	return mem;
-e_free:
-	memblock_free(mem, size);
-	return 0;
-}
-
-static void __init kvm_memblock_free(phys_addr_t addr, phys_addr_t size)
-{
-	if (sev_active())
-		early_set_memory_encrypted((unsigned long)__va(addr), size);
-
-	memblock_free(addr, size);
-}
-
 void __init kvmclock_init(void)
 {
 	struct pvclock_vcpu_time_info *vcpu_time;
-	unsigned long mem, mem_wall_clock;
-	int size, cpu, wall_clock_size;
+	int cpu;
 	u8 flags;
 
-	size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
 	if (!kvm_para_available())
 		return;
 
@@ -291,28 +267,11 @@ void __init kvmclock_init(void)
 	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
 		return;
 
-	wall_clock_size = PAGE_ALIGN(sizeof(struct pvclock_wall_clock));
-	mem_wall_clock = kvm_memblock_alloc(wall_clock_size, PAGE_SIZE);
-	if (!mem_wall_clock)
-		return;
-
-	wall_clock = __va(mem_wall_clock);
-	memset(wall_clock, 0, wall_clock_size);
-
-	mem = kvm_memblock_alloc(size, PAGE_SIZE);
-	if (!mem) {
-		kvm_memblock_free(mem_wall_clock, wall_clock_size);
-		wall_clock = NULL;
-		return;
-	}
-
-	hv_clock = __va(mem);
-	memset(hv_clock, 0, size);
+	wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
+	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
 
 	if (kvm_register_clock("primary cpu clock")) {
 		hv_clock = NULL;
-		kvm_memblock_free(mem, size);
-		kvm_memblock_free(mem_wall_clock, wall_clock_size);
 		wall_clock = NULL;
 		return;
 	}
@@ -357,13 +316,10 @@ int __init kvm_setup_vsyscall_timeinfo(void)
 	int cpu;
 	u8 flags;
 	struct pvclock_vcpu_time_info *vcpu_time;
-	unsigned int size;
 
 	if (!hv_clock)
 		return 0;
 
-	size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
-
 	cpu = get_cpu();
 
 	vcpu_time = &hv_clock[cpu].pvti;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2f86d883dd95..da1dbd99cb6e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1197,10 +1197,6 @@ void __init setup_arch(char **cmdline_p)
 
 	memblock_find_dma_reserve();
 
-#ifdef CONFIG_KVM_GUEST
-	kvmclock_init();
-#endif
-
 	tsc_early_delay_calibrate();
 	if (!early_xdbc_setup_hardware())
 		early_xdbc_register_console();

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/kvmclock: Remove page size requirement from wall_clock
  2018-07-19 20:55 ` [PATCH v15 02/26] x86/kvmclock: Remove page size requirement from wall_clock Pavel Tatashin
@ 2018-07-19 22:22   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Thomas Gleixner @ 2018-07-19 22:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, hpa, linux-kernel, pasha.tatashin, mingo, pbonzini

Commit-ID:  7ef363a39514ed8a6f2333fbae1875ac0953715a
Gitweb:     https://git.kernel.org/tip/7ef363a39514ed8a6f2333fbae1875ac0953715a
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 19 Jul 2018 16:55:21 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:36 +0200

x86/kvmclock: Remove page size requirement from wall_clock

There is no requirement for wall_clock data to be page aligned or page
sized.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Link: https://lkml.kernel.org/r/20180719205545.16512-3-pasha.tatashin@oracle.com

---
 arch/x86/kernel/kvmclock.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1f6ac5aaa904..a995d7d7164c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -46,14 +46,12 @@ early_param("no-kvmclock", parse_no_kvmclock);
 
 /* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
 #define HV_CLOCK_SIZE	(sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
-#define WALL_CLOCK_SIZE	(sizeof(struct pvclock_wall_clock))
 
 static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-static u8 wall_clock_mem[PAGE_ALIGN(WALL_CLOCK_SIZE)] __aligned(PAGE_SIZE);
 
 /* The hypervisor will put information about time periodically here */
 static struct pvclock_vsyscall_time_info *hv_clock;
-static struct pvclock_wall_clock *wall_clock;
+static struct pvclock_wall_clock wall_clock;
 
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
@@ -66,15 +64,15 @@ static void kvm_get_wallclock(struct timespec64 *now)
 	int low, high;
 	int cpu;
 
-	low = (int)slow_virt_to_phys(wall_clock);
-	high = ((u64)slow_virt_to_phys(wall_clock) >> 32);
+	low = (int)slow_virt_to_phys(&wall_clock);
+	high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);
 
 	native_write_msr(msr_kvm_wall_clock, low, high);
 
 	cpu = get_cpu();
 
 	vcpu_time = &hv_clock[cpu].pvti;
-	pvclock_read_wallclock(wall_clock, vcpu_time, now);
+	pvclock_read_wallclock(&wall_clock, vcpu_time, now);
 
 	put_cpu();
 }
@@ -267,12 +265,10 @@ void __init kvmclock_init(void)
 	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
 		return;
 
-	wall_clock = (struct pvclock_wall_clock *)wall_clock_mem;
 	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
 
 	if (kvm_register_clock("primary cpu clock")) {
 		hv_clock = NULL;
-		wall_clock = NULL;
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/kvmclock: Decrapify kvm_register_clock()
  2018-07-19 20:55 ` [PATCH v15 03/26] x86/kvmclock: Decrapify kvm_register_clock() Pavel Tatashin
@ 2018-07-19 22:23   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Thomas Gleixner @ 2018-07-19 22:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, pbonzini, hpa, tglx, pasha.tatashin

Commit-ID:  7a5ddc8fe0ea9518cd7fb6a929cac7d864c6f300
Gitweb:     https://git.kernel.org/tip/7a5ddc8fe0ea9518cd7fb6a929cac7d864c6f300
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 19 Jul 2018 16:55:22 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:36 +0200

x86/kvmclock: Decrapify kvm_register_clock()

The return value is pointless because the wrmsr cannot fail if
KVM_FEATURE_CLOCKSOURCE or KVM_FEATURE_CLOCKSOURCE2 are set.

kvm_register_clock() is only called locally so wants to be static.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Link: https://lkml.kernel.org/r/20180719205545.16512-4-pasha.tatashin@oracle.com

---
 arch/x86/include/asm/kvm_para.h |  1 -
 arch/x86/kernel/kvmclock.c      | 33 ++++++++++-----------------------
 2 files changed, 10 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 3aea2658323a..4c723632c036 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -7,7 +7,6 @@
 #include <uapi/asm/kvm_para.h>
 
 extern void kvmclock_init(void);
-extern int kvm_register_clock(char *txt);
 
 #ifdef CONFIG_KVM_GUEST
 bool kvm_check_and_clear_guest_paused(void);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index a995d7d7164c..f0a0aef5e9fa 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -187,23 +187,19 @@ struct clocksource kvm_clock = {
 };
 EXPORT_SYMBOL_GPL(kvm_clock);
 
-int kvm_register_clock(char *txt)
+static void kvm_register_clock(char *txt)
 {
-	int cpu = smp_processor_id();
-	int low, high, ret;
 	struct pvclock_vcpu_time_info *src;
+	int cpu = smp_processor_id();
+	u64 pa;
 
 	if (!hv_clock)
-		return 0;
+		return;
 
 	src = &hv_clock[cpu].pvti;
-	low = (int)slow_virt_to_phys(src) | 1;
-	high = ((u64)slow_virt_to_phys(src) >> 32);
-	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
-	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
-	       cpu, high, low, txt);
-
-	return ret;
+	pa = slow_virt_to_phys(src) | 0x01ULL;
+	wrmsrl(msr_kvm_system_time, pa);
+	pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
 }
 
 static void kvm_save_sched_clock_state(void)
@@ -218,11 +214,7 @@ static void kvm_restore_sched_clock_state(void)
 #ifdef CONFIG_X86_LOCAL_APIC
 static void kvm_setup_secondary_clock(void)
 {
-	/*
-	 * Now that the first cpu already had this clocksource initialized,
-	 * we shouldn't fail.
-	 */
-	WARN_ON(kvm_register_clock("secondary cpu clock"));
+	kvm_register_clock("secondary cpu clock");
 }
 #endif
 
@@ -265,16 +257,11 @@ void __init kvmclock_init(void)
 	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
 		return;
 
-	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
-
-	if (kvm_register_clock("primary cpu clock")) {
-		hv_clock = NULL;
-		return;
-	}
-
 	printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
 		msr_kvm_system_time, msr_kvm_wall_clock);
 
+	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+	kvm_register_clock("primary cpu clock");
 	pvclock_set_pvti_cpu0_va(hv_clock);
 
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/kvmclock: Cleanup the code
  2018-07-19 20:55 ` [PATCH v15 04/26] x86/kvmclock: Cleanup the code Pavel Tatashin
@ 2018-07-19 22:23   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Thomas Gleixner @ 2018-07-19 22:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, mingo, pbonzini, linux-kernel, pasha.tatashin, tglx

Commit-ID:  146c394d0c3c8e88df433a179c2b0b85fd8cf247
Gitweb:     https://git.kernel.org/tip/146c394d0c3c8e88df433a179c2b0b85fd8cf247
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 19 Jul 2018 16:55:23 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:37 +0200

x86/kvmclock: Cleanup the code

- Cleanup the mrs write for wall clock. The type casts to (int) are sloppy
  because the wrmsr parameters are u32 and aside of that wrmsrl() already
  provides the high/low split for free.

- Remove the pointless get_cpu()/put_cpu() dance from various
  functions. Either they are called during early init where CPU is
  guaranteed to be 0 or they are already called from non preemptible
  context where smp_processor_id() can be used safely

- Simplify the convoluted check for kvmclock in the init function.

- Mark the parameter parsing function __init. No point in keeping it
  around.

- Convert to pr_info()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Link: https://lkml.kernel.org/r/20180719205545.16512-5-pasha.tatashin@oracle.com

---
 arch/x86/kernel/kvmclock.c | 72 ++++++++++++++--------------------------------
 1 file changed, 22 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f0a0aef5e9fa..4afb03e49a4f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -37,7 +37,7 @@ static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
 static u64 kvm_sched_clock_offset;
 
-static int parse_no_kvmclock(char *arg)
+static int __init parse_no_kvmclock(char *arg)
 {
 	kvmclock = 0;
 	return 0;
@@ -61,13 +61,9 @@ static struct pvclock_wall_clock wall_clock;
 static void kvm_get_wallclock(struct timespec64 *now)
 {
 	struct pvclock_vcpu_time_info *vcpu_time;
-	int low, high;
 	int cpu;
 
-	low = (int)slow_virt_to_phys(&wall_clock);
-	high = ((u64)slow_virt_to_phys(&wall_clock) >> 32);
-
-	native_write_msr(msr_kvm_wall_clock, low, high);
+	wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));
 
 	cpu = get_cpu();
 
@@ -117,11 +113,11 @@ static inline void kvm_sched_clock_init(bool stable)
 	kvm_sched_clock_offset = kvm_clock_read();
 	pv_time_ops.sched_clock = kvm_sched_clock_read;
 
-	printk(KERN_INFO "kvm-clock: using sched offset of %llu cycles\n",
-			kvm_sched_clock_offset);
+	pr_info("kvm-clock: using sched offset of %llu cycles",
+		kvm_sched_clock_offset);
 
 	BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
-	         sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
+		sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
 }
 
 /*
@@ -135,16 +131,8 @@ static inline void kvm_sched_clock_init(bool stable)
  */
 static unsigned long kvm_get_tsc_khz(void)
 {
-	struct pvclock_vcpu_time_info *src;
-	int cpu;
-	unsigned long tsc_khz;
-
-	cpu = get_cpu();
-	src = &hv_clock[cpu].pvti;
-	tsc_khz = pvclock_tsc_khz(src);
-	put_cpu();
 	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
-	return tsc_khz;
+	return pvclock_tsc_khz(&hv_clock[0].pvti);
 }
 
 static void kvm_get_preset_lpj(void)
@@ -161,29 +149,27 @@ static void kvm_get_preset_lpj(void)
 
 bool kvm_check_and_clear_guest_paused(void)
 {
-	bool ret = false;
 	struct pvclock_vcpu_time_info *src;
-	int cpu = smp_processor_id();
+	bool ret = false;
 
 	if (!hv_clock)
 		return ret;
 
-	src = &hv_clock[cpu].pvti;
+	src = &hv_clock[smp_processor_id()].pvti;
 	if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
 		src->flags &= ~PVCLOCK_GUEST_STOPPED;
 		pvclock_touch_watchdogs();
 		ret = true;
 	}
-
 	return ret;
 }
 
 struct clocksource kvm_clock = {
-	.name = "kvm-clock",
-	.read = kvm_clock_get_cycles,
-	.rating = 400,
-	.mask = CLOCKSOURCE_MASK(64),
-	.flags = CLOCK_SOURCE_IS_CONTINUOUS,
+	.name	= "kvm-clock",
+	.read	= kvm_clock_get_cycles,
+	.rating	= 400,
+	.mask	= CLOCKSOURCE_MASK(64),
+	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
 };
 EXPORT_SYMBOL_GPL(kvm_clock);
 
@@ -199,7 +185,7 @@ static void kvm_register_clock(char *txt)
 	src = &hv_clock[cpu].pvti;
 	pa = slow_virt_to_phys(src) | 0x01ULL;
 	wrmsrl(msr_kvm_system_time, pa);
-	pr_info("kvm-clock: cpu %d, msr %llx, %s\n", cpu, pa, txt);
+	pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
 }
 
 static void kvm_save_sched_clock_state(void)
@@ -244,20 +230,19 @@ static void kvm_shutdown(void)
 
 void __init kvmclock_init(void)
 {
-	struct pvclock_vcpu_time_info *vcpu_time;
-	int cpu;
 	u8 flags;
 
-	if (!kvm_para_available())
+	if (!kvm_para_available() || !kvmclock)
 		return;
 
-	if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
+	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
 		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
 		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
-	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
+	} else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
 		return;
+	}
 
-	printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
+	pr_info("kvm-clock: Using msrs %x and %x",
 		msr_kvm_system_time, msr_kvm_wall_clock);
 
 	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
@@ -267,20 +252,15 @@ void __init kvmclock_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	cpu = get_cpu();
-	vcpu_time = &hv_clock[cpu].pvti;
-	flags = pvclock_read_flags(vcpu_time);
-
+	flags = pvclock_read_flags(&hv_clock[0].pvti);
 	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
-	put_cpu();
 
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;
 	x86_platform.calibrate_cpu = kvm_get_tsc_khz;
 	x86_platform.get_wallclock = kvm_get_wallclock;
 	x86_platform.set_wallclock = kvm_set_wallclock;
 #ifdef CONFIG_X86_LOCAL_APIC
-	x86_cpuinit.early_percpu_clock_init =
-		kvm_setup_secondary_clock;
+	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
 #endif
 	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
 	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
@@ -296,20 +276,12 @@ void __init kvmclock_init(void)
 int __init kvm_setup_vsyscall_timeinfo(void)
 {
 #ifdef CONFIG_X86_64
-	int cpu;
 	u8 flags;
-	struct pvclock_vcpu_time_info *vcpu_time;
 
 	if (!hv_clock)
 		return 0;
 
-	cpu = get_cpu();
-
-	vcpu_time = &hv_clock[cpu].pvti;
-	flags = pvclock_read_flags(vcpu_time);
-
-	put_cpu();
-
+	flags = pvclock_read_flags(&hv_clock[0].pvti);
 	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
 		return 1;
 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/kvmclock: Mark variables __initdata and __ro_after_init
  2018-07-19 20:55 ` [PATCH v15 05/26] x86/kvmclock: Mark variables __initdata and __ro_after_init Pavel Tatashin
@ 2018-07-19 22:24   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Thomas Gleixner @ 2018-07-19 22:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: pasha.tatashin, pbonzini, mingo, tglx, hpa, linux-kernel

Commit-ID:  42f8df935efefba51d0c5321b1325436523e3377
Gitweb:     https://git.kernel.org/tip/42f8df935efefba51d0c5321b1325436523e3377
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 19 Jul 2018 16:55:24 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:37 +0200

x86/kvmclock: Mark variables __initdata and __ro_after_init

The kvmclock parameter is init data and the other variables are not
modified after init.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Link: https://lkml.kernel.org/r/20180719205545.16512-6-pasha.tatashin@oracle.com

---
 arch/x86/kernel/kvmclock.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 4afb03e49a4f..78aec160f5e0 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -32,10 +32,10 @@
 #include <asm/reboot.h>
 #include <asm/kvmclock.h>
 
-static int kvmclock __ro_after_init = 1;
-static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
-static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
-static u64 kvm_sched_clock_offset;
+static int kvmclock __initdata = 1;
+static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
+static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
+static u64 kvm_sched_clock_offset __ro_after_init;
 
 static int __init parse_no_kvmclock(char *arg)
 {
@@ -50,7 +50,7 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
 
 /* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock;
+static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
 static struct pvclock_wall_clock wall_clock;
 
 /*

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock
  2018-07-19 20:55 ` [PATCH v15 06/26] x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock Pavel Tatashin
@ 2018-07-19 22:24   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Thomas Gleixner @ 2018-07-19 22:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, pbonzini, linux-kernel, hpa, mingo, pasha.tatashin

Commit-ID:  e499a9b6dc488aff7f284bee51936f510ab7ad15
Gitweb:     https://git.kernel.org/tip/e499a9b6dc488aff7f284bee51936f510ab7ad15
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 19 Jul 2018 16:55:25 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:37 +0200

x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock

There is no point to have this in the kvm code itself and call it from
there. This can be called from an initcall and the parameter is cleared
when the hypervisor is not KVM.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Link: https://lkml.kernel.org/r/20180719205545.16512-7-pasha.tatashin@oracle.com

---
 arch/x86/include/asm/kvm_guest.h |  7 -------
 arch/x86/kernel/kvm.c            | 13 ------------
 arch/x86/kernel/kvmclock.c       | 44 ++++++++++++++++++++++++----------------
 3 files changed, 27 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_guest.h b/arch/x86/include/asm/kvm_guest.h
deleted file mode 100644
index 46185263d9c2..000000000000
--- a/arch/x86/include/asm/kvm_guest.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_X86_KVM_GUEST_H
-#define _ASM_X86_KVM_GUEST_H
-
-int kvm_setup_vsyscall_timeinfo(void);
-
-#endif /* _ASM_X86_KVM_GUEST_H */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c65c232d3ddd..a560750cc76f 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -45,7 +45,6 @@
 #include <asm/apic.h>
 #include <asm/apicdef.h>
 #include <asm/hypervisor.h>
-#include <asm/kvm_guest.h>
 
 static int kvmapf = 1;
 
@@ -66,15 +65,6 @@ static int __init parse_no_stealacc(char *arg)
 
 early_param("no-steal-acc", parse_no_stealacc);
 
-static int kvmclock_vsyscall = 1;
-static int __init parse_no_kvmclock_vsyscall(char *arg)
-{
-        kvmclock_vsyscall = 0;
-        return 0;
-}
-
-early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
-
 static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -560,9 +550,6 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
-	if (kvmclock_vsyscall)
-		kvm_setup_vsyscall_timeinfo();
-
 #ifdef CONFIG_SMP
 	smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
 	smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 78aec160f5e0..7d690d2238f8 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -27,12 +27,14 @@
 #include <linux/sched/clock.h>
 #include <linux/mm.h>
 
+#include <asm/hypervisor.h>
 #include <asm/mem_encrypt.h>
 #include <asm/x86_init.h>
 #include <asm/reboot.h>
 #include <asm/kvmclock.h>
 
 static int kvmclock __initdata = 1;
+static int kvmclock_vsyscall __initdata = 1;
 static int msr_kvm_system_time __ro_after_init = MSR_KVM_SYSTEM_TIME;
 static int msr_kvm_wall_clock __ro_after_init = MSR_KVM_WALL_CLOCK;
 static u64 kvm_sched_clock_offset __ro_after_init;
@@ -44,6 +46,13 @@ static int __init parse_no_kvmclock(char *arg)
 }
 early_param("no-kvmclock", parse_no_kvmclock);
 
+static int __init parse_no_kvmclock_vsyscall(char *arg)
+{
+	kvmclock_vsyscall = 0;
+	return 0;
+}
+early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
+
 /* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
 #define HV_CLOCK_SIZE	(sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
 
@@ -228,6 +237,24 @@ static void kvm_shutdown(void)
 	native_machine_shutdown();
 }
 
+static int __init kvm_setup_vsyscall_timeinfo(void)
+{
+#ifdef CONFIG_X86_64
+	u8 flags;
+
+	if (!hv_clock || !kvmclock_vsyscall)
+		return 0;
+
+	flags = pvclock_read_flags(&hv_clock[0].pvti);
+	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
+		return 1;
+
+	kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+#endif
+	return 0;
+}
+early_initcall(kvm_setup_vsyscall_timeinfo);
+
 void __init kvmclock_init(void)
 {
 	u8 flags;
@@ -272,20 +299,3 @@ void __init kvmclock_init(void)
 	clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
 	pv_info.name = "KVM";
 }
-
-int __init kvm_setup_vsyscall_timeinfo(void)
-{
-#ifdef CONFIG_X86_64
-	u8 flags;
-
-	if (!hv_clock)
-		return 0;
-
-	flags = pvclock_read_flags(&hv_clock[0].pvti);
-	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
-		return 1;
-
-	kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
-#endif
-	return 0;
-}

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/kvmclock: Switch kvmclock data to a PER_CPU variable
  2018-07-19 20:55 ` [PATCH v15 07/26] x86/kvmclock: Switch kvmclock data to a PER_CPU variable Pavel Tatashin
@ 2018-07-19 22:25   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Thomas Gleixner @ 2018-07-19 22:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, pbonzini, pasha.tatashin, hpa, mingo, linux-kernel

Commit-ID:  95a3d4454bb1cf5bfd666c27fdd2dc188e17c14d
Gitweb:     https://git.kernel.org/tip/95a3d4454bb1cf5bfd666c27fdd2dc188e17c14d
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Thu, 19 Jul 2018 16:55:26 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:38 +0200

x86/kvmclock: Switch kvmclock data to a PER_CPU variable

The previous removal of the memblock dependency from kvmclock introduced a
static data array sized 64bytes * CONFIG_NR_CPUS. That's wasteful on large
systems when kvmclock is not used.

Replace it with:

 - A static page sized array of pvclock data. It's page sized because the
   pvclock data of the boot cpu is mapped into the VDSO so otherwise random
   other data would be exposed to the vDSO

 - A PER_CPU variable of pvclock data pointers. This is used to access the
   pcvlock data storage on each CPU.

The setup is done in two stages:

 - Early boot stores the pointer to the static page for the boot CPU in
   the per cpu data.

 - In the preparatory stage of CPU hotplug assign either an element of
   the static array (when the CPU number is in that range) or allocate
   memory and initialize the per cpu pointer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Link: https://lkml.kernel.org/r/20180719205545.16512-8-pasha.tatashin@oracle.com

---
 arch/x86/kernel/kvmclock.c | 99 +++++++++++++++++++++++++++++-----------------
 1 file changed, 62 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 7d690d2238f8..91b94c0ae4e3 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
 #include <asm/apic.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/cpuhotplug.h>
 #include <linux/sched.h>
 #include <linux/sched/clock.h>
 #include <linux/mm.h>
@@ -55,12 +56,23 @@ early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
 
 /* Aligned to page sizes to match whats mapped via vsyscalls to userspace */
 #define HV_CLOCK_SIZE	(sizeof(struct pvclock_vsyscall_time_info) * NR_CPUS)
+#define HVC_BOOT_ARRAY_SIZE \
+	(PAGE_SIZE / sizeof(struct pvclock_vsyscall_time_info))
 
-static u8 hv_clock_mem[PAGE_ALIGN(HV_CLOCK_SIZE)] __aligned(PAGE_SIZE);
-
-/* The hypervisor will put information about time periodically here */
-static struct pvclock_vsyscall_time_info *hv_clock __ro_after_init;
+static struct pvclock_vsyscall_time_info
+			hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __aligned(PAGE_SIZE);
 static struct pvclock_wall_clock wall_clock;
+static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
+
+static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
+{
+	return &this_cpu_read(hv_clock_per_cpu)->pvti;
+}
+
+static inline struct pvclock_vsyscall_time_info *this_cpu_hvclock(void)
+{
+	return this_cpu_read(hv_clock_per_cpu);
+}
 
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
@@ -69,17 +81,10 @@ static struct pvclock_wall_clock wall_clock;
  */
 static void kvm_get_wallclock(struct timespec64 *now)
 {
-	struct pvclock_vcpu_time_info *vcpu_time;
-	int cpu;
-
 	wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));
-
-	cpu = get_cpu();
-
-	vcpu_time = &hv_clock[cpu].pvti;
-	pvclock_read_wallclock(&wall_clock, vcpu_time, now);
-
-	put_cpu();
+	preempt_disable();
+	pvclock_read_wallclock(&wall_clock, this_cpu_pvti(), now);
+	preempt_enable();
 }
 
 static int kvm_set_wallclock(const struct timespec64 *now)
@@ -89,14 +94,10 @@ static int kvm_set_wallclock(const struct timespec64 *now)
 
 static u64 kvm_clock_read(void)
 {
-	struct pvclock_vcpu_time_info *src;
 	u64 ret;
-	int cpu;
 
 	preempt_disable_notrace();
-	cpu = smp_processor_id();
-	src = &hv_clock[cpu].pvti;
-	ret = pvclock_clocksource_read(src);
+	ret = pvclock_clocksource_read(this_cpu_pvti());
 	preempt_enable_notrace();
 	return ret;
 }
@@ -141,7 +142,7 @@ static inline void kvm_sched_clock_init(bool stable)
 static unsigned long kvm_get_tsc_khz(void)
 {
 	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
-	return pvclock_tsc_khz(&hv_clock[0].pvti);
+	return pvclock_tsc_khz(this_cpu_pvti());
 }
 
 static void kvm_get_preset_lpj(void)
@@ -158,15 +159,14 @@ static void kvm_get_preset_lpj(void)
 
 bool kvm_check_and_clear_guest_paused(void)
 {
-	struct pvclock_vcpu_time_info *src;
+	struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
 	bool ret = false;
 
-	if (!hv_clock)
+	if (!src)
 		return ret;
 
-	src = &hv_clock[smp_processor_id()].pvti;
-	if ((src->flags & PVCLOCK_GUEST_STOPPED) != 0) {
-		src->flags &= ~PVCLOCK_GUEST_STOPPED;
+	if ((src->pvti.flags & PVCLOCK_GUEST_STOPPED) != 0) {
+		src->pvti.flags &= ~PVCLOCK_GUEST_STOPPED;
 		pvclock_touch_watchdogs();
 		ret = true;
 	}
@@ -184,17 +184,15 @@ EXPORT_SYMBOL_GPL(kvm_clock);
 
 static void kvm_register_clock(char *txt)
 {
-	struct pvclock_vcpu_time_info *src;
-	int cpu = smp_processor_id();
+	struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
 	u64 pa;
 
-	if (!hv_clock)
+	if (!src)
 		return;
 
-	src = &hv_clock[cpu].pvti;
-	pa = slow_virt_to_phys(src) | 0x01ULL;
+	pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
 	wrmsrl(msr_kvm_system_time, pa);
-	pr_info("kvm-clock: cpu %d, msr %llx, %s", cpu, pa, txt);
+	pr_info("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
 }
 
 static void kvm_save_sched_clock_state(void)
@@ -242,12 +240,12 @@ static int __init kvm_setup_vsyscall_timeinfo(void)
 #ifdef CONFIG_X86_64
 	u8 flags;
 
-	if (!hv_clock || !kvmclock_vsyscall)
+	if (!per_cpu(hv_clock_per_cpu, 0) || !kvmclock_vsyscall)
 		return 0;
 
-	flags = pvclock_read_flags(&hv_clock[0].pvti);
+	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
 	if (!(flags & PVCLOCK_TSC_STABLE_BIT))
-		return 1;
+		return 0;
 
 	kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
 #endif
@@ -255,6 +253,28 @@ static int __init kvm_setup_vsyscall_timeinfo(void)
 }
 early_initcall(kvm_setup_vsyscall_timeinfo);
 
+static int kvmclock_setup_percpu(unsigned int cpu)
+{
+	struct pvclock_vsyscall_time_info *p = per_cpu(hv_clock_per_cpu, cpu);
+
+	/*
+	 * The per cpu area setup replicates CPU0 data to all cpu
+	 * pointers. So carefully check. CPU0 has been set up in init
+	 * already.
+	 */
+	if (!cpu || (p && p != per_cpu(hv_clock_per_cpu, 0)))
+		return 0;
+
+	/* Use the static page for the first CPUs, allocate otherwise */
+	if (cpu < HVC_BOOT_ARRAY_SIZE)
+		p = &hv_clock_boot[cpu];
+	else
+		p = kzalloc(sizeof(*p), GFP_KERNEL);
+
+	per_cpu(hv_clock_per_cpu, cpu) = p;
+	return p ? 0 : -ENOMEM;
+}
+
 void __init kvmclock_init(void)
 {
 	u8 flags;
@@ -269,17 +289,22 @@ void __init kvmclock_init(void)
 		return;
 	}
 
+	if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
+			      kvmclock_setup_percpu, NULL) < 0) {
+		return;
+	}
+
 	pr_info("kvm-clock: Using msrs %x and %x",
 		msr_kvm_system_time, msr_kvm_wall_clock);
 
-	hv_clock = (struct pvclock_vsyscall_time_info *)hv_clock_mem;
+	this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
 	kvm_register_clock("primary cpu clock");
-	pvclock_set_pvti_cpu0_va(hv_clock);
+	pvclock_set_pvti_cpu0_va(hv_clock_boot);
 
 	if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
 		pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 
-	flags = pvclock_read_flags(&hv_clock[0].pvti);
+	flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
 	kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
 
 	x86_platform.calibrate_tsc = kvm_get_tsc_khz;

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/alternatives, jumplabel: Use text_poke_early() before mm_init()
  2018-07-19 20:55 ` [PATCH v15 08/26] x86: text_poke() may access uninitialized struct pages Pavel Tatashin
@ 2018-07-19 22:25   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:25 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: pasha.tatashin, mingo, tglx, linux-kernel, hpa

Commit-ID:  6fffacb30349e0903602d664f7ab6fc87e85162e
Gitweb:     https://git.kernel.org/tip/6fffacb30349e0903602d664f7ab6fc87e85162e
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:27 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:38 +0200

x86/alternatives, jumplabel: Use text_poke_early() before mm_init()

It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() it can
end up accessing a struct page which has not been initialized yet.

Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:

| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
|        char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
|        vfs_caches_init_early();
|        sort_main_extable();
|        trap_init();
|+       {
|+       static_branch_enable(&__test);
|+       WARN_ON(!static_branch_likely(&__test));
|+       }
|        mm_init();

The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
RIP: 0010:text_poke+0x20d/0x230
Call Trace:
 ? text_poke_bp+0x50/0xda
 ? arch_jump_label_transform+0x89/0xe0
 ? __jump_label_update+0x78/0xb0
 ? static_key_enable_cpuslocked+0x4d/0x80
 ? static_key_enable+0x11/0x20
 ? start_kernel+0x23e/0x4c8
 ? secondary_startup_64+0xa5/0xb0

---[ end trace abdc99c031b8a90a ]---

If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.

Use text_poke_early() in static branching until early boot IRQs are enabled
and from there switch to text_poke. Also, ensure text_poke() is never
invoked when unitialized memory access may happen by using adding a
!after_bootmem assertion.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-9-pasha.tatashin@oracle.com
---
 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c        |  7 +++++++
 arch/x86/kernel/jump_label.c         | 11 +++++++----
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 2ecd34e2d46c..e85ff65c43c3 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,5 +37,6 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern int after_bootmem;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a481763a3776..014f214da581 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -668,6 +668,7 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 	local_irq_save(flags);
 	memcpy(addr, opcode, len);
 	local_irq_restore(flags);
+	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
 	   that causes hangs on some VIA CPUs. */
 	return addr;
@@ -693,6 +694,12 @@ void *text_poke(void *addr, const void *opcode, size_t len)
 	struct page *pages[2];
 	int i;
 
+	/*
+	 * While boot memory allocator is runnig we cannot use struct
+	 * pages as they are not yet initialized.
+	 */
+	BUG_ON(!after_bootmem);
+
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
 		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e56c95be2808..eeea935e9bb5 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,15 +37,18 @@ static void bug_at(unsigned char *ip, int line)
 	BUG();
 }
 
-static void __jump_label_transform(struct jump_entry *entry,
-				   enum jump_label_type type,
-				   void *(*poker)(void *, const void *, size_t),
-				   int init)
+static void __ref __jump_label_transform(struct jump_entry *entry,
+					 enum jump_label_type type,
+					 void *(*poker)(void *, const void *, size_t),
+					 int init)
 {
 	union jump_code_union code;
 	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
 	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
 
+	if (early_boot_irqs_disabled)
+		poker = text_poke_early;
+
 	if (type == JUMP_LABEL_JMP) {
 		if (init) {
 			/*

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/jump_label: Initialize static branching early
  2018-07-19 20:55 ` [PATCH v15 09/26] x86: initialize static branching early Pavel Tatashin
@ 2018-07-19 22:26   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, linux-kernel, tglx, mingo, peterz, hpa, pasha.tatashin

Commit-ID:  8990cac6e5ea7fa57607736019fe8dca961b998f
Gitweb:     https://git.kernel.org/tip/8990cac6e5ea7fa57607736019fe8dca961b998f
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:28 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:38 +0200

x86/jump_label: Initialize static branching early

Static branching is useful to runtime patch branches that are used in hot
path, but are infrequently changed.

The x86 clock framework is one example that uses static branches to setup
the best clock during boot and never changes it again.

It is desired to enable the TSC based sched clock early to allow fine
grained boot time analysis early on. That requires the static branching
functionality to be functional early as well.

Static branching requires patching nop instructions, thus,
arch_init_ideal_nops() must be called prior to jump_label_init().

Do all the necessary steps to call arch_init_ideal_nops() right after
early_cpu_init(), which also allows to insert a call to jump_label_init()
right after that. jump_label_init() will be called again from the generic
init code, but the code is protected against reinitialization already.

[ tglx: Massaged changelog ]

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-10-pasha.tatashin@oracle.com

---
 arch/x86/kernel/cpu/amd.c    | 13 ++++++++-----
 arch/x86/kernel/cpu/common.c | 38 ++++++++++++++++++++------------------
 arch/x86/kernel/setup.c      |  4 ++--
 3 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 38915fbfae73..b732438c1a1e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -232,8 +232,6 @@ static void init_amd_k7(struct cpuinfo_x86 *c)
 		}
 	}
 
-	set_cpu_cap(c, X86_FEATURE_K7);
-
 	/* calling is from identify_secondary_cpu() ? */
 	if (!c->cpu_index)
 		return;
@@ -617,6 +615,14 @@ static void early_init_amd(struct cpuinfo_x86 *c)
 
 	early_init_amd_mc(c);
 
+#ifdef CONFIG_X86_32
+	if (c->x86 == 6)
+		set_cpu_cap(c, X86_FEATURE_K7);
+#endif
+
+	if (c->x86 >= 0xf)
+		set_cpu_cap(c, X86_FEATURE_K8);
+
 	rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);
 
 	/*
@@ -863,9 +869,6 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	init_amd_cacheinfo(c);
 
-	if (c->x86 >= 0xf)
-		set_cpu_cap(c, X86_FEATURE_K8);
-
 	if (cpu_has(c, X86_FEATURE_XMM2)) {
 		unsigned long long val;
 		int ret;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eb4cb3efd20e..71281ac43b15 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1015,6 +1015,24 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 	setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
 }
 
+/*
+ * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
+ * unfortunately, that's not true in practice because of early VIA
+ * chips and (more importantly) broken virtualizers that are not easy
+ * to detect. In the latter case it doesn't even *fail* reliably, so
+ * probing for it doesn't even work. Disable it completely on 32-bit
+ * unless we can find a reliable way to detect all the broken cases.
+ * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
+ */
+static void detect_nopl(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_X86_32
+	clear_cpu_cap(c, X86_FEATURE_NOPL);
+#else
+	set_cpu_cap(c, X86_FEATURE_NOPL);
+#endif
+}
+
 /*
  * Do minimum CPU detection early.
  * Fields really needed: vendor, cpuid_level, family, model, mask,
@@ -1089,6 +1107,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 	 */
 	if (!pgtable_l5_enabled())
 		setup_clear_cpu_cap(X86_FEATURE_LA57);
+
+	detect_nopl(c);
 }
 
 void __init early_cpu_init(void)
@@ -1124,24 +1144,6 @@ void __init early_cpu_init(void)
 	early_identify_cpu(&boot_cpu_data);
 }
 
-/*
- * The NOPL instruction is supposed to exist on all CPUs of family >= 6;
- * unfortunately, that's not true in practice because of early VIA
- * chips and (more importantly) broken virtualizers that are not easy
- * to detect. In the latter case it doesn't even *fail* reliably, so
- * probing for it doesn't even work. Disable it completely on 32-bit
- * unless we can find a reliable way to detect all the broken cases.
- * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
- */
-static void detect_nopl(struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_X86_32
-	clear_cpu_cap(c, X86_FEATURE_NOPL);
-#else
-	set_cpu_cap(c, X86_FEATURE_NOPL);
-#endif
-}
-
 static void detect_null_seg_behavior(struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index da1dbd99cb6e..7490de925a81 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -866,6 +866,8 @@ void __init setup_arch(char **cmdline_p)
 
 	idt_setup_early_traps();
 	early_cpu_init();
+	arch_init_ideal_nops();
+	jump_label_init();
 	early_ioremap_init();
 
 	setup_olpc_ofw_pgd();
@@ -1268,8 +1270,6 @@ void __init setup_arch(char **cmdline_p)
 
 	mcheck_init();
 
-	arch_init_ideal_nops();
-
 	register_refined_jiffies(CLOCK_TICK_RATE);
 
 #ifdef CONFIG_EFI

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/CPU: Call detect_nopl() only on the BSP
  2018-07-19 20:55 ` [PATCH v15 10/26] x86/CPU: Call detect_nopl() only on the BSP Pavel Tatashin
@ 2018-07-19 22:26   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Borislav Petkov @ 2018-07-19 22:26 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, bp, linux-kernel, bp, pasha.tatashin, mingo, hpa

Commit-ID:  9b3661cd7e5400689ed168a7275e75af333177e6
Gitweb:     https://git.kernel.org/tip/9b3661cd7e5400689ed168a7275e75af333177e6
Author:     Borislav Petkov <bp@alien8.de>
AuthorDate: Thu, 19 Jul 2018 16:55:29 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:39 +0200

x86/CPU: Call detect_nopl() only on the BSP

Make it use the setup_* variants and have it be called only on the BSP and
drop the call in generic_identify() - X86_FEATURE_NOPL will be replicated
to the APs through the forced caps. Helps to keep the mess at a manageable
level.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-11-pasha.tatashin@oracle.com

---
 arch/x86/kernel/cpu/common.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 71281ac43b15..46408a8cdf62 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1024,12 +1024,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
  * unless we can find a reliable way to detect all the broken cases.
  * Enable it explicitly on 64-bit for non-constant inputs of cpu_has().
  */
-static void detect_nopl(struct cpuinfo_x86 *c)
+static void detect_nopl(void)
 {
 #ifdef CONFIG_X86_32
-	clear_cpu_cap(c, X86_FEATURE_NOPL);
+	setup_clear_cpu_cap(X86_FEATURE_NOPL);
 #else
-	set_cpu_cap(c, X86_FEATURE_NOPL);
+	setup_force_cpu_cap(X86_FEATURE_NOPL);
 #endif
 }
 
@@ -1108,7 +1108,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 	if (!pgtable_l5_enabled())
 		setup_clear_cpu_cap(X86_FEATURE_LA57);
 
-	detect_nopl(c);
+	detect_nopl();
 }
 
 void __init early_cpu_init(void)
@@ -1206,8 +1206,6 @@ static void generic_identify(struct cpuinfo_x86 *c)
 
 	get_model_name(c); /* Default name */
 
-	detect_nopl(c);
-
 	detect_null_seg_behavior(c);
 
 	/*

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/tsc: Redefine notsc to behave as tsc=unstable
  2018-07-19 20:55 ` [PATCH v15 11/26] x86/tsc: redefine notsc to behave as tsc=unstable Pavel Tatashin
@ 2018-07-19 22:27   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: douly.fnst, hpa, pasha.tatashin, linux-kernel, mingo, tglx

Commit-ID:  fe9af81e524e8a86bdd59c0cc0d9e2b0ccaf840f
Gitweb:     https://git.kernel.org/tip/fe9af81e524e8a86bdd59c0cc0d9e2b0ccaf840f
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:30 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:39 +0200

x86/tsc: Redefine notsc to behave as tsc=unstable

Currently, the notsc kernel parameter disables the use of the TSC by
sched_clock(). However, this parameter does not prevent the kernel from
accessing tsc in other places.

The only rationale to boot with notsc is to avoid timing discrepancies on
multi-socket systems where TSC are not properly synchronized, and thus
exclude TSC from being used for time keeping. But that prevents using TSC
as sched_clock() as well, which is not necessary as the core sched_clock()
implementation can handle non synchronized TSC based sched clocks just
fine.

However, there is another method to solve the above problem: booting with
tsc=unstable parameter. This parameter allows sched_clock() to use TSC and
just excludes it from timekeeping.

So there is no real reason to keep notsc, but for compatibility reasons the
parameter has to stay. Make it behave like 'tsc=unstable' instead.

[ tglx: Massaged changelog ]

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-12-pasha.tatashin@oracle.com

---
 Documentation/admin-guide/kernel-parameters.txt |  2 --
 Documentation/x86/x86_64/boot-options.txt       |  4 +---
 arch/x86/kernel/tsc.c                           | 18 +++---------------
 3 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 533ff5c68970..5aed30cd0350 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2835,8 +2835,6 @@
 
 	nosync		[HW,M68K] Disables sync negotiation for all devices.
 
-	notsc		[BUGS=X86-32] Disable Time Stamp Counter
-
 	nowatchdog	[KNL] Disable both lockup detectors, i.e.
 			soft-lockup and NMI watchdog (hard-lockup).
 
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 8d109ef67ab6..66114ab4f9fe 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -92,9 +92,7 @@ APICs
 Timing
 
   notsc
-  Don't use the CPU time stamp counter to read the wall time.
-  This can be used to work around timing problems on multiprocessor systems
-  with not properly synchronized CPUs.
+  Deprecated, use tsc=unstable instead.
 
   nohpet
   Don't use the HPET timer.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 74392d9d51e0..186395041725 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -38,11 +38,6 @@ EXPORT_SYMBOL(tsc_khz);
  */
 static int __read_mostly tsc_unstable;
 
-/* native_sched_clock() is called before tsc_init(), so
-   we must start with the TSC soft disabled to prevent
-   erroneous rdtsc usage on !boot_cpu_has(X86_FEATURE_TSC) processors */
-static int __read_mostly tsc_disabled = -1;
-
 static DEFINE_STATIC_KEY_FALSE(__use_tsc);
 
 int tsc_clocksource_reliable;
@@ -248,8 +243,7 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
 #ifdef CONFIG_X86_TSC
 int __init notsc_setup(char *str)
 {
-	pr_warn("Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely\n");
-	tsc_disabled = 1;
+	mark_tsc_unstable("boot parameter notsc");
 	return 1;
 }
 #else
@@ -1307,7 +1301,7 @@ unreg:
 
 static int __init init_tsc_clocksource(void)
 {
-	if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_disabled > 0 || !tsc_khz)
+	if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz)
 		return 0;
 
 	if (tsc_unstable)
@@ -1414,12 +1408,6 @@ void __init tsc_init(void)
 		set_cyc2ns_scale(tsc_khz, cpu, cyc);
 	}
 
-	if (tsc_disabled > 0)
-		return;
-
-	/* now allow native_sched_clock() to use rdtsc */
-
-	tsc_disabled = 0;
 	static_branch_enable(&__use_tsc);
 
 	if (!no_sched_irq_time)
@@ -1455,7 +1443,7 @@ unsigned long calibrate_delay_is_known(void)
 	int constant_tsc = cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC);
 	const struct cpumask *mask = topology_core_cpumask(cpu);
 
-	if (tsc_disabled || !constant_tsc || !mask)
+	if (!constant_tsc || !mask)
 		return 0;
 
 	sibling = cpumask_any_but(mask, cpu);

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
  2018-07-19 20:55 ` [PATCH v15 12/26] x86/xen/time: initialize pv xen time in init_hypervisor_platform Pavel Tatashin
@ 2018-07-19 22:27   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:27 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, tglx, pasha.tatashin, hpa, mingo

Commit-ID:  7b25b9cb0dad8395b5cf5a02196d0e88ccda67d5
Gitweb:     https://git.kernel.org/tip/7b25b9cb0dad8395b5cf5a02196d0e88ccda67d5
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:31 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:39 +0200

x86/xen/time: Initialize pv xen time in init_hypervisor_platform()

In every hypervisor except for xen pv time ops are initialized in
init_hypervisor_platform().

Xen PV domains initialize time ops in x86_init.paging.pagetable_init(),
by calling xen_setup_shared_info() which is a poor design, as time is
needed prior to memory allocator.

xen_setup_shared_info() is called from two places: during boot, and
after suspend. Split the content of xen_setup_shared_info() into
three places:

1. add the clock relavent data into new xen pv init_platform vector, and
   set clock ops in there.

2. move xen_setup_vcpu_info_placement() to new xen_pv_guest_late_init()
   call.

3. Re-initializing parts of shared info copy to xen_pv_post_suspend() to
   be symmetric to xen_pv_pre_suspend

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-13-pasha.tatashin@oracle.com

---
 arch/x86/xen/enlighten_pv.c | 51 +++++++++++++++++++++------------------------
 arch/x86/xen/mmu_pv.c       |  6 ++----
 arch/x86/xen/suspend_pv.c   |  5 +++--
 arch/x86/xen/time.c         |  7 +++----
 arch/x86/xen/xen-ops.h      |  6 ++----
 5 files changed, 34 insertions(+), 41 deletions(-)

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 439a94bf89ad..105a57d73701 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -119,6 +119,27 @@ static void __init xen_banner(void)
 	       version >> 16, version & 0xffff, extra.extraversion,
 	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
 }
+
+static void __init xen_pv_init_platform(void)
+{
+	set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+	HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+
+	/* xen clock uses per-cpu vcpu_info, need to init it for boot cpu */
+	xen_vcpu_info_reset(0);
+
+	/* pvclock is in shared info area */
+	xen_init_time_ops();
+}
+
+static void __init xen_pv_guest_late_init(void)
+{
+#ifndef CONFIG_SMP
+	/* Setup shared vcpu info for non-smp configurations */
+	xen_setup_vcpu_info_placement();
+#endif
+}
+
 /* Check if running on Xen version (major, minor) or later */
 bool
 xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -947,34 +968,8 @@ static void xen_write_msr(unsigned int msr, unsigned low, unsigned high)
 	xen_write_msr_safe(msr, low, high);
 }
 
-void xen_setup_shared_info(void)
-{
-	set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
-
-	HYPERVISOR_shared_info =
-		(struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
-
-	xen_setup_mfn_list_list();
-
-	if (system_state == SYSTEM_BOOTING) {
-#ifndef CONFIG_SMP
-		/*
-		 * In UP this is as good a place as any to set up shared info.
-		 * Limit this to boot only, at restore vcpu setup is done via
-		 * xen_vcpu_restore().
-		 */
-		xen_setup_vcpu_info_placement();
-#endif
-		/*
-		 * Now that shared info is set up we can start using routines
-		 * that point to pvclock area.
-		 */
-		xen_init_time_ops();
-	}
-}
-
 /* This is called once we have the cpu_possible_mask */
-void __ref xen_setup_vcpu_info_placement(void)
+void __init xen_setup_vcpu_info_placement(void)
 {
 	int cpu;
 
@@ -1228,6 +1223,8 @@ asmlinkage __visible void __init xen_start_kernel(void)
 	x86_init.irqs.intr_mode_init	= x86_init_noop;
 	x86_init.oem.arch_setup = xen_arch_setup;
 	x86_init.oem.banner = xen_banner;
+	x86_init.hyper.init_platform = xen_pv_init_platform;
+	x86_init.hyper.guest_late_init = xen_pv_guest_late_init;
 
 	/*
 	 * Set up some pagetable state before starting to set any ptes.
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2c30cabfda90..52206ad81e4b 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1230,8 +1230,7 @@ static void __init xen_pagetable_p2m_free(void)
 	 * We roundup to the PMD, which means that if anybody at this stage is
 	 * using the __ka address of xen_start_info or
 	 * xen_start_info->shared_info they are in going to crash. Fortunatly
-	 * we have already revectored in xen_setup_kernel_pagetable and in
-	 * xen_setup_shared_info.
+	 * we have already revectored in xen_setup_kernel_pagetable.
 	 */
 	size = roundup(size, PMD_SIZE);
 
@@ -1292,8 +1291,7 @@ static void __init xen_pagetable_init(void)
 
 	/* Remap memory freed due to conflicts with E820 map */
 	xen_remap_memory();
-
-	xen_setup_shared_info();
+	xen_setup_mfn_list_list();
 }
 static void xen_write_cr2(unsigned long cr2)
 {
diff --git a/arch/x86/xen/suspend_pv.c b/arch/x86/xen/suspend_pv.c
index a2e0f110af56..8303b58c79a9 100644
--- a/arch/x86/xen/suspend_pv.c
+++ b/arch/x86/xen/suspend_pv.c
@@ -27,8 +27,9 @@ void xen_pv_pre_suspend(void)
 void xen_pv_post_suspend(int suspend_cancelled)
 {
 	xen_build_mfn_list_list();
-
-	xen_setup_shared_info();
+	set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_start_info->shared_info);
+	HYPERVISOR_shared_info = (void *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+	xen_setup_mfn_list_list();
 
 	if (suspend_cancelled) {
 		xen_start_info->store_mfn =
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index e0f1bcf01d63..53bb7a8d10b5 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -40,7 +40,7 @@ static unsigned long xen_tsc_khz(void)
 	return pvclock_tsc_khz(info);
 }
 
-u64 xen_clocksource_read(void)
+static u64 xen_clocksource_read(void)
 {
         struct pvclock_vcpu_time_info *src;
 	u64 ret;
@@ -503,7 +503,7 @@ static void __init xen_time_init(void)
 		pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
 }
 
-void __ref xen_init_time_ops(void)
+void __init xen_init_time_ops(void)
 {
 	pv_time_ops = xen_time_ops;
 
@@ -542,8 +542,7 @@ void __init xen_hvm_init_time_ops(void)
 		return;
 
 	if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
-		printk(KERN_INFO "Xen doesn't support pvclock on HVM,"
-				"disable pv timer\n");
+		pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
 		return;
 	}
 
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 3b34745d0a52..e78684597f57 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -31,7 +31,6 @@ extern struct shared_info xen_dummy_shared_info;
 extern struct shared_info *HYPERVISOR_shared_info;
 
 void xen_setup_mfn_list_list(void);
-void xen_setup_shared_info(void);
 void xen_build_mfn_list_list(void);
 void xen_setup_machphys_mapping(void);
 void xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn);
@@ -68,12 +67,11 @@ void xen_init_irq_ops(void);
 void xen_setup_timer(int cpu);
 void xen_setup_runstate_info(int cpu);
 void xen_teardown_timer(int cpu);
-u64 xen_clocksource_read(void);
 void xen_setup_cpu_clockevents(void);
 void xen_save_time_memory_area(void);
 void xen_restore_time_memory_area(void);
-void __ref xen_init_time_ops(void);
-void __init xen_hvm_init_time_ops(void);
+void xen_init_time_ops(void);
+void xen_hvm_init_time_ops(void);
 
 irqreturn_t xen_debug_interrupt(int irq, void *dev_id);
 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/xen/time: Output xen sched_clock time from 0
  2018-07-19 20:55 ` [PATCH v15 13/26] x86/xen/time: output xen sched_clock time from 0 Pavel Tatashin
@ 2018-07-19 22:28   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:28 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, pasha.tatashin, linux-kernel, mingo, tglx

Commit-ID:  38669ba205d178d2d38bfd194a196d65a44d5af2
Gitweb:     https://git.kernel.org/tip/38669ba205d178d2d38bfd194a196d65a44d5af2
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:32 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:40 +0200

x86/xen/time: Output xen sched_clock time from 0

It is expected for sched_clock() to output data from 0, when system boots.

Add an offset xen_sched_clock_offset (similarly how it is done in other
hypervisors i.e. kvm_sched_clock_offset) to count sched_clock() from 0,
when time is first initialized.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-14-pasha.tatashin@oracle.com

---
 arch/x86/xen/time.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 53bb7a8d10b5..c84f1e039d84 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -31,6 +31,8 @@
 /* Xen may fire a timer up to this many ns early */
 #define TIMER_SLOP	100000
 
+static u64 xen_sched_clock_offset __read_mostly;
+
 /* Get the TSC speed from Xen */
 static unsigned long xen_tsc_khz(void)
 {
@@ -57,6 +59,11 @@ static u64 xen_clocksource_get_cycles(struct clocksource *cs)
 	return xen_clocksource_read();
 }
 
+static u64 xen_sched_clock(void)
+{
+	return xen_clocksource_read() - xen_sched_clock_offset;
+}
+
 static void xen_read_wallclock(struct timespec64 *ts)
 {
 	struct shared_info *s = HYPERVISOR_shared_info;
@@ -367,7 +374,7 @@ void xen_timer_resume(void)
 }
 
 static const struct pv_time_ops xen_time_ops __initconst = {
-	.sched_clock = xen_clocksource_read,
+	.sched_clock = xen_sched_clock,
 	.steal_clock = xen_steal_clock,
 };
 
@@ -505,6 +512,7 @@ static void __init xen_time_init(void)
 
 void __init xen_init_time_ops(void)
 {
+	xen_sched_clock_offset = xen_clocksource_read();
 	pv_time_ops = xen_time_ops;
 
 	x86_init.timers.timer_init = xen_time_init;
@@ -546,6 +554,7 @@ void __init xen_hvm_init_time_ops(void)
 		return;
 	}
 
+	xen_sched_clock_offset = xen_clocksource_read();
 	pv_time_ops = xen_time_ops;
 	x86_init.timers.setup_percpu_clockev = xen_time_init;
 	x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] s390/time: Add read_persistent_wall_and_boot_offset()
  2018-07-19 20:55 ` [PATCH v15 14/26] s390/time: add read_persistent_wall_and_boot_offset() Pavel Tatashin
@ 2018-07-19 22:28   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: schwidefsky, hpa, mingo, linux-kernel, pasha.tatashin, tglx

Commit-ID:  be2e0e4257678408b0ab00ea9e743b9094e393e8
Gitweb:     https://git.kernel.org/tip/be2e0e4257678408b0ab00ea9e743b9094e393e8
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:33 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:40 +0200

s390/time: Add read_persistent_wall_and_boot_offset()

read_persistent_wall_and_boot_offset() will replace read_boot_clock64()
because on some architectures it is more convenient to read both sources
as one may depend on the other. For s390, implementation is the same
as read_boot_clock64() but also calling and returning value of
read_persistent_clock64()

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-15-pasha.tatashin@oracle.com

---
 arch/s390/kernel/time.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index cf561160ea88..d1f5447d5687 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -221,6 +221,24 @@ void read_persistent_clock64(struct timespec64 *ts)
 	ext_to_timespec64(clk, ts);
 }
 
+void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+						 struct timespec64 *boot_offset)
+{
+	unsigned char clk[STORE_CLOCK_EXT_SIZE];
+	struct timespec64 boot_time;
+	__u64 delta;
+
+	delta = initial_leap_seconds + TOD_UNIX_EPOCH;
+	memcpy(clk, tod_clock_base, STORE_CLOCK_EXT_SIZE);
+	*(__u64 *)&clk[1] -= delta;
+	if (*(__u64 *)&clk[1] > delta)
+		clk[0]--;
+	ext_to_timespec64(clk, &boot_time);
+
+	read_persistent_clock64(wall_time);
+	*boot_offset = timespec64_sub(*wall_time, boot_time);
+}
+
 void read_boot_clock64(struct timespec64 *ts)
 {
 	unsigned char clk[STORE_CLOCK_EXT_SIZE];

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
  2018-07-19 20:55 ` [PATCH v15 15/26] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset() Pavel Tatashin
@ 2018-07-19 22:29   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:29 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, mingo, linux-kernel, pasha.tatashin, tglx

Commit-ID:  3eca993740b8eb40f514b90b1877a4dbcf0a6710
Gitweb:     https://git.kernel.org/tip/3eca993740b8eb40f514b90b1877a4dbcf0a6710
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:34 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:40 +0200

timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()

If architecture does not support exact boot time, it is challenging to
estimate boot time without having a reference to the current persistent
clock value. Yet, it cannot read the persistent clock time again, because
this may lead to math discrepancies with the caller of read_boot_clock64()
who have read the persistent clock at a different time.

This is why it is better to provide two values simultaneously: the
persistent clock value, and the boot time.

Replace read_boot_clock64() with:
read_persistent_wall_and_boot_offset(wall_time, boot_offset)

Where wall_time is returned by read_persistent_clock() And boot_offset is
wall_time - boot time, which defaults to 0.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-16-pasha.tatashin@oracle.com

---
 include/linux/timekeeping.h |  3 ++-
 kernel/time/timekeeping.c   | 59 +++++++++++++++++++++++----------------------
 2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 86bc2026efce..686bc27acef0 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -243,7 +243,8 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot);
 extern int persistent_clock_is_local;
 
 extern void read_persistent_clock64(struct timespec64 *ts);
-extern void read_boot_clock64(struct timespec64 *ts);
+void read_persistent_clock_and_boot_offset(struct timespec64 *wall_clock,
+					   struct timespec64 *boot_offset);
 extern int update_persistent_clock64(struct timespec64 now);
 
 /*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 4786df904c22..cb738f825c12 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -17,6 +17,7 @@
 #include <linux/nmi.h>
 #include <linux/sched.h>
 #include <linux/sched/loadavg.h>
+#include <linux/sched/clock.h>
 #include <linux/syscore_ops.h>
 #include <linux/clocksource.h>
 #include <linux/jiffies.h>
@@ -1496,18 +1497,20 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
 }
 
 /**
- * read_boot_clock64 -  Return time of the system start.
+ * read_persistent_wall_and_boot_offset - Read persistent clock, and also offset
+ *                                        from the boot.
  *
  * Weak dummy function for arches that do not yet support it.
- * Function to read the exact time the system has been started.
- * Returns a timespec64 with tv_sec=0 and tv_nsec=0 if unsupported.
- *
- *  XXX - Do be sure to remove it once all arches implement it.
+ * wall_time	- current time as returned by persistent clock
+ * boot_offset	- offset that is defined as wall_time - boot_time
+ *		  default to 0.
  */
-void __weak read_boot_clock64(struct timespec64 *ts)
+void __weak __init
+read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
+				     struct timespec64 *boot_offset)
 {
-	ts->tv_sec = 0;
-	ts->tv_nsec = 0;
+	read_persistent_clock64(wall_time);
+	*boot_offset = (struct timespec64){0};
 }
 
 /* Flag for if timekeeping_resume() has injected sleeptime */
@@ -1521,28 +1524,29 @@ static bool persistent_clock_exists;
  */
 void __init timekeeping_init(void)
 {
+	struct timespec64 wall_time, boot_offset, wall_to_mono;
 	struct timekeeper *tk = &tk_core.timekeeper;
 	struct clocksource *clock;
 	unsigned long flags;
-	struct timespec64 now, boot, tmp;
-
-	read_persistent_clock64(&now);
-	if (!timespec64_valid_strict(&now)) {
-		pr_warn("WARNING: Persistent clock returned invalid value!\n"
-			"         Check your CMOS/BIOS settings.\n");
-		now.tv_sec = 0;
-		now.tv_nsec = 0;
-	} else if (now.tv_sec || now.tv_nsec)
-		persistent_clock_exists = true;
 
-	read_boot_clock64(&boot);
-	if (!timespec64_valid_strict(&boot)) {
-		pr_warn("WARNING: Boot clock returned invalid value!\n"
-			"         Check your CMOS/BIOS settings.\n");
-		boot.tv_sec = 0;
-		boot.tv_nsec = 0;
+	read_persistent_wall_and_boot_offset(&wall_time, &boot_offset);
+	if (timespec64_valid_strict(&wall_time) &&
+	    timespec64_to_ns(&wall_time) > 0) {
+		persistent_clock_exists = true;
+	} else {
+		pr_warn("Persistent clock returned invalid value");
+		wall_time = (struct timespec64){0};
 	}
 
+	if (timespec64_compare(&wall_time, &boot_offset) < 0)
+		boot_offset = (struct timespec64){0};
+
+	/*
+	 * We want set wall_to_mono, so the following is true:
+	 * wall time + wall_to_mono = boot time
+	 */
+	wall_to_mono = timespec64_sub(boot_offset, wall_time);
+
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
 	write_seqcount_begin(&tk_core.seq);
 	ntp_init();
@@ -1552,13 +1556,10 @@ void __init timekeeping_init(void)
 		clock->enable(clock);
 	tk_setup_internals(tk, clock);
 
-	tk_set_xtime(tk, &now);
+	tk_set_xtime(tk, &wall_time);
 	tk->raw_sec = 0;
-	if (boot.tv_sec == 0 && boot.tv_nsec == 0)
-		boot = tk_xtime(tk);
 
-	set_normalized_timespec64(&tmp, -boot.tv_sec, -boot.tv_nsec);
-	tk_set_wall_to_mono(tk, tmp);
+	tk_set_wall_to_mono(tk, wall_to_mono);
 
 	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] timekeeping: Default boot time offset to local_clock()
  2018-07-19 20:55 ` [PATCH v15 16/26] time: default boot time offset to local_clock() Pavel Tatashin
@ 2018-07-19 22:29   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:29 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: pasha.tatashin, mingo, hpa, tglx, linux-kernel

Commit-ID:  4b1b7f8054896cee25669f6cea7cb6dd17f508f7
Gitweb:     https://git.kernel.org/tip/4b1b7f8054896cee25669f6cea7cb6dd17f508f7
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:35 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:41 +0200

timekeeping: Default boot time offset to local_clock()

read_persistent_wall_and_boot_offset() is called during boot to read
both the persistent clock and also return the offset between the boot time
and the value of persistent clock.

Change the default boot_offset from zero to local_clock() so architectures,
that do not have a dedicated boot_clock but have early sched_clock(), such
as SPARCv9, x86, and possibly more will benefit from this change by getting
a better and more consistent estimate of the boot time without need for an
arch specific implementation.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-17-pasha.tatashin@oracle.com

---
 kernel/time/timekeeping.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index cb738f825c12..30d7f64ffc87 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1503,14 +1503,17 @@ void __weak read_persistent_clock64(struct timespec64 *ts64)
  * Weak dummy function for arches that do not yet support it.
  * wall_time	- current time as returned by persistent clock
  * boot_offset	- offset that is defined as wall_time - boot_time
- *		  default to 0.
+ * The default function calculates offset based on the current value of
+ * local_clock(). This way architectures that support sched_clock() but don't
+ * support dedicated boot time clock will provide the best estimate of the
+ * boot time.
  */
 void __weak __init
 read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
 				     struct timespec64 *boot_offset)
 {
 	read_persistent_clock64(wall_time);
-	*boot_offset = (struct timespec64){0};
+	*boot_offset = ns_to_timespec64(local_clock());
 }
 
 /* Flag for if timekeeping_resume() has injected sleeptime */

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] s390/time: Remove read_boot_clock64()
  2018-07-19 20:55 ` [PATCH v15 17/26] s390/time: remove read_boot_clock64() Pavel Tatashin
@ 2018-07-19 22:30   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:30 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: pasha.tatashin, hpa, linux-kernel, tglx, mingo

Commit-ID:  00067a6db2e95f3b9d9a017b3be3c715d54cc0de
Gitweb:     https://git.kernel.org/tip/00067a6db2e95f3b9d9a017b3be3c715d54cc0de
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:36 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:41 +0200

s390/time: Remove read_boot_clock64()

read_boot_clock64() was replaced by read_persistent_wall_and_boot_offset()
so remove it.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-18-pasha.tatashin@oracle.com

---
 arch/s390/kernel/time.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index d1f5447d5687..e8766beee5ad 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -239,19 +239,6 @@ void __init read_persistent_wall_and_boot_offset(struct timespec64 *wall_time,
 	*boot_offset = timespec64_sub(*wall_time, boot_time);
 }
 
-void read_boot_clock64(struct timespec64 *ts)
-{
-	unsigned char clk[STORE_CLOCK_EXT_SIZE];
-	__u64 delta;
-
-	delta = initial_leap_seconds + TOD_UNIX_EPOCH;
-	memcpy(clk, tod_clock_base, 16);
-	*(__u64 *) &clk[1] -= delta;
-	if (*(__u64 *) &clk[1] > delta)
-		clk[0]--;
-	ext_to_timespec64(clk, ts);
-}
-
 static u64 read_tod_clock(struct clocksource *cs)
 {
 	unsigned long long now, adj;

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] ARM/time: Remove read_boot_clock64()
  2018-07-19 20:55 ` [PATCH v15 18/26] ARM/time: remove read_boot_clock64() Pavel Tatashin
@ 2018-07-19 22:30   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:30 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, linux-kernel, pasha.tatashin, hpa, tglx

Commit-ID:  227e3958a780499b3ec41c36d4752ac4f4962874
Gitweb:     https://git.kernel.org/tip/227e3958a780499b3ec41c36d4752ac4f4962874
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:37 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:41 +0200

ARM/time: Remove read_boot_clock64()

read_boot_clock64() is deleted, and replaced with
read_persistent_wall_and_boot_offset().

The default implementation of read_persistent_wall_and_boot_offset()
provides a better fallback than the current stubs for read_boot_clock64()
that arm has with no users, so remove the old code.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-19-pasha.tatashin@oracle.com

---
 arch/arm/include/asm/mach/time.h    |  3 +--
 arch/arm/kernel/time.c              | 15 ++-------------
 arch/arm/plat-omap/counter_32k.c    |  2 +-
 drivers/clocksource/tegra20_timer.c |  2 +-
 4 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/mach/time.h b/arch/arm/include/asm/mach/time.h
index 0f79e4dec7f9..4ac3a019a46f 100644
--- a/arch/arm/include/asm/mach/time.h
+++ b/arch/arm/include/asm/mach/time.h
@@ -13,7 +13,6 @@
 extern void timer_tick(void);
 
 typedef void (*clock_access_fn)(struct timespec64 *);
-extern int register_persistent_clock(clock_access_fn read_boot,
-				     clock_access_fn read_persistent);
+extern int register_persistent_clock(clock_access_fn read_persistent);
 
 #endif
diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index cf2701cb0de8..078b259ead4e 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -83,29 +83,18 @@ static void dummy_clock_access(struct timespec64 *ts)
 }
 
 static clock_access_fn __read_persistent_clock = dummy_clock_access;
-static clock_access_fn __read_boot_clock = dummy_clock_access;
 
 void read_persistent_clock64(struct timespec64 *ts)
 {
 	__read_persistent_clock(ts);
 }
 
-void read_boot_clock64(struct timespec64 *ts)
-{
-	__read_boot_clock(ts);
-}
-
-int __init register_persistent_clock(clock_access_fn read_boot,
-				     clock_access_fn read_persistent)
+int __init register_persistent_clock(clock_access_fn read_persistent)
 {
 	/* Only allow the clockaccess functions to be registered once */
-	if (__read_persistent_clock == dummy_clock_access &&
-	    __read_boot_clock == dummy_clock_access) {
-		if (read_boot)
-			__read_boot_clock = read_boot;
+	if (__read_persistent_clock == dummy_clock_access) {
 		if (read_persistent)
 			__read_persistent_clock = read_persistent;
-
 		return 0;
 	}
 
diff --git a/arch/arm/plat-omap/counter_32k.c b/arch/arm/plat-omap/counter_32k.c
index 2438b96004c1..fcc5bfec8bd1 100644
--- a/arch/arm/plat-omap/counter_32k.c
+++ b/arch/arm/plat-omap/counter_32k.c
@@ -110,7 +110,7 @@ int __init omap_init_clocksource_32k(void __iomem *vbase)
 	}
 
 	sched_clock_register(omap_32k_read_sched_clock, 32, 32768);
-	register_persistent_clock(NULL, omap_read_persistent_clock64);
+	register_persistent_clock(omap_read_persistent_clock64);
 	pr_info("OMAP clocksource: 32k_counter at 32768 Hz\n");
 
 	return 0;
diff --git a/drivers/clocksource/tegra20_timer.c b/drivers/clocksource/tegra20_timer.c
index c337a8100a7b..2242a36fc5b0 100644
--- a/drivers/clocksource/tegra20_timer.c
+++ b/drivers/clocksource/tegra20_timer.c
@@ -259,6 +259,6 @@ static int __init tegra20_init_rtc(struct device_node *np)
 	else
 		clk_prepare_enable(clk);
 
-	return register_persistent_clock(NULL, tegra_read_persistent_clock64);
+	return register_persistent_clock(tegra_read_persistent_clock64);
 }
 TIMER_OF_DECLARE(tegra20_rtc, "nvidia,tegra20-rtc", tegra20_init_rtc);

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/tsc: Calibrate tsc only once
  2018-07-19 20:55 ` [PATCH v15 19/26] x86/tsc: calibrate tsc only once Pavel Tatashin
@ 2018-07-19 22:31   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: pasha.tatashin, linux-kernel, tglx, hpa, mingo

Commit-ID:  cf7a63ef4e0203f6f33284c69e8188d91422de83
Gitweb:     https://git.kernel.org/tip/cf7a63ef4e0203f6f33284c69e8188d91422de83
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:38 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:42 +0200

x86/tsc: Calibrate tsc only once

During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
and the second time in tsc_init().

Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
the calibration is done only early, and make tsc_init() to use the values
already determined in tsc_early_init().

Sometimes it is not possible to determine tsc early, as the subsystem that
is required is not yet initialized, in such case try again later in
tsc_init().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-20-pasha.tatashin@oracle.com

---
 arch/x86/include/asm/tsc.h |  2 +-
 arch/x86/kernel/setup.c    |  2 +-
 arch/x86/kernel/tsc.c      | 87 +++++++++++++++++++++++++---------------------
 3 files changed, 49 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 2701d221583a..c4368ff73652 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -33,7 +33,7 @@ static inline cycles_t get_cycles(void)
 extern struct system_counterval_t convert_art_to_tsc(u64 art);
 extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);
 
-extern void tsc_early_delay_calibrate(void);
+extern void tsc_early_init(void);
 extern void tsc_init(void);
 extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 7490de925a81..5d32c55aeb8b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1014,6 +1014,7 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	init_hypervisor_platform();
 
+	tsc_early_init();
 	x86_init.resources.probe_roms();
 
 	/* after parse_early_param, so could debug it */
@@ -1199,7 +1200,6 @@ void __init setup_arch(char **cmdline_p)
 
 	memblock_find_dma_reserve();
 
-	tsc_early_delay_calibrate();
 	if (!early_xdbc_setup_hardware())
 		early_xdbc_register_console();
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 186395041725..4cab2236169e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -33,6 +33,8 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int __read_mostly tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
+#define KHZ	1000
+
 /*
  * TSC can be unstable due to cpufreq or due to unsynced TSCs
  */
@@ -1335,34 +1337,10 @@ unreg:
  */
 device_initcall(init_tsc_clocksource);
 
-void __init tsc_early_delay_calibrate(void)
-{
-	unsigned long lpj;
-
-	if (!boot_cpu_has(X86_FEATURE_TSC))
-		return;
-
-	cpu_khz = x86_platform.calibrate_cpu();
-	tsc_khz = x86_platform.calibrate_tsc();
-
-	tsc_khz = tsc_khz ? : cpu_khz;
-	if (!tsc_khz)
-		return;
-
-	lpj = tsc_khz * 1000;
-	do_div(lpj, HZ);
-	loops_per_jiffy = lpj;
-}
-
-void __init tsc_init(void)
+static bool __init determine_cpu_tsc_frequencies(void)
 {
-	u64 lpj, cyc;
-	int cpu;
-
-	if (!boot_cpu_has(X86_FEATURE_TSC)) {
-		setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
-		return;
-	}
+	/* Make sure that cpu and tsc are not already calibrated */
+	WARN_ON(cpu_khz || tsc_khz);
 
 	cpu_khz = x86_platform.calibrate_cpu();
 	tsc_khz = x86_platform.calibrate_tsc();
@@ -1377,20 +1355,52 @@ void __init tsc_init(void)
 	else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
 		cpu_khz = tsc_khz;
 
-	if (!tsc_khz) {
-		mark_tsc_unstable("could not calculate TSC khz");
-		setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
-		return;
-	}
+	if (tsc_khz == 0)
+		return false;
 
 	pr_info("Detected %lu.%03lu MHz processor\n",
-		(unsigned long)cpu_khz / 1000,
-		(unsigned long)cpu_khz % 1000);
+		(unsigned long)cpu_khz / KHZ,
+		(unsigned long)cpu_khz % KHZ);
 
 	if (cpu_khz != tsc_khz) {
 		pr_info("Detected %lu.%03lu MHz TSC",
-			(unsigned long)tsc_khz / 1000,
-			(unsigned long)tsc_khz % 1000);
+			(unsigned long)tsc_khz / KHZ,
+			(unsigned long)tsc_khz % KHZ);
+	}
+	return true;
+}
+
+static unsigned long __init get_loops_per_jiffy(void)
+{
+	unsigned long lpj = tsc_khz * KHZ;
+
+	do_div(lpj, HZ);
+	return lpj;
+}
+
+void __init tsc_early_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_TSC))
+		return;
+	if (!determine_cpu_tsc_frequencies())
+		return;
+	loops_per_jiffy = get_loops_per_jiffy();
+}
+
+void __init tsc_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_TSC)) {
+		setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+		return;
+	}
+
+	if (!tsc_khz) {
+		/* We failed to determine frequencies earlier, try again */
+		if (!determine_cpu_tsc_frequencies()) {
+			mark_tsc_unstable("could not calculate TSC khz");
+			setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
+			return;
+		}
 	}
 
 	/* Sanitize TSC ADJUST before cyc2ns gets initialized */
@@ -1413,10 +1423,7 @@ void __init tsc_init(void)
 	if (!no_sched_irq_time)
 		enable_sched_clock_irqtime();
 
-	lpj = ((u64)tsc_khz * 1000);
-	do_div(lpj, HZ);
-	lpj_fine = lpj;
-
+	lpj_fine = get_loops_per_jiffy();
 	use_tsc_delay();
 
 	check_system_tsc_reliable();

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/tsc: Initialize cyc2ns when tsc frequency is determined
  2018-07-19 20:55 ` [PATCH v15 20/26] x86/tsc: initialize cyc2ns when tsc freq. is determined Pavel Tatashin
@ 2018-07-19 22:31   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:31 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, tglx, pasha.tatashin, hpa, linux-kernel

Commit-ID:  e2a9ca29b5edc89da2fddeae30e1070b272395c5
Gitweb:     https://git.kernel.org/tip/e2a9ca29b5edc89da2fddeae30e1070b272395c5
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:39 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:42 +0200

x86/tsc: Initialize cyc2ns when tsc frequency is determined

cyc2ns converts tsc to nanoseconds, and it is handled in a per-cpu data
structure.

Currently, the setup code for c2ns data for every possible CPU goes through
the same sequence of calculations as for the boot CPU, but is based on the
same tsc frequency as the boot CPU, and thus this is not necessary.

Initialize the boot cpu when tsc frequency is determined. Copy the
calculated data from the boot CPU to the other CPUs in tsc_init().

In addition do the following:

 - Remove unnecessary zeroing of c2ns data by removing cyc2ns_data_init()

 - Split set_cyc2ns_scale() into two functions, so set_cyc2ns_scale() can be
   called when system is up, and wraps around __set_cyc2ns_scale() that can
   be called directly when system is booting but avoids saving restoring
   IRQs and going and waking up from idle.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-21-pasha.tatashin@oracle.com

---
 arch/x86/kernel/tsc.c | 94 +++++++++++++++++++++++++++++----------------------
 1 file changed, 53 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 4cab2236169e..7ea0718a4c75 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -103,23 +103,6 @@ void cyc2ns_read_end(void)
  *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
 
-static void cyc2ns_data_init(struct cyc2ns_data *data)
-{
-	data->cyc2ns_mul = 0;
-	data->cyc2ns_shift = 0;
-	data->cyc2ns_offset = 0;
-}
-
-static void __init cyc2ns_init(int cpu)
-{
-	struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu);
-
-	cyc2ns_data_init(&c2n->data[0]);
-	cyc2ns_data_init(&c2n->data[1]);
-
-	seqcount_init(&c2n->seq);
-}
-
 static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
 	struct cyc2ns_data data;
@@ -135,18 +118,11 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 	return ns;
 }
 
-static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
 {
 	unsigned long long ns_now;
 	struct cyc2ns_data data;
 	struct cyc2ns *c2n;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	sched_clock_idle_sleep_event();
-
-	if (!khz)
-		goto done;
 
 	ns_now = cycles_2_ns(tsc_now);
 
@@ -178,12 +154,55 @@ static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_
 	c2n->data[0] = data;
 	raw_write_seqcount_latch(&c2n->seq);
 	c2n->data[1] = data;
+}
+
+static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	if (khz)
+		__set_cyc2ns_scale(khz, cpu, tsc_now);
 
-done:
 	sched_clock_idle_wakeup_event();
 	local_irq_restore(flags);
 }
 
+/*
+ * Initialize cyc2ns for boot cpu
+ */
+static void __init cyc2ns_init_boot_cpu(void)
+{
+	struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+
+	seqcount_init(&c2n->seq);
+	__set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
+}
+
+/*
+ * Secondary CPUs do not run through cyc2ns_init(), so set up
+ * all the scale factors for all CPUs, assuming the same
+ * speed as the bootup CPU. (cpufreq notifiers will fix this
+ * up if their speed diverges)
+ */
+static void __init cyc2ns_init_secondary_cpus(void)
+{
+	unsigned int cpu, this_cpu = smp_processor_id();
+	struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
+	struct cyc2ns_data *data = c2n->data;
+
+	for_each_possible_cpu(cpu) {
+		if (cpu != this_cpu) {
+			seqcount_init(&c2n->seq);
+			c2n = per_cpu_ptr(&cyc2ns, cpu);
+			c2n->data[0] = data[0];
+			c2n->data[1] = data[1];
+		}
+	}
+}
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
@@ -1385,6 +1404,10 @@ void __init tsc_early_init(void)
 	if (!determine_cpu_tsc_frequencies())
 		return;
 	loops_per_jiffy = get_loops_per_jiffy();
+
+	/* Sanitize TSC ADJUST before cyc2ns gets initialized */
+	tsc_store_and_check_tsc_adjust(true);
+	cyc2ns_init_boot_cpu();
 }
 
 void __init tsc_init(void)
@@ -1401,23 +1424,12 @@ void __init tsc_init(void)
 			setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 			return;
 		}
+		/* Sanitize TSC ADJUST before cyc2ns gets initialized */
+		tsc_store_and_check_tsc_adjust(true);
+		cyc2ns_init_boot_cpu();
 	}
 
-	/* Sanitize TSC ADJUST before cyc2ns gets initialized */
-	tsc_store_and_check_tsc_adjust(true);
-
-	/*
-	 * Secondary CPUs do not run through tsc_init(), so set up
-	 * all the scale factors for all CPUs, assuming the same
-	 * speed as the bootup CPU. (cpufreq notifiers will fix this
-	 * up if their speed diverges)
-	 */
-	cyc = rdtsc();
-	for_each_possible_cpu(cpu) {
-		cyc2ns_init(cpu);
-		set_cyc2ns_scale(tsc_khz, cpu, cyc);
-	}
-
+	cyc2ns_init_secondary_cpus();
 	static_branch_enable(&__use_tsc);
 
 	if (!no_sched_irq_time)

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/tsc: Use TSC as sched clock early
  2018-07-19 20:55 ` [PATCH v15 21/26] x86/tsc: use tsc early Pavel Tatashin
@ 2018-07-19 22:32   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:32 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, mingo, tglx, pasha.tatashin, hpa

Commit-ID:  4763f03d3d186ce8a1125844790152d76804ad60
Gitweb:     https://git.kernel.org/tip/4763f03d3d186ce8a1125844790152d76804ad60
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:40 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:42 +0200

x86/tsc: Use TSC as sched clock early

All prerequesites for enabling TSC as sched clock early in the boot
process are available now:

 - Early attempt of TSC calibration

 - Early availablity of static branch patching

If TSC frequency can be established in the early calibration, enable the
static key which switches sched clock to use TSC.

[ tglx: Massaged changelog ]

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-22-pasha.tatashin@oracle.com

---
 arch/x86/kernel/tsc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7ea0718a4c75..9277ae9b68b3 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1408,6 +1408,7 @@ void __init tsc_early_init(void)
 	/* Sanitize TSC ADJUST before cyc2ns gets initialized */
 	tsc_store_and_check_tsc_adjust(true);
 	cyc2ns_init_boot_cpu();
+	static_branch_enable(&__use_tsc);
 }
 
 void __init tsc_init(void)

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] sched/clock: Move sched clock initialization and merge with generic clock
  2018-07-19 20:55 ` [PATCH v15 22/26] sched: move sched clock initialization and merge with generic clock Pavel Tatashin
@ 2018-07-19 22:32   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:32 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, hpa, linux-kernel, pasha.tatashin, peterz, mingo

Commit-ID:  5d2a4e91a541cb04d20d11602f0f9340291322ac
Gitweb:     https://git.kernel.org/tip/5d2a4e91a541cb04d20d11602f0f9340291322ac
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:41 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:43 +0200

sched/clock: Move sched clock initialization and merge with generic clock

sched_clock_postinit() initializes a generic clock on systems where no
other clock is provided. This function may be called only after
timekeeping_init().

Rename sched_clock_postinit to generic_clock_inti() and call it from
sched_clock_init(). Move the call for sched_clock_init() until after
time_init().

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-23-pasha.tatashin@oracle.com

---
 include/linux/sched_clock.h |  5 ++---
 init/main.c                 |  4 ++--
 kernel/sched/clock.c        | 27 +++++++++++++++++----------
 kernel/sched/core.c         |  1 -
 kernel/time/sched_clock.c   |  2 +-
 5 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched_clock.h b/include/linux/sched_clock.h
index 411b52e424e1..abe28d5cb3f4 100644
--- a/include/linux/sched_clock.h
+++ b/include/linux/sched_clock.h
@@ -9,17 +9,16 @@
 #define LINUX_SCHED_CLOCK
 
 #ifdef CONFIG_GENERIC_SCHED_CLOCK
-extern void sched_clock_postinit(void);
+extern void generic_sched_clock_init(void);
 
 extern void sched_clock_register(u64 (*read)(void), int bits,
 				 unsigned long rate);
 #else
-static inline void sched_clock_postinit(void) { }
+static inline void generic_sched_clock_init(void) { }
 
 static inline void sched_clock_register(u64 (*read)(void), int bits,
 					unsigned long rate)
 {
-	;
 }
 #endif
 
diff --git a/init/main.c b/init/main.c
index 3b4ada11ed52..162d931c9511 100644
--- a/init/main.c
+++ b/init/main.c
@@ -79,7 +79,7 @@
 #include <linux/pti.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
-#include <linux/sched_clock.h>
+#include <linux/sched/clock.h>
 #include <linux/sched/task.h>
 #include <linux/sched/task_stack.h>
 #include <linux/context_tracking.h>
@@ -642,7 +642,7 @@ asmlinkage __visible void __init start_kernel(void)
 	softirq_init();
 	timekeeping_init();
 	time_init();
-	sched_clock_postinit();
+	sched_clock_init();
 	printk_safe_init();
 	perf_event_init();
 	profile_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 10c83e73837a..0e9dbb2d9aea 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -53,6 +53,7 @@
  *
  */
 #include "sched.h"
+#include <linux/sched_clock.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -68,11 +69,6 @@ EXPORT_SYMBOL_GPL(sched_clock);
 
 __read_mostly int sched_clock_running;
 
-void sched_clock_init(void)
-{
-	sched_clock_running = 1;
-}
-
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 /*
  * We must start with !__sched_clock_stable because the unstable -> stable
@@ -199,6 +195,15 @@ void clear_sched_clock_stable(void)
 		__clear_sched_clock_stable();
 }
 
+static void __sched_clock_gtod_offset(void)
+{
+	__gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+}
+
+void __init sched_clock_init(void)
+{
+	sched_clock_running = 1;
+}
 /*
  * We run this as late_initcall() such that it runs after all built-in drivers,
  * notably: acpi_processor and intel_idle, which can mark the TSC as unstable.
@@ -385,8 +390,6 @@ void sched_clock_tick(void)
 
 void sched_clock_tick_stable(void)
 {
-	u64 gtod, clock;
-
 	if (!sched_clock_stable())
 		return;
 
@@ -398,9 +401,7 @@ void sched_clock_tick_stable(void)
 	 * TSC to be unstable, any computation will be computing crap.
 	 */
 	local_irq_disable();
-	gtod = ktime_get_ns();
-	clock = sched_clock();
-	__gtod_offset = (clock + __sched_clock_offset) - gtod;
+	__sched_clock_gtod_offset();
 	local_irq_enable();
 }
 
@@ -434,6 +435,12 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
 #else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
 
+void __init sched_clock_init(void)
+{
+	sched_clock_running = 1;
+	generic_sched_clock_init();
+}
+
 u64 sched_clock_cpu(int cpu)
 {
 	if (unlikely(!sched_clock_running))
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..552406e9713b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5954,7 +5954,6 @@ void __init sched_init(void)
 	int i, j;
 	unsigned long alloc_size = 0, ptr;
 
-	sched_clock_init();
 	wait_bit_init();
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 2d8f05aad442..cbc72c2c1fca 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -237,7 +237,7 @@ sched_clock_register(u64 (*read)(void), int bits, unsigned long rate)
 	pr_debug("Registered %pF as sched_clock source\n", read);
 }
 
-void __init sched_clock_postinit(void)
+void __init generic_sched_clock_init(void)
 {
 	/*
 	 * If no sched_clock() function has been provided at that point,

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] sched/clock: Enable sched clock early
  2018-07-19 20:55 ` [PATCH v15 23/26] sched: early boot clock Pavel Tatashin
@ 2018-07-19 22:33   ` tip-bot for Pavel Tatashin
  2018-07-24 19:52     ` Guenter Roeck
  2018-07-20  8:09   ` [PATCH v15 23/26] sched: early boot clock Peter Zijlstra
  2018-11-06  5:42   ` [PATCH v15 23/26] sched: early boot clock Dominique Martinet
  2 siblings, 1 reply; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:33 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: pasha.tatashin, linux-kernel, hpa, mingo, tglx

Commit-ID:  857baa87b6422bcfb84ed3631d6839920cb5b09d
Gitweb:     https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:43 +0200

sched/clock: Enable sched clock early

Allow sched_clock() to be used before schec_clock_init() is called.  This
provides a way to get early boot timestamps on machines with unstable
clocks.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-24-pasha.tatashin@oracle.com

---
 init/main.c          |  2 +-
 kernel/sched/clock.c | 20 +++++++++++++++++++-
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 162d931c9511..ff0a24170b95 100644
--- a/init/main.c
+++ b/init/main.c
@@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
 	softirq_init();
 	timekeeping_init();
 	time_init();
-	sched_clock_init();
 	printk_safe_init();
 	perf_event_init();
 	profile_init();
@@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
 	acpi_early_init();
 	if (late_time_init)
 		late_time_init();
+	sched_clock_init();
 	calibrate_delay();
 	pid_idr_init();
 	anon_vma_init();
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 0e9dbb2d9aea..422cd63f8f17 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)
 
 void __init sched_clock_init(void)
 {
+	unsigned long flags;
+
+	/*
+	 * Set __gtod_offset such that once we mark sched_clock_running,
+	 * sched_clock_tick() continues where sched_clock() left off.
+	 *
+	 * Even if TSC is buggered, we're still UP at this point so it
+	 * can't really be out of sync.
+	 */
+	local_irq_save(flags);
+	__sched_clock_gtod_offset();
+	local_irq_restore(flags);
+
 	sched_clock_running = 1;
+
+	/* Now that sched_clock_running is set adjust scd */
+	local_irq_save(flags);
+	sched_clock_tick();
+	local_irq_restore(flags);
 }
 /*
  * We run this as late_initcall() such that it runs after all built-in drivers,
@@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
 		return sched_clock() + __sched_clock_offset;
 
 	if (unlikely(!sched_clock_running))
-		return 0ull;
+		return sched_clock();
 
 	preempt_disable_notrace();
 	scd = cpu_sdc(cpu);

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] sched/clock: Use static key for sched_clock_running
  2018-07-19 20:55 ` [PATCH v15 24/26] sched: use static key for sched_clock_running Pavel Tatashin
@ 2018-07-19 22:33   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:33 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, linux-kernel, peterz, mingo, pasha.tatashin, hpa

Commit-ID:  46457ea464f5341d1f9dad8dd213805d45f7f117
Gitweb:     https://git.kernel.org/tip/46457ea464f5341d1f9dad8dd213805d45f7f117
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:43 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:43 +0200

sched/clock: Use static key for sched_clock_running

sched_clock_running may be read every time sched_clock_cpu() is called.
Yet, this variable is updated only twice during boot, and never changes
again, therefore it is better to make it a static key.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-25-pasha.tatashin@oracle.com

---
 kernel/sched/clock.c | 16 ++++++++--------
 kernel/sched/debug.c |  2 --
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 422cd63f8f17..c5c47ad3f386 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -67,7 +67,7 @@ unsigned long long __weak sched_clock(void)
 }
 EXPORT_SYMBOL_GPL(sched_clock);
 
-__read_mostly int sched_clock_running;
+static DEFINE_STATIC_KEY_FALSE(sched_clock_running);
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
 /*
@@ -191,7 +191,7 @@ void clear_sched_clock_stable(void)
 
 	smp_mb(); /* matches sched_clock_init_late() */
 
-	if (sched_clock_running == 2)
+	if (static_key_count(&sched_clock_running.key) == 2)
 		__clear_sched_clock_stable();
 }
 
@@ -215,7 +215,7 @@ void __init sched_clock_init(void)
 	__sched_clock_gtod_offset();
 	local_irq_restore(flags);
 
-	sched_clock_running = 1;
+	static_branch_inc(&sched_clock_running);
 
 	/* Now that sched_clock_running is set adjust scd */
 	local_irq_save(flags);
@@ -228,7 +228,7 @@ void __init sched_clock_init(void)
  */
 static int __init sched_clock_init_late(void)
 {
-	sched_clock_running = 2;
+	static_branch_inc(&sched_clock_running);
 	/*
 	 * Ensure that it is impossible to not do a static_key update.
 	 *
@@ -373,7 +373,7 @@ u64 sched_clock_cpu(int cpu)
 	if (sched_clock_stable())
 		return sched_clock() + __sched_clock_offset;
 
-	if (unlikely(!sched_clock_running))
+	if (!static_branch_unlikely(&sched_clock_running))
 		return sched_clock();
 
 	preempt_disable_notrace();
@@ -396,7 +396,7 @@ void sched_clock_tick(void)
 	if (sched_clock_stable())
 		return;
 
-	if (unlikely(!sched_clock_running))
+	if (!static_branch_unlikely(&sched_clock_running))
 		return;
 
 	lockdep_assert_irqs_disabled();
@@ -455,13 +455,13 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
 void __init sched_clock_init(void)
 {
-	sched_clock_running = 1;
+	static_branch_inc(&sched_clock_running);
 	generic_sched_clock_init();
 }
 
 u64 sched_clock_cpu(int cpu)
 {
-	if (unlikely(!sched_clock_running))
+	if (!static_branch_unlikely(&sched_clock_running))
 		return 0;
 
 	return sched_clock();
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e593b4118578..b0212f489a33 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -623,8 +623,6 @@ void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
 #undef PU
 }
 
-extern __read_mostly int sched_clock_running;
-
 static void print_cpu(struct seq_file *m, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 00/26] Early boot time stamps
  2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
                   ` (25 preceding siblings ...)
  2018-07-19 20:55 ` [PATCH v15 26/26] x86/tsc: use tsc_calibrate_cpu_early and pit_hpet_ptimer_calibrate_cpu Pavel Tatashin
@ 2018-07-19 22:34 ` Thomas Gleixner
  26 siblings, 0 replies; 76+ messages in thread
From: Thomas Gleixner @ 2018-07-19 22:34 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	hpa, douly.fnst, peterz, prarit, feng.tang, pmladek, gnomes,
	linux-s390, boris.ostrovsky, jgross, pbonzini

Pavel,

On Thu, 19 Jul 2018, Pavel Tatashin wrote:

> changelog
> ---------
> v15 - v14

I've applied the series to tip:x86/timers and pushed it out.

Thanks for the patience and for going the extra miles to reach your initial
goal of early TSC timestamps. The overall result looks very reasonable and
is not only a functional improvement: Quite some old ballast and duct tape
has been cleaned up on the way.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/tsc: Split native_calibrate_cpu() into early and late parts
  2018-07-19 20:55 ` [PATCH v15 25/26] x86/tsc: split native_calibrate_cpu() into early and late parts Pavel Tatashin
@ 2018-07-19 22:34   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:34 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, mingo, tglx, hpa, pasha.tatashin

Commit-ID:  03821f451d2d2d7599061244734245be139014ea
Gitweb:     https://git.kernel.org/tip/03821f451d2d2d7599061244734245be139014ea
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:44 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:44 +0200

x86/tsc: Split native_calibrate_cpu() into early and late parts

During early boot TSC and CPU frequency can be calibrated using MSR, CPUID,
and quick PIT calibration methods. The other methods PIT/HPET/PMTIMER are
available only after ACPI is initialized.

Split native_calibrate_cpu() into early and late parts so they can be
called separately during early and late tsc calibration.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-26-pasha.tatashin@oracle.com

---
 arch/x86/include/asm/tsc.h |  1 +
 arch/x86/kernel/tsc.c      | 54 ++++++++++++++++++++++++++++++----------------
 2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c4368ff73652..88140e4f2292 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -40,6 +40,7 @@ extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern void mark_tsc_async_resets(char *reason);
 extern unsigned long native_calibrate_cpu(void);
+extern unsigned long native_calibrate_cpu_early(void);
 extern unsigned long native_calibrate_tsc(void);
 extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 9277ae9b68b3..60586779b02c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -680,30 +680,17 @@ static unsigned long cpu_khz_from_cpuid(void)
 	return eax_base_mhz * 1000;
 }
 
-/**
- * native_calibrate_cpu - calibrate the cpu on boot
+/*
+ * calibrate cpu using pit, hpet, and ptimer methods. They are available
+ * later in boot after acpi is initialized.
  */
-unsigned long native_calibrate_cpu(void)
+static unsigned long pit_hpet_ptimer_calibrate_cpu(void)
 {
 	u64 tsc1, tsc2, delta, ref1, ref2;
 	unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
-	unsigned long flags, latch, ms, fast_calibrate;
+	unsigned long flags, latch, ms;
 	int hpet = is_hpet_enabled(), i, loopmin;
 
-	fast_calibrate = cpu_khz_from_cpuid();
-	if (fast_calibrate)
-		return fast_calibrate;
-
-	fast_calibrate = cpu_khz_from_msr();
-	if (fast_calibrate)
-		return fast_calibrate;
-
-	local_irq_save(flags);
-	fast_calibrate = quick_pit_calibrate();
-	local_irq_restore(flags);
-	if (fast_calibrate)
-		return fast_calibrate;
-
 	/*
 	 * Run 5 calibration loops to get the lowest frequency value
 	 * (the best estimate). We use two different calibration modes
@@ -846,6 +833,37 @@ unsigned long native_calibrate_cpu(void)
 	return tsc_pit_min;
 }
 
+/**
+ * native_calibrate_cpu_early - can calibrate the cpu early in boot
+ */
+unsigned long native_calibrate_cpu_early(void)
+{
+	unsigned long flags, fast_calibrate = cpu_khz_from_cpuid();
+
+	if (!fast_calibrate)
+		fast_calibrate = cpu_khz_from_msr();
+	if (!fast_calibrate) {
+		local_irq_save(flags);
+		fast_calibrate = quick_pit_calibrate();
+		local_irq_restore(flags);
+	}
+	return fast_calibrate;
+}
+
+
+/**
+ * native_calibrate_cpu - calibrate the cpu
+ */
+unsigned long native_calibrate_cpu(void)
+{
+	unsigned long tsc_freq = native_calibrate_cpu_early();
+
+	if (!tsc_freq)
+		tsc_freq = pit_hpet_ptimer_calibrate_cpu();
+
+	return tsc_freq;
+}
+
 void recalibrate_cpu_khz(void)
 {
 #ifndef CONFIG_SMP

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] x86/tsc: Make use of tsc_calibrate_cpu_early()
  2018-07-19 20:55 ` [PATCH v15 26/26] x86/tsc: use tsc_calibrate_cpu_early and pit_hpet_ptimer_calibrate_cpu Pavel Tatashin
@ 2018-07-19 22:35   ` tip-bot for Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Pavel Tatashin @ 2018-07-19 22:35 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, tglx, pasha.tatashin, mingo, linux-kernel

Commit-ID:  8dbe438589f373544a1af8b4a859e4da853c0f90
Gitweb:     https://git.kernel.org/tip/8dbe438589f373544a1af8b4a859e4da853c0f90
Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
AuthorDate: Thu, 19 Jul 2018 16:55:45 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 00:02:44 +0200

x86/tsc: Make use of tsc_calibrate_cpu_early()

During early boot enable tsc_calibrate_cpu_early() and switch to
tsc_calibrate_cpu() only later. Do this unconditionally, because it is
unknown what methods other cpus will use to calibrate once they are
onlined.

If by the time tsc_init() is called tsc frequency is still unknown do only
pit_hpet_ptimer_calibrate_cpu() to calibrate, as this function contains the
only methods wich have not been called and tried earlier.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-27-pasha.tatashin@oracle.com

---
 arch/x86/include/asm/tsc.h |  1 -
 arch/x86/kernel/tsc.c      | 25 +++++++++++++++++++------
 arch/x86/kernel/x86_init.c |  2 +-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 88140e4f2292..eb5bbfeccb66 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -39,7 +39,6 @@ extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern void mark_tsc_async_resets(char *reason);
-extern unsigned long native_calibrate_cpu(void);
 extern unsigned long native_calibrate_cpu_early(void);
 extern unsigned long native_calibrate_tsc(void);
 extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 60586779b02c..02e416b87ac1 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -854,7 +854,7 @@ unsigned long native_calibrate_cpu_early(void)
 /**
  * native_calibrate_cpu - calibrate the cpu
  */
-unsigned long native_calibrate_cpu(void)
+static unsigned long native_calibrate_cpu(void)
 {
 	unsigned long tsc_freq = native_calibrate_cpu_early();
 
@@ -1374,13 +1374,19 @@ unreg:
  */
 device_initcall(init_tsc_clocksource);
 
-static bool __init determine_cpu_tsc_frequencies(void)
+static bool __init determine_cpu_tsc_frequencies(bool early)
 {
 	/* Make sure that cpu and tsc are not already calibrated */
 	WARN_ON(cpu_khz || tsc_khz);
 
-	cpu_khz = x86_platform.calibrate_cpu();
-	tsc_khz = x86_platform.calibrate_tsc();
+	if (early) {
+		cpu_khz = x86_platform.calibrate_cpu();
+		tsc_khz = x86_platform.calibrate_tsc();
+	} else {
+		/* We should not be here with non-native cpu calibration */
+		WARN_ON(x86_platform.calibrate_cpu != native_calibrate_cpu);
+		cpu_khz = pit_hpet_ptimer_calibrate_cpu();
+	}
 
 	/*
 	 * Trust non-zero tsc_khz as authorative,
@@ -1419,7 +1425,7 @@ void __init tsc_early_init(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_TSC))
 		return;
-	if (!determine_cpu_tsc_frequencies())
+	if (!determine_cpu_tsc_frequencies(true))
 		return;
 	loops_per_jiffy = get_loops_per_jiffy();
 
@@ -1431,6 +1437,13 @@ void __init tsc_early_init(void)
 
 void __init tsc_init(void)
 {
+	/*
+	 * native_calibrate_cpu_early can only calibrate using methods that are
+	 * available early in boot.
+	 */
+	if (x86_platform.calibrate_cpu == native_calibrate_cpu_early)
+		x86_platform.calibrate_cpu = native_calibrate_cpu;
+
 	if (!boot_cpu_has(X86_FEATURE_TSC)) {
 		setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 		return;
@@ -1438,7 +1451,7 @@ void __init tsc_init(void)
 
 	if (!tsc_khz) {
 		/* We failed to determine frequencies earlier, try again */
-		if (!determine_cpu_tsc_frequencies()) {
+		if (!determine_cpu_tsc_frequencies(false)) {
 			mark_tsc_unstable("could not calculate TSC khz");
 			setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 			return;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 3ab867603e81..2792b5573818 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -109,7 +109,7 @@ struct x86_cpuinit_ops x86_cpuinit = {
 static void default_nmi_init(void) { };
 
 struct x86_platform_ops x86_platform __ro_after_init = {
-	.calibrate_cpu			= native_calibrate_cpu,
+	.calibrate_cpu			= native_calibrate_cpu_early,
 	.calibrate_tsc			= native_calibrate_tsc,
 	.get_wallclock			= mach_get_cmos_time,
 	.set_wallclock			= mach_set_rtc_mmss,

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2018-07-19 20:55 ` [PATCH v15 23/26] sched: early boot clock Pavel Tatashin
  2018-07-19 22:33   ` [tip:x86/timers] sched/clock: Enable sched clock early tip-bot for Pavel Tatashin
@ 2018-07-20  8:09   ` Peter Zijlstra
  2018-07-20 10:00     ` [tip:x86/timers] sched/clock: Close a hole in sched_clock_init() tip-bot for Peter Zijlstra
  2018-11-06  5:42   ` [PATCH v15 23/26] sched: early boot clock Dominique Martinet
  2 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2018-07-20  8:09 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, prarit, feng.tang, pmladek, gnomes,
	linux-s390, boris.ostrovsky, jgross, pbonzini

On Thu, Jul 19, 2018 at 04:55:42PM -0400, Pavel Tatashin wrote:
> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index 0e9dbb2d9aea..422cd63f8f17 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)
>  
>  void __init sched_clock_init(void)
>  {
> +	unsigned long flags;
> +
> +	/*
> +	 * Set __gtod_offset such that once we mark sched_clock_running,
> +	 * sched_clock_tick() continues where sched_clock() left off.
> +	 *
> +	 * Even if TSC is buggered, we're still UP at this point so it
> +	 * can't really be out of sync.
> +	 */
> +	local_irq_save(flags);
> +	__sched_clock_gtod_offset();
> +	local_irq_restore(flags);
> +
>  	sched_clock_running = 1;
> +
> +	/* Now that sched_clock_running is set adjust scd */
> +	local_irq_save(flags);
> +	sched_clock_tick();
> +	local_irq_restore(flags);
>  }

Sorry, that's still wrong. Because the moment you enable
sched_clock_running we need to have everything set-up for it to run.

The above looks double weird because you could've just done that =1
under the same IRQ-disable section and it would've mostly been OK
(except for NMIs). But the reason it's weird like that is because you're
going to change it into a static key later on.

The below cures things.

---
Subject: sched/clock: Close a hole in sched_clock_init()

All data required for the 'unstable' sched_clock must be set-up _before_
enabling it -- setting sched_clock_running. This includes the
__gtod_offset but also a recent scd stamp.

Make the gtod-offset update also set the csd stamp -- it requires the
same two clock reads _anyway_. This doesn't hurt in the
sched_clock_tick_stable() case and ensures sched_clock_init() gets
everything set-up before use.

Also switch to unconditional IRQ-disable/enable because the static key
stuff already requires this is not ran with IRQs disabled.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/clock.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c5c47ad3f386..811a39aca1ce 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -197,13 +197,14 @@ void clear_sched_clock_stable(void)
 
 static void __sched_clock_gtod_offset(void)
 {
-	__gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+	struct sched_clock_data *scd = this_scd();
+
+	__scd_stamp(scd);
+	__gtod_offset = (scd->tick_raw + __sched_clock_offset) - scd->tick_gtod;
 }
 
 void __init sched_clock_init(void)
 {
-	unsigned long flags;
-
 	/*
 	 * Set __gtod_offset such that once we mark sched_clock_running,
 	 * sched_clock_tick() continues where sched_clock() left off.
@@ -211,16 +212,11 @@ void __init sched_clock_init(void)
 	 * Even if TSC is buggered, we're still UP at this point so it
 	 * can't really be out of sync.
 	 */
-	local_irq_save(flags);
+	local_irq_disable();
 	__sched_clock_gtod_offset();
-	local_irq_restore(flags);
+	local_irq_enable();
 
 	static_branch_inc(&sched_clock_running);
-
-	/* Now that sched_clock_running is set adjust scd */
-	local_irq_save(flags);
-	sched_clock_tick();
-	local_irq_restore(flags);
 }
 /*
  * We run this as late_initcall() such that it runs after all built-in drivers,


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/timers] sched/clock: Close a hole in sched_clock_init()
  2018-07-20  8:09   ` [PATCH v15 23/26] sched: early boot clock Peter Zijlstra
@ 2018-07-20 10:00     ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Peter Zijlstra @ 2018-07-20 10:00 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, tglx, peterz, linux-kernel, pasha.tatashin, hpa

Commit-ID:  9407f5a7ee77c631d1e100436132437cf6237e45
Gitweb:     https://git.kernel.org/tip/9407f5a7ee77c631d1e100436132437cf6237e45
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 20 Jul 2018 10:09:11 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 11:58:00 +0200

sched/clock: Close a hole in sched_clock_init()

All data required for the 'unstable' sched_clock must be set-up _before_
enabling it -- setting sched_clock_running. This includes the
__gtod_offset but also a recent scd stamp.

Make the gtod-offset update also set the csd stamp -- it requires the
same two clock reads _anyway_. This doesn't hurt in the
sched_clock_tick_stable() case and ensures sched_clock_init() gets
everything set-up before use.

Also switch to unconditional IRQ-disable/enable because the static key
stuff already requires this is not ran with IRQs disabled.

Fixes: 857baa87b642 ("sched/clock: Enable sched clock early")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180720080911.GM2494@hirez.programming.kicks-ass.net
---
 kernel/sched/clock.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c5c47ad3f386..811a39aca1ce 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -197,13 +197,14 @@ void clear_sched_clock_stable(void)
 
 static void __sched_clock_gtod_offset(void)
 {
-	__gtod_offset = (sched_clock() + __sched_clock_offset) - ktime_get_ns();
+	struct sched_clock_data *scd = this_scd();
+
+	__scd_stamp(scd);
+	__gtod_offset = (scd->tick_raw + __sched_clock_offset) - scd->tick_gtod;
 }
 
 void __init sched_clock_init(void)
 {
-	unsigned long flags;
-
 	/*
 	 * Set __gtod_offset such that once we mark sched_clock_running,
 	 * sched_clock_tick() continues where sched_clock() left off.
@@ -211,16 +212,11 @@ void __init sched_clock_init(void)
 	 * Even if TSC is buggered, we're still UP at this point so it
 	 * can't really be out of sync.
 	 */
-	local_irq_save(flags);
+	local_irq_disable();
 	__sched_clock_gtod_offset();
-	local_irq_restore(flags);
+	local_irq_enable();
 
 	static_branch_inc(&sched_clock_running);
-
-	/* Now that sched_clock_running is set adjust scd */
-	local_irq_save(flags);
-	sched_clock_tick();
-	local_irq_restore(flags);
 }
 /*
  * We run this as late_initcall() such that it runs after all built-in drivers,

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [tip:x86/timers] sched/clock: Enable sched clock early
  2018-07-19 22:33   ` [tip:x86/timers] sched/clock: Enable sched clock early tip-bot for Pavel Tatashin
@ 2018-07-24 19:52     ` Guenter Roeck
  2018-07-24 20:22       ` Pavel Tatashin
  0 siblings, 1 reply; 76+ messages in thread
From: Guenter Roeck @ 2018-07-24 19:52 UTC (permalink / raw)
  To: linux-kernel, pasha.tatashin, mingo, tglx, hpa; +Cc: linux-tip-commits

Hi,

On Thu, Jul 19, 2018 at 03:33:21PM -0700, tip-bot for Pavel Tatashin wrote:
> Commit-ID:  857baa87b6422bcfb84ed3631d6839920cb5b09d
> Gitweb:     https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
> Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
> AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
> Committer:  Thomas Gleixner <tglx@linutronix.de>
> CommitDate: Fri, 20 Jul 2018 00:02:43 +0200
> 
> sched/clock: Enable sched clock early
> 
> Allow sched_clock() to be used before schec_clock_init() is called.  This
> provides a way to get early boot timestamps on machines with unstable
> clocks.
> 

This patch causes a regression when running a qemu emulation with
arm:integratorcp.

...
Console: colour dummy device 80x30
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180
sched_clock_register+0x44/0x278
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc6-next-20180724 #1
Hardware name: ARM Integrator/CP (Device Tree)
[<c0010cb4>] (unwind_backtrace) from [<c000dc24>] (show_stack+0x10/0x18)
[<c000dc24>] (show_stack) from [<c03ffb94>] (dump_stack+0x18/0x24)
[<c03ffb94>] (dump_stack) from [<c001a000>] (__warn+0xc8/0xf0)
[<c001a000>] (__warn) from [<c001a13c>] (warn_slowpath_null+0x3c/0x4c)
[<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
[<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
[<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
[<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
[<c0519c18>] (start_kernel) from [<00000000>] (  (null))
---[ end trace 08080eb81afa002c ]---
sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 21474836475000000ns
...

A complete boot log is available at
http://kerneltests.org/builders/qemu-arm-next/builds/979/steps/qemubuildcommand/logs/stdio

Unfortunately, reverting the patch results in conflicts, so I am unable
to confirm that it is the only culprit.

From the context and from looking into the patch, it appears that this
can happen in any system if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is not
enabled.

Bisect log is attached.

Guenter

---
# bad: [3946cd385042069ec57d3f04240def53b4eed7e5] Add linux-next specific files for 20180724
# good: [d72e90f33aa4709ebecc5005562f52335e106a60] Linux 4.18-rc6
git bisect start 'HEAD' 'v4.18-rc6'
# good: [f5fa891e325acf096c0f79e1d1b922002e251e5a] Merge remote-tracking branch 'crypto/master'
git bisect good f5fa891e325acf096c0f79e1d1b922002e251e5a
# good: [cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4] Merge remote-tracking branch 'spi/for-next'
git bisect good cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4
# bad: [6b5bfa57bf4553d051be65d85d021465041406d8] Merge remote-tracking branch 'char-misc/char-misc-next'
git bisect bad 6b5bfa57bf4553d051be65d85d021465041406d8
# bad: [675a67e9ef3c041999f412cb75418d2b0def3854] Merge remote-tracking branch 'rcu/rcu/next'
git bisect bad 675a67e9ef3c041999f412cb75418d2b0def3854
# good: [e78b01a51131f25fc2d881bc43001575c129069c] Merge branch 'perf/core'
git bisect good e78b01a51131f25fc2d881bc43001575c129069c
# good: [4e581bce514f4107ce84525f0f75f89c92b4140e] Merge branch 'x86/cpu'
git bisect good 4e581bce514f4107ce84525f0f75f89c92b4140e
# good: [20fa22e90e54e2d21cace7ba083598531670f7cf] Merge branch 'x86/pti'
git bisect good 20fa22e90e54e2d21cace7ba083598531670f7cf
# good: [4763f03d3d186ce8a1125844790152d76804ad60] x86/tsc: Use TSC as sched clock early
git bisect good 4763f03d3d186ce8a1125844790152d76804ad60
# good: [5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725] tools/memory-model: Rename litmus tests to comply to norm7
git bisect good 5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725
# bad: [fc3d25e1c8f6a9232530db02a1072033e22e0fe3] Merge branch 'x86/timers'
git bisect bad fc3d25e1c8f6a9232530db02a1072033e22e0fe3
# bad: [46457ea464f5341d1f9dad8dd213805d45f7f117] sched/clock: Use static key for sched_clock_running
git bisect bad 46457ea464f5341d1f9dad8dd213805d45f7f117
# bad: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
git bisect bad 857baa87b6422bcfb84ed3631d6839920cb5b09d
# good: [5d2a4e91a541cb04d20d11602f0f9340291322ac] sched/clock: Move sched clock initialization and merge with generic clock
git bisect good 5d2a4e91a541cb04d20d11602f0f9340291322ac
# first bad commit: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/timers] sched/clock: Enable sched clock early
  2018-07-24 19:52     ` Guenter Roeck
@ 2018-07-24 20:22       ` Pavel Tatashin
  2018-07-25  0:36         ` Pavel Tatashin
  0 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-24 20:22 UTC (permalink / raw)
  To: linux; +Cc: LKML, mingo, tglx, hpa, linux-tip-commits

On Tue, Jul 24, 2018 at 3:54 PM Guenter Roeck <linux@roeck-us.net> wrote:
>
> Hi,
>
> On Thu, Jul 19, 2018 at 03:33:21PM -0700, tip-bot for Pavel Tatashin wrote:
> > Commit-ID:  857baa87b6422bcfb84ed3631d6839920cb5b09d
> > Gitweb:     https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
> > Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
> > AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
> > Committer:  Thomas Gleixner <tglx@linutronix.de>
> > CommitDate: Fri, 20 Jul 2018 00:02:43 +0200
> >
> > sched/clock: Enable sched clock early
> >
> > Allow sched_clock() to be used before schec_clock_init() is called.  This
> > provides a way to get early boot timestamps on machines with unstable
> > clocks.
> >
>
> This patch causes a regression when running a qemu emulation with
> arm:integratorcp.

Thank you for the report. I will study it.

>
> ...
> Console: colour dummy device 80x30
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180
> sched_clock_register+0x44/0x278
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc6-next-20180724 #1
> Hardware name: ARM Integrator/CP (Device Tree)
> [<c0010cb4>] (unwind_backtrace) from [<c000dc24>] (show_stack+0x10/0x18)
> [<c000dc24>] (show_stack) from [<c03ffb94>] (dump_stack+0x18/0x24)
> [<c03ffb94>] (dump_stack) from [<c001a000>] (__warn+0xc8/0xf0)
> [<c001a000>] (__warn) from [<c001a13c>] (warn_slowpath_null+0x3c/0x4c)
> [<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
> [<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
> [<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
> [<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
> [<c0519c18>] (start_kernel) from [<00000000>] (  (null))
> ---[ end trace 08080eb81afa002c ]---
> sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 21474836475000000ns
> ...
>
> A complete boot log is available at
> http://kerneltests.org/builders/qemu-arm-next/builds/979/steps/qemubuildcommand/logs/stdio
>
> Unfortunately, reverting the patch results in conflicts, so I am unable
> to confirm that it is the only culprit.
>
> From the context and from looking into the patch, it appears that this
> can happen in any system if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is not
> enabled.
>
> Bisect log is attached.
>
> Guenter
>
> ---
> # bad: [3946cd385042069ec57d3f04240def53b4eed7e5] Add linux-next specific files for 20180724
> # good: [d72e90f33aa4709ebecc5005562f52335e106a60] Linux 4.18-rc6
> git bisect start 'HEAD' 'v4.18-rc6'
> # good: [f5fa891e325acf096c0f79e1d1b922002e251e5a] Merge remote-tracking branch 'crypto/master'
> git bisect good f5fa891e325acf096c0f79e1d1b922002e251e5a
> # good: [cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4] Merge remote-tracking branch 'spi/for-next'
> git bisect good cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4
> # bad: [6b5bfa57bf4553d051be65d85d021465041406d8] Merge remote-tracking branch 'char-misc/char-misc-next'
> git bisect bad 6b5bfa57bf4553d051be65d85d021465041406d8
> # bad: [675a67e9ef3c041999f412cb75418d2b0def3854] Merge remote-tracking branch 'rcu/rcu/next'
> git bisect bad 675a67e9ef3c041999f412cb75418d2b0def3854
> # good: [e78b01a51131f25fc2d881bc43001575c129069c] Merge branch 'perf/core'
> git bisect good e78b01a51131f25fc2d881bc43001575c129069c
> # good: [4e581bce514f4107ce84525f0f75f89c92b4140e] Merge branch 'x86/cpu'
> git bisect good 4e581bce514f4107ce84525f0f75f89c92b4140e
> # good: [20fa22e90e54e2d21cace7ba083598531670f7cf] Merge branch 'x86/pti'
> git bisect good 20fa22e90e54e2d21cace7ba083598531670f7cf
> # good: [4763f03d3d186ce8a1125844790152d76804ad60] x86/tsc: Use TSC as sched clock early
> git bisect good 4763f03d3d186ce8a1125844790152d76804ad60
> # good: [5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725] tools/memory-model: Rename litmus tests to comply to norm7
> git bisect good 5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725
> # bad: [fc3d25e1c8f6a9232530db02a1072033e22e0fe3] Merge branch 'x86/timers'
> git bisect bad fc3d25e1c8f6a9232530db02a1072033e22e0fe3
> # bad: [46457ea464f5341d1f9dad8dd213805d45f7f117] sched/clock: Use static key for sched_clock_running
> git bisect bad 46457ea464f5341d1f9dad8dd213805d45f7f117
> # bad: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
> git bisect bad 857baa87b6422bcfb84ed3631d6839920cb5b09d
> # good: [5d2a4e91a541cb04d20d11602f0f9340291322ac] sched/clock: Move sched clock initialization and merge with generic clock
> git bisect good 5d2a4e91a541cb04d20d11602f0f9340291322ac
> # first bad commit: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/timers] sched/clock: Enable sched clock early
  2018-07-24 20:22       ` Pavel Tatashin
@ 2018-07-25  0:36         ` Pavel Tatashin
  2018-07-25  1:24           ` Guenter Roeck
  0 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-25  0:36 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: LKML, mingo, tglx, hpa, linux-tip-commits

On Tue, Jul 24, 2018 at 4:22 PM Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
>
> On Tue, Jul 24, 2018 at 3:54 PM Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > Hi,
> >
> > On Thu, Jul 19, 2018 at 03:33:21PM -0700, tip-bot for Pavel Tatashin wrote:
> > > Commit-ID:  857baa87b6422bcfb84ed3631d6839920cb5b09d
> > > Gitweb:     https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
> > > Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
> > > AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
> > > Committer:  Thomas Gleixner <tglx@linutronix.de>
> > > CommitDate: Fri, 20 Jul 2018 00:02:43 +0200
> > >
> > > sched/clock: Enable sched clock early
> > >
> > > Allow sched_clock() to be used before schec_clock_init() is called.  This
> > > provides a way to get early boot timestamps on machines with unstable
> > > clocks.
> > >
> >
> > This patch causes a regression when running a qemu emulation with
> > arm:integratorcp.
>
> Thank you for the report. I will study it.
>
> >
> > ...
> > Console: colour dummy device 80x30
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180
> > sched_clock_register+0x44/0x278
> > Modules linked in:
> > CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc6-next-20180724 #1
> > Hardware name: ARM Integrator/CP (Device Tree)
> > [<c0010cb4>] (unwind_backtrace) from [<c000dc24>] (show_stack+0x10/0x18)
> > [<c000dc24>] (show_stack) from [<c03ffb94>] (dump_stack+0x18/0x24)
> > [<c03ffb94>] (dump_stack) from [<c001a000>] (__warn+0xc8/0xf0)
> > [<c001a000>] (__warn) from [<c001a13c>] (warn_slowpath_null+0x3c/0x4c)
> > [<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
> > [<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
> > [<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
> > [<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
> > [<c0519c18>] (start_kernel) from [<00000000>] (  (null))
> > ---[ end trace 08080eb81afa002c ]---
> > sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 21474836475000000ns
> > ...
> >
> > A complete boot log is available at
> > http://kerneltests.org/builders/qemu-arm-next/builds/979/steps/qemubuildcommand/logs/stdio
> >
> > Unfortunately, reverting the patch results in conflicts, so I am unable
> > to confirm that it is the only culprit.
> >
> > From the context and from looking into the patch, it appears that this
> > can happen in any system if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is not
> > enabled.

Do you have a complete config, and also qemu args that were used? I
have tried defconfig arm, and run in qemu, could not reproduce the
problem.

Thank you,
Pavel

> >
> > Bisect log is attached.
> >
> > Guenter
> >
> > ---
> > # bad: [3946cd385042069ec57d3f04240def53b4eed7e5] Add linux-next specific files for 20180724
> > # good: [d72e90f33aa4709ebecc5005562f52335e106a60] Linux 4.18-rc6
> > git bisect start 'HEAD' 'v4.18-rc6'
> > # good: [f5fa891e325acf096c0f79e1d1b922002e251e5a] Merge remote-tracking branch 'crypto/master'
> > git bisect good f5fa891e325acf096c0f79e1d1b922002e251e5a
> > # good: [cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4] Merge remote-tracking branch 'spi/for-next'
> > git bisect good cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4
> > # bad: [6b5bfa57bf4553d051be65d85d021465041406d8] Merge remote-tracking branch 'char-misc/char-misc-next'
> > git bisect bad 6b5bfa57bf4553d051be65d85d021465041406d8
> > # bad: [675a67e9ef3c041999f412cb75418d2b0def3854] Merge remote-tracking branch 'rcu/rcu/next'
> > git bisect bad 675a67e9ef3c041999f412cb75418d2b0def3854
> > # good: [e78b01a51131f25fc2d881bc43001575c129069c] Merge branch 'perf/core'
> > git bisect good e78b01a51131f25fc2d881bc43001575c129069c
> > # good: [4e581bce514f4107ce84525f0f75f89c92b4140e] Merge branch 'x86/cpu'
> > git bisect good 4e581bce514f4107ce84525f0f75f89c92b4140e
> > # good: [20fa22e90e54e2d21cace7ba083598531670f7cf] Merge branch 'x86/pti'
> > git bisect good 20fa22e90e54e2d21cace7ba083598531670f7cf
> > # good: [4763f03d3d186ce8a1125844790152d76804ad60] x86/tsc: Use TSC as sched clock early
> > git bisect good 4763f03d3d186ce8a1125844790152d76804ad60
> > # good: [5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725] tools/memory-model: Rename litmus tests to comply to norm7
> > git bisect good 5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725
> > # bad: [fc3d25e1c8f6a9232530db02a1072033e22e0fe3] Merge branch 'x86/timers'
> > git bisect bad fc3d25e1c8f6a9232530db02a1072033e22e0fe3
> > # bad: [46457ea464f5341d1f9dad8dd213805d45f7f117] sched/clock: Use static key for sched_clock_running
> > git bisect bad 46457ea464f5341d1f9dad8dd213805d45f7f117
> > # bad: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
> > git bisect bad 857baa87b6422bcfb84ed3631d6839920cb5b09d
> > # good: [5d2a4e91a541cb04d20d11602f0f9340291322ac] sched/clock: Move sched clock initialization and merge with generic clock
> > git bisect good 5d2a4e91a541cb04d20d11602f0f9340291322ac
> > # first bad commit: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/timers] sched/clock: Enable sched clock early
  2018-07-25  0:36         ` Pavel Tatashin
@ 2018-07-25  1:24           ` Guenter Roeck
  2018-07-25  2:05             ` Pavel Tatashin
  0 siblings, 1 reply; 76+ messages in thread
From: Guenter Roeck @ 2018-07-25  1:24 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: LKML, mingo, tglx, hpa, linux-tip-commits

On 07/24/2018 05:36 PM, Pavel Tatashin wrote:
> On Tue, Jul 24, 2018 at 4:22 PM Pavel Tatashin
> <pasha.tatashin@oracle.com> wrote:
>>
>> On Tue, Jul 24, 2018 at 3:54 PM Guenter Roeck <linux@roeck-us.net> wrote:
>>>
>>> Hi,
>>>
>>> On Thu, Jul 19, 2018 at 03:33:21PM -0700, tip-bot for Pavel Tatashin wrote:
>>>> Commit-ID:  857baa87b6422bcfb84ed3631d6839920cb5b09d
>>>> Gitweb:     https://git.kernel.org/tip/857baa87b6422bcfb84ed3631d6839920cb5b09d
>>>> Author:     Pavel Tatashin <pasha.tatashin@oracle.com>
>>>> AuthorDate: Thu, 19 Jul 2018 16:55:42 -0400
>>>> Committer:  Thomas Gleixner <tglx@linutronix.de>
>>>> CommitDate: Fri, 20 Jul 2018 00:02:43 +0200
>>>>
>>>> sched/clock: Enable sched clock early
>>>>
>>>> Allow sched_clock() to be used before schec_clock_init() is called.  This
>>>> provides a way to get early boot timestamps on machines with unstable
>>>> clocks.
>>>>
>>>
>>> This patch causes a regression when running a qemu emulation with
>>> arm:integratorcp.
>>
>> Thank you for the report. I will study it.
>>
>>>
>>> ...
>>> Console: colour dummy device 80x30
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180
>>> sched_clock_register+0x44/0x278
>>> Modules linked in:
>>> CPU: 0 PID: 0 Comm: swapper Not tainted 4.18.0-rc6-next-20180724 #1
>>> Hardware name: ARM Integrator/CP (Device Tree)
>>> [<c0010cb4>] (unwind_backtrace) from [<c000dc24>] (show_stack+0x10/0x18)
>>> [<c000dc24>] (show_stack) from [<c03ffb94>] (dump_stack+0x18/0x24)
>>> [<c03ffb94>] (dump_stack) from [<c001a000>] (__warn+0xc8/0xf0)
>>> [<c001a000>] (__warn) from [<c001a13c>] (warn_slowpath_null+0x3c/0x4c)
>>> [<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
>>> [<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
>>> [<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
>>> [<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
>>> [<c0519c18>] (start_kernel) from [<00000000>] (  (null))
>>> ---[ end trace 08080eb81afa002c ]---
>>> sched_clock: 32 bits at 100 Hz, resolution 10000000ns, wraps every 21474836475000000ns
>>> ...
>>>
>>> A complete boot log is available at
>>> http://kerneltests.org/builders/qemu-arm-next/builds/979/steps/qemubuildcommand/logs/stdio
>>>
>>> Unfortunately, reverting the patch results in conflicts, so I am unable
>>> to confirm that it is the only culprit.
>>>
>>>  From the context and from looking into the patch, it appears that this
>>> can happen in any system if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is not
>>> enabled.
> 
> Do you have a complete config, and also qemu args that were used? I
> have tried defconfig arm, and run in qemu, could not reproduce the
> problem.
> 

integrator_defconfig+CONFIG_DEVTMPFS=y+CONFIG_DEVTMPFS_MOUNT=y

Qemu command line is
	qemu-system-arm -M integratorcp  -m 128 \
	-kernel arch/arm/boot/zImage -no-reboot \
	-initrd busybox-armv4.cpio \
	--append "rdinit=/sbin/init console=ttyAMA0,115200" \
	-serial stdio -monitor null -nographic \
	-dtb arch/arm/boot/dts/integratorcp.dtb

The scripts and files used are available from git@github.com:groeck/linux-build-test.git.
qemu is version 2.12.

Guenter

> Thank you,
> Pavel
> 
>>>
>>> Bisect log is attached.
>>>
>>> Guenter
>>>
>>> ---
>>> # bad: [3946cd385042069ec57d3f04240def53b4eed7e5] Add linux-next specific files for 20180724
>>> # good: [d72e90f33aa4709ebecc5005562f52335e106a60] Linux 4.18-rc6
>>> git bisect start 'HEAD' 'v4.18-rc6'
>>> # good: [f5fa891e325acf096c0f79e1d1b922002e251e5a] Merge remote-tracking branch 'crypto/master'
>>> git bisect good f5fa891e325acf096c0f79e1d1b922002e251e5a
>>> # good: [cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4] Merge remote-tracking branch 'spi/for-next'
>>> git bisect good cb6471f6bcfdacbeef9c23ba9dac00e67bd3c3a4
>>> # bad: [6b5bfa57bf4553d051be65d85d021465041406d8] Merge remote-tracking branch 'char-misc/char-misc-next'
>>> git bisect bad 6b5bfa57bf4553d051be65d85d021465041406d8
>>> # bad: [675a67e9ef3c041999f412cb75418d2b0def3854] Merge remote-tracking branch 'rcu/rcu/next'
>>> git bisect bad 675a67e9ef3c041999f412cb75418d2b0def3854
>>> # good: [e78b01a51131f25fc2d881bc43001575c129069c] Merge branch 'perf/core'
>>> git bisect good e78b01a51131f25fc2d881bc43001575c129069c
>>> # good: [4e581bce514f4107ce84525f0f75f89c92b4140e] Merge branch 'x86/cpu'
>>> git bisect good 4e581bce514f4107ce84525f0f75f89c92b4140e
>>> # good: [20fa22e90e54e2d21cace7ba083598531670f7cf] Merge branch 'x86/pti'
>>> git bisect good 20fa22e90e54e2d21cace7ba083598531670f7cf
>>> # good: [4763f03d3d186ce8a1125844790152d76804ad60] x86/tsc: Use TSC as sched clock early
>>> git bisect good 4763f03d3d186ce8a1125844790152d76804ad60
>>> # good: [5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725] tools/memory-model: Rename litmus tests to comply to norm7
>>> git bisect good 5f9ef44c7d1c59d0eda1d86e31d981bdffe2a725
>>> # bad: [fc3d25e1c8f6a9232530db02a1072033e22e0fe3] Merge branch 'x86/timers'
>>> git bisect bad fc3d25e1c8f6a9232530db02a1072033e22e0fe3
>>> # bad: [46457ea464f5341d1f9dad8dd213805d45f7f117] sched/clock: Use static key for sched_clock_running
>>> git bisect bad 46457ea464f5341d1f9dad8dd213805d45f7f117
>>> # bad: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
>>> git bisect bad 857baa87b6422bcfb84ed3631d6839920cb5b09d
>>> # good: [5d2a4e91a541cb04d20d11602f0f9340291322ac] sched/clock: Move sched clock initialization and merge with generic clock
>>> git bisect good 5d2a4e91a541cb04d20d11602f0f9340291322ac
>>> # first bad commit: [857baa87b6422bcfb84ed3631d6839920cb5b09d] sched/clock: Enable sched clock early
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/timers] sched/clock: Enable sched clock early
  2018-07-25  1:24           ` Guenter Roeck
@ 2018-07-25  2:05             ` Pavel Tatashin
  2018-07-25  2:41               ` Pavel Tatashin
  0 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-25  2:05 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: LKML, mingo, tglx, hpa, linux-tip-commits

> integrator_defconfig+CONFIG_DEVTMPFS=y+CONFIG_DEVTMPFS_MOUNT=y
>
> Qemu command line is
>         qemu-system-arm -M integratorcp  -m 128 \
>         -kernel arch/arm/boot/zImage -no-reboot \
>         -initrd busybox-armv4.cpio \
>         --append "rdinit=/sbin/init console=ttyAMA0,115200" \
>         -serial stdio -monitor null -nographic \
>         -dtb arch/arm/boot/dts/integratorcp.dtb
>
> The scripts and files used are available from git@github.com:groeck/linux-build-test.git.
> qemu is version 2.12.

Reproduced. Thank you

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/timers] sched/clock: Enable sched clock early
  2018-07-25  2:05             ` Pavel Tatashin
@ 2018-07-25  2:41               ` Pavel Tatashin
  2018-07-30 12:36                 ` Peter Zijlstra
  0 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-25  2:41 UTC (permalink / raw)
  To: Guenter Roeck, peterz; +Cc: LKML, mingo, tglx, hpa, linux-tip-commits

Peter,

The problem is in this stack

start_kernel
  local_irq_enable
  late_time_init
  sched_clock_init
    generic_sched_clock_init
      sched_clock_register
        WARN_ON(!irqs_disabled());

Before this work, sched_clock_init() was called prior to enabling
interrupts, but now after. So, we hit this WARN_ON() in
sched_clock_register().

The question is why do we need this warning in sched_clock_register? I
guess because we want to make this section of the code atomic:

195         new_epoch = read();  <- from here
196         cyc = cd.actual_read_sched_clock();
197         ns = rd.epoch_ns + cyc_to_ns((cyc - rd.epoch_cyc) &
rd.sched_clock_mask, rd.mult, rd.shift);
198         cd.actual_read_sched_clock = read;
199
200         rd.read_sched_clock     = read;
201         rd.sched_clock_mask     = new_mask;
202         rd.mult                 = new_mult;
203         rd.shift                = new_shift;
204         rd.epoch_cyc            = new_epoch;
205         rd.epoch_ns             = ns;
206
207         update_clock_read_data(&rd); <- to here

If we need it, we can surround the sched_clock_register() with
local_irq_disable/local_irq_enable:

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index cbc72c2c1fca..5015b165b55b 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -243,8 +243,11 @@ void __init generic_sched_clock_init(void)
         * If no sched_clock() function has been provided at that point,
         * make it the final one one.
         */
-       if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
+       if (cd.actual_read_sched_clock == jiffy_sched_clock_read) {
+               local_irq_disable();
                sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
+               local_irq_enable();
+       }

        update_sched_clock();

Thank you,
Pavel

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [tip:x86/timers] sched/clock: Enable sched clock early
  2018-07-25  2:41               ` Pavel Tatashin
@ 2018-07-30 12:36                 ` Peter Zijlstra
  2018-07-30 13:44                   ` Pavel Tatashin
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2018-07-30 12:36 UTC (permalink / raw)
  To: Pavel Tatashin; +Cc: Guenter Roeck, LKML, mingo, tglx, hpa, linux-tip-commits

On Tue, Jul 24, 2018 at 10:41:19PM -0400, Pavel Tatashin wrote:

> If we need it, we can surround the sched_clock_register() with
> local_irq_disable/local_irq_enable:
> 
> diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
> index cbc72c2c1fca..5015b165b55b 100644
> --- a/kernel/time/sched_clock.c
> +++ b/kernel/time/sched_clock.c
> @@ -243,8 +243,11 @@ void __init generic_sched_clock_init(void)
>          * If no sched_clock() function has been provided at that point,
>          * make it the final one one.
>          */
> -       if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
> +       if (cd.actual_read_sched_clock == jiffy_sched_clock_read) {
> +               local_irq_disable();
>                 sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
> +               local_irq_enable();
> +       }
> 
>         update_sched_clock();

I'm thinking maybe disable IRQs for that entire function, instead of
just the register call.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/timers] sched/clock: Enable sched clock early
  2018-07-30 12:36                 ` Peter Zijlstra
@ 2018-07-30 13:44                   ` Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: Pavel Tatashin @ 2018-07-30 13:44 UTC (permalink / raw)
  To: peterz; +Cc: Guenter Roeck, LKML, mingo, tglx, hpa, linux-tip-commits

> > -       if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
> > +       if (cd.actual_read_sched_clock == jiffy_sched_clock_read) {
> > +               local_irq_disable();
> >                 sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
> > +               local_irq_enable();
> > +       }
> >
> >         update_sched_clock();
>
> I'm thinking maybe disable IRQs for that entire function, instead of
> just the register call.

Sure, I will send a patch.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2018-07-19 20:55 ` [PATCH v15 23/26] sched: early boot clock Pavel Tatashin
  2018-07-19 22:33   ` [tip:x86/timers] sched/clock: Enable sched clock early tip-bot for Pavel Tatashin
  2018-07-20  8:09   ` [PATCH v15 23/26] sched: early boot clock Peter Zijlstra
@ 2018-11-06  5:42   ` Dominique Martinet
  2018-11-06 11:35     ` Steven Sistare
  2 siblings, 1 reply; 76+ messages in thread
From: Dominique Martinet @ 2018-11-06  5:42 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, linux, schwidefsky,
	heiko.carstens, john.stultz, sboyd, x86, linux-kernel, mingo,
	tglx, hpa, douly.fnst, peterz, prarit, feng.tang, pmladek,
	gnomes, linux-s390, boris.ostrovsky, jgross, pbonzini,
	virtualization, kvm, qemu-devel

(added various kvm/virtualization lists in Cc as well as qemu as I don't
know who's "wrong" here)

Pavel Tatashin wrote on Thu, Jul 19, 2018:
> Allow sched_clock() to be used before schec_clock_init() is called.
> This provides with a way to get early boot timestamps on machines with
> unstable clocks.

This isn't something I understand, but bisect tells me this patch
(landed as 857baa87b64 ("sched/clock: Enable sched clock early")) makes
a VM running with kvmclock take a step in uptime/printk timer early in
boot sequence as illustrated below. The step seems to be related to the
amount of time the host was suspended while qemu was running before the
reboot.

$ dmesg
...
[    0.000000] SMBIOS 2.8 present.
[    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
[    0.000000] Hypervisor detected: KVM
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[283120.529821] kvm-clock: cpu 0, msr 321a8001, primary cpu clock
[283120.529822] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[283120.529824] tsc: Detected 2592.000 MHz processor
...

(The VM is x86_64 on x86_64, I can provide my .config on request but
don't think it's related)


It's rather annoying for me as I often reboot VMs and rely on the
'uptime' command to check if I did just reboot or not as I have the
attention span of a goldfish; I'd rather not have to find something else
to check if I did just reboot or not.

Note that if the qemu process is restarted, there is no offset anymore.

I unfortunately just did that so cannot say with confidence (putting my
laptop to sleep for 30s only led to a 2s offset and I do not want to
wait longer right now), but it looks like the clock is still mostly
correct after reboot after disabling my VM's ntp client. Will infirm
that tomorrow if I was wrong.


Happy to try to help fixing this in any way, as written above the quote
I'm not even actually sure who is wrong here.

Thanks!



(As a side, mostly unrelated note, insert swearing here about cf7a63ef4
not compiling earlier in this serie; some variable declaration got
removed before their use. Was fixed in the next patch but I didn't
notice the kernel didn't fully rebuild and wasted time in my bisect
heading the wrong way...)

> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  init/main.c          |  2 +-
>  kernel/sched/clock.c | 20 +++++++++++++++++++-
>  2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/init/main.c b/init/main.c
> index 162d931c9511..ff0a24170b95 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
>  	softirq_init();
>  	timekeeping_init();
>  	time_init();
> -	sched_clock_init();
>  	printk_safe_init();
>  	perf_event_init();
>  	profile_init();
> @@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
>  	acpi_early_init();
>  	if (late_time_init)
>  		late_time_init();
> +	sched_clock_init();
>  	calibrate_delay();
>  	pid_idr_init();
>  	anon_vma_init();
> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index 0e9dbb2d9aea..422cd63f8f17 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)
>  
>  void __init sched_clock_init(void)
>  {
> +	unsigned long flags;
> +
> +	/*
> +	 * Set __gtod_offset such that once we mark sched_clock_running,
> +	 * sched_clock_tick() continues where sched_clock() left off.
> +	 *
> +	 * Even if TSC is buggered, we're still UP at this point so it
> +	 * can't really be out of sync.
> +	 */
> +	local_irq_save(flags);
> +	__sched_clock_gtod_offset();
> +	local_irq_restore(flags);
> +
>  	sched_clock_running = 1;
> +
> +	/* Now that sched_clock_running is set adjust scd */
> +	local_irq_save(flags);
> +	sched_clock_tick();
> +	local_irq_restore(flags);
>  }
>  /*
>   * We run this as late_initcall() such that it runs after all built-in drivers,
> @@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
>  		return sched_clock() + __sched_clock_offset;
>  
>  	if (unlikely(!sched_clock_running))
> -		return 0ull;
> +		return sched_clock();
>  
>  	preempt_disable_notrace();
>  	scd = cpu_sdc(cpu);
-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2018-11-06  5:42   ` [PATCH v15 23/26] sched: early boot clock Dominique Martinet
@ 2018-11-06 11:35     ` Steven Sistare
  2019-01-02 20:20       ` Salvatore Bonaccorso
  0 siblings, 1 reply; 76+ messages in thread
From: Steven Sistare @ 2018-11-06 11:35 UTC (permalink / raw)
  To: Dominique Martinet, Pavel Tatashin
  Cc: daniel.m.jordan, linux, schwidefsky, heiko.carstens, john.stultz,
	sboyd, x86, linux-kernel, mingo, tglx, hpa, douly.fnst, peterz,
	prarit, feng.tang, pmladek, gnomes, linux-s390, boris.ostrovsky,
	jgross, pbonzini, virtualization, kvm, qemu-devel

Pavel has a new email address, cc'd - steve

On 11/6/2018 12:42 AM, Dominique Martinet wrote:
> (added various kvm/virtualization lists in Cc as well as qemu as I don't
> know who's "wrong" here)
> 
> Pavel Tatashin wrote on Thu, Jul 19, 2018:
>> Allow sched_clock() to be used before schec_clock_init() is called.
>> This provides with a way to get early boot timestamps on machines with
>> unstable clocks.
> 
> This isn't something I understand, but bisect tells me this patch
> (landed as 857baa87b64 ("sched/clock: Enable sched clock early")) makes
> a VM running with kvmclock take a step in uptime/printk timer early in
> boot sequence as illustrated below. The step seems to be related to the
> amount of time the host was suspended while qemu was running before the
> reboot.
> 
> $ dmesg
> ...
> [    0.000000] SMBIOS 2.8 present.
> [    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
> [    0.000000] Hypervisor detected: KVM
> [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [283120.529821] kvm-clock: cpu 0, msr 321a8001, primary cpu clock
> [283120.529822] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [283120.529824] tsc: Detected 2592.000 MHz processor
> ...
> 
> (The VM is x86_64 on x86_64, I can provide my .config on request but
> don't think it's related)
> 
> 
> It's rather annoying for me as I often reboot VMs and rely on the
> 'uptime' command to check if I did just reboot or not as I have the
> attention span of a goldfish; I'd rather not have to find something else
> to check if I did just reboot or not.
> 
> Note that if the qemu process is restarted, there is no offset anymore.
> 
> I unfortunately just did that so cannot say with confidence (putting my
> laptop to sleep for 30s only led to a 2s offset and I do not want to
> wait longer right now), but it looks like the clock is still mostly
> correct after reboot after disabling my VM's ntp client. Will infirm
> that tomorrow if I was wrong.
> 
> 
> Happy to try to help fixing this in any way, as written above the quote
> I'm not even actually sure who is wrong here.
> 
> Thanks!
> 
> 
> 
> (As a side, mostly unrelated note, insert swearing here about cf7a63ef4
> not compiling earlier in this serie; some variable declaration got
> removed before their use. Was fixed in the next patch but I didn't
> notice the kernel didn't fully rebuild and wasted time in my bisect
> heading the wrong way...)
> 
>> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
>> ---
>>  init/main.c          |  2 +-
>>  kernel/sched/clock.c | 20 +++++++++++++++++++-
>>  2 files changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/init/main.c b/init/main.c
>> index 162d931c9511..ff0a24170b95 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -642,7 +642,6 @@ asmlinkage __visible void __init start_kernel(void)
>>  	softirq_init();
>>  	timekeeping_init();
>>  	time_init();
>> -	sched_clock_init();
>>  	printk_safe_init();
>>  	perf_event_init();
>>  	profile_init();
>> @@ -697,6 +696,7 @@ asmlinkage __visible void __init start_kernel(void)
>>  	acpi_early_init();
>>  	if (late_time_init)
>>  		late_time_init();
>> +	sched_clock_init();
>>  	calibrate_delay();
>>  	pid_idr_init();
>>  	anon_vma_init();
>> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
>> index 0e9dbb2d9aea..422cd63f8f17 100644
>> --- a/kernel/sched/clock.c
>> +++ b/kernel/sched/clock.c
>> @@ -202,7 +202,25 @@ static void __sched_clock_gtod_offset(void)
>>  
>>  void __init sched_clock_init(void)
>>  {
>> +	unsigned long flags;
>> +
>> +	/*
>> +	 * Set __gtod_offset such that once we mark sched_clock_running,
>> +	 * sched_clock_tick() continues where sched_clock() left off.
>> +	 *
>> +	 * Even if TSC is buggered, we're still UP at this point so it
>> +	 * can't really be out of sync.
>> +	 */
>> +	local_irq_save(flags);
>> +	__sched_clock_gtod_offset();
>> +	local_irq_restore(flags);
>> +
>>  	sched_clock_running = 1;
>> +
>> +	/* Now that sched_clock_running is set adjust scd */
>> +	local_irq_save(flags);
>> +	sched_clock_tick();
>> +	local_irq_restore(flags);
>>  }
>>  /*
>>   * We run this as late_initcall() such that it runs after all built-in drivers,
>> @@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
>>  		return sched_clock() + __sched_clock_offset;
>>  
>>  	if (unlikely(!sched_clock_running))
>> -		return 0ull;
>> +		return sched_clock();
>>  
>>  	preempt_disable_notrace();
>>  	scd = cpu_sdc(cpu);

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2018-11-06 11:35     ` Steven Sistare
@ 2019-01-02 20:20       ` Salvatore Bonaccorso
  2019-01-03 21:28         ` Pavel Tatashin
  0 siblings, 1 reply; 76+ messages in thread
From: Salvatore Bonaccorso @ 2019-01-02 20:20 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Dominique Martinet, Pavel Tatashin, daniel.m.jordan, linux,
	schwidefsky, heiko.carstens, john.stultz, sboyd, x86,
	linux-kernel, mingo, tglx, hpa, douly.fnst, peterz, prarit,
	feng.tang, pmladek, gnomes, linux-s390, boris.ostrovsky, jgross,
	pbonzini, virtualization, kvm, qemu-devel

Hi,

On Tue, Nov 06, 2018 at 06:35:36AM -0500, Steven Sistare wrote:
> Pavel has a new email address, cc'd - steve
> 
> On 11/6/2018 12:42 AM, Dominique Martinet wrote:
> > (added various kvm/virtualization lists in Cc as well as qemu as I don't
> > know who's "wrong" here)
> > 
> > Pavel Tatashin wrote on Thu, Jul 19, 2018:
> >> Allow sched_clock() to be used before schec_clock_init() is called.
> >> This provides with a way to get early boot timestamps on machines with
> >> unstable clocks.
> > 
> > This isn't something I understand, but bisect tells me this patch
> > (landed as 857baa87b64 ("sched/clock: Enable sched clock early")) makes
> > a VM running with kvmclock take a step in uptime/printk timer early in
> > boot sequence as illustrated below. The step seems to be related to the
> > amount of time the host was suspended while qemu was running before the
> > reboot.
> > 
> > $ dmesg
> > ...
> > [    0.000000] SMBIOS 2.8 present.
> > [    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
> > [    0.000000] Hypervisor detected: KVM
> > [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> > [283120.529821] kvm-clock: cpu 0, msr 321a8001, primary cpu clock
> > [283120.529822] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> > [283120.529824] tsc: Detected 2592.000 MHz processor
> > ...
> > 
> > (The VM is x86_64 on x86_64, I can provide my .config on request but
> > don't think it's related)
> > 
> > 
> > It's rather annoying for me as I often reboot VMs and rely on the
> > 'uptime' command to check if I did just reboot or not as I have the
> > attention span of a goldfish; I'd rather not have to find something else
> > to check if I did just reboot or not.
> > 
> > Note that if the qemu process is restarted, there is no offset anymore.
> > 
> > I unfortunately just did that so cannot say with confidence (putting my
> > laptop to sleep for 30s only led to a 2s offset and I do not want to
> > wait longer right now), but it looks like the clock is still mostly
> > correct after reboot after disabling my VM's ntp client. Will infirm
> > that tomorrow if I was wrong.
> > 
> > 
> > Happy to try to help fixing this in any way, as written above the quote
> > I'm not even actually sure who is wrong here.

A user in Debian reported the same/similar issue (with 4.19.13):

https://bugs.debian.org/918036

Regards,
Salvatore

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2019-01-02 20:20       ` Salvatore Bonaccorso
@ 2019-01-03 21:28         ` Pavel Tatashin
  2019-01-03 23:43           ` Dominique Martinet
  2019-01-04  7:30           ` [PATCH v15 23/26] sched: early boot clock (was Re: Bug#918036: linux: uptime after reboot wrong (kvm-clock related?)) Thorsten Glaser
  0 siblings, 2 replies; 76+ messages in thread
From: Pavel Tatashin @ 2019-01-03 21:28 UTC (permalink / raw)
  To: Salvatore Bonaccorso
  Cc: Steven Sistare, Dominique Martinet, Pavel Tatashin,
	Daniel Jordan, linux, schwidefsky, heiko.carstens, john.stultz,
	sboyd, x86, LKML, mingo, Thomas Gleixner, hpa, douly.fnst,
	Peter Zijlstra, prarit, feng.tang, pmladek, gnomes, linux-s390,
	boris.ostrovsky, jgross, pbonzini, virtualization, kvm,
	qemu-devel

Could you please send the config file and qemu arguments that were
used to reproduce this problem.

Thank you,
Pasha

On Wed, Jan 2, 2019 at 3:20 PM Salvatore Bonaccorso <carnil@debian.org> wrote:
>
> Hi,
>
> On Tue, Nov 06, 2018 at 06:35:36AM -0500, Steven Sistare wrote:
> > Pavel has a new email address, cc'd - steve
> >
> > On 11/6/2018 12:42 AM, Dominique Martinet wrote:
> > > (added various kvm/virtualization lists in Cc as well as qemu as I don't
> > > know who's "wrong" here)
> > >
> > > Pavel Tatashin wrote on Thu, Jul 19, 2018:
> > >> Allow sched_clock() to be used before schec_clock_init() is called.
> > >> This provides with a way to get early boot timestamps on machines with
> > >> unstable clocks.
> > >
> > > This isn't something I understand, but bisect tells me this patch
> > > (landed as 857baa87b64 ("sched/clock: Enable sched clock early")) makes
> > > a VM running with kvmclock take a step in uptime/printk timer early in
> > > boot sequence as illustrated below. The step seems to be related to the
> > > amount of time the host was suspended while qemu was running before the
> > > reboot.
> > >
> > > $ dmesg
> > > ...
> > > [    0.000000] SMBIOS 2.8 present.
> > > [    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
> > > [    0.000000] Hypervisor detected: KVM
> > > [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> > > [283120.529821] kvm-clock: cpu 0, msr 321a8001, primary cpu clock
> > > [283120.529822] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> > > [283120.529824] tsc: Detected 2592.000 MHz processor
> > > ...
> > >
> > > (The VM is x86_64 on x86_64, I can provide my .config on request but
> > > don't think it's related)
> > >
> > >
> > > It's rather annoying for me as I often reboot VMs and rely on the
> > > 'uptime' command to check if I did just reboot or not as I have the
> > > attention span of a goldfish; I'd rather not have to find something else
> > > to check if I did just reboot or not.
> > >
> > > Note that if the qemu process is restarted, there is no offset anymore.
> > >
> > > I unfortunately just did that so cannot say with confidence (putting my
> > > laptop to sleep for 30s only led to a 2s offset and I do not want to
> > > wait longer right now), but it looks like the clock is still mostly
> > > correct after reboot after disabling my VM's ntp client. Will infirm
> > > that tomorrow if I was wrong.
> > >
> > >
> > > Happy to try to help fixing this in any way, as written above the quote
> > > I'm not even actually sure who is wrong here.
>
> A user in Debian reported the same/similar issue (with 4.19.13):
>
> https://bugs.debian.org/918036
>
> Regards,
> Salvatore

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2019-01-03 21:28         ` Pavel Tatashin
@ 2019-01-03 23:43           ` Dominique Martinet
  2019-01-07 18:17             ` Pavel Tatashin
  2019-01-04  7:30           ` [PATCH v15 23/26] sched: early boot clock (was Re: Bug#918036: linux: uptime after reboot wrong (kvm-clock related?)) Thorsten Glaser
  1 sibling, 1 reply; 76+ messages in thread
From: Dominique Martinet @ 2019-01-03 23:43 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Salvatore Bonaccorso, Steven Sistare, Pavel Tatashin,
	Daniel Jordan, linux, schwidefsky, heiko.carstens, john.stultz,
	sboyd, x86, LKML, mingo, Thomas Gleixner, hpa, douly.fnst,
	Peter Zijlstra, prarit, feng.tang, pmladek, gnomes, linux-s390,
	boris.ostrovsky, jgross, pbonzini, virtualization, kvm,
	qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1574 bytes --]

Pavel Tatashin wrote on Thu, Jan 03, 2019:
> Could you please send the config file and qemu arguments that were
> used to reproduce this problem.

Running qemu by hand, nothing fancy e.g. this works:

# qemu-system-x86_64 -m 1G -smp 4 -drive file=/root/kvm-wrapper/disks/f2.img,if=virtio -serial mon:stdio --enable-kvm -cpu Haswell -device virtio-rng-pci -nographic

(used a specific cpu just in case but normally runnning with cpu host on
a skylake machine; can probably go older)


qemu is fedora 29 blend as is:
$ qemu-system-x86_64 --version
QEMU emulator version 3.0.0 (qemu-3.0.0-3.fc29)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers


compressed .config attached to the mail, this can likely be trimmed down
some as well but that takes more time for me..
I didn't rebuild the kernel so not 100% sure (comes from
/proc/config.gz) but it should work on a 4.20-rc2 kernel as written in
the first few lines; 857baa87b64 I referred to in another mail was
merged in 4.19-rc1 so anything past that is probably OK to reproduce...


Re-checked today with these exact options (fresh VM start; then suspend
laptop for a bit, then reboot VM):
[    0.000000] Hypervisor detected: KVM
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 2477.907447] kvm-clock: cpu 0, msr 153a4001, primary cpu clock
[ 2477.907448] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 2477.907450] tsc: Detected 2592.000 MHz processor


As offered previously, happy to help in any way.

Thanks,
-- 
Dominique

[-- Attachment #2: config.xz --]
[-- Type: application/octet-stream, Size: 19476 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock (was Re: Bug#918036: linux: uptime after reboot wrong (kvm-clock related?))
  2019-01-03 21:28         ` Pavel Tatashin
  2019-01-03 23:43           ` Dominique Martinet
@ 2019-01-04  7:30           ` Thorsten Glaser
  1 sibling, 0 replies; 76+ messages in thread
From: Thorsten Glaser @ 2019-01-04  7:30 UTC (permalink / raw)
  To: Salvatore Bonaccorso, 918036, Pavel Tatashin
  Cc: Steven Sistare, Dominique Martinet, Pavel Tatashin,
	Daniel Jordan, linux, schwidefsky, heiko.carstens, john.stultz,
	sboyd, x86, LKML, mingo, Thomas Gleixner, hpa, douly.fnst,
	Peter Zijlstra, prarit, feng.tang, pmladek, gnomes, linux-s390,
	boris.ostrovsky, jgross, pbonzini, virtualization, kvm,
	qemu-devel

Hi Salvatore,

>p.s.: my earlier reply to you seem to have been rejected and never
>      reached you, hope this one does now.

if you sent from Googlemail, it may reach me in the next weeks or
never *shrug* they don’t play nice with greylisting. The -submitter
or @d.o works, though. I’m following up from my $dayjob address as
the issue occurred there (which is also Googlemail, unfortunately).

>There was now a followup on this, and if you can I think it's best if
>you can followup there.
>
>https://lore.kernel.org/lkml/CA+CK2bC70pnL0Wimb0xt99J4nNfi8W3zuUHgAk-jsPuOP9jpHA@mail.gmail.com/

OK, doing now:


Pavel Tatashin wrote:

>Could you please send the config file and qemu arguments that were
>used to reproduce this problem.

This is from a libvirt-managed system. The arguments as shown by
“ps axwww” are:

qemu-system-x86_64 -enable-kvm -name ci-busyapps -S -machine pc-1.1,accel=kvm,usb=off -m 8192 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 09536d92-dd73-8993-78fb-e0c885acf763 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/ci-busyapps.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/vms/ci-busyapps,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=25 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:6e:fd,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=b
 alloon0,bus=pci.0,addr=0x5 -msg timestamp=on

I’ve attached the kernel configuration; this is a stock Debian
unstable/amd64 system, just upgraded. After upgrading the guest,
I merely issued a “reboot” in the guest and did not stop/start
qemu.

The host is Debian jessie/amd64 (Linux 3.16.0-7-amd64 / 3.16.59-1)
in case that matters.

Thanks,
//mirabilos
-- 
tarent solutions GmbH
Rochusstraße 2-4, D-53123 Bonn • http://www.tarent.de/
Tel: +49 228 54881-393 • Fax: +49 228 54881-235
HRB 5168 (AG Bonn) • USt-ID (VAT): DE122264941
Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2019-01-03 23:43           ` Dominique Martinet
@ 2019-01-07 18:17             ` Pavel Tatashin
  2019-01-07 23:48               ` Dominique Martinet
  0 siblings, 1 reply; 76+ messages in thread
From: Pavel Tatashin @ 2019-01-07 18:17 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Salvatore Bonaccorso, Steven Sistare, Pavel Tatashin,
	Daniel Jordan, linux, schwidefsky, heiko.carstens, John Stultz,
	sboyd, x86, LKML, mingo, Thomas Gleixner, hpa, douly.fnst,
	Peter Zijlstra, prarit, feng.tang, Petr Mladek, gnomes,
	linux-s390, boris.ostrovsky, jgross, pbonzini, virtualization,
	kvm, qemu-devel

On Thu, Jan 3, 2019 at 6:43 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Pavel Tatashin wrote on Thu, Jan 03, 2019:
> > Could you please send the config file and qemu arguments that were
> > used to reproduce this problem.
>
> Running qemu by hand, nothing fancy e.g. this works:
>
> # qemu-system-x86_64 -m 1G -smp 4 -drive file=/root/kvm-wrapper/disks/f2.img,if=virtio -serial mon:stdio --enable-kvm -cpu Haswell -device virtio-rng-pci -nographic
>
> (used a specific cpu just in case but normally runnning with cpu host on
> a skylake machine; can probably go older)
>
>
> qemu is fedora 29 blend as is:
> $ qemu-system-x86_64 --version
> QEMU emulator version 3.0.0 (qemu-3.0.0-3.fc29)
> Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
>
>
> compressed .config attached to the mail, this can likely be trimmed down
> some as well but that takes more time for me..
> I didn't rebuild the kernel so not 100% sure (comes from
> /proc/config.gz) but it should work on a 4.20-rc2 kernel as written in
> the first few lines; 857baa87b64 I referred to in another mail was
> merged in 4.19-rc1 so anything past that is probably OK to reproduce...
>
>
> Re-checked today with these exact options (fresh VM start; then suspend
> laptop for a bit, then reboot VM):
> [    0.000000] Hypervisor detected: KVM
> [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [ 2477.907447] kvm-clock: cpu 0, msr 153a4001, primary cpu clock
> [ 2477.907448] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [ 2477.907450] tsc: Detected 2592.000 MHz processor

I could not reproduce the problem. Did you suspend to memory between
wake ups? Does this time jump happen every time, even if your laptop
sleeps for a minute?

I have tried with qemu 2.6 and 3.1 on Ubuntu, testing 4.20rc2.

Pasha

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2019-01-07 18:17             ` Pavel Tatashin
@ 2019-01-07 23:48               ` Dominique Martinet
  2019-01-08  1:04                 ` Pavel Tatashin
  0 siblings, 1 reply; 76+ messages in thread
From: Dominique Martinet @ 2019-01-07 23:48 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Salvatore Bonaccorso, Steven Sistare, Pavel Tatashin,
	Daniel Jordan, linux, schwidefsky, heiko.carstens, John Stultz,
	sboyd, x86, LKML, mingo, Thomas Gleixner, hpa, douly.fnst,
	Peter Zijlstra, prarit, feng.tang, Petr Mladek, gnomes,
	linux-s390, boris.ostrovsky, jgross, pbonzini, virtualization,
	kvm, qemu-devel

Pavel Tatashin wrote on Mon, Jan 07, 2019:
> I could not reproduce the problem. Did you suspend to memory between
> wake ups? Does this time jump happen every time, even if your laptop
> sleeps for a minute?

I'm not sure I understand "suspend to memory between the wake ups".
The full sequence is:
 - start a VM (just in case, I let it boot till the end)
 - suspend to memory (aka systemctl suspend) the host
 - after resuming the host, soft reboot the VM (login through
serial/ssh/whatever and reboot or in the qemu console 'system_reset')

I've just slept exactly one minute and reproduced again with the fedora
stock kernel now (4.19.13-300.fc29.x86_64) in the VM.

Interestingly I'm not getting the same offset between multiple reboots
now despite not suspending again; but if I don't suspend I cannot seem
to get it to give an offset at all (only tried for a few minutes; this
might not be true) ; OTOH I pushed my luck further and even with a five
seconds sleep I'm getting a noticeable offset on first VM reboot after
resume:

[    0.000000] Hypervisor detected: KVM
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[  179.362163] kvm-clock: cpu 0, msr 13c01001, primary cpu clock
[  179.362163] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns

Honestly not sure what more information I could give, I'll try on some
other hardware than my laptop (if I can get a server to resume after
suspend through ipmi or wake on lan); but I don't have anything I could
install ubuntu on to try their qemu's version... although I really don't
want to believe that's the difference...

Thanks,
-- 
Dominique

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2019-01-07 23:48               ` Dominique Martinet
@ 2019-01-08  1:04                 ` Pavel Tatashin
  2019-01-08  1:09                   ` Dominique Martinet
  2019-01-26  2:04                   ` Jon DeVree
  0 siblings, 2 replies; 76+ messages in thread
From: Pavel Tatashin @ 2019-01-08  1:04 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Salvatore Bonaccorso, Steven Sistare, Pavel Tatashin,
	Daniel Jordan, linux, schwidefsky, heiko.carstens, John Stultz,
	sboyd, x86, LKML, mingo, Thomas Gleixner, hpa, douly.fnst,
	Peter Zijlstra, prarit, feng.tang, Petr Mladek, gnomes,
	linux-s390, boris.ostrovsky, jgross, pbonzini, virtualization,
	kvm, qemu-devel

I did exactly the same sequence on Kaby Lake CPU and could not
reproduce it. What is your host CPU?

Thank you,
Pasha

On Mon, Jan 7, 2019 at 6:48 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Pavel Tatashin wrote on Mon, Jan 07, 2019:
> > I could not reproduce the problem. Did you suspend to memory between
> > wake ups? Does this time jump happen every time, even if your laptop
> > sleeps for a minute?
>
> I'm not sure I understand "suspend to memory between the wake ups".
> The full sequence is:
>  - start a VM (just in case, I let it boot till the end)
>  - suspend to memory (aka systemctl suspend) the host
>  - after resuming the host, soft reboot the VM (login through
> serial/ssh/whatever and reboot or in the qemu console 'system_reset')
>
> I've just slept exactly one minute and reproduced again with the fedora
> stock kernel now (4.19.13-300.fc29.x86_64) in the VM.
>
> Interestingly I'm not getting the same offset between multiple reboots
> now despite not suspending again; but if I don't suspend I cannot seem
> to get it to give an offset at all (only tried for a few minutes; this
> might not be true) ; OTOH I pushed my luck further and even with a five
> seconds sleep I'm getting a noticeable offset on first VM reboot after
> resume:
>
> [    0.000000] Hypervisor detected: KVM
> [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [  179.362163] kvm-clock: cpu 0, msr 13c01001, primary cpu clock
> [  179.362163] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
>
> Honestly not sure what more information I could give, I'll try on some
> other hardware than my laptop (if I can get a server to resume after
> suspend through ipmi or wake on lan); but I don't have anything I could
> install ubuntu on to try their qemu's version... although I really don't
> want to believe that's the difference...
>
> Thanks,
> --
> Dominique

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2019-01-08  1:04                 ` Pavel Tatashin
@ 2019-01-08  1:09                   ` Dominique Martinet
  2019-01-26  2:04                   ` Jon DeVree
  1 sibling, 0 replies; 76+ messages in thread
From: Dominique Martinet @ 2019-01-08  1:09 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Salvatore Bonaccorso, Steven Sistare, Pavel Tatashin,
	Daniel Jordan, linux, schwidefsky, heiko.carstens, John Stultz,
	sboyd, x86, LKML, mingo, Thomas Gleixner, hpa, douly.fnst,
	Peter Zijlstra, prarit, feng.tang, Petr Mladek, gnomes,
	linux-s390, boris.ostrovsky, jgross, pbonzini, virtualization,
	kvm, qemu-devel

Pavel Tatashin wrote on Mon, Jan 07, 2019:
> I did exactly the same sequence on Kaby Lake CPU and could not
> reproduce it. What is your host CPU?

skylake consumer laptop CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz

I don't have any kaby lake around; I have access to older servers though...
-- 
Dominique

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2019-01-08  1:04                 ` Pavel Tatashin
  2019-01-08  1:09                   ` Dominique Martinet
@ 2019-01-26  2:04                   ` Jon DeVree
  2019-01-26 16:11                     ` Pavel Tatashin
  1 sibling, 1 reply; 76+ messages in thread
From: Jon DeVree @ 2019-01-26  2:04 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Dominique Martinet, Salvatore Bonaccorso, Steven Sistare,
	Pavel Tatashin, Daniel Jordan, linux, schwidefsky,
	heiko.carstens, John Stultz, sboyd, x86, LKML, mingo,
	Thomas Gleixner, hpa, douly.fnst, Peter Zijlstra, prarit,
	feng.tang, Petr Mladek, gnomes, linux-s390, boris.ostrovsky,
	jgross, pbonzini, virtualization, kvm, qemu-devel

On Mon, Jan 07, 2019 at 20:04:41 -0500, Pavel Tatashin wrote:
> I did exactly the same sequence on Kaby Lake CPU and could not
> reproduce it. What is your host CPU?
> 

I have some machines which display this bug and others that don't, so I
was able to figure out the difference between their configurations.

TL;DR the bug appears to be based on wther or not
/sys/devices/system/clocksource/clocksource0/current_clocksource is set
to TSC in the hypervisor

This is the log from a machine with the bug:

[    0.000000] Hypervisor detected: KVM
[    0.000000] kvm-clock: Using msrs 12 and 11
[1162908.013830] kvm-clock: cpu 0, msr 3e0fea001, primary cpu clock
[1162908.013830] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[1162908.013834] tsc: Detected 1899.888 MHz processor

This is the log from a machine without the bug:

[    0.000000] Hypervisor detected: KVM
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr 149fea001, primary cpu clock
[    0.000000] kvm-clock: using sched offset of 1558436482528906 cycles
[    0.000002] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000004] tsc: Detected 2097.570 MHz processor

Note the additional line of output on the machine without the bug:

[    0.000000] kvm-clock: using sched offset of 1558436482528906 cycles

This is printed from kvm_sched_clock_init() in
arch/x86/kernel/kvmclock.c based on whether or not the clock is stable.
For the clock to be stable both KVM_FEATURE_CLOCKSOURCE_STABLE_BIT and
PVCLOCK_TSC_STABLE_BIT have to be set.  Both of these are controlled by
the hypervisor kernel.

* KVM_FEATURE_CLOCKSOURCE_STABLE_BIT is always set by the hypervisor
  starting with Linux v2.6.35 - 371bcf646d17 ("KVM: x86: Tell the guest
  we'll warn it about tsc stability")
* PVCLOCK_TSC_STABLE_BIT is set starting in Linux v3.8 but only if the
  clocksource is the TSC - d828199e8444 ("KVM: x86: implement
  PVCLOCK_TSC_STABLE_BIT pvclock flag")

I changed the clocksource of a hypervisor that wasn't having issues from
TSC to HPET and when I started up a guest VM the bug suddenly appeared.
I shut down the guest, set the hypervisor's clocksource back to TSC,
started up the guest and the bug went away again.

You don't actually have to reboot the guest before the bug is visible
either, just letting the guest sit at the GRUB menu for a minute or two
before loading Linux is enough to make the bug plainly visible in the
printk timestamps.

I don't know enough to actually fix the bug, but hopefully this is
enough to allow everyone else to reproduce it and come up with a fix.

-- 
Jon
X(7): A program for managing terminal windows. See also screen(1) and tmux(1).

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v15 23/26] sched: early boot clock
  2019-01-26  2:04                   ` Jon DeVree
@ 2019-01-26 16:11                     ` Pavel Tatashin
  0 siblings, 0 replies; 76+ messages in thread
From: Pavel Tatashin @ 2019-01-26 16:11 UTC (permalink / raw)
  To: Dominique Martinet, Salvatore Bonaccorso, Steven Sistare,
	Pavel Tatashin, Daniel Jordan, linux, schwidefsky,
	heiko.carstens, John Stultz, sboyd, x86, LKML, mingo,
	Thomas Gleixner, hpa, douly.fnst, Peter Zijlstra, prarit,
	feng.tang, Petr Mladek, gnomes, linux-s390, boris.ostrovsky,
	jgross, pbonzini, virtualization, kvm, qemu-devel

On 19-01-25 21:04:10, Jon DeVree wrote:
> 
> * KVM_FEATURE_CLOCKSOURCE_STABLE_BIT is always set by the hypervisor
>   starting with Linux v2.6.35 - 371bcf646d17 ("KVM: x86: Tell the guest
>   we'll warn it about tsc stability")
> * PVCLOCK_TSC_STABLE_BIT is set starting in Linux v3.8 but only if the
>   clocksource is the TSC - d828199e8444 ("KVM: x86: implement
>   PVCLOCK_TSC_STABLE_BIT pvclock flag")
> 
> I changed the clocksource of a hypervisor that wasn't having issues from
> TSC to HPET and when I started up a guest VM the bug suddenly appeared.
> I shut down the guest, set the hypervisor's clocksource back to TSC,
> started up the guest and the bug went away again.
> 
> You don't actually have to reboot the guest before the bug is visible
> either, just letting the guest sit at the GRUB menu for a minute or two
> before loading Linux is enough to make the bug plainly visible in the
> printk timestamps.
> 
> I don't know enough to actually fix the bug, but hopefully this is
> enough to allow everyone else to reproduce it and come up with a fix.

Thank you very much for your analysis, I am now able to reproduce the
problem by setting clocksource on my machine to hpet. I will soon submit a
patch with a fix.

Pasha

> 
> -- 
> Jon
> X(7): A program for managing terminal windows. See also screen(1) and tmux(1).

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2019-01-26 16:11 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-19 20:55 [PATCH v15 00/26] Early boot time stamps Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 01/26] x86/kvmclock: Remove memblock dependency Pavel Tatashin
2018-07-19 22:21   ` [tip:x86/timers] " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 02/26] x86/kvmclock: Remove page size requirement from wall_clock Pavel Tatashin
2018-07-19 22:22   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
2018-07-19 20:55 ` [PATCH v15 03/26] x86/kvmclock: Decrapify kvm_register_clock() Pavel Tatashin
2018-07-19 22:23   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
2018-07-19 20:55 ` [PATCH v15 04/26] x86/kvmclock: Cleanup the code Pavel Tatashin
2018-07-19 22:23   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
2018-07-19 20:55 ` [PATCH v15 05/26] x86/kvmclock: Mark variables __initdata and __ro_after_init Pavel Tatashin
2018-07-19 22:24   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
2018-07-19 20:55 ` [PATCH v15 06/26] x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock Pavel Tatashin
2018-07-19 22:24   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
2018-07-19 20:55 ` [PATCH v15 07/26] x86/kvmclock: Switch kvmclock data to a PER_CPU variable Pavel Tatashin
2018-07-19 22:25   ` [tip:x86/timers] " tip-bot for Thomas Gleixner
2018-07-19 20:55 ` [PATCH v15 08/26] x86: text_poke() may access uninitialized struct pages Pavel Tatashin
2018-07-19 22:25   ` [tip:x86/timers] x86/alternatives, jumplabel: Use text_poke_early() before mm_init() tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 09/26] x86: initialize static branching early Pavel Tatashin
2018-07-19 22:26   ` [tip:x86/timers] x86/jump_label: Initialize " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 10/26] x86/CPU: Call detect_nopl() only on the BSP Pavel Tatashin
2018-07-19 22:26   ` [tip:x86/timers] " tip-bot for Borislav Petkov
2018-07-19 20:55 ` [PATCH v15 11/26] x86/tsc: redefine notsc to behave as tsc=unstable Pavel Tatashin
2018-07-19 22:27   ` [tip:x86/timers] x86/tsc: Redefine " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 12/26] x86/xen/time: initialize pv xen time in init_hypervisor_platform Pavel Tatashin
2018-07-19 22:27   ` [tip:x86/timers] x86/xen/time: Initialize pv xen time in init_hypervisor_platform() tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 13/26] x86/xen/time: output xen sched_clock time from 0 Pavel Tatashin
2018-07-19 22:28   ` [tip:x86/timers] x86/xen/time: Output " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 14/26] s390/time: add read_persistent_wall_and_boot_offset() Pavel Tatashin
2018-07-19 22:28   ` [tip:x86/timers] s390/time: Add read_persistent_wall_and_boot_offset() tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 15/26] time: replace read_boot_clock64() with read_persistent_wall_and_boot_offset() Pavel Tatashin
2018-07-19 22:29   ` [tip:x86/timers] timekeeping: Replace " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 16/26] time: default boot time offset to local_clock() Pavel Tatashin
2018-07-19 22:29   ` [tip:x86/timers] timekeeping: Default " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 17/26] s390/time: remove read_boot_clock64() Pavel Tatashin
2018-07-19 22:30   ` [tip:x86/timers] s390/time: Remove read_boot_clock64() tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 18/26] ARM/time: remove read_boot_clock64() Pavel Tatashin
2018-07-19 22:30   ` [tip:x86/timers] ARM/time: Remove read_boot_clock64() tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 19/26] x86/tsc: calibrate tsc only once Pavel Tatashin
2018-07-19 22:31   ` [tip:x86/timers] x86/tsc: Calibrate " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 20/26] x86/tsc: initialize cyc2ns when tsc freq. is determined Pavel Tatashin
2018-07-19 22:31   ` [tip:x86/timers] x86/tsc: Initialize cyc2ns when tsc frequency " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 21/26] x86/tsc: use tsc early Pavel Tatashin
2018-07-19 22:32   ` [tip:x86/timers] x86/tsc: Use TSC as sched clock early tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 22/26] sched: move sched clock initialization and merge with generic clock Pavel Tatashin
2018-07-19 22:32   ` [tip:x86/timers] sched/clock: Move " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 23/26] sched: early boot clock Pavel Tatashin
2018-07-19 22:33   ` [tip:x86/timers] sched/clock: Enable sched clock early tip-bot for Pavel Tatashin
2018-07-24 19:52     ` Guenter Roeck
2018-07-24 20:22       ` Pavel Tatashin
2018-07-25  0:36         ` Pavel Tatashin
2018-07-25  1:24           ` Guenter Roeck
2018-07-25  2:05             ` Pavel Tatashin
2018-07-25  2:41               ` Pavel Tatashin
2018-07-30 12:36                 ` Peter Zijlstra
2018-07-30 13:44                   ` Pavel Tatashin
2018-07-20  8:09   ` [PATCH v15 23/26] sched: early boot clock Peter Zijlstra
2018-07-20 10:00     ` [tip:x86/timers] sched/clock: Close a hole in sched_clock_init() tip-bot for Peter Zijlstra
2018-11-06  5:42   ` [PATCH v15 23/26] sched: early boot clock Dominique Martinet
2018-11-06 11:35     ` Steven Sistare
2019-01-02 20:20       ` Salvatore Bonaccorso
2019-01-03 21:28         ` Pavel Tatashin
2019-01-03 23:43           ` Dominique Martinet
2019-01-07 18:17             ` Pavel Tatashin
2019-01-07 23:48               ` Dominique Martinet
2019-01-08  1:04                 ` Pavel Tatashin
2019-01-08  1:09                   ` Dominique Martinet
2019-01-26  2:04                   ` Jon DeVree
2019-01-26 16:11                     ` Pavel Tatashin
2019-01-04  7:30           ` [PATCH v15 23/26] sched: early boot clock (was Re: Bug#918036: linux: uptime after reboot wrong (kvm-clock related?)) Thorsten Glaser
2018-07-19 20:55 ` [PATCH v15 24/26] sched: use static key for sched_clock_running Pavel Tatashin
2018-07-19 22:33   ` [tip:x86/timers] sched/clock: Use " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 25/26] x86/tsc: split native_calibrate_cpu() into early and late parts Pavel Tatashin
2018-07-19 22:34   ` [tip:x86/timers] x86/tsc: Split " tip-bot for Pavel Tatashin
2018-07-19 20:55 ` [PATCH v15 26/26] x86/tsc: use tsc_calibrate_cpu_early and pit_hpet_ptimer_calibrate_cpu Pavel Tatashin
2018-07-19 22:35   ` [tip:x86/timers] x86/tsc: Make use of tsc_calibrate_cpu_early() tip-bot for Pavel Tatashin
2018-07-19 22:34 ` [PATCH v15 00/26] Early boot time stamps Thomas Gleixner
     [not found] <154644530361.2390.18185252504260044930.reportbug@ci-busyapps.lan.tarent.de>
     [not found] ` <20190102163939.GB13145@eldamar.local>

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).