All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
@ 2016-08-10 18:29 Waiman Long
  2016-08-10 18:37 ` Long, Wai Man
  2016-08-11 19:32 ` Dave Hansen
  0 siblings, 2 replies; 17+ messages in thread
From: Waiman Long @ 2016-08-10 18:29 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: linux-kernel, x86, Jiang Liu, Borislav Petkov, Andy Lutomirski,
	Prarit Bhargava, Scott J Norton, Douglas Hatch, Randy Wright,
	Waiman Long

On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
 1) There is a single HPET counter shared by all the CPUs.
 2) HPET counter reading is a very slow operation.

Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.

This patch attempts to reduce HPET read contention by using the fact
that if more than one CPUs are trying to access HPET at the same time,
it will be more efficient if one CPU in the group reads the HPET
counter and shares it with the rest of the group instead of each
group member reads the HPET counter individually.

This is done by using a combination word with a sequence number and
a bit lock. The CPU that gets the bit lock will be responsible for
reading the HPET counter and update the sequence number. The others
will monitor the change in sequence number and grab the HPET counter
accordingly. This change is enabled on SMP configuration.

On a 4-socket Haswell-EX box with 72 cores (HT off), running the
AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):

TSC		= 646515 jobs/min
HPET w/o patch	= 566708 jobs/min
HPET with patch	= 638791 jobs/min

The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 4.99% without patch to 1.41% with patch.

On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
other hand, the performance numbers of the same benchmark were:

TSC		= 3145329 jobs/min
HPET w/o patch	= 1108537 jobs/min
HPET with patch	= 3019934 jobs/min

The corresponding perf profile showed a drop of CPU consumption of
the read_hpet function from more than 34% to just 2.96%.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 v3->v4:
  - Move hpet_save inside the CONFIG_SMP block to fix a compilation
    warning in non-SMP build.

 v2->v3:
  - Make the hpet optimization the default for SMP configuration. So
    no documentation change is needed.
  - Remove threshold checking code as it should not be necessary and
    can be potentially unsafe.

 v1->v2:
  - Reduce the CPU threshold to 32.
  - Add a kernel parameter to explicitly enable or disable hpet
    optimization.
  - Change hpet_save.hpet type to u32 to make sure that read & write
    is atomic on i386.

 arch/x86/kernel/hpet.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index a1f0e4a..bc5bb53 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -759,12 +759,96 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
 #endif
 
 /*
+ * Reading the HPET counter is a very slow operation. If a large number of
+ * CPUs are trying to access the HPET counter simultaneously, it can cause
+ * massive delay and slow down system performance dramatically. This may
+ * happen when HPET is the default clock source instead of TSC. For a
+ * really large system with hundreds of CPUs, the slowdown may be so
+ * severe that it may actually crash the system because of a NMI watchdog
+ * soft lockup, for example.
+ *
+ * If multiple CPUs are trying to access the HPET counter at the same time,
+ * we don't actually need to read the counter multiple times. Instead, the
+ * other CPUs can use the counter value read by the first CPU in the group.
+ *
+ * A sequence number whose lsb is a lock bit is used to control which CPU
+ * has the right to read the HPET counter directly and which CPUs are going
+ * to get the indirect value read by the lock holder. For the later group,
+ * if the sequence number differs from the expected locked value, they
+ * can assume that the saved HPET value is up-to-date and return it.
+ */
+#define HPET_SEQ_LOCKED(seq)	((seq) & 1)	/* Odd == locked */
+
+/*
  * Clock source related code
  */
+#ifdef CONFIG_SMP
+static struct {
+	/* Sequence number + bit lock */
+	int seq ____cacheline_aligned_in_smp;
+
+	/* Current HPET value		*/
+	u32 hpet ____cacheline_aligned_in_smp;
+} hpet_save;
+
+static cycle_t read_hpet(struct clocksource *cs)
+{
+	int seq;
+
+	seq = READ_ONCE(hpet_save.seq);
+	if (!HPET_SEQ_LOCKED(seq)) {
+		int old, new = seq + 1;
+		unsigned long flags;
+
+		local_irq_save(flags);
+		/*
+		 * Set the lock bit (lsb) to get the right to read HPET
+		 * counter directly. If successful, read the counter, save
+		 * its value, and increment the sequence number. Otherwise,
+		 * increment the sequnce number to the expected locked value
+		 * for comparison later on.
+		 */
+		old = cmpxchg(&hpet_save.seq, seq, new);
+		if (old == seq) {
+			u32 time;
+
+			time = hpet_save.hpet = hpet_readl(HPET_COUNTER);
+
+			/* Unlock */
+			smp_store_release(&hpet_save.seq, new + 1);
+			local_irq_restore(flags);
+			return (cycle_t)time;
+		}
+		local_irq_restore(flags);
+		seq = new;
+	}
+
+	/*
+	 * Wait until the locked sequence number changes which indicates
+	 * that the saved HPET value is up-to-date.
+	 */
+	while (READ_ONCE(hpet_save.seq) == seq) {
+		/*
+		 * Since reading the HPET is much slower than a single
+		 * cpu_relax() instruction, we use two here in an attempt
+		 * to reduce the amount of cacheline contention in the
+		 * hpet_save.seq cacheline.
+		 */
+		cpu_relax();
+		cpu_relax();
+	}
+
+	return (cycle_t)READ_ONCE(hpet_save.hpet);
+}
+#else /* CONFIG_SMP */
+/*
+ * For UP
+ */
 static cycle_t read_hpet(struct clocksource *cs)
 {
 	return (cycle_t)hpet_readl(HPET_COUNTER);
 }
+#endif /* CONFIG_SMP */
 
 static struct clocksource clocksource_hpet = {
 	.name		= "hpet",
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-10 18:29 [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention Waiman Long
@ 2016-08-10 18:37 ` Long, Wai Man
  2016-08-10 19:01   ` Prarit Bhargava
  2016-08-11 19:32 ` Dave Hansen
  1 sibling, 1 reply; 17+ messages in thread
From: Long, Wai Man @ 2016-08-10 18:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: linux-kernel, x86, Borislav Petkov, Andy Lutomirski,
	Prarit Bhargava, Norton, Scott J, Hatch,
	Douglas B (HPE Servers - Linux),
	Wright, Randy (HPE Servers Linux)

Hi,

I would like to restart the discussion about the merit of this patch.

This patch was created in response to a problem we have on the 16-socket Broadwell-EX systems (up to 768 logical CPUs) that were under development. About 10% of the kernel boots experienced soft lockups:

[   71.618132] NetLabel: Initializing
[   71.621967] NetLabel:  domain hash size = 128
[   71.626848] NetLabel:  protocols = UNLABELED CIPSOv4
[   71.632418] NetLabel:  unlabeled traffic allowed by default
[   71.638679] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
[   71.646504] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[   71.655313] Switching to clocksource hpet
[   95.679135] BUG: soft lockup - CPU#144 stuck for 23s! [swapper/144:0]
[   95.693363] BUG: soft lockup - CPU#145 stuck for 23s! [swapper/145:0]
[   95.694203] Modules linked in:
[   95.694697] CPU: 145 PID: 0 Comm: swapper/145 Not tainted
3.10.0-327.el7.x86_64 #1
[   95.695580] BUG: soft lockup - CPU#582 stuck for 23s! [swapper/582:0]
[   95.696145] Hardware name: HP Superdome2 16s x86, BIOS Bundle: 
008.001.006
SFW: 041.063.152 01/16/2016
[   95.698128] BUG: soft lockup - CPU#357 stuck for 23s! [swapper/357:0]
[   95.699774] task: ffff8cf7fecf4500 ti: ffff89787c748000 task.ti: 
ffff89787c748000

During the bootup process, there is a short time where the clocksource is switched to hpet to calibrate the tsc's. Then it will be switched back to tsc once the calibration is done. It is during the short period that soft lockups may happen.

Prarit also hit this problem with a smaller Intel box that has 96 cores (192 threads). Maybe he can supply more information of what he had seen.

Cheers,
Longman

-----Original Message-----
From: Long, Wai Man 
Sent: Wednesday, August 10, 2016 2:30 PM
To: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; H. Peter Anvin <hpa@zytor.com>
Cc: linux-kernel@vger.kernel.org; x86@kernel.org; Jiang Liu <jiang.liu@linux.intel.com>; Borislav Petkov <bp@suse.de>; Andy Lutomirski <luto@kernel.org>; Prarit Bhargava <prarit@redhat.com>; Norton, Scott J <scott.norton@hpe.com>; Hatch, Douglas B (HPE Servers - Linux) <doug.hatch@hpe.com>; Wright, Randy (HPE Servers Linux) <rwright@hpe.com>; Long, Wai Man <waiman.long@hpe.com>
Subject: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention

On a large system with many CPUs, using HPET as the clock source can have a significant impact on the overall system performance because of the following reasons:
 1) There is a single HPET counter shared by all the CPUs.
 2) HPET counter reading is a very slow operation.

Using HPET as the default clock source may happen when, for example, the TSC clock calibration exceeds the allowable tolerance. Something the performance slowdown can be so severe that the system may crash because of a NMI watchdog soft lockup, for example.

This patch attempts to reduce HPET read contention by using the fact that if more than one CPUs are trying to access HPET at the same time, it will be more efficient if one CPU in the group reads the HPET counter and shares it with the rest of the group instead of each group member reads the HPET counter individually.

This is done by using a combination word with a sequence number and a bit lock. The CPU that gets the bit lock will be responsible for reading the HPET counter and update the sequence number. The others will monitor the change in sequence number and grab the HPET counter accordingly. This change is enabled on SMP configuration.

On a 4-socket Haswell-EX box with 72 cores (HT off), running the
AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000) with and without the patch has the following performance numbers (with HPET or TSC as clock source):

TSC		= 646515 jobs/min
HPET w/o patch	= 566708 jobs/min
HPET with patch	= 638791 jobs/min

The perf profile showed a reduction of the %CPU time consumed by read_hpet from 4.99% without patch to 1.41% with patch.

On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the other hand, the performance numbers of the same benchmark were:

TSC		= 3145329 jobs/min
HPET w/o patch	= 1108537 jobs/min
HPET with patch	= 3019934 jobs/min

The corresponding perf profile showed a drop of CPU consumption of the read_hpet function from more than 34% to just 2.96%.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 v3->v4:
  - Move hpet_save inside the CONFIG_SMP block to fix a compilation
    warning in non-SMP build.

 v2->v3:
  - Make the hpet optimization the default for SMP configuration. So
    no documentation change is needed.
  - Remove threshold checking code as it should not be necessary and
    can be potentially unsafe.

 v1->v2:
  - Reduce the CPU threshold to 32.
  - Add a kernel parameter to explicitly enable or disable hpet
    optimization.
  - Change hpet_save.hpet type to u32 to make sure that read & write
    is atomic on i386.

 arch/x86/kernel/hpet.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index a1f0e4a..bc5bb53 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -759,12 +759,96 @@ static int hpet_cpuhp_notify(struct notifier_block *n,  #endif
 
 /*
+ * Reading the HPET counter is a very slow operation. If a large number 
+of
+ * CPUs are trying to access the HPET counter simultaneously, it can 
+cause
+ * massive delay and slow down system performance dramatically. This 
+may
+ * happen when HPET is the default clock source instead of TSC. For a
+ * really large system with hundreds of CPUs, the slowdown may be so
+ * severe that it may actually crash the system because of a NMI 
+watchdog
+ * soft lockup, for example.
+ *
+ * If multiple CPUs are trying to access the HPET counter at the same 
+time,
+ * we don't actually need to read the counter multiple times. Instead, 
+the
+ * other CPUs can use the counter value read by the first CPU in the group.
+ *
+ * A sequence number whose lsb is a lock bit is used to control which 
+CPU
+ * has the right to read the HPET counter directly and which CPUs are 
+going
+ * to get the indirect value read by the lock holder. For the later 
+group,
+ * if the sequence number differs from the expected locked value, they
+ * can assume that the saved HPET value is up-to-date and return it.
+ */
+#define HPET_SEQ_LOCKED(seq)	((seq) & 1)	/* Odd == locked */
+
+/*
  * Clock source related code
  */
+#ifdef CONFIG_SMP
+static struct {
+	/* Sequence number + bit lock */
+	int seq ____cacheline_aligned_in_smp;
+
+	/* Current HPET value		*/
+	u32 hpet ____cacheline_aligned_in_smp; } hpet_save;
+
+static cycle_t read_hpet(struct clocksource *cs) {
+	int seq;
+
+	seq = READ_ONCE(hpet_save.seq);
+	if (!HPET_SEQ_LOCKED(seq)) {
+		int old, new = seq + 1;
+		unsigned long flags;
+
+		local_irq_save(flags);
+		/*
+		 * Set the lock bit (lsb) to get the right to read HPET
+		 * counter directly. If successful, read the counter, save
+		 * its value, and increment the sequence number. Otherwise,
+		 * increment the sequnce number to the expected locked value
+		 * for comparison later on.
+		 */
+		old = cmpxchg(&hpet_save.seq, seq, new);
+		if (old == seq) {
+			u32 time;
+
+			time = hpet_save.hpet = hpet_readl(HPET_COUNTER);
+
+			/* Unlock */
+			smp_store_release(&hpet_save.seq, new + 1);
+			local_irq_restore(flags);
+			return (cycle_t)time;
+		}
+		local_irq_restore(flags);
+		seq = new;
+	}
+
+	/*
+	 * Wait until the locked sequence number changes which indicates
+	 * that the saved HPET value is up-to-date.
+	 */
+	while (READ_ONCE(hpet_save.seq) == seq) {
+		/*
+		 * Since reading the HPET is much slower than a single
+		 * cpu_relax() instruction, we use two here in an attempt
+		 * to reduce the amount of cacheline contention in the
+		 * hpet_save.seq cacheline.
+		 */
+		cpu_relax();
+		cpu_relax();
+	}
+
+	return (cycle_t)READ_ONCE(hpet_save.hpet);
+}
+#else /* CONFIG_SMP */
+/*
+ * For UP
+ */
 static cycle_t read_hpet(struct clocksource *cs)  {
 	return (cycle_t)hpet_readl(HPET_COUNTER);
 }
+#endif /* CONFIG_SMP */
 
 static struct clocksource clocksource_hpet = {
 	.name		= "hpet",
--
1.7.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-10 18:37 ` Long, Wai Man
@ 2016-08-10 19:01   ` Prarit Bhargava
  0 siblings, 0 replies; 17+ messages in thread
From: Prarit Bhargava @ 2016-08-10 19:01 UTC (permalink / raw)
  To: Long, Wai Man, Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: linux-kernel, x86, Borislav Petkov, Andy Lutomirski, Norton,
	Scott J, Hatch, Douglas B (HPE Servers - Linux),
	Wright, Randy (HPE Servers Linux)



On 08/10/2016 02:37 PM, Long, Wai Man wrote:
> Hi,
> 
> I would like to restart the discussion about the merit of this patch.
> 
> This patch was created in response to a problem we have on the 16-socket Broadwell-EX systems (up to 768 logical CPUs) that were under development. About 10% of the kernel boots experienced soft lockups:
> 
> [   71.618132] NetLabel: Initializing
> [   71.621967] NetLabel:  domain hash size = 128
> [   71.626848] NetLabel:  protocols = UNLABELED CIPSOv4
> [   71.632418] NetLabel:  unlabeled traffic allowed by default
> [   71.638679] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
> [   71.646504] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
> [   71.655313] Switching to clocksource hpet
> [   95.679135] BUG: soft lockup - CPU#144 stuck for 23s! [swapper/144:0]
> [   95.693363] BUG: soft lockup - CPU#145 stuck for 23s! [swapper/145:0]
> [   95.694203] Modules linked in:
> [   95.694697] CPU: 145 PID: 0 Comm: swapper/145 Not tainted
> 3.10.0-327.el7.x86_64 #1
> [   95.695580] BUG: soft lockup - CPU#582 stuck for 23s! [swapper/582:0]
> [   95.696145] Hardware name: HP Superdome2 16s x86, BIOS Bundle: 
> 008.001.006
> SFW: 041.063.152 01/16/2016
> [   95.698128] BUG: soft lockup - CPU#357 stuck for 23s! [swapper/357:0]
> [   95.699774] task: ffff8cf7fecf4500 ti: ffff89787c748000 task.ti: 
> ffff89787c748000
> 
> During the bootup process, there is a short time where the clocksource is switched to hpet to calibrate the tsc's. Then it will be switched back to tsc once the calibration is done. It is during the short period that soft lockups may happen.
> 
> Prarit also hit this problem with a smaller Intel box that has 96 cores (192 threads). Maybe he can supply more information of what he had seen.
> 

I've hit this on a system with 192 threads.  The TSC is functional and has
passed the TSC sync checks during boot.  When the HPET is used to resynchronize
the TSC, I occasionally see

PCI: Using ACPI for IRQ routing
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel:  unlabeled traffic allowed by default
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
hpet0: 8 comparators, 64-bit 24.000000 MHz counter
Switched to clocksource hpet

followed by the same NMI flood that Waiman described.  After some debugging I
came to the same conclusion that Waiman had, the HPET is causing contention on
the system with many threads accessing it rapidly.

After applying his patch the problem no longer occurs.

P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-10 18:29 [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention Waiman Long
  2016-08-10 18:37 ` Long, Wai Man
@ 2016-08-11 19:32 ` Dave Hansen
  2016-08-11 23:22   ` Waiman Long
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2016-08-11 19:32 UTC (permalink / raw)
  To: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: linux-kernel, x86, Jiang Liu, Borislav Petkov, Andy Lutomirski,
	Prarit Bhargava, Scott J Norton, Douglas Hatch, Randy Wright,
	John Stultz

On 08/10/2016 11:29 AM, Waiman Long wrote:
> +static cycle_t read_hpet(struct clocksource *cs)
> +{
> +	int seq;
> +
> +	seq = READ_ONCE(hpet_save.seq);
> +	if (!HPET_SEQ_LOCKED(seq)) {
...
> +	}
> +
> +	/*
> +	 * Wait until the locked sequence number changes which indicates
> +	 * that the saved HPET value is up-to-date.
> +	 */
> +	while (READ_ONCE(hpet_save.seq) == seq) {
> +		/*
> +		 * Since reading the HPET is much slower than a single
> +		 * cpu_relax() instruction, we use two here in an attempt
> +		 * to reduce the amount of cacheline contention in the
> +		 * hpet_save.seq cacheline.
> +		 */
> +		cpu_relax();
> +		cpu_relax();
> +	}
> +
> +	return (cycle_t)READ_ONCE(hpet_save.hpet);
> +}

It's a real bummer that this all has to be open-coded.  I have to wonder
if there were any alternatives that you tried that were simpler.

Is READ_ONCE()/smp_store_release() really strong enough here?  It
guarantees ordering, but you need ordering *and* a guarantee that your
write is visible to the reader.  Don't you need actual barriers for
that?  Otherwise, you might be seeing a stale HPET value, and the spin
loop that you did waiting for it to be up-to-date was worthless.  The
seqlock code, uses barriers, btw.

Also, since you're fundamentally reading a second-hand HPET value, does
that have any impact on the precision of the HPET as a timesource?  Or,
is it so coarse already that this isn't an issue?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-11 19:32 ` Dave Hansen
@ 2016-08-11 23:22   ` Waiman Long
  2016-08-12  0:31     ` Dave Hansen
  0 siblings, 1 reply; 17+ messages in thread
From: Waiman Long @ 2016-08-11 23:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel, x86,
	Jiang Liu, Borislav Petkov, Andy Lutomirski, Prarit Bhargava,
	Scott J Norton, Douglas Hatch, Randy Wright, John Stultz

On 08/11/2016 03:32 PM, Dave Hansen wrote:
> On 08/10/2016 11:29 AM, Waiman Long wrote:
>> +static cycle_t read_hpet(struct clocksource *cs)
>> +{
>> +	int seq;
>> +
>> +	seq = READ_ONCE(hpet_save.seq);
>> +	if (!HPET_SEQ_LOCKED(seq)) {
> ...
>> +	}
>> +
>> +	/*
>> +	 * Wait until the locked sequence number changes which indicates
>> +	 * that the saved HPET value is up-to-date.
>> +	 */
>> +	while (READ_ONCE(hpet_save.seq) == seq) {
>> +		/*
>> +		 * Since reading the HPET is much slower than a single
>> +		 * cpu_relax() instruction, we use two here in an attempt
>> +		 * to reduce the amount of cacheline contention in the
>> +		 * hpet_save.seq cacheline.
>> +		 */
>> +		cpu_relax();
>> +		cpu_relax();
>> +	}
>> +
>> +	return (cycle_t)READ_ONCE(hpet_save.hpet);
>> +}
> It's a real bummer that this all has to be open-coded.  I have to wonder
> if there were any alternatives that you tried that were simpler.

What do you mean by "open-coded"? Do you mean the function can be inlined?


> Is READ_ONCE()/smp_store_release() really strong enough here?  It
> guarantees ordering, but you need ordering *and* a guarantee that your
> write is visible to the reader.  Don't you need actual barriers for
> that?  Otherwise, you might be seeing a stale HPET value, and the spin
> loop that you did waiting for it to be up-to-date was worthless.  The
> seqlock code, uses barriers, btw.

The cmpxchg() and smp_store_release() act as the lock/unlock sequence 
with the proper barriers. Another important point is that the hpet value 
is visible to the other readers  before the sequence number. This is 
what the smp_store_release() is providing. cmpxchg is an actual barrier, 
even though smp_store_release() is not. However, the x86 architecture 
will guarantee the writes are in order, I think.

> Also, since you're fundamentally reading a second-hand HPET value, does
> that have any impact on the precision of the HPET as a timesource?  Or,
> is it so coarse already that this isn't an issue?

There can always be unexpected latency in the returned time value, such 
as an interrupt or NMI. I think as long as the time won't go backward, 
it should be fine.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-11 23:22   ` Waiman Long
@ 2016-08-12  0:31     ` Dave Hansen
  2016-08-12 17:01       ` Waiman Long
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2016-08-12  0:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel, x86,
	Jiang Liu, Borislav Petkov, Andy Lutomirski, Prarit Bhargava,
	Scott J Norton, Douglas Hatch, Randy Wright, John Stultz

On 08/11/2016 04:22 PM, Waiman Long wrote:
> On 08/11/2016 03:32 PM, Dave Hansen wrote:
>> It's a real bummer that this all has to be open-coded.  I have to wonder
>> if there were any alternatives that you tried that were simpler.
> 
> What do you mean by "open-coded"? Do you mean the function can be inlined?

I just mean that it's implementing its own locking instead of being able
to use spinlocks or seqlocks, or some other existing primitive.

>> Is READ_ONCE()/smp_store_release() really strong enough here?  It
>> guarantees ordering, but you need ordering *and* a guarantee that your
>> write is visible to the reader.  Don't you need actual barriers for
>> that?  Otherwise, you might be seeing a stale HPET value, and the spin
>> loop that you did waiting for it to be up-to-date was worthless.  The
>> seqlock code, uses barriers, btw.
> 
> The cmpxchg() and smp_store_release() act as the lock/unlock sequence
> with the proper barriers. Another important point is that the hpet value
> is visible to the other readers  before the sequence number. This is
> what the smp_store_release() is providing. cmpxchg is an actual barrier,
> even though smp_store_release() is not. However, the x86 architecture
> will guarantee the writes are in order, I think.

The contended case (where HPET_SEQ_LOCKED(seq)) doesn't do the cmpxchg.
 So it's entirely relying on the READ_ONCE() on the "reader" side and
the cmpxchg/smp_store_release() on the "writer".  This probably works in
practice, but I'm not sure it's guaranteed behavior.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-12  0:31     ` Dave Hansen
@ 2016-08-12 17:01       ` Waiman Long
  2016-08-12 17:16         ` Dave Hansen
  0 siblings, 1 reply; 17+ messages in thread
From: Waiman Long @ 2016-08-12 17:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel, x86,
	Jiang Liu, Borislav Petkov, Andy Lutomirski, Prarit Bhargava,
	Scott J Norton, Douglas Hatch, Randy Wright, John Stultz

On 08/11/2016 08:31 PM, Dave Hansen wrote:
> On 08/11/2016 04:22 PM, Waiman Long wrote:
>> On 08/11/2016 03:32 PM, Dave Hansen wrote:
>>> It's a real bummer that this all has to be open-coded.  I have to wonder
>>> if there were any alternatives that you tried that were simpler.
>> What do you mean by "open-coded"? Do you mean the function can be inlined?
> I just mean that it's implementing its own locking instead of being able
> to use spinlocks or seqlocks, or some other existing primitive.

The reason for using a special lock is that I want both sequence number 
update and locking to be done together atomically. They can be made 
separate as is in the seqlock. However, that will make the code more 
complex to make sure that all the threads see a consistent set of lock 
state and sequence number.

>>> Is READ_ONCE()/smp_store_release() really strong enough here?  It
>>> guarantees ordering, but you need ordering *and* a guarantee that your
>>> write is visible to the reader.  Don't you need actual barriers for
>>> that?  Otherwise, you might be seeing a stale HPET value, and the spin
>>> loop that you did waiting for it to be up-to-date was worthless.  The
>>> seqlock code, uses barriers, btw.
>> The cmpxchg() and smp_store_release() act as the lock/unlock sequence
>> with the proper barriers. Another important point is that the hpet value
>> is visible to the other readers  before the sequence number. This is
>> what the smp_store_release() is providing. cmpxchg is an actual barrier,
>> even though smp_store_release() is not. However, the x86 architecture
>> will guarantee the writes are in order, I think.
> The contended case (where HPET_SEQ_LOCKED(seq)) doesn't do the cmpxchg.
>   So it's entirely relying on the READ_ONCE() on the "reader" side and
> the cmpxchg/smp_store_release() on the "writer".  This probably works in
> practice, but I'm not sure it's guaranteed behavior.
>

It is true that the latency where the sequence number change becomes 
visible to others can be unpredictable. All the code in the writer side 
is doing is to make sure that the new HPET value is visible before the 
sequence number change. Do you know of a way to reduce the latency 
without introducing too much overhead, like changing the 
smp_store_release() to smp_store_mb(), maybe?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-12 17:01       ` Waiman Long
@ 2016-08-12 17:16         ` Dave Hansen
  2016-08-12 18:31           ` Waiman Long
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2016-08-12 17:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel, x86,
	Jiang Liu, Borislav Petkov, Andy Lutomirski, Prarit Bhargava,
	Scott J Norton, Douglas Hatch, Randy Wright, John Stultz

On 08/12/2016 10:01 AM, Waiman Long wrote:
> The reason for using a special lock is that I want both sequence number
> update and locking to be done together atomically. They can be made
> separate as is in the seqlock. However, that will make the code more
> complex to make sure that all the threads see a consistent set of lock
> state and sequence number.

Why do we need a sequence number?  The "cached" HPET itself could be used.

I'm thinking something like below could use a spinlock instead of the
doing a custom cmpxchg sequence.  The spin_is_locked() should allow the
contended "readers" to avoid using atomics.

spinlock_t hpet_lock;
u32 hpet_value;
...
{
	u32 old_hpet = READ_ONCE(hpet_value);
	u32 new_hpet;

	// need to ensure that the spin_is_locked() is ordered after
	// the READ_ONCE().
	smp_rmb();
	// spin_is_locked() doesn't do atomics
	if (!spin_is_locked(&hpet_lock) && spin_trylock(&hpet_lock)) {
		WRITE_ONCE(hpet_value, real_read_hpet());
		spin_unlock(&hpet_lock);
		return hpet_value;
	}
	// Contended case.  We spin here waiting for the guy who holds
	// the lock to write a new value to 'hpet_value'.
	//
	// We know that our old_hpet is older than our check for the
	// spinlock being locked. So, someone must either have already
	// updated it or be updating it.
	do {
		cpu_relax();
		// We do not do a rmb() here.  We don't need a guarantee
		// that this read is up-to-date, just that it will
		// _eventually_ see an up-to-date value.
		new_hpet = READ_ONCE(hpet_value);
	} while (old_hpet == new_hpet);
	return new_hpet;
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-12 17:16         ` Dave Hansen
@ 2016-08-12 18:31           ` Waiman Long
  2016-08-12 20:18             ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Waiman Long @ 2016-08-12 18:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel, x86,
	Jiang Liu, Borislav Petkov, Andy Lutomirski, Prarit Bhargava,
	Scott J Norton, Douglas Hatch, Randy Wright, John Stultz

On 08/12/2016 01:16 PM, Dave Hansen wrote:
> On 08/12/2016 10:01 AM, Waiman Long wrote:
>> The reason for using a special lock is that I want both sequence number
>> update and locking to be done together atomically. They can be made
>> separate as is in the seqlock. However, that will make the code more
>> complex to make sure that all the threads see a consistent set of lock
>> state and sequence number.
> Why do we need a sequence number?  The "cached" HPET itself could be used.
>
> I'm thinking something like below could use a spinlock instead of the
> doing a custom cmpxchg sequence.  The spin_is_locked() should allow the
> contended "readers" to avoid using atomics.
>
> spinlock_t hpet_lock;
> u32 hpet_value;
> ...
> {
> 	u32 old_hpet = READ_ONCE(hpet_value);
> 	u32 new_hpet;
>
> 	// need to ensure that the spin_is_locked() is ordered after
> 	// the READ_ONCE().
> 	smp_rmb();
> 	// spin_is_locked() doesn't do atomics
> 	if (!spin_is_locked(&hpet_lock)&&  spin_trylock(&hpet_lock)) {
> 		WRITE_ONCE(hpet_value, real_read_hpet());
> 		spin_unlock(&hpet_lock);
> 		return hpet_value;
> 	}
> 	// Contended case.  We spin here waiting for the guy who holds
> 	// the lock to write a new value to 'hpet_value'.
> 	//
> 	// We know that our old_hpet is older than our check for the
> 	// spinlock being locked. So, someone must either have already
> 	// updated it or be updating it.
> 	do {
> 		cpu_relax();
> 		// We do not do a rmb() here.  We don't need a guarantee
> 		// that this read is up-to-date, just that it will
> 		// _eventually_ see an up-to-date value.
> 		new_hpet = READ_ONCE(hpet_value);
> 	} while (old_hpet == new_hpet);
> 	return new_hpet;
> }

Yes, I think that work too. I will update my patch accordingly. Thanks 
for the input.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-12 18:31           ` Waiman Long
@ 2016-08-12 20:18             ` Andy Lutomirski
  2016-08-12 21:10               ` Waiman Long
  2016-08-12 21:16               ` Dave Hansen
  0 siblings, 2 replies; 17+ messages in thread
From: Andy Lutomirski @ 2016-08-12 20:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, John Stultz, Ingo Molnar, linux-kernel,
	Borislav Petkov, Jiang Liu, Randy Wright, Scott J Norton,
	Douglas Hatch, Dave Hansen, Prarit Bhargava, X86 ML,
	H. Peter Anvin

On Aug 12, 2016 9:31 PM, "Waiman Long" <waiman.long@hpe.com> wrote:
>
> On 08/12/2016 01:16 PM, Dave Hansen wrote:
>>
>> On 08/12/2016 10:01 AM, Waiman Long wrote:
>>>
>>> The reason for using a special lock is that I want both sequence number
>>> update and locking to be done together atomically. They can be made
>>> separate as is in the seqlock. However, that will make the code more
>>> complex to make sure that all the threads see a consistent set of lock
>>> state and sequence number.
>>
>> Why do we need a sequence number?  The "cached" HPET itself could be used.
>>
>> I'm thinking something like below could use a spinlock instead of the
>> doing a custom cmpxchg sequence.  The spin_is_locked() should allow the
>> contended "readers" to avoid using atomics.
>>
>> spinlock_t hpet_lock;
>> u32 hpet_value;
>> ...
>> {
>>         u32 old_hpet = READ_ONCE(hpet_value);
>>         u32 new_hpet;
>>
>>         // need to ensure that the spin_is_locked() is ordered after
>>         // the READ_ONCE().
>>         smp_rmb();
>>         // spin_is_locked() doesn't do atomics
>>         if (!spin_is_locked(&hpet_lock)&&  spin_trylock(&hpet_lock)) {
>>
>>                 WRITE_ONCE(hpet_value, real_read_hpet());
>>                 spin_unlock(&hpet_lock);
>>                 return hpet_value;
>>         }
>>         // Contended case.  We spin here waiting for the guy who holds
>>         // the lock to write a new value to 'hpet_value'.
>>         //
>>         // We know that our old_hpet is older than our check for the
>>         // spinlock being locked. So, someone must either have already
>>         // updated it or be updating it.
>>         do {
>>                 cpu_relax();
>>                 // We do not do a rmb() here.  We don't need a guarantee
>>                 // that this read is up-to-date, just that it will
>>                 // _eventually_ see an up-to-date value.
>>                 new_hpet = READ_ONCE(hpet_value);
>>         } while (old_hpet == new_hpet);
>>         return new_hpet;
>> }
>
>
> Yes, I think that work too. I will update my patch accordingly. Thanks for the input.

Why is Dave more convincing than I was a couple months ago when I
asked a similar question? :)

I don't think this is right.  If the HPET ever returns the same value
twice in a row (unlikely because it's generally too slow to read, but
it's plausible that someone will make a fast HPET some day), then this
could deadlock.

Also, does this code need to be NMI-safe?  This implementation is
deadlocky if it's called from an NMI.

The original code was wait-free, right?  That was a nice property, too.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-12 20:18             ` Andy Lutomirski
@ 2016-08-12 21:10               ` Waiman Long
  2016-08-12 21:20                 ` Dave Hansen
  2016-08-12 21:16               ` Dave Hansen
  1 sibling, 1 reply; 17+ messages in thread
From: Waiman Long @ 2016-08-12 21:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, John Stultz, Ingo Molnar, linux-kernel,
	Borislav Petkov, Jiang Liu, Randy Wright, Scott J Norton,
	Douglas Hatch, Dave Hansen, Prarit Bhargava, X86 ML,
	H. Peter Anvin

On 08/12/2016 04:18 PM, Andy Lutomirski wrote:
> On Aug 12, 2016 9:31 PM, "Waiman Long"<waiman.long@hpe.com>  wrote:
>> On 08/12/2016 01:16 PM, Dave Hansen wrote:
>>> On 08/12/2016 10:01 AM, Waiman Long wrote:
>>>> The reason for using a special lock is that I want both sequence number
>>>> update and locking to be done together atomically. They can be made
>>>> separate as is in the seqlock. However, that will make the code more
>>>> complex to make sure that all the threads see a consistent set of lock
>>>> state and sequence number.
>>> Why do we need a sequence number?  The "cached" HPET itself could be used.
>>>
>>> I'm thinking something like below could use a spinlock instead of the
>>> doing a custom cmpxchg sequence.  The spin_is_locked() should allow the
>>> contended "readers" to avoid using atomics.
>>>
>>> spinlock_t hpet_lock;
>>> u32 hpet_value;
>>> ...
>>> {
>>>          u32 old_hpet = READ_ONCE(hpet_value);
>>>          u32 new_hpet;
>>>
>>>          // need to ensure that the spin_is_locked() is ordered after
>>>          // the READ_ONCE().
>>>          smp_rmb();
>>>          // spin_is_locked() doesn't do atomics
>>>          if (!spin_is_locked(&hpet_lock)&&   spin_trylock(&hpet_lock)) {
>>>
>>>                  WRITE_ONCE(hpet_value, real_read_hpet());
>>>                  spin_unlock(&hpet_lock);
>>>                  return hpet_value;
>>>          }
>>>          // Contended case.  We spin here waiting for the guy who holds
>>>          // the lock to write a new value to 'hpet_value'.
>>>          //
>>>          // We know that our old_hpet is older than our check for the
>>>          // spinlock being locked. So, someone must either have already
>>>          // updated it or be updating it.
>>>          do {
>>>                  cpu_relax();
>>>                  // We do not do a rmb() here.  We don't need a guarantee
>>>                  // that this read is up-to-date, just that it will
>>>                  // _eventually_ see an up-to-date value.
>>>                  new_hpet = READ_ONCE(hpet_value);
>>>          } while (old_hpet == new_hpet);
>>>          return new_hpet;
>>> }
>>
>> Yes, I think that work too. I will update my patch accordingly. Thanks for the input.
> Why is Dave more convincing than I was a couple months ago when I
> asked a similar question? :)

I thought I made some changes in accordance with your comments 
previously. Maybe I missed something:-[

> I don't think this is right.  If the HPET ever returns the same value
> twice in a row (unlikely because it's generally too slow to read, but
> it's plausible that someone will make a fast HPET some day), then this
> could deadlock.

What is the deadlock scenario you are talking about?

>
> Also, does this code need to be NMI-safe?  This implementation is
> deadlocky if it's called from an NMI.

The code isn't NMI-safe as it is not maskable. So is the original code. 
We can check the interrupt state and read HPET directly if in NMI.


>
> The original code was wait-free, right?  That was a nice property, too.
>
> --Andy

The v4 patch isn't wait-free as need to make sure that we read the new 
HPET value instead of the stale one.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-12 20:18             ` Andy Lutomirski
  2016-08-12 21:10               ` Waiman Long
@ 2016-08-12 21:16               ` Dave Hansen
  2016-08-12 21:32                 ` Waiman Long
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2016-08-12 21:16 UTC (permalink / raw)
  To: Andy Lutomirski, Waiman Long
  Cc: Thomas Gleixner, John Stultz, Ingo Molnar, linux-kernel,
	Borislav Petkov, Jiang Liu, Randy Wright, Scott J Norton,
	Douglas Hatch, Prarit Bhargava, X86 ML, H. Peter Anvin

On 08/12/2016 01:18 PM, Andy Lutomirski wrote:
> I don't think this is right.  If the HPET ever returns the same value
> twice in a row (unlikely because it's generally too slow to read, but
> it's plausible that someone will make a fast HPET some day), then this
> could deadlock.

True...

I guess that means we've got to do some kind of sequence counter
preferably in the same cacheline as the HPET value itself, or _something
that we guarantee to change on each write to the cached value.

> Also, does this code need to be NMI-safe?  This implementation is
> deadlocky if it's called from an NMI.

Urg.  Can't we just do

	if (in_nmi())
		return read_real_hpet();

?

> The original code was wait-free, right?  That was a nice property, too.

You mean no spins?  I don't think this one really spins ever either.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-12 21:10               ` Waiman Long
@ 2016-08-12 21:20                 ` Dave Hansen
  2016-08-12 21:32                   ` Waiman Long
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2016-08-12 21:20 UTC (permalink / raw)
  To: Waiman Long, Andy Lutomirski
  Cc: Thomas Gleixner, John Stultz, Ingo Molnar, linux-kernel,
	Borislav Petkov, Jiang Liu, Randy Wright, Scott J Norton,
	Douglas Hatch, Prarit Bhargava, X86 ML, H. Peter Anvin

On 08/12/2016 02:10 PM, Waiman Long wrote:
>> I don't think this is right.  If the HPET ever returns the same value
>> twice in a row (unlikely because it's generally too slow to read, but
>> it's plausible that someone will make a fast HPET some day), then this
>> could deadlock.
> 
> What is the deadlock scenario you are talking about?

A reader loops waiting for the HPET update by looking for the value to
change.  If the HPET updater does an update, but the HPET itself hasn't
advanced, it will write the same value as was there before.  The reader
will keep looping thinking there was no update.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-12 21:16               ` Dave Hansen
@ 2016-08-12 21:32                 ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2016-08-12 21:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, John Stultz, Ingo Molnar,
	linux-kernel, Borislav Petkov, Jiang Liu, Randy Wright,
	Scott J Norton, Douglas Hatch, Prarit Bhargava, X86 ML,
	H. Peter Anvin

On 08/12/2016 05:16 PM, Dave Hansen wrote:
> On 08/12/2016 01:18 PM, Andy Lutomirski wrote:
>> I don't think this is right.  If the HPET ever returns the same value
>> twice in a row (unlikely because it's generally too slow to read, but
>> it's plausible that someone will make a fast HPET some day), then this
>> could deadlock.
> True...
>
> I guess that means we've got to do some kind of sequence counter
> preferably in the same cacheline as the HPET value itself, or _something
> that we guarantee to change on each write to the cached value.

I have done something similar in the v5 patch that I just sent out.

>> Also, does this code need to be NMI-safe?  This implementation is
>> deadlocky if it's called from an NMI.
> Urg.  Can't we just do
>
> 	if (in_nmi())
> 		return read_real_hpet();
>
> ?

Yes, I am doing that in my v5 patch.

>> The original code was wait-free, right?  That was a nice property, too.
> You mean no spins?  I don't think this one really spins ever either.
>

In the contended case, the reader needs to wait until the new HPET value 
is available. I consider this a kind of waiting.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-08-12 21:20                 ` Dave Hansen
@ 2016-08-12 21:32                   ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2016-08-12 21:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, John Stultz, Ingo Molnar,
	linux-kernel, Borislav Petkov, Jiang Liu, Randy Wright,
	Scott J Norton, Douglas Hatch, Prarit Bhargava, X86 ML,
	H. Peter Anvin

On 08/12/2016 05:20 PM, Dave Hansen wrote:
> On 08/12/2016 02:10 PM, Waiman Long wrote:
>>> I don't think this is right.  If the HPET ever returns the same value
>>> twice in a row (unlikely because it's generally too slow to read, but
>>> it's plausible that someone will make a fast HPET some day), then this
>>> could deadlock.
>> What is the deadlock scenario you are talking about?
> A reader loops waiting for the HPET update by looking for the value to
> change.  If the HPET updater does an update, but the HPET itself hasn't
> advanced, it will write the same value as was there before.  The reader
> will keep looping thinking there was no update.

I should have addressed this problem in my new v5 patch.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-06-17 20:20 Waiman Long
@ 2016-07-13 15:02 ` Waiman Long
  0 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2016-07-13 15:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel, x86,
	Jiang Liu, Borislav Petkov, Andy Lutomirski, Scott J Norton,
	Douglas Hatch, Randy Wright

On 06/17/2016 04:20 PM, Waiman Long wrote:
> On a large system with many CPUs, using HPET as the clock source can
> have a significant impact on the overall system performance because
> of the following reasons:
>   1) There is a single HPET counter shared by all the CPUs.
>   2) HPET counter reading is a very slow operation.
>
> Using HPET as the default clock source may happen when, for example,
> the TSC clock calibration exceeds the allowable tolerance. Something
> the performance slowdown can be so severe that the system may crash
> because of a NMI watchdog soft lockup, for example.
>
> This patch attempts to reduce HPET read contention by using the fact
> that if more than one CPUs are trying to access HPET at the same time,
> it will be more efficient if one CPU in the group reads the HPET
> counter and shares it with the rest of the group instead of each
> group member reads the HPET counter individually.
>
> This is done by using a combination word with a sequence number and
> a bit lock. The CPU that gets the bit lock will be responsible for
> reading the HPET counter and update the sequence number. The others
> will monitor the change in sequence number and grab the HPET counter
> accordingly. This change is enabled on SMP configuration.
>
> On a 4-socket Haswell-EX box with 72 cores (HT off), running the
> AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
> with and without the patch has the following performance numbers
> (with HPET or TSC as clock source):
>
> TSC		= 646515 jobs/min
> HPET w/o patch	= 566708 jobs/min
> HPET with patch	= 638791 jobs/min
>
> The perf profile showed a reduction of the %CPU time consumed by
> read_hpet from 4.99% without patch to 1.41% with patch.
>
> On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
> other hand, the performance numbers of the same benchmark were:
>
> TSC		= 3145329 jobs/min
> HPET w/o patch	= 1108537 jobs/min
> HPET with patch	= 3019934 jobs/min
>
> The corresponding perf profile showed a drop of CPU consumption of
> the read_hpet function from more than 34% to just 2.96%.
>
> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
> ---
>   v3->v4:
>    - Move hpet_save inside the CONFIG_SMP block to fix a compilation
>      warning in non-SMP build.
>
>   v2->v3:
>    - Make the hpet optimization the default for SMP configuration. So
>      no documentation change is needed.
>    - Remove threshold checking code as it should not be necessary and
>      can be potentially unsafe.
>
>   v1->v2:
>    - Reduce the CPU threshold to 32.
>    - Add a kernel parameter to explicitly enable or disable hpet
>      optimization.
>    - Change hpet_save.hpet type to u32 to make sure that read&  write
>      is atomic on i386.
>
>   arch/x86/kernel/hpet.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 84 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index a1f0e4a..bc5bb53 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -759,12 +759,96 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
>   #endif
>
>   /*
> + * Reading the HPET counter is a very slow operation. If a large number of
> + * CPUs are trying to access the HPET counter simultaneously, it can cause
> + * massive delay and slow down system performance dramatically. This may
> + * happen when HPET is the default clock source instead of TSC. For a
> + * really large system with hundreds of CPUs, the slowdown may be so
> + * severe that it may actually crash the system because of a NMI watchdog
> + * soft lockup, for example.
> + *
> + * If multiple CPUs are trying to access the HPET counter at the same time,
> + * we don't actually need to read the counter multiple times. Instead, the
> + * other CPUs can use the counter value read by the first CPU in the group.
> + *
> + * A sequence number whose lsb is a lock bit is used to control which CPU
> + * has the right to read the HPET counter directly and which CPUs are going
> + * to get the indirect value read by the lock holder. For the later group,
> + * if the sequence number differs from the expected locked value, they
> + * can assume that the saved HPET value is up-to-date and return it.
> + */
> +#define HPET_SEQ_LOCKED(seq)	((seq)&  1)	/* Odd == locked */
> +
> +/*
>    * Clock source related code
>    */
> +#ifdef CONFIG_SMP
> +static struct {
> +	/* Sequence number + bit lock */
> +	int seq ____cacheline_aligned_in_smp;
> +
> +	/* Current HPET value		*/
> +	u32 hpet ____cacheline_aligned_in_smp;
> +} hpet_save;
> +
> +static cycle_t read_hpet(struct clocksource *cs)
> +{
> +	int seq;
> +
> +	seq = READ_ONCE(hpet_save.seq);
> +	if (!HPET_SEQ_LOCKED(seq)) {
> +		int old, new = seq + 1;
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		/*
> +		 * Set the lock bit (lsb) to get the right to read HPET
> +		 * counter directly. If successful, read the counter, save
> +		 * its value, and increment the sequence number. Otherwise,
> +		 * increment the sequnce number to the expected locked value
> +		 * for comparison later on.
> +		 */
> +		old = cmpxchg(&hpet_save.seq, seq, new);
> +		if (old == seq) {
> +			u32 time;
> +
> +			time = hpet_save.hpet = hpet_readl(HPET_COUNTER);
> +
> +			/* Unlock */
> +			smp_store_release(&hpet_save.seq, new + 1);
> +			local_irq_restore(flags);
> +			return (cycle_t)time;
> +		}
> +		local_irq_restore(flags);
> +		seq = new;
> +	}
> +
> +	/*
> +	 * Wait until the locked sequence number changes which indicates
> +	 * that the saved HPET value is up-to-date.
> +	 */
> +	while (READ_ONCE(hpet_save.seq) == seq) {
> +		/*
> +		 * Since reading the HPET is much slower than a single
> +		 * cpu_relax() instruction, we use two here in an attempt
> +		 * to reduce the amount of cacheline contention in the
> +		 * hpet_save.seq cacheline.
> +		 */
> +		cpu_relax();
> +		cpu_relax();
> +	}
> +
> +	return (cycle_t)READ_ONCE(hpet_save.hpet);
> +}
> +#else /* CONFIG_SMP */
> +/*
> + * For UP
> + */
>   static cycle_t read_hpet(struct clocksource *cs)
>   {
>   	return (cycle_t)hpet_readl(HPET_COUNTER);
>   }
> +#endif /* CONFIG_SMP */
>
>   static struct clocksource clocksource_hpet = {
>   	.name		= "hpet",

This patch was created in response to a problem we have on the 16-socket 
Broadwell-EX systems (up to 768 logical CPUs) that were under 
development. About 10% of the kernel boots experienced soft lockups:

[   71.618132] NetLabel: Initializing
[   71.621967] NetLabel:  domain hash size = 128
[   71.626848] NetLabel:  protocols = UNLABELED CIPSOv4
[   71.632418] NetLabel:  unlabeled traffic allowed by default
[   71.638679] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
[   71.646504] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[   71.655313] Switching to clocksource hpet
[   95.679135] BUG: soft lockup - CPU#144 stuck for 23s! [swapper/144:0]
[   95.693363] BUG: soft lockup - CPU#145 stuck for 23s! [swapper/145:0]
[   95.694203] Modules linked in:
[   95.694697] CPU: 145 PID: 0 Comm: swapper/145 Not tainted
3.10.0-327.el7.x86_64 #1
[   95.695580] BUG: soft lockup - CPU#582 stuck for 23s! [swapper/582:0]
[   95.696145] Hardware name: HP Superdome2 16s x86, BIOS Bundle: 
008.001.006
SFW: 041.063.152 01/16/2016
[   95.698128] BUG: soft lockup - CPU#357 stuck for 23s! [swapper/357:0]
[   95.699774] task: ffff8cf7fecf4500 ti: ffff89787c748000 task.ti: 
ffff89787c748000

During the bootup process, there is a short time where the clocksource 
is switched to hpet to calibrate the tsc's. Then it will be switched 
back to tsc once the calibration is done. It is during the short period 
that soft lockups may happen.

This patch eliminates the sporadic boot soft lockup problem. Please 
consider merging it upstream.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention
@ 2016-06-17 20:20 Waiman Long
  2016-07-13 15:02 ` Waiman Long
  0 siblings, 1 reply; 17+ messages in thread
From: Waiman Long @ 2016-06-17 20:20 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: linux-kernel, x86, Jiang Liu, Borislav Petkov, Andy Lutomirski,
	Scott J Norton, Douglas Hatch, Randy Wright, Waiman Long

On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
 1) There is a single HPET counter shared by all the CPUs.
 2) HPET counter reading is a very slow operation.

Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.

This patch attempts to reduce HPET read contention by using the fact
that if more than one CPUs are trying to access HPET at the same time,
it will be more efficient if one CPU in the group reads the HPET
counter and shares it with the rest of the group instead of each
group member reads the HPET counter individually.

This is done by using a combination word with a sequence number and
a bit lock. The CPU that gets the bit lock will be responsible for
reading the HPET counter and update the sequence number. The others
will monitor the change in sequence number and grab the HPET counter
accordingly. This change is enabled on SMP configuration.

On a 4-socket Haswell-EX box with 72 cores (HT off), running the
AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):

TSC		= 646515 jobs/min
HPET w/o patch	= 566708 jobs/min
HPET with patch	= 638791 jobs/min

The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 4.99% without patch to 1.41% with patch.

On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
other hand, the performance numbers of the same benchmark were:

TSC		= 3145329 jobs/min
HPET w/o patch	= 1108537 jobs/min
HPET with patch	= 3019934 jobs/min

The corresponding perf profile showed a drop of CPU consumption of
the read_hpet function from more than 34% to just 2.96%.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 v3->v4:
  - Move hpet_save inside the CONFIG_SMP block to fix a compilation
    warning in non-SMP build.

 v2->v3:
  - Make the hpet optimization the default for SMP configuration. So
    no documentation change is needed.
  - Remove threshold checking code as it should not be necessary and
    can be potentially unsafe.

 v1->v2:
  - Reduce the CPU threshold to 32.
  - Add a kernel parameter to explicitly enable or disable hpet
    optimization.
  - Change hpet_save.hpet type to u32 to make sure that read & write
    is atomic on i386.

 arch/x86/kernel/hpet.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index a1f0e4a..bc5bb53 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -759,12 +759,96 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
 #endif
 
 /*
+ * Reading the HPET counter is a very slow operation. If a large number of
+ * CPUs are trying to access the HPET counter simultaneously, it can cause
+ * massive delay and slow down system performance dramatically. This may
+ * happen when HPET is the default clock source instead of TSC. For a
+ * really large system with hundreds of CPUs, the slowdown may be so
+ * severe that it may actually crash the system because of a NMI watchdog
+ * soft lockup, for example.
+ *
+ * If multiple CPUs are trying to access the HPET counter at the same time,
+ * we don't actually need to read the counter multiple times. Instead, the
+ * other CPUs can use the counter value read by the first CPU in the group.
+ *
+ * A sequence number whose lsb is a lock bit is used to control which CPU
+ * has the right to read the HPET counter directly and which CPUs are going
+ * to get the indirect value read by the lock holder. For the later group,
+ * if the sequence number differs from the expected locked value, they
+ * can assume that the saved HPET value is up-to-date and return it.
+ */
+#define HPET_SEQ_LOCKED(seq)	((seq) & 1)	/* Odd == locked */
+
+/*
  * Clock source related code
  */
+#ifdef CONFIG_SMP
+static struct {
+	/* Sequence number + bit lock */
+	int seq ____cacheline_aligned_in_smp;
+
+	/* Current HPET value		*/
+	u32 hpet ____cacheline_aligned_in_smp;
+} hpet_save;
+
+static cycle_t read_hpet(struct clocksource *cs)
+{
+	int seq;
+
+	seq = READ_ONCE(hpet_save.seq);
+	if (!HPET_SEQ_LOCKED(seq)) {
+		int old, new = seq + 1;
+		unsigned long flags;
+
+		local_irq_save(flags);
+		/*
+		 * Set the lock bit (lsb) to get the right to read HPET
+		 * counter directly. If successful, read the counter, save
+		 * its value, and increment the sequence number. Otherwise,
+		 * increment the sequnce number to the expected locked value
+		 * for comparison later on.
+		 */
+		old = cmpxchg(&hpet_save.seq, seq, new);
+		if (old == seq) {
+			u32 time;
+
+			time = hpet_save.hpet = hpet_readl(HPET_COUNTER);
+
+			/* Unlock */
+			smp_store_release(&hpet_save.seq, new + 1);
+			local_irq_restore(flags);
+			return (cycle_t)time;
+		}
+		local_irq_restore(flags);
+		seq = new;
+	}
+
+	/*
+	 * Wait until the locked sequence number changes which indicates
+	 * that the saved HPET value is up-to-date.
+	 */
+	while (READ_ONCE(hpet_save.seq) == seq) {
+		/*
+		 * Since reading the HPET is much slower than a single
+		 * cpu_relax() instruction, we use two here in an attempt
+		 * to reduce the amount of cacheline contention in the
+		 * hpet_save.seq cacheline.
+		 */
+		cpu_relax();
+		cpu_relax();
+	}
+
+	return (cycle_t)READ_ONCE(hpet_save.hpet);
+}
+#else /* CONFIG_SMP */
+/*
+ * For UP
+ */
 static cycle_t read_hpet(struct clocksource *cs)
 {
 	return (cycle_t)hpet_readl(HPET_COUNTER);
 }
+#endif /* CONFIG_SMP */
 
 static struct clocksource clocksource_hpet = {
 	.name		= "hpet",
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-08-12 21:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-10 18:29 [RESEND PATCH v4] x86/hpet: Reduce HPET counter read contention Waiman Long
2016-08-10 18:37 ` Long, Wai Man
2016-08-10 19:01   ` Prarit Bhargava
2016-08-11 19:32 ` Dave Hansen
2016-08-11 23:22   ` Waiman Long
2016-08-12  0:31     ` Dave Hansen
2016-08-12 17:01       ` Waiman Long
2016-08-12 17:16         ` Dave Hansen
2016-08-12 18:31           ` Waiman Long
2016-08-12 20:18             ` Andy Lutomirski
2016-08-12 21:10               ` Waiman Long
2016-08-12 21:20                 ` Dave Hansen
2016-08-12 21:32                   ` Waiman Long
2016-08-12 21:16               ` Dave Hansen
2016-08-12 21:32                 ` Waiman Long
  -- strict thread matches above, loose matches on Subject: below --
2016-06-17 20:20 Waiman Long
2016-07-13 15:02 ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.