linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] x86/hpet: Reduce HPET counter read contention
@ 2016-04-12 18:46 Waiman Long
  2016-04-13  6:18 ` Ingo Molnar
  2016-05-12 23:20 ` Waiman Long
  0 siblings, 2 replies; 6+ messages in thread
From: Waiman Long @ 2016-04-12 18:46 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: linux-kernel, x86, Jiang Liu, Borislav Petkov, Andy Lutomirski,
	Scott J Norton, Douglas Hatch, Randy Wright, Waiman Long

On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
 1) There is a single HPET counter shared by all the CPUs.
 2) HPET counter reading is a very slow operation.

Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.

This patch attempts to reduce HPET read contention by using the fact
that if more than one CPUs are trying to access HPET at the same time,
it will be more efficient if one CPU in the group reads the HPET
counter and shares it with the rest of the group instead of each
group member reads the HPET counter individually.

This is done by using a combination word with a sequence number and
a bit lock. The CPU that gets the bit lock will be responsible for
reading the HPET counter and update the sequence number. The others
will monitor the change in sequence number and grab the HPET counter
accordingly. This change is enabled on SMP configuration.

On a 4-socket Haswell-EX box with 72 cores (HT off), running the
AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):

TSC		= 646515 jobs/min
HPET w/o patch	= 566708 jobs/min
HPET with patch	= 638791 jobs/min

The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 4.99% without patch to 1.41% with patch.

On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
other hand, the performance numbers of the same benchmark were:

TSC		= 3145329 jobs/min
HPET w/o patch	= 1108537 jobs/min
HPET with patch	= 3019934 jobs/min

The corresponding perf profile showed a drop of CPU consumption of
the read_hpet function from more than 34% to just 2.96%.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 v3->v4:
  - Move hpet_save inside the CONFIG_SMP block to fix a compilation
    warning in non-SMP build.

 v2->v3:
  - Make the hpet optimization the default for SMP configuration. So
    no documentation change is needed.
  - Remove threshold checking code as it should not be necessary and
    can be potentially unsafe.

 v1->v2:
  - Reduce the CPU threshold to 32.
  - Add a kernel parameter to explicitly enable or disable hpet
    optimization.
  - Change hpet_save.hpet type to u32 to make sure that read & write
    is atomic on i386.

 arch/x86/kernel/hpet.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index a1f0e4a..bc5bb53 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -759,12 +759,96 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
 #endif
 
 /*
+ * Reading the HPET counter is a very slow operation. If a large number of
+ * CPUs are trying to access the HPET counter simultaneously, it can cause
+ * massive delay and slow down system performance dramatically. This may
+ * happen when HPET is the default clock source instead of TSC. For a
+ * really large system with hundreds of CPUs, the slowdown may be so
+ * severe that it may actually crash the system because of a NMI watchdog
+ * soft lockup, for example.
+ *
+ * If multiple CPUs are trying to access the HPET counter at the same time,
+ * we don't actually need to read the counter multiple times. Instead, the
+ * other CPUs can use the counter value read by the first CPU in the group.
+ *
+ * A sequence number whose lsb is a lock bit is used to control which CPU
+ * has the right to read the HPET counter directly and which CPUs are going
+ * to get the indirect value read by the lock holder. For the later group,
+ * if the sequence number differs from the expected locked value, they
+ * can assume that the saved HPET value is up-to-date and return it.
+ */
+#define HPET_SEQ_LOCKED(seq)	((seq) & 1)	/* Odd == locked */
+
+/*
  * Clock source related code
  */
+#ifdef CONFIG_SMP
+static struct {
+	/* Sequence number + bit lock */
+	int seq ____cacheline_aligned_in_smp;
+
+	/* Current HPET value		*/
+	u32 hpet ____cacheline_aligned_in_smp;
+} hpet_save;
+
+static cycle_t read_hpet(struct clocksource *cs)
+{
+	int seq;
+
+	seq = READ_ONCE(hpet_save.seq);
+	if (!HPET_SEQ_LOCKED(seq)) {
+		int old, new = seq + 1;
+		unsigned long flags;
+
+		local_irq_save(flags);
+		/*
+		 * Set the lock bit (lsb) to get the right to read HPET
+		 * counter directly. If successful, read the counter, save
+		 * its value, and increment the sequence number. Otherwise,
+		 * increment the sequnce number to the expected locked value
+		 * for comparison later on.
+		 */
+		old = cmpxchg(&hpet_save.seq, seq, new);
+		if (old == seq) {
+			u32 time;
+
+			time = hpet_save.hpet = hpet_readl(HPET_COUNTER);
+
+			/* Unlock */
+			smp_store_release(&hpet_save.seq, new + 1);
+			local_irq_restore(flags);
+			return (cycle_t)time;
+		}
+		local_irq_restore(flags);
+		seq = new;
+	}
+
+	/*
+	 * Wait until the locked sequence number changes which indicates
+	 * that the saved HPET value is up-to-date.
+	 */
+	while (READ_ONCE(hpet_save.seq) == seq) {
+		/*
+		 * Since reading the HPET is much slower than a single
+		 * cpu_relax() instruction, we use two here in an attempt
+		 * to reduce the amount of cacheline contention in the
+		 * hpet_save.seq cacheline.
+		 */
+		cpu_relax();
+		cpu_relax();
+	}
+
+	return (cycle_t)READ_ONCE(hpet_save.hpet);
+}
+#else /* CONFIG_SMP */
+/*
+ * For UP
+ */
 static cycle_t read_hpet(struct clocksource *cs)
 {
 	return (cycle_t)hpet_readl(HPET_COUNTER);
 }
+#endif /* CONFIG_SMP */
 
 static struct clocksource clocksource_hpet = {
 	.name		= "hpet",
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-04-12 18:46 [PATCH v4] x86/hpet: Reduce HPET counter read contention Waiman Long
@ 2016-04-13  6:18 ` Ingo Molnar
  2016-04-13 15:37   ` Waiman Long
  2016-05-12 23:20 ` Waiman Long
  1 sibling, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2016-04-13  6:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel, x86,
	Jiang Liu, Borislav Petkov, Andy Lutomirski, Scott J Norton,
	Douglas Hatch, Randy Wright, Peter Zijlstra


* Waiman Long <Waiman.Long@hpe.com> wrote:

> On a large system with many CPUs, using HPET as the clock source can
> have a significant impact on the overall system performance because
> of the following reasons:
>  1) There is a single HPET counter shared by all the CPUs.
>  2) HPET counter reading is a very slow operation.
> 
> Using HPET as the default clock source may happen when, for example,
> the TSC clock calibration exceeds the allowable tolerance. Something
> the performance slowdown can be so severe that the system may crash
> because of a NMI watchdog soft lockup, for example.

>  /*
> + * Reading the HPET counter is a very slow operation. If a large number of
> + * CPUs are trying to access the HPET counter simultaneously, it can cause
> + * massive delay and slow down system performance dramatically. This may
> + * happen when HPET is the default clock source instead of TSC. For a
> + * really large system with hundreds of CPUs, the slowdown may be so
> + * severe that it may actually crash the system because of a NMI watchdog
> + * soft lockup, for example.
> + *
> + * If multiple CPUs are trying to access the HPET counter at the same time,
> + * we don't actually need to read the counter multiple times. Instead, the
> + * other CPUs can use the counter value read by the first CPU in the group.

Hm, weird, so how can this:

  static cycle_t read_hpet(struct clocksource *cs)
  {
         return (cycle_t)hpet_readl(HPET_COUNTER);
  }

... cause an actual slowdown of that magnitude? This goes straight to MMIO. So is 
the hardware so terminally broken?

How good is the TSC clocksource on the affected system? Could we simply always use 
the TSC (and not use the HPET at all as a clocksource), instead of trying to fix 
broken hardware?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-04-13  6:18 ` Ingo Molnar
@ 2016-04-13 15:37   ` Waiman Long
  2016-04-14  0:25     ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Waiman Long @ 2016-04-13 15:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel, x86,
	Jiang Liu, Borislav Petkov, Andy Lutomirski, Scott J Norton,
	Douglas Hatch, Randy Wright, Peter Zijlstra

On 04/13/2016 02:18 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hpe.com>  wrote:
>
>> On a large system with many CPUs, using HPET as the clock source can
>> have a significant impact on the overall system performance because
>> of the following reasons:
>>   1) There is a single HPET counter shared by all the CPUs.
>>   2) HPET counter reading is a very slow operation.
>>
>> Using HPET as the default clock source may happen when, for example,
>> the TSC clock calibration exceeds the allowable tolerance. Something
>> the performance slowdown can be so severe that the system may crash
>> because of a NMI watchdog soft lockup, for example.
>>   /*
>> + * Reading the HPET counter is a very slow operation. If a large number of
>> + * CPUs are trying to access the HPET counter simultaneously, it can cause
>> + * massive delay and slow down system performance dramatically. This may
>> + * happen when HPET is the default clock source instead of TSC. For a
>> + * really large system with hundreds of CPUs, the slowdown may be so
>> + * severe that it may actually crash the system because of a NMI watchdog
>> + * soft lockup, for example.
>> + *
>> + * If multiple CPUs are trying to access the HPET counter at the same time,
>> + * we don't actually need to read the counter multiple times. Instead, the
>> + * other CPUs can use the counter value read by the first CPU in the group.
> Hm, weird, so how can this:
>
>    static cycle_t read_hpet(struct clocksource *cs)
>    {
>           return (cycle_t)hpet_readl(HPET_COUNTER);
>    }
>
> ... cause an actual slowdown of that magnitude? This goes straight to MMIO. So is
> the hardware so terminally broken?

I only know that accessing the HPET counter is VERY slow. Andy said that 
it takes at least a few us. I haven't done that measurement myself.

I am not sure what kind of contention will happen when multiple CPUs are 
accessing it at the same time. It is not just the clock tick interrupt 
handler that need to access time, many system call will also cause the 
current time to be accessed. When we have hundred of CPUs in the system, 
it is not too hard to cause a soft lockup if hpet is the default clock 
source.

> How good is the TSC clocksource on the affected system? Could we simply always use
> the TSC (and not use the HPET at all as a clocksource), instead of trying to fix
> broken hardware?
>
> Thanks,
>
> 	Ingo

The TSC clocksource, on the other hand, is per cpu. So there won't be 
much contention in accessing it. Normally TSC will be used the default 
clock source. However, if there is too much variation in the actual 
clock speeds of the individual CPUs, it will cause the TSC calibration 
to fail and revert to use hpet as the clock source. During bootup, hpet 
will usually be selected as the default clock source first. After a 
short time, the TSC will take over as the default clock source. Problem 
can happen during that short period of transition time too. In fact, we 
have 16-socket Broadwell-EX systems that has this soft lockup problem 
once in a few reboot cycles which prompted me to find a solution to fix it.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-04-13 15:37   ` Waiman Long
@ 2016-04-14  0:25     ` Peter Zijlstra
  2016-04-14  2:10       ` Waiman Long
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2016-04-14  0:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-kernel, x86, Jiang Liu, Borislav Petkov, Andy Lutomirski,
	Scott J Norton, Douglas Hatch, Randy Wright

On Wed, Apr 13, 2016 at 11:37:21AM -0400, Waiman Long wrote:
> The TSC clocksource, on the other hand, is per cpu. So there won't be much
> contention in accessing it. Normally TSC will be used the default clock
> source. However, if there is too much variation in the actual clock speeds
> of the individual CPUs, 

Does the system actually have a clock rate skew? Not an offset?

> it will cause the TSC calibration to fail and revert
> to use hpet as the clock source. During bootup, hpet will usually be
> selected as the default clock source first. After a short time, the TSC will
> take over as the default clock source. Problem can happen during that short
> period of transition time too. In fact, we have 16-socket Broadwell-EX
> systems that has this soft lockup problem once in a few reboot cycles which
> prompted me to find a solution to fix it.

This 16 socket system is a completely broken trainwreck. Trying to use
HPET with _that_ many CPUs is absolutely insane.

Please tell your hardware engineers to fix the TSC clock domain.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-04-14  0:25     ` Peter Zijlstra
@ 2016-04-14  2:10       ` Waiman Long
  0 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2016-04-14  2:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	linux-kernel, x86, Jiang Liu, Borislav Petkov, Andy Lutomirski,
	Scott J Norton, Douglas Hatch, Randy Wright

On 04/13/2016 08:25 PM, Peter Zijlstra wrote:
> On Wed, Apr 13, 2016 at 11:37:21AM -0400, Waiman Long wrote:
>> The TSC clocksource, on the other hand, is per cpu. So there won't be much
>> contention in accessing it. Normally TSC will be used the default clock
>> source. However, if there is too much variation in the actual clock speeds
>> of the individual CPUs,
> Does the system actually have a clock rate skew? Not an offset?

No, the system that I was able talking about didn't have this issue. I 
did see some prototype machines that had the clock skew problem due to 
firmware issue.

>> it will cause the TSC calibration to fail and revert
>> to use hpet as the clock source. During bootup, hpet will usually be
>> selected as the default clock source first. After a short time, the TSC will
>> take over as the default clock source. Problem can happen during that short
>> period of transition time too. In fact, we have 16-socket Broadwell-EX
>> systems that has this soft lockup problem once in a few reboot cycles which
>> prompted me to find a solution to fix it.
> This 16 socket system is a completely broken trainwreck. Trying to use
> HPET with _that_ many CPUs is absolutely insane.
>
> Please tell your hardware engineers to fix the TSC clock domain.

I was talking about the way the clock source was brought up. If you look 
at the bootup kernel log, you will see something like (from a 4-socket 
system):

[    5.777423] clocksource: Switched to clocksource hpet
[    5.823689] clocksource: acpi_pm: mask: 0xffffff max_cycles: 
0xffffff, max_idle_ns: 2085701024 ns
[    7.870387] tsc: Refined TSC clocksource calibration: 2493.990 MHz
[    7.872299] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 
0x23f30c707d3, max_idle_ns: 440795252535 ns
[    8.871787] clocksource: Switched to clocksource tsc

The TSC calibration itself takes some time and it needs a prior clock 
source (hpet) as a reference. It is during that transition period 
between hpet and TSC as default clocksource that the 16-socket system 
may hit a soft lockup occasionally. I don't think it is a hardware 
issue. That system was using TSC as the clock source when it booted up 
correctly.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4] x86/hpet: Reduce HPET counter read contention
  2016-04-12 18:46 [PATCH v4] x86/hpet: Reduce HPET counter read contention Waiman Long
  2016-04-13  6:18 ` Ingo Molnar
@ 2016-05-12 23:20 ` Waiman Long
  1 sibling, 0 replies; 6+ messages in thread
From: Waiman Long @ 2016-05-12 23:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, linux-kernel, x86,
	Jiang Liu, Borislav Petkov, Andy Lutomirski, Scott J Norton,
	Douglas Hatch, Randy Wright

On 04/12/2016 02:46 PM, Waiman Long wrote:
> On a large system with many CPUs, using HPET as the clock source can
> have a significant impact on the overall system performance because
> of the following reasons:
>   1) There is a single HPET counter shared by all the CPUs.
>   2) HPET counter reading is a very slow operation.
>
> Using HPET as the default clock source may happen when, for example,
> the TSC clock calibration exceeds the allowable tolerance. Something
> the performance slowdown can be so severe that the system may crash
> because of a NMI watchdog soft lockup, for example.
>
> This patch attempts to reduce HPET read contention by using the fact
> that if more than one CPUs are trying to access HPET at the same time,
> it will be more efficient if one CPU in the group reads the HPET
> counter and shares it with the rest of the group instead of each
> group member reads the HPET counter individually.
>
> This is done by using a combination word with a sequence number and
> a bit lock. The CPU that gets the bit lock will be responsible for
> reading the HPET counter and update the sequence number. The others
> will monitor the change in sequence number and grab the HPET counter
> accordingly. This change is enabled on SMP configuration.
>
> On a 4-socket Haswell-EX box with 72 cores (HT off), running the
> AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
> with and without the patch has the following performance numbers
> (with HPET or TSC as clock source):
>
> TSC		= 646515 jobs/min
> HPET w/o patch	= 566708 jobs/min
> HPET with patch	= 638791 jobs/min
>
> The perf profile showed a reduction of the %CPU time consumed by
> read_hpet from 4.99% without patch to 1.41% with patch.
>
> On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
> other hand, the performance numbers of the same benchmark were:
>
> TSC		= 3145329 jobs/min
> HPET w/o patch	= 1108537 jobs/min
> HPET with patch	= 3019934 jobs/min
>
> The corresponding perf profile showed a drop of CPU consumption of
> the read_hpet function from more than 34% to just 2.96%.
>
> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
> ---
>   v3->v4:
>    - Move hpet_save inside the CONFIG_SMP block to fix a compilation
>      warning in non-SMP build.
>
>   v2->v3:
>    - Make the hpet optimization the default for SMP configuration. So
>      no documentation change is needed.
>    - Remove threshold checking code as it should not be necessary and
>      can be potentially unsafe.
>
>   v1->v2:
>    - Reduce the CPU threshold to 32.
>    - Add a kernel parameter to explicitly enable or disable hpet
>      optimization.
>    - Change hpet_save.hpet type to u32 to make sure that read&  write
>      is atomic on i386.
>
>   arch/x86/kernel/hpet.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 84 insertions(+), 0 deletions(-)
>
>

I haven't received any feedback on this patch since mid-April. I would 
like to know if the current patch is good enough or some additional 
changes are still needed to make it merge-able upstream.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-05-12 23:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-12 18:46 [PATCH v4] x86/hpet: Reduce HPET counter read contention Waiman Long
2016-04-13  6:18 ` Ingo Molnar
2016-04-13 15:37   ` Waiman Long
2016-04-14  0:25     ` Peter Zijlstra
2016-04-14  2:10       ` Waiman Long
2016-05-12 23:20 ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).