All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] x86/hpet: Reduce HPET counter read contention
@ 2016-04-08 20:11 Waiman Long
  2016-04-10  0:19 ` Thomas Gleixner
  0 siblings, 1 reply; 3+ messages in thread
From: Waiman Long @ 2016-04-08 20:11 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jonathan Corbet
  Cc: linux-kernel, linux-doc, x86, Jiang Liu, Borislav Petkov,
	Andy Lutomirski, Scott J Norton, Douglas Hatch, Randy Wright,
	Waiman Long

On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
 1) There is a single HPET counter shared by all the CPUs.
 2) HPET counter reading is a very slow operation.

Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.

This patch attempts to reduce HPET read contention by using the fact
that if more than one task are trying to access HPET at the same time,
it will be more efficient if one task in the group reads the HPET
counter and shares it with the rest of the group instead of each
group member reads the HPET counter individually.

This is done by using a combination word with a sequence number and
a bit lock. The task that gets the bit lock will be responsible for
reading the HPET counter and update the sequence number. The others
will monitor the change in sequence number and grab the HPET counter
accordingly.

On a 4-socket Haswell-EX box with 72 cores (HT off), running the
AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):

TSC		= 646515 jobs/min
HPET w/o patch	= 566708 jobs/min
HPET with patch	= 638791 jobs/min

The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 4.99% without patch to 1.41% with patch.

On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
other hand, the performance numbers of the same benchmark were:

TSC		= 3145329 jobs/min
HPET w/o patch	= 1108537 jobs/min
HPET with patch	= 3019934 jobs/min

The corresponding perf profile showed a drop of CPU consumption of
the read_hpet function from more than 34% to just 2.96%.

This optimization is enabled on systems with more than 32 CPUs. It can
also be explicitly enabled or disabled by using the new opt_read_hpet
kernel parameter.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---

 v1->v2:
  - Reduce the CPU threshold to 32.
  - Add a kernel parameter to explicitly enable or disable hpet
    optimization.
  - Change hpet_save.hpet type to u32 to make sure that read & write
    is atomic on i386.

 Documentation/kernel-parameters.txt |    4 +
 arch/x86/kernel/hpet.c              |  119 ++++++++++++++++++++++++++++++++++-
 2 files changed, 122 insertions(+), 1 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index ecc74fa..9424c75 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2300,6 +2300,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			lock	 - Configure if Flex-OneNAND boundary should be locked.
 				   Once locked, the boundary cannot be changed.
 				   1 indicates lock status, 0 indicates unlock status.
+	opt_read_hpet=	[X86]
+			0 to disable read_hpet optimization
+			1 to enable read_hpet optimization
+			See arch/x86/kernel/hpet.c.
 
 	mtdset=		[ARM]
 			ARM/S3C2412 JIVE boot control
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index a1f0e4a..d0933fd 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -759,11 +759,121 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
 #endif
 
 /*
+ * Reading the HPET counter is a very slow operation. If a large number of
+ * CPUs are trying to access the HPET counter simultaneously, it can cause
+ * massive delay and slow down system performance dramatically. This may
+ * happen when HPET is the default clock source instead of TSC. For a
+ * really large system with hundreds of CPUs, the slowdown may be so
+ * severe that it may actually crash the system because of a NMI watchdog
+ * soft lockup, for example.
+ *
+ * If multiple CPUs are trying to access the HPET counter at the same time,
+ * we don't actually need to read the counter multiple times. Instead, the
+ * other CPUs can use the counter value read by the first CPU in the group.
+ *
+ * A sequence number whose lsb is a lock bit is used to control which CPU
+ * has the right to read the HPET counter directly and which CPUs are going
+ * to get the indirect value read by the lock holder. For the later group,
+ * if the sequence number differs from the expected locked value, they
+ * can assume that the saved HPET value is up-to-date and return it.
+ *
+ * This mechanism is automatically activated on system with a large number
+ * of CPUs (> 32). It can also be explicitly enabled or disabled by using
+ * the "opt_read_hpet=1" or "opt_read_hpet=0" kernel command line options
+ * respectively which overrides the CPU number check.
+ */
+static int opt_read_hpet __read_mostly = -1;	/* Optimize read_hpet() */
+static struct {
+	/* Sequence number + bit lock */
+	int seq ____cacheline_aligned_in_smp;
+
+	/* Current HPET value		*/
+	u32 hpet ____cacheline_aligned_in_smp;
+} hpet_save;
+#define HPET_SEQ_LOCKED(seq)	((seq) & 1)	/* Odd == locked */
+#define HPET_RESET_THRESHOLD	(1 << 14)
+#define HPET_REUSE_THRESHOLD	32
+
+static int __init get_read_hpet_opt(char *str)
+{
+	get_option(&str, &opt_read_hpet);
+	return 0;
+}
+early_param("opt_read_hpet", get_read_hpet_opt);
+
+/*
  * Clock source related code
  */
 static cycle_t read_hpet(struct clocksource *cs)
 {
-	return (cycle_t)hpet_readl(HPET_COUNTER);
+	int seq, cnt = 0;
+	u32 time;
+
+	if (opt_read_hpet <= 0)
+		return (cycle_t)hpet_readl(HPET_COUNTER);
+
+	seq = READ_ONCE(hpet_save.seq);
+	if (!HPET_SEQ_LOCKED(seq)) {
+		int old, new = seq + 1;
+		unsigned long flags;
+
+		local_irq_save(flags);
+		/*
+		 * Set the lock bit (lsb) to get the right to read HPET
+		 * counter directly. If successful, read the counter, save
+		 * its value, and increment the sequence number. Otherwise,
+		 * increment the sequnce number to the expected locked value
+		 * for comparison later on.
+		 */
+		old = cmpxchg(&hpet_save.seq, seq, new);
+		if (old == seq) {
+			time = hpet_readl(HPET_COUNTER);
+			WRITE_ONCE(hpet_save.hpet, time);
+
+			/* Unlock */
+			smp_store_release(&hpet_save.seq, new + 1);
+			local_irq_restore(flags);
+			return (cycle_t)time;
+		}
+		local_irq_restore(flags);
+		seq = new;
+	}
+
+	/*
+	 * Wait until the locked sequence number changes which indicates
+	 * that the saved HPET value is up-to-date.
+	 */
+	while (READ_ONCE(hpet_save.seq) == seq) {
+		/*
+		 * Since reading the HPET is much slower than a single
+		 * cpu_relax() instruction, we use two here in an attempt
+		 * to reduce the amount of cacheline contention in the
+		 * hpet_save.seq cacheline.
+		 */
+		cpu_relax();
+		cpu_relax();
+
+		if (likely(++cnt <= HPET_RESET_THRESHOLD))
+			continue;
+
+		/*
+		 * In the unlikely event that it takes too long for the lock
+		 * holder to read the HPET, we do it ourselves and try to
+		 * reset the lock. This will also break a deadlock if it
+		 * happens, for example, when the process context lock holder
+		 * gets killed in the middle of reading the HPET counter.
+		 */
+		time = hpet_readl(HPET_COUNTER);
+		WRITE_ONCE(hpet_save.hpet, time);
+		if (READ_ONCE(hpet_save.seq) == seq) {
+			if (cmpxchg(&hpet_save.seq, seq, seq + 1) == seq)
+				pr_warn("read_hpet: reset hpet seq to 0x%x\n",
+					seq + 1);
+		}
+		return (cycle_t)time;
+	}
+
+	return (cycle_t)READ_ONCE(hpet_save.hpet);
 }
 
 static struct clocksource clocksource_hpet = {
@@ -956,6 +1066,13 @@ static __init int hpet_late_init(void)
 	hpet_reserve_platform_timers(hpet_readl(HPET_ID));
 	hpet_print_config();
 
+	/*
+	 * Reuse HPET value read by other CPUs if there are more than
+	 * HPET_REUSE_THRESHOLD CPUs in the system.
+	 */
+	if ((opt_read_hpet < 0) && (num_possible_cpus() > HPET_REUSE_THRESHOLD))
+		opt_read_hpet = 1;
+
 	if (hpet_msi_disable)
 		return 0;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] x86/hpet: Reduce HPET counter read contention
  2016-04-08 20:11 [PATCH v2] x86/hpet: Reduce HPET counter read contention Waiman Long
@ 2016-04-10  0:19 ` Thomas Gleixner
  2016-04-11 20:06   ` Waiman Long
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Gleixner @ 2016-04-10  0:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, H. Peter Anvin, Jonathan Corbet, LKML, linux-doc,
	x86, Borislav Petkov, Andy Lutomirski, Scott J Norton,
	Douglas Hatch, Randy Wright

On Fri, 8 Apr 2016, Waiman Long wrote:
> This patch attempts to reduce HPET read contention by using the fact
> that if more than one task are trying to access HPET at the same time,
> it will be more efficient if one task in the group reads the HPET
> counter and shares it with the rest of the group instead of each
> group member reads the HPET counter individually.

That has nothing to do with tasks. clocksource reads can happen from almost
any context. The problem is concurrent access on multiple cpus.

> This optimization is enabled on systems with more than 32 CPUs. It can
> also be explicitly enabled or disabled by using the new opt_read_hpet
> kernel parameter.

Please not. What's wrong with enabling it unconditionally?
 
> +/*
>   * Clock source related code
>   */
>  static cycle_t read_hpet(struct clocksource *cs)
>  {
> -	return (cycle_t)hpet_readl(HPET_COUNTER);
> +	int seq, cnt = 0;
> +	u32 time;
> +
> +	if (opt_read_hpet <= 0)
> +		return (cycle_t)hpet_readl(HPET_COUNTER);

This wants to be conditional on CONFIG_SMP. No point in having all that muck
around for an UP kernel.

> +	seq = READ_ONCE(hpet_save.seq);
> +	if (!HPET_SEQ_LOCKED(seq)) {
> +		int old, new = seq + 1;
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		/*
> +		 * Set the lock bit (lsb) to get the right to read HPET
> +		 * counter directly. If successful, read the counter, save
> +		 * its value, and increment the sequence number. Otherwise,
> +		 * increment the sequnce number to the expected locked value
> +		 * for comparison later on.
> +		 */
> +		old = cmpxchg(&hpet_save.seq, seq, new);
> +		if (old == seq) {
> +			time = hpet_readl(HPET_COUNTER);
> +			WRITE_ONCE(hpet_save.hpet, time);
> +
> +			/* Unlock */
> +			smp_store_release(&hpet_save.seq, new + 1);
> +			local_irq_restore(flags);
> +			return (cycle_t)time;
> +		}
> +		local_irq_restore(flags);
> +		seq = new;
> +	}
> +
> +	/*
> +	 * Wait until the locked sequence number changes which indicates
> +	 * that the saved HPET value is up-to-date.
> +	 */
> +	while (READ_ONCE(hpet_save.seq) == seq) {
> +		/*
> +		 * Since reading the HPET is much slower than a single
> +		 * cpu_relax() instruction, we use two here in an attempt
> +		 * to reduce the amount of cacheline contention in the
> +		 * hpet_save.seq cacheline.
> +		 */
> +		cpu_relax();
> +		cpu_relax();
> +
> +		if (likely(++cnt <= HPET_RESET_THRESHOLD))
> +			continue;
> +
> +		/*
> +		 * In the unlikely event that it takes too long for the lock
> +		 * holder to read the HPET, we do it ourselves and try to
> +		 * reset the lock. This will also break a deadlock if it
> +		 * happens, for example, when the process context lock holder
> +		 * gets killed in the middle of reading the HPET counter.
> +		 */
> +		time = hpet_readl(HPET_COUNTER);
> +		WRITE_ONCE(hpet_save.hpet, time);
> +		if (READ_ONCE(hpet_save.seq) == seq) {
> +			if (cmpxchg(&hpet_save.seq, seq, seq + 1) == seq)
> +				pr_warn("read_hpet: reset hpet seq to 0x%x\n",
> +					seq + 1);

This is voodoo programming and actively dangerous.

CPU0 	        CPU1	       		CPU2
lock_hpet()
T1=read_hpet()	wait_for_unlock()	
store_hpet(T1)	
		....			
		T2 = read_hpet()
unlock_hpet()				
					lock_hpet()
					T3 = read_hpet()
					store_hpet(T3)
					unlock_hpet()
					return T3
lock_hpet()
T4 = read_hpet()			wait_for_unlock()
store_hpet(T4)	
		store_hpet(T2)			
unlock_hpet()				return T2

CPU2 will observe time going backwards.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] x86/hpet: Reduce HPET counter read contention
  2016-04-10  0:19 ` Thomas Gleixner
@ 2016-04-11 20:06   ` Waiman Long
  0 siblings, 0 replies; 3+ messages in thread
From: Waiman Long @ 2016-04-11 20:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H. Peter Anvin, Jonathan Corbet, LKML, linux-doc,
	x86, Borislav Petkov, Andy Lutomirski, Scott J Norton,
	Douglas Hatch, Randy Wright

On 04/09/2016 08:19 PM, Thomas Gleixner wrote:
> On Fri, 8 Apr 2016, Waiman Long wrote:
>> This patch attempts to reduce HPET read contention by using the fact
>> that if more than one task are trying to access HPET at the same time,
>> it will be more efficient if one task in the group reads the HPET
>> counter and shares it with the rest of the group instead of each
>> group member reads the HPET counter individually.
> That has nothing to do with tasks. clocksource reads can happen from almost
> any context. The problem is concurrent access on multiple cpus.

You are right. I should have used CPU instead.

>> This optimization is enabled on systems with more than 32 CPUs. It can
>> also be explicitly enabled or disabled by using the new opt_read_hpet
>> kernel parameter.
> Please not. What's wrong with enabling it unconditionally?
>   

That is nothing wrong to enable it unconditionally. I am just not sure 
if that is the right thing to do. Since both you and Andy said we should 
enable it unconditionally, I will do so in the next version of the patch.

>> +/*
>>    * Clock source related code
>>    */
>>   static cycle_t read_hpet(struct clocksource *cs)
>>   {
>> -	return (cycle_t)hpet_readl(HPET_COUNTER);
>> +	int seq, cnt = 0;
>> +	u32 time;
>> +
>> +	if (opt_read_hpet<= 0)
>> +		return (cycle_t)hpet_readl(HPET_COUNTER);
> This wants to be conditional on CONFIG_SMP. No point in having all that muck
> around for an UP kernel.

Will do so.

>> +	seq = READ_ONCE(hpet_save.seq);
>> +	if (!HPET_SEQ_LOCKED(seq)) {
>> +		int old, new = seq + 1;
>> +		unsigned long flags;
>> +
>> +		local_irq_save(flags);
>> +		/*
>> +		 * Set the lock bit (lsb) to get the right to read HPET
>> +		 * counter directly. If successful, read the counter, save
>> +		 * its value, and increment the sequence number. Otherwise,
>> +		 * increment the sequnce number to the expected locked value
>> +		 * for comparison later on.
>> +		 */
>> +		old = cmpxchg(&hpet_save.seq, seq, new);
>> +		if (old == seq) {
>> +			time = hpet_readl(HPET_COUNTER);
>> +			WRITE_ONCE(hpet_save.hpet, time);
>> +
>> +			/* Unlock */
>> +			smp_store_release(&hpet_save.seq, new + 1);
>> +			local_irq_restore(flags);
>> +			return (cycle_t)time;
>> +		}
>> +		local_irq_restore(flags);
>> +		seq = new;
>> +	}
>> +
>> +	/*
>> +	 * Wait until the locked sequence number changes which indicates
>> +	 * that the saved HPET value is up-to-date.
>> +	 */
>> +	while (READ_ONCE(hpet_save.seq) == seq) {
>> +		/*
>> +		 * Since reading the HPET is much slower than a single
>> +		 * cpu_relax() instruction, we use two here in an attempt
>> +		 * to reduce the amount of cacheline contention in the
>> +		 * hpet_save.seq cacheline.
>> +		 */
>> +		cpu_relax();
>> +		cpu_relax();
>> +
>> +		if (likely(++cnt<= HPET_RESET_THRESHOLD))
>> +			continue;
>> +
>> +		/*
>> +		 * In the unlikely event that it takes too long for the lock
>> +		 * holder to read the HPET, we do it ourselves and try to
>> +		 * reset the lock. This will also break a deadlock if it
>> +		 * happens, for example, when the process context lock holder
>> +		 * gets killed in the middle of reading the HPET counter.
>> +		 */
>> +		time = hpet_readl(HPET_COUNTER);
>> +		WRITE_ONCE(hpet_save.hpet, time);
>> +		if (READ_ONCE(hpet_save.seq) == seq) {
>> +			if (cmpxchg(&hpet_save.seq, seq, seq + 1) == seq)
>> +				pr_warn("read_hpet: reset hpet seq to 0x%x\n",
>> +					seq + 1);
> This is voodoo programming and actively dangerous.
>
> CPU0 	        CPU1	       		CPU2
> lock_hpet()
> T1=read_hpet()	wait_for_unlock()	
> store_hpet(T1)	
> 		....			
> 		T2 = read_hpet()
> unlock_hpet()				
> 					lock_hpet()
> 					T3 = read_hpet()
> 					store_hpet(T3)
> 					unlock_hpet()
> 					return T3
> lock_hpet()
> T4 = read_hpet()			wait_for_unlock()
> store_hpet(T4)	
> 		store_hpet(T2)			
> unlock_hpet()				return T2
>
> CPU2 will observe time going backwards.
>
> Thanks,
>
> 	tglx

That part is leftover code from my testing and debugging effort. I think 
using local_irq_save() should allow the critical section to be executed 
without interruption. In this case, I should be able to remove the 
threshold checking code without harm.

Thanks for the review.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-04-11 20:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-08 20:11 [PATCH v2] x86/hpet: Reduce HPET counter read contention Waiman Long
2016-04-10  0:19 ` Thomas Gleixner
2016-04-11 20:06   ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.