linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] acpi_pm: Reduce PMTMR counter read contention
@ 2019-01-22  7:23 Zhenzhong Duan
  2019-01-30  8:06 ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Zhenzhong Duan @ 2019-01-22  7:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Daniel Lezcano, Thomas Gleixner, Waiman Long, Srinivas Eeda

On a large system with many CPUs, using PMTMR as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
 1) There is a single PMTMR counter shared by all the CPUs.
 2) PMTMR counter reading is a very slow operation.

Using PMTMR as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance and HPET
disabled by nohpet on kernel command line. Sometimes the performance
slowdown can be so severe that the system may crash because of a NMI
watchdog soft lockup, logs:

[   20.181521] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff,
max_idle_ns: 2085701024 ns
[   44.273786] BUG: soft lockup - CPU#48 stuck for 23s! [swapper/48:0]
[   44.279992] BUG: soft lockup - CPU#49 stuck for 23s! [migration/49:307]
[   44.285169] BUG: soft lockup - CPU#50 stuck for 23s! [migration/50:313]

Commit f99fd22e4d4b ("x86/hpet: Reduce HPET counter read contention")
fixed a similar issue for HPET, this patch adapts that design to PMTMR.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Tested-by: Kin Cho <kin.cho@oracle.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
---
 drivers/clocksource/acpi_pm.c | 101 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
index 1961e35..8b522eb 100644
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -32,12 +32,111 @@
  */
 u32 pmtmr_ioport __read_mostly;
 
-static inline u32 read_pmtmr(void)
+static inline u32 pmtmr_readl(void)
 {
 	/* mask the output to 24 bits */
 	return inl(pmtmr_ioport) & ACPI_PM_MASK;
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_64BIT)
+/*
+ * Reading the PMTMR counter is a very slow operation. If a large number of
+ * CPUs are trying to access the PMTMR counter simultaneously, it can cause
+ * massive delay and slow down system performance dramatically. This may
+ * happen when PMTMR is the default clock source instead of TSC. For a
+ * really large system with hundreds of CPUs, the slowdown may be so
+ * severe that it may actually crash the system because of a NMI watchdog
+ * soft lockup, for example.
+ *
+ * If multiple CPUs are trying to access the PMTMR counter at the same time,
+ * we don't actually need to read the counter multiple times. Instead, the
+ * other CPUs can use the counter value read by the first CPU in the group.
+ *
+ * This special feature is only enabled on x86-64 systems. It is unlikely
+ * that 32-bit x86 systems will have enough CPUs to require this feature
+ * with its associated locking overhead. And we also need 64-bit atomic
+ * read.
+ *
+ * The lock and the pmtmr value are stored together and can be read in a
+ * single atomic 64-bit read. It is explicitly assumed that arch_spinlock_t
+ * is 32 bits in size.
+ */
+union pmtmr_lock {
+	struct {
+		arch_spinlock_t lock;
+		u32 value;
+	};
+	u64 lockval;
+};
+
+static union pmtmr_lock pmtmr __cacheline_aligned = {
+	{ .lock = __ARCH_SPIN_LOCK_UNLOCKED, },
+};
+
+static u32 read_pmtmr(void)
+{
+	unsigned long flags;
+	union pmtmr_lock old, new;
+
+	BUILD_BUG_ON(sizeof(union pmtmr_lock) != 8);
+
+	/*
+	 * Read PMTMR directly if in NMI.
+	 */
+	if (in_nmi())
+		return (u64)pmtmr_readl();
+
+	/*
+	 * Read the current state of the lock and PMTMR value atomically.
+	 */
+	old.lockval = READ_ONCE(pmtmr.lockval);
+
+	if (arch_spin_is_locked(&old.lock))
+		goto contended;
+
+	local_irq_save(flags);
+	if (arch_spin_trylock(&pmtmr.lock)) {
+		new.value = pmtmr_readl();
+		/*
+		 * Use WRITE_ONCE() to prevent store tearing.
+		 */
+		WRITE_ONCE(pmtmr.value, new.value);
+		arch_spin_unlock(&pmtmr.lock);
+		local_irq_restore(flags);
+		return (u64)new.value;
+	}
+	local_irq_restore(flags);
+
+contended:
+	/*
+	 * Contended case
+	 * --------------
+	 * Wait until the PMTMR value change or the lock is free to indicate
+	 * its value is up-to-date.
+	 *
+	 * It is possible that old.value has already contained the latest
+	 * PMTMR value while the lock holder was in the process of releasing
+	 * the lock. Checking for lock state change will enable us to return
+	 * the value immediately instead of waiting for the next PMTMR reader
+	 * to come along.
+	 */
+	do {
+		cpu_relax();
+		new.lockval = READ_ONCE(pmtmr.lockval);
+	} while ((new.value == old.value) && arch_spin_is_locked(&new.lock));
+
+	return (u64)new.value;
+}
+#else
+/*
+ * For UP or 32-bit.
+ */
+static inline u32 read_pmtmr(void)
+{
+	return pmtmr_readl();
+}
+#endif
+
 u32 acpi_pm_read_verified(void)
 {
 	u32 v1 = 0, v2 = 0, v3 = 0;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] acpi_pm: Reduce PMTMR counter read contention
  2019-01-22  7:23 [PATCH] acpi_pm: Reduce PMTMR counter read contention Zhenzhong Duan
@ 2019-01-30  8:06 ` Thomas Gleixner
  2019-01-31  3:50   ` Zhenzhong Duan
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2019-01-30  8:06 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: linux-kernel, x86, Daniel Lezcano, Waiman Long, Srinivas Eeda

On Tue, 22 Jan 2019, Zhenzhong Duan wrote:

> On a large system with many CPUs, using PMTMR as the clock source can
> have a significant impact on the overall system performance because
> of the following reasons:
>  1) There is a single PMTMR counter shared by all the CPUs.
>  2) PMTMR counter reading is a very slow operation.
> 
> Using PMTMR as the default clock source may happen when, for example,
> the TSC clock calibration exceeds the allowable tolerance and HPET
> disabled by nohpet on kernel command line. Sometimes the performance

The question is why would anyone disable HPET on a larger machine when the
TSC is wreckaged?

I'm not against the change per se, but I really want to understand why we
need all the complexity for something which should never be used in a real
world deployment.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] acpi_pm: Reduce PMTMR counter read contention
  2019-01-30  8:06 ` Thomas Gleixner
@ 2019-01-31  3:50   ` Zhenzhong Duan
  2019-01-31 14:26     ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Zhenzhong Duan @ 2019-01-31  3:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, Daniel Lezcano, Waiman Long, Srinivas Eeda

On 2019/1/30 16:06, Thomas Gleixner wrote:
> On Tue, 22 Jan 2019, Zhenzhong Duan wrote:
> 
>> On a large system with many CPUs, using PMTMR as the clock source can
>> have a significant impact on the overall system performance because
>> of the following reasons:
>>   1) There is a single PMTMR counter shared by all the CPUs.
>>   2) PMTMR counter reading is a very slow operation.
>>
>> Using PMTMR as the default clock source may happen when, for example,
>> the TSC clock calibration exceeds the allowable tolerance and HPET
>> disabled by nohpet on kernel command line. Sometimes the performance
> 
> The question is why would anyone disable HPET on a larger machine when the
> TSC is wreckaged?

There may be broken hardware where TSC is wreckaged.
On our instances(X8-8/X7-8), TSC isn't wreckaged. Sometimes we are lucky 
to pass the bootup stage, then TSC is the final default clocksource. See 
log:
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff 
max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   13.963224] clocksource: jiffies: mask: 0xffffffff max_cycles: 
0xffffffff, max_idle_ns: 1911260446275000 ns
[   19.903175] clocksource: Switched to clocksource refined-jiffies
[   20.190467] clocksource: acpi_pm: mask: 0xffffff max_cycles: 
0xffffff, max_idle_ns: 2085701024 ns
[   20.201634] clocksource: Switched to clocksource acpi_pm
[   39.082577] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 
0x2113ba2fe3c, max_idle_ns: 440795266816 ns
[   39.138781] clocksource: Switched to clocksource tsc

When we are unlucky, logs:
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff 
max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   19.905741] clocksource: Switched to clocksource refined-jiffies
[   20.181521] clocksource: acpi_pm: mask: 0xffffff max_cycles: 
0xffffff, max_idle_ns: 2085701024 ns
[   44.273786] watchdog: BUG: soft lockup - CPU#48 stuck for 23s! 
[swapper/48:0]
[   44.279992] watchdog: BUG: soft lockup - CPU#49 stuck for 23s! 
[migration/49:307]

So we paniced when acpi_pm is initializing and is chosed as default 
clocksource temporarily, it paniced just because we add nohpet parameter.
> 
> I'm not against the change per se, but I really want to understand why we
> need all the complexity for something which should never be used in a real
> world deployment.

Hmm, it's a strong word of "never be used". Customers may happen to use 
nohpet(sanity test?) and report bug to us. Sometimes they does report a 
bug that reproduce with their customed config. There may also be BIOS 
setting HPET disabled.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] acpi_pm: Reduce PMTMR counter read contention
  2019-01-31  3:50   ` Zhenzhong Duan
@ 2019-01-31 14:26     ` Thomas Gleixner
  2019-02-02  2:52       ` Zhenzhong Duan
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2019-01-31 14:26 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: linux-kernel, x86, Daniel Lezcano, Waiman Long, Srinivas Eeda

On Thu, 31 Jan 2019, Zhenzhong Duan wrote:

> On 2019/1/30 16:06, Thomas Gleixner wrote:
> > On Tue, 22 Jan 2019, Zhenzhong Duan wrote:
> > 
> > > On a large system with many CPUs, using PMTMR as the clock source can
> > > have a significant impact on the overall system performance because
> > > of the following reasons:
> > >   1) There is a single PMTMR counter shared by all the CPUs.
> > >   2) PMTMR counter reading is a very slow operation.
> > > 
> > > Using PMTMR as the default clock source may happen when, for example,
> > > the TSC clock calibration exceeds the allowable tolerance and HPET
> > > disabled by nohpet on kernel command line. Sometimes the performance
> > 
> > The question is why would anyone disable HPET on a larger machine when the
> > TSC is wreckaged?
> 
> There may be broken hardware where TSC is wreckaged.

I know that.

> > I'm not against the change per se, but I really want to understand why we
> > need all the complexity for something which should never be used in a real
> > world deployment.
> 
> Hmm, it's a strong word of "never be used". Customers may happen to use
> nohpet(sanity test?) and report bug to us. Sometimes they does report a bug
> that reproduce with their customed config. There may also be BIOS setting HPET
> disabled.

And because the customer MAY do completely nonsensical things (and there
are a lot more than the HPET) the kernel has to handle all of them?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] acpi_pm: Reduce PMTMR counter read contention
  2019-01-31 14:26     ` Thomas Gleixner
@ 2019-02-02  2:52       ` Zhenzhong Duan
  2019-02-02  7:55         ` Kin Cho
  2019-02-10 21:08         ` Thomas Gleixner
  0 siblings, 2 replies; 9+ messages in thread
From: Zhenzhong Duan @ 2019-02-02  2:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, Daniel Lezcano, Waiman Long, Srinivas Eeda, kin.cho

On 2019/1/31 22:26, Thomas Gleixner wrote:
> On Thu, 31 Jan 2019, Zhenzhong Duan wrote:
> 
>> On 2019/1/30 16:06, Thomas Gleixner wrote:
>>> On Tue, 22 Jan 2019, Zhenzhong Duan wrote:
>>>
>>>> On a large system with many CPUs, using PMTMR as the clock source can
>>>> have a significant impact on the overall system performance because
>>>> of the following reasons:
>>>>    1) There is a single PMTMR counter shared by all the CPUs.
>>>>    2) PMTMR counter reading is a very slow operation.
>>>>
>>>> Using PMTMR as the default clock source may happen when, for example,
>>>> the TSC clock calibration exceeds the allowable tolerance and HPET
>>>> disabled by nohpet on kernel command line. Sometimes the performance
>>>
>>> The question is why would anyone disable HPET on a larger machine when the
>>> TSC is wreckaged?
>>
>> There may be broken hardware where TSC is wreckaged.
> 
> I know that.
> 
>>> I'm not against the change per se, but I really want to understand why we
>>> need all the complexity for something which should never be used in a real
>>> world deployment.
>>
>> Hmm, it's a strong word of "never be used". Customers may happen to use
>> nohpet(sanity test?) and report bug to us. Sometimes they does report a bug
>> that reproduce with their customed config. There may also be BIOS setting HPET
>> disabled.
> 
> And because the customer MAY do completely nonsensical things (and there
> are a lot more than the HPET) the kernel has to handle all of them?

Ok, then. I don't have more suggestion to convince you. I just think of 
a simple fix as below. I think it will work for both hpet and pmtmr. We 
will test it when the env is available.

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1353,6 +1353,7 @@ static int change_clocksource(void *data)

         write_seqcount_end(&tk_core.seq);
         raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
+       tick_clock_notify();

         return 0;
  }
@@ -1371,7 +1372,6 @@ int timekeeping_notify(struct clocksource *clock)
         if (tk->tkr_mono.clock == clock)
                 return 0;
         stop_machine(change_clocksource, clock, NULL);
-       tick_clock_notify();
         return tk->tkr_mono.clock == clock ? 0 : -1;
  }


Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] acpi_pm: Reduce PMTMR counter read contention
  2019-02-02  2:52       ` Zhenzhong Duan
@ 2019-02-02  7:55         ` Kin Cho
  2019-02-10 21:08         ` Thomas Gleixner
  1 sibling, 0 replies; 9+ messages in thread
From: Kin Cho @ 2019-02-02  7:55 UTC (permalink / raw)
  To: zhenzhong.duan, Thomas Gleixner
  Cc: linux-kernel, x86, Daniel Lezcano, Waiman Long, Srinivas Eeda

Zhenzhong,
The machine is running test this weekend, I'll try your simple fix next 
week.

All,
We're not aware of a specific customer need for acpi_pm/PMTMR,
but if we must keep acpi_pm/PMTMR in the kernel, let's fix it so it
actually works, even on machine like ours.
On our hardware currently it's broken both during clocksource selection
and as a permanent clocksource.

Thanks,

-kin

On 2/1/19 6:52 PM, Zhenzhong Duan wrote:
> On 2019/1/31 22:26, Thomas Gleixner wrote:
>> On Thu, 31 Jan 2019, Zhenzhong Duan wrote:
>>
>>> On 2019/1/30 16:06, Thomas Gleixner wrote:
>>>> On Tue, 22 Jan 2019, Zhenzhong Duan wrote:
>>>>
>>>>> On a large system with many CPUs, using PMTMR as the clock source can
>>>>> have a significant impact on the overall system performance because
>>>>> of the following reasons:
>>>>>    1) There is a single PMTMR counter shared by all the CPUs.
>>>>>    2) PMTMR counter reading is a very slow operation.
>>>>>
>>>>> Using PMTMR as the default clock source may happen when, for example,
>>>>> the TSC clock calibration exceeds the allowable tolerance and HPET
>>>>> disabled by nohpet on kernel command line. Sometimes the performance
>>>>
>>>> The question is why would anyone disable HPET on a larger machine 
>>>> when the
>>>> TSC is wreckaged?
>>>
>>> There may be broken hardware where TSC is wreckaged.
>>
>> I know that.
>>
>>>> I'm not against the change per se, but I really want to understand 
>>>> why we
>>>> need all the complexity for something which should never be used in 
>>>> a real
>>>> world deployment.
>>>
>>> Hmm, it's a strong word of "never be used". Customers may happen to use
>>> nohpet(sanity test?) and report bug to us. Sometimes they does 
>>> report a bug
>>> that reproduce with their customed config. There may also be BIOS 
>>> setting HPET
>>> disabled.
>>
>> And because the customer MAY do completely nonsensical things (and there
>> are a lot more than the HPET) the kernel has to handle all of them?
>
> Ok, then. I don't have more suggestion to convince you. I just think 
> of a simple fix as below. I think it will work for both hpet and 
> pmtmr. We will test it when the env is available.
>
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1353,6 +1353,7 @@ static int change_clocksource(void *data)
>
>         write_seqcount_end(&tk_core.seq);
>         raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
> +       tick_clock_notify();
>
>         return 0;
>  }
> @@ -1371,7 +1372,6 @@ int timekeeping_notify(struct clocksource *clock)
>         if (tk->tkr_mono.clock == clock)
>                 return 0;
>         stop_machine(change_clocksource, clock, NULL);
> -       tick_clock_notify();
>         return tk->tkr_mono.clock == clock ? 0 : -1;
>  }
>
>
> Thanks
> Zhenzhong


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] acpi_pm: Reduce PMTMR counter read contention
  2019-02-02  2:52       ` Zhenzhong Duan
  2019-02-02  7:55         ` Kin Cho
@ 2019-02-10 21:08         ` Thomas Gleixner
  2019-02-18  3:48           ` Zhenzhong Duan
  1 sibling, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2019-02-10 21:08 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: linux-kernel, x86, Daniel Lezcano, Waiman Long, Srinivas Eeda, kin.cho

On Sat, 2 Feb 2019, Zhenzhong Duan wrote:
> On 2019/1/31 22:26, Thomas Gleixner wrote:

> > > > I'm not against the change per se, but I really want to understand
> > > > why we need all the complexity for something which should never be
> > > > used in a real world deployment.
> > > >
> > > Hmm, it's a strong word of "never be used". Customers may happen to
> > > use nohpet(sanity test?) and report bug to us. Sometimes they does
> > > report a bug that reproduce with their customed config. There may
> > > also be BIOS setting HPET disabled.
> > 
> > And because the customer MAY do completely nonsensical things (and there
> > are a lot more than the HPET) the kernel has to handle all of them?
> 
> Ok, then. I don't have more suggestion to convince you.

You give up too fast :)

The point is, that we really want proper justifications for changes like
this. Some 'may, could and more handwaving' simply does not cut it.

So if you can just describe a realistic scenario, which does not involve
thoughtless flipping of BIOS options, then this becomes way more palatable.

> I just think of a simple fix as below. I think it will work for both hpet
> and pmtmr. We will test it when the env is available.

> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1353,6 +1353,7 @@ static int change_clocksource(void *data)
> 
>         write_seqcount_end(&tk_core.seq);
>         raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
> +       tick_clock_notify();
> 
>         return 0;
>  }
> @@ -1371,7 +1372,6 @@ int timekeeping_notify(struct clocksource *clock)
>         if (tk->tkr_mono.clock == clock)
>                 return 0;
>         stop_machine(change_clocksource, clock, NULL);
> -       tick_clock_notify();
>         return tk->tkr_mono.clock == clock ? 0 : -1;
>  }

This won't resolve the concurrency issues of HPET or PMTIMER in any
way. Instead it breaks the careful orchestrated mechanism of clocksource
change.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] acpi_pm: Reduce PMTMR counter read contention
  2019-02-10 21:08         ` Thomas Gleixner
@ 2019-02-18  3:48           ` Zhenzhong Duan
  2019-02-28  3:33             ` Zhenzhong Duan
  0 siblings, 1 reply; 9+ messages in thread
From: Zhenzhong Duan @ 2019-02-18  3:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, Daniel Lezcano, Waiman Long, Srinivas Eeda, kin.cho

On 2019/2/11 5:08, Thomas Gleixner wrote:
> On Sat, 2 Feb 2019, Zhenzhong Duan wrote:
>> On 2019/1/31 22:26, Thomas Gleixner wrote:
> 
>>>>> I'm not against the change per se, but I really want to understand
>>>>> why we need all the complexity for something which should never be
>>>>> used in a real world deployment.
>>>>>
>>>> Hmm, it's a strong word of "never be used". Customers may happen to
>>>> use nohpet(sanity test?) and report bug to us. Sometimes they does
>>>> report a bug that reproduce with their customed config. There may
>>>> also be BIOS setting HPET disabled.
>>>
>>> And because the customer MAY do completely nonsensical things (and there
>>> are a lot more than the HPET) the kernel has to handle all of them?
>>
>> Ok, then. I don't have more suggestion to convince you.
> 
> You give up too fast :)

Ah, because I thought of a simple fix.
> 
> The point is, that we really want proper justifications for changes like
> this. Some 'may, could and more handwaving' simply does not cut it.
> 
> So if you can just describe a realistic scenario, which does not involve
> thoughtless flipping of BIOS options, then this becomes way more palatable.

I indeed don't see a realistic scenario in a product env needing to use 
nohpet. My only justification is now that we have nohpet as kernel 
parameter, we should fix the softlockup in large machines for enterprise 
use.
> 
>> I just think of a simple fix as below. I think it will work for both hpet
>> and pmtmr. We will test it when the env is available.
> 
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1353,6 +1353,7 @@ static int change_clocksource(void *data)
>>
>>          write_seqcount_end(&tk_core.seq);
>>          raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
>> +       tick_clock_notify();
>>
>>          return 0;
>>   }
>> @@ -1371,7 +1372,6 @@ int timekeeping_notify(struct clocksource *clock)
>>          if (tk->tkr_mono.clock == clock)
>>                  return 0;
>>          stop_machine(change_clocksource, clock, NULL);
>> -       tick_clock_notify();
>>          return tk->tkr_mono.clock == clock ? 0 : -1;
>>   }
> 
> This won't resolve the concurrency issues of HPET or PMTIMER in any
> way.

Just got chance to test and Kin confirmed it fix the softlockup of 
PMTMR(with nohpet) and HPET(without nohpet, revert previous hpet commit) 
at bootup stage.

My understandig is, at bootup stage tick device is firstly initialized 
in periodic mode and then switch to one-shot mode. In periodic mode 
clock event interrupt is triggered every 1ms(HZ=1000), contention in 
HPET or PMTIMER exceeds 1ms and delayed the clock interrupt. Then CPUs 
continue to process interrupt one by one without a break, 
tick_clock_notify() have no chance to be called and we never switch to 
one-shot mode.

In one-shot mode, the contention is still there but next event is always 
set with a future value. We may missed some ticks, but the timer code is 
smart enough to pick up those missed ticks.

By moving tick_clock_notify() in stop_machine, kernel changes to 
one-shot mode early before the contention accumulate and lockup system.

> Instead it breaks the careful orchestrated mechanism of clocksource
> change.

Sorry, I didn't get a idea how it breaks, tick_clock_notify() is a 
simple function setting bitmask in percpu variable. Could you explain a bit?

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] acpi_pm: Reduce PMTMR counter read contention
  2019-02-18  3:48           ` Zhenzhong Duan
@ 2019-02-28  3:33             ` Zhenzhong Duan
  0 siblings, 0 replies; 9+ messages in thread
From: Zhenzhong Duan @ 2019-02-28  3:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, x86, Daniel Lezcano, Waiman Long, Srinivas Eeda,
	kin.cho, linux_lkml_grp

On 2019/2/18 11:48, Zhenzhong Duan wrote:
> On 2019/2/11 5:08, Thomas Gleixner wrote:
>> On Sat, 2 Feb 2019, Zhenzhong Duan wrote:
>>> On 2019/1/31 22:26, Thomas Gleixner wrote:
>>
>>>>>> I'm not against the change per se, but I really want to understand
>>>>>> why we need all the complexity for something which should never be
>>>>>> used in a real world deployment.
>>>>>>
>>>>> Hmm, it's a strong word of "never be used". Customers may happen to
>>>>> use nohpet(sanity test?) and report bug to us. Sometimes they does
>>>>> report a bug that reproduce with their customed config. There may
>>>>> also be BIOS setting HPET disabled.
>>>>
>>>> And because the customer MAY do completely nonsensical things (and 
>>>> there
>>>> are a lot more than the HPET) the kernel has to handle all of them?
>>>
>>> Ok, then. I don't have more suggestion to convince you.
>>
>> You give up too fast :)
> 
> Ah, because I thought of a simple fix.
>>
>> The point is, that we really want proper justifications for changes like
>> this. Some 'may, could and more handwaving' simply does not cut it.
>>
>> So if you can just describe a realistic scenario, which does not involve
>> thoughtless flipping of BIOS options, then this becomes way more 
>> palatable.
> 
> I indeed don't see a realistic scenario in a product env needing to use 
> nohpet. My only justification is now that we have nohpet as kernel 
> parameter, we should fix the softlockup in large machines for enterprise 
> use.
>>
>>> I just think of a simple fix as below. I think it will work for both 
>>> hpet
>>> and pmtmr. We will test it when the env is available.
>>
>>> --- a/kernel/time/timekeeping.c
>>> +++ b/kernel/time/timekeeping.c
>>> @@ -1353,6 +1353,7 @@ static int change_clocksource(void *data)
>>>
>>>          write_seqcount_end(&tk_core.seq);
>>>          raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
>>> +       tick_clock_notify();
>>>
>>>          return 0;
>>>   }
>>> @@ -1371,7 +1372,6 @@ int timekeeping_notify(struct clocksource *clock)
>>>          if (tk->tkr_mono.clock == clock)
>>>                  return 0;
>>>          stop_machine(change_clocksource, clock, NULL);
>>> -       tick_clock_notify();
>>>          return tk->tkr_mono.clock == clock ? 0 : -1;
>>>   }
>>
>> This won't resolve the concurrency issues of HPET or PMTIMER in any
>> way.
> 
> Just got chance to test and Kin confirmed it fix the softlockup of 
> PMTMR(with nohpet) and HPET(without nohpet, revert previous hpet commit) 
> at bootup stage.
> 
> My understandig is, at bootup stage tick device is firstly initialized 
> in periodic mode and then switch to one-shot mode. In periodic mode 
> clock event interrupt is triggered every 1ms(HZ=1000), contention in 
> HPET or PMTIMER exceeds 1ms and delayed the clock interrupt. Then CPUs 
> continue to process interrupt one by one without a break, 
> tick_clock_notify() have no chance to be called and we never switch to 
> one-shot mode.
> 
> In one-shot mode, the contention is still there but next event is always 
> set with a future value. We may missed some ticks, but the timer code is 
> smart enough to pick up those missed ticks.
> 
> By moving tick_clock_notify() in stop_machine, kernel changes to 
> one-shot mode early before the contention accumulate and lockup system.
> 
>> Instead it breaks the careful orchestrated mechanism of clocksource
>> change.
> 
> Sorry, I didn't get a idea how it breaks, tick_clock_notify() is a 
> simple function setting bitmask in percpu variable. Could you explain a 
> bit?

Hi Thomas,

May I have your further comments? I think applying a simple patch to fix 
both hpet and pmtmr softlockup is better?

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-02-28  3:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-22  7:23 [PATCH] acpi_pm: Reduce PMTMR counter read contention Zhenzhong Duan
2019-01-30  8:06 ` Thomas Gleixner
2019-01-31  3:50   ` Zhenzhong Duan
2019-01-31 14:26     ` Thomas Gleixner
2019-02-02  2:52       ` Zhenzhong Duan
2019-02-02  7:55         ` Kin Cho
2019-02-10 21:08         ` Thomas Gleixner
2019-02-18  3:48           ` Zhenzhong Duan
2019-02-28  3:33             ` Zhenzhong Duan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).