linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [PATCH] idle using PNI monitor/mwait (take 2)
@ 2003-09-06  1:03 Nakajima, Jun
  2003-09-06  9:58 ` Jamie Lokier
  0 siblings, 1 reply; 7+ messages in thread
From: Nakajima, Jun @ 2003-09-06  1:03 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andrew Morton, linux-kernel, Saxena, Sunil, Mallick, Asit K,
	Pallipadi, Venkatesh

Thanks for the explanation.

> It isn't defensive.  The local_irq_disable() is needed to avoid a race
> condition.  It was added to fix high latency spikes.
> 
> Without it, after checking need_resched, that flag can be set by an
> interrupt before we do "hlt".
> 
> The result is a halted CPU even though need_resched is set, and the
> idle loop isn't broken until the next interrupt, often a timer tick.

I'm aware of the window, but did not realize that it could cause high
latency spikes. Is that still the case with 2.6 where we have higher HZ
(1000)? Anyway, I think it's a cheap way of removing such spikes.

> So you can remove it from your loop.
Okay we'll remove local_irq_enable() at entry. So in that case we can
remove the local_irq_enable() below as well?

static void poll_idle (void)
{
	int oldval;

=>	local_irq_enable();

	/*
	 * Deal with another CPU just having chosen a thread to
	 * run here:
	 */
	oldval = test_and_clear_thread_flag(TIF_NEED_RESCHED);
...

Thanks,
Jun

> -----Original Message-----
> From: Jamie Lokier [mailto:jamie@shareable.org]
> Sent: Friday, September 05, 2003 2:14 PM
> To: Nakajima, Jun
> Cc: Andrew Morton; linux-kernel@vger.kernel.org; Saxena, Sunil;
Mallick,
> Asit K; Pallipadi, Venkatesh
> Subject: Re: [PATCH] idle using PNI monitor/mwait (take 2)
> 
> Nakajima, Jun wrote:
> > We are doing this as defensive programming (because of bogus device
> > drivers, for example), like the other idle routines (default_idle,
and
> > poll_idle) always do.
> >
> > BTW, I'm not sure that local_irq_disable() is really required below
(as
> > you know, "sti" is hiding in safe_halt()).
> >
> > void default_idle(void)
> > {
> > 	if (!hlt_counter && current_cpu_data.hlt_works_ok) {
> > =>		local_irq_disable();
> > 		if (!need_resched())
> > 			safe_halt();
> > 		else
> > 			local_irq_enable();
> > 	}
> > }
> 
> 
> It isn't defensive.  The local_irq_disable() is needed to avoid a race
> condition.  It was added to fix high latency spikes.
> 
> (A comment to that effect in default_idle would be nice).
> 
> Without it, after checking need_resched, that flag can be set by an
> interrupt before we do "hlt".
> 
> The result is a halted CPU even though need_resched is set, and the
> idle loop isn't broken until the next interrupt, often a timer tick.
> 
> The "sti" in safe_halt() is immediately before the "hlt", which means
> it doesn't enable interrupts until after the "hlt" instruction.  This
> means that any interrupt will wake the CPU.
> 
> local_irq_disable() isn't required in the monitor/mwait loop, because
> you check need_resched between the monitor and mwait.  (If Intel had
> implemented monitor+mwait as a single instruction, then you'd need
it).
> 
> So you can remove it from your loop.
> 
> Enjoy :)
> -- Jamie


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] idle using PNI monitor/mwait (take 2)
  2003-09-06  1:03 [PATCH] idle using PNI monitor/mwait (take 2) Nakajima, Jun
@ 2003-09-06  9:58 ` Jamie Lokier
  0 siblings, 0 replies; 7+ messages in thread
From: Jamie Lokier @ 2003-09-06  9:58 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Andrew Morton, linux-kernel, Saxena, Sunil, Mallick, Asit K,
	Pallipadi, Venkatesh

Nakajima, Jun wrote:
> > The result is a halted CPU even though need_resched is set, and the
> > idle loop isn't broken until the next interrupt, often a timer tick.
> 
> I'm aware of the window, but did not realize that it could cause high
> latency spikes. Is that still the case with 2.6 where we have higher HZ
> (1000)? Anyway, I think it's a cheap way of removing such spikes.

HZ is irrelevant.  If you receive an interrupt, and typical wakeup
latency is 0.05ms, then waiting until the next timer tick in 1ms is a spike.
Some applications tolerate that, some don't.

> > So you can remove it from your loop.
> Okay we'll remove local_irq_enable() at entry. So in that case we can
> remove the local_irq_enable() below as well?
> 
> static void poll_idle (void)
> {
> 	int oldval;
> 
> =>	local_irq_enable();

I misunderstood you and now you misunderstood me :)

It's ok to _disable_ irqs for the reasons I gave.  I thought of that
because you pointed to the _disable_ in your quote of default_idle.

The _enable_ is there for a different reason.

Scheduling is not allowed with interrupts disabled.  So we know that
when schedule() returns, local irqs are enabled.  So poll_idle() doesn't
need to enable them.  I suggest this change:

        - remove the local_irq_enable() from poll_idle().

        - add local_irq_enable() at the start of cpu_idle(), before the loop.

-- Jamie

diff -urN --exclude-from=dontdiff orig-2.6.0-test4/arch/i386/kernel/process.c idle_irqs-2.6.0-test4/arch/i386/kernel/process.c
--- orig-2.6.0-test4/arch/i386/kernel/process.c	2003-09-02 23:05:06.000000000 +0100
+++ idle_irqs-2.6.0-test4/arch/i386/kernel/process.c	2003-09-06 10:50:59.000000000 +0100
@@ -105,8 +105,6 @@
 {
 	int oldval;
 
-	local_irq_enable();
-
 	/*
 	 * Deal with another CPU just having chosen a thread to
 	 * run here:
@@ -136,6 +134,8 @@
  */
 void cpu_idle (void)
 {
+	local_irq_enable();
+
 	/* endless idle loop with no priority at all */
 	while (1) {
 		while (!need_resched()) {

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] idle using PNI monitor/mwait (take 2)
  2003-09-05 21:14 ` Jamie Lokier
@ 2003-09-05 22:03   ` Jamie Lokier
  0 siblings, 0 replies; 7+ messages in thread
From: Jamie Lokier @ 2003-09-05 22:03 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Andrew Morton, linux-kernel, Saxena, Sunil, Mallick, Asit K,
	Pallipadi, Venkatesh

Jamie Lokier wrote:
> local_irq_disable() isn't required in the monitor/mwait loop, because
> you check need_resched between the monitor and mwait.  (If Intel had
> implemented monitor+mwait as a single instruction, then you'd need it).
> 
> So you can remove it from your loop.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Excuse me, I hadn't looked at your code closely.  Everything I said up
to that last line is fine.  But the last line doesn't apply because
you don't have a local_irq_disable().

You can remove your local_irq_enable() instead, if you want, but for
a different reason.  That one _is_ defensive. :)

-- Jamie

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] idle using PNI monitor/mwait (take 2)
  2003-09-05 20:41 Nakajima, Jun
@ 2003-09-05 21:14 ` Jamie Lokier
  2003-09-05 22:03   ` Jamie Lokier
  0 siblings, 1 reply; 7+ messages in thread
From: Jamie Lokier @ 2003-09-05 21:14 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Andrew Morton, linux-kernel, Saxena, Sunil, Mallick, Asit K,
	Pallipadi, Venkatesh

Nakajima, Jun wrote:
> We are doing this as defensive programming (because of bogus device
> drivers, for example), like the other idle routines (default_idle, and
> poll_idle) always do. 
> 
> BTW, I'm not sure that local_irq_disable() is really required below (as
> you know, "sti" is hiding in safe_halt()).
> 
> void default_idle(void)
> {
> 	if (!hlt_counter && current_cpu_data.hlt_works_ok) {
> =>		local_irq_disable();
> 		if (!need_resched())
> 			safe_halt();
> 		else
> 			local_irq_enable();
> 	}
> }


It isn't defensive.  The local_irq_disable() is needed to avoid a race
condition.  It was added to fix high latency spikes.

(A comment to that effect in default_idle would be nice).

Without it, after checking need_resched, that flag can be set by an
interrupt before we do "hlt".

The result is a halted CPU even though need_resched is set, and the
idle loop isn't broken until the next interrupt, often a timer tick.

The "sti" in safe_halt() is immediately before the "hlt", which means
it doesn't enable interrupts until after the "hlt" instruction.  This
means that any interrupt will wake the CPU.

local_irq_disable() isn't required in the monitor/mwait loop, because
you check need_resched between the monitor and mwait.  (If Intel had
implemented monitor+mwait as a single instruction, then you'd need it).

So you can remove it from your loop.

Enjoy :)
-- Jamie


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] idle using PNI monitor/mwait (take 2)
@ 2003-09-05 20:41 Nakajima, Jun
  2003-09-05 21:14 ` Jamie Lokier
  0 siblings, 1 reply; 7+ messages in thread
From: Nakajima, Jun @ 2003-09-05 20:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Saxena, Sunil, Mallick, Asit K, Pallipadi, Venkatesh

We are doing this as defensive programming (because of bogus device
drivers, for example), like the other idle routines (default_idle, and
poll_idle) always do. 

BTW, I'm not sure that local_irq_disable() is really required below (as
you know, "sti" is hiding in safe_halt()).

void default_idle(void)
{
	if (!hlt_counter && current_cpu_data.hlt_works_ok) {
=>		local_irq_disable();
		if (!need_resched())
			safe_halt();
		else
			local_irq_enable();
	}
}

Thanks,
Jun


> -----Original Message-----
> From: Andrew Morton [mailto:akpm@osdl.org]
> Sent: Friday, September 05, 2003 10:14 AM
> To: Nakajima, Jun
> Cc: linux-kernel@vger.kernel.org; Saxena, Sunil; Mallick, Asit K;
> Pallipadi, Venkatesh
> Subject: Re: [PATCH] idle using PNI monitor/mwait (take 2)
> 
> "Nakajima, Jun" <jun.nakajima@intel.com> wrote:
> >
> > Attached is a patch that enables PNI (Prescott New Instructions)
> > monitor/mwait in the kernel idle.
> 
> Thanks, looks good.
> 
> Why is there a local_irq_enable() on entry to mwait_idle()?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] idle using PNI monitor/mwait (take 2)
  2003-09-05  2:19 Nakajima, Jun
@ 2003-09-05 17:14 ` Andrew Morton
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2003-09-05 17:14 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: linux-kernel, sunil.saxena, asit.k.mallick, venkatesh.pallipadi

"Nakajima, Jun" <jun.nakajima@intel.com> wrote:
>
> Attached is a patch that enables PNI (Prescott New Instructions)
> monitor/mwait in the kernel idle. 

Thanks, looks good.

Why is there a local_irq_enable() on entry to mwait_idle()?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] idle using PNI monitor/mwait (take 2)
@ 2003-09-05  2:19 Nakajima, Jun
  2003-09-05 17:14 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Nakajima, Jun @ 2003-09-05  2:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Saxena, Sunil, Mallick, Asit K, Pallipadi, Venkatesh

[-- Attachment #1: Type: text/plain, Size: 4329 bytes --]

Hi Andrew,

Attached is a patch that enables PNI (Prescott New Instructions)
monitor/mwait in the kernel idle. It is for mm4, but it should apply to
mm5. 

Thanks,
Jun

-------
diff -purN linux-2.6.0-test4-mm4/arch/i386/kernel/cpu/intel.c
linux-2.6.0-test4-mm4-mwait/arch/i386/kernel/cpu/intel.c
--- linux-2.6.0-test4-mm4/arch/i386/kernel/cpu/intel.c	2003-09-02
17:56:48.000000000 -0700
+++ linux-2.6.0-test4-mm4-mwait/arch/i386/kernel/cpu/intel.c
2003-09-03 12:12:56.000000000 -0700
@@ -12,6 +12,8 @@
 
 #include "cpu.h"
 
+extern void select_idle_routine(const struct cpuinfo_x86 *c);
+
 #ifdef CONFIG_X86_INTEL_USERCOPY
 /*
  * Alignment at which movsl is preferred for bulk memory copies.
@@ -163,7 +165,7 @@ static void __init init_intel(struct cpu
 	}
 #endif
 
-
+	select_idle_routine(c);
 	if (c->cpuid_level > 1) {
 		/* supports eax=2  call */
 		int i, j, n;
diff -purN linux-2.6.0-test4-mm4/arch/i386/kernel/process.c
linux-2.6.0-test4-mm4-mwait/arch/i386/kernel/process.c
--- linux-2.6.0-test4-mm4/arch/i386/kernel/process.c	2003-09-02
17:56:48.000000000 -0700
+++ linux-2.6.0-test4-mm4-mwait/arch/i386/kernel/process.c
2003-09-02 20:18:47.000000000 -0700
@@ -152,11 +152,56 @@ void cpu_idle (void)
 	}
 }
 
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI, 
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait
state 
+ * through MWAIT. Whenever someone changes need_resched, we would be
woken 
+ * up from MWAIT (without an IPI).
+ */
+static void mwait_idle (void)
+{
+	local_irq_enable();
+
+	if (!need_resched()) {
+		set_thread_flag(TIF_POLLING_NRFLAG);
+		do {
+			__monitor((void *)&current_thread_info()->flags,
0, 0);
+			if (need_resched())
+				break;
+			__mwait(0, 0);
+		} while (!need_resched());
+		clear_thread_flag(TIF_POLLING_NRFLAG);
+	}
+}
+
+void __init select_idle_routine(const struct cpuinfo_x86 *c)
+{
+	if (cpu_has(c, X86_FEATURE_MWAIT)) {
+		printk("Monitor/Mwait feature present.\n");
+		/* 
+		 * Skip, if setup has overridden idle.
+		 * Also, take care of system with asymmetric CPUs.
+		 * Use, mwait_idle only if all cpus support it.
+		 * If not, we fallback to default_idle()
+		 */
+		if (!pm_idle) {
+			pm_idle = mwait_idle;
+		}
+		return;
+	}
+	pm_idle = default_idle;
+	return;
+}
+
 static int __init idle_setup (char *str)
 {
 	if (!strncmp(str, "poll", 4)) {
 		printk("using polling idle threads.\n");
 		pm_idle = poll_idle;
+	} else if (!strncmp(str, "halt", 4)) {
+		printk("using halt in idle threads.\n");
+		pm_idle = default_idle;
 	}
 
 	return 1;
diff -purN linux-2.6.0-test4-mm4/include/asm-i386/cpufeature.h
linux-2.6.0-test4-mm4-mwait/include/asm-i386/cpufeature.h
--- linux-2.6.0-test4-mm4/include/asm-i386/cpufeature.h	2003-08-22
16:58:04.000000000 -0700
+++ linux-2.6.0-test4-mm4-mwait/include/asm-i386/cpufeature.h
2003-09-02 20:18:47.000000000 -0700
@@ -71,6 +71,8 @@
 
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_EST		(4*32+ 7) /* Enhanced SpeedStep
*/
+#define X86_FEATURE_MWAIT	(4*32+ 3) /* Monitor/Mwait support */
+
 
 /* VIA/Cyrix/Centaur-defined CPU features, CPUID level 0xC0000001, word
5 */
 #define X86_FEATURE_XSTORE	(5*32+ 2) /* on-CPU RNG present (xstore
insn) */
diff -purN linux-2.6.0-test4-mm4/include/asm-i386/processor.h
linux-2.6.0-test4-mm4-mwait/include/asm-i386/processor.h
--- linux-2.6.0-test4-mm4/include/asm-i386/processor.h	2003-09-02
17:56:48.000000000 -0700
+++ linux-2.6.0-test4-mm4-mwait/include/asm-i386/processor.h
2003-09-02 20:18:47.000000000 -0700
@@ -272,6 +272,22 @@ extern int MCA_bus;
 #define pc98 0
 #endif
 
+static __inline__ void __monitor(const void *eax, unsigned long ecx, 
+		unsigned long edx)
+{
+	/* "monitor %eax,%ecx,%edx;" */
+	asm volatile(
+		".byte 0x0f,0x01,0xc8;"
+		: :"a" (eax), "c" (ecx), "d"(edx));
+}
+
+static __inline__ void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax,%ecx;" */
+	asm volatile(
+		".byte 0x0f,0x01,0xc9;"
+		: :"a" (eax), "c" (ecx));
+}
 
 /* from system description table in BIOS.  Mostly for MCA use, but
 others may find it useful. */




[-- Attachment #2: mm4-mwait.patch --]
[-- Type: application/octet-stream, Size: 3999 bytes --]

diff -purN linux-2.6.0-test4-mm4/arch/i386/kernel/cpu/intel.c linux-2.6.0-test4-mm4-mwait/arch/i386/kernel/cpu/intel.c
--- linux-2.6.0-test4-mm4/arch/i386/kernel/cpu/intel.c	2003-09-02 17:56:48.000000000 -0700
+++ linux-2.6.0-test4-mm4-mwait/arch/i386/kernel/cpu/intel.c	2003-09-03 12:12:56.000000000 -0700
@@ -12,6 +12,8 @@
 
 #include "cpu.h"
 
+extern void select_idle_routine(const struct cpuinfo_x86 *c);
+
 #ifdef CONFIG_X86_INTEL_USERCOPY
 /*
  * Alignment at which movsl is preferred for bulk memory copies.
@@ -163,7 +165,7 @@ static void __init init_intel(struct cpu
 	}
 #endif
 
-
+	select_idle_routine(c);
 	if (c->cpuid_level > 1) {
 		/* supports eax=2  call */
 		int i, j, n;
diff -purN linux-2.6.0-test4-mm4/arch/i386/kernel/process.c linux-2.6.0-test4-mm4-mwait/arch/i386/kernel/process.c
--- linux-2.6.0-test4-mm4/arch/i386/kernel/process.c	2003-09-02 17:56:48.000000000 -0700
+++ linux-2.6.0-test4-mm4-mwait/arch/i386/kernel/process.c	2003-09-02 20:18:47.000000000 -0700
@@ -152,11 +152,56 @@ void cpu_idle (void)
 	}
 }
 
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI, 
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state 
+ * through MWAIT. Whenever someone changes need_resched, we would be woken 
+ * up from MWAIT (without an IPI).
+ */
+static void mwait_idle (void)
+{
+	local_irq_enable();
+
+	if (!need_resched()) {
+		set_thread_flag(TIF_POLLING_NRFLAG);
+		do {
+			__monitor((void *)&current_thread_info()->flags, 0, 0);
+			if (need_resched())
+				break;
+			__mwait(0, 0);
+		} while (!need_resched());
+		clear_thread_flag(TIF_POLLING_NRFLAG);
+	}
+}
+
+void __init select_idle_routine(const struct cpuinfo_x86 *c)
+{
+	if (cpu_has(c, X86_FEATURE_MWAIT)) {
+		printk("Monitor/Mwait feature present.\n");
+		/* 
+		 * Skip, if setup has overridden idle.
+		 * Also, take care of system with asymmetric CPUs.
+		 * Use, mwait_idle only if all cpus support it.
+		 * If not, we fallback to default_idle()
+		 */
+		if (!pm_idle) {
+			pm_idle = mwait_idle;
+		}
+		return;
+	}
+	pm_idle = default_idle;
+	return;
+}
+
 static int __init idle_setup (char *str)
 {
 	if (!strncmp(str, "poll", 4)) {
 		printk("using polling idle threads.\n");
 		pm_idle = poll_idle;
+	} else if (!strncmp(str, "halt", 4)) {
+		printk("using halt in idle threads.\n");
+		pm_idle = default_idle;
 	}
 
 	return 1;
diff -purN linux-2.6.0-test4-mm4/include/asm-i386/cpufeature.h linux-2.6.0-test4-mm4-mwait/include/asm-i386/cpufeature.h
--- linux-2.6.0-test4-mm4/include/asm-i386/cpufeature.h	2003-08-22 16:58:04.000000000 -0700
+++ linux-2.6.0-test4-mm4-mwait/include/asm-i386/cpufeature.h	2003-09-02 20:18:47.000000000 -0700
@@ -71,6 +71,8 @@
 
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_EST		(4*32+ 7) /* Enhanced SpeedStep */
+#define X86_FEATURE_MWAIT	(4*32+ 3) /* Monitor/Mwait support */
+
 
 /* VIA/Cyrix/Centaur-defined CPU features, CPUID level 0xC0000001, word 5 */
 #define X86_FEATURE_XSTORE	(5*32+ 2) /* on-CPU RNG present (xstore insn) */
diff -purN linux-2.6.0-test4-mm4/include/asm-i386/processor.h linux-2.6.0-test4-mm4-mwait/include/asm-i386/processor.h
--- linux-2.6.0-test4-mm4/include/asm-i386/processor.h	2003-09-02 17:56:48.000000000 -0700
+++ linux-2.6.0-test4-mm4-mwait/include/asm-i386/processor.h	2003-09-02 20:18:47.000000000 -0700
@@ -272,6 +272,22 @@ extern int MCA_bus;
 #define pc98 0
 #endif
 
+static __inline__ void __monitor(const void *eax, unsigned long ecx, 
+		unsigned long edx)
+{
+	/* "monitor %eax,%ecx,%edx;" */
+	asm volatile(
+		".byte 0x0f,0x01,0xc8;"
+		: :"a" (eax), "c" (ecx), "d"(edx));
+}
+
+static __inline__ void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax,%ecx;" */
+	asm volatile(
+		".byte 0x0f,0x01,0xc9;"
+		: :"a" (eax), "c" (ecx));
+}
 
 /* from system description table in BIOS.  Mostly for MCA use, but
 others may find it useful. */

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-09-06  9:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-09-06  1:03 [PATCH] idle using PNI monitor/mwait (take 2) Nakajima, Jun
2003-09-06  9:58 ` Jamie Lokier
  -- strict thread matches above, loose matches on Subject: below --
2003-09-05 20:41 Nakajima, Jun
2003-09-05 21:14 ` Jamie Lokier
2003-09-05 22:03   ` Jamie Lokier
2003-09-05  2:19 Nakajima, Jun
2003-09-05 17:14 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).