All of lore.kernel.org
 help / color / mirror / Atom feed
* A few questions and issues with dynticks, NOHZ and powertop
@ 2010-04-03 22:33 Dominik Brodowski
  2010-04-03 23:53 ` Dmitry Torokhov
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Dominik Brodowski @ 2010-04-03 22:33 UTC (permalink / raw)
  To: linux-kernel, Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Alan Stern, Arjan van de Ven, Dmitry Torokhov

Hey!

Before I'm off hiding some Easter eggs, here are some questions and
issues related to "dynticks", NOHZ, and powertop:

1) single-CPU systems, SMP-capable kernel and RCU 
2) dual-core CPU[*] and select_nohz_load_balancer()
3) USB, autosuspend failure, excessive ticks
4) SynPS/2 touchpad and hundreds of IRQs per second
5) powertop: 1 + 1 = 1


1) single-CPU systems, SMP-capable kernel and RCU

CONFIG_TREE_RCU=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FAST_NO_HZ=y

Booting a SMP-capable kernel with "nosmp", or manually offlining one CPU
(or -- though I haven't tested it -- booting a SMP-capable kernel on a
system with merely one CPU) means that in up to about half of the calls to
tick_nohz_stop_sched_tick() are aborted due to rcu_needs_cpu(). This is
quite strange to me: AFAIK, RCU is an excellent tool for SMP, but not really
needed for UP? And all updates seem to be local to the CPU anyway.
Therefore, I'd presume that rcu_needs_cpu() should return 0 on
one-CPU-systems. Or could RCU switch between TINY_RCU on UP and TREE_RCU on
SMP (using alternatives or whatever)?


2) dual-core CPU[*] and select_nohz_load_balancer()
[*] (Intel(R) Core(TM)2 Duo CPU T7250)

# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
CONFIG_SCHED_HRTICK=y

CONFIG_SCHED_MC is igored, as mc_capable() returns 0 on a one-socket,
dual-core system. Quite surprisingly, even under moderate load (~98.0% idle)
while writing this bugreport, up to half of the calls to
tick_nohz_stop_sched_tick() are aborted due to select_nohz_load_balancer(1):

		if (atomic_read(&nohz.load_balancer) == -1) {
			/* make me the ilb owner */
			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
				return 1;

I'm not really sure, but I guess this is caused by the following phenomenon
under minor load but still, every once in a while, parallel work for both
CPUs:

CPU #0					CPU #1

<active>				<active>
<idle>					<active>
  tick_nohz_stop_sched_tick(1)		<active>
   select_nohz_load_balancer(1)		<active>
    => becomes ilb owner		<idle>
   => tick is not stopped		 tick_nohz_stop_sched_tick(1)
  => CPU goes to sleep for 1 tick	  => as it isn't the ILB owner, tick
  <sleep for 1 tick>			     is stopped	.
  ---> scheduler_tick()			  <sleeeeeeeep>
  tick_nohz_stop_sched_tick(0)
<still idle>
  tick_nohz_stop_sched_tick(1)
   select_nohz_load_balancer(1)
    => is ilb owner, all CPUs idle,
       may go to sleep.

If both CPUs have hardly anything to do, letting the _active_ CPU do ilb
allows us to enter deep sleep states earlier, and longer:

current ILB model (* = ILB)

	tick ---------- tick -------- tick ----- IRQ
CPU0:   active|IDLE(C2)--|*|IDLE (C3)             |
CPU1:   active....| IDLE (C3)                     |
core:   .......???| C2   |           C3           |

ILB-by-active-CPU-on-light-load:

	tick ---------- tick -------- tick ----- IRQ
CPU0:   active|IDLE(C3)                           |
CPU1:   active....*| IDLE (C3)                    |
core:   .......????|               C3             |


3) USB: built-in UHCI and a built-in 0a5c:2101 Broadcom Corp. A-Link
BlueUsbA2 Bluetooth module; built-in EHCI and a built-in 0ac8:c302 Z-Star
Microelectronics Corp. Vega USB 2.0 Camera.

usbcore.autosuspend is enabled (= 2), of course.

Recent USB suspend statistics
Active  Device name
100.0%	USB device  7-1 : BCM92045NMD (Broadcom Corp)
100.0%	USB device  1-2 : Vega USB 2.0 Camera. (Vimicro Corp.)
100.0%	USB device usb7 : UHCI Host Controller (Linux 2.6.34-rc3 uhci_hcd)
100.0%	USB device usb1 : EHCI Host Controller (Linux 2.6.34-rc3 ehci_hcd)

Booting into /bin/bash on a SMP kernel booted with "nosmp" leads to ~ 10
wakeups per second; disabling the cursor helps halfway (~ 5 wakeups); and
manually unbinding the USB host drivers from the USB host devices finally
lead to ~ 1.1 wakeups per second. What's keeping USB from suspending these
unused devices here?


4) SynPS/2 touchpad: 
Why does moving the touchpad lead to sooo many IRQs? I can't look as fast
as the mouse pointer seems to get new data:
  62,5% (473,1)       <interrupt> : PS/2 keyboard/mouse/touchpad 


5) powertop and hrtimer_start_range_ns (tick_sched_timer) on a SMP kernel
booted with "nosmp":

Wakeups-from-idle per second :  9.9     interval: 15.0s
...
  48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer) 
  26.1% (  5.1)     <kernel core> : cursor_timer_handler (cursor_timer_handler) 
  20.6% (  4.0)     <kernel core> : usb_hcd_poll_rh_status (rh_timer_func) 
   1.0% (  0.2)     <kernel core> : arm_supers_timer (sync_supers_timer_fn) 
   0.7% (  0.1)       <interrupt> : ata_piix 
   ...

Accoding to http://www.linuxpowertop.org , the count in the brackets is how
many wakeups per seconds were caused by one source. Adding all _except_
  48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer)
up leads to the 9.9; adding also the 9.4 leads to 19.3 wakeups-from-idle per
second. However, http://www.linuxpowertop.org says:

>  "Should "Wakeups-from-idle per second" equal the sum of the
>  wakeups/second/core listed on the "Top causes for wakeups" list?
>
>  It should be higher, since there are some causes for wakeups that are nearly
>  impossible to detect by software."


Best, and Happy Easter,

	Dominik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-03 22:33 A few questions and issues with dynticks, NOHZ and powertop Dominik Brodowski
@ 2010-04-03 23:53 ` Dmitry Torokhov
  2010-04-04 10:35   ` Dominik Brodowski
  2010-04-04 10:47   ` Dominik Brodowski
  2010-04-04 15:17 ` Alan Stern
  2010-04-08 19:59 ` [RFC PATCH] nohz/sched: disable ilb on !mc_capable() Dominik Brodowski
  2 siblings, 2 replies; 30+ messages in thread
From: Dmitry Torokhov @ 2010-04-03 23:53 UTC (permalink / raw)
  To: Dominik Brodowski, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Alan Stern, Arjan van de Ven

Hi Dominik,

On Sun, Apr 04, 2010 at 12:33:28AM +0200, Dominik Brodowski wrote:
> 
> 4) SynPS/2 touchpad: 
> Why does moving the touchpad lead to sooo many IRQs? I can't look as fast
> as the mouse pointer seems to get new data:
>   62,5% (473,1)       <interrupt> : PS/2 keyboard/mouse/touchpad 
> 

80 pps @ 6 bytes/packet = 480 interrupts/sec.

You can try using psmouse.rate=40 to limit it to 40 pps which should
bring it to the rate of standard PS/2 mouse at the expense of
sensitivity...

-- 
Dmitry

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-03 23:53 ` Dmitry Torokhov
@ 2010-04-04 10:35   ` Dominik Brodowski
  2010-04-05 20:54     ` Dmitry Torokhov
  2010-04-04 10:47   ` Dominik Brodowski
  1 sibling, 1 reply; 30+ messages in thread
From: Dominik Brodowski @ 2010-04-04 10:35 UTC (permalink / raw)
  To: Dmitry Torokhov, power; +Cc: linux-kernel, Arjan van de Ven

Hi Dmitry,

On Sat, Apr 03, 2010 at 04:53:26PM -0700, Dmitry Torokhov wrote:
> On Sun, Apr 04, 2010 at 12:33:28AM +0200, Dominik Brodowski wrote:
> > 
> > 4) SynPS/2 touchpad: 
> > Why does moving the touchpad lead to sooo many IRQs? I can't look as fast
> > as the mouse pointer seems to get new data:
> >   62,5% (473,1)       <interrupt> : PS/2 keyboard/mouse/touchpad 
> > 
> 
> 80 pps @ 6 bytes/packet = 480 interrupts/sec.
> 
> You can try using psmouse.rate=40 to limit it to 40 pps which should
> bring it to the rate of standard PS/2 mouse at the expense of
> sensitivity...

Excellent. Maybe this could be added to the Tips&Tricks section at the
Powertop website?

I guess obtaining all 6 bytes at once is not really possible? It
seems a new byte only appears ~1.75 ms after the last one, at least on my
notebook, so waiting for this is not an option...

Best,
	Dominik


PS: Dmitry, got something small for you in return:


[PATCH] i8042: spelling fix

Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

diff --git a/drivers/input/serio/i8042.c b/drivers/input/serio/i8042.c
index 9302ba0..f61233e 100644
--- a/drivers/input/serio/i8042.c
+++ b/drivers/input/serio/i8042.c
@@ -38,7 +38,7 @@ MODULE_PARM_DESC(noaux, "Do not probe or use AUX (mouse) port.");
 
 static bool i8042_nomux;
 module_param_named(nomux, i8042_nomux, bool, 0);
-MODULE_PARM_DESC(nomux, "Do not check whether an active multiplexing conrtoller is present.");
+MODULE_PARM_DESC(nomux, "Do not check whether an active multiplexing controller is present.");
 
 static bool i8042_unlock;
 module_param_named(unlock, i8042_unlock, bool, 0);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-03 23:53 ` Dmitry Torokhov
  2010-04-04 10:35   ` Dominik Brodowski
@ 2010-04-04 10:47   ` Dominik Brodowski
  2010-04-05  3:42     ` Arjan van de Ven
  1 sibling, 1 reply; 30+ messages in thread
From: Dominik Brodowski @ 2010-04-04 10:47 UTC (permalink / raw)
  To: Dmitry Torokhov, Adam Belay, Len Brown
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Alan Stern, Arjan van de Ven

Hey,

On Sat, Apr 03, 2010 at 04:53:26PM -0700, Dmitry Torokhov wrote:
> On Sun, Apr 04, 2010 at 12:33:28AM +0200, Dominik Brodowski wrote:
> > 
> > 4) SynPS/2 touchpad: 
> > Why does moving the touchpad lead to sooo many IRQs? I can't look as fast
> > as the mouse pointer seems to get new data:
> >   62,5% (473,1)       <interrupt> : PS/2 keyboard/mouse/touchpad 
> > 
> 
> 80 pps @ 6 bytes/packet = 480 interrupts/sec.
> 
> You can try using psmouse.rate=40 to limit it to 40 pps which should
> bring it to the rate of standard PS/2 mouse at the expense of
> sensitivity...

as a sidenote: if we know -- like here -- that the next IRQ will be issued
soon, in approximately 1.75 ms (well, at least on my system), might it make
sense to make tick_nohz_get_sleep_length() smarter to know about this?

Best,
	Dominik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-03 22:33 A few questions and issues with dynticks, NOHZ and powertop Dominik Brodowski
  2010-04-03 23:53 ` Dmitry Torokhov
@ 2010-04-04 15:17 ` Alan Stern
  2010-04-04 16:39   ` Dominik Brodowski
  2010-04-08 19:59 ` [RFC PATCH] nohz/sched: disable ilb on !mc_capable() Dominik Brodowski
  2 siblings, 1 reply; 30+ messages in thread
From: Alan Stern @ 2010-04-04 15:17 UTC (permalink / raw)
  To: Dominik Brodowski
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Arjan van de Ven, Dmitry Torokhov

On Sun, 4 Apr 2010, Dominik Brodowski wrote:

> Booting a SMP-capable kernel with "nosmp", or manually offlining one CPU
> (or -- though I haven't tested it -- booting a SMP-capable kernel on a
> system with merely one CPU) means that in up to about half of the calls to
> tick_nohz_stop_sched_tick() are aborted due to rcu_needs_cpu(). This is
> quite strange to me: AFAIK, RCU is an excellent tool for SMP, but not really
> needed for UP?

I can't answer the real question here, not knowing enough about the RCU
implementation.  However, your impression is wrong: RCU very definitely
_is_ useful and needed on UP systems.  It coordinates among processes
(and interrupt handlers) as well as among processors.


> 3) USB: built-in UHCI and a built-in 0a5c:2101 Broadcom Corp. A-Link
> BlueUsbA2 Bluetooth module; built-in EHCI and a built-in 0ac8:c302 Z-Star
> Microelectronics Corp. Vega USB 2.0 Camera.
> 
> usbcore.autosuspend is enabled (= 2), of course.
> 
> Recent USB suspend statistics
> Active  Device name
> 100.0%	USB device  7-1 : BCM92045NMD (Broadcom Corp)
> 100.0%	USB device  1-2 : Vega USB 2.0 Camera. (Vimicro Corp.)
> 100.0%	USB device usb7 : UHCI Host Controller (Linux 2.6.34-rc3 uhci_hcd)
> 100.0%	USB device usb1 : EHCI Host Controller (Linux 2.6.34-rc3 ehci_hcd)
> 
> Booting into /bin/bash on a SMP kernel booted with "nosmp" leads to ~ 10
> wakeups per second; disabling the cursor helps halfway (~ 5 wakeups); and
> manually unbinding the USB host drivers from the USB host devices finally
> lead to ~ 1.1 wakeups per second. What's keeping USB from suspending these
> unused devices here?

Either the drivers don't support autosuspend or the devices aren't
enabled for autosuspend.  By default, autosuspend is disabled for
(almost) all non-hub devices.  You or your distribution must enable
it manually by doing:

	echo auto >/sys/bus/usb/devices/.../power/level

If the driver doesn't support autosuspend then enabling it won't be
enough; you'll also have to unbind the driver from the device.  The
easiest way to do this is to unconfigure the device:

	echo 0 >/sys/bus/usb/devices/.../bConfigurationValue

Alan Stern


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-04 15:17 ` Alan Stern
@ 2010-04-04 16:39   ` Dominik Brodowski
  2010-04-04 20:47     ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Dominik Brodowski @ 2010-04-04 16:39 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Arjan van de Ven, Dmitry Torokhov

On Sun, Apr 04, 2010 at 11:17:37AM -0400, Alan Stern wrote:
> On Sun, 4 Apr 2010, Dominik Brodowski wrote:
> 
> > Booting a SMP-capable kernel with "nosmp", or manually offlining one CPU
> > (or -- though I haven't tested it -- booting a SMP-capable kernel on a
> > system with merely one CPU) means that in up to about half of the calls to
> > tick_nohz_stop_sched_tick() are aborted due to rcu_needs_cpu(). This is
> > quite strange to me: AFAIK, RCU is an excellent tool for SMP, but not really
> > needed for UP?
> 
> I can't answer the real question here, not knowing enough about the RCU
> implementation.  However, your impression is wrong: RCU very definitely
> _is_ useful and needed on UP systems.  It coordinates among processes
> (and interrupt handlers) as well as among processors.

Okay, but still: can't this be sped up by much on UP (especially if
CONFIG_RCU_FAST_NO_HZ is set), so that we can go to sleep right away?

> > 3) USB: built-in UHCI and a built-in 0a5c:2101 Broadcom Corp. A-Link
> > BlueUsbA2 Bluetooth module; built-in EHCI and a built-in 0ac8:c302 Z-Star
> > Microelectronics Corp. Vega USB 2.0 Camera.
> > 
> > usbcore.autosuspend is enabled (= 2), of course.
> > 
> > Recent USB suspend statistics
> > Active  Device name
> > 100.0%	USB device  7-1 : BCM92045NMD (Broadcom Corp)
> > 100.0%	USB device  1-2 : Vega USB 2.0 Camera. (Vimicro Corp.)
> > 100.0%	USB device usb7 : UHCI Host Controller (Linux 2.6.34-rc3 uhci_hcd)
> > 100.0%	USB device usb1 : EHCI Host Controller (Linux 2.6.34-rc3 ehci_hcd)
> > 
> > Booting into /bin/bash on a SMP kernel booted with "nosmp" leads to ~ 10
> > wakeups per second; disabling the cursor helps halfway (~ 5 wakeups); and
> > manually unbinding the USB host drivers from the USB host devices finally
> > lead to ~ 1.1 wakeups per second. What's keeping USB from suspending these
> > unused devices here?
> 
> Either the drivers don't support autosuspend or the devices aren't
> enabled for autosuspend.  By default, autosuspend is disabled for
> (almost) all non-hub devices.  You or your distribution must enable
> it manually by doing:
> 
> 	echo auto >/sys/bus/usb/devices/.../power/level
> 
> If the driver doesn't support autosuspend then enabling it won't be
> enough; you'll also have to unbind the driver from the device.  The
> easiest way to do this is to unconfigure the device:
> 
> 	echo 0 >/sys/bus/usb/devices/.../bConfigurationValue

Thanks! This way, it works, even without manually unbinding the host
drivers.

Best,
	Dominik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-04 16:39   ` Dominik Brodowski
@ 2010-04-04 20:47     ` Paul E. McKenney
  2010-04-04 23:37       ` Paul E. McKenney
  2010-04-05 21:03       ` Dominik Brodowski
  0 siblings, 2 replies; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-04 20:47 UTC (permalink / raw)
  To: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Arjan van de Ven, Dmitry Torokhov

On Sun, Apr 04, 2010 at 06:39:24PM +0200, Dominik Brodowski wrote:
> On Sun, Apr 04, 2010 at 11:17:37AM -0400, Alan Stern wrote:
> > On Sun, 4 Apr 2010, Dominik Brodowski wrote:
> > 
> > > Booting a SMP-capable kernel with "nosmp", or manually offlining one CPU
> > > (or -- though I haven't tested it -- booting a SMP-capable kernel on a
> > > system with merely one CPU) means that in up to about half of the calls to
> > > tick_nohz_stop_sched_tick() are aborted due to rcu_needs_cpu(). This is
> > > quite strange to me: AFAIK, RCU is an excellent tool for SMP, but not really
> > > needed for UP?
> > 
> > I can't answer the real question here, not knowing enough about the RCU
> > implementation.  However, your impression is wrong: RCU very definitely
> > _is_ useful and needed on UP systems.  It coordinates among processes
> > (and interrupt handlers) as well as among processors.
> 
> Okay, but still: can't this be sped up by much on UP (especially if
> CONFIG_RCU_FAST_NO_HZ is set), so that we can go to sleep right away?

One situation that will prevent CONFIG_RCU_FAST_NO_HZ from putting the
machine to sleep right away is if there is an RCU callback posted that
spawns another RCU callback, and so on.  CONFIG_RCU_FAST_NO_HZ will handle
one callback that spawns another, but it gives up if the second callback
spawns a third.

Might this be what is happening to you?

If so, would you be willing to patch your kernel?  RCU_NEEDS_CPU_FLUSHES
is currently set to 5, and might be set to (say) 8.  This is defined
in kernel/rcutree_plugin.h, near line 990.

Another thing to try would be running with TINY_RCU, at least if it is
OK that RCU be non-preemptible.

							Thanx, Paul

> > > 3) USB: built-in UHCI and a built-in 0a5c:2101 Broadcom Corp. A-Link
> > > BlueUsbA2 Bluetooth module; built-in EHCI and a built-in 0ac8:c302 Z-Star
> > > Microelectronics Corp. Vega USB 2.0 Camera.
> > > 
> > > usbcore.autosuspend is enabled (= 2), of course.
> > > 
> > > Recent USB suspend statistics
> > > Active  Device name
> > > 100.0%	USB device  7-1 : BCM92045NMD (Broadcom Corp)
> > > 100.0%	USB device  1-2 : Vega USB 2.0 Camera. (Vimicro Corp.)
> > > 100.0%	USB device usb7 : UHCI Host Controller (Linux 2.6.34-rc3 uhci_hcd)
> > > 100.0%	USB device usb1 : EHCI Host Controller (Linux 2.6.34-rc3 ehci_hcd)
> > > 
> > > Booting into /bin/bash on a SMP kernel booted with "nosmp" leads to ~ 10
> > > wakeups per second; disabling the cursor helps halfway (~ 5 wakeups); and
> > > manually unbinding the USB host drivers from the USB host devices finally
> > > lead to ~ 1.1 wakeups per second. What's keeping USB from suspending these
> > > unused devices here?
> > 
> > Either the drivers don't support autosuspend or the devices aren't
> > enabled for autosuspend.  By default, autosuspend is disabled for
> > (almost) all non-hub devices.  You or your distribution must enable
> > it manually by doing:
> > 
> > 	echo auto >/sys/bus/usb/devices/.../power/level
> > 
> > If the driver doesn't support autosuspend then enabling it won't be
> > enough; you'll also have to unbind the driver from the device.  The
> > easiest way to do this is to unconfigure the device:
> > 
> > 	echo 0 >/sys/bus/usb/devices/.../bConfigurationValue
> 
> Thanks! This way, it works, even without manually unbinding the host
> drivers.
> 
> Best,
> 	Dominik
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-04 20:47     ` Paul E. McKenney
@ 2010-04-04 23:37       ` Paul E. McKenney
  2010-04-05  3:44         ` Arjan van de Ven
  2010-04-05 21:03       ` Dominik Brodowski
  1 sibling, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-04 23:37 UTC (permalink / raw)
  To: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Arjan van de Ven, Dmitry Torokhov

On Sun, Apr 04, 2010 at 01:47:25PM -0700, Paul E. McKenney wrote:
> On Sun, Apr 04, 2010 at 06:39:24PM +0200, Dominik Brodowski wrote:
> > On Sun, Apr 04, 2010 at 11:17:37AM -0400, Alan Stern wrote:
> > > On Sun, 4 Apr 2010, Dominik Brodowski wrote:
> > > 
> > > > Booting a SMP-capable kernel with "nosmp", or manually offlining one CPU
> > > > (or -- though I haven't tested it -- booting a SMP-capable kernel on a
> > > > system with merely one CPU) means that in up to about half of the calls to
> > > > tick_nohz_stop_sched_tick() are aborted due to rcu_needs_cpu(). This is
> > > > quite strange to me: AFAIK, RCU is an excellent tool for SMP, but not really
> > > > needed for UP?
> > > 
> > > I can't answer the real question here, not knowing enough about the RCU
> > > implementation.  However, your impression is wrong: RCU very definitely
> > > _is_ useful and needed on UP systems.  It coordinates among processes
> > > (and interrupt handlers) as well as among processors.
> > 
> > Okay, but still: can't this be sped up by much on UP (especially if
> > CONFIG_RCU_FAST_NO_HZ is set), so that we can go to sleep right away?
> 
> One situation that will prevent CONFIG_RCU_FAST_NO_HZ from putting the
> machine to sleep right away is if there is an RCU callback posted that
> spawns another RCU callback, and so on.  CONFIG_RCU_FAST_NO_HZ will handle
> one callback that spawns another, but it gives up if the second callback
> spawns a third.
> 
> Might this be what is happening to you?
> 
> If so, would you be willing to patch your kernel?  RCU_NEEDS_CPU_FLUSHES
> is currently set to 5, and might be set to (say) 8.  This is defined
> in kernel/rcutree_plugin.h, near line 990.
> 
> Another thing to try would be running with TINY_RCU, at least if it is
> OK that RCU be non-preemptible.

And you did mention offlining some CPUs above.  The folloiwng patch
(from Lai Jiangshan) is needed to handle this case.

							Thanx, Paul

------------------------------------------------------------------------

commit 6a2ae79877827355b747c0b91133a963b74ed396
Author: Lai Jiangshan <laijs@cn.fujitsu.com>
Date:   Tue Mar 30 18:40:36 2010 +0800

    rcu: ignore offline CPUs in last non-dyntick-idle CPU check
    
    Offline CPUs are not in nohz_cpu_mask, but can be ignored when checking
    for the last non-dyntick-idle CPU.  This patch therefore only checks
    online CPUs for not being dyntick idle, allowing fast entry into
    full-system dyntick-idle state even when there are some offline CPUs.
    
    Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 79b53bd..687c4e9 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1016,7 +1016,7 @@ int rcu_needs_cpu(int cpu)
 
 	/* Don't bother unless we are the last non-dyntick-idle CPU. */
 	for_each_cpu_not(thatcpu, nohz_cpu_mask)
-		if (thatcpu != cpu) {
+		if (cpu_online(thatcpu) && thatcpu != cpu) {
 			per_cpu(rcu_dyntick_drain, cpu) = 0;
 			per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1;
 			return rcu_needs_cpu_quick_check(cpu);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-04 10:47   ` Dominik Brodowski
@ 2010-04-05  3:42     ` Arjan van de Ven
  2010-04-05 20:41       ` Dominik Brodowski
  0 siblings, 1 reply; 30+ messages in thread
From: Arjan van de Ven @ 2010-04-05  3:42 UTC (permalink / raw)
  To: Dominik Brodowski, Dmitry Torokhov, Adam Belay, Len Brown,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Alan Stern

On 4/4/2010 3:47, Dominik Brodowski wrote:
> Hey,
>
> On Sat, Apr 03, 2010 at 04:53:26PM -0700, Dmitry Torokhov wrote:
>> On Sun, Apr 04, 2010 at 12:33:28AM +0200, Dominik Brodowski wrote:
>>>
>>> 4) SynPS/2 touchpad:
>>> Why does moving the touchpad lead to sooo many IRQs? I can't look as fast
>>> as the mouse pointer seems to get new data:
>>>    62,5% (473,1)<interrupt>  : PS/2 keyboard/mouse/touchpad
>>>
>>
>> 80 pps @ 6 bytes/packet = 480 interrupts/sec.
>>
>> You can try using psmouse.rate=40 to limit it to 40 pps which should
>> bring it to the rate of standard PS/2 mouse at the expense of
>> sensitivity...
>
> as a sidenote: if we know -- like here -- that the next IRQ will be issued
> soon, in approximately 1.75 ms (well, at least on my system), might it make
> sense to make tick_nohz_get_sleep_length() smarter to know about this?

yes and no.

if you are very sure (95%+ or so) then absolutely it needs to know about this
so that the C state selection code can make a better decision.
Right now it tries to look at history to guess this delay.

Unfortunately we do not currently have such a concept in the code to make this
work... but it'd be really nice to have.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-04 23:37       ` Paul E. McKenney
@ 2010-04-05  3:44         ` Arjan van de Ven
  2010-04-05  4:22           ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Arjan van de Ven @ 2010-04-05  3:44 UTC (permalink / raw)
  To: paulmck
  Cc: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

> And you did mention offlining some CPUs above.  The folloiwng patch
> (from Lai Jiangshan) is needed to handle this case.

btw on x86... don't offline CPUs if you want to save power.. it doesn't.
(at least not during idle.. and when you're busy it might save power, but it won't save you energy normally)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05  3:44         ` Arjan van de Ven
@ 2010-04-05  4:22           ` Paul E. McKenney
  2010-04-05 14:40             ` Arjan van de Ven
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-05  4:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On Sun, Apr 04, 2010 at 08:44:05PM -0700, Arjan van de Ven wrote:
> >And you did mention offlining some CPUs above.  The folloiwng patch
> >(from Lai Jiangshan) is needed to handle this case.
> 
> btw on x86... don't offline CPUs if you want to save power.. it doesn't.
> (at least not during idle.. and when you're busy it might save power,
> but it won't save you energy normally)

Hmmm...  The fact that offlining CPUs doesn't save power could form
the basis of an interesting rationalization for my having ignored
offlined CPUs in my original patch, I suppose.  ;-)

So the proper approach is to affinity everything away from the CPUs
in question so that they stay in dyntick-idle mode?  I must confess
that I find this quite counter-intuitive -- and I suspect that I am
not the only one who would expect offlined CPUs to drop to the
lowest possible power consumption.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05  4:22           ` Paul E. McKenney
@ 2010-04-05 14:40             ` Arjan van de Ven
  2010-04-05 15:14               ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Arjan van de Ven @ 2010-04-05 14:40 UTC (permalink / raw)
  To: paulmck
  Cc: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On 4/4/2010 21:22, Paul E. McKenney wrote:
> On Sun, Apr 04, 2010 at 08:44:05PM -0700, Arjan van de Ven wrote:
>>> And you did mention offlining some CPUs above.  The folloiwng patch
>>> (from Lai Jiangshan) is needed to handle this case.
>>
>> btw on x86... don't offline CPUs if you want to save power.. it doesn't.
>> (at least not during idle.. and when you're busy it might save power,
>> but it won't save you energy normally)
>
> Hmmm...  The fact that offlining CPUs doesn't save power could form
> the basis of an interesting rationalization for my having ignored
> offlined CPUs in my original patch, I suppose.  ;-)
>
> So the proper approach is to affinity everything away from the CPUs
> in question so that they stay in dyntick-idle mode?

that is actually equivalently bad ;)

>  I must confess
> that I find this quite counter-intuitive -- and I suspect that I am
> not the only one who would expect offlined CPUs to drop to the
> lowest possible power consumption.

on x86 (other archs might be different), nowadays idle is VERY efficient.
(and the way to offline a cpu is to put it into the same deep C state as idle would)

the "offline is same power as idle" is only the part where you don't win energy.
the part where you lose energy is the part where you realize that you can only really put
the memory controller and memory in power saving state if all cpus are in idle.
If you offline a cpu (versus leaving it idle), and you have several tasks to run during an
activity burst (which are more common the more we group activity to save power), it can happen
that there are so many things to do that all remaining cpus can't handle it without the scheduler
delaying some tasks. At which point you delay putting the memory (controllers) in low power state
and you lose. In general, the winning strategy seems to be to finish things as quickly as you can,
eg on as many cpus as you have in the system to then let them all go idle.

(this is not always true if you're picking cpu frequency/voltages, but for idle states it tends
to be true. there's always cases where it isn't.... power is like that)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 14:40             ` Arjan van de Ven
@ 2010-04-05 15:14               ` Paul E. McKenney
  2010-04-05 16:07                 ` Arjan van de Ven
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-05 15:14 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On Mon, Apr 05, 2010 at 07:40:23AM -0700, Arjan van de Ven wrote:
> On 4/4/2010 21:22, Paul E. McKenney wrote:
> >On Sun, Apr 04, 2010 at 08:44:05PM -0700, Arjan van de Ven wrote:
> >>>And you did mention offlining some CPUs above.  The folloiwng patch
> >>>(from Lai Jiangshan) is needed to handle this case.
> >>
> >>btw on x86... don't offline CPUs if you want to save power.. it doesn't.
> >>(at least not during idle.. and when you're busy it might save power,
> >>but it won't save you energy normally)
> >
> >Hmmm...  The fact that offlining CPUs doesn't save power could form
> >the basis of an interesting rationalization for my having ignored
> >offlined CPUs in my original patch, I suppose.  ;-)
> >
> >So the proper approach is to affinity everything away from the CPUs
> >in question so that they stay in dyntick-idle mode?
> 
> that is actually equivalently bad ;)
> 
> > I must confess
> >that I find this quite counter-intuitive -- and I suspect that I am
> >not the only one who would expect offlined CPUs to drop to the
> >lowest possible power consumption.
> 
> on x86 (other archs might be different), nowadays idle is VERY efficient.
> (and the way to offline a cpu is to put it into the same deep C state as idle would)
> 
> the "offline is same power as idle" is only the part where you don't win energy.
> the part where you lose energy is the part where you realize that you can only really put
> the memory controller and memory in power saving state if all cpus are in idle.
> If you offline a cpu (versus leaving it idle), and you have several tasks to run during an
> activity burst (which are more common the more we group activity to save power), it can happen
> that there are so many things to do that all remaining cpus can't handle it without the scheduler
> delaying some tasks. At which point you delay putting the memory (controllers) in low power state
> and you lose. In general, the winning strategy seems to be to finish things as quickly as you can,
> eg on as many cpus as you have in the system to then let them all go idle.
> 
> (this is not always true if you're picking cpu frequency/voltages, but for idle states it tends
> to be true. there's always cases where it isn't.... power is like that)

So the main issue is that for many workloads, it is best to run full bore
and get done quickly, thus allowing the entire machine to be powered down?

If so, it seems likely that there would be some workloads that were sometimes
unable to use all the CPUs, in which case shutting down (idling, offlining,
dyntick-idling, whatever) the excess CPUs might nevertheless be the right
thing to do.

Or am I missing your point?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 15:14               ` Paul E. McKenney
@ 2010-04-05 16:07                 ` Arjan van de Ven
  2010-04-05 16:22                   ` Paul E. McKenney
  2010-04-05 18:44                   ` david
  0 siblings, 2 replies; 30+ messages in thread
From: Arjan van de Ven @ 2010-04-05 16:07 UTC (permalink / raw)
  To: paulmck
  Cc: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On 4/5/2010 8:14, Paul E. McKenney wrote:
> So the main issue is that for many workloads, it is best to run full bore
> and get done quickly, thus allowing the entire machine to be powered down?

yep

>
> If so, it seems likely that there would be some workloads that were sometimes
> unable to use all the CPUs, in which case shutting down (idling, offlining,
> dyntick-idling, whatever) the excess CPUs might nevertheless be the right
> thing to do.

but the point is that the normal scheduler + idle behavior gives you exactly that
in a natural way !
If you don't have enough work (tasks) to keep all cores busy, the others are and stay idle.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 16:07                 ` Arjan van de Ven
@ 2010-04-05 16:22                   ` Paul E. McKenney
  2010-04-05 16:23                     ` Arjan van de Ven
  2010-04-05 18:44                   ` david
  1 sibling, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-05 16:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On Mon, Apr 05, 2010 at 09:07:33AM -0700, Arjan van de Ven wrote:
> On 4/5/2010 8:14, Paul E. McKenney wrote:
> >So the main issue is that for many workloads, it is best to run full bore
> >and get done quickly, thus allowing the entire machine to be powered down?
> 
> yep
> 
> >If so, it seems likely that there would be some workloads that were sometimes
> >unable to use all the CPUs, in which case shutting down (idling, offlining,
> >dyntick-idling, whatever) the excess CPUs might nevertheless be the right
> >thing to do.
> 
> but the point is that the normal scheduler + idle behavior gives you exactly that
> in a natural way !
> If you don't have enough work (tasks) to keep all cores busy, the others are and stay idle.

So your earlier objection was not to dyntick-idle as such, but rather
to artificially constraining the scheduler to induce dyntick-idle?

						Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 16:22                   ` Paul E. McKenney
@ 2010-04-05 16:23                     ` Arjan van de Ven
  2010-04-05 16:40                       ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Arjan van de Ven @ 2010-04-05 16:23 UTC (permalink / raw)
  To: paulmck
  Cc: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On 4/5/2010 9:22, Paul E. McKenney wrote:
> On Mon, Apr 05, 2010 at 09:07:33AM -0700, Arjan van de Ven wrote:
>> On 4/5/2010 8:14, Paul E. McKenney wrote:
>>> So the main issue is that for many workloads, it is best to run full bore
>>> and get done quickly, thus allowing the entire machine to be powered down?
>>
>> yep
>>
>>> If so, it seems likely that there would be some workloads that were sometimes
>>> unable to use all the CPUs, in which case shutting down (idling, offlining,
>>> dyntick-idling, whatever) the excess CPUs might nevertheless be the right
>>> thing to do.
>>
>> but the point is that the normal scheduler + idle behavior gives you exactly that
>> in a natural way !
>> If you don't have enough work (tasks) to keep all cores busy, the others are and stay idle.
>
> So your earlier objection was not to dyntick-idle as such, but rather
> to artificially constraining the scheduler to induce dyntick-idle?

my objection was against the notion that offlining cpus helps power/energy ;-)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 16:23                     ` Arjan van de Ven
@ 2010-04-05 16:40                       ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-05 16:40 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On Mon, Apr 05, 2010 at 09:23:16AM -0700, Arjan van de Ven wrote:
> On 4/5/2010 9:22, Paul E. McKenney wrote:
> >On Mon, Apr 05, 2010 at 09:07:33AM -0700, Arjan van de Ven wrote:
> >>On 4/5/2010 8:14, Paul E. McKenney wrote:
> >>>So the main issue is that for many workloads, it is best to run full bore
> >>>and get done quickly, thus allowing the entire machine to be powered down?
> >>
> >>yep
> >>
> >>>If so, it seems likely that there would be some workloads that were sometimes
> >>>unable to use all the CPUs, in which case shutting down (idling, offlining,
> >>>dyntick-idling, whatever) the excess CPUs might nevertheless be the right
> >>>thing to do.
> >>
> >>but the point is that the normal scheduler + idle behavior gives you exactly that
> >>in a natural way !
> >>If you don't have enough work (tasks) to keep all cores busy, the others are and stay idle.
> >
> >So your earlier objection was not to dyntick-idle as such, but rather
> >to artificially constraining the scheduler to induce dyntick-idle?
> 
> my objection was against the notion that offlining cpus helps power/energy ;-)

Fair enough, at least in general.  I should hasten to add that Lai's
patch also helps in the case where NR_CPUS is greater than the number
of CPUs on the system.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 16:07                 ` Arjan van de Ven
  2010-04-05 16:22                   ` Paul E. McKenney
@ 2010-04-05 18:44                   ` david
  2010-04-05 19:48                     ` Arjan van de Ven
  1 sibling, 1 reply; 30+ messages in thread
From: david @ 2010-04-05 18:44 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: paulmck, Dominik Brodowski, Alan Stern, linux-kernel,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On Mon, 5 Apr 2010, Arjan van de Ven wrote:

> On 4/5/2010 8:14, Paul E. McKenney wrote:
>> So the main issue is that for many workloads, it is best to run full bore
>> and get done quickly, thus allowing the entire machine to be powered down?
>
> yep

Race To Idle works extremely well in a batch type situation where there is 
not going to be any work to do after you finish what you have.

It doesn't work quite as well if you are going to have new work to do in 
the near future.

You cannot power down the entire machine if you have to look for user 
input.

It takes time (and power) to shut down and start back up, if you are going 
to have more work to do before you can make the complete cycle (and save 
more power than it costs to make the transitions), it's best to stay at 
full power, even if you are idle.


As an example, video/audio playback.

This requires relativly little cpu, but it needs it frequently (to keep 
the hardware buffers filled), and you cannot power down even when the cpu 
is idle.

But you could save power by disabling cores, switching to a slower clock 
speed, etc while still having one core remaining awake all the time.


The key is to look at what you are waiting for. If you are just waiting 
for the processing to finish, race to idle is great. However if you are 
waiting for the outside world or for a clock tick you need to look into 
the exact situation more closely.

David Lang

>> If so, it seems likely that there would be some workloads that were 
>> sometimes
>> unable to use all the CPUs, in which case shutting down (idling, offlining,
>> dyntick-idling, whatever) the excess CPUs might nevertheless be the right
>> thing to do.
>
> but the point is that the normal scheduler + idle behavior gives you exactly 
> that
> in a natural way !
> If you don't have enough work (tasks) to keep all cores busy, the others are 
> and stay idle.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 18:44                   ` david
@ 2010-04-05 19:48                     ` Arjan van de Ven
  2010-04-05 20:34                       ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Arjan van de Ven @ 2010-04-05 19:48 UTC (permalink / raw)
  To: david
  Cc: paulmck, Dominik Brodowski, Alan Stern, linux-kernel,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On 4/5/2010 11:44, david@lang.hm wrote:
> On Mon, 5 Apr 2010, Arjan van de Ven wrote:
>
>> On 4/5/2010 8:14, Paul E. McKenney wrote:
>>> So the main issue is that for many workloads, it is best to run full
>>> bore
>>> and get done quickly, thus allowing the entire machine to be powered
>>> down?
>>
>> yep
>
> Race To Idle works extremely well in a batch type situation where there
> is not going to be any work to do after you finish what you have.
>
> It doesn't work quite as well if you are going to have new work to do in
> the near future.
>
> You cannot power down the entire machine if you have to look for user
> input.
>
> It takes time (and power) to shut down and start back up, if you are
> going to have more work to do before you can make the complete cycle
> (and save more power than it costs to make the transitions), it's best
> to stay at full power, even if you are idle.

for the things we're talking about here (memory controllers etc) we're talking
about single to low double digit microseconds (or even less) of time to go up and down.
Many of the things you talk about are in the millisecond timeframe.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 19:48                     ` Arjan van de Ven
@ 2010-04-05 20:34                       ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-05 20:34 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: david, Dominik Brodowski, Alan Stern, linux-kernel,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Dmitry Torokhov

On Mon, Apr 05, 2010 at 12:48:59PM -0700, Arjan van de Ven wrote:
> On 4/5/2010 11:44, david@lang.hm wrote:
> >On Mon, 5 Apr 2010, Arjan van de Ven wrote:
> >
> >>On 4/5/2010 8:14, Paul E. McKenney wrote:
> >>>So the main issue is that for many workloads, it is best to run full
> >>>bore
> >>>and get done quickly, thus allowing the entire machine to be powered
> >>>down?
> >>
> >>yep
> >
> >Race To Idle works extremely well in a batch type situation where there
> >is not going to be any work to do after you finish what you have.
> >
> >It doesn't work quite as well if you are going to have new work to do in
> >the near future.
> >
> >You cannot power down the entire machine if you have to look for user
> >input.
> >
> >It takes time (and power) to shut down and start back up, if you are
> >going to have more work to do before you can make the complete cycle
> >(and save more power than it costs to make the transitions), it's best
> >to stay at full power, even if you are idle.
> 
> for the things we're talking about here (memory controllers etc) we're talking
> about single to low double digit microseconds (or even less) of time to go up and down.
> Many of the things you talk about are in the millisecond timeframe.

So the decision will depend not only on the workload itself, but also
on the hardware that the workload uses.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05  3:42     ` Arjan van de Ven
@ 2010-04-05 20:41       ` Dominik Brodowski
  2010-04-05 20:52         ` Dmitry Torokhov
  0 siblings, 1 reply; 30+ messages in thread
From: Dominik Brodowski @ 2010-04-05 20:41 UTC (permalink / raw)
  To: Arjan van de Ven, Dmitry Torokhov
  Cc: Adam Belay, Len Brown, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Alan Stern

On Sun, Apr 04, 2010 at 08:42:43PM -0700, Arjan van de Ven wrote:
> On 4/4/2010 3:47, Dominik Brodowski wrote:
> >Hey,
> >
> >On Sat, Apr 03, 2010 at 04:53:26PM -0700, Dmitry Torokhov wrote:
> >>On Sun, Apr 04, 2010 at 12:33:28AM +0200, Dominik Brodowski wrote:
> >>>
> >>>4) SynPS/2 touchpad:
> >>>Why does moving the touchpad lead to sooo many IRQs? I can't look as fast
> >>>as the mouse pointer seems to get new data:
> >>>   62,5% (473,1)<interrupt>  : PS/2 keyboard/mouse/touchpad
> >>>
> >>
> >>80 pps @ 6 bytes/packet = 480 interrupts/sec.
> >>
> >>You can try using psmouse.rate=40 to limit it to 40 pps which should
> >>bring it to the rate of standard PS/2 mouse at the expense of
> >>sensitivity...
> >
> >as a sidenote: if we know -- like here -- that the next IRQ will be issued
> >soon, in approximately 1.75 ms (well, at least on my system), might it make
> >sense to make tick_nohz_get_sleep_length() smarter to know about this?
> 
> yes and no.
> 
> if you are very sure (95%+ or so) then absolutely it needs to know about this
> so that the C state selection code can make a better decision.
> Right now it tries to look at history to guess this delay.
> 
> Unfortunately we do not currently have such a concept in the code to make this
> work... but it'd be really nice to have.

Dmitry, are we "very sure" in this touchpad case?

Best,
	Dominik


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 20:41       ` Dominik Brodowski
@ 2010-04-05 20:52         ` Dmitry Torokhov
  0 siblings, 0 replies; 30+ messages in thread
From: Dmitry Torokhov @ 2010-04-05 20:52 UTC (permalink / raw)
  To: Dominik Brodowski, Arjan van de Ven, Adam Belay, Len Brown,
	linux-kernel, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Alan Stern

On Mon, Apr 05, 2010 at 10:41:43PM +0200, Dominik Brodowski wrote:
> On Sun, Apr 04, 2010 at 08:42:43PM -0700, Arjan van de Ven wrote:
> > On 4/4/2010 3:47, Dominik Brodowski wrote:
> > >Hey,
> > >
> > >On Sat, Apr 03, 2010 at 04:53:26PM -0700, Dmitry Torokhov wrote:
> > >>On Sun, Apr 04, 2010 at 12:33:28AM +0200, Dominik Brodowski wrote:
> > >>>
> > >>>4) SynPS/2 touchpad:
> > >>>Why does moving the touchpad lead to sooo many IRQs? I can't look as fast
> > >>>as the mouse pointer seems to get new data:
> > >>>   62,5% (473,1)<interrupt>  : PS/2 keyboard/mouse/touchpad
> > >>>
> > >>
> > >>80 pps @ 6 bytes/packet = 480 interrupts/sec.
> > >>
> > >>You can try using psmouse.rate=40 to limit it to 40 pps which should
> > >>bring it to the rate of standard PS/2 mouse at the expense of
> > >>sensitivity...
> > >
> > >as a sidenote: if we know -- like here -- that the next IRQ will be issued
> > >soon, in approximately 1.75 ms (well, at least on my system), might it make
> > >sense to make tick_nohz_get_sleep_length() smarter to know about this?
> > 
> > yes and no.
> > 
> > if you are very sure (95%+ or so) then absolutely it needs to know about this
> > so that the C state selection code can make a better decision.
> > Right now it tries to look at history to guess this delay.
> > 
> > Unfortunately we do not currently have such a concept in the code to make this
> > work... but it'd be really nice to have.
> 
> Dmitry, are we "very sure" in this touchpad case?
> 

Psmouse driver tries to not rely on any timing data really... But yes,
we do expect the next interrupt to arrive "shortly" and I guess the
driver could do some data gathering to collect average time between
interrupts. The question is - is it worth it?

-- 
Dmitry

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-04 10:35   ` Dominik Brodowski
@ 2010-04-05 20:54     ` Dmitry Torokhov
  0 siblings, 0 replies; 30+ messages in thread
From: Dmitry Torokhov @ 2010-04-05 20:54 UTC (permalink / raw)
  To: Dominik Brodowski, power, linux-kernel, Arjan van de Ven

On Sun, Apr 04, 2010 at 12:35:15PM +0200, Dominik Brodowski wrote:
> Hi Dmitry,
> 
> On Sat, Apr 03, 2010 at 04:53:26PM -0700, Dmitry Torokhov wrote:
> > On Sun, Apr 04, 2010 at 12:33:28AM +0200, Dominik Brodowski wrote:
> > > 
> > > 4) SynPS/2 touchpad: 
> > > Why does moving the touchpad lead to sooo many IRQs? I can't look as fast
> > > as the mouse pointer seems to get new data:
> > >   62,5% (473,1)       <interrupt> : PS/2 keyboard/mouse/touchpad 
> > > 
> > 
> > 80 pps @ 6 bytes/packet = 480 interrupts/sec.
> > 
> > You can try using psmouse.rate=40 to limit it to 40 pps which should
> > bring it to the rate of standard PS/2 mouse at the expense of
> > sensitivity...
> 
> Excellent. Maybe this could be added to the Tips&Tricks section at the
> Powertop website?
> 

Do not see the point - the touchpad does not generate interrupts while
you are not touching it and while you are touching it the machine is not
going to sleep. The only reason I see for lowering the rate is if your
keyboard contrioller can not handle it (some Toshibas couldn't).

> I guess obtaining all 6 bytes at once is not really possible?

No. i8042 is byte-oriented.

> It
> seems a new byte only appears ~1.75 ms after the last one, at least on my
> notebook, so waiting for this is not an option...
> 
> Best,
> 	Dominik
> 
> 
> PS: Dmitry, got something small for you in return:

Thanks, will apply.
> 
> 
> [PATCH] i8042: spelling fix
> 
> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
> 
> diff --git a/drivers/input/serio/i8042.c b/drivers/input/serio/i8042.c
> index 9302ba0..f61233e 100644
> --- a/drivers/input/serio/i8042.c
> +++ b/drivers/input/serio/i8042.c
> @@ -38,7 +38,7 @@ MODULE_PARM_DESC(noaux, "Do not probe or use AUX (mouse) port.");
>  
>  static bool i8042_nomux;
>  module_param_named(nomux, i8042_nomux, bool, 0);
> -MODULE_PARM_DESC(nomux, "Do not check whether an active multiplexing conrtoller is present.");
> +MODULE_PARM_DESC(nomux, "Do not check whether an active multiplexing controller is present.");
>  
>  static bool i8042_unlock;
>  module_param_named(unlock, i8042_unlock, bool, 0);

-- 
Dmitry

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-04 20:47     ` Paul E. McKenney
  2010-04-04 23:37       ` Paul E. McKenney
@ 2010-04-05 21:03       ` Dominik Brodowski
  2010-04-05 21:38         ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Dominik Brodowski @ 2010-04-05 21:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alan Stern, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Arjan van de Ven, Dmitry Torokhov

Paul,

I really appreaciate your reply -- thanks! I've done some more testing in
the meantime:

On Sun, Apr 04, 2010 at 01:47:25PM -0700, Paul E. McKenney wrote:
> On Sun, Apr 04, 2010 at 06:39:24PM +0200, Dominik Brodowski wrote:
> > On Sun, Apr 04, 2010 at 11:17:37AM -0400, Alan Stern wrote:
> > > On Sun, 4 Apr 2010, Dominik Brodowski wrote:
> > > 
> > > > Booting a SMP-capable kernel with "nosmp", or manually offlining one CPU
> > > > (or -- though I haven't tested it -- booting a SMP-capable kernel on a
> > > > system with merely one CPU) means that in up to about half of the calls to
> > > > tick_nohz_stop_sched_tick() are aborted due to rcu_needs_cpu(). This is
> > > > quite strange to me: AFAIK, RCU is an excellent tool for SMP, but not really
> > > > needed for UP?
> > > 
> > > I can't answer the real question here, not knowing enough about the RCU
> > > implementation.  However, your impression is wrong: RCU very definitely
> > > _is_ useful and needed on UP systems.  It coordinates among processes
> > > (and interrupt handlers) as well as among processors.
> > 
> > Okay, but still: can't this be sped up by much on UP (especially if
> > CONFIG_RCU_FAST_NO_HZ is set), so that we can go to sleep right away?
> 
> One situation that will prevent CONFIG_RCU_FAST_NO_HZ from putting the
> machine to sleep right away is if there is an RCU callback posted that
> spawns another RCU callback, and so on.  CONFIG_RCU_FAST_NO_HZ will handle
> one callback that spawns another, but it gives up if the second callback
> spawns a third.

Will the remaining callbacks be executed immediately afterwards (due to a
need_resched() etc.), or only after the next tick?

> Might this be what is happening to you?
> 
> If so, would you be willing to patch your kernel?  RCU_NEEDS_CPU_FLUSHES
> is currently set to 5, and might be set to (say) 8.  This is defined
> in kernel/rcutree_plugin.h, near line 990.

Applied the patch by Lai Jiangshan, and tested 5 and 8:

5:	  Wakeups-from-idle: 33.4		(hrtimer_sched_timer: 78 %)
		34% of calls to tick_nohz_stop_sched_tick fail due to
			rcu_needs_cpu()
8:	  Wakeups-from-idle: 36.5		(hrtimer_sched_timer: 83 %)
		37% of calls to tick_nohz_stop_sched_tick fail due to
			rcu_needs_cpu()

> Another thing to try would be running with TINY_RCU, at least if it is
> OK that RCU be non-preemptible.

tick_nohz_stop_sched_tick() doesn't fail in this case because of
rcu_needs_cpu(). However, the improvements are hardly recognizable:

TINY_RCU: Wakeups-from-idle: 33.9		(hrtimer_sched_timer: 53 %)

> And you did mention offlining some CPUs above. 

... just for testing how NOHZ works on UP systems ;)

Best,
	Dominik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 21:03       ` Dominik Brodowski
@ 2010-04-05 21:38         ` Paul E. McKenney
  2010-04-05 22:11           ` Dominik Brodowski
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-05 21:38 UTC (permalink / raw)
  To: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Arjan van de Ven, Dmitry Torokhov

On Mon, Apr 05, 2010 at 11:03:40PM +0200, Dominik Brodowski wrote:
> Paul,
> 
> I really appreaciate your reply -- thanks! I've done some more testing in
> the meantime:
> 
> On Sun, Apr 04, 2010 at 01:47:25PM -0700, Paul E. McKenney wrote:
> > On Sun, Apr 04, 2010 at 06:39:24PM +0200, Dominik Brodowski wrote:
> > > On Sun, Apr 04, 2010 at 11:17:37AM -0400, Alan Stern wrote:
> > > > On Sun, 4 Apr 2010, Dominik Brodowski wrote:
> > > > 
> > > > > Booting a SMP-capable kernel with "nosmp", or manually offlining one CPU
> > > > > (or -- though I haven't tested it -- booting a SMP-capable kernel on a
> > > > > system with merely one CPU) means that in up to about half of the calls to
> > > > > tick_nohz_stop_sched_tick() are aborted due to rcu_needs_cpu(). This is
> > > > > quite strange to me: AFAIK, RCU is an excellent tool for SMP, but not really
> > > > > needed for UP?
> > > > 
> > > > I can't answer the real question here, not knowing enough about the RCU
> > > > implementation.  However, your impression is wrong: RCU very definitely
> > > > _is_ useful and needed on UP systems.  It coordinates among processes
> > > > (and interrupt handlers) as well as among processors.
> > > 
> > > Okay, but still: can't this be sped up by much on UP (especially if
> > > CONFIG_RCU_FAST_NO_HZ is set), so that we can go to sleep right away?
> > 
> > One situation that will prevent CONFIG_RCU_FAST_NO_HZ from putting the
> > machine to sleep right away is if there is an RCU callback posted that
> > spawns another RCU callback, and so on.  CONFIG_RCU_FAST_NO_HZ will handle
> > one callback that spawns another, but it gives up if the second callback
> > spawns a third.
> 
> Will the remaining callbacks be executed immediately afterwards (due to a
> need_resched() etc.), or only after the next tick?

Only after the next tick.  To see why, imagine an RCU callback that
re-registers itself -- which is a perfectly legal thing to do.  The
only thing that will happen if we run through grace periods faster is
that we will have more invocations of that same callback to deal with.

So we try for a bit, and if that doesn't get rid of all of the callbacks,
we hold off until the next jiffy.

> > Might this be what is happening to you?
> > 
> > If so, would you be willing to patch your kernel?  RCU_NEEDS_CPU_FLUSHES
> > is currently set to 5, and might be set to (say) 8.  This is defined
> > in kernel/rcutree_plugin.h, near line 990.
> 
> Applied the patch by Lai Jiangshan, and tested 5 and 8:
> 
> 5:	  Wakeups-from-idle: 33.4		(hrtimer_sched_timer: 78 %)
> 		34% of calls to tick_nohz_stop_sched_tick fail due to
> 			rcu_needs_cpu()
> 8:	  Wakeups-from-idle: 36.5		(hrtimer_sched_timer: 83 %)
> 		37% of calls to tick_nohz_stop_sched_tick fail due to
> 			rcu_needs_cpu()

I don't recall your posting wakeups-from-idle for the original -- did
we get improvement?  You did say "roughly 50%", but...

OK, I see what is happening...

What happens in the CONFIG_RCU_FAST_NO_HZ case is as follows:

o	Check to see if the holdoff period is in effect, and if so,
	just check to see if RCU needs the CPU for later processing
	without attempting to accelerate grace periods.

o	Check to see if there is some other non-dyntick-idle CPU.
	If there is, reset holdoff state and just check to see if
	RCU needs the CPU for later processing without attempting to
	accelerate grace periods.

o	Check for initialization and hitting the RCU_NEEDS_CPU_FLUSHES
	limit, again doing the "just check" thing if we hit the limit.

o	For each of RCU-sched and RCU-bh, note a quiescent state
	and force the grace-period machinery, noting in each case
	whether or not there are callbacks left to invoke.

o	If there are callbacks left to invoke, raise RCU_SOFTIRQ.
	This softirq will process the callbacks.  (Why not just invoke
	the softirq function directly?	Because lockdep yells at you
	and I do not believe that this is a false positive.)

o	If there are callbacks left to invoke, tell the caller that
	this CPU cannot yet enter dyntick-idle state.

But if we told the caller that this CPU cannot yet enter dyntick-idle
state, then we also raised RCU_SOFTIRQ.  Once the softirq returns, we
should once again try to enter dyntick-idle state.

So a significant fraction of calls to rcu_needs_cpu() saying "no" does
not necessarily mean that we are taking significant time to get the
grace periods and callbacks out of the way.  The funny loop involving
softirq is required due to locking-design issues.

Or are you seeing significant delays between successive calls to
rcu_needs_cpu() on your setup?

> > Another thing to try would be running with TINY_RCU, at least if it is
> > OK that RCU be non-preemptible.
> 
> tick_nohz_stop_sched_tick() doesn't fail in this case because of
> rcu_needs_cpu(). However, the improvements are hardly recognizable:
> 
> TINY_RCU: Wakeups-from-idle: 33.9		(hrtimer_sched_timer: 53 %)

TINY_RCU is set up to automatically do CONFIG_RCU_FAST_NO_HZ, and do
the same softirq dance, or that is the theory, anyway.  Again, are you
seeing significant delays between successive calls to rcu_needs_cpu()?

> > And you did mention offlining some CPUs above. 
> 
> ... just for testing how NOHZ works on UP systems ;)

;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 21:38         ` Paul E. McKenney
@ 2010-04-05 22:11           ` Dominik Brodowski
  2010-04-05 22:31             ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Dominik Brodowski @ 2010-04-05 22:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alan Stern, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Arjan van de Ven, Dmitry Torokhov

On Mon, Apr 05, 2010 at 02:38:52PM -0700, Paul E. McKenney wrote:
> On Mon, Apr 05, 2010 at 11:03:40PM +0200, Dominik Brodowski wrote:
> > Paul,
> > 
> > I really appreaciate your reply -- thanks! I've done some more testing in
> > the meantime:
> > 
> > On Sun, Apr 04, 2010 at 01:47:25PM -0700, Paul E. McKenney wrote:
> > > On Sun, Apr 04, 2010 at 06:39:24PM +0200, Dominik Brodowski wrote:
> > > > On Sun, Apr 04, 2010 at 11:17:37AM -0400, Alan Stern wrote:
> > > > > On Sun, 4 Apr 2010, Dominik Brodowski wrote:
> > > > > 
> > > > > > Booting a SMP-capable kernel with "nosmp", or manually offlining one CPU
> > > > > > (or -- though I haven't tested it -- booting a SMP-capable kernel on a
> > > > > > system with merely one CPU) means that in up to about half of the calls to
> > > > > > tick_nohz_stop_sched_tick() are aborted due to rcu_needs_cpu(). This is
> > > > > > quite strange to me: AFAIK, RCU is an excellent tool for SMP, but not really
> > > > > > needed for UP?
> > > > > 
> > > > > I can't answer the real question here, not knowing enough about the RCU
> > > > > implementation.  However, your impression is wrong: RCU very definitely
> > > > > _is_ useful and needed on UP systems.  It coordinates among processes
> > > > > (and interrupt handlers) as well as among processors.
> > > > 
> > > > Okay, but still: can't this be sped up by much on UP (especially if
> > > > CONFIG_RCU_FAST_NO_HZ is set), so that we can go to sleep right away?
> > > 
> > > One situation that will prevent CONFIG_RCU_FAST_NO_HZ from putting the
> > > machine to sleep right away is if there is an RCU callback posted that
> > > spawns another RCU callback, and so on.  CONFIG_RCU_FAST_NO_HZ will handle
> > > one callback that spawns another, but it gives up if the second callback
> > > spawns a third.
> > 
> > Will the remaining callbacks be executed immediately afterwards (due to a
> > need_resched() etc.), or only after the next tick?
> 
> Only after the next tick.  To see why, imagine an RCU callback that
> re-registers itself -- which is a perfectly legal thing to do.  The
> only thing that will happen if we run through grace periods faster is
> that we will have more invocations of that same callback to deal with.
> 
> So we try for a bit, and if that doesn't get rid of all of the callbacks,
> we hold off until the next jiffy.
> 
> > > Might this be what is happening to you?
> > > 
> > > If so, would you be willing to patch your kernel?  RCU_NEEDS_CPU_FLUSHES
> > > is currently set to 5, and might be set to (say) 8.  This is defined
> > > in kernel/rcutree_plugin.h, near line 990.
> > 
> > Applied the patch by Lai Jiangshan, and tested 5 and 8:
> > 
> > 5:	  Wakeups-from-idle: 33.4		(hrtimer_sched_timer: 78 %)
> > 		34% of calls to tick_nohz_stop_sched_tick fail due to
> > 			rcu_needs_cpu()
> > 8:	  Wakeups-from-idle: 36.5		(hrtimer_sched_timer: 83 %)
> > 		37% of calls to tick_nohz_stop_sched_tick fail due to
> > 			rcu_needs_cpu()
> 
> I don't recall your posting wakeups-from-idle for the original -- did
> we get improvement?  You did say "roughly 50%", but...

Actually, no. I'd say the 5-to-8 change has no significant effect at all;
for the Patch by Lai Jiangshan, I'd need to re-run the test.

> OK, I see what is happening...
> 
> What happens in the CONFIG_RCU_FAST_NO_HZ case is as follows:
> 
> o	Check to see if the holdoff period is in effect, and if so,
> 	just check to see if RCU needs the CPU for later processing
> 	without attempting to accelerate grace periods.
> 
> o	Check to see if there is some other non-dyntick-idle CPU.
> 	If there is, reset holdoff state and just check to see if
> 	RCU needs the CPU for later processing without attempting to
> 	accelerate grace periods.
> 
> o	Check for initialization and hitting the RCU_NEEDS_CPU_FLUSHES
> 	limit, again doing the "just check" thing if we hit the limit.
> 
> o	For each of RCU-sched and RCU-bh, note a quiescent state
> 	and force the grace-period machinery, noting in each case
> 	whether or not there are callbacks left to invoke.
> 
> o	If there are callbacks left to invoke, raise RCU_SOFTIRQ.
> 	This softirq will process the callbacks.  (Why not just invoke
> 	the softirq function directly?	Because lockdep yells at you
> 	and I do not believe that this is a false positive.)
> 
> o	If there are callbacks left to invoke, tell the caller that
> 	this CPU cannot yet enter dyntick-idle state.
> 
> But if we told the caller that this CPU cannot yet enter dyntick-idle
> state, then we also raised RCU_SOFTIRQ.  Once the softirq returns, we
> should once again try to enter dyntick-idle state.
> 
> So a significant fraction of calls to rcu_needs_cpu() saying "no" does
> not necessarily mean that we are taking significant time to get the
> grace periods and callbacks out of the way.  The funny loop involving
> softirq is required due to locking-design issues.
> 
> Or are you seeing significant delays between successive calls to
> rcu_needs_cpu() on your setup?

Will check this, but all the data I'm seeing points to rcu_needs_cpu() not
leading to additional wakeups. It might just be wrong reports by powertop,
after all, for the UP case. Quoting my original mail:

> 5) powertop and hrtimer_start_range_ns (tick_sched_timer) on a SMP kernel
> booted with "nosmp":
> 
> Wakeups-from-idle per second :  9.9     interval: 15.0s
> ...
>   48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer) 
>   26.1% (  5.1)     <kernel core> : cursor_timer_handler
>   (cursor_timer_handle
>   20.6% (  4.0)     <kernel core> : usb_hcd_poll_rh_status (rh_timer_func) 
>    1.0% (  0.2)     <kernel core> : arm_supers_timer
>   (sync_supers_timer_fn) 
>    0.7% (  0.1)       <interrupt> : ata_piix 
>    ...
> 
> Accoding to http://www.linuxpowertop.org , the count in the brackets is
> how
> many wakeups per seconds were caused by one source. Adding all _except_
>   48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer)
> up leads to the 9.9.

Back to your mail:

> > tick_nohz_stop_sched_tick() doesn't fail in this case because of
> > rcu_needs_cpu(). However, the improvements are hardly recognizable:
> > 
> > TINY_RCU: Wakeups-from-idle: 33.9		(hrtimer_sched_timer: 53 %)
> 
> TINY_RCU is set up to automatically do CONFIG_RCU_FAST_NO_HZ, and do
> the same softirq dance, or that is the theory, anyway.  Again, are you
> seeing significant delays between successive calls to rcu_needs_cpu()?

Actually, rcu_needs_cpu() is statically defined to return 0 on TINY_RCU in
include/linux/rcutiny.h .

Best,
	Dominik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 22:11           ` Dominik Brodowski
@ 2010-04-05 22:31             ` Paul E. McKenney
  2010-04-06 20:45               ` Dominik Brodowski
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-05 22:31 UTC (permalink / raw)
  To: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Arjan van de Ven, Dmitry Torokhov

On Tue, Apr 06, 2010 at 12:11:23AM +0200, Dominik Brodowski wrote:
> On Mon, Apr 05, 2010 at 02:38:52PM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 05, 2010 at 11:03:40PM +0200, Dominik Brodowski wrote:
> > > Paul,
> > > 
> > > I really appreaciate your reply -- thanks! I've done some more testing in
> > > the meantime:
> > > 
> > > On Sun, Apr 04, 2010 at 01:47:25PM -0700, Paul E. McKenney wrote:
> > > > On Sun, Apr 04, 2010 at 06:39:24PM +0200, Dominik Brodowski wrote:
> > > > > On Sun, Apr 04, 2010 at 11:17:37AM -0400, Alan Stern wrote:
> > > > > > On Sun, 4 Apr 2010, Dominik Brodowski wrote:
> > > > > > 
> > > > > > > Booting a SMP-capable kernel with "nosmp", or manually offlining one CPU
> > > > > > > (or -- though I haven't tested it -- booting a SMP-capable kernel on a
> > > > > > > system with merely one CPU) means that in up to about half of the calls to
> > > > > > > tick_nohz_stop_sched_tick() are aborted due to rcu_needs_cpu(). This is
> > > > > > > quite strange to me: AFAIK, RCU is an excellent tool for SMP, but not really
> > > > > > > needed for UP?
> > > > > > 
> > > > > > I can't answer the real question here, not knowing enough about the RCU
> > > > > > implementation.  However, your impression is wrong: RCU very definitely
> > > > > > _is_ useful and needed on UP systems.  It coordinates among processes
> > > > > > (and interrupt handlers) as well as among processors.
> > > > > 
> > > > > Okay, but still: can't this be sped up by much on UP (especially if
> > > > > CONFIG_RCU_FAST_NO_HZ is set), so that we can go to sleep right away?
> > > > 
> > > > One situation that will prevent CONFIG_RCU_FAST_NO_HZ from putting the
> > > > machine to sleep right away is if there is an RCU callback posted that
> > > > spawns another RCU callback, and so on.  CONFIG_RCU_FAST_NO_HZ will handle
> > > > one callback that spawns another, but it gives up if the second callback
> > > > spawns a third.
> > > 
> > > Will the remaining callbacks be executed immediately afterwards (due to a
> > > need_resched() etc.), or only after the next tick?
> > 
> > Only after the next tick.  To see why, imagine an RCU callback that
> > re-registers itself -- which is a perfectly legal thing to do.  The
> > only thing that will happen if we run through grace periods faster is
> > that we will have more invocations of that same callback to deal with.
> > 
> > So we try for a bit, and if that doesn't get rid of all of the callbacks,
> > we hold off until the next jiffy.
> > 
> > > > Might this be what is happening to you?
> > > > 
> > > > If so, would you be willing to patch your kernel?  RCU_NEEDS_CPU_FLUSHES
> > > > is currently set to 5, and might be set to (say) 8.  This is defined
> > > > in kernel/rcutree_plugin.h, near line 990.
> > > 
> > > Applied the patch by Lai Jiangshan, and tested 5 and 8:
> > > 
> > > 5:	  Wakeups-from-idle: 33.4		(hrtimer_sched_timer: 78 %)
> > > 		34% of calls to tick_nohz_stop_sched_tick fail due to
> > > 			rcu_needs_cpu()
> > > 8:	  Wakeups-from-idle: 36.5		(hrtimer_sched_timer: 83 %)
> > > 		37% of calls to tick_nohz_stop_sched_tick fail due to
> > > 			rcu_needs_cpu()
> > 
> > I don't recall your posting wakeups-from-idle for the original -- did
> > we get improvement?  You did say "roughly 50%", but...
> 
> Actually, no. I'd say the 5-to-8 change has no significant effect at all;
> for the Patch by Lai Jiangshan, I'd need to re-run the test.
> 
> > OK, I see what is happening...
> > 
> > What happens in the CONFIG_RCU_FAST_NO_HZ case is as follows:
> > 
> > o	Check to see if the holdoff period is in effect, and if so,
> > 	just check to see if RCU needs the CPU for later processing
> > 	without attempting to accelerate grace periods.
> > 
> > o	Check to see if there is some other non-dyntick-idle CPU.
> > 	If there is, reset holdoff state and just check to see if
> > 	RCU needs the CPU for later processing without attempting to
> > 	accelerate grace periods.
> > 
> > o	Check for initialization and hitting the RCU_NEEDS_CPU_FLUSHES
> > 	limit, again doing the "just check" thing if we hit the limit.
> > 
> > o	For each of RCU-sched and RCU-bh, note a quiescent state
> > 	and force the grace-period machinery, noting in each case
> > 	whether or not there are callbacks left to invoke.
> > 
> > o	If there are callbacks left to invoke, raise RCU_SOFTIRQ.
> > 	This softirq will process the callbacks.  (Why not just invoke
> > 	the softirq function directly?	Because lockdep yells at you
> > 	and I do not believe that this is a false positive.)
> > 
> > o	If there are callbacks left to invoke, tell the caller that
> > 	this CPU cannot yet enter dyntick-idle state.
> > 
> > But if we told the caller that this CPU cannot yet enter dyntick-idle
> > state, then we also raised RCU_SOFTIRQ.  Once the softirq returns, we
> > should once again try to enter dyntick-idle state.
> > 
> > So a significant fraction of calls to rcu_needs_cpu() saying "no" does
> > not necessarily mean that we are taking significant time to get the
> > grace periods and callbacks out of the way.  The funny loop involving
> > softirq is required due to locking-design issues.
> > 
> > Or are you seeing significant delays between successive calls to
> > rcu_needs_cpu() on your setup?
> 
> Will check this, but all the data I'm seeing points to rcu_needs_cpu() not
> leading to additional wakeups. It might just be wrong reports by powertop,
> after all, for the UP case.

OK, for all I know, powertop might need some adjustment to allow for
the presence of CONFIG_RCU_FAST_NO_HZ.

>                             Quoting my original mail:
> 
> > 5) powertop and hrtimer_start_range_ns (tick_sched_timer) on a SMP kernel
> > booted with "nosmp":
> > 
> > Wakeups-from-idle per second :  9.9     interval: 15.0s
> > ...
> >   48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer) 
> >   26.1% (  5.1)     <kernel core> : cursor_timer_handler
> >   (cursor_timer_handle
> >   20.6% (  4.0)     <kernel core> : usb_hcd_poll_rh_status (rh_timer_func) 
> >    1.0% (  0.2)     <kernel core> : arm_supers_timer
> >   (sync_supers_timer_fn) 
> >    0.7% (  0.1)       <interrupt> : ata_piix 
> >    ...
> > 
> > Accoding to http://www.linuxpowertop.org , the count in the brackets is
> > how
> > many wakeups per seconds were caused by one source. Adding all _except_
> >   48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer)
> > up leads to the 9.9.

OK, so you further instrumented the hrtimer_sched_timer (or was it
tick_sched_timer?) to find the number that you were attributing to
rcu_needs_cpu()?

> Back to your mail:
> 
> > > tick_nohz_stop_sched_tick() doesn't fail in this case because of
> > > rcu_needs_cpu(). However, the improvements are hardly recognizable:
> > > 
> > > TINY_RCU: Wakeups-from-idle: 33.9		(hrtimer_sched_timer: 53 %)
> > 
> > TINY_RCU is set up to automatically do CONFIG_RCU_FAST_NO_HZ, and do
> > the same softirq dance, or that is the theory, anyway.  Again, are you
> > seeing significant delays between successive calls to rcu_needs_cpu()?
> 
> Actually, rcu_needs_cpu() is statically defined to return 0 on TINY_RCU in
> include/linux/rcutiny.h .

Exactly!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-05 22:31             ` Paul E. McKenney
@ 2010-04-06 20:45               ` Dominik Brodowski
  2010-04-06 20:59                 ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Dominik Brodowski @ 2010-04-06 20:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alan Stern, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Arjan van de Ven, Dmitry Torokhov

On Mon, Apr 05, 2010 at 03:31:30PM -0700, Paul E. McKenney wrote:
> > > 5) powertop and hrtimer_start_range_ns (tick_sched_timer) on a SMP kernel
> > > booted with "nosmp":
> > > 
> > > Wakeups-from-idle per second :  9.9     interval: 15.0s
> > > ...
> > >   48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer) 
> > >   26.1% (  5.1)     <kernel core> : cursor_timer_handler
> > >   (cursor_timer_handle
> > >   20.6% (  4.0)     <kernel core> : usb_hcd_poll_rh_status (rh_timer_func) 
> > >    1.0% (  0.2)     <kernel core> : arm_supers_timer
> > >   (sync_supers_timer_fn) 
> > >    0.7% (  0.1)       <interrupt> : ata_piix 
> > >    ...
> > > 
> > > Accoding to http://www.linuxpowertop.org , the count in the brackets is
> > > how
> > > many wakeups per seconds were caused by one source. Adding all _except_
> > >   48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer)
> > > up leads to the 9.9.
> 
> OK, so you further instrumented the hrtimer_sched_timer (or was it
> tick_sched_timer?) to find the number that you were attributing to
> rcu_needs_cpu()?

That's what I did -- to tick_nohz_stop_sched_tick(), to see why the
tick_sched_timer did not get stopped. Or why powertop thinks it did not
get stopped...

Patch below (works only for 1 or 2 CPUs) for all interested parties, 
_NOT_ intended for submission, though ;)

0x0001: CPU #1
0x0002: inidle
0x0004: not inidle
0x0008: INACTIVE
0x0010: need_resched
0x0020: softirq
0x0040: rcu_needs_
0x0080: printk_needs_
0x0100: arch_needs_
0x0200: off-by-one 
0x0400: skip
0x0800: loadbalance:
0x1000: cancel timer (SUCCESS)
0x2000: timer was already in the past
0x4000: hrtimer_start() (SUCCESS)
0x8000: stop_tick (SUCCESS)

Fixing USB-autosuspend and disabling the fbcon cursor, I'm down to
~0.9 wakeups-per-second booting with "nosmp".

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f992762..6ba7bae 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -210,19 +210,35 @@ void tick_nohz_stop_sched_tick(int inidle)
 	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
 	u64 time_delta;
 	int cpu;
+	static int testrun_counter = 0;
+	int testrun = 0;
+	u64 debugdata = 0;
+
+	testrun_counter++;
+
+	if ((testrun_counter % 123) == 0)
+		testrun = 1;
 
 	local_irq_save(flags);
 
 	cpu = smp_processor_id();
 	ts = &per_cpu(tick_cpu_sched, cpu);
 
+	debugdata = cpu;
+
+	if (inidle)
+		debugdata |= 0x2;
+
+
 	/*
 	 * Call to tick_nohz_start_idle stops the last_update_time from being
 	 * updated. Thus, it must not be called in the event we are called from
 	 * irq_exit() with the prior state different than idle.
 	 */
-	if (!inidle && !ts->inidle)
+	if (!inidle && !ts->inidle) {
+		debugdata |= 0x4;
 		goto end;
+	}
 
 	/*
 	 * Set ts->inidle unconditionally. Even if the system did not
@@ -245,11 +261,15 @@ void tick_nohz_stop_sched_tick(int inidle)
 			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
 	}
 
-	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
+	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) {
+		debugdata |= 0x8;		
 		goto end;
+	}
 
-	if (need_resched())
+	if (need_resched()) {
+		debugdata |= 0x10;
 		goto end;
+	}
 
 	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
 		static int ratelimit;
@@ -259,6 +279,7 @@ void tick_nohz_stop_sched_tick(int inidle)
 			       (unsigned int) local_softirq_pending());
 			ratelimit++;
 		}
+		debugdata |= 0x20;
 		goto end;
 	}
 
@@ -275,6 +296,12 @@ void tick_nohz_stop_sched_tick(int inidle)
 	    arch_needs_cpu(cpu)) {
 		next_jiffies = last_jiffies + 1;
 		delta_jiffies = 1;
+		if (rcu_needs_cpu(cpu))
+			debugdata |= 0x40;
+		if (printk_needs_cpu(cpu))
+			debugdata |= 0x80;
+		if (arch_needs_cpu(cpu))
+			debugdata |= 0x100;
 	} else {
 		/* Get the next timer wheel timer */
 		next_jiffies = get_next_timer_interrupt(last_jiffies);
@@ -284,8 +311,10 @@ void tick_nohz_stop_sched_tick(int inidle)
 	 * Do not stop the tick, if we are only one off
 	 * or if the cpu is required for rcu
 	 */
-	if (!ts->tick_stopped && delta_jiffies == 1)
+	if (!ts->tick_stopped && delta_jiffies == 1) {
+		debugdata |= 0x200;
 		goto out;
+	}
 
 	/* Schedule the tick, if we are at least one jiffie off */
 	if ((long)delta_jiffies >= 1) {
@@ -341,8 +370,10 @@ void tick_nohz_stop_sched_tick(int inidle)
 			cpumask_set_cpu(cpu, nohz_cpu_mask);
 
 		/* Skip reprogram of event if its not changed */
-		if (ts->tick_stopped && ktime_equal(expires, dev->next_event))
+		if (ts->tick_stopped && ktime_equal(expires, dev->next_event)) {
+			debugdata |= 0x400;
 			goto out;
+		}
 
 		/*
 		 * nohz_stop_sched_tick can be called several times before
@@ -357,8 +388,10 @@ void tick_nohz_stop_sched_tick(int inidle)
 				 * sched tick not stopped!
 				 */
 				cpumask_clear_cpu(cpu, nohz_cpu_mask);
+				debugdata |= 0x800;
 				goto out;
 			}
+			debugdata |= 0x8000;
 
 			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
 			ts->tick_stopped = 1;
@@ -375,7 +408,8 @@ void tick_nohz_stop_sched_tick(int inidle)
 		 * If the expiration time == KTIME_MAX, then
 		 * in this case we simply stop the tick timer.
 		 */
-		 if (unlikely(expires.tv64 == KTIME_MAX)) {
+		if (unlikely(expires.tv64 == KTIME_MAX)) {
+			debugdata |= 0x1000;
 			if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
 				hrtimer_cancel(&ts->sched_timer);
 			goto out;
@@ -384,9 +418,13 @@ void tick_nohz_stop_sched_tick(int inidle)
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
 			hrtimer_start(&ts->sched_timer, expires,
 				      HRTIMER_MODE_ABS_PINNED);
+			debugdata |= 0x2000;
 			/* Check, if the timer was already in the past */
-			if (hrtimer_active(&ts->sched_timer))
+			if (hrtimer_active(&ts->sched_timer)) {
+				debugdata &= ~(0x2000);
+				debugdata |= 0x4000;
 				goto out;
+			}
 		} else if (!tick_program_event(expires, 0))
 				goto out;
 		/*
@@ -403,6 +441,8 @@ out:
 	ts->last_jiffies = last_jiffies;
 	ts->sleep_length = ktime_sub(dev->next_event, now);
 end:
+	if (testrun)
+		printk(KERN_DEBUG "0x%0llx\n", debugdata);
 	local_irq_restore(flags);
 }
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: A few questions and issues with dynticks, NOHZ and powertop
  2010-04-06 20:45               ` Dominik Brodowski
@ 2010-04-06 20:59                 ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2010-04-06 20:59 UTC (permalink / raw)
  To: Dominik Brodowski, Alan Stern, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Arjan van de Ven, Dmitry Torokhov

On Tue, Apr 06, 2010 at 10:45:37PM +0200, Dominik Brodowski wrote:
> On Mon, Apr 05, 2010 at 03:31:30PM -0700, Paul E. McKenney wrote:
> > > > 5) powertop and hrtimer_start_range_ns (tick_sched_timer) on a SMP kernel
> > > > booted with "nosmp":
> > > > 
> > > > Wakeups-from-idle per second :  9.9     interval: 15.0s
> > > > ...
> > > >   48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer) 
> > > >   26.1% (  5.1)     <kernel core> : cursor_timer_handler
> > > >   (cursor_timer_handle
> > > >   20.6% (  4.0)     <kernel core> : usb_hcd_poll_rh_status (rh_timer_func) 
> > > >    1.0% (  0.2)     <kernel core> : arm_supers_timer
> > > >   (sync_supers_timer_fn) 
> > > >    0.7% (  0.1)       <interrupt> : ata_piix 
> > > >    ...
> > > > 
> > > > Accoding to http://www.linuxpowertop.org , the count in the brackets is
> > > > how
> > > > many wakeups per seconds were caused by one source. Adding all _except_
> > > >   48.5% (  9.4)     <kernel core> : hrtimer_start (tick_sched_timer)
> > > > up leads to the 9.9.
> > 
> > OK, so you further instrumented the hrtimer_sched_timer (or was it
> > tick_sched_timer?) to find the number that you were attributing to
> > rcu_needs_cpu()?
> 
> That's what I did -- to tick_nohz_stop_sched_tick(), to see why the
> tick_sched_timer did not get stopped. Or why powertop thinks it did not
> get stopped...
> 
> Patch below (works only for 1 or 2 CPUs) for all interested parties, 
> _NOT_ intended for submission, though ;)

Cool!

If you ever do want to submit such a patch, I should give you multiple
non-zero return values from rcu_needs_cpu() to allow distinguishing between
the following:

1.	RCU needs the CPU, but is trying to get done with it quickly.

2.	RCU needs the CPU, and has given up on trying to get done with
	it quickly.

Unless you tell me otherwise, I will assume that you do -not- need this.

							Thanx, Paul

> 0x0001: CPU #1
> 0x0002: inidle
> 0x0004: not inidle
> 0x0008: INACTIVE
> 0x0010: need_resched
> 0x0020: softirq
> 0x0040: rcu_needs_
> 0x0080: printk_needs_
> 0x0100: arch_needs_
> 0x0200: off-by-one 
> 0x0400: skip
> 0x0800: loadbalance:
> 0x1000: cancel timer (SUCCESS)
> 0x2000: timer was already in the past
> 0x4000: hrtimer_start() (SUCCESS)
> 0x8000: stop_tick (SUCCESS)
> 
> Fixing USB-autosuspend and disabling the fbcon cursor, I'm down to
> ~0.9 wakeups-per-second booting with "nosmp".
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index f992762..6ba7bae 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -210,19 +210,35 @@ void tick_nohz_stop_sched_tick(int inidle)
>  	struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
>  	u64 time_delta;
>  	int cpu;
> +	static int testrun_counter = 0;
> +	int testrun = 0;
> +	u64 debugdata = 0;
> +
> +	testrun_counter++;
> +
> +	if ((testrun_counter % 123) == 0)
> +		testrun = 1;
> 
>  	local_irq_save(flags);
> 
>  	cpu = smp_processor_id();
>  	ts = &per_cpu(tick_cpu_sched, cpu);
> 
> +	debugdata = cpu;
> +
> +	if (inidle)
> +		debugdata |= 0x2;
> +
> +
>  	/*
>  	 * Call to tick_nohz_start_idle stops the last_update_time from being
>  	 * updated. Thus, it must not be called in the event we are called from
>  	 * irq_exit() with the prior state different than idle.
>  	 */
> -	if (!inidle && !ts->inidle)
> +	if (!inidle && !ts->inidle) {
> +		debugdata |= 0x4;
>  		goto end;
> +	}
> 
>  	/*
>  	 * Set ts->inidle unconditionally. Even if the system did not
> @@ -245,11 +261,15 @@ void tick_nohz_stop_sched_tick(int inidle)
>  			tick_do_timer_cpu = TICK_DO_TIMER_NONE;
>  	}
> 
> -	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> +	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) {
> +		debugdata |= 0x8;		
>  		goto end;
> +	}
> 
> -	if (need_resched())
> +	if (need_resched()) {
> +		debugdata |= 0x10;
>  		goto end;
> +	}
> 
>  	if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
>  		static int ratelimit;
> @@ -259,6 +279,7 @@ void tick_nohz_stop_sched_tick(int inidle)
>  			       (unsigned int) local_softirq_pending());
>  			ratelimit++;
>  		}
> +		debugdata |= 0x20;
>  		goto end;
>  	}
> 
> @@ -275,6 +296,12 @@ void tick_nohz_stop_sched_tick(int inidle)
>  	    arch_needs_cpu(cpu)) {
>  		next_jiffies = last_jiffies + 1;
>  		delta_jiffies = 1;
> +		if (rcu_needs_cpu(cpu))
> +			debugdata |= 0x40;
> +		if (printk_needs_cpu(cpu))
> +			debugdata |= 0x80;
> +		if (arch_needs_cpu(cpu))
> +			debugdata |= 0x100;
>  	} else {
>  		/* Get the next timer wheel timer */
>  		next_jiffies = get_next_timer_interrupt(last_jiffies);
> @@ -284,8 +311,10 @@ void tick_nohz_stop_sched_tick(int inidle)
>  	 * Do not stop the tick, if we are only one off
>  	 * or if the cpu is required for rcu
>  	 */
> -	if (!ts->tick_stopped && delta_jiffies == 1)
> +	if (!ts->tick_stopped && delta_jiffies == 1) {
> +		debugdata |= 0x200;
>  		goto out;
> +	}
> 
>  	/* Schedule the tick, if we are at least one jiffie off */
>  	if ((long)delta_jiffies >= 1) {
> @@ -341,8 +370,10 @@ void tick_nohz_stop_sched_tick(int inidle)
>  			cpumask_set_cpu(cpu, nohz_cpu_mask);
> 
>  		/* Skip reprogram of event if its not changed */
> -		if (ts->tick_stopped && ktime_equal(expires, dev->next_event))
> +		if (ts->tick_stopped && ktime_equal(expires, dev->next_event)) {
> +			debugdata |= 0x400;
>  			goto out;
> +		}
> 
>  		/*
>  		 * nohz_stop_sched_tick can be called several times before
> @@ -357,8 +388,10 @@ void tick_nohz_stop_sched_tick(int inidle)
>  				 * sched tick not stopped!
>  				 */
>  				cpumask_clear_cpu(cpu, nohz_cpu_mask);
> +				debugdata |= 0x800;
>  				goto out;
>  			}
> +			debugdata |= 0x8000;
> 
>  			ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
>  			ts->tick_stopped = 1;
> @@ -375,7 +408,8 @@ void tick_nohz_stop_sched_tick(int inidle)
>  		 * If the expiration time == KTIME_MAX, then
>  		 * in this case we simply stop the tick timer.
>  		 */
> -		 if (unlikely(expires.tv64 == KTIME_MAX)) {
> +		if (unlikely(expires.tv64 == KTIME_MAX)) {
> +			debugdata |= 0x1000;
>  			if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
>  				hrtimer_cancel(&ts->sched_timer);
>  			goto out;
> @@ -384,9 +418,13 @@ void tick_nohz_stop_sched_tick(int inidle)
>  		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
>  			hrtimer_start(&ts->sched_timer, expires,
>  				      HRTIMER_MODE_ABS_PINNED);
> +			debugdata |= 0x2000;
>  			/* Check, if the timer was already in the past */
> -			if (hrtimer_active(&ts->sched_timer))
> +			if (hrtimer_active(&ts->sched_timer)) {
> +				debugdata &= ~(0x2000);
> +				debugdata |= 0x4000;
>  				goto out;
> +			}
>  		} else if (!tick_program_event(expires, 0))
>  				goto out;
>  		/*
> @@ -403,6 +441,8 @@ out:
>  	ts->last_jiffies = last_jiffies;
>  	ts->sleep_length = ktime_sub(dev->next_event, now);
>  end:
> +	if (testrun)
> +		printk(KERN_DEBUG "0x%0llx\n", debugdata);
>  	local_irq_restore(flags);
>  }
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH] nohz/sched: disable ilb on !mc_capable()
  2010-04-03 22:33 A few questions and issues with dynticks, NOHZ and powertop Dominik Brodowski
  2010-04-03 23:53 ` Dmitry Torokhov
  2010-04-04 15:17 ` Alan Stern
@ 2010-04-08 19:59 ` Dominik Brodowski
  2 siblings, 0 replies; 30+ messages in thread
From: Dominik Brodowski @ 2010-04-08 19:59 UTC (permalink / raw)
  To: linux-kernel, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Arjan van de Ven

On Sun, Apr 04, 2010 at 12:33:28AM +0200, Dominik Brodowski wrote:
>
> 2) dual-core CPU[*] and select_nohz_load_balancer()
> [*] (Intel(R) Core(TM)2 Duo CPU T7250)
> 
> # CONFIG_SCHED_SMT is not set
> CONFIG_SCHED_MC=y
> CONFIG_SCHED_HRTICK=y
> 
> CONFIG_SCHED_MC is igored, as mc_capable() returns 0 on a one-socket,
> dual-core system. Quite surprisingly, even under moderate load (~98.0% idle)
> while writing this bugreport, up to half of the calls to
> tick_nohz_stop_sched_tick() are aborted due to select_nohz_load_balancer(1):
> 
> 		if (atomic_read(&nohz.load_balancer) == -1) {
> 			/* make me the ilb owner */
> 			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
> 				return 1;
> 
> I'm not really sure, but I guess this is caused by the following phenomenon
> under minor load but still, every once in a while, parallel work for both
> CPUs:
> 
> CPU #0					CPU #1
> 
> <active>				<active>
> <idle>					<active>
>   tick_nohz_stop_sched_tick(1)		<active>
>    select_nohz_load_balancer(1)		<active>
>     => becomes ilb owner		<idle>
>    => tick is not stopped		 tick_nohz_stop_sched_tick(1)
>   => CPU goes to sleep for 1 tick	  => as it isn't the ILB owner, tick
>   <sleep for 1 tick>			     is stopped	.
>   ---> scheduler_tick()			  <sleeeeeeeep>
>   tick_nohz_stop_sched_tick(0)
> <still idle>
>   tick_nohz_stop_sched_tick(1)
>    select_nohz_load_balancer(1)
>     => is ilb owner, all CPUs idle,
>        may go to sleep.
> 
> If both CPUs have hardly anything to do, letting the _active_ CPU do ilb
> allows us to enter deep sleep states earlier, and longer:
> 
> current ILB model (* = ILB)
> 
> 	tick ---------- tick -------- tick ----- IRQ
> CPU0:   active|IDLE(C2)--|*|IDLE (C3)             |
> CPU1:   active....| IDLE (C3)                     |
> core:   .......???| C2   |           C3           |
> 
> ILB-by-active-CPU-on-light-load:
> 
> 	tick ---------- tick -------- tick ----- IRQ
> CPU0:   active|IDLE(C3)                           |
> CPU1:   active....*| IDLE (C3)                    |
> core:   .......????|               C3             |

Tested this a bit further, and thought about it a bit further:

On systems like my laptop, which has one physical CPUs with two cores
( = SMP, !mc_capable() ), the "idle load balancing" seems to be _not_
necessary at all:

- if both cores are active, ilb is inactive anyway.

- if no core is active, ilb was inactive anyway

- if only one core is active and busy, it seems to attempt to balance its
  load on each tick anyway. ilb wouldn't act quicker anyways.

The attached patch decreases the amount of wakeups on my completely idle
notebook ( init=/bin/bash ) from ~2 wakeups-per-second[*] to ~0.7. During
normal system usage, the amount of wakeups-per-second seems to decrease as
well, but is less easy to detect. More importantly, over 80 % of all calls
to tick_nohz_stop_sched_tick() succeed immediately[**].

[*] needs an USB-autosuspend bugfix, manual enabling of USB autosuspend, and
    disabling of the blinking fb cursor.

[**] about 10% return due to rcu_needs_cpu(), which often means the CPU can
    go to sleep pretty soon afterwards.

The remaining reports of "tick_sched_timer" in powertop(1) seems to be
related to timer ticks when one CPU is active for at least one jiffy. So
this is probably not a real "wakeup" at all.

Best,
	Dominik


From: Dominik Brodowski <linux@dominikbrodowski.net>
Date: Thu, 8 Apr 2010 21:51:18 +0200
Subject: [PATCH] nohz/sched: disable ilb on !mc_capable()

Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5a5ea2c..8ad8a03 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3290,6 +3290,9 @@ int select_nohz_load_balancer(int stop_tick)
 	if (stop_tick) {
 		cpu_rq(cpu)->in_nohz_recently = 1;
 
+		if (!mc_capable())
+			return 0;
+
 		if (!cpu_active(cpu)) {
 			if (atomic_read(&nohz.load_balancer) != cpu)
 				return 0;
@@ -3339,6 +3342,9 @@ int select_nohz_load_balancer(int stop_tick)
 		if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
 			return 0;
 
+		if (!mc_capable())
+			return 0;
+
 		cpumask_clear_cpu(cpu, nohz.cpu_mask);
 
 		if (atomic_read(&nohz.load_balancer) == cpu)

^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2010-04-08 19:59 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-03 22:33 A few questions and issues with dynticks, NOHZ and powertop Dominik Brodowski
2010-04-03 23:53 ` Dmitry Torokhov
2010-04-04 10:35   ` Dominik Brodowski
2010-04-05 20:54     ` Dmitry Torokhov
2010-04-04 10:47   ` Dominik Brodowski
2010-04-05  3:42     ` Arjan van de Ven
2010-04-05 20:41       ` Dominik Brodowski
2010-04-05 20:52         ` Dmitry Torokhov
2010-04-04 15:17 ` Alan Stern
2010-04-04 16:39   ` Dominik Brodowski
2010-04-04 20:47     ` Paul E. McKenney
2010-04-04 23:37       ` Paul E. McKenney
2010-04-05  3:44         ` Arjan van de Ven
2010-04-05  4:22           ` Paul E. McKenney
2010-04-05 14:40             ` Arjan van de Ven
2010-04-05 15:14               ` Paul E. McKenney
2010-04-05 16:07                 ` Arjan van de Ven
2010-04-05 16:22                   ` Paul E. McKenney
2010-04-05 16:23                     ` Arjan van de Ven
2010-04-05 16:40                       ` Paul E. McKenney
2010-04-05 18:44                   ` david
2010-04-05 19:48                     ` Arjan van de Ven
2010-04-05 20:34                       ` Paul E. McKenney
2010-04-05 21:03       ` Dominik Brodowski
2010-04-05 21:38         ` Paul E. McKenney
2010-04-05 22:11           ` Dominik Brodowski
2010-04-05 22:31             ` Paul E. McKenney
2010-04-06 20:45               ` Dominik Brodowski
2010-04-06 20:59                 ` Paul E. McKenney
2010-04-08 19:59 ` [RFC PATCH] nohz/sched: disable ilb on !mc_capable() Dominik Brodowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.